Merge tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm

Pull device-mapper updates from Mike Snitzer: "Add the ability to collect I/O statistics on user-defined regions of a device-mapper device. This dm-stats code required the reintroduction of a div64_u64_rem() helper, but as a separate method that doesn't slow down div64_u64() -- especially on 32-bit systems. Allow the error target to replace request-based DM devices (e.g. multipath) in addition to bio-based DM devices. Various other small code fixes and improvements to thin-provisioning, DM cache and the DM ioctl interface" * tag 'dm-3.12-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: dm stripe: silence a couple sparse warnings dm: add statistics support dm thin: always return -ENOSPC if no_free_space is set dm ioctl: cleanup error handling in table_load dm ioctl: increase granularity of type_lock when loading table dm ioctl: prevent rename to empty name or uuid dm thin: set pool read-only if breaking_sharing fails block allocation dm thin: prefix pool error messages with pool device name dm: allow error target to replace bio-based and request-based targets math64: New separate div64_u64_rem helper dm space map: optimise sm_ll_dec and sm_ll_inc dm btree: prefetch child nodes when walking tree for a dm_btree_del dm btree: use pop_frame in dm_btree_del to cleanup code dm cache: eliminate holes in cache structure dm cache: fix stacking of geometry limits dm thin: fix stacking of geometry limits dm thin: add data block size limits to Documentation dm cache: add data block size limits to code and Documentation dm cache: document metadata device is exclussive to a cache dm: stop using WQ_NON_REENTRANT
author: Linus Torvalds <torvalds@linux-foundation.org> 2013-09-10 13:06:15 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2013-09-10 13:06:15 -0700
commit: 7426d62871dafbeeed087d609c6469a515c88389 (patch)
tree: 7d935f360eeb5e78ba633238a29e9213c291aad7
parent: 4d7696f1b05f4aeb586c74868fe3da2731daca4b (diff)
parent: 7fff5e8f727285cf54e6aba10f31b196f207b98a (diff)
25 files changed, 1621 insertions, 162 deletions
diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt
index e8cdf7241b6..33d45ee0b73 100644
--- a/Documentation/device-mapper/cache.txt
+++ b/Documentation/device-mapper/cache.txt
@@ -50,14 +50,16 @@ other parameters detailed later):
    which are dirty, and extra hints for use by the policy object.
    This information could be put on the cache device, but having it
    separate allows the volume manager to configure it differently,
-   e.g. as a mirror for extra robustness.
+   e.g. as a mirror for extra robustness.  This metadata device may only
+   be used by a single cache device.
 
 Fixed block size
 ----------------
 
 The origin is divided up into blocks of a fixed size.  This block size
 is configurable when you first create the cache.  Typically we've been
-using block sizes of 256k - 1024k.
+using block sizes of 256KB - 1024KB.  The block size must be between 64
+(32KB) and 2097152 (1GB) and a multiple of 64 (32KB).
 
 Having a fixed block size simplifies the target a lot.  But it is
 something of a compromise.  For instance, a small part of a block may be
diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt
new file mode 100644
index 00000000000..2a1673adc20
--- /dev/null
+++ b/Documentation/device-mapper/statistics.txt
@@ -0,0 +1,186 @@
+DM statistics
+=============
+
+Device Mapper supports the collection of I/O statistics on user-defined
+regions of a DM device.	 If no regions are defined no statistics are
+collected so there isn't any performance impact.  Only bio-based DM
+devices are currently supported.
+
+Each user-defined region specifies a starting sector, length and step.
+Individual statistics will be collected for each step-sized area within
+the range specified.
+
+The I/O statistics counters for each step-sized area of a region are
+in the same format as /sys/block/*/stat or /proc/diskstats (see:
+Documentation/iostats.txt).  But two extra counters (12 and 13) are
+provided: total time spent reading and writing in milliseconds.	 All
+these counters may be accessed by sending the @stats_print message to
+the appropriate DM device via dmsetup.
+
+Each region has a corresponding unique identifier, which we call a
+region_id, that is assigned when the region is created.	 The region_id
+must be supplied when querying statistics about the region, deleting the
+region, etc.  Unique region_ids enable multiple userspace programs to
+request and process statistics for the same DM device without stepping
+on each other's data.
+
+The creation of DM statistics will allocate memory via kmalloc or
+fallback to using vmalloc space.  At most, 1/4 of the overall system
+memory may be allocated by DM statistics.  The admin can see how much
+memory is used by reading
+/sys/module/dm_mod/parameters/stats_current_allocated_bytes
+
+Messages
+========
+
+    @stats_create <range> <step> [<program_id> [<aux_data>]]
+
+	Create a new region and return the region_id.
+
+	<range>
+	  "-" - whole device
+	  "<start_sector>+<length>" - a range of <length> 512-byte sectors
+				      starting with <start_sector>.
+
+	<step>
+	  "<area_size>" - the range is subdivided into areas each containing
+			  <area_size> sectors.
+	  "/<number_of_areas>" - the range is subdivided into the specified
+				 number of areas.
+
+	<program_id>
+	  An optional parameter.  A name that uniquely identifies
+	  the userspace owner of the range.  This groups ranges together
+	  so that userspace programs can identify the ranges they
+	  created and ignore those created by others.
+	  The kernel returns this string back in the output of
+	  @stats_list message, but it doesn't use it for anything else.
+
+	<aux_data>
+	  An optional parameter.  A word that provides auxiliary data
+	  that is useful to the client program that created the range.
+	  The kernel returns this string back in the output of
+	  @stats_list message, but it doesn't use this value for anything.
+
+    @stats_delete <region_id>
+
+	Delete the region with the specified id.
+
+	<region_id>
+	  region_id returned from @stats_create
+
+    @stats_clear <region_id>
+
+	Clear all the counters except the in-flight i/o counters.
+
+	<region_id>
+	  region_id returned from @stats_create
+
+    @stats_list [<program_id>]
+
+	List all regions registered with @stats_create.
+
+	<program_id>
+	  An optional parameter.
+	  If this parameter is specified, only matching regions
+	  are returned.
+	  If it is not specified, all regions are returned.
+
+	Output format:
+	  <region_id>: <start_sector>+<length> <step> <program_id> <aux_data>
+
+    @stats_print <region_id> [<starting_line> <number_of_lines>]
+
+	Print counters for each step-sized area of a region.
+
+	<region_id>
+	  region_id returned from @stats_create
+
+	<starting_line>
+	  The index of the starting line in the output.
+	  If omitted, all lines are returned.
+
+	<number_of_lines>
+	  The number of lines to include in the output.
+	  If omitted, all lines are returned.
+
+	Output format for each step-sized area of a region:
+
+	  <start_sector>+<length> counters
+
+	  The first 11 counters have the same meaning as
+	  /sys/block/*/stat or /proc/diskstats.
+
+	  Please refer to Documentation/iostats.txt for details.
+
+	  1. the number of reads completed
+	  2. the number of reads merged
+	  3. the number of sectors read
+	  4. the number of milliseconds spent reading
+	  5. the number of writes completed
+	  6. the number of writes merged
+	  7. the number of sectors written
+	  8. the number of milliseconds spent writing
+	  9. the number of I/Os currently in progress
+	  10. the number of milliseconds spent doing I/Os
+	  11. the weighted number of milliseconds spent doing I/Os
+
+	  Additional counters:
+	  12. the total time spent reading in milliseconds
+	  13. the total time spent writing in milliseconds
+
+    @stats_print_clear <region_id> [<starting_line> <number_of_lines>]
+
+	Atomically print and then clear all the counters except the
+	in-flight i/o counters.	 Useful when the client consuming the
+	statistics does not want to lose any statistics (those updated
+	between printing and clearing).
+
+	<region_id>
+	  region_id returned from @stats_create
+
+	<starting_line>
+	  The index of the starting line in the output.
+	  If omitted, all lines are printed and then cleared.
+
+	<number_of_lines>
+	  The number of lines to process.
+	  If omitted, all lines are printed and then cleared.
+
+    @stats_set_aux <region_id> <aux_data>
+
+	Store auxiliary data aux_data for the specified region.
+
+	<region_id>
+	  region_id returned from @stats_create
+
+	<aux_data>
+	  The string that identifies data which is useful to the client
+	  program that created the range.  The kernel returns this
+	  string back in the output of @stats_list message, but it
+	  doesn't use this value for anything.
+
+Examples
+========
+
+Subdivide the DM device 'vol' into 100 pieces and start collecting
+statistics on them:
+
+  dmsetup message vol 0 @stats_create - /100
+
+Set the auxillary data string to "foo bar baz" (the escape for each
+space must also be escaped, otherwise the shell will consume them):
+
+  dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz
+
+List the statistics:
+
+  dmsetup message vol 0 @stats_list
+
+Print the statistics:
+
+  dmsetup message vol 0 @stats_print 0
+
+Delete the statistics:
+
+  dmsetup message vol 0 @stats_delete 0
diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt
index 30b8b83bd33..50c44cf79b0 100644
--- a/Documentation/device-mapper/thin-provisioning.txt
+++ b/Documentation/device-mapper/thin-provisioning.txt
@@ -99,13 +99,14 @@ Using an existing pool device
 		 $data_block_size $low_water_mark"
 
 $data_block_size gives the smallest unit of disk space that can be
-allocated at a time expressed in units of 512-byte sectors.  People
-primarily interested in thin provisioning may want to use a value such
-as 1024 (512KB).  People doing lots of snapshotting may want a smaller value
-such as 128 (64KB).  If you are not zeroing newly-allocated data,
-a larger $data_block_size in the region of 256000 (128MB) is suggested.
-$data_block_size must be the same for the lifetime of the
-metadata device.
+allocated at a time expressed in units of 512-byte sectors.
+$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a
+multiple of 128 (64KB).  $data_block_size cannot be changed after the
+thin-pool is created.  People primarily interested in thin provisioning
+may want to use a value such as 1024 (512KB).  People doing lots of
+snapshotting may want a smaller value such as 128 (64KB).  If you are
+not zeroing newly-allocated data, a larger $data_block_size in the
+region of 256000 (128MB) is suggested.
 
 $low_water_mark is expressed in blocks of size $data_block_size.  If
 free space on the data device drops below this level then a dm event
diff --git a/drivers/md/Makefile b/drivers/md/Makefile
index 5ef78efc27f..2acc43fe022 100644
--- a/drivers/md/Makefile
+++ b/drivers/md/Makefile
@@ -3,7 +3,7 @@
 #
 
 dm-mod-y	+= dm.o dm-table.o dm-target.o dm-linear.o dm-stripe.o \
-		   dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o
+		   dm-ioctl.o dm-io.o dm-kcopyd.o dm-sysfs.o dm-stats.o
 dm-multipath-y	+= dm-path-selector.o dm-mpath.o
 dm-snapshot-y	+= dm-snap.o dm-exception-store.o dm-snap-transient.o \
 		    dm-snap-persistent.o
diff --git a/drivers/md/dm-cache-target.c b/drivers/md/dm-cache-target.c
index 0df3ec085eb..29569768ffb 100644
--- a/drivers/md/dm-cache-target.c
+++ b/drivers/md/dm-cache-target.c
@@ -67,9 +67,11 @@ static void free_bitset(unsigned long *bits)
 #define MIGRATION_COUNT_WINDOW 10
 
 /*
- * The block size of the device holding cache data must be >= 32KB
+ * The block size of the device holding cache data must be
+ * between 32KB and 1GB.
  */
 #define DATA_DEV_BLOCK_SIZE_MIN_SECTORS (32 * 1024 >> SECTOR_SHIFT)
+#define DATA_DEV_BLOCK_SIZE_MAX_SECTORS (1024 * 1024 * 1024 >> SECTOR_SHIFT)
 
 /*
  * FIXME: the cache is read/write for the time being.
@@ -101,6 +103,8 @@ struct cache {
 	struct dm_target *ti;
 	struct dm_target_callbacks callbacks;
 
+	struct dm_cache_metadata *cmd;
+
 	/*
 	 * Metadata is written to this device.
 	 */
@@ -117,11 +121,6 @@ struct cache {
 	struct dm_dev *cache_dev;
 
 	/*
-	 * Cache features such as write-through.
-	 */
-	struct cache_features features;
-
-	/*
 	 * Size of the origin device in _complete_ blocks and native sectors.
 	 */
 	dm_oblock_t origin_blocks;
@@ -138,8 +137,6 @@ struct cache {
 	uint32_t sectors_per_block;
 	int sectors_per_block_shift;
 
-	struct dm_cache_metadata *cmd;
-
 	spinlock_t lock;
 	struct bio_list deferred_bios;
 	struct bio_list deferred_flush_bios;
@@ -148,8 +145,8 @@ struct cache {
 	struct list_head completed_migrations;
 	struct list_head need_commit_migrations;
 	sector_t migration_threshold;
-	atomic_t nr_migrations;
 	wait_queue_head_t migration_wait;
+	atomic_t nr_migrations;
 
 	/*
 	 * cache_size entries, dirty if set
@@ -160,9 +157,16 @@ struct cache {
 	/*
 	 * origin_blocks entries, discarded if set.
 	 */
-	uint32_t discard_block_size; /* a power of 2 times sectors per block */
 	dm_dblock_t discard_nr_blocks;
 	unsigned long *discard_bitset;
+	uint32_t discard_block_size; /* a power of 2 times sectors per block */
+
+	/*
+	 * Rather than reconstructing the table line for the status we just
+	 * save it and regurgitate.
+	 */
+	unsigned nr_ctr_args;
+	const char **ctr_args;
 
 	struct dm_kcopyd_client *copier;
 	struct workqueue_struct *wq;
@@ -187,14 +191,12 @@ struct cache {
 	bool loaded_mappings:1;
 	bool loaded_discards:1;
 
-	struct cache_stats stats;
-
 	/*
-	 * Rather than reconstructing the table line for the status we just
-	 * save it and regurgitate.
+	 * Cache features such as write-through.
 	 */
-	unsigned nr_ctr_args;
-	const char **ctr_args;
+	struct cache_features features;
+
+	struct cache_stats stats;
 };
 
 struct per_bio_data {
@@ -1687,24 +1689,25 @@ static int parse_origin_dev(struct cache_args *ca, struct dm_arg_set *as,
 static int parse_block_size(struct cache_args *ca, struct dm_arg_set *as,
 			    char **error)
 {
-	unsigned long tmp;
+	unsigned long block_size;
 
 	if (!at_least_one_arg(as, error))
 		return -EINVAL;
 
-	if (kstrtoul(dm_shift_arg(as), 10, &tmp) || !tmp ||
-	    tmp < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
-	    tmp & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
+	if (kstrtoul(dm_shift_arg(as), 10, &block_size) || !block_size ||
+	    block_size < DATA_DEV_BLOCK_SIZE_MIN_SECTORS ||
+	    block_size > DATA_DEV_BLOCK_SIZE_MAX_SECTORS ||
+	    block_size & (DATA_DEV_BLOCK_SIZE_MIN_SECTORS - 1)) {
 		*error = "Invalid data block size";
 		return -EINVAL;
 	}
 
-	if (tmp > ca->cache_sectors) {
+	if (block_size > ca->cache_sectors) {
 		*error = "Data block size is larger than the cache device";
 		return -EINVAL;
 	}
 
-	ca->block_size = tmp;
+	ca->block_size = block_size;
 
 	return 0;
 }
@@ -2609,9 +2612,17 @@ static void set_discard_limits(struct cache *cache, struct queue_limits *limits)
 static void cache_io_hints(struct dm_target *ti, struct queue_limits *limits)
 {
 	struct cache *cache = ti->private;
+	uint64_t io_opt_sectors = limits->io_opt >> SECTOR_SHIFT;
 
-	blk_limits_io_min(limits, 0);
-	blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT);
+	/*
+	 * If the system-determined stacked limits are compatible with the
+	 * cache's blocksize (io_opt is a factor) do not override them.
+	 */
+	if (io_opt_sectors < cache->sectors_per_block ||
+	    do_div(io_opt_sectors, cache->sectors_per_block)) {
+		blk_limits_io_min(limits, 0);
+		blk_limits_io_opt(limits, cache->sectors_per_block << SECTOR_SHIFT);
+	}
 	set_discard_limits(cache, limits);
 }
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 6d2d41ae9e3..0fce0bc1a95 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1645,20 +1645,14 @@ static int crypt_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	}
 
 	ret = -ENOMEM;
-	cc->io_queue = alloc_workqueue("kcryptd_io",
-				       WQ_NON_REENTRANT|
-				       WQ_MEM_RECLAIM,
-				       1);
+	cc->io_queue = alloc_workqueue("kcryptd_io", WQ_MEM_RECLAIM, 1);
 	if (!cc->io_queue) {
 		ti->error = "Couldn't create kcryptd io queue";
 		goto bad;
 	}
 
 	cc->crypt_queue = alloc_workqueue("kcryptd",
-					  WQ_NON_REENTRANT|
-					  WQ_CPU_INTENSIVE|
-					  WQ_MEM_RECLAIM,
-					  1);
+					  WQ_CPU_INTENSIVE | WQ_MEM_RECLAIM, 1);
 	if (!cc->crypt_queue) {
 		ti->error = "Couldn't create kcryptd queue";
 		goto bad;
diff --git a/drivers/md/dm-ioctl.c b/drivers/md/dm-ioctl.c
index f1b758675ec..afe08146f73 100644
--- a/drivers/md/dm-ioctl.c
+++ b/drivers/md/dm-ioctl.c
@@ -877,7 +877,7 @@ static int dev_rename(struct dm_ioctl *param, size_t param_size)
 	unsigned change_uuid = (param->flags & DM_UUID_FLAG) ? 1 : 0;
 
 	if (new_data < param->data ||
-	    invalid_str(new_data, (void *) param + param_size) ||
+	    invalid_str(new_data, (void *) param + param_size) || !*new_data ||
 	    strlen(new_data) > (change_uuid ? DM_UUID_LEN - 1 : DM_NAME_LEN - 1)) {
 		DMWARN("Invalid new mapped device name or uuid string supplied.");
 		return -EINVAL;
@@ -1262,44 +1262,37 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
 
 	r = dm_table_create(&t, get_mode(param), param->target_count, md);
 	if (r)
-		goto out;
+		goto err;
 
+	/* Protect md->type and md->queue against concurrent table loads. */
+	dm_lock_md_type(md);
 	r = populate_table(t, param, param_size);
-	if (r) {
-		dm_table_destroy(t);
-		goto out;
-	}
+	if (r)
+		goto err_unlock_md_type;
 
 	immutable_target_type = dm_get_immutable_target_type(md);
 	if (immutable_target_type &&
 	    (immutable_target_type != dm_table_get_immutable_target_type(t))) {
 		DMWARN("can't replace immutable target type %s",
 		       immutable_target_type->name);
-		dm_table_destroy(t);
 		r = -EINVAL;
-		goto out;
+		goto err_unlock_md_type;
 	}
 
-	/* Protect md->type and md->queue against concurrent table loads. */
-	dm_lock_md_type(md);
 	if (dm_get_md_type(md) == DM_TYPE_NONE)
 		/* Initial table load: acquire type of table. */
 		dm_set_md_type(md, dm_table_get_type(t));
 	else if (dm_get_md_type(md) != dm_table_get_type(t)) {
 		DMWARN("can't change device type after initial table load.");
-		dm_table_destroy(t);
-		dm_unlock_md_type(md);
 		r = -EINVAL;
-		goto out;
+		goto err_unlock_md_type;
 	}
 
 	/* setup md->queue to reflect md's type (may block) */
 	r = dm_setup_md_queue(md);
 	if (r) {
 		DMWARN("unable to set up device queue for new table.");
-		dm_table_destroy(t);
-		dm_unlock_md_type(md);
-		goto out;
+		goto err_unlock_md_type;
 	}
 	dm_unlock_md_type(md);
 
@@ -1309,9 +1302,8 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
 	if (!hc || hc->md != md) {
 		DMWARN("device has been removed from the dev hash table.");
 		up_write(&_hash_lock);
-		dm_table_destroy(t);
 		r = -ENXIO;
-		goto out;
+		goto err_destroy_table;
 	}
 
 	if (hc->new_map)
@@ -1322,7 +1314,6 @@ static int table_load(struct dm_ioctl *param, size_t param_size)
 	param->flags |= DM_INACTIVE_PRESENT_FLAG;
 	__dev_status(md, param);
 
-out:
 	if (old_map) {
 		dm_sync_table(md);
 		dm_table_destroy(old_map);
@@ -1330,6 +1321,15 @@ out:
 
 	dm_put(md);
 
+	return 0;
+
+err_unlock_md_type:
+	dm_unlock_md_type(md);
+err_destroy_table:
+	dm_table_destroy(t);
+err:
+	dm_put(md);
+
 	return r;
 }
 
@@ -1455,20 +1455,26 @@ static int table_status(struct dm_ioctl *param, size_t param_size)
 	return 0;
 }
 
-static bool buffer_test_overflow(char *result, unsigned maxlen)
-{
-	return !maxlen || strlen(result) + 1 >= maxlen;
-}
-
 /*
- * Process device-mapper dependent messages.
+ * Process device-mapper dependent messages.  Messages prefixed with '@'
+ * are processed by the DM core.  All others are delivered to the target.
  * Returns a number <= 1 if message was processed by device mapper.
  * Returns 2 if message should be delivered to the target.
  */
 static int message_for_md(struct mapped_device *md, unsigned argc, char **argv,
 			  char *result, unsigned maxlen)
 {
-	return 2;
+	int r;
+
+	if (**argv != '@')
+		return 2; /* no '@' prefix, deliver to target */
+
+	r = dm_stats_message(md, argc, argv, result, maxlen);
+	if (r < 2)
+		return r;
+
+	DMERR("Unsupported message sent to DM core: %s", argv[0]);
+	return -EINVAL;
 }
 
 /*
@@ -1542,7 +1548,7 @@ static int target_message(struct dm_ioctl *param, size_t param_size)
 
 	if (r == 1) {
 		param->flags |= DM_DATA_OUT_FLAG;
-		if (buffer_test_overflow(result, maxlen))
+		if (dm_message_test_buffer_overflow(result, maxlen))
 			param->flags |= DM_BUFFER_FULL_FLAG;
 		else
 			param->data_size = param->data_start + strlen(result) + 1;
diff --git a/drivers/md/dm-kcopyd.c b/drivers/md/dm-kcopyd.c
index d581fe5d2fa..3a7cade5e27 100644
--- a/drivers/md/dm-kcopyd.c
+++ b/drivers/md/dm-kcopyd.c
@@ -833,8 +833,7 @@ struct dm_kcopyd_client *dm_kcopyd_client_create(struct dm_kcopyd_throttle *thro
 		goto bad_slab;
 
 	INIT_WORK(&kc->kcopyd_work, do_work);
-	kc->kcopyd_wq = alloc_workqueue("kcopyd",
-					WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
+	kc->kcopyd_wq = alloc_workqueue("kcopyd", WQ_MEM_RECLAIM, 0);
 	if (!kc->kcopyd_wq)
 		goto bad_workqueue;
 
diff --git a/drivers/md/dm-raid1.c b/drivers/md/dm-raid1.c
index 699b5be68d3..9584443c561 100644
--- a/drivers/md/dm-raid1.c
+++ b/drivers/md/dm-raid1.c
@@ -1080,8 +1080,7 @@ static int mirror_ctr(struct dm_target *ti, unsigned int argc, char **argv)
 	ti->per_bio_data_size = sizeof(struct dm_raid1_bio_record);
 	ti->discard_zeroes_data_unsupported = true;
 
-	ms->kmirrord_wq = alloc_workqueue("kmirrord",
-					  WQ_NON_REENTRANT | WQ_MEM_RECLAIM, 0);
+	ms->kmirrord_wq = alloc_workqueue("kmirrord", WQ_MEM_RECLAIM, 0);
 	if (!ms->kmirrord_wq) {
 		DMERR("couldn't start kmirrord");
 		r = -ENOMEM;
diff --git a/drivers/md/dm-stats.c b/drivers/md/dm-stats.c
new file mode 100644
index 00000000000..8ae31e8d3d6
--- /dev/null
+++ b/drivers/md/dm-stats.c
@@ -0,0 +1,969 @@
+#include <linux/errno.h>
+#include <linux/numa.h>
+#include <linux/slab.h>
+#include <linux/rculist.h>
+#include <linux/threads.h>
+#include <linux/preempt.h>
+#include <linux/irqflags.h>
+#include <linux/vmalloc.h>
+#include <linux/mm.h>
+#include <linux/module.h>
+#include <linux/device-mapper.h>
+
+#include "dm.h"
+#include "dm-stats.h"
+
+#define DM_MSG_PREFIX "stats"
+
+static int dm_stat_need_rcu_barrier;
+
+/*
+ * Using 64-bit values to avoid overflow (which is a
+ * problem that block/genhd.c's IO accounting has).
+ */
+struct dm_stat_percpu {
+	unsigned long long sectors[2];
+	unsigned long long ios[2];
+	unsigned long long merges[2];
+	unsigned long long ticks[2];
+	unsigned long long io_ticks[2];
+	unsigned long long io_ticks_total;
+	unsigned long long time_in_queue;
+};
+
+struct dm_stat_shared {
+	atomic_t in_flight[2];
+	unsigned long stamp;
+	struct dm_stat_percpu tmp;
+};
+
+struct dm_stat {
+	struct list_head list_entry;
+	int id;
+	size_t n_entries;
+	sector_t start;
+	sector_t end;
+	sector_t step;
+	const char *program_id;
+	const char *aux_data;
+	struct rcu_head rcu_head;
+	size_t shared_alloc_size;
+	size_t percpu_alloc_size;
+	struct dm_stat_percpu *stat_percpu[NR_CPUS];
+	struct dm_stat_shared stat_shared[0];
+};
+
+struct dm_stats_last_position {
+	sector_t last_sector;
+	unsigned last_rw;
+};
+
+/*
+ * A typo on the command line could possibly make the kernel run out of memory
+ * and crash. To prevent the crash we account all used memory. We fail if we
+ * exhaust 1/4 of all memory or 1/2 of vmalloc space.
+ */
+#define DM_STATS_MEMORY_FACTOR		4
+#define DM_STATS_VMALLOC_FACTOR		2
+
+static DEFINE_SPINLOCK(shared_memory_lock);
+
+static unsigned long shared_memory_amount;
+
+static bool __check_shared_memory(size_t alloc_size)
+{
+	size_t a;
+
+	a = shared_memory_amount + alloc_size;
+	if (a < shared_memory_amount)
+		return false;
+	if (a >> PAGE_SHIFT > totalram_pages / DM_STATS_MEMORY_FACTOR)
+		return false;
+#ifdef CONFIG_MMU
+	if (a > (VMALLOC_END - VMALLOC_START) / DM_STATS_VMALLOC_FACTOR)
+		return false;
+#endif
+	return true;
+}
+
+static bool check_shared_memory(size_t alloc_size)
+{
+	bool ret;
+
+	spin_lock_irq(&shared_memory_lock);
+
+	ret = __check_shared_memory(alloc_size);
+
+	spin_unlock_irq(&shared_memory_lock);
+
+	return ret;
+}
+
+static bool claim_shared_memory(size_t alloc_size)
+{
+	spin_lock_irq(&shared_memory_lock);
+
+	if (!__check_shared_memory(alloc_size)) {
+		spin_unlock_irq(&shared_memory_lock);
+		return false;
+	}
+
+	shared_memory_amount += alloc_size;
+
+	spin_unlock_irq(&shared_memory_lock);
+
+	return true;
+}
+
+static void free_shared_memory(size_t alloc_size)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&shared_memory_lock, flags);
+
+	if (WARN_ON_ONCE(shared_memory_amount < alloc_size)) {
+		spin_unlock_irqrestore(&shared_memory_lock, flags);
+		DMCRIT("Memory usage accounting bug.");
+		return;
+	}
+
+	shared_memory_amount -= alloc_size;
+
+	spin_unlock_irqrestore(&shared_memory_lock, flags);
+}
+
+static void *dm_kvzalloc(size_t alloc_size, int node)
+{
+	void *p;
+
+	if (!claim_shared_memory(alloc_size))
+		return NULL;
+
+	if (alloc_size <= KMALLOC_MAX_SIZE) {
+		p = kzalloc_node(alloc_size, GFP_KERNEL | __GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN, node);
+		if (p)
+			return p;
+	}
+	p = vzalloc_node(alloc_size, node);
+	if (p)
+		return p;
+
+	free_shared_memory(alloc_size);
+
+	return NULL;
+}
+
+static void dm_kvfree(void *ptr, size_t alloc_size)
+{
+	if (!ptr)
+		return;
+
+	free_shared_memory(alloc_size);
+
+	if (is_vmalloc_addr(ptr))
+		vfree(ptr);
+	else
+		kfree(ptr);
+}
+
+static void dm_stat_free(struct rcu_head *head)
+{
+	int cpu;
+	struct dm_stat *s = container_of(head, struct dm_stat, rcu_head);
+
+	kfree(s->program_id);
+	kfree(s->aux_data);
+	for_each_possible_cpu(cpu)
+		dm_kvfree(s->stat_percpu[cpu], s->percpu_alloc_size);
+	dm_kvfree(s, s->shared_alloc_size);
+}
+
+static int dm_stat_in_flight(struct dm_stat_shared *shared)
+{
+	return atomic_read(&shared->in_flight[READ]) +
+	       atomic_read(&shared->in_flight[WRITE]);
+}
+
+void dm_stats_init(struct dm_stats *stats)
+{
+	int cpu;
+	struct dm_stats_last_position *last;
+
+	mutex_init(&stats->mutex);
+	INIT_LIST_HEAD(&stats->list);
+	stats->last = alloc_percpu(struct dm_stats_last_position);
+	for_each_possible_cpu(cpu) {
+		last = per_cpu_ptr(stats->last, cpu);
+		last->last_sector = (sector_t)ULLONG_MAX;
+		last->last_rw = UINT_MAX;
+	}
+}
+
+void dm_stats_cleanup(struct dm_stats *stats)
+{
+	size_t ni;
+	struct dm_stat *s;
+	struct dm_stat_shared *shared;
+
+	while (!list_empty(&stats->list)) {
+		s = container_of(stats->list.next, struct dm_stat, list_entry);
+		list_del(&s->list_entry);
+		for (ni = 0; ni < s->n_entries; ni++) {
+			shared = &s->stat_shared[ni];
+			if (WARN_ON(dm_stat_in_flight(shared))) {
+				DMCRIT("leaked in-flight counter at index %lu "
+				       "(start %llu, end %llu, step %llu): reads %d, writes %d",
+				       (unsigned long)ni,
+				       (unsigned long long)s->start,
+				       (unsigned long long)s->end,
+				       (unsigned long long)s->step,
+				       atomic_read(&shared->in_flight[READ]),
+				       atomic_read(&shared->in_flight[WRITE]));
+			}
+		}
+		dm_stat_free(&s->rcu_head);
+	}
+	free_percpu(stats->last);
+}
+
+static int dm_stats_create(struct dm_stats *stats, sector_t start, sector_t end,
+			   sector_t step, const char *program_id, const char *aux_data,
+			   void (*suspend_callback)(struct mapped_device *),
+			   void (*resume_callback)(struct mapped_device *),
+			   struct mapped_device *md)
+{
+	struct list_head *l;
+	struct dm_stat *s, *tmp_s;
+	sector_t n_entries;
+	size_t ni;
+	size_t shared_alloc_size;
+	size_t percpu_alloc_size;
+	struct dm_stat_percpu *p;
+	int cpu;
+	int ret_id;
+	int r;
+
+	if (end < start || !step)
+		return -EINVAL;
+
+	n_entries = end - start;
+	if (dm_sector_div64(n_entries, step))
+		n_entries++;
author	Linus Torvalds <torvalds@linux-foundation.org>	2013-09-10 13:06:15 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2013-09-10 13:06:15 -0700
commit	7426d62871dafbeeed087d609c6469a515c88389 (patch)
tree	7d935f360eeb5e78ba633238a29e9213c291aad7
parent	4d7696f1b05f4aeb586c74868fe3da2731daca4b (diff)
parent	7fff5e8f727285cf54e6aba10f31b196f207b98a (diff)