aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2010-02-23dm log: userspace fix overhead_size calcuationsJonathan Brassow
commit ebfd32bba9b518d684009d9d21a56742337ca1b3 upstream. This patch fixes two bugs that revolve around the miscalculation and misuse of the variable 'overhead_size'. 'overhead_size' is the size of the various header structures used during communication. The first bug is the use of 'sizeof' with the pointer of a structure instead of the structure itself - resulting in the wrong size being computed. This is then used in a check to see if the payload (data_size) would be to large for the preallocated structure. Since the bug produces a smaller value for the overhead, it was possible for the structure to be breached. (Although the current users of the code do not currently send enough data to trigger this bug.) The second bug is that the 'overhead_size' value is used to compute how much of the preallocated space should be cleared before populating it with fresh data. This should have simply been 'sizeof(struct cn_msg)' not overhead_size. The fact that 'overhead_size' was computed incorrectly made this problem "less bad" - leaving only a pointer's worth of space at the end uncleared. Thus, this bug was never producing a bad result, but still needs to be fixed - especially now that the value is computed correctly. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-23dm stripe: avoid divide by zero with invalid stripe countNikanth Karthikesan
commit 781248c1b50c776a9ef4be1130f84ced1cba42fe upstream. If a table containing zero as stripe count is passed into stripe_ctr the code attempts to divide by zero. This patch changes DM_TABLE_LOAD to return -EINVAL if the stripe count is zero. We now get the following error messages: device-mapper: table: 253:0: striped: Invalid stripe count device-mapper: ioctl: error adding target to table Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-02-23md: fix 'degraded' calculation when starting a reshape.NeilBrown
commit 9eb07c259207d048e3ee8be2a77b2a4680b1edd4 upstream. This code was written long ago when it was not possible to reshape a degraded array. Now it is so the current level of degraded-ness needs to be taken in to account. Also newly addded devices should only reduce degradedness if they are deemed to be in-sync. In particular, if you convert a RAID5 to a RAID6, and increase the number of devices at the same time, then the 5->6 conversion will make the array degraded so the current code will produce a wrong value for 'degraded' - "-1" to be precise. If the reshape runs to completion end_reshape will calculate a correct new value for 'degraded', but if a device fails during the reshape an incorrect decision might be made based on the incorrect value of "degraded". This patch is suitable for 2.6.32-stable and if they are still open, 2.6.31-stable and 2.6.30-stable as well. Reported-by: Michael Evans <mjevans1983@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-28md: fix small irregularity with start_ro module parameterNeilBrown
commit 0f9552b5dc4fe10da37fa3f4a4ca185d90fa41c9 upstream. The start_ro modules parameter can be used to force arrays to be started in 'auto-readonly' in which they are read-only until the first write. This ensures that no resync/recovery happens until something else writes to the device. This is important for resume-from-disk off an md array. However if an array is started 'readonly' (by writing 'readonly' to the 'array_state' sysfs attribute) we want it to be really 'readonly', not 'auto-readonly'. So strengthen the condition to only set auto-readonly if the array is not already read-only. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-25DM: Fix device mapper topology stackingMartin K. Petersen
commit b27d7f16d3c6c27345d4280a739809c1c2c4c0b5 upstream. Make DM use bdev_stack_limits() function so that partition offsets get taken into account when calculating alignment. Clarify stacking warnings. Also remove obsolete clearing of final alignment_offset and misalignment flag. Signed-off-by: Martin K. Petersen <martin.petersen@oracle.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Cc: Alasdair G. Kergon <agk@redhat.com> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2010-01-06md: Fix unfortunate interaction with evmsNeilBrown
commit cbd1998377504df005302ac90d49db72a48552a6 upstream. evms configures md arrays by: open device send ioctl close device for each different ioctl needed. Since 2.6.29, the device can disappear after the 'close' unless a significant configuration has happened to the device. The change made by "SET_ARRAY_INFO" can too minor to stop the device from disappearing, but important enough that losing the change is bad. So: make sure SET_ARRAY_INFO sets mddev->ctime, and keep the device active as long as ctime is non-zero (it gets zeroed with lots of other things when the array is stopped). This is suitable for -stable kernels since 2.6.29. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm crypt: make wipe message also wipe essiv keyMilan Broz
commit 542da317668c35036e8471822a564b609d05af66 upstream. The "wipe key" message is used to wipe the volume key from memory temporarily, for example when suspending to RAM. But the initialisation vector in ESSIV mode is calculated from the hashed volume key, so the wipe message should wipe this IV key too and reinitialise it when the volume key is reinstated. This patch adds an IV wipe method called from a wipe message callback. ESSIV is then reinitialised using the init function added by the last patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm crypt: separate essiv allocation from initialisationMilan Broz
commit b95bf2d3d5a48b095bffe2a0cd8c40453cf59557 upstream. This patch separates the construction of IV from its initialisation. (For ESSIV it is a hash calculation based on volume key.) Constructor code now preallocates hash tfm and salt array and saves it in a private IV structure. The next patch requires this to reinitialise the wiped IV without reallocating memory when resuming a suspended device. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm snapshot: cope with chunk size larger than originMikulas Patocka
commit 8e87b9b81b3c370f7e53c1ab6e1c3519ef37a644 upstream. Under some special conditions the snapshot hash_size is calculated as zero. This patch instead sets a minimum value of 64, the same as for the pending exception table. rounddown_pow_of_two(0) is an undefined operation (it expands to shift by -1). init_exception_table with an argument of 0 would fail with -ENOMEM. The way to trigger the problem is to create a snapshot with a chunk size that is larger than the origin device. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm: avoid _hash_lock deadlockMikulas Patocka
commit 6076905b5ef39e0ea58db32583c9e0036c05e47b upstream. Fix a reported deadlock if there are still unprocessed multipath events on a device that is being removed. _hash_lock is held during dev_remove while trying to send the outstanding events. Sending the events requests the _hash_lock again in dm_copy_name_and_uuid. This patch introduces a separate lock around regions that modify the link to the hash table (dm_set_mdptr) or the name or uuid so that dm_copy_name_and_uuid no longer needs _hash_lock. Additionally, dm_copy_name_and_uuid can only be called if md exists so we can drop the dm_get() and dm_put() which can lead to a BUG() while md is being freed. The deadlock: #0 [ffff8106298dfb48] schedule at ffffffff80063035 #1 [ffff8106298dfc20] __down_read at ffffffff8006475d #2 [ffff8106298dfc60] dm_copy_name_and_uuid at ffffffff8824f740 #3 [ffff8106298dfc90] dm_send_uevents at ffffffff88252685 #4 [ffff8106298dfcd0] event_callback at ffffffff8824c678 #5 [ffff8106298dfd00] dm_table_event at ffffffff8824dd01 #6 [ffff8106298dfd10] __hash_remove at ffffffff882507ad #7 [ffff8106298dfd30] dev_remove at ffffffff88250865 #8 [ffff8106298dfd60] ctl_ioctl at ffffffff88250d80 #9 [ffff8106298dfee0] do_ioctl at ffffffff800418c4 #10 [ffff8106298dff00] vfs_ioctl at ffffffff8002fab9 #11 [ffff8106298dff40] sys_ioctl at ffffffff8004bdaf #12 [ffff8106298dff80] tracesys at ffffffff8005d28d (via system_call) Reported-by: guy keren <choo@actcom.co.il> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm crypt: restructure essiv error pathMilan Broz
commit 5861f1be00b3b70f8ab5e5a81392a6cf69666cd2 upstream. Use kzfree for salt deallocation because it is derived from the volume key. Use a common error path in ESSIV constructor. Required by a later patch which fixes the way key material is wiped from memory. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm crypt: move private iv fields to structsMilan Broz
commit 6047359277517c4e56d8bfd6ea4966d7a3924151 upstream. Define private structures for IV so it's easy to add further attributes in a following patch which fixes the way key material is wiped from memory. Also move ESSIV destructor and remove unnecessary 'status' operation. There are no functional changes in this patch. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm snapshot: only take lock for statustype info not tableMikulas Patocka
commit 94e76572b5dd37b1f0f4b3742ee8a565daead932 upstream. Take snapshot lock only for STATUSTYPE_INFO, not STATUSTYPE_TABLE. Commit 4c6fff445d7aa753957856278d4d93bcad6e2c14 (dm-snapshot-lock-snapshot-while-supplying-status.patch) introduced this use of the lock, but userspace applications using libdevmapper have been found to request STATUSTYPE_TABLE while the device is suspended and the lock is already held, leading to deadlock. Since the lock is not necessary in this case, don't try to take it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18dm exception store: free tmp_store on persistent flag errorJulia Lawall
commit 613978f8711c7fd4d0aa856872375d2abd7c92ff upstream. Error handling code following a kmalloc should free the allocated data. Signed-off-by: Julia Lawall <julia@diku.dk> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-18md/bitmap: protect against bitmap removal while being updated.NeilBrown
commit aa5cbd103887011b4830355f88fb055f9ad2d556 upstream. A write intent bitmap can be removed from an array while the array is active. When this happens, all IO is suspended and flushed before the bitmap is removed. However it is possible that bitmap_daemon_work is still running to clear old bits from the bitmap. If it is, it can dereference the bitmap after it has been freed. So introduce a new mutex to protect bitmap_daemon_work and get it before destroying a bitmap. This is suitable for any current -stable kernel. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-12-01md: revert incorrect fix for read error handling in raid1.NeilBrown
commit 4706b349f was a forward port of a fix that was needed for SLES10. But in fact it is not needed in mainline because the earlier commit dd00a99e7a fixes the same problem in a better way. Further, this commit introduces a bug in the way it interacts with the automatic read-error-correction. If, after a read error is successfully corrected, the same disk is chosen to re-read - the re-read won't be attempted but an error will be returned instead. After reverting that commit, there is the possibility that a read error on a read-only array (where read errors cannot be corrected as that requires a write) will repeatedly read the same device and continue to get an error. So in the "Array is readonly" case, fail the drive immediately on a read error. Signed-off-by: NeilBrown <neilb@suse.de> Cc: stable@kernel.org
2009-11-13md/raid5: Allow dirty-degraded arrays to be assembled when only party is ↵NeilBrown
degraded. Normally is it not safe to allow a raid5 that is both dirty and degraded to be assembled without explicit request from that admin, as it can cause hidden data corruption. This is because 'dirty' means that the parity cannot be trusted, and 'degraded' means that the parity needs to be used. However, if the device that is missing contains only parity, then there is no issue and assembly can continue. This particularly applies when a RAID5 is being converted to a RAID6 and there is an unclean shutdown while the conversion is happening. So check for whether the degraded space only contains parity, and in that case, allow the assembly. Signed-off-by: NeilBrown <neilb@suse.de>
2009-11-13Don't unconditionally set in_sync on newly added device in raid5_reshapeNeilBrown
When a reshape finds that it can add spare devices into the array, those devices might already be 'in_sync' if they are beyond the old size of the array, or they might not if they are within the array. The first case happens when we change an N-drive RAID5 to an N+1-drive RAID5. The second happens when we convert an N-drive RAID5 to an N+1-drive RAID6. So set the flag more carefully. Also, ->recovery_offset is only meaningful when the flag is clear, so only set it in that case. This change needs the preceding two to ensure that the non-in_sync device doesn't get evicted from the array when it is stopped, in the case where v0.90 metadata is used. Signed-off-by: NeilBrown <neilb@suse.de>
2009-11-13md: allow v0.91 metadata to record devices as being active but not in-sync.NeilBrown
This is a combination that didn't really make sense before. However when a reshape is converting e.g. raid5 -> raid6, the extra device is not fully in-sync, but is certainly active and contains important data. So allow that start to be meaningful and in particular get the 'recovery_offset' value (which is needed for any non-in-sync active device) from the reshape_position. Signed-off-by: NeilBrown <neilb@suse.de>
2009-11-12md: factor out updating of 'recovery_offset'.NeilBrown
Each device has its own 'recovery_offset' showing how far recovery has progressed on the device. As the only real significance of this is that fact that it can be stored in the metadata and recovered at restart, and as only 1.x metadata can do this, we were only updating 'recovery_offset' to 'curr_resync_completed' when updating v1.x metadata. But this is wrong, and we will shortly make limited use of this field in v0.90 metadata. So move the update into common code. Signed-off-by: NeilBrown <neilb@suse.de>
2009-11-06md/raid5: make sure curr_sync_completes is uptodate when reshape startsNeilBrown
This value is visible through sysfs and is used by mdadm when it manages a reshape (backing up data that is about to be rearranged). So it is important that it is always correct. Current it does not get updated properly when a reshape starts which can cause problems when assembling an array that is in the middle of being reshaped. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2009-11-06md: don't clear endpoint for resync when resync is interrupted.NeilBrown
If a 'sync_max' has been set (via sysfs), it is wrong to clear it until a resync (or reshape or recovery ...) actually reached that point. So if a resync is interrupted (e.g. by device failure), leave 'resync_max' unchanged. This is particularly important for 'reshape' operations that do not change the size of the array. For such operations mdadm needs to monitor the reshape taking rolling backups of the section being reshaped. If resync_max gets cleared, the reshape can get ahead of mdadm and then the backups that mdadm creates are useless. This is suitable for 2.6.31.y stable kernels. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-31Merge branch 'for-linus' of git://neil.brown.name/mdLinus Torvalds
* 'for-linus' of git://neil.brown.name/md: async_tx: fix asynchronous raid6 recovery for ddf layouts async_pq: rename scribble page async_pq: kill a stray dma_map() call and other cleanups md/raid6: kill a gcc-4.0.1 'uninitialized variable' warning raid6/async_tx: handle holes in block list in async_syndrome_val md/async: don't pass a memory pointer as a page pointer. md: Fix handling of raid5 array which is being reshaped to fewer devices. md: fix problems with RAID6 calculations for DDF. md/raid456: downlevel multicore operations to raid_run_ops md: drivers/md/unroll.pl replaced with awk analog md: remove clumsy usage of do_sync_mapping_range from bitmap code md: raid1/raid10: handle allocation errors during array setup. md/raid5: initialize conf->device_lock earlier md/raid1/raid10: add a cond_resched Revert "md: do not progress the resync process if the stripe was blocked"
2009-10-19md/raid6: kill a gcc-4.0.1 'uninitialized variable' warningDan Williams
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2009-10-16dm snapshot: allow chunk size to be less than page sizeMikulas Patocka
Allow the snapshot chunk size to be smaller than the page size The code is now capable of handling this due to some previous fixes and enhancements. As the page size varies between computers, prior to this patch, the chunk size of a snapshot dictated which machines could read it: Snapshots created on one machine might not be readable on another. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm snapshot: use unsigned integer chunk sizeMikulas Patocka
Use unsigned integer chunk size. Maximum chunk size is 512kB, there won't ever be need to use 4GB chunk size, so the number can be 32-bit. This fixes compiler failure on 32-bit systems with large block devices. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm snapshot: lock snapshot while supplying statusMikulas Patocka
This patch locks the snapshot when returning status. It fixes a race when it could return an invalid number of free chunks if someone was simultaneously modifying it. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm exception store: fix failed set_chunk_size error pathMikulas Patocka
Properly close the device if failing because of an invalid chunk size. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm snapshot: require non zero chunk size by end of ctrMikulas Patocka
If we are creating snapshot with memory-stored exception store, fail if the user didn't specify chunk size. Zero chunk size would probably crash a lot of places in the rest of snapshot code. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm: dec_pending needs locking to save error valueKiyoshi Ueda
Multiple instances of dec_pending() can run concurrently so a lock is needed when it saves the first error code. I have never experienced actual problem without locking and just found this during code inspection while implementing the barrier support patch for request-based dm. This patch adds the locking. I've done compile, boot and basic I/O testings. Cc: stable@kernel.org Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm: add missing del_gendisk to alloc_dev error pathZdenek Kabelac
Add missing del_gendisk() to error path when creation of workqueue fails. Otherwice there is a resource leak and following warning is shown: WARNING: at fs/sysfs/dir.c:487 sysfs_add_one+0xc5/0x160() sysfs: cannot create duplicate filename '/devices/virtual/block/dm-0' Cc: stable@kernel.org Signed-off-by: Zdenek Kabelac <zkabelac@redhat.com> Reviewed-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm log: userspace fix incorrect luid cast in userspace_ctrAndrew Morton
mips: drivers/md/dm-log-userspace-base.c: In function `userspace_ctr': drivers/md/dm-log-userspace-base.c:159: warning: cast from pointer to integer of different size Cc: stable@kernel.org Cc: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm snapshot: free exception store on init failureJonathan Brassow
While initializing the snapshot module, if we fail to register the snapshot target then we must back-out the exception store module initialization. Cc: stable@kernel.org Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Reviewed-by: Mikulas Patocka <mpatocka@redhat.com> Reviewed-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16dm snapshot: sort by chunk size to fix raceMikulas Patocka
Avoid a race causing corruption when snapshots of the same origin have different chunk sizes by sorting the internal list of snapshots by chunk size, largest first. https://bugzilla.redhat.com/show_bug.cgi?id=182659 For example, let's have two snapshots with different chunk sizes. The first snapshot (1) has small chunk size and the second snapshot (2) has large chunk size. Let's have chunks A, B, C in these snapshots: snapshot1: ====A==== ====B==== snapshot2: ==========C========== (Chunk size is a power of 2. Chunks are aligned.) A write to the origin at a position within A and C comes along. It triggers reallocation of A, then reallocation of C and links them together using A as the 'primary' exception. Then another write to the origin comes along at a position within B and C. It creates pending exception for B. C already has a reallocation in progress and it already has a primary exception (A), so nothing is done to it: B and C are not linked. If the reallocation of B finishes before the reallocation of C, because there is no link with the pending exception for C it does not know to wait for it and, the second write is dispatched to the origin and causes data corruption in the chunk C in snapshot2. To avoid this situation, we maintain snapshots sorted in descending order of chunk size. This leads to a guaranteed ordering on the links between the pending exceptions and avoids the problem explained above - both A and B now get linked to C. Cc: stable@kernel.org Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2009-10-16md/async: don't pass a memory pointer as a page pointer.NeilBrown
md/raid6 passes a list of 'struct page *' to the async_tx routines, which then either DMA map them for offload, or take the page_address for CPU based calculations. For RAID6 we sometime leave 'blanks' in the list of pages. For CPU based calcs, we want to treat theses as a page of zeros. For offloaded calculations, we simply don't pass a page to the hardware. Currently the 'blanks' are encoded as a pointer to raid6_empty_zero_page. This is a 4096 byte memory region, not a 'struct page'. This is mostly handled correctly but is rather ugly. So change the code to pass and expect a NULL pointer for the blanks. When taking page_address of a page, we need to check for a NULL and in that case use raid6_empty_zero_page. Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md: Fix handling of raid5 array which is being reshaped to fewer devices.NeilBrown
When a raid5 (or raid6) array is being reshaped to have fewer devices, conf->raid_disks is the latter and hence smaller number of devices. However sometimes we want to use a number which is the total number of currently required devices - the larger of the 'old' and 'new' sizes. Before we implemented reducing the number of devices, this was always 'new' i.e. ->raid_disks. Now we need max(raid_disks, previous_raid_disks) in those places. This particularly affects assembling an array that was shutdown while in the middle of a reshape to fewer devices. md.c needs a similar fix when interpreting the md metadata. Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md: fix problems with RAID6 calculations for DDF.NeilBrown
Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md/raid456: downlevel multicore operations to raid_run_opsDan Williams
The percpu conversion allowed a straightforward handoff of stripe processing to the async subsytem that initially showed some modest gains (+4%). However, this model is too simplistic and leads to stripes bouncing between raid5d and the async thread pool for every invocation of handle_stripe(). As reported by Holger this can fall into a pathological situation severely impacting throughput (6x performance loss). By downleveling the parallelism to raid_run_ops the pathological stripe_head bouncing is eliminated. This version still exhibits an average 11% throughput loss for: mdadm --create /dev/md0 /dev/sd[b-q] -n 16 -l 6 echo 1024 > /sys/block/md0/md/stripe_cache_size dd if=/dev/zero of=/dev/md0 bs=1024k count=2048 ...but the results are at least stable and can be used as a base for further multicore experimentation. Reported-by: Holger Kiehl <Holger.Kiehl@dwd.de> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md: drivers/md/unroll.pl replaced with awk analogVladimir Dronnikov
drivers/md/unroll.pl replaced by awk script to drop build-time dependency on perl Signed-off-by: Vladimir Dronnikov <dronnikov@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md: remove clumsy usage of do_sync_mapping_range from bitmap codeNeilBrown
and replace with vfs_fsync which is much neater (but wasn't exported, or even in existence at the time the code was written). Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md: raid1/raid10: handle allocation errors during array setup.NeilBrown
Both raid1 and raid10 create a mempool during startup. If the 'alloc' function for this mempool fails, unplug_slaves is called. If that happens when the pool is being initialised, unplug_slaves will try to use the 'conf' structure that isn't filled in yet, and badness will happen. So ensure that unplug_slaves doesn't get called unless we know that the conf structure if fully initialised. Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md/raid5: initialize conf->device_lock earlierDan Williams
Deallocating a raid5_conf_t structure requires taking 'device_lock'. Ensure it is initialized before it is used, i.e. initialize the lock before attempting any further initializations that might fail. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16md/raid1/raid10: add a cond_reschedNeilBrown
During 'check' of a raid1 or raid10 it is possible for the management thread to spend a lot of time running 'memcmp' on blocks from different devices, so make sure the thread has a chance to schedule. raid5d already has a cond_resched (in process_stripe). Reported-By: Lee Howard <faxguy@howardsilvan.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-16Revert "md: do not progress the resync process if the stripe was blocked"NeilBrown
This reverts commit df10cfbc4d7ab93260d997df754219d390d62a9d. This patch was based on a misunderstanding and risks introducing a busy-wait loop. So revert it. Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de>
2009-10-06block: Seperate read and write statistics of in_flight requests v2Nikanth Karthikesan
Commit a9327cac440be4d8333bba975cbbf76045096275 added seperate read and write statistics of in_flight requests. And exported the number of read and write requests in progress seperately through sysfs. But Corrado Zoccolo <czoccolo@gmail.com> reported getting strange output from "iostat -kx 2". Global values for service time and utilization were garbage. For interval values, utilization was always 100%, and service time is higher than normal. So this was reverted by commit 0f78ab9899e9d6acb09d5465def618704255963b The problem was in part_round_stats_single(), I missed the following: if (now == part->stamp) return; - if (part->in_flight) { + if (part_in_flight(part)) { __part_stat_add(cpu, part, time_in_queue, part_in_flight(part) * (now - part->stamp)); __part_stat_add(cpu, part, io_ticks, (now - part->stamp)); With this chunk included, the reported regression gets fixed. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> -- Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-04Merge branch 'for-linus' of git://git.kernel.dk/linux-2.6-blockLinus Torvalds
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (41 commits) Revert "Seperate read and write statistics of in_flight requests" cfq-iosched: don't delay async queue if it hasn't dispatched at all block: Topology ioctls cfq-iosched: use assigned slice sync value, not default cfq-iosched: rename 'desktop' sysfs entry to 'low_latency' cfq-iosched: implement slower async initiate and queue ramp up cfq-iosched: delay async IO dispatch, if sync IO was just done cfq-iosched: add a knob for desktop interactiveness Add a tracepoint for block request remapping block: allow large discard requests block: use normal I/O path for discard requests swapfile: avoid NULL pointer dereference in swapon when s_bdev is NULL fs/bio.c: move EXPORT* macros to line after function Add missing blk_trace_remove_sysfs to be in pair with blk_trace_init_sysfs cciss: fix build when !PROC_FS block: Do not clamp max_hw_sectors for stacking devices block: Set max_sectors correctly for stacking devices cciss: cciss_host_attr_groups should be const cciss: Dynamically allocate the drive_info_struct for each logical drive. cciss: Add usage_count attribute to each logical drive in /sys ...
2009-10-04Revert "Seperate read and write statistics of in_flight requests"Jens Axboe
This reverts commit a9327cac440be4d8333bba975cbbf76045096275. Corrado Zoccolo <czoccolo@gmail.com> reports: "with 2.6.32-rc1 I started getting the following strange output from "iostat -kx 2": Linux 2.6.31bisect (et2) 04/10/2009 _i686_ (2 CPU) avg-cpu: %user %nice %system %iowait %steal %idle 10,70 0,00 3,16 15,75 0,00 70,38 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 18,22 0,00 0,67 0,01 14,77 0,02 43,94 0,01 10,53 39043915,03 2629219,87 sdb 60,89 9,68 50,79 3,04 1724,43 50,52 65,95 0,70 13,06 488437,47 2629219,87 avg-cpu: %user %nice %system %iowait %steal %idle 2,72 0,00 0,74 0,00 0,00 96,53 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 6,68 0,00 0,99 0,00 0,00 92,33 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 avg-cpu: %user %nice %system %iowait %steal %idle 4,40 0,00 0,73 1,47 0,00 93,40 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await svctm %util sda 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 0,00 100,00 sdb 0,00 4,00 0,00 3,00 0,00 28,00 18,67 0,06 19,50 333,33 100,00 Global values for service time and utilization are garbage. For interval values, utilization is always 100%, and service time is higher than normal. I bisected it down to: [a9327cac440be4d8333bba975cbbf76045096275] Seperate read and write statistics of in_flight requests and verified that reverting just that commit indeed solves the issue on 2.6.32-rc1." So until this is debugged, revert the bad commit. Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
2009-10-02dm/connector: Only process connector packages from privileged processesPhilipp Reisner
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-02connector/dm: Fixed a compilation warningPhilipp Reisner
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2009-10-02connector: Provide the sender's credentials to the callbackPhilipp Reisner
Signed-off-by: Philipp Reisner <philipp.reisner@linbit.com> Acked-by: Lars Ellenberg <lars.ellenberg@linbit.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>