linux/drivers/md, branch v3.4.73

dm: fix truncated status strings

2013-12-08T15:29:43Z

commit fd7c092e711ebab55b2688d3859d95dfd0301f73 upstream. Avoid returning a truncated table or status string instead of setting the DM_BUFFER_FULL_FLAG when the last target of a table fills the buffer. When processing a table or status request, the function retrieve_status calls ti->type->status. If ti->type->status returns non-zero, retrieve_status assumes that the buffer overflowed and sets DM_BUFFER_FULL_FLAG. However, targets don't return non-zero values from their status method on overflow. Most targets returns always zero. If a buffer overflow happens in a target that is not the last in the table, it gets noticed during the next iteration of the loop in retrieve_status; but if a buffer overflow happens in the last target, it goes unnoticed and erroneously truncated data is returned. In the current code, the targets behave in the following way: * dm-crypt returns -ENOMEM if there is not enough space to store the key, but it returns 0 on all other overflows. * dm-thin returns errors from the status method if a disk error happened. This is incorrect because retrieve_status doesn't check the error code, it assumes that all non-zero values mean buffer overflow. * all the other targets always return 0. This patch changes the ti->type->status function to return void (because most targets don't use the return code). Overflow is detected in retrieve_status: if the status method fills up the remaining space completely, it is assumed that buffer overflow happened. Signed-off-by: Mikulas Patocka Signed-off-by: Alasdair G Kergon Signed-off-by: Greg Kroah-Hartman

md: fix calculation of stacking limits on level change.

2013-12-04T18:50:33Z

commit 02e5f5c0a0f726e66e3d8506ea1691e344277969 upstream. The various ->run routines of md personalities assume that the 'queue' has been initialised by the blk_set_stacking_limits() call in md_alloc(). However when the level is changed (by level_store()) the ->run routine for the new level is called for an array which has already had the stacking limits modified. This can result in incorrect final settings. So call blk_set_stacking_limits() before ->run in level_store(). A specific consequence of this bug is that it causes discard_granularity to be set incorrectly when reshaping a RAID4 to a RAID0. This is suitable for any -stable kernel since 3.3 in which blk_set_stacking_limits() was introduced. Reported-and-tested-by: "Baldysiak, Pawel" Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

dm: allocate buffer for messages with small number of arguments using GFP_NOIO

2013-12-04T18:50:31Z

commit f36afb3957353d2529cb2b00f78fdccd14fc5e9c upstream. dm-mpath and dm-thin must process messages even if some device is suspended, so we allocate argv buffer with GFP_NOIO. These messages have a small fixed number of arguments. On the other hand, dm-switch needs to process bulk data using messages so excessive use of GFP_NOIO could cause trouble. The patch also lowers the default number of arguments from 64 to 8, so that there is smaller load on GFP_NOIO allocations. Signed-off-by: Mikulas Patocka Acked-by: Alasdair G Kergon Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

md: Fix skipping recovery for read-only arrays.

2013-11-13T03:01:48Z

commit 61e4947c99c4494336254ec540c50186d186150b upstream. Since: commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Signed-off-by: Pawel Baldysiak Signed-off-by: Lukasz Dorau Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

dm snapshot: fix data corruption

2013-11-04T12:23:42Z

commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream. This patch fixes a particular type of data corruption that has been encountered when loading a snapshot's metadata from disk. When we allocate a new chunk in persistent_prepare, we increment ps->next_free and we make sure that it doesn't point to a metadata area by further incrementing it if necessary. When we load metadata from disk on device activation, ps->next_free is positioned after the last used data chunk. However, if this last used data chunk is followed by a metadata area, ps->next_free is positioned erroneously to the metadata area. A newly-allocated chunk is placed at the same location as the metadata area, resulting in data or metadata corruption. This patch changes the code so that ps->next_free skips the metadata area when metadata are loaded in function read_exceptions. The patch also moves a piece of code from persistent_prepare_exception to a separate function skip_metadata to avoid code duplication. CVE-2013-4299 Signed-off-by: Mikulas Patocka Cc: Mike Snitzer Signed-off-by: Alasdair G Kergon Signed-off-by: Greg Kroah-Hartman

dm-snapshot: fix performance degradation due to small hash size

2013-10-05T14:06:54Z

commit 60e356f381954d79088d0455e357db48cfdd6857 upstream. LVM2, since version 2.02.96, creates origin with zero size, then loads the snapshot driver and then loads the origin. Consequently, the snapshot driver sees the origin size zero and sets the hash size to the lower bound 64. Such small hash table causes performance degradation. This patch changes it so that the hash size is determined by the size of snapshot volume, not minimum of origin and snapshot size. It doesn't make sense to set the snapshot size significantly larger than the origin size, so we do not need to take origin size into account when calculating the hash size. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

dm snapshot: workaround for a false positive lockdep warning

2013-10-05T14:06:54Z

commit 5ea330a75bd86b2b2a01d7b85c516983238306fb upstream. The kernel reports a lockdep warning if a snapshot is invalidated because it runs out of space. The lockdep warning was triggered by commit 0976dfc1d0cd80a4e9dfaf87bd87 ("workqueue: Catch more locking problems with flush_work()") in v3.5. The warning is false positive. The real cause for the warning is that the lockdep engine treats different instances of md->lock as a single lock. This patch is a workaround - we use flush_workqueue instead of flush_work. This code path is not performance sensitive (it is called only on initialization or invalidation), thus it doesn't matter that we flush the whole workqueue. The real fix for the problem would be to teach the lockdep engine to treat different instances of md->lock as separate locks. Signed-off-by: Mikulas Patocka Acked-by: Alasdair G Kergon Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman

md/raid1,raid10: use freeze_array in place of raise_barrier in various places.

2013-08-20T15:26:28Z

commit e2d59925221cd562e07fee38ec8839f7209ae603 upstream. Various places in raid1 and raid10 are calling raise_barrier when they really should call freeze_array. The former is only intended to be called from "make_request". The later has extra checks for 'nr_queued' and makes a call to flush_pending_writes(), so it is safe to call it from within the management thread. Using raise_barrier will sometimes deadlock. Using freeze_array should not. As 'freeze_array' currently expects one request to be pending (in handle_read_error - the only previous caller), we need to pass it the number of pending requests (extra) to ignore. The deadlock was made particularly noticeable by commits 050b66152f87c7 (raid10) and 6b740b8d79252f13 (raid1) which appeared in 3.4, so the fix is appropriate for any -stable kernel since then. This patch probably won't apply directly to some early kernels and will need to be applied by hand. Cc: stable@vger.kernel.org Reported-by: Alexander Lyakas Signed-off-by: NeilBrown [adjust context to make it can be apply on top of 3.4 ] Signed-off-by: Jack Wang Signed-off-by: Greg Kroah-Hartman

md/raid10: remove use-after-free bug.

2013-08-04T08:26:00Z

commit 0eb25bb027a100f5a9df8991f2f628e7d851bc1e upstream. We always need to be careful when calling generic_make_request, as it can start a chain of events which might free something that we are using. Here is one place I wasn't careful enough. If the wbio2 is not in use, then it might get freed at the first generic_make_request call. So perform all necessary tests first. This bug was introduced in 3.3-rc3 (24afd80d99) and can cause an oops, so fix is suitable for any -stable since then. Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

md/raid5: fix interaction of 'replace' and 'recovery'.

2013-08-04T08:26:00Z

commit f94c0b6658c7edea8bc19d13be321e3860a3fa54 upstream. If a device in a RAID4/5/6 is being replaced while another is being recovered, then the writes to the replacement device currently don't happen, resulting in corruption when the replacement completes and the new drive takes over. This is because the replacement writes are only triggered when 's.replacing' is set and not when the similar 's.sync' is set (which is the case during resync and recovery - it means all devices need to be read). So schedule those writes when s.replacing is set as well. In this case we cannot use "STRIPE_INSYNC" to record that the replacement has happened as that is needed for recording that any parity calculation is complete. So introduce STRIPE_REPLACED to record if the replacement has happened. For safety we should also check that STRIPE_COMPUTE_RUN is not set. This has a similar effect to the "s.locked == 0" test. The latter ensure that now IO has been flagged but not started. The former checks if any parity calculation has been flagged by not started. We must wait for both of these to complete before triggering the 'replace'. Add a similar test to the subsequent check for "are we finished yet". This possibly isn't needed (is subsumed in the STRIPE_INSYNC test), but it makes it more obvious that the REPLACE will happen before we think we are finished. Finally if a NeedReplace device is not UPTODATE then that is an error. We really must trigger a warning. This bug was introduced in commit 9a3e1101b827a59ac9036a672f5fa8d5279d0fe2 (md/raid5: detect and handle replacements during recovery.) which introduced replacement for raid5. That was in 3.3-rc3, so any stable kernel since then would benefit from this fix. Reported-by: qindehua <13691222965@163.com> Tested-by: qindehua Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman