linux/drivers/md, branch v3.10.20

md: Fix skipping recovery for read-only arrays.

2013-11-13T03:05:32Z

commit 61e4947c99c4494336254ec540c50186d186150b upstream. Since: commit 7ceb17e87bde79d285a8b988cfed9eaeebe60b86 md: Allow devices to be re-added to a read-only array. spares are activated on a read-only array. In case of raid1 and raid10 personalities it causes that not-in-sync devices are marked in-sync without checking if recovery has been finished. If a read-only array is degraded and one of its devices is not in-sync (because the array has been only partially recovered) recovery will be skipped. This patch adds checking if recovery has been finished before marking a device in-sync for raid1 and raid10 personalities. In case of raid5 personality such condition is already present (at raid5.c:6029). Bug was introduced in 3.10 and causes data corruption. Signed-off-by: Pawel Baldysiak Signed-off-by: Lukasz Dorau Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

md: avoid deadlock when md_set_badblocks.

2013-11-13T03:05:32Z

commit 905b0297a9533d7a6ee00a01a990456636877dd6 upstream. When operate harddisk and hit errors, md_set_badblocks is called after scsi_restart_operations which already disabled the irq. but md_set_badblocks will call write_sequnlock_irq and enable irq. so softirq can preempt the current thread and that may cause a deadlock. I think this situation should use write_sequnlock_irqsave/irqrestore instead. I met the situation and the call trace is below: [ 638.919974] BUG: spinlock recursion on CPU#0, scsi_eh_13/1010 [ 638.921923] lock: 0xffff8800d4d51fc8, .magic: dead4ead, .owner: scsi_eh_13/1010, .owner_cpu: 0 [ 638.923890] CPU: 0 PID: 1010 Comm: scsi_eh_13 Not tainted 3.12.0-rc5+ #37 [ 638.925844] Hardware name: To be filled by O.E.M. To be filled by O.E.M./MAHOBAY, BIOS 4.6.5 03/05/2013 [ 638.927816] ffff880037ad4640 ffff880118c03d50 ffffffff8172ff85 0000000000000007 [ 638.929829] ffff8800d4d51fc8 ffff880118c03d70 ffffffff81730030 ffff8800d4d51fc8 [ 638.931848] ffffffff81a72eb0 ffff880118c03d90 ffffffff81730056 ffff8800d4d51fc8 [ 638.933884] Call Trace: [ 638.935867] [] dump_stack+0x55/0x76 [ 638.937878] [] spin_dump+0x8a/0x8f [ 638.939861] [] spin_bug+0x21/0x26 [ 638.941836] [] do_raw_spin_lock+0xa4/0xc0 [ 638.943801] [] _raw_spin_lock+0x66/0x80 [ 638.945747] [] ? scsi_device_unbusy+0x9d/0xd0 [ 638.947672] [] ? _raw_spin_unlock+0x2b/0x50 [ 638.949595] [] scsi_device_unbusy+0x9d/0xd0 [ 638.951504] [] scsi_finish_command+0x37/0xe0 [ 638.953388] [] scsi_softirq_done+0xa8/0x140 [ 638.955248] [] blk_done_softirq+0x7b/0x90 [ 638.957116] [] __do_softirq+0xfd/0x330 [ 638.958987] [] ? __lock_release+0x6f/0x100 [ 638.960861] [] call_softirq+0x1c/0x30 [ 638.962724] [] do_softirq+0x8d/0xc0 [ 638.964565] [] irq_exit+0x10e/0x150 [ 638.966390] [] smp_apic_timer_interrupt+0x4a/0x60 [ 638.968223] [] apic_timer_interrupt+0x6f/0x80 [ 638.970079] [] ? __lock_release+0x6f/0x100 [ 638.971899] [] ? _raw_spin_unlock_irq+0x3a/0x50 [ 638.973691] [] ? _raw_spin_unlock_irq+0x30/0x50 [ 638.975475] [] md_set_badblocks+0x1f3/0x4a0 [ 638.977243] [] rdev_set_badblocks+0x27/0x80 [ 638.978988] [] raid5_end_read_request+0x36b/0x4e0 [raid456] [ 638.980723] [] bio_endio+0x1d/0x40 [ 638.982463] [] req_bio_endio.isra.65+0x83/0xa0 [ 638.984214] [] blk_update_request+0x7f/0x350 [ 638.985967] [] blk_update_bidi_request+0x31/0x90 [ 638.987710] [] __blk_end_bidi_request+0x20/0x50 [ 638.989439] [] __blk_end_request_all+0x1f/0x30 [ 638.991149] [] blk_peek_request+0x106/0x250 [ 638.992861] [] ? scsi_kill_request.isra.32+0xe9/0x130 [ 638.994561] [] scsi_request_fn+0x4a/0x3d0 [ 638.996251] [] __blk_run_queue+0x37/0x50 [ 638.997900] [] blk_run_queue+0x2f/0x50 [ 638.999553] [] scsi_run_queue+0xe0/0x1c0 [ 639.001185] [] scsi_run_host_queues+0x21/0x40 [ 639.002798] [] scsi_restart_operations+0x177/0x200 [ 639.004391] [] scsi_error_handler+0xc9/0xe0 [ 639.005996] [] ? scsi_unjam_host+0xd0/0xd0 [ 639.007600] [] kthread+0xdb/0xe0 [ 639.009205] [] ? flush_kthread_worker+0x170/0x170 [ 639.010821] [] ret_from_fork+0x7c/0xb0 [ 639.012437] [] ? flush_kthread_worker+0x170/0x170 This bug was introduce in commit 2e8ac30312973dd20e68073653 (the first time rdev_set_badblock was call from interrupt context), so this patch is appropriate for 3.5 and subsequent kernels. Signed-off-by: Bian Yu Reviewed-by: Jianpeng Ma Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

raid5: avoid finding "discard" stripe

2013-11-13T03:05:31Z

commit d47648fcf0611812286f68131b40251c6fa54f5e upstream. SCSI discard will damage discard stripe bio setting, eg, some fields are changed. If the stripe is reused very soon, we have wrong bios setting. We remove discard stripe from hash list, so next time the strip will be fully initialized. Suitable for backport to 3.7+. Signed-off-by: Shaohua Li Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

raid5: set bio bi_vcnt 0 for discard request

2013-11-13T03:05:31Z

commit 37c61ff31e9b5e3fcf3cc6579f5c68f6ad40c4b1 upstream. SCSI layer will add new payload for discard request. If two bios are merged to one, the second bio has bi_vcnt 1 which is set in raid5. This will confuse SCSI and cause oops. Suitable for backport to 3.7+ Reported-by: Jes Sorensen Signed-off-by: Shaohua Li Signed-off-by: NeilBrown Acked-by: Martin K. Petersen Signed-off-by: Greg Kroah-Hartman

bcache: Fixed incorrect order of arguments to bio_alloc_bioset()

2013-11-13T03:05:30Z

commit d4eddd42f592a0cf06818fae694a3d271f842e4d upstream. Signed-off-by: Kent Overstreet Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

dm snapshot: fix data corruption

2013-11-04T12:31:06Z

commit e9c6a182649f4259db704ae15a91ac820e63b0ca upstream. This patch fixes a particular type of data corruption that has been encountered when loading a snapshot's metadata from disk. When we allocate a new chunk in persistent_prepare, we increment ps->next_free and we make sure that it doesn't point to a metadata area by further incrementing it if necessary. When we load metadata from disk on device activation, ps->next_free is positioned after the last used data chunk. However, if this last used data chunk is followed by a metadata area, ps->next_free is positioned erroneously to the metadata area. A newly-allocated chunk is placed at the same location as the metadata area, resulting in data or metadata corruption. This patch changes the code so that ps->next_free skips the metadata area when metadata are loaded in function read_exceptions. The patch also moves a piece of code from persistent_prepare_exception to a separate function skip_metadata to avoid code duplication. CVE-2013-4299 Signed-off-by: Mikulas Patocka Cc: Mike Snitzer Signed-off-by: Alasdair G Kergon Signed-off-by: Greg Kroah-Hartman

bcache: Fix a null ptr deref regression

2013-10-13T23:08:35Z

commit 2fe80d3bbf1c8bd9efc5b8154207c8dd104e7306 upstream. Commit c0f04d88e46d ("bcache: Fix flushes in writeback mode") was fixing a reported data corruption bug, but it seems some last minute refactoring or rebasing introduced a null pointer deref. Signed-off-by: Kent Overstreet Reported-by: Gabriel de Perthuis Signed-off-by: Linus Torvalds Signed-off-by: Greg Kroah-Hartman

dm-raid: silence compiler warning on rebuilds_per_group.

2013-10-05T14:13:11Z

commit 3f6bbd3ffd7b733dd705e494663e5761aa2cb9c1 upstream. This doesn't really need to be initialised, but it doesn't hurt, silences the compiler, and as it is a counter it makes sense for it to start at zero. Signed-off-by: NeilBrown Signed-off-by: Greg Kroah-Hartman

dm mpath: disable WRITE SAME if it fails

2013-10-05T14:13:11Z

commit f84cb8a46a771f36a04a02c61ea635c968ed5f6a upstream. Workaround the SCSI layer's problematic WRITE SAME heuristics by disabling WRITE SAME in the DM multipath device's queue_limits if an underlying device disabled it. The WRITE SAME heuristics, with both the original commit 5db44863b6eb ("[SCSI] sd: Implement support for WRITE SAME") and the updated commit 66c28f971 ("[SCSI] sd: Update WRITE SAME heuristics"), default to enabling WRITE SAME(10) even without successfully determining it is supported. After the first failed WRITE SAME the SCSI layer will disable WRITE SAME for the device (by setting sdkp->device->no_write_same which results in 'max_write_same_sectors' in device's queue_limits to be set to 0). When a device is stacked ontop of such a SCSI device any changes to that SCSI device's queue_limits do not automatically propagate up the stack. As such, a DM multipath device will not have its WRITE SAME support disabled. This causes the block layer to continue to issue WRITE SAME requests to the mpath device which causes paths to fail and (if mpath IO isn't configured to queue when no paths are available) it will result in actual IO errors to the upper layers. This fix doesn't help configurations that have additional devices stacked ontop of the mpath device (e.g. LVM created linear DM devices ontop). A proper fix that restacks all the queue_limits from the bottom of the device stack up will need to be explored if SCSI will continue to use this model of optimistically allowing op codes and then disabling them after they fail for the first time. Before this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: failing WRITE SAME IO with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. device-mapper: multipath: Failing path 8:112. end_request: I/O error, dev dm-6, sector 4616 dm-6: WRITE SAME failed. Manually zeroing. end_request: I/O error, dev dm-6, sector 4616 end_request: I/O error, dev dm-6, sector 5640 end_request: I/O error, dev dm-6, sector 6664 end_request: I/O error, dev dm-6, sector 7688 end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. end_request: I/O error, dev dm-6, sector 524296 Aborting journal on device dm-6-8. end_request: I/O error, dev dm-6, sector 524288 Buffer I/O error on device dm-6, logical block 65536 lost page write due to I/O error on dm-6 JBD2: Error -5 detected when updating journal superblock for dm-6-8. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 33553920 After this patch: EXT4-fs (dm-6): mounted filesystem with ordered data mode. Opts: (null) device-mapper: multipath: XXX snitm debugging: got -EREMOTEIO (-121) device-mapper: multipath: XXX snitm debugging: WRITE SAME I/O failed with error=-121 end_request: critical target error, dev dm-6, sector 528 dm-6: WRITE SAME failed. Manually zeroing. # cat /sys/block/sdh/queue/write_same_max_bytes 0 # cat /sys/block/dm-6/queue/write_same_max_bytes 0 It should be noted that WRITE SAME support wasn't enabled in DM multipath until v3.10. Signed-off-by: Mike Snitzer Cc: Martin K. Petersen Cc: Hannes Reinecke Signed-off-by: Greg Kroah-Hartman

dm-snapshot: fix performance degradation due to small hash size

2013-10-05T14:13:11Z

commit 60e356f381954d79088d0455e357db48cfdd6857 upstream. LVM2, since version 2.02.96, creates origin with zero size, then loads the snapshot driver and then loads the origin. Consequently, the snapshot driver sees the origin size zero and sets the hash size to the lower bound 64. Such small hash table causes performance degradation. This patch changes it so that the hash size is determined by the size of snapshot volume, not minimum of origin and snapshot size. It doesn't make sense to set the snapshot size significantly larger than the origin size, so we do not need to take origin size into account when calculating the hash size. Signed-off-by: Mikulas Patocka Signed-off-by: Mike Snitzer Signed-off-by: Greg Kroah-Hartman