linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2013-05-13	md: bad block list should default to disabled.	NeilBrown
	commit 486adf72ccc0c235754923d47a2270c5dcb0c98b upstream. Maintenance of a bad-block-list currently defaults to 'enabled' and is then disabled when it cannot be supported. This is backwards and causes problem for dm-raid which didn't know to disable it. So fix the defaults, and only enabled for v1.x metadata which explicitly has bad blocks enabled. The problem with dm-raid has been present since badblock support was added in v3.1, so this patch is suitable for any -stable from 3.1 onwards. Reported-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20	md: fix two bugs when attempting to resize RAID0 array.	NeilBrown
	commit a64685399181780998281fe07309a94b25dd24c3 upstream. You cannot resize a RAID0 array (in terms of making the devices bigger), but the code doesn't entirely stop you. So: disable setting of the available size on each device for RAID0 and Linear devices. This must not change as doing so can change the effective layout of data. Make sure that the size that raid0_size() reports is accurate, but rounding devices sizes to chunk sizes. As the device sizes cannot change now, this isn't so important, but it is best to be safe. Without this change: mdadm --grow /dev/md0 -z max mdadm --grow /dev/md0 -Z max then read to the end of the array can cause a BUG in a RAID0 array. These bugs have been present ever since it became possible to resize any device, which is a long time. So the fix is suitable for any -stable kerenl. Signed-off-by: NeilBrown <neilb@suse.de> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-20	md: protect against crash upon fsync on ro array	Sebastian Riemer
	commit bbfa57c0f2243a7c31fd248d22e9861a2802cad5 upstream. If an fsync occurs on a read-only array, we need to send a completion for the IO and may not increment the active IO count. Otherwise, we hit a bug trace and can't stop the MD array anymore. By advice of Christoph Hellwig we return success upon a flush request but we return -EROFS for other writes. We detect flush requests by checking if the bio has zero sectors. This patch is suitable to any -stable kernel to which it applies. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ben Hutchings <ben@decadent.org.uk> Cc: NeilBrown <neilb@suse.de> Signed-off-by: Sebastian Riemer <sebastian.riemer@profitbricks.com> Reported-by: Ben Hutchings <ben@decadent.org.uk> Acked-by: Paul Menzel <paulepanter@users.sourceforge.net> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06	md: Avoid write invalid address if read_seqretry returned true.	majianpeng
	commit 35f9ac2dcec8f79d7059ce174fd7b7ee3290d620 upstream. If read_seqretry returned true and bbp was changed, it will write invalid address which can cause some serious problem. This bug was introduced by commit v3.0-rc7-130-g2699b67. So fix is suitable for 3.0.y thru 3.6.y. Reported-by: zhuwenfeng@kedacom.com Tested-by: zhuwenfeng@kedacom.com Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-12-06	md: Reassigned the parameters if read_seqretry returned true in func ↵	majianpeng
	md_is_badblock. commit ab05613a0646dcc11049692d54bae76ca9ffa910 upstream. This bug was introduced by commit(v3.0-rc7-126-g2230dfe). So fix is suitable for 3.0.y thru 3.6.y. Signed-off-by: Jianpeng Ma <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-09-12	md: Don't truncate size at 4TB for RAID0 and Linear	NeilBrown
	commit 667a5313ecd7308d79629c0738b0db588b0b0a4e upstream. commit 27a7b260f71439c40546b43588448faac01adb93 md: Fix handling for devices from 2TB to 4TB in 0.90 metadata. changed 0.90 metadata handling to truncated size to 4TB as that is all that 0.90 can record. However for RAID0 and Linear, 0.90 doesn't need to record the size, so this truncation is not needed and causes working arrays to become too small. So avoid the truncation for RAID0 and Linear This bug was introduced in 3.1 and is suitable for any stable kernels from then onwards. As the offending commit was tagged for 'stable', any stable kernel that it was applied to should also get this patch. That includes at least 2.6.32, 2.6.33 and 3.0. (Thanks to Ben Hutchings for providing that list). Signed-off-by: Neil Brown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-07-25	md: avoid crash when stopping md array races with closing other open fds.	NeilBrown
	commit a05b7ea03d72f36edb0cec05e8893803335c61a0 upstream. md will refuse to stop an array if any other fd (or mounted fs) is using it. When any fs is unmounted of when the last open fd is closed all pending IO will be flushed (e.g. sync_blockdev call in __blkdev_put) so there will be no pending IO to worry about when the array is stopped. However in order to send the STOP_ARRAY ioctl to stop the array one must first get and open fd on the block device. If some fd is being used to write to the block device and it is closed after mdadm open the block device, but before mdadm issues the STOP_ARRAY ioctl, then there will be no last-close on the md device so __blkdev_put will not call sync_blockdev. If this happens, then IO can still be in-flight while md tears down the array and bad things can happen (use-after-free and subsequent havoc). So in the case where do_md_stop is being called from an open file descriptor, call sync_block after taking the mutex to ensure there will be no new openers. This is needed when setting a read-write device to read-only too. Reported-by: majianpeng <majianpeng@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-31	md: using GFP_NOIO to allocate bio for flush request	Shaohua Li
	commit b5e1b8cee7ad58a15d2fa79bcd7946acb592602d upstream. A flush request is usually issued in transaction commit code path, so using GFP_KERNEL to allocate memory for flush request bio falls into the classic deadlock issue. This is suitable for any -stable kernel to which it applies as it avoids a possible deadlock. Signed-off-by: Shaohua Li <shli@fusionio.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-20	MD: Add del_timer_sync to mddev_suspend (fix nasty panic)	Jonathan Brassow
	commit 0d9f4f135eb6dea06bdcb7065b1e4ff78274a5e9 upstream. Use del_timer_sync to remove timer before mddev_suspend finishes. We don't want a timer going off after an mddev_suspend is called. This is especially true with device-mapper, since it can call the destructor function immediately following a suspend. This results in the removal (kfree) of the structures upon which the timer depends - resulting in a very ugly panic. Therefore, we add a del_timer_sync to mddev_suspend to prevent this. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-05-11	md: fix possible corruption of array metadata on shutdown.	NeilBrown
	commit 30b8aa9172dfeaac6d77897c67ee9f9fc574cdbb upstream. commit c744a65c1e2d59acc54333ce8 md: don't set md arrays to readonly on shutdown. removed the possibility of a 'BUG' when data is written to an array that has just been switched to read-only, but also introduced the possibility that the array metadata could be corrupted. If, when md_notify_reboot gets the mddev lock, the array is in a state where it is assembled but hasn't been started (as can happen if the personality module is not available, or in other unusual situations), then incorrect metadata will be written out making it impossible to re-assemble the array. So only call __md_stop_writes() if the array has actually been activated. This patch is needed for any stable kernel which has had the above commit applied. Reported-by: Christoph Nelles <evilazrael@evilazrael.de> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-04-02	md: fix clearing of the 'changed' flags for the bad blocks list.	NeilBrown
	commit d0962936bff659d20522555b517582a2715fd23f upstream. In super_1_sync (the first hunk) we need to clear 'changed' before checking read_seqretry(), otherwise we might race with other code adding a bad block and so won't retry later. In md_update_sb (the second hunk), in the case where there is no metadata (neither persistent nor external), we treat any bad blocks as an error. However we need to clear the 'changed' flag before calling md_ack_all_badblocks, else it won't do anything. This patch is suitable for -stable release 3.0 and later. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-04-02	md: don't set md arrays to readonly on shutdown.	NeilBrown
	commit c744a65c1e2d59acc54333ce80a5b0702a98010b upstream. It seems that with recent kernel, writeback can still be happening while shutdown is happening, and consequently data can be written after the md reboot notifier switches all arrays to read-only. This causes a BUG. So don't switch them to read-only - just mark them clean and set 'safemode' to '2' which mean that immediately after any write the array will be switch back to 'clean'. This could result in the shutdown happening when array is marked dirty, thus forcing a resync on reboot. However if you reboot without performing a "sync" first, you get to keep both halves. This is suitable for any stable kernel (though there might be some conflicts with obvious fixes in earlier kernels). Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2011-12-23	md: don't give up looking for spares on first failure-to-add	NeilBrown
	Before performing a recovery we try to remove any spares that might not be working, then add any that might have become relevant. Currently we abort on the first spare that cannot be added. This is a false optimisation. It is conceivable that - depending on rules in the personality - a subsequent spare might be accepted. Also the loop does other things like count the available spares and reset the 'recovery_offset' value. If we abort early these might not happen properly. So remove the early abort. In particular if you have an array what is undergoing recovery and which has extra spares, then the recovery may not restart after as reboot as the could of 'spares' might end up as zero. Reported-by: Anssi Hannula <anssi.hannula@iki.fi> Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08	md: ensure new badblocks are handled promptly.	NeilBrown
	When we mark blocks as bad we need them to be acknowledged by the metadata handler promptly. For an in-kernel metadata handler that was already being done. But for an external metadata handler we need to alert it of the change by sending a notification through the sysfs file. This adds that notification. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08	md: bad blocks shouldn't cause a Blocked status on a Faulty device.	NeilBrown
	Once a device is marked Faulty the badblocks - whether acknowledged or not - become irrelevant. So they shouldn't cause the device to be marked as Blocked. Without this patch, a process might write "-blocked" to clear the Blocked status, but while that will correctly fail the device, it won't remove the apparent 'blocked' status. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08	md: take a reference to mddev during sysfs access.	NeilBrown
	When we are accessing an mddev via sysfs we know that the mddev cannot disappear because it has an embedded kobj which is refcounted by sysfs. And we also take the mddev_lock. However this is not enough. The final mddev_put could have been called and the mddev_delayed_delete is waiting for sysfs to let go so it can destroy the kobj and mddev. In this state there are a lot of changes that should not be attempted. To to guard against this we: - initialise mddev->all_mddevs in on last put so the state can be easily detected. - in md_attr_show and md_attr_store, check ->all_mddevs under all_mddevs_lock and mddev_get the mddev if it still appears to be active. This means that if we get to sysfs as the mddev is being deleted we will get -EBUSY. rdev_attr_store and rdev_attr_show are similar but already have sufficient protection. They check that rdev->mddev still points to mddev after taking mddev_lock. As this is cleared before delayed removal which can only be requested under the mddev_lock, this ensure the rdev and mddev are still alive. Signed-off-by: NeilBrown <neilb@suse.de>
2011-12-08	md: refine interpretation of "hold_active == UNTIL_IOCTL".	NeilBrown
	We like md devices to disappear when they really are not needed. However it is not possible to tell from the current state whether it is needed or not. We can only tell from recent history of changes. In particular immediately after we create an md device it looks very similar to immediately after we have finished with it. So we always preserve a newly created md device until something significant happens. This state is stored in 'hold_active'. The normal case is to keep it until an ioctl happens, as that will normally either activate it, or explicitly de-activate it. If it doesn't then it was probably created by mistake and it is now time to get rid of it. We can also modify an array via sysfs (instead of via ioctl) and we currently treat any change via sysfs like an ioctl as a sign that if it now isn't more active, it should be destroyed. However this is not appropriate as changes made via sysfs are more gradual so we should look for a more definitive change. So this patch only clears 'hold_active' from UNTIL_IOCTL to clear when the array_state is changed via sysfs. Other changes via sysfs are ignored. Signed-off-by: NeilBrown <neilb@suse.de>
2011-11-06	Merge branch 'modsplit-Oct31_2011' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-04	Merge branch 'for-3.2/core' of git://git.kernel.dk/linux-block	Linus Torvalds
	* 'for-3.2/core' of git://git.kernel.dk/linux-block: (29 commits) block: don't call blk_drain_queue() if elevator is not up blk-throttle: use queue_is_locked() instead of lockdep_is_held() blk-throttle: Take blkcg->lock while traversing blkcg->policy_list blk-throttle: Free up policy node associated with deleted rule block: warn if tag is greater than real_max_depth. block: make gendisk hold a reference to its queue blk-flush: move the queue kick into blk-flush: fix invalid BUG_ON in blk_insert_flush block: Remove the control of complete cpu from bio. block: fix a typo in the blk-cgroup.h file block: initialize the bounce pool if high memory may be added later block: fix request_queue lifetime handling by making blk_queue_cleanup() properly shutdown block: drop @tsk from attempt_plug_merge() and explain sync rules block: make get_request[_wait]() fail if queue is dead block: reorganize throtl_get_tg() and blk_throtl_bio() block: reorganize queue draining block: drop unnecessary blk_get/put_queue() in scsi_cmd_ioctl() and blk_get_tg() block: pass around REQ_* flags instead of broken down booleans during request alloc/free block: move blk_throtl prototypes to block/blk.h block: fix genhd refcounting in blkio_policy_parse_and_set() ... Fix up trivial conflicts due to "mddev_t" -> "struct mddev" conversion and making the request functions be of type "void" instead of "int" in - drivers/md/{faulty.c,linear.c,md.c,md.h,multipath.c,raid0.c,raid1.c,raid10.c,raid5.c} - drivers/staging/zram/zram_drv.c
2011-10-31	md: Add module.h to all files using it implicitly	Paul Gortmaker
	A pending cleanup will mean that module.h won't be implicitly everywhere anymore. Make sure the modular drivers in md dir are actually calling out for <module.h> explicitly in advance. Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2011-10-19	Merge branch 'v3.1-rc10' into for-3.2/core	Jens Axboe
	Conflicts: block/blk-core.c include/linux/blkdev.h Signed-off-by: Jens Axboe <axboe@kernel.dk>
2011-10-19	md.c: trivial comment fix	Chris Dunlop
	Trivial comment fix Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18	MD: Allow restarting an interrupted incremental recovery.	Andrei Warkentin
	If an incremental recovery was interrupted, a subsequent re-add will result in a full recovery, even though an incremental should be possible (seen with raid1). Solve this problem by not updating the superblock on the recovering device until array is not degraded any longer. Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-18	md: clear In_sync bit on devices added to an active array.	NeilBrown
	When we add a device to an active array it can be meaningful to set the 'insync' flag. This indicates that the device is in-sync with the array except for locations recorded in the bitmap. A bitmap-based recovery can then bring it completely in-sync. Internally we move that flag to 'saved_raid_disk' but forgot to clear In_sync like we do in add_new_disk. So clear In_sync after moving its value to saved_raid_disk. Reported-by: Andrei Warkentin <andreiw@vmware.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11	md: rename "mdk_personality" to "md_personality"	NeilBrown
	"mdk" doesn't mean anything any more. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11	md: remove typedefs: mdk_thread_t -> struct md_thread	NeilBrown
	Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11	md: remove typedefs: mddev_t -> struct mddev	NeilBrown
	Having mddev_t and 'struct mddev_s' is ugly and not preferred Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-11	md: removing typedefs: mdk_rdev_t -> struct md_rdev	NeilBrown
	The typedefs are just annoying. 'mdk' probably refers to 'md_k.h' which used to be an include file that defined this thing. Signed-off-by: NeilBrown <neilb@suse.de>
2011-10-07	md: remove PRINTK and dprintk debugging and use pr_debug	NeilBrown
	Being able to dynamically enable these make them much more useful. Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-23	md: don't delay reboot by 1 second if no MD devices exist	Daniel P. Berrange
	The md_notify_reboot() method includes a call to mdelay(1000), to deal with "exotic SCSI devices" which are too volatile on reboot. The delay is unconditional. Even if the machine does not have any block devices, let alone MD devices, the kernel shutdown sequence is slowed down. 1 second does not matter much with physical hardware, but with certain virtualization use cases any wasted time in the bootup & shutdown sequence counts for alot. * drivers/md/md.c: md_notify_reboot() - only impose a delay if there was at least one MD device to be stopped during reboot Signed-off-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-21	md: Avoid waking up a thread after it has been freed.	NeilBrown
	Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
2011-09-12	block: remove support for bio remapping from ->make_request	Christoph Hellwig
	There is very little benefit in allowing to let a ->make_request instance update the bios device and sector and loop around it in __generic_make_request when we can archive the same through calling generic_make_request from the driver and letting the loop in generic_make_request handle it. Note that various drivers got the return value from ->make_request and returned non-zero values for errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: NeilBrown <neilb@suse.de> Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
2011-09-10	md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.	NeilBrown
	0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-30	md: fix clearing of 'blocked' flag in the presence of bad blocks.	NeilBrown
	When the 'blocked' flag on a device is cleared while there are unacknowledged bad blocks we must fail the device. This is needed for backwards compatability of the interface. The code currently uses the wrong test for "unacknowledged bad blocks exist". Change it to the right test. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25	md: use REQ_NOIDLE flag in md_super_write()	Namhyung Kim
	Queue idling is used for the anticipation of immediate sequencial I/O's but md_super_write() is a kind of one- shot operation, coupled with md_super_wait(), so the idling in this case will be just a waste of time. Specifying REQ_NOIDLE prevents it. Instead of adding the flag to submit_bio() directly, use pre-defined macro WRITE_FLUSH_FUA. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25	md: ensure changes to 'write-mostly' are reflected in metadata.	NeilBrown
	The 'write-mostly' flag can be changed through sysfs. With 0.90 metadata, those changes are reflected in the metadata. For 1.x metadata, they aren't. So fix super_1_sync to record 'write-mostly' status. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25	md: report failure if a 'set faulty' request doesn't.	NeilBrown
	Sometimes a device will refuse to be set faulty. e.g. RAID1 will never let the last working device become faulty. So check if "md_error()" did manage to set the faulty flag and fail with EBUSY if it didn't. Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198 Reported-by: Mike Hommey <mh+reportbug@glandium.org> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	Merge branch 'for-linus' of git://neil.brown.name/md	Linus Torvalds
	* 'for-linus' of git://neil.brown.name/md: (75 commits) md/raid10: handle further errors during fix_read_error better. md/raid10: Handle read errors during recovery better. md/raid10: simplify read error handling during recovery. md/raid10: record bad blocks due to write errors during resync/recovery. md/raid10: attempt to fix read errors during resync/check md/raid10: Handle write errors by updating badblock log. md/raid10: clear bad-block record when write succeeds. md/raid10: avoid writing to known bad blocks on known bad drives. md/raid10 record bad blocks as needed during recovery. md/raid10: avoid reading known bad blocks during resync/recovery. md/raid10 - avoid reading from known bad blocks - part 3 md/raid10: avoid reading from known bad blocks - part 2 md/raid10: avoid reading from known bad blocks - part 1 md/raid10: Split handle_read_error out from raid10d. md/raid10: simplify/reindent some loops. md/raid5: Clear bad blocks on successful write. md/raid5. Don't write to known bad block on doubtful devices. md/raid5: write errors should be recorded as bad blocks if possible. md/raid5: use bad-block log to improve handling of uncorrectable read errors. md/raid5: avoid reading from known bad blocks. ...
2011-07-28	md/raid10 record bad blocks as needed during recovery.	NeilBrown
	When recovering one or more devices, if all the good devices have bad blocks we should record a bad block on the device being rebuilt. If this fails, we need to abort the recovery. To ensure we don't think that we aborted later than we actually did, we need to move the check for MD_RECOVERY_INTR earlier in md_do_sync, in particular before mddev->curr_resync is updated. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	md: make it easier to wait for bad blocks to be acknowledged.	NeilBrown
	It is only safe to choose not to write to a bad block if that bad block is safely recorded in metadata - i.e. if it has been 'acknowledged'. If it hasn't we need to wait for the acknowledgement. We support that using rdev->blocked wait and md_wait_for_blocked_rdev by introducing a new device flag 'BlockedBadBlock'. This flag is only advisory. It is cleared whenever we acknowledge a bad block, so that a waiter can re-check the particular bad blocks that it is interested it. It should be set by a caller when they find they need to wait. This (set after test) is inherently racy, but as md_wait_for_blocked_rdev already has a timeout, losing the race will have minimal impact. When we clear "Blocked" was also clear "BlockedBadBlocks" incase it was set incorrectly (see above race). We also modify the way we manage 'Blocked' to fit better with the new handling of 'BlockedBadBlocks' and to make it consistent between externally managed and internally managed metadata. This requires that each raidXd loop checks if the metadata needs to be written and triggers a write (md_check_recovery) if needed. Otherwise a queued write request might cause raidXd to wait for the metadata to write, and only that thread can write it. Before writing metadata, we set FaultRecorded for all devices that are Faulty, then after writing the metadata we clear Blocked for any device for which the Fault was certainly Recorded. The 'faulty' device flag now appears in sysfs if the device is faulty or it has unacknowledged bad blocks. So user-space which does not understand bad blocks can continue to function correctly. User space which does, should not assume a device is faulty until it sees the 'faulty' flag, and then sees the list of unacknowledged bad blocks is empty. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	md: add 'write_error' flag to component devices.	NeilBrown
	If a device has ever seen a write error, we will want to handle known-bad-blocks differently. So create an appropriate state flag and export it via sysfs. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28	md/raid1: avoid reading from known bad blocks.	NeilBrown
	Now that we have a bad block list, we should not read from those blocks. There are several main parts to this: 1/ read_balance needs to check for bad blocks, and return not only the chosen device, but also how many good blocks are available there. 2/ fix_read_error needs to avoid trying to read from bad blocks. 3/ read submission must be ready to issue multiple reads to different devices as different bad blocks on different devices could mean that a single large read cannot be served by any one device, but can still be served by the array. This requires keeping count of the number of outstanding requests per bio. This count is stored in 'bi_phys_segments' 4/ retrying a read needs to also be ready to submit a smaller read and queue another request for the rest. This does not yet handle bad blocks when reading to perform resync, recovery, or check. 'md_trim_bio' will also be used for RAID10, so put it in md.c and export it. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	md: Disable bad blocks and v0.90 metadata.	NeilBrown
	v0.90 metadata cannot record bad blocks, so when loading metadata for such a device, set shift to -1. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	md: load/store badblock list from v1.x metadata	NeilBrown
	Space must have been allocated when array was created. A feature flag is set when the badblock list is non-empty, to ensure old kernels don't load and trust the whole device. We only update the on-disk badblocklist when it has changed. If the badblocklist (or other metadata) is stored on a bad block, we don't cope very well. If metadata has no room for bad block, flag bad-blocks as disabled, and do the same for 0.90 metadata. Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-28	md/bad-block-log: add sysfs interface for accessing bad-block-log.	NeilBrown
	This can show the log (providing it fits in one page) and allows bad blocks to be 'acknowledged' meaning that they have safely been recorded in metadata. Clearing bad blocks is not allowed via sysfs (except for code testing). A bad block can only be cleared when a write to the block succeeds. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28	md: beginnings of bad block management.	NeilBrown
	This the first step in allowing md to track bad-blocks per-device so that we can fail individual blocks rather than the whole device. This patch just adds a data structure for recording bad blocks, with routines to add, remove, search the list. Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Namhyung Kim <namhyung@gmail.com>
2011-07-28	md: remove suspicious size_of()	NeilBrown
	When calling bioset_create we pass the size of the front_pad as sizeof(mddev) which looks suspicious as mddev is a pointer and so it looks like a common mistake where sizeof(mddev) was intended. The size is actually correct as we want to store a pointer in the front padding of the bios created by the bioset, so make the intent more explicit by using sizeof(mddev_t ) Reported-by: Zdenek Kabelac <zdenek.kabelac@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27	MD: generate an event when array sync is complete	Jonathan Brassow
	This patch causes MD to generate an event (for device-mapper) when the synchronization thread is reaped. This is expected behavior for device-mapper. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27	md: get rid of unnecessary casts on page_address()	Namhyung Kim
	page_address() returns void pointer, so the casts can be removed. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-07-27	md: change managed of recovery_disabled.	NeilBrown
	If we hit a read error while recovering a mirror, we want to abort the recovery without necessarily failing the disk - as having a disk this a read error is better than not having an array at all. Currently this is managed with a per-array flag "recovery_disabled" and is only implemented for RAID1. For RAID10 we will need finer grained control as we might want to disable recovery for individual devices separately. So push more of the decision making into the personality. 'recovery_disabled' is now a 'cookie' which is copied when the personality want to disable recovery and is changed when a device is added to the array as this is used as a trigger to 'try recovery again'. This will allow RAID10 to get the control that it needs. Signed-off-by: NeilBrown <neilb@suse.de>