aboutsummaryrefslogtreecommitdiff
path: root/drivers/md
AgeCommit message (Collapse)Author
2011-11-11md/raid5: STRIPE_ACTIVE has lock semantics, add barriersDan Williams
commit 257a4b42af7586fab4eaec7f04e6896b86551843 upstream. All updates that occur under STRIPE_ACTIVE should be globally visible when STRIPE_ACTIVE clears. test_and_set_bit() implies a barrier, but clear_bit() does not. This is suitable for 3.1-stable. Signed-off-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11md/raid5: abort any pending parity operations when array fails.NeilBrown
commit 9a3f530f39f4490eaa18b02719fb74ce5f4d2d86 upstream. When the number of failed devices exceeds the allowed number we must abort any active parity operations (checks or updates) as they are no longer meaningful, and can lead to a BUG_ON in handle_parity_checks6. This bug was introduce by commit 6c0069c0ae9659e3a91b68eaed06a5c6c37f45c8 in 2.6.29. Reported-by: Manish Katiyar <mkatiyar@gmail.com> Tested-by: Manish Katiyar <mkatiyar@gmail.com> Acked-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11md/raid10: Fix bug when activating a hot-spare.NeilBrown
commit 7fcc7c8acf0fba44d19a713207af7e58267c1179 upstream. This is a fairly serious bug in RAID10. When a RAID10 array is degraded and a hot-spare is activated, the spare does not take up the empty slot, but rather replaces the first working device. This is likely to make the array non-functional. It would normally be possible to recover the data, but that would need care and is not guaranteed. This bug was introduced in commit 2bb77736ae5dca0a189829fbb7379d43364a9dac which first appeared in 3.1. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-11-11md/raid5: fix bug that could result in reads from a failed device.NeilBrown
commit 355840e7a7e56bb2834fd3b0da64da5465f8aeaa upstream. This bug was introduced in 415e72d034c50520ddb7ff79e7d1792c1306f0c9 which was in 2.6.36. There is a small window of time between when a device fails and when it is removed from the array. During this time we might still read from it, but we won't write to it - so it is possible that we could read stale data. We didn't need the test of 'Faulty' before because the test on In_sync is sufficient. Since we started allowing reads from the early part of non-In_sync devices we need a test on Faulty too. This is suitable for any kernel from 2.6.36 onwards, though the patch might need a bit of tweaking in 3.0 and earlier. Signed-off-by: NeilBrown <neilb@suse.de> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-10-23dm kcopyd: fix job_pool leakAlasdair G Kergon
Fix memory leak introduced by commit a6e50b409d3f9e0833e69c3c9cca822e8fa4adbb (dm snapshot: skip reading origin when overwriting complete chunk). When allocating a set of jobs from kc->job_pool, job->master_job must be set (to point to itself) so that the mempool item gets freed when the master_job completes. master_job was introduced by commit c6ea41fbbe08f270a8edef99dc369faf809d1bd6 (dm kcopyd: preallocate sub jobs to avoid deadlock) Reported-by: Michael Leun <ml@newton.leun.net> Cc: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-10-06Merge branch 'for-linus' of http://people.redhat.com/agk/git/linux-dmLinus Torvalds
* 'for-linus' of http://people.redhat.com/agk/git/linux-dm: dm crypt: always disable discard_zeroes_data dm: raid fix write_mostly arg validation dm table: avoid crash if integrity profile changes dm: flakey fix corrupt_bio_byte error path
2011-09-25dm crypt: always disable discard_zeroes_dataMilan Broz
If optional discard support in dm-crypt is enabled, discards requests bypass the crypt queue and blocks of the underlying device are discarded. For the read path, discarded blocks are handled the same as normal ciphertext blocks, thus decrypted. So if the underlying device announces discarded regions return zeroes, dm-crypt must disable this flag because after decryption there is just random noise instead of zeroes. Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25dm: raid fix write_mostly arg validationJonthan Brassow
Fix off-by-one error in validation of write_mostly. The user-supplied value given for the 'write_mostly' argument must be an index starting at 0. The validation of the supplied argument failed to check for 'N' ('>' vs '>='), which would have caused an access beyond the end of the array. Reported-by: Doug Ledford <dledford@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-25dm table: avoid crash if integrity profile changesMike Snitzer
Commit a63a5cf (dm: improve block integrity support) introduced a two-phase initialization of a DM device's integrity profile. This patch avoids dereferencing a NULL 'template_disk' pointer in blk_integrity_register() if there is an integrity profile mismatch in dm_table_set_integrity(). This can occur if the integrity profiles for stacked devices in a DM table are changed between the call to dm_table_prealloc_integrity() and dm_table_set_integrity(). Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com> Cc: stable@kernel.org # 2.6.39
2011-09-25dm: flakey fix corrupt_bio_byte error pathMike Snitzer
If no arguments were provided to the corrupt_bio_byte feature an error should be returned immediately. Reported-by: Zdenek Kabelac <zkabelac@redhat.com> Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-09-21md: Avoid waking up a thread after it has been freed.NeilBrown
Two related problems: 1/ some error paths call "md_unregister_thread(mddev->thread)" without subsequently clearing ->thread. A subsequent call to mddev_unlock will try to wake the thread, and crash. 2/ Most calls to md_wakeup_thread are protected against the thread disappeared either by: - holding the ->mutex - having an active request, so something else must be keeping the array active. However mddev_unlock calls md_wakeup_thread after dropping the mutex and without any certainty of an active request, so the ->thread could theoretically disappear. So we need a spinlock to provide some protections. So change md_unregister_thread to take a pointer to the thread pointer, and ensure that it always does the required locking, and clears the pointer properly. Reported-by: "Moshe Melnikov" <moshe@zadarastorage.com> Signed-off-by: NeilBrown <neilb@suse.de> cc: stable@kernel.org
2011-09-10md: Fix handling for devices from 2TB to 4TB in 0.90 metadata.NeilBrown
0.90 metadata uses an unsigned 32bit number to count the number of kilobytes used from each device. This should allow up to 4TB per device. However we multiply this by 2 (to get sectors) before casting to a larger type, so sizes above 2TB get truncated. Also we allow rdev->sectors to be larger than 4TB, so it is possible for the array to be resized larger than the metadata can handle. So make sure rdev->sectors never exceeds 4TB when 0.90 metadata is in used. Also the sanity check at the end of super_90_load should include level 1 as it used ->size too. (RAID0 and Linear don't use ->size at all). Reported-by: Pim Zandbergen <P.Zandbergen@macroscoop.nl> Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10md/raid1,10: Remove use-after-free bug in make_request.NeilBrown
A single request to RAID1 or RAID10 might result in multiple requests if there are known bad blocks that need to be avoided. To detect if we need to submit another write request we test: if (sectors_handled < (bio->bi_size >> 9)) { However this is after we call **_write_done() so the 'bio' no longer belongs to us - the writes could have completed and the bio freed. So move the **_write_done call until after the test against bio->bi_size. This addresses https://bugzilla.kernel.org/show_bug.cgi?id=41862 Reported-by: Bruno Wolff III <bruno@wolff.to> Tested-by: Bruno Wolff III <bruno@wolff.to> Signed-off-by: NeilBrown <neilb@suse.de>
2011-09-10md/raid10: unify handling of write completion.NeilBrown
A write can complete at two different places: 1/ when the last member-device write completes, through raid10_end_write_request 2/ in make_request() when we remove the initial bias from ->remaining. These two should do exactly the same thing and the comment says they do, but they don't. So factor the correct code out into a function and call it in both places. This makes the code much more similar to RAID1. The difference is only significant if there is an error, and they usually take a while, so it is unlikely that there will be an error already when make_request is completing, so this is unlikely to cause real problems. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-31md/raid5: fix a hang on device failure.NeilBrown
Waiting for a 'blocked' rdev to become unblocked in the raid5d thread cannot work with internal metadata as it is the raid5d thread which will clear the blocked flag. This wasn't a problem in 3.0 and earlier as we only set the blocked flag when external metadata was used then. However we now set it always, so we need to be more careful. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-30md: fix clearing of 'blocked' flag in the presence of bad blocks.NeilBrown
When the 'blocked' flag on a device is cleared while there are unacknowledged bad blocks we must fail the device. This is needed for backwards compatability of the interface. The code currently uses the wrong test for "unacknowledged bad blocks exist". Change it to the right test. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25md/linear: avoid corrupting structure while waiting for rcu_free to complete.NeilBrown
I don't know what I was thinking putting 'rcu' after a dynamically sized array! The array could still be in use when we call rcu_free() (That is the point) so we mustn't corrupt it. Cc: stable@kernel.org Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25md: use REQ_NOIDLE flag in md_super_write()Namhyung Kim
Queue idling is used for the anticipation of immediate sequencial I/O's but md_super_write() is a kind of one- shot operation, coupled with md_super_wait(), so the idling in this case will be just a waste of time. Specifying REQ_NOIDLE prevents it. Instead of adding the flag to submit_bio() directly, use pre-defined macro WRITE_FLUSH_FUA. Signed-off-by: Namhyung Kim <namhyung@gmail.com> Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25md: ensure changes to 'write-mostly' are reflected in metadata.NeilBrown
The 'write-mostly' flag can be changed through sysfs. With 0.90 metadata, those changes are reflected in the metadata. For 1.x metadata, they aren't. So fix super_1_sync to record 'write-mostly' status. Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-25md: report failure if a 'set faulty' request doesn't.NeilBrown
Sometimes a device will refuse to be set faulty. e.g. RAID1 will never let the last working device become faulty. So check if "md_error()" did manage to set the faulty flag and fail with EBUSY if it didn't. Resolves-Debian-Bug: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=601198 Reported-by: Mike Hommey <mh+reportbug@glandium.org> Signed-off-by: NeilBrown <neilb@suse.de>
2011-08-02dm table: set flush capability based on underlying devicesMike Snitzer
DM has always advertised both REQ_FLUSH and REQ_FUA flush capabilities regardless of whether or not a given DM device's underlying devices also advertised a need for them. Block's flush-merge changes from 2.6.39 have proven to be more costly for DM devices. Performance regressions have been reported even when DM's underlying devices do not advertise that they have a write cache. Fix the performance regressions by configuring a DM device's flushing capabilities based on those of the underlying devices' capabilities. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm crypt: optionally support discard requestsMilan Broz
Add optional parameter field to dmcrypt table and support "allow_discards" option. Discard requests bypass crypt queue processing. Bio is simple remapped to underlying device. Note that discard will be never enabled by default because of security consequences. It is up to the administrator to enable it for encrypted devices. (Note that userspace cryptsetup does not understand new optional parameters yet. Support for this will come later. Until then, you should use 'dmsetup' to enable and disable this.) Signed-off-by: Milan Broz <mbroz@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: add md raid1 supportJonathan Brassow
Support the MD RAID1 personality through dm-raid. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: support metadata devicesJonathan Brassow
Add the ability to parse and use metadata devices to dm-raid. Although not strictly required, without the metadata devices, many features of RAID are unavailable. They are used to store a superblock and bitmap. The role, or position in the array, of each device must be recorded in its superblock. This is to help with fault handling, array reshaping, and sanity checks. RAID 4/5/6 devices must be loaded in a specific order: in this way, the 'array_position' field helps validate the correctness of the mapping when it is loaded. It can be used during reshaping to identify which devices are added/removed. Fault handling is impossible without this field. For example, when a device fails it is recorded in the superblock. If this is a RAID1 device and the offending device is removed from the array, there must be a way during subsequent array assembly to determine that the failed device was the one removed. This is done by correlating the 'array_position' field and the bit-field variable 'failed_devices'. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: add write_mostly parameterJonathan Brassow
Add the write_mostly parameter to RAID1 dm-raid tables. This allows the user to set the WriteMostly flag on a RAID1 device that should normally be avoided for read I/O. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: add region_size parameterJonathan Brassow
Allow the user to specify the region_size. Ensures that the supplied value meets md's constraints, viz. the number of regions does not exceed 2^21. Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm ioctl: forbid multiple device specifiersMikulas Patocka
Exactly one of name, uuid or device must be specified when referencing an existing device. This removes the ambiguity (risking the wrong device being updated) if two conflicting parameters were specified. Previously one parameter got used and any others were ignored silently. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm ioctl: introduce __get_dev_cellMikulas Patocka
Move logic to find device based on major/minor number to a separate function __get_dev_cell (similar to __get_uuid_cell and __get_name_cell). This makes the function __find_device_hash_cell more straightforward. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm ioctl: fill in device parameters in more ioctlsMikulas Patocka
Move parameter filling from find_device to __find_device_hash_cell. This patch causes ioctls using __find_device_hash_cell (DM_DEV_REMOVE_CMD, DM_DEV_SUSPEND_CMD - resume, DM_TABLE_CLEAR_CMD) to return device parameters, bringing them into line with the other ioctls. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm flakey: add corrupt_bio_byte featureMike Snitzer
Add corrupt_bio_byte feature to simulate corruption by overwriting a byte at a specified position with a specified value during intervals when the device is "down". Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm flakey: add drop_writesMike Snitzer
Add 'drop_writes' option to drop writes silently while the device is 'down'. Reads are not touched. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm flakey: support feature argsMike Snitzer
Add the ability to specify arbitrary feature flags when creating a flakey target. This code uses the same target argument helpers that the multipath target does. Also remove the superfluous 'dm-flakey' prefixes from the error messages, as they already contain the prefix 'flakey'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm flakey: use dm_target_offset and support discardsMike Snitzer
Use dm_target_offset() and support discards. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm table: share target argument parsing functionsMike Snitzer
Move multipath target argument parsing code into dm-table so other targets can share it. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm snapshot: skip reading origin when overwriting complete chunkMikulas Patocka
If we write a full chunk in the snapshot, skip reading the origin device because the whole chunk will be overwritten anyway. This patch changes the snapshot write logic when a full chunk is written. In this case: 1. allocate the exception 2. dispatch the bio (but don't report the bio completion to device mapper) 3. write the exception record 4. report bio completed Callbacks must be done through the kcopyd thread, because callbacks must not race with each other. So we create two new functions: dm_kcopyd_prepare_callback: allocate a job structure and prepare the callback. (This function must not be called from interrupt context.) dm_kcopyd_do_callback: submit callback. (This function may be called from interrupt context.) Performance test (on snapshots with 4k chunk size): without the patch: non-direct-io sequential write (dd): 17.7MB/s direct-io sequential write (dd): 20.9MB/s non-direct-io random write (mkfs.ext2): 0.44s with the patch: non-direct-io sequential write (dd): 26.5MB/s direct-io sequential write (dd): 33.2MB/s non-direct-io random write (mkfs.ext2): 0.27s Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm: ignore merge_bvec for snapshots when safeMikulas Patocka
Add a new flag DMF_MERGE_IS_OPTIONAL to struct mapped_device to indicate whether the device can accept bios larger than the size its merge function returns. When set, use this to send large bios to snapshots which can split them if necessary. Snapshot I/O may be significantly fragmented and this approach seems to improve peformance. Before the patch, dm_set_device_limits restricted bio size to page size if the underlying device had a merge function and the target didn't provide a merge function. After the patch, dm_set_device_limits restricts bio size to page size if the underlying device has a merge function, doesn't have DMF_MERGE_IS_OPTIONAL flag and the target doesn't provide a merge function. The snapshot target can't provide a merge function because when the merge function is called, it is impossible to determine where the bio will be remapped. Previously this led us to impose a 4k limit, which we can now remove if the snapshot store is located on a device without a merge function. Together with another patch for optimizing full chunk writes, it improves performance from 29MB/s to 40MB/s when writing to the filesystem on snapshot store. If the snapshot store is placed on a non-dm device with a merge function (such as md-raid), device mapper still limits all bios to page size. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm table: clean dm_get_device and move exportsMike Snitzer
There is no need for __table_get_device to be factored out. Also move the exports to the end of their respective functions. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: tidy includesAlasdair G Kergon
A dm target only needs to use include/linux dm headers. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm ioctl: prevent empty messageAlasdair G Kergon
Detect invalid empty messages in core dm instead of requiring every target to check this. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm raid: cleanup parameter handlingJonathan Brassow
Re-order the parameters so they are handled consistently in the same order where defined, parsed and output. Only include rebuild parameters in the STATUSTYPE_TABLE output if they were supplied in the original table line. Correct the parameter count when outputting rebuild: there are two words, not one. Use case-independent checks for keywords (as in other device-mapper targets). Signed-off-by: Jonathan Brassow <jbrassow@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm snapshot: style cleanupsJonathan Brassow
Coding style cleanups. Signed-off-by: Alasdair G Kergon <agk@redhat.com> Signed-off-by: Jonathan Brassow <jbrassow@redhat.com>
2011-08-02dm snapshot: remove unused definitionsMikulas Patocka
Remove a couple of unused #defines. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm kcopyd: remove nr_pages field from job structureMikulas Patocka
The nr_pages field in struct kcopyd_job is only used temporarily in run_pages_job() to count the number of required pages. We can use a local variable instead. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm kcopyd: remove offset field from job structureMikulas Patocka
The offset field in struct kcopyd_job is always zero so remove it. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm: use vzallocJoe Perches
Use vzalloc() instead of vmalloc()+memset(). Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm log: userspace use list_moveKirill A. Shutemov
Replace list_del() followed by list_add() with list_move(). Signed-off-by: Kirill A. Shutemov <kirill@shutemov.name> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm log: clean up bit little endian bitopsAkinobu Mita
Using __test_and_{set,clear}_bit_le() with ignoring its return value can be replaced with __{set,clear}_bit_le(). This also removes unnecessary casts. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm table: fix discard supportMike Snitzer
Remove 'discards_supported' from the dm_table structure. The same information can be easily discovered from the table's target(s) in dm_table_supports_discards(). Before this fix dm_table_supports_discards() would skip checking the individual targets' 'discards_supported' flag if any one target in the table didn't set num_discard_requests > 0. Now the per-target 'discards_supported' flag is effective at insuring the final DM device advertises discard support. But, to be clear, targets that don't support discards (!num_discard_requests) will not receive discard requests. Also DMWARN if a target sets 'discards_supported' override but forgets to set 'num_discard_requests'. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm: suppress endian warningsAlasdair G Kergon
Suppress sparse warnings about cpu_to_le32() by using __le32 types for on-disk data etc. Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2011-08-02dm: fix idr leak on module removalAlasdair G Kergon
Destroy _minor_idr when unloading the core dm module. (Found by kmemleak.) Cc: stable@kernel.org Signed-off-by: Alasdair G Kergon <agk@redhat.com>