diff options
Diffstat (limited to 'Documentation/device-mapper')
| -rw-r--r-- | Documentation/device-mapper/cache-policies.txt | 91 | ||||
| -rw-r--r-- | Documentation/device-mapper/cache.txt | 298 | ||||
| -rw-r--r-- | Documentation/device-mapper/dm-crypt.txt | 11 | ||||
| -rw-r--r-- | Documentation/device-mapper/dm-raid.txt | 148 | ||||
| -rw-r--r-- | Documentation/device-mapper/era.txt | 108 | ||||
| -rw-r--r-- | Documentation/device-mapper/statistics.txt | 186 | ||||
| -rw-r--r-- | Documentation/device-mapper/striped.txt | 7 | ||||
| -rw-r--r-- | Documentation/device-mapper/switch.txt | 126 | ||||
| -rw-r--r-- | Documentation/device-mapper/thin-provisioning.txt | 83 |
9 files changed, 1026 insertions, 32 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt new file mode 100644 index 00000000000..66c2774c0c6 --- /dev/null +++ b/Documentation/device-mapper/cache-policies.txt @@ -0,0 +1,91 @@ +Guidance for writing policies +============================= + +Try to keep transactionality out of it. The core is careful to +avoid asking about anything that is migrating. This is a pain, but +makes it easier to write the policies. + +Mappings are loaded into the policy at construction time. + +Every bio that is mapped by the target is referred to the policy. +The policy can return a simple HIT or MISS or issue a migration. + +Currently there's no way for the policy to issue background work, +e.g. to start writing back dirty blocks that are going to be evicte +soon. + +Because we map bios, rather than requests it's easy for the policy +to get fooled by many small bios. For this reason the core target +issues periodic ticks to the policy. It's suggested that the policy +doesn't update states (eg, hit counts) for a block more than once +for each tick. The core ticks by watching bios complete, and so +trying to see when the io scheduler has let the ios run. + + +Overview of supplied cache replacement policies +=============================================== + +multiqueue +---------- + +This policy is the default. + +The multiqueue policy has three sets of 16 queues: one set for entries +waiting for the cache and another two for those in the cache (a set for +clean entries and a set for dirty entries). + +Cache entries in the queues are aged based on logical time. Entry into +the cache is based on variable thresholds and queue selection is based +on hit count on entry. The policy aims to take different cache miss +costs into account and to adjust to varying load patterns automatically. + +Message and constructor argument pairs are: + 'sequential_threshold <#nr_sequential_ios>' + 'random_threshold <#nr_random_ios>' + 'read_promote_adjustment <value>' + 'write_promote_adjustment <value>' + 'discard_promote_adjustment <value>' + +The sequential threshold indicates the number of contiguous I/Os +required before a stream is treated as sequential. The random threshold +is the number of intervening non-contiguous I/Os that must be seen +before the stream is treated as random again. + +The sequential and random thresholds default to 512 and 4 respectively. + +Large, sequential ios are probably better left on the origin device +since spindles tend to have good bandwidth. The io_tracker counts +contiguous I/Os to try to spot when the io is in one of these sequential +modes. + +Internally the mq policy maintains a promotion threshold variable. If +the hit count of a block not in the cache goes above this threshold it +gets promoted to the cache. The read, write and discard promote adjustment +tunables allow you to tweak the promotion threshold by adding a small +value based on the io type. They default to 4, 8 and 1 respectively. +If you're trying to quickly warm a new cache device you may wish to +reduce these to encourage promotion. Remember to switch them back to +their defaults after the cache fills though. + +cleaner +------- + +The cleaner writes back all dirty blocks in a cache to decommission it. + +Examples +======== + +The syntax for a table is: + cache <metadata dev> <cache dev> <origin dev> <block size> + <#feature_args> [<feature arg>]* + <policy> <#policy_args> [<policy arg>]* + +The syntax to send a message using the dmsetup command is: + dmsetup message <mapped device> 0 sequential_threshold 1024 + dmsetup message <mapped device> 0 random_threshold 8 + +Using dmsetup: + dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ + /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" + creates a 128GB large mapped device named 'blah' with the + sequential threshold set to 1024 and the random_threshold set to 8. diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt new file mode 100644 index 00000000000..68c0f517c60 --- /dev/null +++ b/Documentation/device-mapper/cache.txt @@ -0,0 +1,298 @@ +Introduction +============ + +dm-cache is a device mapper target written by Joe Thornber, Heinz +Mauelshagen, and Mike Snitzer. + +It aims to improve performance of a block device (eg, a spindle) by +dynamically migrating some of its data to a faster, smaller device +(eg, an SSD). + +This device-mapper solution allows us to insert this caching at +different levels of the dm stack, for instance above the data device for +a thin-provisioning pool. Caching solutions that are integrated more +closely with the virtual memory system should give better performance. + +The target reuses the metadata library used in the thin-provisioning +library. + +The decision as to what data to migrate and when is left to a plug-in +policy module. Several of these have been written as we experiment, +and we hope other people will contribute others for specific io +scenarios (eg. a vm image server). + +Glossary +======== + + Migration - Movement of the primary copy of a logical block from one + device to the other. + Promotion - Migration from slow device to fast device. + Demotion - Migration from fast device to slow device. + +The origin device always contains a copy of the logical block, which +may be out of date or kept in sync with the copy on the cache device +(depending on policy). + +Design +====== + +Sub-devices +----------- + +The target is constructed by passing three devices to it (along with +other parameters detailed later): + +1. An origin device - the big, slow one. + +2. A cache device - the small, fast one. + +3. A small metadata device - records which blocks are in the cache, + which are dirty, and extra hints for use by the policy object. + This information could be put on the cache device, but having it + separate allows the volume manager to configure it differently, + e.g. as a mirror for extra robustness. This metadata device may only + be used by a single cache device. + +Fixed block size +---------------- + +The origin is divided up into blocks of a fixed size. This block size +is configurable when you first create the cache. Typically we've been +using block sizes of 256KB - 1024KB. The block size must be between 64 +(32KB) and 2097152 (1GB) and a multiple of 64 (32KB). + +Having a fixed block size simplifies the target a lot. But it is +something of a compromise. For instance, a small part of a block may be +getting hit a lot, yet the whole block will be promoted to the cache. +So large block sizes are bad because they waste cache space. And small +block sizes are bad because they increase the amount of metadata (both +in core and on disk). + +Cache operating modes +--------------------- + +The cache has three operating modes: writeback, writethrough and +passthrough. + +If writeback, the default, is selected then a write to a block that is +cached will go only to the cache and the block will be marked dirty in +the metadata. + +If writethrough is selected then a write to a cached block will not +complete until it has hit both the origin and cache devices. Clean +blocks should remain clean. + +If passthrough is selected, useful when the cache contents are not known +to be coherent with the origin device, then all reads are served from +the origin device (all reads miss the cache) and all writes are +forwarded to the origin device; additionally, write hits cause cache +block invalidates. To enable passthrough mode the cache must be clean. +Passthrough mode allows a cache device to be activated without having to +worry about coherency. Coherency that exists is maintained, although +the cache will gradually cool as writes take place. If the coherency of +the cache can later be verified, or established through use of the +"invalidate_cblocks" message, the cache device can be transitioned to +writethrough or writeback mode while still warm. Otherwise, the cache +contents can be discarded prior to transitioning to the desired +operating mode. + +A simple cleaner policy is provided, which will clean (write back) all +dirty blocks in a cache. Useful for decommissioning a cache or when +shrinking a cache. Shrinking the cache's fast device requires all cache +blocks, in the area of the cache being removed, to be clean. If the +area being removed from the cache still contains dirty blocks the resize +will fail. Care must be taken to never reduce the volume used for the +cache's fast device until the cache is clean. This is of particular +importance if writeback mode is used. Writethrough and passthrough +modes already maintain a clean cache. Future support to partially clean +the cache, above a specified threshold, will allow for keeping the cache +warm and in writeback mode during resize. + +Migration throttling +-------------------- + +Migrating data between the origin and cache device uses bandwidth. +The user can set a throttle to prevent more than a certain amount of +migration occurring at any one time. Currently we're not taking any +account of normal io traffic going to the devices. More work needs +doing here to avoid migrating during those peak io moments. + +For the time being, a message "migration_threshold <#sectors>" +can be used to set the maximum number of sectors being migrated, +the default being 204800 sectors (or 100MB). + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. +If no such requests are made then commits will occur every second. This +means the cache behaves like a physical disk that has a volatile write +cache. If power is lost you may lose some recent writes. The metadata +should always be consistent in spite of any crash. + +The 'dirty' state for a cache block changes far too frequently for us +to keep updating it on the fly. So we treat it as a hint. In normal +operation it will be written when the dm device is suspended. If the +system crashes all cache blocks will be assumed dirty when restarted. + +Per-block policy hints +---------------------- + +Policy plug-ins can store a chunk of data per cache block. It's up to +the policy how big this chunk is, but it should be kept small. Like the +dirty flags this data is lost if there's a crash so a safe fallback +value should always be possible. + +For instance, the 'mq' policy, which is currently the default policy, +uses this facility to store the hit count of the cache blocks. If +there's a crash this information will be lost, which means the cache +may be less efficient until those hit counts are regenerated. + +Policy hints affect performance, not correctness. + +Policy messaging +---------------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. Refer to cache-policies.txt. + +Discard bitset resolution +------------------------- + +We can avoid copying data during migration if we know the block has +been discarded. A prime example of this is when mkfs discards the +whole block device. We store a bitset tracking the discard state of +blocks. However, we allow this bitset to have a different block size +from the cache blocks. This is because we need to track the discard +state for all of the origin device (compare with the dirty bitset +which is just for the smaller cache device). + +Target interface +================ + +Constructor +----------- + + cache <metadata dev> <cache dev> <origin dev> <block size> + <#feature args> [<feature arg>]* + <policy> <#policy args> [policy args]* + + metadata dev : fast device holding the persistent metadata + cache dev : fast device holding cached data blocks + origin dev : slow device holding original data blocks + block size : cache unit size in sectors + + #feature args : number of feature arguments passed + feature args : writethrough or passthrough (The default is writeback.) + + policy : the replacement policy to use + #policy args : an even number of arguments corresponding to + key/value pairs passed to the policy + policy args : key/value pairs passed to the policy + E.g. 'sequential_threshold 1024' + See cache-policies.txt for details. + +Optional feature arguments are: + writethrough : write through caching that prohibits cache block + content from being different from origin block content. + Without this argument, the default behaviour is to write + back cache block contents later for performance reasons, + so they may differ from the corresponding origin blocks. + + passthrough : a degraded mode useful for various cache coherency + situations (e.g., rolling back snapshots of + underlying storage). Reads and writes always go to + the origin. If a write goes to a cached origin + block, then the cache block is invalidated. + To enable passthrough mode the cache must be clean. + +A policy called 'default' is always registered. This is an alias for +the policy we currently think is giving best all round performance. + +As the default policy could vary between kernels, if you are relying on +the characteristics of a specific policy, always request it by name. + +Status +------ + +<metadata block size> <#used metadata blocks>/<#total metadata blocks> +<cache block size> <#used cache blocks>/<#total cache blocks> +<#read hits> <#read misses> <#write hits> <#write misses> +<#demotions> <#promotions> <#dirty> <#features> <features>* +<#core args> <core args>* <policy name> <#policy args> <policy args>* + +metadata block size : Fixed block size for each metadata block in + sectors +#used metadata blocks : Number of metadata blocks used +#total metadata blocks : Total number of metadata blocks +cache block size : Configurable block size for the cache device + in sectors +#used cache blocks : Number of blocks resident in the cache +#total cache blocks : Total number of cache blocks +#read hits : Number of times a READ bio has been mapped + to the cache +#read misses : Number of times a READ bio has been mapped + to the origin +#write hits : Number of times a WRITE bio has been mapped + to the cache +#write misses : Number of times a WRITE bio has been + mapped to the origin +#demotions : Number of times a block has been removed + from the cache +#promotions : Number of times a block has been moved to + the cache +#dirty : Number of blocks in the cache that differ + from the origin +#feature args : Number of feature args to follow +feature args : 'writethrough' (optional) +#core args : Number of core arguments (must be even) +core args : Key/value pairs for tuning the core + e.g. migration_threshold +policy name : Name of the policy +#policy args : Number of policy arguments to follow (must be even) +policy args : Key/value pairs + e.g. sequential_threshold + +Messages +-------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. (A sysfs interface would also be possible.) + +The message format is: + + <key> <value> + +E.g. + dmsetup message my_cache 0 sequential_threshold 1024 + + +Invalidation is removing an entry from the cache without writing it +back. Cache blocks can be invalidated via the invalidate_cblocks +message, which takes an arbitrary number of cblock ranges. Each cblock +range's end value is "one past the end", meaning 5-10 expresses a range +of values from 5 to 9. Each cblock must be expressed as a decimal +value, in the future a variant message that takes cblock ranges +expressed in hexidecimal may be needed to better support efficient +invalidation of larger caches. The cache must be in passthrough mode +when invalidate_cblocks is used. + + invalidate_cblocks [<cblock>|<cblock begin>-<cblock end>]* + +E.g. + dmsetup message my_cache 0 invalidate_cblocks 2345 3456-4567 5678-6789 + +Examples +======== + +The test suite can be found here: + +https://github.com/jthornber/device-mapper-test-suite + +dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' +dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ + mq 4 sequential_threshold 1024 random_threshold 8' diff --git a/Documentation/device-mapper/dm-crypt.txt b/Documentation/device-mapper/dm-crypt.txt index 2c656ae43ba..c81839b52c4 100644 --- a/Documentation/device-mapper/dm-crypt.txt +++ b/Documentation/device-mapper/dm-crypt.txt @@ -4,12 +4,15 @@ dm-crypt Device-Mapper's "crypt" target provides transparent encryption of block devices using the kernel crypto API. +For a more detailed description of supported parameters see: +http://code.google.com/p/cryptsetup/wiki/DMCrypt + Parameters: <cipher> <key> <iv_offset> <device path> \ <offset> [<#opt_params> <opt_params>] <cipher> Encryption cipher and an optional IV generation mode. - (In format cipher[:keycount]-chainmode-ivopts:ivmode). + (In format cipher[:keycount]-chainmode-ivmode[:ivopts]). Examples: des aes-cbc-essiv:sha256 @@ -19,7 +22,11 @@ Parameters: <cipher> <key> <iv_offset> <device path> \ <key> Key used for encryption. It is encoded as a hexadecimal number. - You can only use key sizes that are valid for the selected cipher. + You can only use key sizes that are valid for the selected cipher + in combination with the selected iv mode. + Note that for some iv modes the key string can contain additional + keys (for example IV seed) so the key contains more parts concatenated + into a single string. <keycount> Multi-key compatibility mode. You can define <keycount> keys and diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt index 946c73342cd..ef8ba9fa58c 100644 --- a/Documentation/device-mapper/dm-raid.txt +++ b/Documentation/device-mapper/dm-raid.txt @@ -1,10 +1,13 @@ dm-raid -------- +======= The device-mapper RAID (dm-raid) target provides a bridge from DM to MD. It allows the MD RAID drivers to be accessed using a device-mapper interface. + +Mapping Table Interface +----------------------- The target is named "raid" and it accepts the following parameters: <raid_type> <#raid_params> <raid_params> \ @@ -27,6 +30,11 @@ The target is named "raid" and it accepts the following parameters: - rotating parity N (right-to-left) with data restart raid6_nc RAID6 N continue - rotating parity N (right-to-left) with data continuation + raid10 Various RAID10 inspired algorithms chosen by additional params + - RAID10: Striped Mirrors (aka 'Striping on top of mirrors') + - RAID1E: Integrated Adjacent Stripe Mirroring + - RAID1E: Integrated Offset Stripe Mirroring + - and other similar RAID10 variants Reference: Chapter 4 of http://www.snia.org/sites/default/files/SNIA_DDF_Technical_Position_v2.0.pdf @@ -42,7 +50,7 @@ The target is named "raid" and it accepts the following parameters: followed by optional parameters (in any order): [sync|nosync] Force or prevent RAID initialization. - [rebuild <idx>] Rebuild drive number idx (first drive is 0). + [rebuild <idx>] Rebuild drive number 'idx' (first drive is 0). [daemon_sleep <ms>] Interval between runs of the bitmap daemon that @@ -51,14 +59,63 @@ The target is named "raid" and it accepts the following parameters: [min_recovery_rate <kB/sec/disk>] Throttle RAID initialization [max_recovery_rate <kB/sec/disk>] Throttle RAID initialization - [write_mostly <idx>] Drive index is write-mostly - [max_write_behind <sectors>] See '-write-behind=' (man mdadm) - [stripe_cache <sectors>] Stripe cache size (higher RAIDs only) + [write_mostly <idx>] Mark drive index 'idx' write-mostly. + [max_write_behind <sectors>] See '--write-behind=' (man mdadm) + [stripe_cache <sectors>] Stripe cache size (RAID 4/5/6 only) [region_size <sectors>] The region_size multiplied by the number of regions is the logical size of the array. The bitmap records the device synchronisation state for each region. + [raid10_copies <# copies>] + [raid10_format <near|far|offset>] + These two options are used to alter the default layout of + a RAID10 configuration. The number of copies is can be + specified, but the default is 2. There are also three + variations to how the copies are laid down - the default + is "near". Near copies are what most people think of with + respect to mirroring. If these options are left unspecified, + or 'raid10_copies 2' and/or 'raid10_format near' are given, + then the layouts for 2, 3 and 4 devices are: + 2 drives 3 drives 4 drives + -------- ---------- -------------- + A1 A1 A1 A1 A2 A1 A1 A2 A2 + A2 A2 A2 A3 A3 A3 A3 A4 A4 + A3 A3 A4 A4 A5 A5 A5 A6 A6 + A4 A4 A5 A6 A6 A7 A7 A8 A8 + .. .. .. .. .. .. .. .. .. + The 2-device layout is equivalent 2-way RAID1. The 4-device + layout is what a traditional RAID10 would look like. The + 3-device layout is what might be called a 'RAID1E - Integrated + Adjacent Stripe Mirroring'. + + If 'raid10_copies 2' and 'raid10_format far', then the layouts + for 2, 3 and 4 devices are: + 2 drives 3 drives 4 drives + -------- -------------- -------------------- + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + .. .. .. .. .. .. .. .. .. + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + + If 'raid10_copies 2' and 'raid10_format offset', then the + layouts for 2, 3 and 4 devices are: + 2 drives 3 drives 4 drives + -------- ------------ ----------------- + A1 A2 A1 A2 A3 A1 A2 A3 A4 + A2 A1 A3 A1 A2 A2 A1 A4 A3 + A3 A4 A4 A5 A6 A5 A6 A7 A8 + A4 A3 A6 A4 A5 A6 A5 A8 A7 + A5 A6 A7 A8 A9 A9 A10 A11 A12 + A6 A5 A9 A7 A8 A10 A9 A12 A11 + .. .. .. .. .. .. .. .. .. + Here we see layouts closely akin to 'RAID1E - Integrated + Offset Stripe Mirroring'. + <#raid_devs>: The number of devices composing the array. Each device consists of two entries. The first is the device containing the metadata (if any); the second is the one containing the @@ -68,7 +125,7 @@ The target is named "raid" and it accepts the following parameters: given for both the metadata and data drives for a given position. -Example tables +Example Tables -------------- # RAID4 - 4 data drives, 1 parity (no metadata devices) # No metadata devices specified to hold superblock/bitmap info @@ -87,22 +144,83 @@ Example tables raid4 4 2048 sync min_recovery_rate 20 \ 5 8:17 8:18 8:33 8:34 8:49 8:50 8:65 8:66 8:81 8:82 + +Status Output +------------- 'dmsetup table' displays the table used to construct the mapping. The optional parameters are always printed in the order listed above with "sync" or "nosync" always output ahead of the other arguments, regardless of the order used when originally loading the table. Arguments that can be repeated are ordered by value. -'dmsetup status' yields information on the state and health of the -array. -The output is as follows: + +'dmsetup status' yields information on the state and health of the array. +The output is as follows (normally a single line, but expanded here for +clarity): 1: <s> <l> raid \ -2: <raid_type> <#devices> <1 health char for each dev> <resync_ratio> +2: <raid_type> <#devices> <health_chars> \ +3: <sync_ratio> <sync_action> <mismatch_cnt> Line 1 is the standard output produced by device-mapper. -Line 2 is produced by the raid target, and best explained by example: - 0 1960893648 raid raid4 5 AAAAA 2/490221568 +Line 2 & 3 are produced by the raid target and are best explained by example: + 0 1960893648 raid raid4 5 AAAAA 2/490221568 init 0 Here we can see the RAID type is raid4, there are 5 devices - all of -which are 'A'live, and the array is 2/490221568 complete with recovery. -Faulty or missing devices are marked 'D'. Devices that are out-of-sync -are marked 'a'. +which are 'A'live, and the array is 2/490221568 complete with its initial +recovery. Here is a fuller description of the individual fields: + <raid_type> Same as the <raid_type> used to create the array. + <health_chars> One char for each device, indicating: 'A' = alive and + in-sync, 'a' = alive but not in-sync, 'D' = dead/failed. + <sync_ratio> The ratio indicating how much of the array has undergone + the process described by 'sync_action'. If the + 'sync_action' is "check" or "repair", then the process + of "resync" or "recover" can be considered complete. + <sync_action> One of the following possible states: + idle - No synchronization action is being performed. + frozen - The current action has been halted. + resync - Array is undergoing its initial synchronization + or is resynchronizing after an unclean shutdown + (possibly aided by a bitmap). + recover - A device in the array is being rebuilt or + replaced. + check - A user-initiated full check of the array is + being performed. All blocks are read and + checked for consistency. The number of + discrepancies found are recorded in + <mismatch_cnt>. No changes are made to the + array by this action. + repair - The same as "check", but discrepancies are + corrected. + reshape - The array is undergoing a reshape. + <mismatch_cnt> The number of discrepancies found between mirror copies + in RAID1/10 or wrong parity values found in RAID4/5/6. + This value is valid only after a "check" of the array + is performed. A healthy array has a 'mismatch_cnt' of 0. + +Message Interface +----------------- +The dm-raid target will accept certain actions through the 'message' interface. +('man dmsetup' for more information on the message interface.) These actions +include: + "idle" - Halt the current sync action. + "frozen" - Freeze the current sync action. + "resync" - Initiate/continue a resync. + "recover"- Initiate/continue a recover process. + "check" - Initiate a check (i.e. a "scrub") of the array. + "repair" - Initiate a repair of the array. + "reshape"- Currently unsupported (-EINVAL). + +Version History +--------------- +1.0.0 Initial version. Support for RAID 4/5/6 +1.1.0 Added support for RAID 1 +1.2.0 Handle creation of arrays that contain failed devices. +1.3.0 Added support for RAID 10 +1.3.1 Allow device replacement/rebuild for RAID 10 +1.3.2 Fix/improve redundancy checking for RAID10 +1.4.0 Non-functional change. Removes arg from mapping function. +1.4.1 RAID10 fix redundancy validation checks (commit 55ebbb5). +1.4.2 Add RAID10 "far" and "offset" algorithm support. +1.5.0 Add message interface to allow manipulation of the sync_action. + New status (STATUSTYPE_INFO) fields: sync_action and mismatch_cnt. +1.5.1 Add ability to restore transiently failed devices on resume. +1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check". diff --git a/Documentation/device-mapper/era.txt b/Documentation/device-mapper/era.txt new file mode 100644 index 00000000000..3c6d01be356 --- /dev/null +++ b/Documentation/device-mapper/era.txt @@ -0,0 +1,108 @@ +Introduction +============ + +dm-era is a target that behaves similar to the linear target. In +addition it keeps track of which blocks were written within a user +defined period of time called an 'era'. Each era target instance +maintains the current era as a monotonically increasing 32-bit +counter. + +Use cases include tracking changed blocks for backup software, and +partially invalidating the contents of a cache to restore cache +coherency after rolling back a vendor snapshot. + +Constructor +=========== + + era <metadata dev> <origin dev> <block size> + + metadata dev : fast device holding the persistent metadata + origin dev : device holding data blocks that may change + block size : block size of origin data device, granularity that is + tracked by the target + +Messages +======== + +None of the dm messages take any arguments. + +checkpoint +---------- + +Possibly move to a new era. You shouldn't assume the era has +incremented. After sending this message, you should check the +current era via the status line. + +take_metadata_snap +------------------ + +Create a clone of the metadata, to allow a userland process to read it. + +drop_metadata_snap +------------------ + +Drop the metadata snapshot. + +Status +====== + +<metadata block size> <#used metadata blocks>/<#total metadata blocks> +<current era> <held metadata root | '-'> + +metadata block size : Fixed block size for each metadata block in + sectors +#used metadata blocks : Number of metadata blocks used +#total metadata blocks : Total number of metadata blocks +current era : The current era +held metadata root : The location, in blocks, of the metadata root + that has been 'held' for userspace read + access. '-' indicates there is no held root + +Detailed use case +================= + +The scenario of invalidating a cache when rolling back a vendor +snapshot was the primary use case when developing this target: + +Taking a vendor snapshot +------------------------ + +- Send a checkpoint message to the era target +- Make a note of the current era in its status line +- Take vendor snapshot (the era and snapshot should be forever + associated now). + +Rolling back to an vendor snapshot +---------------------------------- + +- Cache enters passthrough mode (see: dm-cache's docs in cache.txt) +- Rollback vendor storage +- Take metadata snapshot +- Ascertain which blocks have been written since the snapshot was taken + by checking each block's era +- Invalidate those blocks in the caching software +- Cache returns to writeback/writethrough mode + +Memory usage +============ + +The target uses a bitset to record writes in the current era. It also +has a spare bitset ready for switching over to a new era. Other than +that it uses a few 4k blocks for updating metadata. + + (4 * nr_blocks) bytes + buffers + +Resilience +========== + +Metadata is updated on disk before a write to a previously unwritten +block is performed. As such dm-era should not be effected by a hard +crash such as power failure. + +Userland tools +============== + +Userland tools are found in the increasingly poorly named +thin-provisioning-tools project: + + https://github.com/jthornber/thin-provisioning-tools diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt new file mode 100644 index 00000000000..2a1673adc20 --- /dev/null +++ b/Documentation/device-mapper/statistics.txt @@ -0,0 +1,186 @@ +DM statistics +============= + +Device Mapper supports the collection of I/O statistics on user-defined +regions of a DM device. If no regions are defined no statistics are +collected so there isn't any performance impact. Only bio-based DM +devices are currently supported. + +Each user-defined region specifies a starting sector, length and step. +Individual statistics will be collected for each step-sized area within +the range specified. + +The I/O statistics counters for each step-sized area of a region are +in the same format as /sys/block/*/stat or /proc/diskstats (see: +Documentation/iostats.txt). But two extra counters (12 and 13) are +provided: total time spent reading and writing in milliseconds. All +these counters may be accessed by sending the @stats_print message to +the appropriate DM device via dmsetup. + +Each region has a corresponding unique identifier, which we call a +region_id, that is assigned when the region is created. The region_id +must be supplied when querying statistics about the region, deleting the +region, etc. Unique region_ids enable multiple userspace programs to +request and process statistics for the same DM device without stepping +on each other's data. + +The creation of DM statistics will allocate memory via kmalloc or +fallback to using vmalloc space. At most, 1/4 of the overall system +memory may be allocated by DM statistics. The admin can see how much +memory is used by reading +/sys/module/dm_mod/parameters/stats_current_allocated_bytes + +Messages +======== + + @stats_create <range> <step> [<program_id> [<aux_data>]] + + Create a new region and return the region_id. + + <range> + "-" - whole device + "<start_sector>+<length>" - a range of <length> 512-byte sectors + starting with <start_sector>. + + <step> + "<area_size>" - the range is subdivided into areas each containing + <area_size> sectors. + "/<number_of_areas>" - the range is subdivided into the specified + number of areas. + + <program_id> + An optional parameter. A name that uniquely identifies + the userspace owner of the range. This groups ranges together + so that userspace programs can identify the ranges they + created and ignore those created by others. + The kernel returns this string back in the output of + @stats_list message, but it doesn't use it for anything else. + + <aux_data> + An optional parameter. A word that provides auxiliary data + that is useful to the client program that created the range. + The kernel returns this string back in the output of + @stats_list message, but it doesn't use this value for anything. + + @stats_delete <region_id> + + Delete the region with the specified id. + + <region_id> + region_id returned from @stats_create + + @stats_clear <region_id> + + Clear all the counters except the in-flight i/o counters. + + <region_id> + region_id returned from @stats_create + + @stats_list [<program_id>] + + List all regions registered with @stats_create. + + <program_id> + An optional parameter. + If this parameter is specified, only matching regions + are returned. + If it is not specified, all regions are returned. + + Output format: + <region_id>: <start_sector>+<length> <step> <program_id> <aux_data> + + @stats_print <region_id> [<starting_line> <number_of_lines>] + + Print counters for each step-sized area of a region. + + <region_id> + region_id returned from @stats_create + + <starting_line> + The index of the starting line in the output. + If omitted, all lines are returned. + + <number_of_lines> + The number of lines to include in the output. + If omitted, all lines are returned. + + Output format for each step-sized area of a region: + + <start_sector>+<length> counters + + The first 11 counters have the same meaning as + /sys/block/*/stat or /proc/diskstats. + + Please refer to Documentation/iostats.txt for details. + + 1. the number of reads completed + 2. the number of reads merged + 3. the number of sectors read + 4. the number of milliseconds spent reading + 5. the number of writes completed + 6. the number of writes merged + 7. the number of sectors written + 8. the number of milliseconds spent writing + 9. the number of I/Os currently in progress + 10. the number of milliseconds spent doing I/Os + 11. the weighted number of milliseconds spent doing I/Os + + Additional counters: + 12. the total time spent reading in milliseconds + 13. the total time spent writing in milliseconds + + @stats_print_clear <region_id> [<starting_line> <number_of_lines>] + + Atomically print and then clear all the counters except the + in-flight i/o counters. Useful when the client consuming the + statistics does not want to lose any statistics (those updated + between printing and clearing). + + <region_id> + region_id returned from @stats_create + + <starting_line> + The index of the starting line in the output. + If omitted, all lines are printed and then cleared. + + <number_of_lines> + The number of lines to process. + If omitted, all lines are printed and then cleared. + + @stats_set_aux <region_id> <aux_data> + + Store auxiliary data aux_data for the specified region. + + <region_id> + region_id returned from @stats_create + + <aux_data> + The string that identifies data which is useful to the client + program that created the range. The kernel returns this + string back in the output of @stats_list message, but it + doesn't use this value for anything. + +Examples +======== + +Subdivide the DM device 'vol' into 100 pieces and start collecting +statistics on them: + + dmsetup message vol 0 @stats_create - /100 + +Set the auxillary data string to "foo bar baz" (the escape for each +space must also be escaped, otherwise the shell will consume them): + + dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz + +List the statistics: + + dmsetup message vol 0 @stats_list + +Print the statistics: + + dmsetup message vol 0 @stats_print 0 + +Delete the statistics: + + dmsetup message vol 0 @stats_delete 0 diff --git a/Documentation/device-mapper/striped.txt b/Documentation/device-mapper/striped.txt index f34d3236b9d..45f3b91ea4c 100644 --- a/Documentation/device-mapper/striped.txt +++ b/Documentation/device-mapper/striped.txt @@ -9,15 +9,14 @@ devices in parallel. Parameters: <num devs> <chunk size> [<dev path> <offset>]+ <num devs>: Number of underlying devices. - <chunk size>: Size of each chunk of data. Must be a power-of-2 and at - least as large as the system's PAGE_SIZE. + <chunk size>: Size of each chunk of data. Must be at least as + large as the system's PAGE_SIZE. <dev path>: Full pathname to the underlying block-device, or a "major:minor" device-number. <offset>: Starting sector within the device. One or more underlying devices can be specified. The striped device size must -be a multiple of the chunk size and a multiple of the number of underlying -devices. +be a multiple of the chunk size multiplied by the number of underlying devices. Example scripts diff --git a/Documentation/device-mapper/switch.txt b/Documentation/device-mapper/switch.txt new file mode 100644 index 00000000000..2fa749387be --- /dev/null +++ b/Documentation/device-mapper/switch.txt @@ -0,0 +1,126 @@ +dm-switch +========= + +The device-mapper switch target creates a device that supports an +arbitrary mapping of fixed-size regions of I/O across a fixed set of +paths. The path used for any specific region can be switched +dynamically by sending the target a message. + +It maps I/O to underlying block devices efficiently when there is a large +number of fixed-sized address regions but there is no simple pattern +that would allow for a compact representation of the mapping such as +dm-stripe. + +Background +---------- + +Dell EqualLogic and some other iSCSI storage arrays use a distributed +frameless architecture. In this architecture, the storage group +consists of a number of distinct storage arrays ("members") each having +independent controllers, disk storage and network adapters. When a LUN +is created it is spread across multiple members. The details of the +spreading are hidden from initiators connected to this storage system. +The storage group exposes a single target discovery portal, no matter +how many members are being used. When iSCSI sessions are created, each +session is connected to an eth port on a single member. Data to a LUN +can be sent on any iSCSI session, and if the blocks being accessed are +stored on another member the I/O will be forwarded as required. This +forwarding is invisible to the initiator. The storage layout is also +dynamic, and the blocks stored on disk may be moved from member to +member as needed to balance the load. + +This architecture simplifies the management and configuration of both +the storage group and initiators. In a multipathing configuration, it +is possible to set up multiple iSCSI sessions to use multiple network +interfaces on both the host and target to take advantage of the +increased network bandwidth. An initiator could use a simple round +robin algorithm to send I/O across all paths and let the storage array +members forward it as necessary, but there is a performance advantage to +sending data directly to the correct member. + +A device-mapper table already lets you map different regions of a +device onto different targets. However in this architecture the LUN is +spread with an address region size on the order of 10s of MBs, which +means the resulting table could have more than a million entries and +consume far too much memory. + +Using this device-mapper switch target we can now build a two-layer +device hierarchy: + + Upper Tier – Determine which array member the I/O should be sent to. + Lower Tier – Load balance amongst paths to a particular member. + +The lower tier consists of a single dm multipath device for each member. +Each of these multipath devices contains the set of paths directly to +the array member in one priority group, and leverages existing path +selectors to load balance amongst these paths. We also build a +non-preferred priority group containing paths to other array members for +failover reasons. + +The upper tier consists of a single dm-switch device. This device uses +a bitmap to look up the location of the I/O and choose the appropriate +lower tier device to route the I/O. By using a bitmap we are able to +use 4 bits for each address range in a 16 member group (which is very +large for us). This is a much denser representation than the dm table +b-tree can achieve. + +Construction Parameters +======================= + + <num_paths> <region_size> <num_optional_args> [<optional_args>...] + [<dev_path> <offset>]+ + +<num_paths> + The number of paths across which to distribute the I/O. + +<region_size> + The number of 512-byte sectors in a region. Each region can be redirected + to any of the available paths. + +<num_optional_args> + The number of optional arguments. Currently, no optional arguments + are supported and so this must be zero. + +<dev_path> + The block device that represents a specific path to the device. + +<offset> + The offset of the start of data on the specific <dev_path> (in units + of 512-byte sectors). This number is added to the sector number when + forwarding the request to the specific path. Typically it is zero. + +Messages +======== + +set_region_mappings <index>:<path_nr> [<index>]:<path_nr> [<index>]:<path_nr>... + +Modify the region table by specifying which regions are redirected to +which paths. + +<index> + The region number (region size was specified in constructor parameters). + If index is omitted, the next region (previous index + 1) is used. + Expressed in hexadecimal (WITHOUT any prefix like 0x). + +<path_nr> + The path number in the range 0 ... (<num_paths> - 1). + Expressed in hexadecimal (WITHOUT any prefix like 0x). + +Status +====== + +No status line is reported. + +Example +======= + +Assume that you have volumes vg1/switch0 vg1/switch1 vg1/switch2 with +the same size. + +Create a switch device with 64kB region size: + dmsetup create switch --table "0 `blockdev --getsize /dev/vg1/switch0` + switch 3 128 0 /dev/vg1/switch0 0 /dev/vg1/switch1 0 /dev/vg1/switch2 0" + +Set mappings for the first 7 entries to point to devices switch0, switch1, +switch2, switch0, switch1, switch2, switch1: + dmsetup message switch 0 set_region_mappings 0:0 :1 :2 :0 :1 :2 :1 diff --git a/Documentation/device-mapper/thin-provisioning.txt b/Documentation/device-mapper/thin-provisioning.txt index f5cfc62b7ad..2f5173500bd 100644 --- a/Documentation/device-mapper/thin-provisioning.txt +++ b/Documentation/device-mapper/thin-provisioning.txt @@ -99,13 +99,14 @@ Using an existing pool device $data_block_size $low_water_mark" $data_block_size gives the smallest unit of disk space that can be -allocated at a time expressed in units of 512-byte sectors. People -primarily interested in thin provisioning may want to use a value such -as 1024 (512KB). People doing lots of snapshotting may want a smaller value -such as 128 (64KB). If you are not zeroing newly-allocated data, -a larger $data_block_size in the region of 256000 (128MB) is suggested. -$data_block_size must be the same for the lifetime of the -metadata device. +allocated at a time expressed in units of 512-byte sectors. +$data_block_size must be between 128 (64KB) and 2097152 (1GB) and a +multiple of 128 (64KB). $data_block_size cannot be changed after the +thin-pool is created. People primarily interested in thin provisioning +may want to use a value such as 1024 (512KB). People doing lots of +snapshotting may want a smaller value such as 128 (64KB). If you are +not zeroing newly-allocated data, a larger $data_block_size in the +region of 256000 (128MB) is suggested. $low_water_mark is expressed in blocks of size $data_block_size. If free space on the data device drops below this level then a dm event @@ -115,6 +116,35 @@ Resuming a device with a new table itself triggers an event so the userspace daemon can use this to detect a situation where a new table already exceeds the threshold. +A low water mark for the metadata device is maintained in the kernel and +will trigger a dm event if free space on the metadata device drops below +it. + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a FLUSH or FUA bio is written. +If no such requests are made then commits will occur every second. This +means the thin-provisioning target behaves like a physical disk that has +a volatile write cache. If power is lost you may lose some recent +writes. The metadata should always be consistent in spite of any crash. + +If data space is exhausted the pool will either error or queue IO +according to the configuration (see: error_if_no_space). If metadata +space is exhausted or a metadata operation fails: the pool will error IO +until the pool is taken offline and repair is performed to 1) fix any +potential inconsistencies and 2) clear the flag that imposes repair. +Once the pool's metadata device is repaired it may be resized, which +will allow the pool to return to normal operation. Note that if a pool +is flagged as needing repair, the pool's data and metadata devices +cannot be resized until repair is performed. It should also be noted +that when the pool's metadata space is exhausted the current metadata +transaction is aborted. Given that the pool will cache IO whose +completion may have already been acknowledged to upper IO layers +(e.g. filesystem) it is strongly suggested that consistency checks +(e.g. fsck) be performed on those layers when repair of the pool is +required. + Thin provisioning ----------------- @@ -231,6 +261,11 @@ i) Constructor no_discard_passdown: Don't pass discards down to the underlying data device, but just remove the mapping. + read_only: Don't allow any changes to be made to the pool + metadata. + + error_if_no_space: Error IOs, instead of queueing, if no space. + Data block size must be between 64KB (128 sectors) and 1GB (2097152 sectors) inclusive. @@ -239,7 +274,7 @@ ii) Status <transaction id> <used metadata blocks>/<total metadata blocks> <used data blocks>/<total data blocks> <held metadata root> - + [no_]discard_passdown ro|rw transaction id: A 64-bit number used by userspace to help synchronise with metadata @@ -252,10 +287,32 @@ ii) Status should register for the event and then check the target's status. held metadata root: - The location, in sectors, of the metadata root that has been + The location, in blocks, of the metadata root that has been 'held' for userspace read access. '-' indicates there is no - held root. This feature is not yet implemented so '-' is - always returned. + held root. + + discard_passdown|no_discard_passdown + Whether or not discards are actually being passed down to the + underlying device. When this is enabled when loading the table, + it can get disabled if the underlying device doesn't support it. + + ro|rw + If the pool encounters certain types of device failures it will + drop into a read-only metadata mode in which no changes to + the pool metadata (like allocating new blocks) are permitted. + + In serious cases where even a read-only mode is deemed unsafe + no further I/O will be permitted and the status will just + contain the string 'Fail'. The userspace recovery tools + should then be used. + + error_if_no_space|queue_if_no_space + If the pool runs out of data or metadata space, the pool will + either queue or error the IO destined to the data device. The + default is to queue the IO until more space is added or the + 'no_space_timeout' expires. The 'no_space_timeout' dm-thin-pool + module parameter can be used to change this timeout -- it + defaults to 60 seconds but may be disabled using a value of 0. iii) Messages @@ -329,3 +386,7 @@ regain some space then send the 'trim' message to the pool. ii) Status <nr mapped sectors> <highest mapped sector> + + If the pool has encountered device errors and failed, the status + will just contain the string 'Fail'. The userspace recovery + tools should then be used. |
