diff options
48 files changed, 8791 insertions, 601 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt new file mode 100644 index 00000000000..d7c440b444c --- /dev/null +++ b/Documentation/device-mapper/cache-policies.txt @@ -0,0 +1,77 @@ +Guidance for writing policies +============================= + +Try to keep transactionality out of it. The core is careful to +avoid asking about anything that is migrating. This is a pain, but +makes it easier to write the policies. + +Mappings are loaded into the policy at construction time. + +Every bio that is mapped by the target is referred to the policy. +The policy can return a simple HIT or MISS or issue a migration. + +Currently there's no way for the policy to issue background work, +e.g. to start writing back dirty blocks that are going to be evicte +soon. + +Because we map bios, rather than requests it's easy for the policy +to get fooled by many small bios. For this reason the core target +issues periodic ticks to the policy. It's suggested that the policy +doesn't update states (eg, hit counts) for a block more than once +for each tick. The core ticks by watching bios complete, and so +trying to see when the io scheduler has let the ios run. + + +Overview of supplied cache replacement policies +=============================================== + +multiqueue +---------- + +This policy is the default. + +The multiqueue policy has two sets of 16 queues: one set for entries +waiting for the cache and another one for those in the cache. +Cache entries in the queues are aged based on logical time. Entry into +the cache is based on variable thresholds and queue selection is based +on hit count on entry. The policy aims to take different cache miss +costs into account and to adjust to varying load patterns automatically. + +Message and constructor argument pairs are: + 'sequential_threshold <#nr_sequential_ios>' and + 'random_threshold <#nr_random_ios>'. + +The sequential threshold indicates the number of contiguous I/Os +required before a stream is treated as sequential. The random threshold +is the number of intervening non-contiguous I/Os that must be seen +before the stream is treated as random again. + +The sequential and random thresholds default to 512 and 4 respectively. + +Large, sequential ios are probably better left on the origin device +since spindles tend to have good bandwidth. The io_tracker counts +contiguous I/Os to try to spot when the io is in one of these sequential +modes. + +cleaner +------- + +The cleaner writes back all dirty blocks in a cache to decommission it. + +Examples +======== + +The syntax for a table is: + cache <metadata dev> <cache dev> <origin dev> <block size> + <#feature_args> [<feature arg>]* + <policy> <#policy_args> [<policy arg>]* + +The syntax to send a message using the dmsetup command is: + dmsetup message <mapped device> 0 sequential_threshold 1024 + dmsetup message <mapped device> 0 random_threshold 8 + +Using dmsetup: + dmsetup create blah --table "0 268435456 cache /dev/sdb /dev/sdc \ + /dev/sdd 512 0 mq 4 sequential_threshold 1024 random_threshold 8" + creates a 128GB large mapped device named 'blah' with the + sequential threshold set to 1024 and the random_threshold set to 8. diff --git a/Documentation/device-mapper/cache.txt b/Documentation/device-mapper/cache.txt new file mode 100644 index 00000000000..f50470abe24 --- /dev/null +++ b/Documentation/device-mapper/cache.txt @@ -0,0 +1,243 @@ +Introduction +============ + +dm-cache is a device mapper target written by Joe Thornber, Heinz +Mauelshagen, and Mike Snitzer. + +It aims to improve performance of a block device (eg, a spindle) by +dynamically migrating some of its data to a faster, smaller device +(eg, an SSD). + +This device-mapper solution allows us to insert this caching at +different levels of the dm stack, for instance above the data device for +a thin-provisioning pool. Caching solutions that are integrated more +closely with the virtual memory system should give better performance. + +The target reuses the metadata library used in the thin-provisioning +library. + +The decision as to what data to migrate and when is left to a plug-in +policy module. Several of these have been written as we experiment, +and we hope other people will contribute others for specific io +scenarios (eg. a vm image server). + +Glossary +======== + + Migration - Movement of the primary copy of a logical block from one + device to the other. + Promotion - Migration from slow device to fast device. + Demotion - Migration from fast device to slow device. + +The origin device always contains a copy of the logical block, which +may be out of date or kept in sync with the copy on the cache device +(depending on policy). + +Design +====== + +Sub-devices +----------- + +The target is constructed by passing three devices to it (along with +other parameters detailed later): + +1. An origin device - the big, slow one. + +2. A cache device - the small, fast one. + +3. A small metadata device - records which blocks are in the cache, + which are dirty, and extra hints for use by the policy object. + This information could be put on the cache device, but having it + separate allows the volume manager to configure it differently, + e.g. as a mirror for extra robustness. + +Fixed block size +---------------- + +The origin is divided up into blocks of a fixed size. This block size +is configurable when you first create the cache. Typically we've been +using block sizes of 256k - 1024k. + +Having a fixed block size simplifies the target a lot. But it is +something of a compromise. For instance, a small part of a block may be +getting hit a lot, yet the whole block will be promoted to the cache. +So large block sizes are bad because they waste cache space. And small +block sizes are bad because they increase the amount of metadata (both +in core and on disk). + +Writeback/writethrough +---------------------- + +The cache has two modes, writeback and writethrough. + +If writeback, the default, is selected then a write to a block that is +cached will go only to the cache and the block will be marked dirty in +the metadata. + +If writethrough is selected then a write to a cached block will not +complete until it has hit both the origin and cache devices. Clean +blocks should remain clean. + +A simple cleaner policy is provided, which will clean (write back) all +dirty blocks in a cache. Useful for decommissioning a cache. + +Migration throttling +-------------------- + +Migrating data between the origin and cache device uses bandwidth. +The user can set a throttle to prevent more than a certain amount of +migration occuring at any one time. Currently we're not taking any +account of normal io traffic going to the devices. More work needs +doing here to avoid migrating during those peak io moments. + +For the time being, a message "migration_threshold <#sectors>" +can be used to set the maximum number of sectors being migrated, +the default being 204800 sectors (or 100MB). + +Updating on-disk metadata +------------------------- + +On-disk metadata is committed every time a REQ_SYNC or REQ_FUA bio is +written. If no such requests are made then commits will occur every +second. This means the cache behaves like a physical disk that has a +write cache (the same is true of the thin-provisioning target). If +power is lost you may lose some recent writes. The metadata should +always be consistent in spite of any crash. + +The 'dirty' state for a cache block changes far too frequently for us +to keep updating it on the fly. So we treat it as a hint. In normal +operation it will be written when the dm device is suspended. If the +system crashes all cache blocks will be assumed dirty when restarted. + +Per-block policy hints +---------------------- + +Policy plug-ins can store a chunk of data per cache block. It's up to +the policy how big this chunk is, but it should be kept small. Like the +dirty flags this data is lost if there's a crash so a safe fallback +value should always be possible. + +For instance, the 'mq' policy, which is currently the default policy, +uses this facility to store the hit count of the cache blocks. If +there's a crash this information will be lost, which means the cache +may be less efficient until those hit counts are regenerated. + +Policy hints affect performance, not correctness. + +Policy messaging +---------------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. Refer to cache-policies.txt. + +Discard bitset resolution +------------------------- + +We can avoid copying data during migration if we know the block has +been discarded. A prime example of this is when mkfs discards the +whole block device. We store a bitset tracking the discard state of +blocks. However, we allow this bitset to have a different block size +from the cache blocks. This is because we need to track the discard +state for all of the origin device (compare with the dirty bitset +which is just for the smaller cache device). + +Target interface +================ + +Constructor +----------- + + cache <metadata dev> <cache dev> <origin dev> <block size> + <#feature args> [<feature arg>]* + <policy> <#policy args> [policy args]* + + metadata dev : fast device holding the persistent metadata + cache dev : fast device holding cached data blocks + origin dev : slow device holding original data blocks + block size : cache unit size in sectors + + #feature args : number of feature arguments passed + feature args : writethrough. (The default is writeback.) + + policy : the replacement policy to use + #policy args : an even number of arguments corresponding to + key/value pairs passed to the policy + policy args : key/value pairs passed to the policy + E.g. 'sequential_threshold 1024' + See cache-policies.txt for details. + +Optional feature arguments are: + writethrough : write through caching that prohibits cache block + content from being different from origin block content. + Without this argument, the default behaviour is to write + back cache block contents later for performance reasons, + so they may differ from the corresponding origin blocks. + +A policy called 'default' is always registered. This is an alias for +the policy we currently think is giving best all round performance. + +As the default policy could vary between kernels, if you are relying on +the characteristics of a specific policy, always request it by name. + +Status +------ + +<#used metadata blocks>/<#total metadata blocks> <#read hits> <#read misses> +<#write hits> <#write misses> <#demotions> <#promotions> <#blocks in cache> +<#dirty> <#features> <features>* <#core args> <core args>* <#policy args> +<policy args>* + +#used metadata blocks : Number of metadata blocks used +#total metadata blocks : Total number of metadata blocks +#read hits : Number of times a READ bio has been mapped + to the cache +#read misses : Number of times a READ bio has been mapped + to the origin +#write hits : Number of times a WRITE bio has been mapped + to the cache +#write misses : Number of times a WRITE bio has been + mapped to the origin +#demotions : Number of times a block has been removed + from the cache +#promotions : Number of times a block has been moved to + the cache +#blocks in cache : Number of blocks resident in the cache +#dirty : Number of blocks in the cache that differ + from the origin +#feature args : Number of feature args to follow +feature args : 'writethrough' (optional) +#core args : Number of core arguments (must be even) +core args : Key/value pairs for tuning the core + e.g. migration_threshold +#policy args : Number of policy arguments to follow (must be even) +policy args : Key/value pairs + e.g. 'sequential_threshold 1024 + +Messages +-------- + +Policies will have different tunables, specific to each one, so we +need a generic way of getting and setting these. Device-mapper +messages are used. (A sysfs interface would also be possible.) + +The message format is: + + <key> <value> + +E.g. + dmsetup message my_cache 0 sequential_threshold 1024 + +Examples +======== + +The test suite can be found here: + +https://github.com/jthornber/thinp-test-suite + +dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 512 1 writeback default 0' +dmsetup create my_cache --table '0 41943040 cache /dev/mapper/metadata \ + /dev/mapper/ssd /dev/mapper/origin 1024 1 writeback \ + mq 4 sequential_threshold 1024 random_threshold 8' diff --git a/drivers/md/Kconfig b/drivers/md/Kconfig index 91a02eeeb31..e30b490055a 100644 --- a/drivers/md/Kconfig +++ b/drivers/md/Kconfig @@ -210,7 +210,7 @@ config DM_DEBUG config DM_BUFIO tristate - depends on BLK_DEV_DM && EXPERIMENTAL + depends on BLK_DEV_DM ---help--- This interface allows you to do buffered I/O on a device and acts as a cache, holding recently-read blocks in memory and performing @@ -218,7 +218,7 @@ config DM_BUFIO config DM_BIO_PRISON tristate - depends on BLK_DEV_DM && EXPERIMENTAL + depends on BLK_DEV_DM ---help--- Some bio locking schemes used by other device-mapper targets including thin provisioning. @@ -251,8 +251,8 @@ config DM_SNAPSHOT Allow volume managers to take writable snapshots of a device. config DM_THIN_PROVISIONING - tristate "Thin provisioning target (EXPERIMENTAL)" - depends on BLK_DEV_DM && EXPERIMENTAL + tristate "Thin provisioning target" + depends on BLK_DEV_DM select DM_PERSISTENT_DATA select DM_BIO_PRISON ---help--- @@ -268,6 +268,37 @@ config DM_DEBUG_BLOCK_STACK_TRACING If unsure, say N. +config DM_CACHE + tristate "Cache target (EXPERIMENTAL)" + depends on BLK_DEV_DM + default n + select DM_PERSISTENT_DATA + select DM_BIO_PRISON + ---help--- + dm-cache attempts to improve performance of a block device by + moving frequently used data to a smaller, higher performance + device. Different 'policy' plugins can be used to change the + algorithms used to select which blocks are promoted, demoted, + cleaned etc. It supports writeback and writethrough modes. + +config DM_CACHE_MQ + tristate "MQ Cache Policy (EXPERIMENTAL)" + depends on DM_CACHE + default y + ---help--- + A cache policy that uses a multiqueue ordered by recent hit + count to select which blocks should be promoted and demoted. + This is meant to be a general purpose policy. It prioritises + reads over writes. + +config DM_CACHE_CLEANER + tristate "Cleaner Cache Policy (EXPERIMENTAL)" + depends on DM_CACHE + default y + ---help--- + A simple cache policy that writes back all data to the + origin. Used when decommissioning a dm-cache. + config DM_MIRROR tristate "Mirror target" depends on BLK_DEV_DM @@ -302,8 +333,8 @@ config DM_RAID in one of the available parity distribution methods. config DM_LOG_USERSPACE - tristate "Mirror userspace logging (EXPERIMENTAL)" - depends on DM_MIRROR && EXPERIMENTAL && NET + tristate "Mirror userspace logging" + depends on DM_MIRROR && NET select CONNECTOR ---help--- The userspace logging module provides a mechanism for @@ -350,8 +381,8 @@ config DM_MULTIPATH_ST If unsure, say N. config DM_DELAY - tristate "I/O delaying target (EXPERIMENTAL)" - depends on BLK_DEV_DM && EXPERIMENTAL + tristate "I/O delaying target" + depends on BLK_DEV_DM ---help--- A target that delays reads and/or writes and can send them to different devices. Useful for testing. @@ -365,14 +396,14 @@ config DM_UEVENT Generate udev events for DM events. config DM_FLAKEY - tristate "Flakey target (EXPERIMENTAL)" - depends on BLK_DEV_DM && EXPERIMENTAL + tristate "Flakey target" + depends on BLK_DEV_DM ---help--- A target that intermittently fails I/O for debugging purposes. config DM_VERITY - tristate "Verity target support (EXPERIMENTAL)" - depends on BLK_DEV_DM && EXPERIMENTAL + tristate "Verity target support" + depends on BLK_DEV_DM select CRYPTO select CRYPTO_HASH select DM_BUFIO diff --git a/drivers/md/Makefile b/drivers/md/Makefile index 94dce8b4932..7ceeaefc0e9 100644 --- a/drivers/md/Makefile +++ b/drivers/md/Makefile @@ -11,6 +11,9 @@ dm-mirror-y += dm-raid1.o dm-log-userspace-y \ += dm-log-userspace-base.o dm-log-userspace-transfer.o dm-thin-pool-y += dm-thin.o dm-thin-metadata.o +dm-cache-y += dm-cache-target.o dm-cache-metadata.o dm-cache-policy.o +dm-cache-mq-y += dm-cache-policy-mq.o +dm-cache-cleaner-y += dm-cache-policy-cleaner.o md-mod-y += md.o bitmap.o raid456-y += raid5.o @@ -44,6 +47,9 @@ obj-$(CONFIG_DM_ZERO) += dm-zero.o obj-$(CONFIG_DM_RAID) += dm-raid.o obj-$(CONFIG_DM_THIN_PROVISIONING) += dm-thin-pool.o obj-$(CONFIG_DM_VERITY) += dm-verity.o +obj-$(CONFIG_DM_CACHE) += dm-cache.o +obj-$(CONFIG_DM_CACHE_MQ) += dm-cache-mq.o +obj-$(CONFIG_DM_CACHE_CLEANER) += dm-cache-cleaner.o ifeq ($(CONFIG_DM_UEVENT),y) dm-mod-objs += dm-uevent.o diff --git a/drivers/md/dm-bio-prison.c b/drivers/md/dm-bio-prison.c index d9d3f1c7b66..85f0b707425 100644 --- a/drivers/md/dm-bio-prison.c +++ b/drivers/md/dm-bio-prison.c @@ -14,14 +14,6 @@ /*----------------------------------------------------------------*/ -struct dm_bio_prison_cell { - struct hlist_node list; - struct dm_bio_prison *prison; - struct dm_cell_key key; - struct bio *holder; - struct bio_list bios; -}; - struct dm_bio_prison { spinlock_t lock; mempool_t *cell_pool; @@ -87,6 +79,19 @@ void dm_bio_prison_destroy(struct dm_bio_prison *prison) } EXPORT_SYMBOL_GPL(dm_bio_prison_destroy); +struct dm_bio_prison_cell *dm_bio_prison_alloc_cell(struct dm_bio_prison *prison, gfp_t gfp) +{ + return mempool_alloc(prison->cell_pool, gfp); +} +EXPORT_SYMBOL_GPL(dm_bio_prison_alloc_cell); + +void dm_bio_prison_free_cell(struct dm_bio_prison *prison, + struct dm_bio_prison_cell *cell) +{ + mempool_free(cell, prison->cell_pool); +} +EXPORT_SYMBOL_GPL(dm_bio_prison_free_cell); + static uint32_t hash_key(struct dm_bio_prison *prison, struct dm_cell_key *key) { const unsigned long BIG_PRIME = 4294967291UL; @@ -114,91 +119,95 @@ static struct dm_bio_prison_cell *__search_bucket(struct hlist_head *bucket, return NULL; } -/* - * This may block if a new cell needs allocating. You must ensure that - * cells will be unlocked even if the calling thread is blocked. - * - * Returns 1 if the cell was already held, 0 if @inmate is the new holder. - */ -int dm_bio_detain(struct dm_bio_prison *prison, struct dm_cell_key *key, - struct bio *inmate, struct dm_bio_prison_cell **ref) +static void __setup_new_cell(struct dm_bio_prison *prison, + struct dm_cell_key *key, + struct bio *holder, + uint32_t hash, + struct dm_bio_prison_cell *cell) { - int r = 1; - unsigned long flags; - uint32_t hash = hash_key(prison, key); - struct dm_bio_prison_cell *cell, *cell2; - - BUG_ON(hash > prison->nr_buckets); - - spin_lock_irqsave(&prison->lock, flags); - - cell = __search_bucket(prison->cells + hash, key); - if (cell) { - bio_list_add(&cell->bios, inmate); - goto out; - } + memcpy(&cell->key, key, sizeof(cell->key)); + cell->holder = holder; + bio_list_init(&cell->bios); + hlist_add_head(&cell->list, prison->cells + hash); +} - /* - * Allocate a new cell - */ - spin_unlock_irqrestore(&prison->lock, flags); - cell2 = mempool_alloc(prison->cell_pool, GFP_NOIO); - spin_lock_irqsave(&prison->lock, flags); +static int __bio_detain(struct dm_bio_prison *prison, + struct dm_cell_key *key, + struct bio *inmate, + struct dm_bio_prison_cell *cell_prealloc, + struct dm_bio_prison_cell **cell_result) +{ + uint32_t hash = hash_key(prison, key); + struct dm_bio_prison_cell *cell; - /* - * We've been unlocked, so we have to double check that - * nobody else has inserted this cell in the meantime. - */ cell = __search_bucket(prison->cells + hash, key); if (cell) { - mempool_free(cell2, prison->cell_pool); - bio_list_add(&cell->bios, inmate); - goto out; + if (inmate) + bio_list_add(&cell->bios, inmate); + *cell_result = cell; + return 1; } - /* - * Use new cell. - */ - cell = cell2; - - cell->prison = prison; - memcpy(&cell->key, key, sizeof(cell->key)); - cell->holder = inmate; - bio_list_init(&cell->bios); - hlist_add_head(&cell->list, prison->cells + hash); + __setup_new_cell(prison, key, inmate, hash, cell_prealloc); + *cell_result = cell_prealloc; + return 0; +} - r = 0; +static int bio_detain(struct dm_bio_prison *prison, + struct dm_cell_key *key, + struct bio *inmate, + struct dm_bio_prison_cell *cell_prealloc, + struct dm_bio_prison_cell **cell_result) +{ + int r; + unsigned long flags; -out: + spin_lock_irqsave(&prison->lock, flags); + r = __bio_detain(prison, key, inmate, cell_prealloc, cell_result); spin_unlock_irqrestore(&prison->lock, flags); - *ref = cell; - return r; } + +int dm_bio_detain(struct dm_bio_prison *prison, + struct dm_cell_key *key, + struct bio *inmate, + struct dm_bio_prison_cell *cell_prealloc, + struct dm_bio_prison_cell **cell_result) +{ + return bio_detain(prison, key, inmate, cell_prealloc, cell_result); +} EXPORT_SYMBOL_GPL(dm_bio_detain); +int dm_get_cell(struct dm_bio_prison *prison, + struct dm_cell_key *key, + struct dm_bio_prison_cell *cell_prealloc, + struct dm_bio_prison_cell **cell_result) +{ + return bio_detain(prison, key, NULL, cell_prealloc, cell_result); +} +EXPORT_SYMBOL_GPL(dm_get_cell); + /* * @inmates must have been initialised prior to this call */ -static void __cell_release(struct dm_bio_prison_cell *cell, struct bio_list *inmates) +static void __cell_release(struct dm_bio_prison_cell *cell, + struct bio_list *inmates) { - struct dm_bio_prison *prison = cell->prison; - hlist_del(&cell->list); if (inmates) { - bio_list_add(inmates, cell->holder); + if (cell->holder) + bio_list_add(inmates, cell->holder); bio_list_merge(inmates, &cell->bios); } - - mempool_free(cell, prison->cell_pool); } -void dm_cell_release(struct dm_bio_prison_cell *cell, struct bio_list *bios) +void dm_cell_release(struct dm_bio_prison *prison, + struct dm_bio_prison_cell *cell, + struct bio_list *bios) { unsigned long flags; - struct dm_bio_prison *prison = cell->prison; spin_lock_irqsave(&prison->lock, flags); __cell_release(cell, bios); @@ -209,20 +218,18 @@ EXPORT_SYMBOL_GPL(dm_cell_release); /* * Sometimes we don't want the holder, just the additional bios. */ -static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) +static void __cell_release_no_holder(struct dm_bio_prison_cell *cell, + struct bio_list *inmates) { - struct dm_bio_prison *prison = cell->prison; - hlist_del(&cell->list); bio_list_merge(inmates, &cell->bios); - - mempool_free(cell, prison->cell_pool); } -void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list *inmates) +void dm_cell_release_no_holder(struct dm_bio_prison *prison, + struct dm_bio_prison_cell *cell, + struct bio_list *inmates) { unsigned long flags; - struct dm_bio_prison *prison = cell->prison; spin_lock_irqsave(&prison->lock, flags); __cell_release_no_holder(cell, inmates); @@ -230,9 +237,9 @@ void dm_cell_release_no_holder(struct dm_bio_prison_cell *cell, struct bio_list } EXPORT_SYMBOL_GPL(dm_cell_release_no_holder); -void dm_cell_error(struct dm_bio_prison_cell *cell) +void dm_cell_error(struct dm_bio_prison *prison, |