diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2013-07-11 13:03:24 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2013-07-11 13:03:24 -0700 |
commit | 36805aaea5ae3cf1bb32f1643e0a800bb69f0d5b (patch) | |
tree | 5565132549a0733772b3a2ac6b5cda516ea8cdce | |
parent | 6d2fa9e141ea56a571ec842fd4f3a86bea44a203 (diff) | |
parent | d50235b7bc3ee0a0427984d763ea7534149531b4 (diff) |
Merge branch 'for-3.11/core' of git://git.kernel.dk/linux-block
Pull core block IO updates from Jens Axboe:
"Here are the core IO block bits for 3.11. It contains:
- A tweak to the reserved tag logic from Jan, for weirdo devices with
just 3 free tags. But for those it improves things substantially
for random writes.
- Periodic writeback fix from Jan. Marked for stable as well.
- Fix for a race condition in IO scheduler switching from Jianpeng.
- The hierarchical blk-cgroup support from Tejun. This is the grunt
of the series.
- blk-throttle fix from Vivek.
Just a note that I'm in the middle of a relocation, whole family is
flying out tomorrow. Hence I will be awal the remainder of this week,
but back at work again on Monday the 15th. CC'ing Tejun, since any
potential "surprises" will most likely be from the blk-cgroup work.
But it's been brewing for a while and sitting in my tree and
linux-next for a long time, so should be solid."
* 'for-3.11/core' of git://git.kernel.dk/linux-block: (36 commits)
elevator: Fix a race in elevator switching
block: Reserve only one queue tag for sync IO if only 3 tags are available
writeback: Fix periodic writeback after fs mount
blk-throttle: implement proper hierarchy support
blk-throttle: implement throtl_grp->has_rules[]
blk-throttle: Account for child group's start time in parent while bio climbs up
blk-throttle: add throtl_qnode for dispatch fairness
blk-throttle: make throtl_pending_timer_fn() ready for hierarchy
blk-throttle: make tg_dispatch_one_bio() ready for hierarchy
blk-throttle: make blk_throtl_bio() ready for hierarchy
blk-throttle: make blk_throtl_drain() ready for hierarchy
blk-throttle: dispatch from throtl_pending_timer_fn()
blk-throttle: implement dispatch looping
blk-throttle: separate out throtl_service_queue->pending_timer from throtl_data->dispatch_work
blk-throttle: set REQ_THROTTLED from throtl_charge_bio() and gate stats update with it
blk-throttle: implement sq_to_tg(), sq_to_td() and throtl_log()
blk-throttle: add throtl_service_queue->parent_sq
blk-throttle: generalize update_disptime optimization in blk_throtl_bio()
blk-throttle: dispatch to throtl_data->service_queue.bio_lists[]
blk-throttle: move bio_lists[] and friends to throtl_service_queue
...
-rw-r--r-- | Documentation/cgroups/blkio-controller.txt | 29 | ||||
-rw-r--r-- | block/blk-cgroup.c | 105 | ||||
-rw-r--r-- | block/blk-cgroup.h | 38 | ||||
-rw-r--r-- | block/blk-tag.c | 11 | ||||
-rw-r--r-- | block/blk-throttle.c | 1064 | ||||
-rw-r--r-- | block/cfq-iosched.c | 17 | ||||
-rw-r--r-- | block/deadline-iosched.c | 16 | ||||
-rw-r--r-- | block/elevator.c | 25 | ||||
-rw-r--r-- | block/noop-iosched.c | 17 | ||||
-rw-r--r-- | fs/block_dev.c | 9 | ||||
-rw-r--r-- | include/linux/cgroup.h | 2 | ||||
-rw-r--r-- | include/linux/elevator.h | 6 |
12 files changed, 905 insertions, 434 deletions
diff --git a/Documentation/cgroups/blkio-controller.txt b/Documentation/cgroups/blkio-controller.txt index da272c8f44e..cd556b91478 100644 --- a/Documentation/cgroups/blkio-controller.txt +++ b/Documentation/cgroups/blkio-controller.txt @@ -94,11 +94,13 @@ Throttling/Upper Limit policy Hierarchical Cgroups ==================== -- Currently only CFQ supports hierarchical groups. For throttling, - cgroup interface does allow creation of hierarchical cgroups and - internally it treats them as flat hierarchy. - If somebody created a hierarchy like as follows. +Both CFQ and throttling implement hierarchy support; however, +throttling's hierarchy support is enabled iff "sane_behavior" is +enabled from cgroup side, which currently is a development option and +not publicly available. + +If somebody created a hierarchy like as follows. root / \ @@ -106,21 +108,20 @@ Hierarchical Cgroups | test3 - CFQ will handle the hierarchy correctly but and throttling will - practically treat all groups at same level. For details on CFQ - hierarchy support, refer to Documentation/block/cfq-iosched.txt. - Throttling will treat the hierarchy as if it looks like the - following. +CFQ by default and throttling with "sane_behavior" will handle the +hierarchy correctly. For details on CFQ hierarchy support, refer to +Documentation/block/cfq-iosched.txt. For throttling, all limits apply +to the whole subtree while all statistics are local to the IOs +directly generated by tasks in that cgroup. + +Throttling without "sane_behavior" enabled from cgroup side will +practically treat all groups at same level as if it looks like the +following. pivot / / \ \ root test1 test2 test3 - Nesting cgroups, while allowed, isn't officially supported and blkio - genereates warning when cgroups nest. Once throttling implements - hierarchy support, hierarchy will be supported and the warning will - be removed. - Various user visible config options =================================== CONFIG_BLK_CGROUP diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c index e8918ffaf96..290792a13e3 100644 --- a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -32,26 +32,6 @@ EXPORT_SYMBOL_GPL(blkcg_root); static struct blkcg_policy *blkcg_policy[BLKCG_MAX_POLS]; -static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, - struct request_queue *q, bool update_hint); - -/** - * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants - * @d_blkg: loop cursor pointing to the current descendant - * @pos_cgrp: used for iteration - * @p_blkg: target blkg to walk descendants of - * - * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU - * read locked. If called under either blkcg or queue lock, the iteration - * is guaranteed to include all and only online blkgs. The caller may - * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip - * subtree. - */ -#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg) \ - cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ - if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ - (p_blkg)->q, false))) - static bool blkcg_policy_enabled(struct request_queue *q, const struct blkcg_policy *pol) { @@ -71,18 +51,8 @@ static void blkg_free(struct blkcg_gq *blkg) if (!blkg) return; - for (i = 0; i < BLKCG_MAX_POLS; i++) { - struct blkcg_policy *pol = blkcg_policy[i]; - struct blkg_policy_data *pd = blkg->pd[i]; - - if (!pd) - continue; - - if (pol && pol->pd_exit_fn) - pol->pd_exit_fn(blkg); - - kfree(pd); - } + for (i = 0; i < BLKCG_MAX_POLS; i++) + kfree(blkg->pd[i]); blk_exit_rl(&blkg->rl); kfree(blkg); @@ -134,10 +104,6 @@ static struct blkcg_gq *blkg_alloc(struct blkcg *blkcg, struct request_queue *q, blkg->pd[i] = pd; pd->blkg = blkg; pd->plid = i; - - /* invoke per-policy init */ - if (pol->pd_init_fn) - pol->pd_init_fn(blkg); } return blkg; @@ -158,8 +124,8 @@ err_free: * @q's bypass state. If @update_hint is %true, the caller should be * holding @q->queue_lock and lookup hint is updated on success. */ -static struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, - struct request_queue *q, bool update_hint) +struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q, + bool update_hint) { struct blkcg_gq *blkg; @@ -234,16 +200,25 @@ static struct blkcg_gq *blkg_create(struct blkcg *blkcg, } blkg = new_blkg; - /* link parent and insert */ + /* link parent */ if (blkcg_parent(blkcg)) { blkg->parent = __blkg_lookup(blkcg_parent(blkcg), q, false); if (WARN_ON_ONCE(!blkg->parent)) { - blkg = ERR_PTR(-EINVAL); + ret = -EINVAL; goto err_put_css; } blkg_get(blkg->parent); } + /* invoke per-policy init */ + for (i = 0; i < BLKCG_MAX_POLS; i++) { + struct blkcg_policy *pol = blkcg_policy[i]; + + if (blkg->pd[i] && pol->pd_init_fn) + pol->pd_init_fn(blkg); + } + + /* insert */ spin_lock(&blkcg->lock); ret = radix_tree_insert(&blkcg->blkg_tree, q->id, blkg); if (likely(!ret)) { @@ -394,30 +369,38 @@ static void blkg_destroy_all(struct request_queue *q) q->root_rl.blkg = NULL; } -static void blkg_rcu_free(struct rcu_head *rcu_head) +/* + * A group is RCU protected, but having an rcu lock does not mean that one + * can access all the fields of blkg and assume these are valid. For + * example, don't try to follow throtl_data and request queue links. + * + * Having a reference to blkg under an rcu allows accesses to only values + * local to groups like group stats and group rate limits. + */ +void __blkg_release_rcu(struct rcu_head *rcu_head) { - blkg_free(container_of(rcu_head, struct blkcg_gq, rcu_head)); -} + struct blkcg_gq *blkg = container_of(rcu_head, struct blkcg_gq, rcu_head); + int i; + + /* tell policies that this one is being freed */ + for (i = 0; i < BLKCG_MAX_POLS; i++) { + struct blkcg_policy *pol = blkcg_policy[i]; + + if (blkg->pd[i] && pol->pd_exit_fn) + pol->pd_exit_fn(blkg); + } -void __blkg_release(struct blkcg_gq *blkg) -{ /* release the blkcg and parent blkg refs this blkg has been holding */ css_put(&blkg->blkcg->css); - if (blkg->parent) + if (blkg->parent) { + spin_lock_irq(blkg->q->queue_lock); blkg_put(blkg->parent); + spin_unlock_irq(blkg->q->queue_lock); + } - /* - * A group is freed in rcu manner. But having an rcu lock does not - * mean that one can access all the fields of blkg and assume these - * are valid. For example, don't try to follow throtl_data and - * request queue links. - * - * Having a reference to blkg under an rcu allows acess to only - * values local to groups like group stats and group rate limits - */ - call_rcu(&blkg->rcu_head, blkg_rcu_free); + blkg_free(blkg); } -EXPORT_SYMBOL_GPL(__blkg_release); +EXPORT_SYMBOL_GPL(__blkg_release_rcu); /* * The next function used by blk_queue_for_each_rl(). It's a bit tricky @@ -928,14 +911,6 @@ struct cgroup_subsys blkio_subsys = { .subsys_id = blkio_subsys_id, .base_cftypes = blkcg_files, .module = THIS_MODULE, - - /* - * blkio subsystem is utterly broken in terms of hierarchy support. - * It treats all cgroups equally regardless of where they're - * located in the hierarchy - all cgroups are treated as if they're - * right below the root. Fix it and remove the following. - */ - .broken_hierarchy = true, }; EXPORT_SYMBOL_GPL(blkio_subsys); diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h index 4e595ee8c91..8056c03a338 100644 --- a/block/blk-cgroup.h +++ b/block/blk-cgroup.h @@ -266,7 +266,7 @@ static inline void blkg_get(struct blkcg_gq *blkg) blkg->refcnt++; } -void __blkg_release(struct blkcg_gq *blkg); +void __blkg_release_rcu(struct rcu_head *rcu); /** * blkg_put - put a blkg reference @@ -279,9 +279,43 @@ static inline void blkg_put(struct blkcg_gq *blkg) lockdep_assert_held(blkg->q->queue_lock); WARN_ON_ONCE(blkg->refcnt <= 0); if (!--blkg->refcnt) - __blkg_release(blkg); + call_rcu(&blkg->rcu_head, __blkg_release_rcu); } +struct blkcg_gq *__blkg_lookup(struct blkcg *blkcg, struct request_queue *q, + bool update_hint); + +/** + * blkg_for_each_descendant_pre - pre-order walk of a blkg's descendants + * @d_blkg: loop cursor pointing to the current descendant + * @pos_cgrp: used for iteration + * @p_blkg: target blkg to walk descendants of + * + * Walk @c_blkg through the descendants of @p_blkg. Must be used with RCU + * read locked. If called under either blkcg or queue lock, the iteration + * is guaranteed to include all and only online blkgs. The caller may + * update @pos_cgrp by calling cgroup_rightmost_descendant() to skip + * subtree. + */ +#define blkg_for_each_descendant_pre(d_blkg, pos_cgrp, p_blkg) \ + cgroup_for_each_descendant_pre((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ + if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ + (p_blkg)->q, false))) + +/** + * blkg_for_each_descendant_post - post-order walk of a blkg's descendants + * @d_blkg: loop cursor pointing to the current descendant + * @pos_cgrp: used for iteration + * @p_blkg: target blkg to walk descendants of + * + * Similar to blkg_for_each_descendant_pre() but performs post-order + * traversal instead. Synchronization rules are the same. + */ +#define blkg_for_each_descendant_post(d_blkg, pos_cgrp, p_blkg) \ + cgroup_for_each_descendant_post((pos_cgrp), (p_blkg)->blkcg->css.cgroup) \ + if (((d_blkg) = __blkg_lookup(cgroup_to_blkcg(pos_cgrp), \ + (p_blkg)->q, false))) + /** * blk_get_rl - get request_list to use * @q: request_queue of interest diff --git a/block/blk-tag.c b/block/blk-tag.c index cc345e1d8d4..3f33d867226 100644 --- a/block/blk-tag.c +++ b/block/blk-tag.c @@ -348,9 +348,16 @@ int blk_queue_start_tag(struct request_queue *q, struct request *rq) */ max_depth = bqt->max_depth; if (!rq_is_sync(rq) && max_depth > 1) { - max_depth -= 2; - if (!max_depth) + switch (max_depth) { + case 2: max_depth = 1; + break; + case 3: + max_depth = 2; + break; + default: + max_depth -= 2; + } if (q->in_flight[BLK_RW_ASYNC] > max_depth) return 1; } diff --git a/block/blk-throttle.c b/block/blk-throttle.c index 31146225f3d..08a32dfd384 100644 --- a/block/blk-throttle.c +++ b/block/blk-throttle.c @@ -25,18 +25,61 @@ static struct blkcg_policy blkcg_policy_throtl; /* A workqueue to queue throttle related work */ static struct workqueue_struct *kthrotld_workqueue; -static void throtl_schedule_delayed_work(struct throtl_data *td, - unsigned long delay); - -struct throtl_rb_root { - struct rb_root rb; - struct rb_node *left; - unsigned int count; - unsigned long min_disptime; + +/* + * To implement hierarchical throttling, throtl_grps form a tree and bios + * are dispatched upwards level by level until they reach the top and get + * issued. When dispatching bios from the children and local group at each + * level, if the bios are dispatched into a single bio_list, there's a risk + * of a local or child group which can queue many bios at once filling up + * the list starving others. + * + * To avoid such starvation, dispatched bios are queued separately + * according to where they came from. When they are again dispatched to + * the parent, they're popped in round-robin order so that no single source + * hogs the dispatch window. + * + * throtl_qnode is used to keep the queued bios separated by their sources. + * Bios are queued to throtl_qnode which in turn is queued to + * throtl_service_queue and then dispatched in round-robin order. + * + * It's also used to track the reference counts on blkg's. A qnode always + * belongs to a throtl_grp and gets queued on itself or the parent, so + * incrementing the reference of the associated throtl_grp when a qnode is + * queued and decrementing when dequeued is enough to keep the whole blkg + * tree pinned while bios are in flight. + */ +struct throtl_qnode { + struct list_head node; /* service_queue->queued[] */ + struct bio_list bios; /* queued bios */ + struct throtl_grp *tg; /* tg this qnode belongs to */ }; -#define THROTL_RB_ROOT (struct throtl_rb_root) { .rb = RB_ROOT, .left = NULL, \ - .count = 0, .min_disptime = 0} +struct throtl_service_queue { + struct throtl_service_queue *parent_sq; /* the parent service_queue */ + + /* + * Bios queued directly to this service_queue or dispatched from + * children throtl_grp's. + */ + struct list_head queued[2]; /* throtl_qnode [READ/WRITE] */ + unsigned int nr_queued[2]; /* number of queued bios */ + + /* + * RB tree of active children throtl_grp's, which are sorted by + * their ->disptime. + */ + struct rb_root pending_tree; /* RB tree of active tgs */ + struct rb_node *first_pending; /* first node in the tree */ + unsigned int nr_pending; /* # queued in the tree */ + unsigned long first_pending_disptime; /* disptime of the first tg */ + struct timer_list pending_timer; /* fires on first_pending_disptime */ +}; + +enum tg_state_flags { + THROTL_TG_PENDING = 1 << 0, /* on parent's pending tree */ + THROTL_TG_WAS_EMPTY = 1 << 1, /* bio_lists[] became non-empty */ +}; #define rb_entry_tg(node) rb_entry((node), struct throtl_grp, rb_node) @@ -52,9 +95,26 @@ struct throtl_grp { /* must be the first member */ struct blkg_policy_data pd; - /* active throtl group service_tree member */ + /* active throtl group service_queue member */ struct rb_node rb_node; + /* throtl_data this group belongs to */ + struct throtl_data *td; + + /* this group's service queue */ + struct throtl_service_queue service_queue; + + /* + * qnode_on_self is used when bios are directly queued to this + * throtl_grp so that local bios compete fairly with bios + * dispatched from children. qnode_on_parent is used when bios are + * dispatched from this throtl_grp into its parent and will compete + * with the sibling qnode_on_parents and the parent's + * qnode_on_self. + */ + struct throtl_qnode qnode_on_self[2]; + struct throtl_qnode qnode_on_parent[2]; + /* * Dispatch time in jiffies. This is the estimated time when group * will unthrottle and is ready to dispatch more bio. It is used as @@ -64,11 +124,8 @@ struct throtl_grp { unsigned int flags; - /* Two lists for READ and WRITE */ - struct bio_list bio_lists[2]; - - /* Number of queued bios on READ and WRITE lists */ - unsigned int nr_queued[2]; + /* are there any throtl rules between this group and td? */ + bool has_rules[2]; /* bytes per second rate limits */ uint64_t bps[2]; @@ -85,9 +142,6 @@ struct throtl_grp { unsigned long slice_start[2]; unsigned long slice_end[2]; - /* Some throttle limits got updated for the group */ - int limits_changed; - /* Per cpu stats pointer */ struct tg_stats_cpu __percpu *stats_cpu; @@ -98,7 +152,7 @@ struct throtl_grp { struct throtl_data { /* service tree for active throtl groups */ - struct throtl_rb_root tg_service_tree; + struct throtl_service_queue service_queue; struct request_queue *queue; @@ -111,9 +165,7 @@ struct throtl_data unsigned int nr_undestroyed_grps; /* Work for dispatching throttled bios */ - struct delayed_work throtl_work; - - int limits_changed; + struct work_struct dispatch_work; }; /* list and work item to allocate percpu group stats */ @@ -123,6 +175,8 @@ static LIST_HEAD(tg_stats_alloc_list); static void tg_stats_alloc_fn(struct work_struct *); static DECLARE_DELAYED_WORK(tg_stats_alloc_work, tg_stats_alloc_fn); +static void throtl_pending_timer_fn(unsigned long arg); + static inline struct throtl_grp *pd_to_tg(struct blkg_policy_data *pd) { return pd ? container_of(pd, struct throtl_grp, pd) : NULL; @@ -143,41 +197,65 @@ static inline struct throtl_grp *td_root_tg(struct throtl_data *td) return blkg_to_tg(td->queue->root_blkg); } -enum tg_state_flags { - THROTL_TG_FLAG_on_rr = 0, /* on round-robin busy list */ -}; - -#define THROTL_TG_FNS(name) \ -static inline void throtl_mark_tg_##name(struct throtl_grp *tg) \ -{ \ - (tg)->flags |= (1 << THROTL_TG_FLAG_##name); \ -} \ -static inline void throtl_clear_tg_##name(struct throtl_grp *tg) \ -{ \ - (tg)->flags &= ~(1 << THROTL_TG_FLAG_##name); \ -} \ -static inline int throtl_tg_##name(const struct throtl_grp *tg) \ -{ \ - return ((tg)->flags & (1 << THROTL_TG_FLAG_##name)) != 0; \ +/** + * sq_to_tg - return the throl_grp the specified service queue belongs to + * @sq: the throtl_service_queue of interest + * + * Return the throtl_grp @sq belongs to. If @sq is the top-level one + * embedded in throtl_data, %NULL is returned. + */ +static struct throtl_grp *sq_to_tg(struct throtl_service_queue *sq) +{ + if (sq && sq->parent_sq) + return container_of(sq, struct throtl_grp, service_queue); + else + return NULL; } -THROTL_TG_FNS(on_rr); +/** + * sq_to_td - return throtl_data the specified service queue belongs to + * @sq: the throtl_service_queue of interest + * + * A service_queue can be embeded in either a throtl_grp or throtl_data. + * Determine the associated throtl_data accordingly and return it. + */ +static struct throtl_data *sq_to_td(struct throtl_service_queue *sq) +{ + struct throtl_grp *tg = sq_to_tg(sq); -#define throtl_log_tg(td, tg, fmt, args...) do { \ - char __pbuf[128]; \ + if (tg) + return tg->td; + else + return container_of(sq, struct throtl_data, service_queue); +} + +/** + * throtl_log - log debug message via blktrace + * @sq: the service_queue being reported + * @fmt: printf format string + * @args: printf args + * + * The messages are prefixed with "throtl BLKG_NAME" if @sq belongs to a + * throtl_grp; otherwise, just "throtl". + * + * TODO: this should be made a function and name formatting should happen + * after testing whether blktrace is enabled. + */ +#define throtl_log(sq, fmt, args...) do { \ + struct throtl_grp *__tg = sq_to_tg((sq)); \ + struct throtl_data *__td = sq_to_td((sq)); \ + \ + (void)__td; \ + if ((__tg)) { \ + char __pbuf[128]; \ \ - blkg_path(tg_to_blkg(tg), __pbuf, sizeof(__pbuf)); \ - blk_add_trace_msg((td)->queue, "throtl %s " fmt, __pbuf, ##args); \ + blkg_path(tg_to_blkg(__tg), __pbuf, sizeof(__pbuf)); \ + blk_add_trace_msg(__td->queue, "throtl %s " fmt, __pbuf, ##args); \ + } else { \ + blk_add_trace_msg(__td->queue, "throtl " fmt, ##args); \ + } \ } while (0) -#define throtl_log(td, fmt, args...) \ - blk_add_trace_msg((td)->queue, "throtl " fmt, ##args) - -static inline unsigned int total_nr_queued(struct throtl_data *td) -{ - return td->nr_queued[0] + td->nr_queued[1]; -} - /* * Worker for allocating per cpu stat for tgs. This is scheduled on the * system_wq once there are some groups on the alloc_list waiting for @@ -215,15 +293,141 @@ alloc_stats: goto alloc_stats; } +static void throtl_qnode_init(struct throtl_qnode *qn, struct throtl_grp *tg) +{ + INIT_LIST_HEAD(&qn->node); + bio_list_init(&qn->bios); + qn->tg = tg; +} + +/** + * throtl_qnode_add_bio - add a bio to a throtl_qnode and activate it + * @bio: bio being added + * @qn: qnode to add bio to + * @queued: the service_queue->queued[] list @qn belongs to + * + * Add @bio to @qn and put @qn on @queued if it's not already on. + * @qn->tg's reference count is bumped when @qn is activated. See the + * comment on top of throtl_qnode definition for details. + */ +static void throtl_qnode_add_bio(struct bio *bio, struct throtl_qnode *qn, + struct list_head *queued) +{ + bio_list_add(&qn->bios, bio); + if (list_empty(&qn->node)) { + list_add_tail(&qn->node, queued); + blkg_get(tg_to_blkg(qn->tg)); + } +} + +/** + * throtl_peek_queued - peek the first bio on a qnode list + * @queued: the qnode list to peek + */ +static struct bio *throtl_peek_queued(struct list_head *queued) +{ + struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node); + struct bio *bio; + + if (list_empty(queued)) + return NULL; + + bio = bio_list_peek(&qn->bios); + WARN_ON_ONCE(!bio); + return bio; +} + +/** + * throtl_pop_queued - pop the first bio form a qnode list + * @queued: the qnode list to pop a bio from + * @tg_to_put: optional out argument for throtl_grp to put + * + * Pop the first bio from the qnode list @queued. After popping, the first + * qnode is removed from @queued if empty or moved to the end of @queued so + * that the popping order is round-robin. + * + * When the first qnode is removed, its associated throtl_grp should be put + * too. If @tg_to_put is NULL, this function automatically puts it; + * otherwise, *@tg_to_put is set to the throtl_grp to put and the caller is + * responsible for putting it. + */ +static struct bio *throtl_pop_queued(struct list_head *queued, + struct throtl_grp **tg_to_put) +{ + struct throtl_qnode *qn = list_first_entry(queued, struct throtl_qnode, node); + struct bio *bio; + + if (list_empty(queued)) + return NULL; + + bio = bio_list_pop(&qn->bios); + WARN_ON_ONCE(!bio); + + if (bio_list_empty(&qn->bios)) { + list_del_init(&qn->node); + if (tg_to_put) + *tg_to_put = qn->tg; + else + blkg_put(tg_to_blkg(qn->tg)); + } else { + list_move_tail(&qn->node, queued); + } + + return bio; +} + +/* init a service_queue, assumes the caller zeroed it */ +static void throtl_service_queue_init(struct throtl_service_queue *sq, + struct throtl_service_queue *parent_sq) +{ + INIT_LIST_HEAD(&sq->queued[0]); + INIT_LIST_HEAD(&sq->queued[1]); + sq->pending_tree = RB_ROOT; + sq->parent_sq = parent_sq; + setup_timer(&sq->pending_timer, throtl_pending_timer_fn, + (unsigned long)sq); +} + +static void throtl_service_queue_exit(struct throtl_service_queue *sq) +{ + del_timer_sync(&sq->pending_timer); +} + static void throtl_pd_init(struct blkcg_gq *blkg) { struct throtl_grp *tg = blkg_to_tg(blkg); + struct throtl_data *td = blkg->q->td; + struct throtl_service_queue *parent_sq; unsigned long flags; + int rw; + + /* + * If sane_hierarchy is enabled, we switch to properly hierarchical + * behavior where limits on a given throtl_grp are applied to the + * whole subtree rather than just the group itself. e.g. If 16M + * read_bps limit is set on the root group, the whole system can't + * exceed 16M for the device. + * + * If sane_hierarchy is not enabled, the broken flat hierarchy + * behavior is retained where all throtl_grps are treated as if + * they're all separate root groups right below throtl_data. + * Limits of a group don't interact with limits of other groups + * regardless of the position of the group in the hierarchy. + */ + parent_sq = &td->service_queue; + + if (cgroup_sane_behavior(blkg->blkcg->css.cgroup) && blkg->parent) + parent_sq = &blkg_to_tg(blkg->parent)->service_queue; + + throtl_service_queue_init(&tg->service_queue, parent_sq); + + for (rw = READ; rw <= WRITE; rw++) { + throtl_qnode_init(&tg->qnode_on_self[rw], tg); + throtl_qnode_init(&tg->qnode_on_parent[rw], tg); + } RB_CLEAR_NODE(&tg->rb_node); - bio_list_init(&tg->bio_lists[0]); - bio_list_init(&tg->bio_lists[1]); - tg->limits_changed = false; + tg->td = td; tg->bps[READ] = -1; tg->bps[WRITE] = -1; @@ -241,6 +445,30 @@ static void throtl_pd_init(struct blkcg_gq *blkg) spin_unlock_irqrestore(&tg_stats_alloc_lock, flags); } +/* + * Set has_rules[] if @tg or any of its parents have limits configured. + * This doesn't require walking up to the top of the hierarchy as the + * parent's has_rules[] is guaranteed to be correct. + */ +static void tg_update_has_rules(struct throtl_grp *tg) +{ + struct throtl_grp *parent_tg = sq_to_tg(tg->service_queue.parent_sq); + int rw; + + for (rw = READ; rw <= WRITE; rw++) + tg->has_rules[rw] = (parent_tg && parent_tg->has_rules[rw]) || + (tg->bps[rw] != -1 || tg->iops[rw] != -1); +} + +static void throtl_pd_online(struct blkcg_gq *blkg) +{ + /* + * We don't want new groups to escape the limits of its ancestors. + * Update has_rules[] after a new group is brought online. + */ + tg_update_has_rules(blkg_to_tg(blkg)); +} + static void throtl_pd_exit(struct blkcg_gq *blkg) { struct throtl_grp *tg = blkg_to_tg(blkg); @@ -251,6 +479,8 @@ static void throtl_pd_exit(struct blkcg_gq *blkg) spin_unlock_irqrestore(&tg_stats_alloc_lock, flags); free_percpu(tg->stats_cpu); + + throtl_service_queue_exit(&tg->service_queue); } static void throtl_pd_reset_stats(struct blkcg_gq *blkg) @@ -309,17 +539,18 @@ static struct throtl_grp *throtl_lookup_create_tg(struct throtl_data *td, return tg; } -static struct throtl_grp *throtl_rb_first(struct throtl_rb_root *root) +static struct throtl_grp * +throtl_rb_first(struct throtl_service_queue *parent_sq) { /* Service tree is empty */ - if (!root->count) + if (!parent_sq->nr_pending) return NULL; - if (!root->left) - root->left = rb_first(&root->rb); + if (!parent_sq->first_pending) + parent_sq->first_pending = rb_first(&parent_sq->pending_tree); - if (root->left) - return rb_entry_tg(root->left); + if (parent_sq->first_pending) + return rb_entry_tg(parent_sq->first_pending); return NULL; } @@ -330,29 +561,30 @@ static void rb_erase_init(struct rb_node *n, struct rb_root *root) RB_CLEAR_NODE(n); } -static void throtl_rb_erase(struct rb_node *n, struct throtl_rb_root *root) +static void throtl_rb_erase(struct rb_node *n, + struct throtl_service_queue *parent_sq) { - if (root->left == n) - root->left = NULL; - rb_erase_init(n, &root->rb); - --root->count; + if (parent_sq->first_pending == n) + parent_sq->first_pending = NULL; + rb_erase_init(n, &parent_sq->pending_tree); + --parent_sq->nr_pending; } -static void update_min_dispatch_time(struct throtl_rb_root *st) +static void update_min_dispatch_time(struct throtl_service_queue *parent_sq) { struct throtl_grp *tg; - tg = throtl_rb_first(st); + tg = throtl_rb_first(parent_sq); if (!tg) return; - st->min_disptime = tg->disptime; + parent_sq->first_pending_disptime = tg->disptime; } -static void -tg_service_tree_add(struct throtl_rb_root *st, struct throtl_grp *tg) +static void tg_service_queue_add(struct throtl_grp *tg) { - struct rb_node **node = &st->rb.rb_node; + struct throtl_service_queue *parent_sq = tg->service_queue.parent_sq; + struct rb_node **node = &parent_sq->pending_tree.rb_node; struct rb_node *parent = NULL; struct throtl_grp *__tg; unsigned long key = tg->disptime; @@ -371,89 +603,135 @@ tg_service_tree_add(struct throtl_rb_root *st, struct throtl_grp *tg) } if (left) - st->left = &tg->rb_node; + parent_sq->first_pending = &tg->rb_node; rb_link_node(&tg->rb_node, parent, node); - rb_insert_color(&tg->rb_node, &st->rb); + rb_insert_color(&tg->rb_node, &parent_sq->pending_tree); } -static void __throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg) +static void __throtl_enqueue_tg(struct throtl_grp *tg) { - struct throtl_rb_root *st = &td->tg_service_tree; + tg_service_queue_add(tg); + tg->flags |= THROTL_TG_PENDING; + tg->service_queue.parent_sq->nr_pending++; +} - tg_service_tree_add(st, tg); - throtl_mark_tg_on_rr(tg); - st->count++; +static void throtl_enqueue_tg(struct throtl_grp *tg) +{ + if (!(tg->flags & THROTL_TG_PENDING)) + __throtl_enqueue_tg(tg); } -static void throtl_enqueue_tg(struct throtl_data *td, struct throtl_grp *tg) +static void __throtl_dequeue_tg(struct throtl_grp *tg) { - if (!throtl_tg_on_rr(tg)) - __throtl_enqueue_tg(td, tg); + throtl_rb_erase(&tg->rb_node, tg->service_queue.parent_sq); + tg->flags &= ~THROTL_TG_PENDING; } -static void __throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg) +static void throtl_dequeue_tg(struct throtl_grp *tg) { - throtl_rb_erase(&tg->rb_node, &td->tg_service_tree); - throtl_clear_tg_on_rr(tg); + if (tg->flags & THROTL_TG_PENDING) + __throtl_dequeue_tg(tg); } -static void throtl_dequeue_tg(struct throtl_data *td, struct throtl_grp *tg) +/* Call with queue lock held */ +static void throtl_schedule_pending_timer(struct throtl_service_queue *sq, + unsigned long expires) { - if (throtl_tg_on_rr(tg)) - __throtl_dequeue_tg(td, tg); + mod_timer(&sq->pending_timer, expires); + throtl_log(sq, "schedule timer. delay=%lu jiffies=%lu", + expires - jiffies, jiffies); } -static void throtl_schedule_next_dispatch(struct throtl_data *td) +/** + * throtl_schedule_next_dispatch - schedule the next dispatch cycle + * @sq: the service_queue to schedule dispatch for + * @force: force scheduling + * + * Arm @sq->pending_timer so that the next dispatch cycle starts on the + * dispatch time of the first pending child. Returns %true if either timer + * is armed or there's no pending child left. %false if the current + * dispatch window is still open and the caller should continue + * dispatching. + * + * If @force is %true, the dispatch timer is always scheduled and this + * function is guaranteed to return %true. This is to be used when the + * caller can't dispatch itself and needs to invoke pending_timer + * unconditionally. Note that forced scheduling is likely to induce short + * delay before dispatch starts even if @sq->first_pending_disptime is not + * in the future and thus shouldn't be used in hot paths. + */ +static bool throtl_schedule_next_dispatch(struct throtl_service_queue *sq, + bool force) { - struct throtl_rb_root *st = &td->tg_service_tree; + /* any pending children left? */ + if (!sq->nr_pending) + return true; - /* - * If there are more bios pending, schedule more work. - */ - if (!total_nr_queued(td)) - return; + update_min_dispatch_time(sq); - BUG_ON(!st->count); + /* is the next dispatch time in the future? */ + if (force || time_after(sq->first_pending_disptime, jiffies)) { + throtl_schedule_pending_timer(sq, sq->first_pending_disptime); + return true; + } - update_min_dispatch_time(st); + /* tell the caller to continue dispatching */ + return false; +} - if (time_before_eq(st->min_disptime, jiffies)) - throtl_schedule_delayed_work(td, 0); - else - throtl_schedule_delayed_work(td, (st->min_disptime - jiffies)); +static inline void throtl_start_new_slice_with_credit(struct throtl_grp *tg, + bool rw, unsigned long start) +{ + tg->bytes_disp[rw] = 0; + tg->io_disp[rw] = 0; + + /* + * Previous slice has expired. We must have trimmed it after last + * bio dispatch. Tha |