aboutsummaryrefslogtreecommitdiff
path: root/block
AgeCommit message (Collapse)Author
2012-04-01blkcg: cfq doesn't need per-cpu dispatch statsTejun Heo
blkio_group_stats_cpu is used to count dispatch stats using per-cpu counters. This is used by both blk-throtl and cfq-iosched but the sharing is rather silly. * cfq-iosched doesn't need per-cpu dispatch stats. cfq always updates those stats while holding queue_lock. * blk-throtl needs per-cpu dispatch stats but only service_bytes and serviced. It doesn't make use of sectors. This patch makes cfq add and use global stats for service_bytes, serviced and sectors, removes per-cpu sectors counter and moves per-cpu stat printing code to blk-throttle.c. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move statistics update code to policiesTejun Heo
As with conf/stats file handling code, there's no reason for stat update code to live in blkcg core with policies calling into update them. The current organization is both inflexible and complex. This patch moves stat update code to specific policies. All blkiocg_update_*_stats() functions which deal with BLKIO_POLICY_PROP stats are collapsed into their cfq_blkiocg_update_*_stats() counterparts. blkiocg_update_dispatch_stats() is used by both policies and duplicated as throtl_update_dispatch_stats() and cfq_blkiocg_update_dispatch_stats(). This will be cleaned up later. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01cfq: collapse cfq.h into cfq-iosched.cTejun Heo
block/cfq.h contains some functions which interact with blkcg; however, this is only part of it and cfq-iosched.c already has quite some #ifdef CONFIG_CFQ_GROUP_IOSCHED. With conf/stat handling being moved to specific policies, having these relay functions isolated in cfq.h doesn't make much sense. Collapse cfq.h into cfq-iosched.c for now. Let's split blkcg support properly later if necessary. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: move conf/stat file handling code to policiesTejun Heo
blkcg conf/stat handling is convoluted in that details which belong to specific policy implementations are all out in blkcg core and then policies hook into core layer to access and manipulate confs and stats. This sadly achieves both inflexibility (confs/stats can't be modified without messing with blkcg core) and complexity (all the call-ins and call-backs). The previous patches restructured conf and stat handling code such that they can be separated out. This patch relocates the file handling part. All conf/stat file handling code which belongs to BLKIO_POLICY_PROP is moved to cfq-iosched.c and all BKLIO_POLICY_THROTL code to blk-throtl.c. The move is verbatim except for blkio_update_group_{weight|bps|iops}() callbacks which relays conf changes to policies. The configuration settings are handled in policies themselves so the relaying isn't necessary. Conf setting functions are modified to directly call per-policy update functions and the relaying mechanism is dropped. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: implement blkio_policy_type->cftypesTejun Heo
Add blkiop->cftypes which is added and removed together with the policy. This will be used to move conf/stat handling to the policies. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: export conf/stat helpers to prepare for reorganizationTejun Heo
conf/stat handling is about to be moved to policy implementation from blkcg core. Export conf/stat helpers from blkcg core so that blk-throttle and cfq-iosched can use them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: simplify blkg_conf_prep()Tejun Heo
blkg_conf_prep() implements "MAJ:MIN VAL" parsing manually, which is unnecessary. Just use sscanf("%u:%u %llu"). This might not reject some malformed input (extra input at the end) but we don't care. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure blkio_group configruation settingTejun Heo
As part of userland interface restructuring, this patch updates per-blkio_group configuration setting. Instead of funneling everything through a master function which has hard-coded cases for each config file it may handle, the common part is factored into blkg_conf_prep() and blkg_conf_finish() and different configuration setters are implemented using the helpers. While this doesn't result in immediate LOC reduction, this enables further cleanups and more modular implementation. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure configuration printingTejun Heo
Similarly to the previous stat restructuring, this patch restructures conf printing code such that, * Conf printing uses the same helpers as stat. * Printing function doesn't require hardcoded switching on the config being printed. Note that this isn't complete yet for throttle confs. The next patch will convert setting for these confs and will complete the transition. * Printing uses read_seq_string callback (other methods will be phased out). Note that blkio_group_conf.iops[2] is changed to u64 so that they can be manipulated with the same functions. This is transitional and will go away later. After this patch, per-device configurations - weight, bps and iops - use __blkg_prfill_u64() for printing which uses white space as delimiter instead of tab. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: drop blkiocg_file_write_u64()Tejun Heo
blkiocg_file_write_u64() has single switch case. Drop blkiocg_file_write_u64(), rename blkio_weight_write() to blkcg_set_weight() and use it directly for .write_u64 callback. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: restructure statistics printingTejun Heo
blkcg stats handling is a mess. None of the stats has much to do with blkcg core but they are all implemented in blkcg core. Code sharing is achieved by mixing common code with hard-coded cases for each stat counter. This patch restructures statistics printing such that * Common logic exists as helper functions and specific print functions use the helpers to implement specific cases. * Printing functions serving multiple counters don't require hardcoded switching on specific counters. * Printing uses read_seq_string callback (other methods will be phased out). This change enables further cleanups and relocating stats code to the policy implementation it belongs to. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: introduce blkg_stat and blkg_rwstatTejun Heo
blkcg uses u64_stats_sync to avoid reading wrong u64 statistic values on 32bit archs and some stat counters have subtypes to distinguish read/writes and sync/async IOs. The stat code paths are confusing and involve a lot of going back and forth between blkcg core and specific policy implementations, and synchronization and subtype handling are open coded in blkcg core. This patch introduces struct blkg_stat and blkg_rwstat which, with accompanying operations, encapsulate stat updating and accessing with proper synchronization. blkg_stat is simple u64 counter with 64bit read-access protection. blkg_rwstat is the one with rw and [a]sync subcounters and takes @rw flags to distinguish IO subtypes (%REQ_WRITE and %REQ_SYNC) and replaces stat_sub_type indexed arrays. All counters in blkio_group_stats and blkio_group_stats_cpu are replaced with either blkg_stat or blkg_rwstat along with all users. This does add one u64_stats_sync per counter and increase stats_sync operations but they're empty/noops on 64bit archs and blkcg doesn't have too many counters, especially with DEBUG_BLK_CGROUP off. While the currently resulting code isn't necessarily simpler at the moment, this will enable further clean up of blkcg stats code. - BLKIO_STAT_{READ|WRITE|SYNC|ASYNC|TOTAL} renamed to BLKG_RWSTAT_{READ|WRITE|SYNC|ASYNC|TOTAL}. - blkg_stat_add() replaces blkio_add_stat() and blkio_check_and_dec_stat(). Note that BUG_ON() on underflow in the latter function no longer exists. It's *way* better to have underflowed stat counters than oopsing. - blkio_group_stats->dequeue is now a proper u64 stat counter instead of ulong. - reset_stats() updated to clear each stat counters individually and BLKG_STATS_DEBUG_CLEAR_{START|SIZE} are removed. - Some functions reconstruct rw flags from direction and sync booleans. This will be removed by future patches. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: BLKIO_STAT_CPU_SECTORS doesn't have subcountersTejun Heo
BLKIO_STAT_CPU_SECTORS doesn't need read/write/sync/async subcounters and is counted by blkio_group_stats_cpu->sectors; however, it still holds a member in blkio_group_stats_cpu->stat_arr_cpu. Rearrange stat_type_cpu and define BLKIO_STAT_CPU_ARR_NR and use it for stat_arr_cpu[] size so that only SERVICE_BYTES and SERVICED have subcounters. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01blkcg: remove unused @pol and @plid parametersTejun Heo
@pol to blkg_to_pdata() and @plid to blkg_lookup_create() are no longer necessary. Drop them. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01block: Make cfq_target_latency tunable through sysfs.Tao Ma
In cfq, when we calculate a time slice for a process(or a cfqq to be precise), we have to consider the cfq_target_latency so that all the sync request have an estimated latency(300ms) and it is controlled by cfq_target_latency. But in some hadoop test, we have found that if there are many processes doing sequential read(24 for example), the throughput is bad because every process can only work for about 25ms and the cfqq is switched. That leads to a higher disk seek. We can achive the good throughput by setting low_latency=0, but then some read's latency is too much for the application. So this patch makes cfq_target_latency tunable through sysfs so that we can tune it and find some magic number which is not bad for both the throughput and the read latency. Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Tao Ma <boyu.mt@taobao.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-04-01Merge branch 'for-3.5' of ../cgroup into block/for-3.5/core-mergedTejun Heo
cgroup/for-3.5 contains the following changes which blk-cgroup needs to proceed with the on-going cleanup. * Dynamic addition and removal of cftypes to make config/stat file handling modular for policies. * cgroup removal update to not wait for css references to drain to fix blkcg removal hang caused by cfq caching cfqgs. Pull in cgroup/for-3.5 into block/for-3.5/core. This causes the following conflicts in block/blk-cgroup.c. * 761b3ef50e "cgroup: remove cgroup_subsys argument from callbacks" conflicts with blkiocg_pre_destroy() addition and blkiocg_attach() removal. Resolved by removing @subsys from all subsys methods. * 676f7c8f84 "cgroup: relocate cftype and cgroup_subsys definitions in controllers" conflicts with ->pre_destroy() and ->attach() updates and removal of modular config. Resolved by dropping forward declarations of the methods and applying updates to the relocated blkio_subsys. * 4baf6e3325 "cgroup: convert all non-memcg controllers to the new cftype interface" builds upon the previous item. Resolved by adding ->base_cftypes to the relocated blkio_subsys. Signed-off-by: Tejun Heo <tj@kernel.org>
2012-04-01cgroup: convert all non-memcg controllers to the new cftype interfaceTejun Heo
Convert debug, freezer, cpuset, cpu_cgroup, cpuacct, net_prio, blkio, net_cls and device controllers to use the new cftype based interface. Termination entry is added to cftype arrays and populate callbacks are replaced with cgroup_subsys->base_cftypes initializations. This is functionally identical transformation. There shouldn't be any visible behavior change. memcg is rather special and will be converted separately. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <paul@paulmenage.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Peter Zijlstra <peterz@infradead.org> Cc: "David S. Miller" <davem@davemloft.net> Cc: Vivek Goyal <vgoyal@redhat.com>
2012-04-01cgroup: relocate cftype and cgroup_subsys definitions in controllersTejun Heo
blk-cgroup, netprio_cgroup, cls_cgroup and tcp_memcontrol unnecessarily define cftype array and cgroup_subsys structures at the top of the file, which is unconventional and necessiates forward declaration of methods. This patch relocates those below the definitions of the methods and removes the forward declarations. Note that forward declaration of tcp_files[] is added in tcp_memcontrol.c for tcp_init_cgroup(). This will be removed soon by another patch. This patch doesn't introduce any functional change. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizf@cn.fujitsu.com>
2012-03-30block: use lockdep_assert_held for queue lockingAndi Kleen
Instead of an ugly open coded variant. Cc: axboe@kernel.dk Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-29blkcg: change a spin_lock() to spin_lock_irq()Dan Carpenter
Smatch complains that we re-enable IRQs twice. It looks like we forgot to disable them here on the spin_trylock() failure path. This was added in 9f13ef678e "blkcg: use double locking instead of RCU for blkg synchronization". Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>` Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23cfq: fix cfqg ref handling when BLK_CGROUP && !CFQ_GROUP_IOSCHEDTejun Heo
When BLK_CGROUP is enabled but CFQ_GROUP_IOSCHED is, cfq ends up calling blkg_get/put() on dummy cfqg leading to the following crash. BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0 IP: [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 PGD 0 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC CPU 0 Modules linked in: Pid: 1, comm: swapper/0 Not tainted 3.3.0-rc6-work+ #125 Bochs Bochs RIP: 0010:[<ffffffff813d44d8>] [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 RSP: 0018:ffff88001f9dfd80 EFLAGS: 00010046 RAX: ffff88001aefbbf0 RBX: ffff88001aeedbf0 RCX: 0000000000000100 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820ffd40 RBP: ffff88001f9dfdd0 R08: 0000000000000000 R09: 0000000000000001 R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000009 R14: ffff88001aefbc30 R15: 0000000000000003 FS: 0000000000000000(0000) GS:ffff88001fc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00000000000000b0 CR3: 000000000206f000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process swapper/0 (pid: 1, threadinfo ffff88001f9de000, task ffff88001f9dc040) Stack: ffff88001aeedbf0 ffff88001aefbdb0 ffff88001aef1548 ffff88001aefbbf0 ffff88001f9dfdd0 ffff88001aef1548 ffffffff820d6320 ffffffff8165ce30 ffffffff82c555e0 ffff88001aeebbf0 ffff88001f9dfe00 ffffffff813b0507 Call Trace: [<ffffffff813b0507>] elevator_init+0xd7/0x140 [<ffffffff813b83d5>] blk_init_allocated_queue+0x125/0x150 [<ffffffff813b94d3>] blk_init_queue_node+0x43/0x80 [<ffffffff813b9523>] blk_init_queue+0x13/0x20 [<ffffffff821aec00>] floppy_init+0x82/0xec7 [<ffffffff810001d2>] do_one_initcall+0x42/0x170 [<ffffffff821835fc>] kernel_init+0xcb/0x14f [<ffffffff81b40b24>] kernel_thread_helper+0x4/0x10 Code: 00 e8 1d 9e 76 00 48 8b 43 48 48 85 c0 48 89 83 28 03 00 00 74 07 4c 8b a0 10 ff ff ff 8b 15 b0 2e d0 00 85 d2 0f 85 49 01 00 00 <41> 8b 84 24 b0 00 00 00 85 c0 0f 8e 8c 01 00 00 83 e8 01 85 c0 RIP [<ffffffff813d44d8>] cfq_init_queue+0x258/0x430 Because cfq's blkcg support has a on/off switch, CFQ_GROUP_IOSCHED, separate from BLK_CGROUP, blkg access through cfqg needs to be conditioned on it. * Make blkg_to_cfqg() and cfqg_to_blkg() conditioned on CFQ_GROUP_IOSCHED. If disabled, they always return %NULL. * Introduce cfqg_get() and cfqg_put() conditioned on CFQ_GROUP_IOSCHED. If disabled, they are noops. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-23block: blk_alloc_queue_node(): use caller's GFP flags instead of GFP_KERNELDan Carpenter
We should use the GFP flags that the caller specified instead of picking our own. All the callers specify GFP_KERNEL so this doesn't make a difference to how the kernel runs, it's just a cleanup. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20Merge branch 'for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()
2012-03-20Merge branch 'sched-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler changes for v3.4 from Ingo Molnar * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits) printk: Make it compile with !CONFIG_PRINTK sched/x86: Fix overflow in cyc2ns_offset sched: Fix nohz load accounting -- again! sched: Update yield() docs printk/sched: Introduce special printk_sched() for those awkward moments sched/nohz: Correctly initialize 'next_balance' in 'nohz' idle balancer sched: Cleanup cpu_active madness sched: Fix load-balance wreckage sched: Clean up parameter passing of proc_sched_autogroup_set_nice() sched: Ditch per cgroup task lists for load-balancing sched: Rename load-balancing fields sched: Move load-balancing arguments into helper struct sched/rt: Do not submit new work when PI-blocked sched/rt: Prevent idle task boosting sched/wait: Add __wake_up_all_locked() API sched/rt: Document scheduler related skip-resched-check sites sched/rt: Use schedule_preempt_disabled() sched/rt: Add schedule_preempt_disabled() sched/rt: Do not throttle when PI boosting sched/rt: Keep period timer ticking when rt throttling is active ...
2012-03-20block: remove ioc_*_changed()Tejun Heo
After the previous patch to cfq, there's no ioc_get_changed() user left. This patch yanks out ioc_{ioprio|cgroup|get}_changed() and all related stuff. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20cfq: don't use icq_get_changed()Tejun Heo
cfq caches the associated cfqq's for a given cic. The cache needs to be flushed if the cic's ioprio or blkcg has changed. It is currently done by requiring the changing action to set the respective ICQ_*_CHANGED bit in the icq and testing it from cfq_set_request(), which involves iterating through all the affected icqs. All cfq wants to know is whether ioprio and/or blkcg have changed since the last flush and can be easily achieved by just remembering the current ioprio and blkcg ID in cic. This patch adds cic->{ioprio|blkcg_id}, updates all ioprio users to use the remembered value instead, and updates cfq_set_request() path such that, instead of using icq_get_changed(), the current values are compared against the remembered ones and trigger appropriate flush action if not. Condition tests are moved inside both _changed functions which are now named check_ioprio_changed() and check_blkcg_changed(). ioprio.h::task_ioprio*() can't be used anymore and replaced with open-coded IOPRIO_CLASS_NONE case in cfq_async_queue_prio(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20cfq: pass around cfq_io_cq instead of io_contextTejun Heo
Now that io_cq is managed by block core and guaranteed to exist for any in-flight request, it is easier and carries more information to pass around cfq_io_cq than io_context. This patch updates cfq_init_prio_data(), cfq_find_alloc_queue() and cfq_get_queue() to take @cic instead of @ioc. This change removes a duplicate cfq_cic_lookup() from cfq_find_alloc_queue(). This change enables the use of cic-cached ioprio in the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: add blkcg->idTejun Heo
Add 64bit unique id to blkcg. This will be used by policies which want blkcg identity test to tell whether the associated blkcg has changed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: remove blkio_group->stats_lockTejun Heo
With recent plug merge updates, all non-percpu stat updates happen under queue_lock making stats_lock unnecessary to synchronize stat updates. The only synchronization necessary is stat reading, which can be done using u64_stats_sync instead. This patch removes blkio_group->stats_lock and adds blkio_group_stats->syncp for reader synchronization. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: restructure blkio_get_stat()Tejun Heo
Restructure blkio_get_stat() to prepare for removal of stats_lock. * Define BLKIO_STAT_ARR_NR explicitly to denote which stats have subtypes instead of using BLKIO_STAT_QUEUED. * Separate out stat acquisition and printing. After this, there are only two users of blkio_fill_stat(). Just open code it. * The code was mixing MAX_KEY_LEN and MAX_KEY_LEN - 1. There's no need to subtract one. Use MAX_KEY_LEN consistently. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: simplify stat resetTejun Heo
blkiocg_reset_stats() implements stat reset for blkio.reset_stats cgroupfs file. This feature is very unconventional and something which shouldn't have been merged. It's only useful when there's only one user or tool looking at the stats. As soon as multiple users and/or tools are involved, it becomes useless as resetting disrupts other usages. There are very good reasons why all other stats expect readers to read values at the start and end of a period and subtract to determine delta over the period. The implementation is rather complex - some fields shouldn't be cleared and it saves some fields, resets whole and restores for some reason. Reset of percpu stats is also racy. The comment points to 64bit store atomicity for the reason but even without that stores for zero can simply race with other CPUs doing RMW and get clobbered. Simplify reset by * Clear selectively instead of resetting and restoring. * Grouping debug stat fields to be reset and using memset() over them. * Not caring about stats_lock. * Using memset() to reset percpu stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: don't use percpu for merged statsTejun Heo
With recent plug merge updates, merged stats are no longer called for plug merges and now only updated while holding queue_lock. As stats_lock is scheduled to be removed, there's no reason to use percpu for merged stats. Don't use percpu for merged stats. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-20blkcg: alloc per cpu stats from worker thread in a delayed mannerVivek Goyal
Current per cpu stat allocation assumes GFP_KERNEL allocation flag. But in IO path there are times when we want GFP_NOIO semantics. As there is no way to pass the allocation flags to alloc_percpu(), this patch delays the allocation of stats using a worker thread. v2-> tejun suggested following changes. Changed the patch accordingly. - move alloc_node location in structure - reduce the size of names of some of the fields - Reduce the scope of locking of alloc_list_lock - Simplified stat_alloc_fn() by allocating stats for all policies in one go and then assigning these to a group. v3 -> Andrew suggested to put some comments in the code. Also raised concerns about trying to allocate infinitely in case of allocation failure. I have changed the logic to sleep for 10ms before retrying. That should take care of non-preemptible UP kernels. v4 -> Tejun had more suggestions. - drop list_for_each_entry_all() - instead of msleep() use queue_delayed_work() - Some cleanups realted to more compact coding. v5-> tejun suggested more cleanups leading to more compact code. tj: - Relocated pcpu_stats into blkio_stat_alloc_fn(). - Minor comment update. - This also fixes suspicious RCU usage warning caused by invoking cgroup_path() from blkg_alloc() without holding RCU read lock. Now that blkg_alloc() doesn't require sleepable context, RCU read lock from blkg_lookup_create() is maintained throughout blkg_alloc(). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-14Merge branch 'for-linus' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block fixes from Jens Axboe: "Been sitting on this for a while, but lets get this out the door. This fixes various important bugs for 3.3 final, along with a few more trivial ones. Please pull!" * 'for-linus' of git://git.kernel.dk/linux-block: block: fix ioc leak in put_io_context block, sx8: fix pointer math issue getting fw version Block: use a freezable workqueue for disk-event polling drivers/block/DAC960: fix -Wuninitialized warning drivers/block/DAC960: fix DAC960_V2_IOCTL_Opcode_T -Wenum-compare warning block: fix __blkdev_get and add_disk race condition block: Fix setting bio flags in drivers (sd_dif/floppy) block: Fix NULL pointer dereference in sd_revalidate_disk block: exit_io_context() should call elevator_exit_icq_fn() block: simplify ioc_release_fn() block: replace icq->changed with icq->flags
2012-03-14block: fix ioc leak in put_io_contextXiaotian Feng
When put_io_context is called, if ioc->icq_list is empty and refcount is 1, kernel will not free the ioc. This is caught by following kmemleak: unreferenced object 0xffff880036349fe0 (size 216): comm "sh", pid 2137, jiffies 4294931140 (age 290579.412s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 01 00 01 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... backtrace: [<ffffffff8169f926>] kmemleak_alloc+0x26/0x50 [<ffffffff81195a9c>] kmem_cache_alloc_node+0x1cc/0x2a0 [<ffffffff81356b67>] create_io_context_slowpath+0x27/0x130 [<ffffffff81356d2b>] get_task_io_context+0xbb/0xf0 [<ffffffff81055f0e>] copy_process+0x188e/0x18b0 [<ffffffff8105609b>] do_fork+0x11b/0x420 [<ffffffff810247f8>] sys_clone+0x28/0x30 [<ffffffff816d3373>] stub_clone+0x13/0x20 [<ffffffffffffffff>] 0xffffffffffffffff ioc should be freed if ioc->icq_list is empty. Signed-off-by: Xiaotian Feng <dannyfeng@tencent.com> Acked-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: make blk-throttle preserve the issuing task on delayed biosTejun Heo
Make blk-throttle call bio_associate_current() on bios being delayed such that they get issued to block layer with the original io_context. This allows stacking blk-throttle and cfq-iosched propio policies. bios will always be issued with the correct ioc and blkcg whether it gets delayed by blk-throttle or not. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: make block cgroup policies follow bio task associationTejun Heo
Implement bio_blkio_cgroup() which returns the blkcg associated with the bio if exists or %current's blkcg, and use it in blk-throttle and cfq-iosched propio. This makes both cgroup policies honor task association for the bio instead of always assuming %current. As nobody is using bio_set_task() yet, this doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: implement bio_associate_current()Tejun Heo
IO scheduling and cgroup are tied to the issuing task via io_context and cgroup of %current. Unfortunately, there are cases where IOs need to be routed via a different task which makes scheduling and cgroup limit enforcement applied completely incorrectly. For example, all bios delayed by blk-throttle end up being issued by a delayed work item and get assigned the io_context of the worker task which happens to serve the work item and dumped to the default block cgroup. This is double confusing as bios which aren't delayed end up in the correct cgroup and makes using blk-throttle and cfq propio together impossible. Any code which punts IO issuing to another task is affected which is getting more and more common (e.g. btrfs). As both io_context and cgroup are firmly tied to task including userland visible APIs to manipulate them, it makes a lot of sense to match up tasks to bios. This patch implements bio_associate_current() which associates the specified bio with %current. The bio will record the associated ioc and blkcg at that point and block layer will use the recorded ones regardless of which task actually ends up issuing the bio. bio release puts the associated ioc and blkcg. It grabs and remembers ioc and blkcg instead of the task itself because task may already be dead by the time the bio is issued making ioc and blkcg inaccessible and those are all block layer cares about. elevator_set_req_fn() is updated such that the bio elvdata is being allocated for is available to the elevator. This doesn't update block cgroup policies yet. Further patches will implement the support. -v2: #ifdef CONFIG_BLK_CGROUP added around bio->bi_ioc dereference in rq_ioc() to fix build breakage. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Kent Overstreet <koverstreet@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: add io_context->active_refTejun Heo
Currently ioc->nr_tasks is used to decide two things - whether an ioc is done issuing IOs and whether it's shared by multiple tasks. This patch separate out the first into ioc->active_ref, which is acquired and released using {get|put}_io_context_active() respectively. This will be used to associate bio's with a given task. This patch doesn't introduce any visible behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: interface update for ioc/icq creation functionsTejun Heo
Make the following interface updates to prepare for future ioc related changes. * create_io_context() returning ioc only works for %current because it doesn't increment ref on the ioc. Drop @task parameter from it and always assume %current. * Make create_io_context_slowpath() return 0 or -errno and rename it to create_task_io_context(). * Make ioc_create_icq() take @ioc as parameter instead of assuming that of %current. The caller, get_request(), is updated to create ioc explicitly and then pass it into ioc_create_icq(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06block: restructure get_request()Tejun Heo
get_request() is structured a bit unusually in that failure path is inlined in the usual flow with goto labels atop and inside it. Relocate the error path to the end of the function. This is to prepare for icq handling changes in get_request() and doesn't introduce any behavior change. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: drop unnecessary RCU lockingTejun Heo
Now that blkg additions / removals are always done under both q and blkcg locks, the only places RCU locking is necessary are blkg_lookup[_create]() for lookup w/o blkcg lock. This patch drops unncessary RCU locking replacing it with plain blkcg locking as necessary. * blkiocg_pre_destroy() already perform proper locking and don't need RCU. Dropped. * blkio_read_blkg_stats() now uses blkcg->lock instead of RCU read lock. This isn't a hot path. * Now unnecessary synchronize_rcu() from queue exit paths removed. This makes q->nr_blkgs unnecessary. Dropped. * RCU annotation on blkg->q removed. -v2: Vivek pointed out that blkg_lookup_create() still needs to be called under rcu_read_lock(). Updated. -v3: After the update, stats_lock locking in blkio_read_blkg_stats() shouldn't be using _irq variant as it otherwise ends up enabling irq while blkcg->lock is locked. Fixed. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: use double locking instead of RCU for blkg synchronizationTejun Heo
blkgs are chained from both blkcgs and request_queues and thus subjected to two locks - blkcg->lock and q->queue_lock. As both blkcg and q can go away anytime, locking during removal is tricky. It's currently solved by wrapping removal inside RCU, which makes the synchronization complex. There are three locks to worry about - the outer RCU, q lock and blkcg lock, and it leads to nasty subtle complications like conditional synchronize_rcu() on queue exit paths. For all other paths, blkcg lock is naturally nested inside q lock and the only exception is blkcg removal path, which is a very cold path and can be implemented as clumsy but conceptually-simple reverse double lock dancing. This patch updates blkg removal path such that blkgs are removed while holding both q and blkcg locks, which is trivial for request queue exit path - blkg_destroy_all(). The blkcg removal path, blkiocg_pre_destroy(), implements reverse double lock dancing essentially identical to ioc_release_fn(). This simplifies blkg locking - no half-dead blkgs to worry about. Now unnecessary RCU annotations will be removed by the next patch. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: unify blkg's for blkcg policiesTejun Heo
Currently, blkg is per cgroup-queue-policy combination. This is unnatural and leads to various convolutions in partially used duplicate fields in blkg, config / stat access, and general management of blkgs. This patch make blkg's per cgroup-queue and let them serve all policies. blkgs are now created and destroyed by blkcg core proper. This will allow further consolidation of common management logic into blkcg core and API with better defined semantics and layering. As a transitional step to untangle blkg management, elvswitch and policy [de]registration, all blkgs except the root blkg are being shot down during elvswitch and bypass. This patch adds blkg_root_update() to update root blkg in place on policy change. This is hacky and racy but should be good enough as interim step until we get locking simplified and switch over to proper in-place update for all blkgs. -v2: Root blkgs need to be updated on elvswitch too and blkg_alloc() comment wasn't updated according to the function change. Fixed. Both pointed out by Vivek. -v3: v2 updated blkg_destroy_all() to invoke update_root_blkg_pd() for all policies. This freed root pd during elvswitch before the last queue finished exiting and led to oops. Directly invoke update_root_blkg_pd() only on BLKIO_POLICY_PROP from cfq_exit_queue(). This also is closer to what will be done with proper in-place blkg update. Reported by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: let blkcg core manage per-queue blkg list and counterTejun Heo
With the previous patch to move blkg list heads and counters to request_queue and blkg, logic to manage them in both policies are almost identical and can be moved to blkcg core. This patch moves blkg link logic into blkg_lookup_create(), implements common blkg unlink code in blkg_destroy(), and updates blkg_destory_all() so that it's policy specific and can skip root group. The updated blkg_destroy_all() is now used to both clear queue for bypassing and elv switching, and release all blkgs on q exit. This patch introduces a race window where policy [de]registration may race against queue blkg clearing. This can only be a problem on cfq unload and shouldn't be a real problem in practice (and we have many other places where this race already exists). Future patches will remove these unlikely races. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: move per-queue blkg list heads and counters to queue and blkgTejun Heo
Currently, specific policy implementations are responsible for maintaining list and number of blkgs. This duplicates code unnecessarily, and hinders factoring common code and providing blkcg API with better defined semantics. After this patch, request_queue hosts list heads and counters and blkg has list nodes for both policies. This patch only relocates the necessary fields and the next patch will actually move management code into blkcg core. Note that request_queue->blkg_list[] and ->nr_blkgs[] are hardcoded to have 2 elements. This is to avoid include dependency and will be removed by the next patch. This patch doesn't introduce any behavior change. -v2: Now unnecessary conditional on CONFIG_BLK_CGROUP_MODULE removed as pointed out by Vivek. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: don't use blkg->plid in stat related functionsTejun Heo
blkg is scheduled to be unified for all policies and thus there won't be one-to-one mapping from blkg to policy. Update stat related functions to take explicit @pol or @plid arguments and not use blkg->plid. This is painful for now but most of specific stat interface functions will be replaced with a handful of generic helpers. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: make blkg->pd an array and move configuration and stats into itTejun Heo
To prepare for unifying blkgs for different policies, make blkg->pd an array with BLKIO_NR_POLICIES elements and move blkg->conf, ->stats, and ->stats_cpu into blkg_policy_data. This patch doesn't introduce any functional difference. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: move refcnt to blkcg coreTejun Heo
Currently, blkcg policy implementations manage blkg refcnt duplicating mostly identical code in both policies. This patch moves refcnt to blkg and let blkcg core handle refcnt and freeing of blkgs. * cfq blkgs now also get freed via RCU. * cfq blkgs lose RB_EMPTY_ROOT() sanity check on blkg free. If necessary, we can add blkio_exit_group_fn() to resurrect this. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2012-03-06blkcg: let blkcg core handle policy private data allocationTejun Heo
Currently, blkg's are embedded in private data blkcg policy private data structure and thus allocated and freed by policies. This leads to duplicate codes in policies, hinders implementing common part in blkcg core with strong semantics, and forces duplicate blkg's for the same cgroup-q association. This patch introduces struct blkg_policy_data which is a separate data structure chained from blkg. Policies specifies the amount of private data it needs in its blkio_policy_type->pdata_size and blkcg core takes care of allocating them along with blkg which can be accessed using blkg_to_pdata(). blkg can be determined from pdata using pdata_to_blkg(). blkio_alloc_group_fn() method is accordingly updated to blkio_init_group_fn(). For consistency, tg_of_blkg() and cfqg_of_blkg() are replaced with blkg_to_tg() and blkg_to_cfqg() respectively, and functions to map in the reverse direction are added. Except that policy specific data now lives in a separate data structure from blkg, this patch doesn't introduce any functional difference. This will be used to unify blkg's for different policies. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>