aboutsummaryrefslogtreecommitdiff
path: root/kernel/sched
AgeCommit message (Collapse)Author
2014-02-22sched: Check for idle task in might_sleep()Thomas Gleixner
Idle is not allowed to call sleeping functions ever! Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-3-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-22sched: Init idle->on_rq in init_idle()Thomas Gleixner
We stumbled in RT over a SMP bringup issue on ARM where the idle->on_rq == 0 was causing try_to_wakeup() on the other cpu to run into nada land. After adding that idle->on_rq = 1; I was able to find the root cause of the lockup: the idle task on the newly woken up cpu was fiddling with a sleeping spinlock, which is a nono. I kept the init of idle->on_rq to keep the state consistent and to avoid another long lasting debug session. As a side note, the whole debug mess could have been avoided if might_sleep() would have yelled when called from the idle task. That's fixed with patch 2/6 - and that one actually has a changelog :) Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391803122-4425-2-git-send-email-bigeasy@linutronix.de Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-21sched: Remove some #ifdefferyPeter Zijlstra
Remove a few gratuitous #ifdefs in pick_next_task*(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/n/tip-nnzddp5c4fijyzzxxrwlxghf@git.kernel.org Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Fix hotplug task migrationPeter Zijlstra
Dan Carpenter reported: > kernel/sched/rt.c:1347 pick_next_task_rt() warn: variable dereferenced before check 'prev' (see line 1338) > kernel/sched/deadline.c:1011 pick_next_task_dl() warn: variable dereferenced before check 'prev' (see line 1005) Kirill also spotted that migrate_tasks() will have an instant NULL deref because pick_next_task() will immediately deref prev. Instead of fixing all the corner cases because migrate_tasks() can pass in a NULL prev task in the unlikely case of hot-un-plug, provide a fake task such that we can remove all the NULL checks from the far more common paths. A further problem; not previously spotted; is that because we pushed pre_schedule() and idle_balance() into pick_next_task() we now need to avoid those getting called and pulling more tasks on our dying CPU. We avoid pull_{dl,rt}_task() by setting fake_task.prio to MAX_PRIO+1. We also note that since we call pick_next_task() exactly the amount of times we have runnable tasks present, we should never land in idle_balance(). Fixes: 38033c37faab ("sched: Push down pre_schedule() and idle_balance()") Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Kirill Tkhai <tkhai@yandex.ru> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140212094930.GB3545@laptop.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/fair: Remove idle_balance() declaration in sched.hPeter Zijlstra
Remove idle_balance() from the public life; also reduce some #ifdef clutter by folding the pick_next_task_fair() idle path into idle_balance(). Cc: mingo@kernel.org Reported-by: Daniel Lezcano <daniel.lezcano@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140211151148.GP27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/fair: Reset se-depth when task switched to FAIRMichael wang
Sasha reported: [ 522.645288] BUG: unable to handle kernel NULL pointer dereference at ... [ 522.646271] IP: [<ffffffff81186c6f>] check_preempt_wakeup+0x11f/0x210 ... [ 522.650021] Call Trace: [ 522.650021] <IRQ> [ 522.650021] [<ffffffff8117361d>] check_preempt_curr+0x3d/0xb0 [ 522.650021] [<ffffffff81175d88>] ttwu_do_wakeup+0x18/0x130 ... which was caused by the se-depth changed during the time when task is not FAIR, and we will use the wrong depth value after it switched back to FAIR. This patch reset the depth at the time when task switched to FAIR, make sure that we always have the correct value when task is FAIR. Cc: Ingo Molnar <mingo@kernel.org> Reported-by: Sasha Levin <sasha.levin@oracle.com> Tested-by: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/5305732D.70001@linux.vnet.ibm.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Remove useless dl_nr_totalKirill Tkhai
In deadline class we do not have group scheduling like in RT. dl_nr_total is the same as dl_nr_running. So, one of them should be removed. Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Kirill Tkhai <tkhai@yandex.ru> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/368631392675853@web20h.yandex.ru Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Test for CPU's presence explicitlyBoris Ostrovsky
A hot-removed CPU may have ID that is numerically larger than the number of existing CPUs in the system (e.g. we can unplug CPU 4 from a system that has CPUs 0, 1 and 4). Thus the WARN_ONs should check whether the CPU in question is currently present, not whether its ID value is less than num_present_cpus(). Cc: Ingo Molnar <mingo@kernel.org> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Steven Rostedt <rostedt@goodmis.org> Reported-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392646353-1874-1-git-send-email-boris.ostrovsky@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Add 'flags' argument to sched_{set,get}attr() syscallsPeter Zijlstra
Because of a recent syscall design debate; its deemed appropriate for each syscall to have a flags argument for future extension; without immediately requiring new syscalls. Cc: juri.lelli@gmail.com Cc: Ingo Molnar <mingo@redhat.com> Suggested-by: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140214161929.GL27965@twins.programming.kicks-ass.net Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched: Fix information leak in sys_sched_getattr()Vegard Nossum
We're copying the on-stack structure to userspace, but forgot to give the right number of bytes to copy. This allows the calling process to obtain up to PAGE_SIZE bytes from the stack (and possibly adjacent kernel memory). This fix copies only as much as we actually have on the stack (attr->size defaults to the size of the struct) and leaves the rest of the userspace-provided buffer untouched. Found using kmemcheck + trinity. Fixes: d50dde5a10f30 ("sched: Add new scheduler syscalls to support an extended scheduling parameters ABI") Cc: Dario Faggioli <raistlin@linux.it> Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Vegard Nossum <vegard.nossum@oracle.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392585857-10725-1-git-send-email-vegard.nossum@oracle.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched,numa: add cond_resched to task_numa_workRik van Riel
Normally task_numa_work scans over a fairly small amount of memory, but it is possible to run into a large unpopulated part of virtual memory, with no pages mapped. In that case, task_numa_work can run for a while, and it may make sense to reschedule as required. Cc: akpm@linux-foundation.org Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Rik van Riel <riel@redhat.com> Reported-by: Xing Gang <gang.xing@hp.com> Tested-by: Chegu Vinod <chegu_vinod@hp.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392761566-24834-2-git-send-email-riel@redhat.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/core: Make dl_b->lock IRQ safeJuri Lelli
Fix this lockdep warning: [ 44.804600] ========================================================= [ 44.805746] [ INFO: possible irq lock inversion dependency detected ] [ 44.805746] 3.14.0-rc2-test+ #14 Not tainted [ 44.805746] --------------------------------------------------------- [ 44.805746] bash/3674 just changed the state of lock: [ 44.805746] (&dl_b->lock){+.....}, at: [<ffffffff8106ad15>] sched_rt_handler+0x132/0x248 [ 44.805746] but this lock was taken by another, HARDIRQ-safe lock in the past: [ 44.805746] (&rq->lock){-.-.-.} and interrupts could create inverse lock ordering between them. [ 44.805746] [ 44.805746] other info that might help us debug this: [ 44.805746] Possible interrupt unsafe locking scenario: [ 44.805746] [ 44.805746] CPU0 CPU1 [ 44.805746] ---- ---- [ 44.805746] lock(&dl_b->lock); [ 44.805746] local_irq_disable(); [ 44.805746] lock(&rq->lock); [ 44.805746] lock(&dl_b->lock); [ 44.805746] <Interrupt> [ 44.805746] lock(&rq->lock); by making dl_b->lock acquiring always IRQ safe. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-3-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/core: Fix sched_rt_global_validateJuri Lelli
Don't compare sysctl_sched_rt_runtime against sysctl_sched_rt_period if the former is equal to RUNTIME_INF, otherwise disabling -rt bandwidth management (with CONFIG_RT_GROUP_SCHED=n) fails. Cc: Ingo Molnar <mingo@redhat.com> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392107067-19907-2-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Fix overflow to handle period==0 and deadline!=0Steven Rostedt
While debugging the crash with the bad nr_running accounting, I hit another bug where, after running my sched deadline test, I was getting failures to take a CPU offline. It was giving me a -EBUSY error. Adding a bunch of trace_printk()s around, I found that the cpu notifier that called sched_cpu_inactive() was returning a failure. The overflow value was coming up negative? Talking this over with Juri, the problem is that the total_bw update was suppose to be made by dl_overflow() which, during my tests, seemed to not be called. Adding more trace_printk()s, it wasn't that it wasn't called, but it exited out right away with the check of new_bw being equal to p->dl.dl_bw. The new_bw calculates the ratio between period and runtime. The bug is that if you set a deadline, you do not need to set a period if you plan on the period being equal to the deadline. That is, if period is zero and deadline is not, then the system call should set the period to be equal to the deadline. This is done elsewhere in the code. The fix is easy, check if period is set, and if it is not, then use the deadline. Cc: Juri Lelli <juri.lelli@gmail.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140219135335.7e74abd4@gandalf.local.home Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/deadline: Fix bad accounting of nr_runningJuri Lelli
Rostedt writes: My test suite was locking up hard when enabling mmiotracer. This was due to the mmiotracer placing all but one CPU offline. I found this out when I was able to reproduce the bug with just my stress-cpu-hotplug test. This bug baffled me because it would not always trigger, and would only trigger on the first run after boot up. The stress-cpu-hotplug test would crash hard the first run, or never crash at all. But a new reboot may cause it to crash on the first run again. I spent all week bisecting this, as I couldn't find a consistent reproducer. I finally narrowed it down to the sched deadline patches, and even more peculiar, to the commit that added the sched deadline boot up self test to the latency tracer. Then it dawned on me to what the bug was. All it took was to run a task under sched deadline to screw up the CPU hot plugging. This explained why it would lock up only on the first run of the stress-cpu-hotplug test. The bug happened when the boot up self test of the schedule latency tracer would test a deadline task. The deadline task would corrupt something that would cause CPU hotplug to fail. If it didn't corrupt it, the stress test would always work (there's no other sched deadline tasks that would run to cause problems). If it did corrupt on boot up, the first test would lockup hard. I proved this theory by running my deadline test program on another box, and then run the stress-cpu-hotplug test, and it would now consistently lock up. I could run stress-cpu-hotplug over and over with no problem, but once I ran the deadline test, the next run of the stress-cpu-hotplug would lock hard. After adding lots of tracing to the code, I found the cause. The function tracer showed that migrate_tasks() was stuck in an infinite loop, where rq->nr_running never equaled 1 to break out of it. When I added a trace_printk() to see what that number was, it was 335 and never decrementing! Looking at the deadline code I found: static void __dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { dequeue_dl_entity(&p->dl); dequeue_pushable_dl_task(rq, p); } static void dequeue_task_dl(struct rq *rq, struct task_struct *p, int flags) { update_curr_dl(rq); __dequeue_task_dl(rq, p, flags); dec_nr_running(rq); } And this: if (dl_runtime_exceeded(rq, dl_se)) { __dequeue_task_dl(rq, curr, 0); if (likely(start_dl_timer(dl_se, curr->dl.dl_boosted))) dl_se->dl_throttled = 1; else enqueue_task_dl(rq, curr, ENQUEUE_REPLENISH); if (!is_leftmost(curr, &rq->dl)) resched_task(curr); } Notice how we call __dequeue_task_dl() and in the else case we call enqueue_task_dl()? Also notice that dequeue_task_dl() has underscores where enqueue_task_dl() does not. The enqueue_task_dl() calls inc_nr_running(rq), but __dequeue_task_dl() does not. This is where we get nr_running out of sync. [snip] Another point where nr_running can get out of sync is when the dl_timer fires: dl_se->dl_throttled = 0; if (p->on_rq) { enqueue_task_dl(rq, p, ENQUEUE_REPLENISH); if (task_has_dl_policy(rq->curr)) check_preempt_curr_dl(rq, p, 0); else resched_task(rq->curr); This patch does two things: - correctly accounts for throttled tasks (that are now considered !running); - fixes the bug, updating nr_running from {inc,dec}_dl_tasks(), since we risk to update it twice in some situations (e.g., a task is dequeued while it has exceeded its budget). Cc: mingo@redhat.com Cc: torvalds@linux-foundation.org Cc: akpm@linux-foundation.org Reported-by: Steven Rostedt <rostedt@goodmis.org> Reviewed-by: Steven Rostedt <rostedt@goodmis.org> Tested-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1392884379-13744-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
2014-02-21sched/mm: call finish_arch_post_lock_switch in idle_task_exit and use_mmMartin Schwidefsky
The finish_arch_post_lock_switch is called at the end of the task switch after all locks have been released. In concept it is paired with the switch_mm function, but the current code only does the call in finish_task_switch. Add the call to idle_task_exit and use_mm. One use case for the additional calls is s390 which will use finish_arch_post_lock_switch to wait for the completion of TLB flush operations. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com>
2014-02-13cgroup: drop @skip_css from cgroup_taskset_for_each()Tejun Heo
If !NULL, @skip_css makes cgroup_taskset_for_each() skip the matching css. The intention of the interface is to make it easy to skip css's (cgroup_subsys_states) which already match the migration target; however, this is entirely unnecessary as migration taskset doesn't include tasks which are already in the target cgroup. Drop @skip_css from cgroup_taskset_for_each(). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Paul Mackerras <paulus@samba.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Daniel Borkmann <dborkman@redhat.com>
2014-02-12cgroup: remove cgroup->nameTejun Heo
cgroup->name handling became quite complicated over time involving dedicated struct cgroup_name for RCU protection. Now that cgroup is on kernfs, we can drop all of it and simply use kernfs_name/path() and friends. Replace cgroup->name and all related code with kernfs name/path constructs. * Reimplement cgroup_name() and cgroup_path() as thin wrappers on top of kernfs counterparts, which involves semantic changes. pr_cont_cgroup_name() and pr_cont_cgroup_path() added. * cgroup->name handling dropped from cgroup_rename(). * All users of cgroup_name/path() updated to the new semantics. Users which were formatting the string just to printk them are converted to use pr_cont_cgroup_name/path() instead, which simplifies things quite a bit. As cgroup_name() no longer requires RCU read lock around it, RCU lockings which were protecting only cgroup_name() are removed. v2: Comment above oom_info_lock updated as suggested by Michal. v3: dummy_top doesn't have a kn associated and pr_cont_cgroup_name/path() ended up calling the matching kernfs functions with NULL kn leading to oops. Test for NULL kn and print "/" if so. This issue was reported by Fengguang Wu. v4: Rebased on top of 0ab02ca8f887 ("cgroup: protect modifications to cgroup_idr with cgroup_mutex"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Fengguang Wu <fengguang.wu@intel.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
2014-02-11sched/idle: Move cpu/idle.c to sched/idle.cNicolas Pitre
Integration of cpuidle with the scheduler requires that the idle loop be closely integrated with the scheduler proper. Moving cpu/idle.c into the sched directory will allow for a smoother integration, and eliminate a subdirectory which contained only one source file. Signed-off-by: Nicolas Pitre <nico@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/alpine.LFD.2.11.1401301102210.1652@knanqh.ubzr Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Add statistic for newidle load balance costAlex Shi
Tracking rq->max_idle_balance_cost and sd->max_newidle_lb_cost. It's useful to know these values in debug mode. Signed-off-by: Alex Shi <alex.shi@linaro.org> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/52E0F3BF.5020904@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Delete is_same_group() outside CONFIG_FAIR_GROUP_SCHEDDietmar Eggemann
Since is_same_group() is only used in the group scheduling code, there is no need to define it outside CONFIG_FAIR_GROUP_SCHED. Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1391005773-29493-1-git-send-email-dietmar.eggemann@arm.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-11sched: Push down pre_schedule() and idle_balance()Peter Zijlstra
This patch both merged idle_balance() and pre_schedule() and pushes both of them into pick_next_task(). Conceptually pre_schedule() and idle_balance() are rather similar, both are used to pull more work onto the current CPU. We cannot however first move idle_balance() into pre_schedule_fair() since there is no guarantee the last runnable task is a fair task, and thus we would miss newidle balances. Similarly, the dl and rt pre_schedule calls must be ran before idle_balance() since their respective tasks have higher priority and it would not do to delay their execution searching for less important tasks first. However, by noticing that pick_next_tasks() already traverses the sched_class hierarchy in the right order, we can get the right behaviour and do away with both calls. We must however change the special case optimization to also require that prev is of sched_class_fair, otherwise we can miss doing a dl or rt pull where we needed one. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/n/tip-a8k6vvaebtn64nie345kx1je@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Clean up idle task SMP logicPeter Zijlstra
The idle post_schedule flag is just a vile waste of time, furthermore it appears unneeded, move the idle_enter_fair() call into pick_next_task_idle(). Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: Vincent Guittot <vincent.guittot@linaro.org> Cc: alex.shi@linaro.org Cc: mingo@kernel.org Cc: Steven Rostedt <rostedt@goodmis.org> Link: http://lkml.kernel.org/n/tip-aljykihtxJt3mkokxi0qZurb@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Optimize cgroup pick_next_task_fair()Peter Zijlstra
Since commit 2f36825b1 ("sched: Next buddy hint on sleep and preempt path") it is likely we pick a new task from the same cgroup, doing a put and then set on all intermediate entities is a waste of time, so try to avoid this. Measured using: mount nodev /cgroup -t cgroup -o cpu cd /cgroup mkdir a; cd a mkdir b; cd b mkdir c; cd c echo $$ > tasks perf stat --repeat 10 -- taskset 1 perf bench sched pipe PRE : 4.542422684 seconds time elapsed ( +- 0.33% ) POST: 4.389409991 seconds time elapsed ( +- 0.32% ) Which shows a significant improvement of ~3.5% Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tejun Heo <tj@kernel.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Clean up the __clear_buddies_*() functionsPeter Zijlstra
Slightly easier code flow, no functional changes. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Push put_prev_task() into pick_next_task()Peter Zijlstra
In order to avoid having to do put/set on a whole cgroup hierarchy when we context switch, push the put into pick_next_task() so that both operations are in the same function. Further changes then allow us to possibly optimize away redundant work. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched/fair: Track cgroup depthPeter Zijlstra
Track depth in cgroup tree, this is useful for things like find_matching_se() where you need to get to a common parent of two sched entities. Keeping the depth avoids having to calculate it on the spot, which saves a number of possible cache-misses. Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1328936700.2476.17.camel@laptop Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Move rq->idle_stamp up to the coreDaniel Lezcano
idle_balance() modifies the rq->idle_stamp field, making this information shared across core.c and fair.c. As we know if the cpu is going to idle or not with the previous patch, let's encapsulate the rq->idle_stamp information in core.c by moving it up to the caller. The idle_balance() function returns true in case a balancing occured and the cpu won't be idle, false if no balance happened and the cpu is going idle. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-3-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Fix race in idle_balance()Daniel Lezcano
The scheduler main function 'schedule()' checks if there are no more tasks on the runqueue. Then it checks if a task should be pulled in the current runqueue in idle_balance() assuming it will go to idle otherwise. But idle_balance() releases the rq->lock in order to look up the sched domains and takes the lock again right after. That opens a window where another cpu may put a task in our runqueue, so we won't go to idle but we have filled the idle_stamp, thinking we will. This patch closes the window by checking if the runqueue has been modified but without pulling a task after taking the lock again, so we won't go to idle right after in the __schedule() function. Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-2-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-10sched: Remove 'cpu' parameter from idle_balance()Daniel Lezcano
The cpu parameter passed to idle_balance() is not needed as it could be retrieved from 'struct rq.' Signed-off-by: Daniel Lezcano <daniel.lezcano@linaro.org> Cc: alex.shi@linaro.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1389949444-14821-1-git-send-email-daniel.lezcano@linaro.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09sched: Implement task_nice() as static inline functionDongsheng Yang
As patch "sched: Move the priority specific bits into a new header file" exposes the priority related macros in linux/sched/prio.h, we don't have to implement task_nice() in kernel/sched/core.c any more. This patch implements it in linux/sched/sched.h as static inline function, saving the kernel stack and enhancing performance a bit. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: clark.williams@gmail.com Cc: rostedt@goodmis.org Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390878045-7096-1-git-send-email-yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09sched: Expose some macros related to priorityDongsheng Yang
Some macros in kernel/sched/sched.h about priority are private to kernel/sched. But they are useful to other parts of the core kernel. This patch moves these macros from kernel/sched/sched.h to include/linux/sched/prio.h so that they are available to other subsystems. Signed-off-by: Dongsheng Yang <yangds.fnst@cn.fujitsu.com> Cc: raistlin@linux.it Cc: juri.lelli@gmail.com Cc: clark.williams@gmail.com Cc: rostedt@goodmis.org Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/2b022810905b52d13238466807f4b2a691577180.1390859827.git.yangds.fnst@cn.fujitsu.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-09sched/deadline: Skip in switched_to_dl() if task is currentKirill Tkhai
When p is current and it's not of dl class, then there are no other dl taks in the rq. If we had had pushable tasks in some other rq, they would have been pushed earlier. So, skip "p == rq->curr" case. Signed-off-by: Kirill Tkhai <ktkhai@parallels.com> Acked-by: Juri Lelli <juri.lelli@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140128072421.32315.25300.stgit@tkhai Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-02-08cgroup: clean up cgroup_subsys names and initializationTejun Heo
cgroup_subsys is a bit messier than it needs to be. * The name of a subsys can be different from its internal identifier defined in cgroup_subsys.h. Most subsystems use the matching name but three - cpu, memory and perf_event - use different ones. * cgroup_subsys_id enums are postfixed with _subsys_id and each cgroup_subsys is postfixed with _subsys. cgroup.h is widely included throughout various subsystems, it doesn't and shouldn't have claim on such generic names which don't have any qualifier indicating that they belong to cgroup. * cgroup_subsys->subsys_id should always equal the matching cgroup_subsys_id enum; however, we require each controller to initialize it and then BUG if they don't match, which is a bit silly. This patch cleans up cgroup_subsys names and initialization by doing the followings. * cgroup_subsys_id enums are now postfixed with _cgrp_id, and each cgroup_subsys with _cgrp_subsys. * With the above, renaming subsys identifiers to match the userland visible names doesn't cause any naming conflicts. All non-matching identifiers are renamed to match the official names. cpu_cgroup -> cpu mem_cgroup -> memory perf -> perf_event * controllers no longer need to initialize ->subsys_id and ->name. They're generated in cgroup core and set automatically during boot. * Redundant cgroup_subsys declarations removed. * While updating BUG_ON()s in cgroup_init_early(), convert them to WARN()s. BUGging that early during boot is stupid - the kernel can't print anything, even through serial console and the trap handler doesn't even link stack frame properly for back-tracing. This patch doesn't introduce any behavior changes. v2: Rebased on top of fe1217c4f3f7 ("net: net_cls: move cgroupfs classid handling into core"). Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Neil Horman <nhorman@tuxdriver.com> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: "Rafael J. Wysocki" <rjw@rjwysocki.net> Acked-by: Michal Hocko <mhocko@suse.cz> Acked-by: Peter Zijlstra <peterz@infradead.org> Acked-by: Aristeu Rozanski <aris@redhat.com> Acked-by: Ingo Molnar <mingo@redhat.com> Acked-by: Li Zefan <lizefan@huawei.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Balbir Singh <bsingharora@gmail.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Serge E. Hallyn <serue@us.ibm.com> Cc: Vivek Goyal <vgoyal@redhat.com> Cc: Thomas Graf <tgraf@suug.ch>
2014-02-02Merge branch 'linus' into sched/core, to resolve conflictsIngo Molnar
Conflicts: kernel/sysctl.c Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-31Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer/dynticks updates from Ingo Molnar: "This tree contains misc dynticks updates: a fix and three cleanups" * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/nohz: Fix overflow error in scheduler_tick_max_deferment() nohz_full: fix code style issue of tick_nohz_full_stop_tick nohz: Get timekeeping max deferment outside jiffies_lock tick: Rename tick_check_idle() to tick_irq_enter()
2014-01-31Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A crash fix and documentation updates" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Make sched_class::get_rr_interval() optional sched/deadline: Add sched_dl documentation sched: Fix docbook parameter annotation error in wait.h
2014-01-28sched/numa: Turn some magic numbers into #definesRik van Riel
Cleanup suggested by Mel Gorman. Now the code contains some more hints on what statistics go where. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-10-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Rename variables in task_numa_fault()Rik van Riel
We track both the node of the memory after a NUMA fault, and the node of the CPU on which the fault happened. Rename the local variables in task_numa_fault to make things more explicit. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-9-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Do statistics calculation using local variables onlyRik van Riel
The current code in task_numa_placement calculates the difference between the old and the new value, but also temporarily stores half of the old value in the per-process variables. The NUMA balancing code looks at those per-process variables, and having other tasks temporarily see halved statistics could lead to unwanted numa migrations. This can be avoided by doing all the math in local variables. This change also simplifies the code a little. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-8-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Normalize faults_cpu stats and weigh by CPU useRik van Riel
Tracing the code that decides the active nodes has made it abundantly clear that the naive implementation of the faults_from code has issues. Specifically, the garbage collector in some workloads will access orders of magnitudes more memory than the threads that do all the active work. This resulted in the node with the garbage collector being marked the only active node in the group. This issue is avoided if we weigh the statistics by CPU use of each task in the numa group, instead of by how many faults each thread has occurred. To achieve this, we normalize the number of faults to the fraction of faults that occurred on each node, and then multiply that fraction by the fraction of CPU time the task has used since the last time task_numa_placement was invoked. This way the nodes in the active node mask will be the ones where the tasks from the numa group are most actively running, and the influence of eg. the garbage collector and other do-little threads is properly minimized. On a 4 node system, using CPU use statistics calculated over a longer interval results in about 1% fewer page migrations with two 32-warehouse specjbb runs on a 4 node system, and about 5% fewer page migrations, as well as 1% better throughput, with two 8-warehouse specjbb runs, as compared with the shorter term statistics kept by the scheduler. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-7-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa, mm: Use active_nodes nodemask to limit numa migrationsRik van Riel
Use the active_nodes nodemask to make smarter decisions on NUMA migrations. In order to maximize performance of workloads that do not fit in one NUMA node, we want to satisfy the following criteria: 1) keep private memory local to each thread 2) avoid excessive NUMA migration of pages 3) distribute shared memory across the active nodes, to maximize memory bandwidth available to the workload This patch accomplishes that by implementing the following policy for NUMA migrations: 1) always migrate on a private fault 2) never migrate to a node that is not in the set of active nodes for the numa_group 3) always migrate from a node outside of the set of active nodes, to a node that is in that set 4) within the set of active nodes in the numa_group, only migrate from a node with more NUMA page faults, to a node with fewer NUMA page faults, with a 25% margin to avoid ping-ponging This results in most pages of a workload ending up on the actively used nodes, with reduced ping-ponging of pages between those nodes. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-6-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Build per numa_group active node mask from numa_faults_cpu ↵Rik van Riel
statistics The numa_faults_cpu statistics are used to maintain an active_nodes nodemask per numa_group. This allows us to be smarter about when to do numa migrations. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-5-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Track from which nodes NUMA faults are triggeredRik van Riel
Track which nodes NUMA faults are triggered from, in other words the CPUs on which the NUMA faults happened. This uses a similar mechanism to what is used to track the memory involved in numa faults. The next patches use this to build up a bitmap of which nodes a workload is actively running on. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-4-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa: Rename p->numa_faults to numa_faults_memoryRik van Riel
In order to get a more consistent naming scheme, making it clear which fault statistics track memory locality, and which track CPU locality, rename the memory fault statistics. Suggested-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-3-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/numa, mm: Remove p->numa_migrate_deferredRik van Riel
Excessive migration of pages can hurt the performance of workloads that span multiple NUMA nodes. However, it turns out that the p->numa_migrate_deferred knob is a really big hammer, which does reduce migration rates, but does not actually help performance. Now that the second stage of the automatic numa balancing code has stabilized, it is time to replace the simplistic migration deferral code with something smarter. Signed-off-by: Rik van Riel <riel@redhat.com> Acked-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Cc: Chegu Vinod <chegu_vinod@hp.com> Link: http://lkml.kernel.org/r/1390860228-21539-2-git-send-email-riel@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched: Make sched_class::get_rr_interval() optionalPeter Zijlstra
Not all classes implement (or can implement) a useful get_rr_interval() function, default to a 0 time-slice for them. This fixes a crash reported by Tommi Rantala. Reported-by: Tommi Rantala <tt.rantala@gmail.com> Cc: Dave Jones <davej@redhat.com> Cc: Tommi Rantala <tt.rantala@gmail.com> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20140127105413.GC11314@laptop.programming.kicks-ass.net Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-28sched/deadline: Add sched_dl documentationDario Faggioli
Add in Documentation/scheduler/ some hints about the design choices, the usage and the future possible developments of the sched_dl scheduling class and of the SCHED_DEADLINE policy. Reviewed-by: Henrik Austad <henrik@austad.us> Signed-off-by: Dario Faggioli <raistlin@linux.it> Signed-off-by: Juri Lelli <juri.lelli@gmail.com> [ Re-wrote sections 2 and 3. ] Signed-off-by: Luca Abeni <luca.abeni@unitn.it> Signed-off-by: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/1390821615-23247-1-git-send-email-juri.lelli@gmail.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2014-01-25Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar: "A couple of regression fixes mostly hitting virtualized setups, but also some bare metal systems" * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/x86/tsc: Initialize multiplier to 0 sched/clock: Fixup early initialization sched/preempt/x86: Fix voluntary preempt for x86 Revert "sched: Fix sleep time double accounting in enqueue entity"
2014-01-25Merge branch 'timers/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks into timers/urgent Pull dynticks cleanups from Frederic Weisbecker. Signed-off-by: Ingo Molnar <mingo@kernel.org>