aboutsummaryrefslogtreecommitdiff
path: root/kernel/cgroup.c
AgeCommit message (Collapse)Author
2013-05-13cgroup: fix an off-by-one bug which may trigger BUG_ON()Li Zefan
commit 3ac1707a13a3da9cfc8f242a15b2fae6df2c5f88 upstream. The 3rd parameter of flex_array_prealloc() is the number of elements, not the index of the last element. The effect of the bug is, when opening cgroup.procs, a flex array will be allocated and all elements of the array is allocated with GFP_KERNEL flag, but the last one is GFP_ATOMIC, and if we fail to allocate memory for it, it'll trigger a BUG_ON(). Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-03-06cgroup: fix exit() vs rmdir() raceLi Zefan
commit 71b5707e119653039e6e95213f00479668c79b75 upstream. In cgroup_exit() put_css_set_taskexit() is called without any lock, which might lead to accessing a freed cgroup: thread1 thread2 --------------------------------------------- exit() cgroup_exit() put_css_set_taskexit() atomic_dec(cgrp->count); rmdir(); /* not safe !! */ check_for_release(cgrp); rcu_read_lock() can be used to make sure the cgroup is alive. Signed-off-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-03cgroup: remove incorrect dget/dput() pair in cgroup_create_dir()Tejun Heo
commit 175431635ec09b1d1bba04979b006b99e8305a83 upstream. cgroup_create_dir() does weird dancing with dentry refcnt. On success, it gets and then puts it achieving nothing. On failure, it puts but there isn't no matching get anywhere leading to the following oops if cgroup_create_file() fails for whatever reason. ------------[ cut here ]------------ kernel BUG at /work/os/work/fs/dcache.c:552! invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC Modules linked in: CPU 2 Pid: 697, comm: mkdir Not tainted 3.7.0-rc4-work+ #3 Bochs Bochs RIP: 0010:[<ffffffff811d9c0c>] [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP: 0018:ffff88001a3ebef8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff88000e5b1ef8 RCX: 0000000000000403 RDX: 0000000000000303 RSI: 2000000000000000 RDI: ffff88000e5b1f58 RBP: ffff88001a3ebf18 R08: ffffffff82c76960 R09: 0000000000000001 R10: ffff880015022080 R11: ffd9bed70f48a041 R12: 00000000ffffffea R13: 0000000000000001 R14: ffff88000e5b1f58 R15: 00007fff57656d60 FS: 00007ff05fcb3800(0000) GS:ffff88001fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000004046f0 CR3: 000000001315f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process mkdir (pid: 697, threadinfo ffff88001a3ea000, task ffff880015022080) Stack: ffff88001a3ebf48 00000000ffffffea 0000000000000001 0000000000000000 ffff88001a3ebf38 ffffffff811cc889 0000000000000001 ffff88000e5b1ef8 ffff88001a3ebf68 ffffffff811d1fc9 ffff8800198d7f18 ffff880019106ef8 Call Trace: [<ffffffff811cc889>] done_path_create+0x19/0x50 [<ffffffff811d1fc9>] sys_mkdirat+0x59/0x80 [<ffffffff811d2009>] sys_mkdir+0x19/0x20 [<ffffffff81be1e02>] system_call_fastpath+0x16/0x1b Code: 00 48 8d 90 18 01 00 00 48 89 93 c0 00 00 00 4c 89 a0 18 01 00 00 48 8b 83 a0 00 00 00 83 80 28 01 00 00 01 e8 e6 6f a0 00 eb 92 <0f> 0b 66 90 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 49 89 fe 41 RIP [<ffffffff811d9c0c>] dput+0x1dc/0x1e0 RSP <ffff88001a3ebef8> ---[ end trace 1277bcfd9561ddb0 ]--- Fix it by dropping the unnecessary dget/dput() pair. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2013-01-03cgroup: cgroup_subsys->fork() should be called after the task is added to ↵Tejun Heo
css_set commit 5edee61edeaaebafe584f8fb7074c1ef4658596b upstream. cgroup core has a bug which violates a basic rule about event notifications - when a new entity needs to be added, you add that to the notification list first and then make the new entity conform to the current state. If done in the reverse order, an event happening inbetween will be lost. cgroup_subsys->fork() is invoked way before the new task is added to the css_set. Currently, cgroup_freezer is the only user of ->fork() and uses it to make new tasks conform to the current state of the freezer. If FROZEN state is requested while fork is in progress between cgroup_fork_callbacks() and cgroup_post_fork(), the child could escape freezing - the cgroup isn't frozen when ->fork() is called and the freezer couldn't see the new task on the css_set. This patch moves cgroup_subsys->fork() invocation to cgroup_post_fork() after the new task is added to the css_set. cgroup_fork_callbacks() is removed. Because now a task may be migrated during cgroup_subsys->fork(), freezer_fork() is updated so that it adheres to the usual RCU locking and the rather pointless comment on why locking can be different there is removed (if it doesn't make anything simpler, why even bother?). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Rafael J. Wysocki <rjw@sisk.pl> [bwh: Backported to 3.2: - Adjust context - Iterate over first CGROUP_BUILTIN_SUBSYS_COUNT elements of subsys - cgroup_subsys::fork takes cgroup_subsys pointer as first parameter] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-10-30cgroup: notify_on_release may not be triggered in some casesDaisuke Nishimura
commit 1f5320d5972aa50d3e8d2b227b636b370e608359 upstream. notify_on_release must be triggered when the last process in a cgroup is move to another. But if the first(and only) process in a cgroup is moved to another, notify_on_release is not triggered. # mkdir /cgroup/cpu/SRC # mkdir /cgroup/cpu/DST # # echo 1 >/cgroup/cpu/SRC/notify_on_release # echo 1 >/cgroup/cpu/DST/notify_on_release # # sleep 300 & [1] 8629 # # echo 8629 >/cgroup/cpu/SRC/tasks # echo 8629 >/cgroup/cpu/DST/tasks -> notify_on_release for /SRC must be triggered at this point, but it isn't. This is because put_css_set() is called before setting CGRP_RELEASABLE in cgroup_task_migrate(), and is a regression introduce by the commit:74a1166d(cgroups: make procs file writable), which was merged into v3.0. Cc: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizefan@huawei.com> Signed-off-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> Signed-off-by: Tejun Heo <tj@kernel.org> [bwh: Backported to 3.2: adjust context] Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
2012-01-12cgroup: fix to allow mounting a hierarchy by nameLi Zefan
commit 0d19ea866562e46989412a0676412fa0983c9ce7 upstream. If we mount a hierarchy with a specified name, the name is unique, and we can use it to mount the hierarchy without specifying its set of subsystem names. This feature is documented is Documentation/cgroups/cgroups.txt section 2.3 Here's an example: # mount -t cgroup -o cpuset,name=myhier xxx /cgroup1 # mount -t cgroup -o name=myhier xxx /cgroup2 But it was broken by commit 32a8cf235e2f192eb002755076994525cdbaa35a (cgroup: make the mount options parsing more accurate) This fixes the regression. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2011-12-19cgroups: fix a css_set not found bug in cgroup_attach_procMandeep Singh Baines
There is a BUG when migrating a PF_EXITING proc. Since css_set_prefetch() is not called for the PF_EXITING case, find_existing_css_set() will return NULL inside cgroup_task_migrate() causing a BUG. This bug is easy to reproduce. Create a zombie and echo its pid to cgroup.procs. $ cat zombie.c \#include <unistd.h> int main() { if (fork()) pause(); return 0; } $ We are hitting this bug pretty regularly on ChromeOS. This bug is already fixed by Tejun Heo's cgroup patchset which is targetted for the next merge window: https://lkml.org/lkml/2011/11/1/356 I've create a smaller patch here which just fixes this bug so that a fix can be merged into the current release and stable. Signed-off-by: Mandeep Singh Baines <msb@chromium.org> Downstream-Bug-Report: http://crosbug.com/23953 Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org> Cc: containers@lists.linux-foundation.org Cc: cgroups@vger.kernel.org Cc: stable@kernel.org Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Paul Menage <paul@paulmenage.org> Cc: Olof Johansson <olofj@chromium.org>
2011-11-02memcg: replace ss->id_lock with a rwlockAndrew Bresticker
While back-porting Johannes Weiner's patch "mm: memcg-aware global reclaim" for an internal effort, we noticed a significant performance regression during page-reclaim heavy workloads due to high contention of the ss->id_lock. This lock protects idr map, and serializes calls to idr_get_next() in css_get_next() (which is used during the memcg hierarchy walk). Since idr_get_next() is just doing a look up, we need only serialize it with respect to idr_remove()/idr_get_new(). By making the ss->id_lock a rwlock, contention is greatly reduced and performance improves. Tested: cat a 256m file from a ramdisk in a 128m container 50 times on each core (one file + container per core) in parallel on a NUMA machine. Result is the time for the test to complete in 1 of the containers. Both kernels included Johannes' memcg-aware global reclaim patches. Before rwlock patch: 1710.778s After rwlock patch: 152.227s Signed-off-by: Andrew Bresticker <abrestic@google.com> Cc: Paul Menage <menage@gmail.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02cgroups: don't attach task to subsystem if migration failedBen Blum
If a task has exited to the point it has called cgroup_exit() already, then we can't migrate it to another cgroup anymore. This can happen when we are attaching a task to a new cgroup between the call to ->can_attach_task() on subsystems and the migration that is eventually tried in cgroup_task_migrate(). In this case cgroup_task_migrate() returns -ESRCH and we don't want to attach the task to the subsystems because the attachment to the new cgroup itself failed. Fix this by only calling ->attach_task() on the subsystems if the cgroup migration succeeded. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-11-02cgroups: more safe tasklist locking in cgroup_attach_procBen Blum
Fix unstable tasklist locking in cgroup_attach_proc. According to this thread - https://lkml.org/lkml/2011/7/27/243 - RCU is not sufficient to guarantee the tasklist is stable w.r.t. de_thread and exit. Taking tasklist_lock for reading, instead of rcu_read_lock, ensures proper exclusion. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Paul Menage <paul@paulmenage.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Neil Brown <neilb@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-09-13locking, sched, cgroups: Annotate release_list_lock as rawThomas Gleixner
The release_list_lock can be taken in atomic context and therefore cannot be preempted on -rt - annotate it. In mainline this change documents the low level nature of the lock - otherwise there's no functional difference. Lockdep and Sparse checking will work as usual. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-27Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (54 commits) tpm_nsc: Fix bug when loading multiple TPM drivers tpm: Move tpm_tis_reenable_interrupts out of CONFIG_PNP block tpm: Fix compilation warning when CONFIG_PNP is not defined TOMOYO: Update kernel-doc. tpm: Fix a typo tpm_tis: Probing function for Intel iTPM bug tpm_tis: Fix the probing for interrupts tpm_tis: Delay ACPI S3 suspend while the TPM is busy tpm_tis: Re-enable interrupts upon (S3) resume tpm: Fix display of data in pubek sysfs entry tpm_tis: Add timeouts sysfs entry tpm: Adjust interface timeouts if they are too small tpm: Use interface timeouts returned from the TPM tpm_tis: Introduce durations sysfs entry tpm: Adjust the durations if they are too small tpm: Use durations returned from TPM TOMOYO: Enable conditional ACL. TOMOYO: Allow using argv[]/envp[] of execve() as conditions. TOMOYO: Allow using executable's realpath and symlink's target as conditions. TOMOYO: Allow using owner/group etc. of file objects as conditions. ... Fix up trivial conflict in security/tomoyo/realpath.c
2011-07-26atomic: use <linux/atomic.h>Arun Sharma
This allows us to move duplicated code in <asm/atomic.h> (atomic_inc_not_zero() for now) to <linux/atomic.h> Signed-off-by: Arun Sharma <asharma@fb.com> Reviewed-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-07-25Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (43 commits) fs: Merge split strings treewide: fix potentially dangerous trailing ';' in #defined values/expressions uwb: Fix misspelling of neighbourhood in comment net, netfilter: Remove redundant goto in ebt_ulog_packet trivial: don't touch files that are removed in the staging tree lib/vsprintf: replace link to Draft by final RFC number doc: Kconfig: `to be' -> `be' doc: Kconfig: Typo: square -> squared doc: Konfig: Documentation/power/{pm => apm-acpi}.txt drivers/net: static should be at beginning of declaration drivers/media: static should be at beginning of declaration drivers/i2c: static should be at beginning of declaration XTENSA: static should be at beginning of declaration SH: static should be at beginning of declaration MIPS: static should be at beginning of declaration ARM: static should be at beginning of declaration rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_check Update my e-mail address PCIe ASPM: forcedly -> forcibly gma500: push through device driver tree ... Fix up trivial conflicts: - arch/arm/mach-ep93xx/dma-m2p.c (deleted) - drivers/gpio/gpio-ep93xx.c (renamed and context nearby) - drivers/net/r8169.c (just context changes)
2011-07-20kill file_permission() completelyAl Viro
convert the last remaining caller to inode_permission() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-07-08rcu: treewide: Do not use rcu_read_lock_held when calling rcu_dereference_checkMichal Hocko
Since ca5ecddf (rcu: define __rcu address space modifier for sparse) rcu_dereference_check use rcu_read_lock_held as a part of condition automatically so callers do not have to do that as well. Signed-off-by: Michal Hocko <mhocko@suse.cz> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2011-06-09cgroupfs: use init_cred when populating new cgroupfs mounteparis@redhat
We recently found that in some configurations SELinux was blocking the ability for cgroupfs to be mounted. The reason for this is because cgroupfs creates files and directories during the get_sb() call and also uses lookup_one_len() during that same get_sb() call. This is a problem since the security subsystem cannot initialize the superblock and the inodes in that filesystem until after the get_sb() call returns. Thus we leave the inodes in an unitialized state during get_sb(). For the vast majority of filesystems this is not an issue, but since cgroupfs uses lookup_on_len() it does search permission checks on the directories in the path it walks. Since the inode security state is not set up SELinux does these checks as if the inodes were 'unlabeled.' Many 'normal' userspace process do not have permission to interact with unlabeled inodes. The solution presented here is to do the permission checks of path walk and inode creation as the kernel rather than as the task that called mount. Since the kernel has permission to read/write/create unlabeled inodes the get_sb() call will complete successfully and the SELinux code will be able to initialize the superblock and those inodes created during the get_sb() call. This appears to be the same solution used by other filesystems such as devtmpfs to solve the same issue and should thus have no negative impact on other LSMs which currently work. Signed-off-by: Eric Paris <eparis@redhat.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: James Morris <jmorris@namei.org>
2011-05-26cgroup: remove the ns_cgroupDaniel Lezcano
The ns_cgroup is an annoying cgroup at the namespace / cgroup frontier and leads to some problems: * cgroup creation is out-of-control * cgroup name can conflict when pids are looping * it is not possible to have a single process handling a lot of namespaces without falling in a exponential creation time * we may want to create a namespace without creating a cgroup The ns_cgroup was replaced by a compatibility flag 'clone_children', where a newly created cgroup will copy the parent cgroup values. The userspace has to manually create a cgroup and add a task to the 'tasks' file. This patch removes the ns_cgroup as suggested in the following thread: https://lists.linux-foundation.org/pipermail/containers/2009-June/018616.html The 'cgroup_clone' function is removed because it is no longer used. This is a userspace-visible change. Commit 45531757b45c ("cgroup: notify ns_cgroup deprecated") (merged into 2.6.27) caused the kernel to emit a printk warning users that the feature is planned for removal. Since that time we have heard from XXX users who were affected by this. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Acked-by: Matt Helsley <matthltc@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: use flex_array in attach_procBen Blum
Convert cgroup_attach_proc to use flex_array. The cgroup_attach_proc implementation requires a pre-allocated array to store task pointers to atomically move a thread-group, but asking for a monolithic array with kmalloc() may be unreliable for very large groups. Using flex_array provides the same functionality with less risk of failure. This is a post-patch for cgroup-procs-write.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: make procs file writableBen Blum
Make procs file writable to move all threads by tgid at once. Add functionality that enables users to move all threads in a threadgroup at once to a cgroup by writing the tgid to the 'cgroup.procs' file. This current implementation makes use of a per-threadgroup rwsem that's taken for reading in the fork() path to prevent newly forking threads within the threadgroup from "escaping" while the move is in progress. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-26cgroups: add per-thread subsystem callbacksBen Blum
Add cgroup subsystem callbacks for per-thread attachment in atomic contexts Add can_attach_task(), pre_attach(), and attach_task() as new callbacks for cgroups's subsystem interface. Unlike can_attach and attach, these are for per-thread operations, to be called potentially many times when attaching an entire threadgroup. Also, the old "bool threadgroup" interface is removed, as replaced by this. All subsystems are modified for the new interface - of note is cpuset, which requires from/to nodemasks for attach to be globally scoped (though per-cpuset would work too) to persist from its pre_attach to attach_task and attach. This is a pre-patch for cgroup-procs-writable.patch. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Matt Helsley <matthltc@us.ibm.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: David Rientjes <rientjes@google.com> Cc: Miao Xie <miaox@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-05-07cgroup,rcu: convert call_rcu(__free_css_id_cb) to kfree_rcu()Lai Jiangshan
The rcu callback __free_css_id_cb() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(__free_css_id_cb). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07cgroup,rcu: convert call_rcu(free_cgroup_rcu) to kfree_rcu()Lai Jiangshan
The rcu callback free_cgroup_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_cgroup_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-05-07cgroup,rcu: convert call_rcu(free_css_set_rcu) to kfree_rcu()Lai Jiangshan
The rcu callback free_css_set_rcu() just calls a kfree(), so we use kfree_rcu() instead of the call_rcu(free_css_set_rcu). Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-22cgroups: if you list_empty() a head then don't list_del() itPhil Carmody
list_del() leaves poison in the prev and next pointers. The next list_empty() will compare those poisons, and say the list isn't empty. Any list operations that assume the node is on a list because of such a check will be fooled into dereferencing poison. One needs to INIT the node after the del, and fortunately there's already a wrapper for that - list_del_init(). Some of the dels are followed by deallocations, so can be ignored, and one can be merged with an add to make a move. Apart from that, I erred on the side of caution in making nodes list_empty()-queriable. Signed-off-by: Phil Carmody <ext-phil.2.carmody@nokia.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Kirill A. Shutemov <kirill@shutemov.name> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-02-16perf: Add cgroup supportStephane Eranian
This kernel patch adds the ability to filter monitoring based on container groups (cgroups). This is for use in per-cpu mode only. The cgroup to monitor is passed as a file descriptor in the pid argument to the syscall. The file descriptor must be opened to the cgroup name in the cgroup filesystem. For instance, if the cgroup name is foo and cgroupfs is mounted in /cgroup, then the file descriptor is opened to /cgroup/foo. Cgroup mode is activated by passing PERF_FLAG_PID_CGROUP in the flags argument to the syscall. For instance to measure in cgroup foo on CPU1 assuming cgroupfs is mounted under /cgroup: struct perf_event_attr attr; int cgroup_fd, fd; cgroup_fd = open("/cgroup/foo", O_RDONLY); fd = perf_event_open(&attr, cgroup_fd, 1, -1, PERF_FLAG_PID_CGROUP); close(cgroup_fd); Signed-off-by: Stephane Eranian <eranian@google.com> [ added perf_cgroup_{exit,attach} ] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <4d590250.114ddf0a.689e.4482@mx.google.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-02-16cgroup: Fix cgroup_subsys::exit callbackPeter Zijlstra
Make the ::exit method act like ::attach, it is after all very nearly the same thing. The bug had no effect on correctness - fixing it is an optimization for the scheduler. Also, later perf-cgroups patches rely on it. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Paul Menage <menage@google.com> LKML-Reference: <1297160655.13327.92.camel@laptop> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-01-14Merge branch 'vfs-scale-working' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin * 'vfs-scale-working' of git://git.kernel.org/pub/scm/linux/kernel/git/npiggin/linux-npiggin: kernel: fix hlist_bl again cgroups: Fix a lockdep warning at cgroup removal fs: namei fix ->put_link on wrong inode in do_filp_open
2011-01-14cgroup_fs: fix cgroup use of simple_lookup()Al Viro
cgroup can't use simple_lookup(), since that'd override its desired ->d_op. Tested-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-01-14cgroups: Fix a lockdep warning at cgroup removalLi Zefan
Commit 2fd6b7f5 ("fs: dcache scale subdirs") forgot to annotate a dentry lock, which caused a lockdep warning. Reported-by: Valdis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
2011-01-12switch cgroupAl Viro
switching it to s_d_op allows to kill the cgroup_lookup() kludge. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2011-01-07fs: dcache reduce branches in lookup pathNick Piggin
Reduce some branches and memory accesses in dcache lookup by adding dentry flags to indicate common d_ops are set, rather than having to check them. This saves a pointer memory access (dentry->d_op) in common path lookup situations, and saves another pointer load and branch in cases where we have d_op but not the particular operation. Patched with: git grep -E '[.>]([[:space:]])*d_op([[:space:]])*=' | xargs sed -e 's/\([^\t ]*\)->d_op = \(.*\);/d_set_d_op(\1, \2);/' -e 's/\([^\t ]*\)\.d_op = \(.*\);/d_set_d_op(\&\1, \2);/' -i Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache rationalise dget variantsNick Piggin
dget_locked was a shortcut to avoid the lazy lru manipulation when we already held dcache_lock (lru manipulation was relatively cheap at that point). However, how that the lru lock is an innermost one, we never hold it at any caller, so the lock cost can now be avoided. We already have well working lazy dcache LRU, so it should be fine to defer LRU manipulations to scan time. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache remove dcache_lockNick Piggin
dcache_lock no longer protects anything. remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale subdirsNick Piggin
Protect d_subdirs and d_child with d_lock, except in filesystems that aren't using dcache_lock for these anyway (eg. using i_mutex). Note: if we change the locking rule in future so that ->d_child protection is provided only with ->d_parent->d_lock, it may allow us to reduce some locking. But it would be an exception to an otherwise regular locking scheme, so we'd have to see some good results. Probably not worthwhile. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: dcache scale dentry refcountNick Piggin
Make d_count non-atomic and protect it with d_lock. This allows us to ensure a 0 refcount dentry remains 0 without dcache_lock. It is also fairly natural when we start protecting many other dentry members with d_lock. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07fs: change d_delete semanticsNick Piggin
Change d_delete from a dentry deletion notification to a dentry caching advise, more like ->drop_inode. Require it to be constant and idempotent, and not take d_lock. This is how all existing filesystems use the callback anyway. This makes fine grained dentry locking of dput and dentry lru scanning much simpler. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2011-01-07cgroup fs: avoid switching ->d_op on live dentryNick Piggin
Switching d_op on a live dentry is racy in general, so avoid it. In this case it is a negative dentry, which is safer, but there are still concurrent ops which may be called on d_op in that case (eg. d_revalidate). So in general a filesystem may not do this. Fix cgroupfs so as not to do this. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2010-10-29convert cgroup and cpusetAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-27cgroups: add check for strcpy destination string overflowEvgeny Kuznetsov
Function "strcpy" is used without check for maximum allowed source string length and could cause destination string overflow. Check for string length is added before using "strcpy". Function now is return error if source string length is more than a maximum. akpm: presently considered NotABug, but add the check for general future-safeness and robustness. Signed-off-by: Evgeny Kuznetsov <EXT-Eugeny.Kuznetsov@nokia.com> Acked-by: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27cgroup: make the mount options parsing more accurateDaniel Lezcano
Current behavior: ================= (1) When we mount a cgroup, we can specify the 'all' option which means to enable all the cgroup subsystems. This is the default option when no option is specified. (2) If we want to mount a cgroup with a subset of the supported cgroup subsystems, we have to specify a subsystems name list for the mount option. (3) If we specify another option like 'noprefix' or 'release_agent', the actual code wants the 'all' or a subsystem name option specified also. Not critical but a bit not friendly as we should assume (1) in this case. (4) Logically, the 'all' option is mutually exclusive with a subsystem name, but this is not detected. In other words: succeed : mount -t cgroup -o all,freezer cgroup /cgroup => is it 'all' or 'freezer' ? fails : mount -t cgroup -o noprefix cgroup /cgroup => succeed if we do '-o noprefix,all' The following patches consolidate a bit the mount options check. New behavior: ============= (1) untouched (2) untouched (3) the 'all' option will be by default when specifying other than a subsystem name option (4) raises an error In other words: fails : mount -t cgroup -o all,freezer cgroup /cgroup succeed : mount -t cgroup -o noprefix cgroup /cgroup For the sake of lisibility, the if ... then ... else ... if ... indentation when parsing the options has been changed to: if ... then ... continue fi Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Paul Menage <menage@google.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Cc: Matt Helsley <matthltc@us.ibm.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-27cgroup: add clone_children control fileDaniel Lezcano
The ns_cgroup is a control group interacting with the namespaces. When a new namespace is created, a corresponding cgroup is automatically created too. The cgroup name is the pid of the process who did 'unshare' or the child of 'clone'. This cgroup is tied with the namespace because it prevents a process to escape the control group and use the post_clone callback, so the child cgroup inherits the values of the parent cgroup. Unfortunately, the more we use this cgroup and the more we are facing problems with it: (1) when a process unshares, the cgroup name may conflict with a previous cgroup with the same pid, so unshare or clone return -EEXIST (2) the cgroup creation is out of control because there may have an application creating several namespaces where the system will automatically create several cgroups in his back and let them on the cgroupfs (eg. a vrf based on the network namespace). (3) the mix of (1) and (2) force an administrator to regularly check and clean these cgroups. This patchset removes the ns_cgroup by adding a new flag to the cgroup and the cgroupfs mount option. It enables the copy of the parent cgroup when a child cgroup is created. We can then safely remove the ns_cgroup as this flag brings a compatibility. We have now to manually create and add the task to a cgroup, which is consistent with the cgroup framework. This patch: Sent as an answer to a previous thread around the ns_cgroup. https://lists.linux-foundation.org/pipermail/containers/2009-June/018627.html It adds a control file 'clone_children' for a cgroup. This control file is a boolean specifying if the child cgroup should be a clone of the parent cgroup or not. The default value is 'false'. This flag makes the child cgroup to call the post_clone callback of all the subsystem, if it is available. At present, the cpuset is the only one which had implemented the post_clone callback. The option can be set at mount time by specifying the 'clone_children' mount option. Signed-off-by: Daniel Lezcano <daniel.lezcano@free.fr> Signed-off-by: Serge E. Hallyn <serge.hallyn@canonical.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Acked-by: Paul Menage <menage@google.com> Reviewed-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Jamal Hadi Salim <hadi@cyberus.ca> Cc: Matt Helsley <matthltc@us.ibm.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-10-25fs: do not assign default i_ino in new_inodeChristoph Hellwig
Instead of always assigning an increasing inode number in new_inode move the call to assign it into those callers that actually need it. For now callers that need it is estimated conservatively, that is the call is added to all filesystems that do not assign an i_ino by themselves. For a few more filesystems we can avoid assigning any inode number given that they aren't user visible, and for others it could be done lazily when an inode number is actually needed, but that's left for later patches. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-10-22Merge branch 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bklLinus Torvalds
* 'vfs' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl: (30 commits) BKL: remove BKL from freevxfs BKL: remove BKL from qnx4 autofs4: Only declare function when CONFIG_COMPAT is defined autofs: Only declare function when CONFIG_COMPAT is defined ncpfs: Lock socket in ncpfs while setting its callbacks fs/locks.c: prepare for BKL removal BKL: Remove BKL from ncpfs BKL: Remove BKL from OCFS2 BKL: Remove BKL from squashfs BKL: Remove BKL from jffs2 BKL: Remove BKL from ecryptfs BKL: Remove BKL from afs BKL: Remove BKL from USB gadgetfs BKL: Remove BKL from autofs4 BKL: Remove BKL from isofs BKL: Remove BKL from fat BKL: Remove BKL from ext2 filesystem BKL: Remove BKL from do_new_mount() BKL: Remove BKL from cgroup BKL: Remove BKL from NTFS ...
2010-10-07Merge branch 'rcu/urgent' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-2.6-rcu into core/rcu
2010-10-04BKL: Remove BKL from cgroupJan Blunck
The BKL is only used in remount_fs and get_sb that are both protected by the superblocks s_umount rw_semaphore. Therefore it is safe to remove the BKL entirely. Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2010-10-04BKL: Explicitly add BKL around get_sb/fill_superJan Blunck
This patch is a preparation necessary to remove the BKL from do_new_mount(). It explicitly adds calls to lock_kernel()/unlock_kernel() around get_sb/fill_super operations for filesystems that still uses the BKL. I've read through all the code formerly covered by the BKL inside do_kern_mount() and have satisfied myself that it doesn't need the BKL any more. do_kern_mount() is already called without the BKL when mounting the rootfs and in nfsctl. do_kern_mount() calls vfs_kern_mount(), which is called from various places without BKL: simple_pin_fs(), nfs_do_clone_mount() through nfs_follow_mountpoint(), afs_mntpt_do_automount() through afs_mntpt_follow_link(). Both later functions are actually the filesystems follow_link inode operation. vfs_kern_mount() is calling the specified get_sb function and lets the filesystem do its job by calling the given fill_super function. Therefore I think it is safe to push down the BKL from the VFS to the low-level filesystems get_sb/fill_super operation. [arnd: do not add the BKL to those file systems that already don't use it elsewhere] Signed-off-by: Jan Blunck <jblunck@infradead.org> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox <matthew@wil.cx> Cc: Christoph Hellwig <hch@infradead.org>
2010-09-09cgroups: fix API thinkoMichael S. Tsirkin
Add cgroup_attach_task_all() The existing cgroup_attach_task_current_cg() API is called by a thread to attach another thread to all of its cgroups; this is unsuitable for cases where a privileged task wants to attach itself to the cgroups of a less privileged one, since the call must be made from the context of the target task. This patch adds a more generic cgroup_attach_task_all() API that allows both the source task and to-be-moved task to be specified. cgroup_attach_task_current_cg() becomes a specialization of the more generic new function. [menage@google.com: rewrote changelog] [akpm@linux-foundation.org: address reviewer comments] Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Alex Williamson <alex.williamson@redhat.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Ben Blum <bblum@google.com> Cc: Sridhar Samudrala <sri@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-19cgroups: __rcu annotationsArnd Bergmann
Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Acked-by: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>