aboutsummaryrefslogtreecommitdiff
path: root/fs/proc
AgeCommit message (Collapse)Author
2010-08-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (96 commits) no need for list_for_each_entry_safe()/resetting with superblock list Fix sget() race with failing mount vfs: don't hold s_umount over close_bdev_exclusive() call sysv: do not mark superblock dirty on remount sysv: do not mark superblock dirty on mount btrfs: remove junk sb_dirt change BFS: clean up the superblock usage AFFS: wait for sb synchronization when needed AFFS: clean up dirty flag usage cifs: truncate fallout mbcache: fix shrinker function return value mbcache: Remove unused features add f_flags to struct statfs(64) pass a struct path to vfs_statfs update VFS documentation for method changes. All filesystems that need invalidate_inode_buffers() are doing that explicitly convert remaining ->clear_inode() to ->evict_inode() Make ->drop_inode() just return whether inode needs to be dropped fs/inode.c:clear_inode() is gone fs/inode.c:evict() doesn't care about delete vs. non-delete paths now ... Fix up trivial conflicts in fs/nilfs2/super.c
2010-08-09oom: deprecate oom_adj tunableDavid Rientjes
/proc/pid/oom_adj is now deprecated so that that it may eventually be removed. The target date for removal is August 2012. A warning will be printed to the kernel log if a task attempts to use this interface. Future warning will be suppressed until the kernel is rebooted to prevent spamming the kernel log. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09oom: badness heuristic rewriteDavid Rientjes
This a complete rewrite of the oom killer's badness() heuristic which is used to determine which task to kill in oom conditions. The goal is to make it as simple and predictable as possible so the results are better understood and we end up killing the task which will lead to the most memory freeing while still respecting the fine-tuning from userspace. Instead of basing the heuristic on mm->total_vm for each task, the task's rss and swap space is used instead. This is a better indication of the amount of memory that will be freeable if the oom killed task is chosen and subsequently exits. This helps specifically in cases where KDE or GNOME is chosen for oom kill on desktop systems instead of a memory hogging task. The baseline for the heuristic is a proportion of memory that each task is currently using in memory plus swap compared to the amount of "allowable" memory. "Allowable," in this sense, means the system-wide resources for unconstrained oom conditions, the set of mempolicy nodes, the mems attached to current's cpuset, or a memory controller's limit. The proportion is given on a scale of 0 (never kill) to 1000 (always kill), roughly meaning that if a task has a badness() score of 500 that the task consumes approximately 50% of allowable memory resident in RAM or in swap space. The proportion is always relative to the amount of "allowable" memory and not the total amount of RAM systemwide so that mempolicies and cpusets may operate in isolation; they shall not need to know the true size of the machine on which they are running if they are bound to a specific set of nodes or mems, respectively. Root tasks are given 3% extra memory just like __vm_enough_memory() provides in LSMs. In the event of two tasks consuming similar amounts of memory, it is generally better to save root's task. Because of the change in the badness() heuristic's baseline, it is also necessary to introduce a new user interface to tune it. It's not possible to redefine the meaning of /proc/pid/oom_adj with a new scale since the ABI cannot be changed for backward compatability. Instead, a new tunable, /proc/pid/oom_score_adj, is added that ranges from -1000 to +1000. It may be used to polarize the heuristic such that certain tasks are never considered for oom kill while others may always be considered. The value is added directly into the badness() score so a value of -500, for example, means to discount 50% of its memory consumption in comparison to other tasks either on the system, bound to the mempolicy, in the cpuset, or sharing the same memory controller. /proc/pid/oom_adj is changed so that its meaning is rescaled into the units used by /proc/pid/oom_score_adj, and vice versa. Changing one of these per-task tunables will rescale the value of the other to an equivalent meaning. Although /proc/pid/oom_adj was originally defined as a bitshift on the badness score, it now shares the same linear growth as /proc/pid/oom_score_adj but with different granularity. This is required so the ABI is not broken with userspace applications and allows oom_adj to be deprecated for future removal. Signed-off-by: David Rientjes <rientjes@google.com> Cc: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Balbir Singh <balbir@in.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09oom: move badness() declaration into oom.hAndrew Morton
Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09oom: /proc/<pid>/oom_score treat kernel thread honestlyKOSAKI Motohiro
If a kernel thread is using use_mm(), badness() returns a positive value. This is not a big issue because caller take care of it correctly. But there is one exception, /proc/<pid>/oom_score calls badness() directly and doesn't care that the task is a regular process. Another example, /proc/1/oom_score return !0 value. But it's unkillable. This incorrectness makes administration a little confusing. This patch fixes it. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Minchan Kim <minchan.kim@gmail.com> Cc: David Rientjes <rientjes@google.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09switch procfs to ->evict_inode()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-08-09remove inode_setattrChristoph Hellwig
Replace inode_setattr with opencoded variants of it in all callers. This moves the remaining call to vmtruncate into the filesystem methods where it can be replaced with the proper truncate sequence. In a few cases it was obvious that we would never end up calling vmtruncate so it was left out in the opencoded variant: spufs: explicitly checks for ATTR_SIZE earlier btrfs,hugetlbfs,logfs,dlmfs: explicitly clears ATTR_SIZE earlier ufs: contains an opencoded simple_seattr + truncate that sets the filesize just above In addition to that ncpfs called inode_setattr with handcrafted iattrs, which allowed to trim down the opencoded variant. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-07-29CRED: Fix get_task_cred() and task_state() to not resurrect dead credentialsDavid Howells
It's possible for get_task_cred() as it currently stands to 'corrupt' a set of credentials by incrementing their usage count after their replacement by the task being accessed. What happens is that get_task_cred() can race with commit_creds(): TASK_1 TASK_2 RCU_CLEANER -->get_task_cred(TASK_2) rcu_read_lock() __cred = __task_cred(TASK_2) -->commit_creds() old_cred = TASK_2->real_cred TASK_2->real_cred = ... put_cred(old_cred) call_rcu(old_cred) [__cred->usage == 0] get_cred(__cred) [__cred->usage == 1] rcu_read_unlock() -->put_cred_rcu() [__cred->usage == 1] panic() However, since a tasks credentials are generally not changed very often, we can reasonably make use of a loop involving reading the creds pointer and using atomic_inc_not_zero() to attempt to increment it if it hasn't already hit zero. If successful, we can safely return the credentials in the knowledge that, even if the task we're accessing has released them, they haven't gone to the RCU cleanup code. We then change task_state() in procfs to use get_task_cred() rather than calling get_cred() on the result of __task_cred(), as that suffers from the same problem. Without this change, a BUG_ON in __put_cred() or in put_cred_rcu() can be tripped when it is noticed that the usage count is not zero as it ought to be, for example: kernel BUG at kernel/cred.c:168! invalid opcode: 0000 [#1] SMP last sysfs file: /sys/kernel/mm/ksm/run CPU 0 Pid: 2436, comm: master Not tainted 2.6.33.3-85.fc13.x86_64 #1 0HR330/OptiPlex 745 RIP: 0010:[<ffffffff81069881>] [<ffffffff81069881>] __put_cred+0xc/0x45 RSP: 0018:ffff88019e7e9eb8 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffff880161514480 RCX: 00000000ffffffff RDX: 00000000ffffffff RSI: ffff880140c690c0 RDI: ffff880140c690c0 RBP: ffff88019e7e9eb8 R08: 00000000000000d0 R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000040 R12: ffff880140c690c0 R13: ffff88019e77aea0 R14: 00007fff336b0a5c R15: 0000000000000001 FS: 00007f12f50d97c0(0000) GS:ffff880007400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f8f461bc000 CR3: 00000001b26ce000 CR4: 00000000000006f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process master (pid: 2436, threadinfo ffff88019e7e8000, task ffff88019e77aea0) Stack: ffff88019e7e9ec8 ffffffff810698cd ffff88019e7e9ef8 ffffffff81069b45 <0> ffff880161514180 ffff880161514480 ffff880161514180 0000000000000000 <0> ffff88019e7e9f28 ffffffff8106aace 0000000000000001 0000000000000246 Call Trace: [<ffffffff810698cd>] put_cred+0x13/0x15 [<ffffffff81069b45>] commit_creds+0x16b/0x175 [<ffffffff8106aace>] set_current_groups+0x47/0x4e [<ffffffff8106ac89>] sys_setgroups+0xf6/0x105 [<ffffffff81009b02>] system_call_fastpath+0x16/0x1b Code: 48 8d 71 ff e8 7e 4e 15 00 85 c0 78 0b 8b 75 ec 48 89 df e8 ef 4a 15 00 48 83 c4 18 5b c9 c3 55 8b 07 8b 07 48 89 e5 85 c0 74 04 <0f> 0b eb fe 65 48 8b 04 25 00 cc 00 00 48 3b b8 58 04 00 00 75 RIP [<ffffffff81069881>] __put_cred+0xc/0x45 RSP <ffff88019e7e9eb8> ---[ end trace df391256a100ebdd ]--- Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Jiri Olsa <jolsa@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-29nommu: add '[stack]' label to /proc/pid/maps outputMike Frysinger
Add support to the NOMMU /proc/pid/maps file to show which mapping is the stack of the original thread after execve. This is largely based on the MMU code. Subsidiary thread stacks are not indicated. For FDPIC, we now get: root:/> cat /proc/self/maps 02064000-02067ccc rw-p 0004d000 00:01 22 /bin/busybox 0206e000-0206f35c rw-p 00006000 00:01 295 /lib/ld-uClibc.so.0 025f0000-025f6f0c r-xs 00000000 00:01 295 /lib/ld-uClibc.so.0 02680000-026ba6b0 r-xs 00000000 00:01 297 /lib/libc.so.0 02700000-0274d384 r-xs 00000000 00:01 22 /bin/busybox 02816000-02817000 rw-p 00000000 00:00 0 02848000-0284c0d8 rw-p 00000000 00:00 0 02860000-02880000 rw-p 00000000 00:00 0 [stack] The semi-downside here is that for FLAT, we get: root:/> cat /proc/155/maps 029f0000-029f9000 rwxp 00000000 00:00 0 [stack] The reason being that FLAT combines a whole lot of stuff into one map (including the stack). But this isn't any worse than the current output (which is nothing), so screw it. Signed-off-by: Mike Frysinger <vapier@gentoo.org> Signed-off-by: David Howells <dhowells@redhat.com> Cc: Greg Ungerer <gerg@snapgear.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-06-13of: Drop properties with "/" in their nameMichael Ellerman
Some bogus firmwares include properties with "/" in their name. This causes problems when creating the /proc/device-tree file system, because the slash is taken to indicate a directory. We don't care about those properties, and we don't want to encourage them, so just throw them away when creating /proc/device-tree. Signed-off-by: Michael Ellerman <michael@ellerman.id.au> Tested-by: Christian Kujau <lists@nerdbynature.de> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-05-27kcore: add _text to KCORE_TEXTWu Fengguang
Extend KCORE_TEXT to cover the pages between _text and _stext, to allow examining some important page table pages. `readelf -a` output on x86_64 before and after patch: Type Offset VirtAddr PhysAddr before LOAD 0x00007fff8100c000 0xffffffff81009000 0x0000000000000000 after LOAD 0x00007fff81003000 0xffffffff81000000 0x0000000000000000 The newly covered pages are: 0xffffffff81000000 <startup_64> etc. 0xffffffff81001000 <init_level4_pgt> 0xffffffff81002000 <level3_ident_pgt> 0xffffffff81003000 <level3_kernel_pgt> 0xffffffff81004000 <level2_fixmap_pgt> 0xffffffff81005000 <level1_fixmap_pgt> 0xffffffff81006000 <level2_ident_pgt> 0xffffffff81007000 <level2_kernel_pgt> 0xffffffff81008000 <level2_spare_pgt> Before patch, /proc/kcore shows outdated contents for the above page table pages, for example: (gdb) p level3_ident_pgt $1 = {<text variable, no debug info>} 0xffffffff81002000 <level3_ident_pgt> (gdb) p/x *((pud_t *)&level3_ident_pgt)@512 $2 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} while the real content is: root@hp /home/wfg# hexdump -s 0x1002000 -n 4096 /dev/mem 1002000 6063 0100 0000 0000 8067 0000 0000 0000 1002010 0000 0000 0000 0000 0000 0000 0000 0000 * 1003000 That is, on a x86_64 box with 2GB memory, we can see first-1GB / full-2GB identity mapping before/after patch: (gdb) p/x *((pud_t *)&level3_ident_pgt)@512 before $1 = {{pud = 0x1006063}, {pud = 0x0} <repeats 511 times>} after $1 = {{pud = 0x1006063}, {pud = 0x8067}, {pud = 0x0} <repeats 510 times>} Obviously the content before patch is wrong. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Andi Kleen <andi@firstfloor.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27proc: remove obsolete commentsAmerigo Wang
A quick test shows these comments are obsolete, so just remove them. Signed-off-by: WANG Cong <amwang@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27proc: cleanup: remove unused assignmentsDan Carpenter
I removed 3 unused assignments. The first two get reset on the first statement of their functions. For "err" in root.c we don't return an error and we don't use the variable again. Signed-off-by: Dan Carpenter <error27@gmail.com> Cc: Oleg Nesterov <oleg@redhat.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27proc: get_nr_threads() doesn't need ->siglock any longerOleg Nesterov
Now that task->signal can't go away get_nr_threads() doesn't need ->siglock to read signal->count. Also, make it inline, move into sched.h, and convert 2 other proc users of signal->count to use this (now trivial) helper. Henceforth get_nr_threads() is the only valid user of signal->count, we are ready to turn it into "int nr_threads" or, perhaps, kill it. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: David Howells <dhowells@redhat.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Roland McGrath <roland@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-25pagemap: add #ifdefs CONFIG_HUGETLB_PAGE on code walking hugetlb vmaNaoya Horiguchi
If !CONFIG_HUGETLB_PAGE, pagemap_hugetlb_range() is never called. So put it (and its calling function) into #ifdef block. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-19Merge branch 'bkl/procfs' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing * 'bkl/procfs' of git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing: sunrpc: Include missing smp_lock.h procfs: Kill the bkl in ioctl procfs: Push down the bkl from ioctl procfs: Use generic_file_llseek in /proc/vmcore procfs: Use generic_file_llseek in /proc/kmsg procfs: Use generic_file_llseek in /proc/kcore procfs: Kill BKL in llseek on proc base
2010-05-17procfs: Kill the bkl in ioctlFrederic Weisbecker
There are no more users of procfs that implement the ioctl callback. Drop the bkl from this path and warn on any use of this callback. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-05-11revert "procfs: provide stack information for threads" and its fixup commitsRobin Holt
Originally, commit d899bf7b ("procfs: provide stack information for threads") attempted to introduce a new feature for showing where the threadstack was located and how many pages are being utilized by the stack. Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was applied to fix the NO_MMU case. Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") was applied to fix a bug in ia32 executables being loaded. Commit 9ebd4eba7 ("procfs: fix /proc/<pid>/stat stack pointer for kernel threads") was applied to fix a bug which had kernel threads printing a userland stack address. Commit 1306d603f ('proc: partially revert "procfs: provide stack information for threads"') was then applied to revert the stack pages being used to solve a significant performance regression. This patch nearly undoes the effect of all these patches. The reason for reverting these is it provides an unusable value in field 28. For x86_64, a fork will result in the task->stack_start value being updated to the current user top of stack and not the stack start address. This unpredictability of the stack_start value makes it worthless. That includes the intended use of showing how much stack space a thread has. Other architectures will get different values. As an example, ia64 gets 0. The do_fork() and copy_process() functions appear to treat the stack_start and stack_size parameters as architecture specific. I only partially reverted c44972f1 ("procfs: disable per-task stack usage on NOMMU") . If I had completely reverted it, I would have had to change mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is configured. Since I could not test the builds without significant effort, I decided to not change mm/Makefile. I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on 64-bit") . I left the KSTK_ESP() change in place as that seemed worthwhile. Signed-off-by: Robin Holt <holt@sgi.com> Cc: Stefani Seibold <stefani@seibold.net> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Michal Simek <monstr@monstr.eu> Cc: Ingo Molnar <mingo@elte.hu> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-27procfs: fix tid fdinfoJerome Marchand
Correct the file_operations struct in fdinfo entry of tid_base_stuff[]. Presently /proc/*/task/*/fdinfo contains symlinks to opened files like /proc/*/fd/. Signed-off-by: Jerome Marchand <jmarchan@redhat.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Miklos Szeredi <mszeredi@suse.cz> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-09procfs: Use generic_file_llseek in /proc/vmcoreFrederic Weisbecker
/proc/vmcore has no llseek and then falls down to use default_llseek. This is racy against read_vmcore() that directly manipulates fpos but it doesn't hold the bkl there so using it in llseek doesn't protect anything. Let's use generic_file_llseek() instead. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09procfs: Use generic_file_llseek in /proc/kmsgFrederic Weisbecker
No need to hold the bkl to seek here, none of the other fops callbacks use it. Use generic_file_llseek explicitly. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09procfs: Use generic_file_llseek in /proc/kcoreFrederic Weisbecker
/proc/kcore has no llseek and then falls down to use default_llseek. This is racy against read_kcore() that directly manipulates fpos but it doesn't hold the bkl there so using it in llseek doesn't protect anything. Let's use generic_file_llseek() instead. Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Acked-by: Arnd Bergmann <arnd@arndb.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-09procfs: Kill BKL in llseek on proc baseArnd Bergmann
We don't use the BKL elsewhere, so use generic_file_llseek so we can avoid default_llseek taking the BKL. Signed-off-by: Arnd Bergmann <arnd@arndb.de> [restore proc_fdinfo_file_operations as non-seekable] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: John Kacur <jkacur@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Al Viro <viro@ZenIV.linux.org.uk>
2010-04-07pagemap: fix pfn calculation for hugepageNaoya Horiguchi
When we look into pagemap using page-types with option -p, the value of pfn for hugepages looks wrong (see below.) This is because pte was evaluated only once for one vma although it should be updated for each hugepage. This patch fixes it. $ page-types -p 3277 -Nl -b huge voffset offset len flags 7f21e8a00 11e400 1 ___U___________H_G________________ 7f21e8a01 11e401 1ff ________________TG________________ ^^^ 7f21e8c00 11e400 1 ___U___________H_G________________ 7f21e8c01 11e401 1ff ________________TG________________ ^^^ One hugepage contains 1 head page and 511 tail pages in x86_64 and each two lines represent each hugepage. Voffset and offset mean virtual address and physical address in the page unit, respectively. The different hugepages should not have the same offset value. With this patch applied: $ page-types -p 3386 -Nl -b huge voffset offset len flags 7fec7a600 112c00 1 ___UD__________H_G________________ 7fec7a601 112c01 1ff ________________TG________________ ^^^ 7fec7a800 113200 1 ___UD__________H_G________________ 7fec7a801 113201 1ff ________________TG________________ ^^^ OK More info: - This patch modifies walk_page_range()'s hugepage walker. But the change only affects pagemap_read(), which is the only caller of hugepage callback. - Without this patch, hugetlb_entry() callback is called per vma, that doesn't match the natural expectation from its name. - With this patch, hugetlb_entry() is called per hugepte entry and the callback can become much simpler. Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-06proc: copy_to_user() returns unsignedDan Carpenter
copy_to_user() returns the number of bytes left to be copied. This was a typo from: d82ef020cf31 "proc: pagemap: Hold mmap_sem during page walk". Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: Matt Mackall <mpm@selenic.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-05Merge branch 'master' into export-slabhTejun Heo
2010-04-04proc: pagemap: Hold mmap_sem during page walkKAMEZAWA Hiroyuki
In initial design, walk_page_range() was designed just for walking page table and it didn't require mmap_sem. Now, find_vma() etc.. are used in walk_page_range() and we need mmap_sem around it. This patch adds mmap_sem around walk_page_range(). Because /proc/<pid>/pagemap's callback routine use put_user(), we have to get rid of it to do sane fix. Changelog: 2010/Apr/2 - fixed start_vaddr and end overflow Changelog: 2010/Apr/1 - fixed start_vaddr calculation - removed unnecessary cast. - removed unnecessary change in smaps. - use GFP_TEMPORARY instead of GFP_KERNEL Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Matt Mackall <mpm@selenic.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: San Mehat <san@google.com> Cc: Brian Swetland <swetland@google.com> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Andrew Morton <akpm@linux-foundation.org> [ Fixed kmalloc failure return code as per Matt ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-01oom: fix the unsafe usage of badness() in proc_oom_score()Oleg Nesterov
proc_oom_score(task) has a reference to task_struct, but that is all. If this task was already released before we take tasklist_lock - we can't use task->group_leader, it points to nowhere - it is not safe to call badness() even if this task is ->group_leader, has_intersects_mems_allowed() assumes it is safe to iterate over ->thread_group list. - even worse, badness() can hit ->signal == NULL Add the pid_alive() check to ensure __unhash_process() was not called. Also, use "task" instead of task->group_leader. badness() should return the same result for any sub-thread. Currently this is not true, but this should be changed anyway. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: stable@kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-03-24kcore: fix test for end of listDan Carpenter
"m" is never NULL here. We need a different test for the end of list condition. Signed-off-by: Dan Carpenter <error27@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: WANG Cong <xiyou.wangcong@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06proc: warn on non-existing proc entriesAlexey Dobriyan
* warn if creation goes on to non-existent directory * warn if removal goes on from non-existing directory * warn if non-existing proc entry is removed Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06proc: do translation + unlink atomically at remove_proc_entry()Alexey Dobriyan
remove_proc_entry() does lock lookup parent unlock lock unlink proc entry from lists unlock which can be made bit more correct by doing parent translation + unlink without dropping lock. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06fs: use rlimit helpersJiri Slaby
Make sure compiler won't do weird things with limits. E.g. fetching them twice may return 2 different values after writable limits are implemented. I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource: add helpers for fetching rlimits") or ACCESS_ONCE if not applicable. Signed-off-by: Jiri Slaby <jslaby@suse.cz> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: count swap usageKAMEZAWA Hiroyuki
A frequent questions from users about memory management is what numbers of swap ents are user for processes. And this information will give some hints to oom-killer. Besides we can count the number of swapents per a process by scanning /proc/<pid>/smaps, this is very slow and not good for usual process information handler which works like 'ps' or 'top'. (ps or top is now enough slow..) This patch adds a counter of swapents to mm_counter and update is at each swap events. Information is exported via /proc/<pid>/status file as [kamezawa@bluextal memory]$ cat /proc/self/status Name: cat State: R (running) Tgid: 2910 Pid: 2910 PPid: 2823 TracerPid: 0 Uid: 500 500 500 500 Gid: 500 500 500 500 FDSize: 256 Groups: 500 VmPeak: 82696 kB VmSize: 82696 kB VmLck: 0 kB VmHWM: 432 kB VmRSS: 432 kB VmData: 172 kB VmStk: 84 kB VmExe: 48 kB VmLib: 1568 kB VmPTE: 40 kB VmSwap: 0 kB <=============== this. [akpm@linux-foundation.org: coding-style fixes] Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Reviewed-by: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-06mm: clean up mm_counterKAMEZAWA Hiroyuki
Presently, per-mm statistics counter is defined by macro in sched.h This patch modifies it to - defined in mm.h as inlinf functions - use array instead of macro's name creation. This patch is for reducing patch size in future patch to modify implementation of per-mm counter. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Reviewed-by: Minchan Kim <minchan.kim@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-04Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits) init: Open /dev/console from rootfs mqueue: fix typo "failues" -> "failures" mqueue: only set error codes if they are really necessary mqueue: simplify do_open() error handling mqueue: apply mathematics distributivity on mq_bytes calculation mqueue: remove unneeded info->messages initialization mqueue: fix mq_open() file descriptor leak on user-space processes fix race in d_splice_alias() set S_DEAD on unlink() and non-directory rename() victims vfs: add NOFOLLOW flag to umount(2) get rid of ->mnt_parent in tomoyo/realpath hppfs can use existing proc_mnt, no need for do_kern_mount() in there Mirror MS_KERNMOUNT in ->mnt_flags get rid of useless vfsmount_lock use in put_mnt_ns() Take vfsmount_lock to fs/internal.h get rid of insanity with namespace roots in tomoyo take check for new events in namespace (guts of mounts_poll()) to namespace.c Don't mess with generic_permission() under ->d_lock in hpfs sanitize const/signedness for udf nilfs: sanitize const/signedness in dealing with ->d_name.name ... Fix up fairly trivial (famous last words...) conflicts in drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c
2010-03-03take check for new events in namespace (guts of mounts_poll()) to namespace.cAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-03some clean up in fs/procHelight.Xu
EXPORT_SYMBOL(proc_symlink); EXPORT_SYMBOL(proc_mkdir); EXPORT_SYMBOL(create_proc_entry); EXPORT_SYMBOL(proc_create_data); EXPORT_SYMBOL(remove_proc_entry); Those EXPORT_SYMBOL shouldn't be in fs/proc/root.c, should be in fs/proc/generic.c. Signed-off-by: Helight.Xu <helight.xu@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-03-01Merge branch 'next' into for-linusJames Morris
2010-02-28Merge branch 'core-rcu-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (44 commits) rcu: Fix accelerated GPs for last non-dynticked CPU rcu: Make non-RCU_PROVE_LOCKING rcu_read_lock_sched_held() understand boot rcu: Fix accelerated grace periods for last non-dynticked CPU rcu: Export rcu_scheduler_active rcu: Make rcu_read_lock_sched_held() take boot time into account rcu: Make lockdep_rcu_dereference() message less alarmist sched, cgroups: Fix module export rcu: Add RCU_CPU_STALL_VERBOSE to dump detailed per-task information rcu: Fix rcutorture mod_timer argument to delay one jiffy rcu: Fix deadlock in TREE_PREEMPT_RCU CPU stall detection rcu: Convert to raw_spinlocks rcu: Stop overflowing signed integers rcu: Use canonical URL for Mathieu's dissertation rcu: Accelerate grace period if last non-dynticked CPU rcu: Fix citation of Mathieu's dissertation rcu: Documentation update for CONFIG_PROVE_RCU security: Apply lockdep-based checking to rcu_dereference() uses idr: Apply lockdep-based diagnostics to rcu_dereference() uses radix-tree: Disable RCU lockdep checking in radix tree vfs: Abstract rcu_dereference_check for files-fdtable use ...
2010-02-25Merge branch 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6Linus Torvalds
* 'next-devicetree' of git://git.secretlab.ca/git/linux-2.6: (41 commits) of: remove undefined request_OF_resource & release_OF_resource of/sparc: Remove sparc-local declaration of allnodes and devtree_lock of: move definition of of_chosen into common code. of: remove unused extern reference to devtree_lock of: put default string compare and #a/s-cell values into common header of/flattree: Don't assume HAVE_LMB of: protect linux/of.h with CONFIG_OF proc_devtree: fix THIS_MODULE without module.h of: Remove old and misplaced function declarations of/flattree: Make the kernel accept ePAPR style phandle information of/flattree: endian-convert members of boot_param_header of: assume big-endian properties, adding conversions where necessary of: use __be32 for cell value accessors of/flattree: use OF_ROOT_NODE_{SIZE,ADDR}_CELLS DEFAULT for fdt parsing of/flattree: use callback to setup initrd from /chosen proc_devtree: include linux/of.h of: make set_node_proc_entry private to proc_devtree.c of: include linux/proc_fs.h of/flattree: merge early_init_dt_scan_memory() common code of: add 'of_' prefix to machine_is_compatible() ...
2010-02-25vfs: Apply lockdep-based checking to rcu_dereference() usesPaul E. McKenney
Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable() and fcheck_files(). Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com Cc: Alexander Viro <viro@zeniv.linux.org.uk> LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2010-02-19Switch proc/self to nd_set_link()Al Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-02-14proc_devtree: fix THIS_MODULE without module.hJeremy Kerr
Commit e22f628395432b967f2f505858c64450f7835365 introduced a build breakage for ARM devtree work: the THIS_MODULE macro was added, but we don't have module.h This change adds the necessary #include to get THIS_MODULE defined. While we could just replace it with NULL (PROC_FS is a bool, not a tristate), using THIS_MODULE will prevent unexpected breakage if we ever do compile this as a module. Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca> Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Acked-by: Michal Simek <monstr@monstr.eu>
2010-02-09proc_devtree: include linux/of.hJeremy Kerr
Currenly, proc_devtree.c depends on asm/prom.h to include linux/of.h, to provide some device-tree definitions (eg, struct property). Instead, include linux/of.h directly. We still need asm/prom.h for HAVE_ARCH_DEVTREE_FIXUPS. Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-09of: make set_node_proc_entry private to proc_devtree.cJeremy Kerr
We only need set_node_proc_entry in proc_devtree.c, so move it there. This fixes the !HAVE_ARCH_DEVTREE_FIXUPS build, as we can't make make the definition in linux/of.h conditional on this #define (definitions in asm/prom.h can't be exposed to linux/of.h, due to the enforced #include ordering). Signed-off-by: Jeremy Kerr <jeremy.kerr@canonical.com> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
2010-02-04syslog: use defined constants instead of raw numbersKees Cook
Right now the syslog "type" action are just raw numbers which makes the source difficult to follow. This patch replaces the raw numbers with defined constants for some level of sanity. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: John Johansen <john.johansen@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-02-04syslog: distinguish between /proc/kmsg and syscallsKees Cook
This allows the LSM to distinguish between syslog functions originating from /proc/kmsg access and direct syscalls. By default, the commoncaps will now no longer require CAP_SYS_ADMIN to read an opened /proc/kmsg file descriptor. For example the kernel syslog reader can now drop privileges after opening /proc/kmsg, instead of staying privileged with CAP_SYS_ADMIN. MAC systems that implement security_syslog have unchanged behavior. Signed-off-by: Kees Cook <kees.cook@canonical.com> Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: John Johansen <john.johansen@canonical.com> Signed-off-by: James Morris <jmorris@namei.org>
2010-01-14fix autofs/afs/etc. magic mountpoint breakageAl Viro
We end up trying to kfree() nd.last.name on open("/mnt/tmp", O_CREAT) if /mnt/tmp is an autofs direct mount. The reason is that nd.last_type is bogus here; we want LAST_BIND for everything of that kind and we get LAST_NORM left over from finding parent directory. So make sure that it *is* set properly; set to LAST_BIND before doing ->follow_link() - for normal symlinks it will be changed by __vfs_follow_link() and everything else needs it set that way. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2010-01-11smaps: fix wrong rss countMinchan Kim
A long time ago we regarded zero page as file_rss and vm_normal_page doesn't return NULL. But now, we reinstated ZERO_PAGE and vm_normal_page's implementation can return NULL in case of zero page. Also we don't count it with file_rss any more. Then, RSS and PSS can't be matched. For consistency, Let's ignore zero page in smaps_pte_range. Signed-off-by: Minchan Kim <minchan.kim@gmail.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk> Acked-by: Matt Mackall <mpm@selenic.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>