aboutsummaryrefslogtreecommitdiff
path: root/mm/slab.c
AgeCommit message (Collapse)Author
2012-10-13slab: fix the DEADLOCK issue on l3 alien lockMichael Wang
commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e upstream. DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled, the process of this fake report is: kmem_cache_free() //free obj in cachep -> cache_free_alien() //acquire cachep's l3 alien lock -> __drain_alien_cache() -> free_block() -> slab_destroy() -> kmem_cache_free() //free slab in cachep->slabp_cache -> cache_free_alien() //acquire cachep->slabp_cache's l3 alien lock Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class, fake report generated. This should not happen since we already have init_lock_keys() which will reassign the lock class for both l3 list and l3 alien. However, init_lock_keys() was invoked at a wrong position which is before we invoke enable_cpucache() on each cache. Since until set slab_state to be FULL, we won't invoke enable_cpucache() on caches to build their l3 alien while creating them, so although we invoked init_lock_keys(), the l3 alien lock class won't change since we don't have them until invoked enable_cpucache() later. This patch will invoke init_lock_keys() after we done enable_cpucache() instead of before to avoid the fake DEADLOCK report. Michael traced the problem back to a commit in release 3.0.0: commit 30765b92ada267c5395fc788623cb15233276f5c Author: Peter Zijlstra <peterz@infradead.org> Date: Thu Jul 28 23:22:56 2011 +0200 slab, lockdep: Annotate the locks before using them Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu> The commit moved init_lock_keys() before we build up the alien, so we failed to reclass it. Acked-by: Christoph Lameter <cl@linux.com> Tested-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Michael Wang <wangyun@linux.vnet.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-28Merge branch 'slab/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux Pull SLAB changes from Pekka Enberg: "There's the new kmalloc_array() API, minor fixes and performance improvements, but quite honestly, nothing terribly exciting." * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: mm: SLAB Out-of-memory diagnostics slab: introduce kmalloc_array() slub: per cpu partial statistics change slub: include include for prefetch slub: Do not hold slub_lock when calling sysfs_slab_add() slub: prefetch next freelist pointer in slab_alloc() slab, cleanup: remove unneeded return
2012-03-21cpuset: mm: reduce large amounts of memory barrier related damage v3Mel Gorman
Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when changing cpuset's mems") wins a super prize for the largest number of memory barriers entered into fast paths for one commit. [get|put]_mems_allowed is incredibly heavy with pairs of full memory barriers inserted into a number of hot paths. This was detected while investigating at large page allocator slowdown introduced some time after 2.6.32. The largest portion of this overhead was shown by oprofile to be at an mfence introduced by this commit into the page allocator hot path. For extra style points, the commit introduced the use of yield() in an implementation of what looks like a spinning mutex. This patch replaces the full memory barriers on both read and write sides with a sequence counter with just read barriers on the fast path side. This is much cheaper on some architectures, including x86. The main bulk of the patch is the retry logic if the nodemask changes in a manner that can cause a false failure. While updating the nodemask, a check is made to see if a false failure is a risk. If it is, the sequence number gets bumped and parallel allocators will briefly stall while the nodemask update takes place. In a page fault test microbenchmark, oprofile samples from __alloc_pages_nodemask went from 4.53% of all samples to 1.15%. The actual results were 3.3.0-rc3 3.3.0-rc3 rc3-vanilla nobarrier-v2r1 Clients 1 UserTime 0.07 ( 0.00%) 0.08 (-14.19%) Clients 2 UserTime 0.07 ( 0.00%) 0.07 ( 2.72%) Clients 4 UserTime 0.08 ( 0.00%) 0.07 ( 3.29%) Clients 1 SysTime 0.70 ( 0.00%) 0.65 ( 6.65%) Clients 2 SysTime 0.85 ( 0.00%) 0.82 ( 3.65%) Clients 4 SysTime 1.41 ( 0.00%) 1.41 ( 0.32%) Clients 1 WallTime 0.77 ( 0.00%) 0.74 ( 4.19%) Clients 2 WallTime 0.47 ( 0.00%) 0.45 ( 3.73%) Clients 4 WallTime 0.38 ( 0.00%) 0.37 ( 1.58%) Clients 1 Flt/sec/cpu 497620.28 ( 0.00%) 520294.53 ( 4.56%) Clients 2 Flt/sec/cpu 414639.05 ( 0.00%) 429882.01 ( 3.68%) Clients 4 Flt/sec/cpu 257959.16 ( 0.00%) 258761.48 ( 0.31%) Clients 1 Flt/sec 495161.39 ( 0.00%) 517292.87 ( 4.47%) Clients 2 Flt/sec 820325.95 ( 0.00%) 850289.77 ( 3.65%) Clients 4 Flt/sec 1020068.93 ( 0.00%) 1022674.06 ( 0.26%) MMTests Statistics: duration Sys Time Running Test (seconds) 135.68 132.17 User+Sys Time Running Test (seconds) 164.2 160.13 Total Elapsed Time (seconds) 123.46 120.87 The overall improvement is small but the System CPU time is much improved and roughly in correlation to what oprofile reported (these performance figures are without profiling so skew is expected). The actual number of page faults is noticeably improved. For benchmarks like kernel builds, the overall benefit is marginal but the system CPU time is slightly reduced. To test the actual bug the commit fixed I opened two terminals. The first ran within a cpuset and continually ran a small program that faulted 100M of anonymous data. In a second window, the nodemask of the cpuset was continually randomised in a loop. Without the commit, the program would fail every so often (usually within 10 seconds) and obviously with the commit everything worked fine. With this patch applied, it also worked fine so the fix should be functionally equivalent. Signed-off-by: Mel Gorman <mgorman@suse.de> Cc: Miao Xie <miaox@cn.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Christoph Lameter <cl@linux.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-03-10mm: SLAB Out-of-memory diagnosticsRafael Aquini
Following the example at mm/slub.c, add out-of-memory diagnostics to the SLAB allocator to help on debugging certain OOM conditions. An example print out looks like this: <snip page allocator out-of-memory message> SLAB: Unable to allocate memory on node 0 (gfp=0x11200) cache: bio-0, object size: 192, order: 0 node 0: slabs: 3/3, objs: 60/60, free: 0 Signed-off-by: Rafael Aquini <aquini@redhat.com> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-01-23slab, cleanup: remove unneeded returnZhao Jin
The procedure ends right after the if-statement, so remove ``return''. Also move the last common statement outside. Signed-off-by: Zhao Jin <cronozhj@gmail.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2012-01-11Merge branch 'slab/for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux * 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux: slub: disallow changing cpu_partial from userspace for debug caches slub: add missed accounting slub: Extract get_freelist from __slab_alloc slub: Switch per cpu partial page support off for debugging slub: fix a possible memleak in __slab_alloc() slub: fix slub_max_order Documentation slub: add missed accounting slab: add taint flag outputting to debug paths. slub: add taint flag outputting to debug paths slab: introduce slab_max_order kernel parameter slab: rename slab_break_gfp_order to slab_max_order
2012-01-11Merge branch 'slab/urgent' into slab/for-linusPekka Enberg
2012-01-09tracing/mm: Move include of trace/events/kmem.h out of header into slab.cSteven Rostedt
Including trace/events/*.h TRACE_EVENT() macro headers in other headers can cause strange side effects if another trace/event/*.h header includes that header. Having trace/events/kmem.h inside slab_def.h caused a compile error in sparc64 when changes were done to some header files. Moving the kmem.h trace header out of slab.h and into slab.c fixes the problem. Note, both slub.c and slob.c already include the trace/events/kmem.h file. Only slab.c had it missing. Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.au Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-05slab, lockdep: Fix silly bugPeter Zijlstra
Commit 30765b92 ("slab, lockdep: Annotate the locks before using them") moves the init_lock_keys() call from after g_cpucache_up = FULL, to before it. And overlooks the fact that init_node_lock_keys() tests for it and ignores everything !FULL. Introduce a LATE stage and change the lockdep test to be <LATE. Acked-by: Christoph Lameter <cl@linux.com> Cc: Pekka Enberg <penberg@kernel.org> Cc: stable@kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-11-16slab: add taint flag outputting to debug paths.Dave Jones
When we get corruption reports, it's useful to see if the kernel was tainted, to rule out problems we can't do anything about. Signed-off-by: Dave Jones <davej@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-10slab: introduce slab_max_order kernel parameterDavid Rientjes
Introduce new slab_max_order kernel parameter which is the equivalent of slub_max_order. For immediate purposes, allows users to override the heuristic that sets the max order to 1 by default if they have more than 32MB of RAM. This may result in page allocation failures if there is substantial fragmentation. Another usecase would be to increase the max order for better performance. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-11-10slab: rename slab_break_gfp_order to slab_max_orderDavid Rientjes
slab_break_gfp_order is more appropriately named slab_max_order since it enforces the maximum order size of slabs as long as a single object will still fit. Also rename BREAK_GFP_ORDER_{LO,HI} accordingly. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-27mm: restrict access to slab files under procfs and sysfsVasiliy Kulikov
Historically /proc/slabinfo and files under /sys/kernel/slab/* have world read permissions and are accessible to the world. slabinfo contains rather private information related both to the kernel and userspace tasks. Depending on the situation, it might reveal either private information per se or information useful to make another targeted attack. Some examples of what can be learned by reading/watching for /proc/slabinfo entries: 1) dentry (and different *inode*) number might reveal other processes fs activity. The number of dentry "active objects" doesn't strictly show file count opened/touched by a process, however, there is a good correlation between them. The patch "proc: force dcache drop on unauthorized access" relies on the privacy of dentry count. 2) different inode entries might reveal the same information as (1), but these are more fine granted counters. If a filesystem is mounted in a private mount point (or even a private namespace) and fs type differs from other mounted fs types, fs activity in this mount point/namespace is revealed. If there is a single ecryptfs mount point, the whole fs activity of a single user is revealed. Number of files in ecryptfs mount point is a private information per se. 3) fuse_* reveals number of files / fs activity of a user in a user private mount point. It is approx. the same severity as ecryptfs infoleak in (2). 4) sysfs_dir_cache similar to (2) reveals devices' addition/removal, which can be otherwise hidden by "chmod 0700 /sys/". With 0444 slabinfo the precise number of sysfs files is known to the world. 5) buffer_head might reveal some kernel activity. With other information leaks an attacker might identify what specific kernel routines generate buffer_head activity. 6) *kmalloc* infoleaks are very situational. Attacker should watch for the specific kmalloc size entry and filter the noise related to the unrelated kernel activity. If an attacker has relatively silent victim system, he might get rather precise counters. Additional information sources might significantly increase the slabinfo infoleak benefits. E.g. if an attacker knows that the processes activity on the system is very low (only core daemons like syslog and cron), he may run setxid binaries / trigger local daemon activity / trigger network services activity / await sporadic cron jobs activity / etc. and get rather precise counters for fs and network activity of these privileged tasks, which is unknown otherwise. Also hiding slabinfo and /sys/kernel/slab/* is a one step to complicate exploitation of kernel heap overflows (and possibly, other bugs). The related discussion: http://thread.gmane.org/gmane.linux.kernel/1108378 To keep compatibility with old permission model where non-root monitoring daemon could watch for kernel memleaks though slabinfo one should do: groupadd slabinfo usermod -a -G slabinfo $MONITOR_USER And add the following commands to init scripts (to mountall.conf in Ubuntu's upstart case): chmod g+r /proc/slabinfo /sys/kernel/slab/*/* chgrp slabinfo /proc/slabinfo /sys/kernel/slab/*/* Signed-off-by: Vasiliy Kulikov <segoon@openwall.com> Reviewed-by: Kees Cook <kees@ubuntu.com> Reviewed-by: Dave Hansen <dave@linux.vnet.ibm.com> Acked-by: Christoph Lameter <cl@gentwo.org> Acked-by: David Rientjes <rientjes@google.com> CC: Valdis.Kletnieks@vt.edu CC: Linus Torvalds <torvalds@linux-foundation.org> CC: Alan Cox <alan@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-09-19Merge branch 'slab/urgent' into slab/nextPekka Enberg
2011-08-04slab, lockdep: Annotate the locks before using themPeter Zijlstra
Fernando found we hit the regular OFF_SLAB 'recursion' before we annotate the locks, cure this. The relevant portion of the stack-trace: > [ 0.000000] [<c085e24f>] rt_spin_lock+0x50/0x56 > [ 0.000000] [<c04fb406>] __cache_free+0x43/0xc3 > [ 0.000000] [<c04fb23f>] kmem_cache_free+0x6c/0xdc > [ 0.000000] [<c04fb2fe>] slab_destroy+0x4f/0x53 > [ 0.000000] [<c04fb396>] free_block+0x94/0xc1 > [ 0.000000] [<c04fc551>] do_tune_cpucache+0x10b/0x2bb > [ 0.000000] [<c04fc8dc>] enable_cpucache+0x7b/0xa7 > [ 0.000000] [<c0bd9d3c>] kmem_cache_init_late+0x1f/0x61 > [ 0.000000] [<c0bba687>] start_kernel+0x24c/0x363 > [ 0.000000] [<c0bba0ba>] i386_start_kernel+0xa9/0xaf Reported-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-08-04slab, lockdep: Annotate slab -> rcu -> debug_object -> slabPeter Zijlstra
Lockdep thinks there's lock recursion through: kmem_cache_free() cache_flusharray() spin_lock(&l3->list_lock) <----------------. free_block() | slab_destroy() | call_rcu() | debug_object_activate() | debug_object_init() | __debug_object_init() | kmem_cache_alloc() | cache_alloc_refill() | spin_lock(&l3->list_lock) --' Now debug objects doesn't use SLAB_DESTROY_BY_RCU and hence there is no actual possibility of recursing. Luckily debug objects marks it slab with SLAB_DEBUG_OBJECTS so we can identify the thing. Mark all SLAB_DEBUG_OBJECTS (all one!) slab caches with a special lockdep key so that lockdep sees its a different cachep. Also add a WARN on trying to create a SLAB_DESTROY_BY_RCU | SLAB_DEBUG_OBJECTS cache, to avoid possible future trouble. Reported-and-tested-by: Sebastian Siewior <sebastian@breakpoint.cc> [ fixes to the initial patch ] Reported-by: Thomas Gleixner <tglx@linutronix.de> Acked-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1311341165.27400.58.camel@twins Signed-off-by: Ingo Molnar <mingo@elte.hu>
2011-07-31slab: use print_hex_dumpSebastian Andrzej Siewior
Less code and the advantage of ascii dump. before: | Slab corruption: names_cache start=c5788000, len=4096 | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff | 030: ff ff ff ff e2 b4 17 18 c7 e4 08 06 00 01 08 00 | 040: 06 04 00 01 e2 b4 17 18 c7 e4 0a 00 00 01 00 00 | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b after: | Slab corruption: size-4096 start=c38a9000, len=4096 | 000: 6b 6b 01 00 00 00 56 00 00 00 24 00 00 00 2a 00 kk....V...$...*. | 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ | 020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ff ff ................ | 030: ff ff ff ff d2 56 5f aa db 9c 08 06 00 01 08 00 .....V_......... | 040: 06 04 00 01 d2 56 5f aa db 9c 0a 00 00 01 00 00 .....V_......... | 050: 00 00 00 00 0a 00 00 02 6b 6b 6b 6b 6b 6b 6b 6b ........kkkkkkkk Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-31slab: use NUMA_NO_NODEAndrew Morton
Use the nice enumerated constant. Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-28slab: remove one NR_CPUS dependencyEric Dumazet
Reduce high order allocations in do_tune_cpucache() for some setups. (NR_CPUS=4096 -> we need 64KB) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-22Merge branch 'slab-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slab: fix DEBUG_SLAB warning slab: shrink sizeof(struct kmem_cache) slab: fix DEBUG_SLAB build SLUB: Fix missing <linux/stacktrace.h> include slub: reduce overhead of slub_debug slub: Add method to verify memory is not freed slub: Enable backtrace for create/delete points slab allocators: Provide generic description of alignment defines slab, slub, slob: Unify alignment definition slob/lockdep: Fix gfp flags passed to lockdep
2011-07-22slab: fix DEBUG_SLAB warningTetsuo Handa
In commit c225150b "slab: fix DEBUG_SLAB build", "if ((unsigned long)objp & (ARCH_SLAB_MINALIGN-1))" is always true if ARCH_SLAB_MINALIGN == 0. Do not print warning if ARCH_SLAB_MINALIGN == 0. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-20slab: shrink sizeof(struct kmem_cache)Eric Dumazet
Reduce high order allocations for some setups. (NR_CPUS=4096 -> we need 64KB per kmem_cache struct) We now allocate exact needed size (using nr_cpu_ids and nr_node_ids) This also makes code a bit smaller on x86_64, since some field offsets are less than the 127 limit : Before patch : # size mm/slab.o text data bss dec hex filename 22605 361665 32 384302 5dd2e mm/slab.o After patch : # size mm/slab.o text data bss dec hex filename 22349 353473 8224 384046 5dc2e mm/slab.o CC: Andrew Morton <akpm@linux-foundation.org> Reported-by: Konstantin Khlebnikov <khlebnikov@openvz.org> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-07-18slab: fix DEBUG_SLAB buildHugh Dickins
Fix CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y build error and warnings. Now that ARCH_SLAB_MINALIGN defaults to __alignof__(unsigned long long), it is always defined (when slab.h included), but cannot be used in #if: mm/slab.c: In function `cache_alloc_debugcheck_after': mm/slab.c:3156:5: warning: "__alignof__" is not defined mm/slab.c:3156:5: error: missing binary operator before token "(" make[1]: *** [mm/slab.o] Error 1 So just remove the #if and #endif lines, but then 64-bit build warns: mm/slab.c: In function `cache_alloc_debugcheck_after': mm/slab.c:3156:6: warning: cast from pointer to integer of different size mm/slab.c:3158:10: warning: format `%d' expects type `int', but argument 3 has type `long unsigned int' Fix those with casts, whatever the actual type of ARCH_SLAB_MINALIGN. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-06-03SLAB: Record actual last user of freed objects.Suleiman Souhlal
Currently, when using CONFIG_DEBUG_SLAB, we put in kfree() or kmem_cache_free() as the last user of free objects, which is not very useful, so change it to the caller of those functions instead. Acked-by: David Rientjes <rientjes@google.com> Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Suleiman Souhlal <suleiman@google.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-05-20sanitize <linux/prefetch.h> usageLinus Torvalds
Commit e66eed651fd1 ("list: remove prefetching from regular list iterators") removed the include of prefetch.h from list.h, which uncovered several cases that had apparently relied on that rather obscure header file dependency. So this fixes things up a bit, using grep -L linux/prefetch.h $(git grep -l '[^a-z_]prefetchw*(' -- '*.[ch]') grep -L 'prefetchw*(' $(git grep -l 'linux/prefetch.h' -- '*.[ch]') to guide us in finding files that either need <linux/prefetch.h> inclusion, or have it despite not needing it. There are more of them around (mostly network drivers), but this gets many core ones. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-31Fix common misspellingsLucas De Marchi
Fixes generated by 'codespell' and manually reviewed. Signed-off-by: Lucas De Marchi <lucas.demarchi@profusion.mobi>
2011-03-22mm: notifier_from_errno() cleanupPrarit Bhargava
While looking at some other notifier callbacks I noticed this code could use a simple cleanup. notifier_from_errno() no longer needs the if (ret)/else conditional. That same conditional is now done in notifier_from_errno(). Signed-off-by: Prarit Bhargava <prarit@redhat.com> Cc: Paul Menage <menage@google.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-03-11Merge branch 'slab/urgent' into slab/nextPekka Enberg
2011-03-11Merge branch 'slab/rcu' into slab/nextPekka Enberg
Conflicts: mm/slub.c
2011-03-11slab,rcu: don't assume the size of struct rcu_headLai Jiangshan
The size of struct rcu_head may be changed. When it becomes larger, it may pollute the data after struct slab. Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-02-14Revert "slab: Fix missing DEBUG_SLAB last user"Pekka Enberg
This reverts commit 5c5e3b33b7cb959a401f823707bee006caadd76e. The commit breaks ARM thusly: | Mount-cache hash table entries: 512 | slab error in verify_redzone_free(): cache `idr_layer_cache': memory outside object was overwritten | Backtrace: | [<c0227088>] (dump_backtrace+0x0/0x110) from [<c0431afc>] (dump_stack+0x18/0x1c) | [<c0431ae4>] (dump_stack+0x0/0x1c) from [<c0293304>] (__slab_error+0x28/0x30) | [<c02932dc>] (__slab_error+0x0/0x30) from [<c0293a74>] (cache_free_debugcheck+0x1c0/0x2b8) | [<c02938b4>] (cache_free_debugcheck+0x0/0x2b8) from [<c0293f78>] (kmem_cache_free+0x3c/0xc0) | [<c0293f3c>] (kmem_cache_free+0x0/0xc0) from [<c032b1c8>] (ida_get_new_above+0x19c/0x1c0) | [<c032b02c>] (ida_get_new_above+0x0/0x1c0) from [<c02af7ec>] (alloc_vfsmnt+0x54/0x144) | [<c02af798>] (alloc_vfsmnt+0x0/0x144) from [<c0299830>] (vfs_kern_mount+0x30/0xec) | [<c0299800>] (vfs_kern_mount+0x0/0xec) from [<c0299908>] (kern_mount_data+0x1c/0x20) | [<c02998ec>] (kern_mount_data+0x0/0x20) from [<c02146c4>] (sysfs_init+0x68/0xc8) | [<c021465c>] (sysfs_init+0x0/0xc8) from [<c02137d4>] (mnt_init+0x90/0x1b0) | [<c0213744>] (mnt_init+0x0/0x1b0) from [<c0213388>] (vfs_caches_init+0x100/0x140) | [<c0213288>] (vfs_caches_init+0x0/0x140) from [<c0208c0c>] (start_kernel+0x2e8/0x368) | [<c0208924>] (start_kernel+0x0/0x368) from [<c0208034>] (__enable_mmu+0x0/0x2c) | c0113268: redzone 1:0xd84156c5c032b3ac, redzone 2:0xd84156c5635688c0. | slab error in cache_alloc_debugcheck_after(): cache `idr_layer_cache': double free, or memory outside object was overwritten | ... | c011307c: redzone 1:0x9f91102ffffffff, redzone 2:0x9f911029d74e35b | slab: Internal list corruption detected in cache 'idr_layer_cache'(24), slabp c0113000(16). Hexdump: | | 000: 20 4f 10 c0 20 4f 10 c0 7c 00 00 00 7c 30 11 c0 | 010: 10 00 00 00 10 00 00 00 00 00 c9 17 fe ff ff ff | 020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff | 050: fe ff ff ff fe ff ff ff fe ff ff ff 11 00 00 00 | 060: 12 00 00 00 13 00 00 00 14 00 00 00 15 00 00 00 | 070: 16 00 00 00 17 00 00 00 c0 88 56 63 | kernel BUG at /home/rmk/git/linux-2.6-rmk/mm/slab.c:2928! Reference: https://lkml.org/lkml/2011/2/7/238 Cc: <stable@kernel.org> # 2.6.35.y and later Reported-and-analyzed-by: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-23mm: Remove support for kmem_cache_name()Christoph Lameter
The last user was ext4 and Eric Sandeen removed the call in a recent patch. See the following URL for the discussion: http://marc.info/?l=linux-ext4&m=129546975702198&w=2 Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-15mm/slab.c: make local symbols staticH Hartley Sweeten
Local symbols should be static. Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2011-01-10Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Fix a crash during slabinfo -v tracing/slab: Move kmalloc tracepoint out of inline code slub: Fix slub_lock down/up imbalance slub: Fix build breakage in Documentation/vm slub tracing: move trace calls out of always inlined functions to reduce kernel code size slub: move slabinfo.c to tools/slub/slabinfo.c
2011-01-07Merge branch 'for-2.6.38' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu * 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu: (30 commits) gameport: use this_cpu_read instead of lookup x86: udelay: Use this_cpu_read to avoid address calculation x86: Use this_cpu_inc_return for nmi counter x86: Replace uses of current_cpu_data with this_cpu ops x86: Use this_cpu_ops to optimize code vmstat: User per cpu atomics to avoid interrupt disable / enable irq_work: Use per cpu atomics instead of regular atomics cpuops: Use cmpxchg for xchg to avoid lock semantics x86: this_cpu_cmpxchg and this_cpu_xchg operations percpu: Generic this_cpu_cmpxchg() and this_cpu_xchg support percpu,x86: relocate this_cpu_add_return() and friends connector: Use this_cpu operations xen: Use this_cpu_inc_return taskstats: Use this_cpu_ops random: Use this_cpu_inc_return fs: Use this_cpu_inc_return in buffer.c highmem: Use this_cpu_xx_return() operations vmstat: Use this_cpu_inc_return for vm statistics x86: Support for this_cpu_add, sub, dec, inc_return percpu: Generic support for this_cpu_add, sub, dec, inc_return ... Fixed up conflicts: in arch/x86/kernel/{apic/nmi.c, apic/x2apic_uv_x.c, process.c} as per Tejun.
2011-01-07Merge branch 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
* 'for-2.6.38' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (33 commits) usb: don't use flush_scheduled_work() speedtch: don't abuse struct delayed_work media/video: don't use flush_scheduled_work() media/video: explicitly flush request_module work ioc4: use static work_struct for ioc4_load_modules() init: don't call flush_scheduled_work() from do_initcalls() s390: don't use flush_scheduled_work() rtc: don't use flush_scheduled_work() mmc: update workqueue usages mfd: update workqueue usages dvb: don't use flush_scheduled_work() leds-wm8350: don't use flush_scheduled_work() mISDN: don't use flush_scheduled_work() macintosh/ams: don't use flush_scheduled_work() vmwgfx: don't use flush_scheduled_work() tpm: don't use flush_scheduled_work() sonypi: don't use flush_scheduled_work() hvsi: don't use flush_scheduled_work() xen: don't use flush_scheduled_work() gdrom: don't use flush_scheduled_work() ... Fixed up trivial conflict in drivers/media/video/bt8xx/bttv-input.c as per Tejun.
2011-01-07kernel: kmem_ptr_validate considered harmfulNick Piggin
This is a nasty and error prone API. It is no longer used, remove it. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
2010-12-17core: Replace __get_cpu_var with __this_cpu_read if not used for an address.Christoph Lameter
__get_cpu_var() can be replaced with this_cpu_read and will then use a single read instruction with implied address calculation to access the correct per cpu instance. However, the address of a per cpu variable passed to __this_cpu_read() cannot be determined (since it's an implied address conversion through segment prefixes). Therefore apply this only to uses of __get_cpu_var where the address of the variable is not used. Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Hugh Dickins <hughd@google.com> Cc: Thomas Gleixner <tglx@linutronix.de> Acked-by: H. Peter Anvin <hpa@zytor.com> Signed-off-by: Christoph Lameter <cl@linux.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2010-12-15workqueue: convert cancel_rearming_delayed_work[queue]() users to ↵Tejun Heo
cancel_delayed_work_sync() cancel_rearming_delayed_work[queue]() has been superceded by cancel_delayed_work_sync() quite some time ago. Convert all the in-kernel users. The conversions are completely equivalent and trivial. Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: "David S. Miller" <davem@davemloft.net> Acked-by: Greg Kroah-Hartman <gregkh@suse.de> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Cc: Jeff Garzik <jgarzik@pobox.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Mauro Carvalho Chehab <mchehab@infradead.org> Cc: netdev@vger.kernel.org Cc: Anton Vorontsov <cbou@mail.ru> Cc: David Woodhouse <dwmw2@infradead.org> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> Cc: Alex Elder <aelder@sgi.com> Cc: xfs-masters@oss.sgi.com Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: netfilter-devel@vger.kernel.org Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: linux-nfs@vger.kernel.org
2010-11-28tracing/slab: Move kmalloc tracepoint out of inline codeSteven Rostedt
The tracepoint for kmalloc is in the slab inlined code which causes every instance of kmalloc to have the tracepoint. This patch moves the tracepoint out of the inline code to the slab C file, which removes a large number of inlined trace points. objdump -dr vmlinux.slab| grep 'jmpq.*<trace_kmalloc' |wc -l 213 objdump -dr vmlinux.slab.patched| grep 'jmpq.*<trace_kmalloc' |wc -l 1 This also has a nice impact on size. text data bss dec hex filename 7023060 2121564 2482432 11627056 b16a30 vmlinux.slab 6970579 2109772 2482432 11562783 b06f1f vmlinux.slab.patched Signed-off-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-10-26replace nested max/min macros with {max,min}3 macroHagen Paul Pfeifer
Use the new {max,min}3 macros to save some cycles and bytes on the stack. This patch substitutes trivial nested macros with their counterpart. Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Joe Perches <joe@perches.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Hartley Sweeten <hsweeten@visionengravers.com> Cc: Russell King <linux@arm.linux.org.uk> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Roland Dreier <rolandd@cisco.com> Cc: Sean Hefty <sean.hefty@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slab: fix object alignment slub: add missing __percpu markup in mm/slub_def.h
2010-08-09gcc-4.6: mm: fix unused but set warningsAndi Kleen
No real bugs, just some dead code and some fixups. Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-08-09slab: fix object alignmentCarsten Otte
This patch fixes alignment of slab objects in case CONFIG_DEBUG_PAGEALLOC is active. Before this spot in kmem_cache_create, we have this situation: - align contains the required alignment of the object - cachep->obj_offset is 0 or equals align in case of CONFIG_DEBUG_SLAB - size equals the size of the object, or object plus trailing redzone in case of CONFIG_DEBUG_SLAB This spot tries to fill one page per object if the object is in certain size limits, however setting obj_offset to PAGE_SIZE - size does break the object alignment since size may not be aligned with the required alignment. This patch simply adds an ALIGN(size, align) to the equation and fixes the object size detection accordingly. This code in drivers/s390/cio/qdio_setup_init has lead to incorrectly aligned slab objects (sizeof(struct qdio_q) equals 1792): qdio_q_cache = kmem_cache_create("qdio_q", sizeof(struct qdio_q), 256, 0, NULL); Acked-by: Christoph Lameter <cl@linux.com> Signed-off-by: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Pekka Enberg <penberg@kernel.org>
2010-08-06Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6: slub: Allow removal of slab caches during boot Revert "slub: Allow removal of slab caches during boot" slub numa: Fix rare allocation from unexpected node slab: use deferable timers for its periodic housekeeping slub: Use kmem_cache flags to detect if slab is in debugging mode. slub: Allow removal of slab caches during boot slub: Check kasprintf results in kmem_cache_init() SLUB: Constants need UL slub: Use a constant for a unspecified node. SLOB: Free objects to their own list slab: fix caller tracking on !CONFIG_DEBUG_SLAB && CONFIG_TRACING
2010-07-20slab: use deferable timers for its periodic housekeepingArjan van de Ven
slab has a "once every 2 second" timer for its housekeeping. As the number of logical processors is growing, its more and more common that this 2 second timer becomes the primary wakeup source. This patch turns this housekeeping timer into a deferable timer, which means that the timer does not interrupt idle, but just runs at the next event that wakes the cpu up. The impact is that the timer likely runs a bit later, but during the delay no code is running so there's not all that much reason for a difference in housekeeping to occur because of this delay. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
2010-06-09tracing: Remove kmemtrace ftrace pluginLi Zefan
We have been resisting new ftrace plugins and removing existing ones, and kmemtrace has been superseded by kmem trace events and perf-kmem, so we remove it. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro> Cc: Ingo Molnar <mingo@elte.hu> Cc: Steven Rostedt <rostedt@goodmis.org> [ remove kmemtrace from the makefile, handle slob too ] Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
2010-05-27numa: slab: use numa_mem_id() for slab local memory nodeLee Schermerhorn
Example usage of generic "numa_mem_id()": The mainline slab code, since ~ 2.6.19, does not handle memoryless nodes well. Specifically, the "fast path"--____cache_alloc()--will never succeed as slab doesn't cache offnode object on the per cpu queues, and for memoryless nodes, all memory will be "off node" relative to numa_node_id(). This adds significant overhead to all kmem cache allocations, incurring a significant regression relative to earlier kernels [from before slab.c was reorganized]. This patch uses the generic topology function "numa_mem_id()" to return the "effective local memory node" for the calling context. This is the first node in the local node's generic fallback zonelist-- the same node that "local" mempolicy-based allocations would use. This lets slab cache these "local" allocations and avoid fallback/refill on every allocation. N.B.: Slab will need to handle node and memory hotplug events that could change the value returned by numa_mem_id() for any given node if recent changes to address memory hotplug don't already address this. E.g., flush all per cpu slab queues before rebuilding the zonelists while the "machine" is held in the stopped state. Performance impact on "hackbench 400 process 200" 2.6.34-rc3-mmotm-100405-1609 no-patch this-patch ia64 no memoryless nodes [avg of 10]: 11.713 11.637 ~0.65 diff ia64 cpus all on memless nodes [10]: 228.259 26.484 ~8.6x speedup The slowdown of the patched kernel from ~12 sec to ~28 seconds when configured with memoryless nodes is the result of all cpus allocating from a single node's mm pagepool. The cache lines of the single node are distributed/interleaved over the memory of the real physical nodes, but the zone lock, list heads, ... of the single node with memory still each live in a single cache line that is accessed from all processors. x86_64 [8x6 AMD] [avg of 40]: 2.883 2.845 Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Tejun Heo <tj@kernel.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Nick Piggin <npiggin@suse.de> Cc: David Rientjes <rientjes@google.com> Cc: Eric Whitney <eric.whitney@hp.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: "Luck, Tony" <tony.luck@intel.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: <linux-arch@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27slab: convert cpu notifier to return encapsulate errno valueAkinobu Mita
By the previous modification, the cpu notifier can return encapsulate errno value. This converts the cpu notifiers for slab. Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-05-27cpusets: new round-robin rotor for SLAB allocationsJack Steiner
We have observed several workloads running on multi-node systems where memory is assigned unevenly across the nodes in the system. There are numerous reasons for this but one is the round-robin rotor in cpuset_mem_spread_node(). For example, a simple test that writes a multi-page file will allocate pages on nodes 0 2 4 6 ... Odd nodes are skipped. (Sometimes it allocates on odd nodes & skips even nodes). An example is shown below. The program "lfile" writes a file consisting of 10 pages. The program then mmaps the file & uses get_mempolicy(..., MPOL_F_NODE) to determine the nodes where the file pages were allocated. The output is shown below: # ./lfile allocated on nodes: 2 4 6 0 1 2 6 0 2 There is a single rotor that is used for allocating both file pages & slab pages. Writing the file allocates both a data page & a slab page (buffer_head). This advances the RR rotor 2 nodes for each page allocated. A quick confirmation seems to confirm this is the cause of the uneven allocation: # echo 0 >/dev/cpuset/memory_spread_slab # ./lfile allocated on nodes: 6 7 8 9 0 1 2 3 4 5 This patch introduces a second rotor that is used for slab allocations. Signed-off-by: Jack Steiner <steiner@sgi.com> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Paul Menage <menage@google.com> Cc: Jack Steiner <steiner@sgi.com> Cc: Robin Holt <holt@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>