aboutsummaryrefslogtreecommitdiff
path: root/lib/radix-tree.c
AgeCommit message (Collapse)Author
2010-12-14radix-tree: fix RCU bugNick Piggin
commit 27d20fddc8af539464fc3ba499d6a830054c3bd6 upstream. Salman Qazi describes the following radix-tree bug: In the following case, we get can get a deadlock: 0. The radix tree contains two items, one has the index 0. 1. The reader (in this case find_get_pages) takes the rcu_read_lock. 2. The reader acquires slot(s) for item(s) including the index 0 item. 3. The non-zero index item is deleted, and as a consequence the other item is moved to the root of the tree. The place where it used to be is queued for deletion after the readers finish. 3b. The zero item is deleted, removing it from the direct slot, it remains in the rcu-delayed indirect node. 4. The reader looks at the index 0 slot, and finds that the page has 0 ref count 5. The reader looks at it again, hoping that the item will either be freed or the ref count will increase. This never happens, as the slot it is looking at will never be updated. Also, this slot can never be reclaimed because the reader is holding rcu_read_lock and is in an infinite loop. The fix is to re-use the same "indirect" pointer case that requires a slot lookup retry into a general "retry the lookup" bit. Signed-off-by: Nick Piggin <npiggin@kernel.dk> Reported-by: Salman Qazi <sqazi@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Andi Kleen <ak@linux.intel.com>
2010-05-27radix-tree: fix radix_tree_prev_hole() underflow caseCesar Eduardo Barros
radix_tree_prev_hole() used LONG_MAX to detect underflow; however, ULONG_MAX is clearly what was intended, both here and by its only user (count_history_pages at mm/readahead.c). Reviewed-by: Wu Fengguang <fengguang.wu@intel.com> Signed-off-by: Cesar Eduardo Barros <cesarb@cesarb.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-04-09radix_tree_tag_get() is not as safe as the docs make out [ver #2]David Howells
radix_tree_tag_get() is not safe to use concurrently with radix_tree_tag_set() or radix_tree_tag_clear(). The problem is that the double tag_get() in radix_tree_tag_get(): if (!tag_get(node, tag, offset)) saw_unset_tag = 1; if (height == 1) { int ret = tag_get(node, tag, offset); may see the value change due to the action of set/clear. RCU is no protection against this as no pointers are being changed, no nodes are being replaced according to a COW protocol - set/clear alter the node directly. The documentation in linux/radix-tree.h, however, says that radix_tree_tag_get() is an exception to the rule that "any function modifying the tree or tags (...) must exclude other modifications, and exclude any functions reading the tree". The problem is that the next statement in radix_tree_tag_get() checks that the tag doesn't vary over time: BUG_ON(ret && saw_unset_tag); This has been seen happening in FS-Cache: https://www.redhat.com/archives/linux-cachefs/2010-April/msg00013.html To this end, remove the BUG_ON() from radix_tree_tag_get() and note in various comments that the value of the tag may change whilst the RCU read lock is held, and thus that the return value of radix_tree_tag_get() may not be relied upon unless radix_tree_tag_set/clear() and radix_tree_delete() are excluded from running concurrently with it. Reported-by: Romain DEGEZ <romain.degez@smartjog.com> Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2010-03-30include cleanup: Update gfp.h and slab.h includes to prepare for breaking ↵Tejun Heo
implicit slab.h inclusion from percpu.h percpu.h is included by sched.h and module.h and thus ends up being included when building most .c files. percpu.h includes slab.h which in turn includes gfp.h making everything defined by the two files universally available and complicating inclusion dependencies. percpu.h -> slab.h dependency is about to be removed. Prepare for this change by updating users of gfp and slab facilities include those headers directly instead of assuming availability. As this conversion needs to touch large number of source files, the following script is used as the basis of conversion. http://userweb.kernel.org/~tj/misc/slabh-sweep.py The script does the followings. * Scan files for gfp and slab usages and update includes such that only the necessary includes are there. ie. if only gfp is used, gfp.h, if slab is used, slab.h. * When the script inserts a new include, it looks at the include blocks and try to put the new include such that its order conforms to its surrounding. It's put in the include block which contains core kernel includes, in the same order that the rest are ordered - alphabetical, Christmas tree, rev-Xmas-tree or at the end if there doesn't seem to be any matching order. * If the script can't find a place to put a new include (mostly because the file doesn't have fitting include block), it prints out an error message indicating which .h file needs to be added to the file. The conversion was done in the following steps. 1. The initial automatic conversion of all .c files updated slightly over 4000 files, deleting around 700 includes and adding ~480 gfp.h and ~3000 slab.h inclusions. The script emitted errors for ~400 files. 2. Each error was manually checked. Some didn't need the inclusion, some needed manual addition while adding it to implementation .h or embedding .c file was more appropriate for others. This step added inclusions to around 150 files. 3. The script was run again and the output was compared to the edits from #2 to make sure no file was left behind. 4. Several build tests were done and a couple of problems were fixed. e.g. lib/decompress_*.c used malloc/free() wrappers around slab APIs requiring slab.h to be added manually. 5. The script was run on all .h files but without automatically editing them as sprinkling gfp.h and slab.h inclusions around .h files could easily lead to inclusion dependency hell. Most gfp.h inclusion directives were ignored as stuff from gfp.h was usually wildly available and often used in preprocessor macros. Each slab.h inclusion directive was examined and added manually as necessary. 6. percpu.h was updated not to include slab.h. 7. Build test were done on the following configurations and failures were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my distributed build env didn't work with gcov compiles) and a few more options had to be turned off depending on archs to make things build (like ipr on powerpc/64 which failed due to missing writeq). * x86 and x86_64 UP and SMP allmodconfig and a custom test config. * powerpc and powerpc64 SMP allmodconfig * sparc and sparc64 SMP allmodconfig * ia64 SMP allmodconfig * s390 SMP allmodconfig * alpha SMP allmodconfig * um on x86_64 SMP allmodconfig 8. percpu.h modifications were reverted so that it could be applied as a separate patch and serve as bisection point. Given the fact that I had only a couple of failures from tests on step 6, I'm fairly confident about the coverage of this conversion patch. If there is a breakage, it's likely to be something in one of the arch headers which should be easily discoverable easily on most builds of the specific arch. Signed-off-by: Tejun Heo <tj@kernel.org> Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
2010-02-25radix-tree: Disable RCU lockdep checking in radix treePaul E. McKenney
Because the radix tree is used with many different locking designs, we cannot do any effective checking without changing the radix-tree APIs. It might make sense to do this later, but only if the RCU lockdep checking proves itself sufficiently valuable. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: laijs@cn.fujitsu.com Cc: dipankar@in.ibm.com Cc: mathieu.desnoyers@polymtl.ca Cc: josh@joshtriplett.org Cc: dvhltc@us.ibm.com Cc: niv@us.ibm.com Cc: peterz@infradead.org Cc: rostedt@goodmis.org Cc: Valdis.Kletnieks@vt.edu Cc: dhowells@redhat.com LKML-Reference: <1266887105-1528-10-git-send-email-paulmck@linux.vnet.ibm.com> Signed-off-by: Ingo Molnar <mingo@elte.hu>
2009-11-19FS-Cache: Don't delete pending pages from the page-store tracking treeDavid Howells
Don't delete pending pages from the page-store tracking tree, but rather send them for another write as they've presumably been updated. Signed-off-by: David Howells <dhowells@redhat.com>
2009-11-19FS-Cache: Use radix tree preload correctly in tracking of pages to be storedDavid Howells
__fscache_write_page() attempts to load the radix tree preallocation pool for the CPU it is on before calling radix_tree_insert(), as the insertion must be done inside a pair of spinlocks. Use of the preallocation pool, however, is contingent on the radix tree being initialised without __GFP_WAIT specified. __fscache_acquire_cookie() was passing GFP_NOFS to INIT_RADIX_TREE() - but that includes __GFP_WAIT. The solution is to AND out __GFP_WAIT. Additionally, the banner comment to radix_tree_preload() is altered to make note of this prerequisite. Possibly there should be a WARN_ON() too. Without this fix, I have seen the following recursive deadlock caused by radix_tree_insert() attempting to allocate memory inside the spinlocked region, which resulted in FS-Cache being called back into to release memory - which required the spinlock already held. ============================================= [ INFO: possible recursive locking detected ] 2.6.32-rc6-cachefs #24 --------------------------------------------- nfsiod/7916 is trying to acquire lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] but task is already holding lock: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] other info that might help us debug this: 5 locks held by nfsiod/7916: #0: (nfsiod){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #1: (&task->u.tk_work#2){+.+.+.}, at: [<ffffffff81048290>] worker_thread+0x19a/0x2e2 #2: (&cookie->lock){+.+.-.}, at: [<ffffffffa0076acc>] __fscache_write_page+0x15c/0x3f3 [fscache] #3: (&object->lock#2){+.+.-.}, at: [<ffffffffa0076b07>] __fscache_write_page+0x197/0x3f3 [fscache] #4: (&cookie->stores_lock){+.+...}, at: [<ffffffffa0076b0f>] __fscache_write_page+0x19f/0x3f3 [fscache] stack backtrace: Pid: 7916, comm: nfsiod Not tainted 2.6.32-rc6-cachefs #24 Call Trace: [<ffffffff8105ac7f>] __lock_acquire+0x1649/0x16e3 [<ffffffff81059ded>] ? __lock_acquire+0x7b7/0x16e3 [<ffffffff8100e27d>] ? dump_trace+0x248/0x257 [<ffffffff8105ad70>] lock_acquire+0x57/0x6d [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffff8135467c>] _spin_lock+0x2c/0x3b [<ffffffffa0076872>] ? __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0076872>] __fscache_uncache_page+0xdb/0x160 [fscache] [<ffffffffa0077eb7>] ? __fscache_check_page_write+0x0/0x71 [fscache] [<ffffffffa00b4755>] nfs_fscache_release_page+0x86/0xc4 [nfs] [<ffffffffa00907f0>] nfs_release_page+0x3c/0x41 [nfs] [<ffffffff81087ffb>] try_to_release_page+0x32/0x3b [<ffffffff81092c2b>] shrink_page_list+0x316/0x4ac [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff8135451b>] ? _spin_unlock_irq+0x2b/0x31 [<ffffffff81093153>] shrink_inactive_list+0x392/0x67c [<ffffffff81058a9b>] ? mark_held_locks+0x52/0x70 [<ffffffff810934ca>] shrink_list+0x8d/0x8f [<ffffffff81093744>] shrink_zone+0x278/0x33c [<ffffffff81052c70>] ? ktime_get_ts+0xad/0xba [<ffffffff8109453b>] try_to_free_pages+0x22e/0x392 [<ffffffff8109184c>] ? isolate_pages_global+0x0/0x212 [<ffffffff8108e16b>] __alloc_pages_nodemask+0x3dc/0x5cf [<ffffffff810ae24a>] cache_alloc_refill+0x34d/0x6c1 [<ffffffff811bcf74>] ? radix_tree_node_alloc+0x52/0x5c [<ffffffff810ae929>] kmem_cache_alloc+0xb2/0x118 [<ffffffff811bcf74>] radix_tree_node_alloc+0x52/0x5c [<ffffffff811bcfd5>] radix_tree_insert+0x57/0x19c [<ffffffffa0076b53>] __fscache_write_page+0x1e3/0x3f3 [fscache] [<ffffffffa00b4248>] __nfs_readpage_to_fscache+0x58/0x11e [nfs] [<ffffffffa009bb77>] nfs_readpage_release+0x34/0x9b [nfs] [<ffffffffa009c0d9>] nfs_readpage_release_full+0x32/0x4b [nfs] [<ffffffffa0006cff>] rpc_release_calldata+0x12/0x14 [sunrpc] [<ffffffffa0006e2d>] rpc_free_task+0x59/0x61 [sunrpc] [<ffffffffa0006f03>] rpc_async_release+0x10/0x12 [sunrpc] [<ffffffff810482e5>] worker_thread+0x1ef/0x2e2 [<ffffffff81048290>] ? worker_thread+0x19a/0x2e2 [<ffffffff81352433>] ? thread_return+0x3e/0x101 [<ffffffffa0006ef3>] ? rpc_async_release+0x0/0x12 [sunrpc] [<ffffffff8104bff5>] ? autoremove_wake_function+0x0/0x34 [<ffffffff81058d25>] ? trace_hardirqs_on+0xd/0xf [<ffffffff810480f6>] ? worker_thread+0x0/0x2e2 [<ffffffff8104bd21>] kthread+0x7a/0x82 [<ffffffff8100beda>] child_rip+0xa/0x20 [<ffffffff8100b87c>] ? restore_args+0x0/0x30 [<ffffffff8104c2b9>] ? add_wait_queue+0x15/0x44 [<ffffffff8104bca7>] ? kthread+0x0/0x82 [<ffffffff8100bed0>] ? child_rip+0x0/0x20 Signed-off-by: David Howells <dhowells@redhat.com>
2009-06-16lib: do code optimization for radix_tree_lookup() and radix_tree_lookup_slot()Huang Shijie
radix_tree_lookup() and radix_tree_lookup_slot() have much the same code except for the return value. Introduce radix_tree_lookup_element() to do the real work. /* * is_slot == 1 : search for the slot. * is_slot == 0 : search for the node. */ static void * radix_tree_lookup_element(struct radix_tree_root *root, unsigned long index, int is_slot); Signed-off-by: Huang Shijie <shijie8@gmail.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-06-16radix-tree: add radix_tree_prev_hole()Wu Fengguang
The counterpart of radix_tree_next_hole(). To be used by context readahead. Signed-off-by: Wu Fengguang <fengguang.wu@intel.com> Cc: Vladislav Bolkhovitin <vst@vlnb.net> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Ying Han <yinghan@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-07Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (24 commits) trivial: chack -> check typo fix in main Makefile trivial: Add a space (and a comma) to a printk in 8250 driver trivial: Fix misspelling of "firmware" in docs for ncr53c8xx/sym53c8xx trivial: Fix misspelling of "firmware" in powerpc Makefile trivial: Fix misspelling of "firmware" in usb.c trivial: Fix misspelling of "firmware" in qla1280.c trivial: Fix misspelling of "firmware" in a100u2w.c trivial: Fix misspelling of "firmware" in megaraid.c trivial: Fix misspelling of "firmware" in ql4_mbx.c trivial: Fix misspelling of "firmware" in acpi_memhotplug.c trivial: Fix misspelling of "firmware" in ipw2100.c trivial: Fix misspelling of "firmware" in atmel.c trivial: Fix misspelled firmware in Kconfig trivial: fix an -> a typos in documentation and comments trivial: fix then -> than typos in comments and documentation trivial: update Jesper Juhl CREDITS entry with new email trivial: fix singal -> signal typo trivial: Fix incorrect use of "loose" in event.c trivial: printk: fix indentation of new_text_line declaration trivial: rtc-stk17ta8: fix sparse warning ...
2009-01-06lib: radix_tree.c make percpu variable staticHarvey Harrison
radix_tree_preloads is unused outside of this file, make it static. Noticed by sparse: lib/radix-tree.c:84:1: warning: symbol 'per_cpu__radix_tree_preloads' was not declared. Should it be static? Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-01-06trivial: radix-tree: document wrap-around issue of radix_tree_next_hole()Wu Fengguang
And some 80-line cleanups. Signed-off-by: Wu Fengguang <wfg@linux.intel.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2008-07-26SL*B: drop kmem cache argument from constructorAlexey Dobriyan
Kmem cache passed to constructor is only needed for constructors that are themselves multiplexeres. Nobody uses this "feature", nor does anybody uses passed kmem cache in non-trivial way, so pass only pointer to object. Non-trivial places are: arch/powerpc/mm/init_64.c arch/powerpc/mm/hugetlbpage.c This is flag day, yes. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Christoph Lameter <cl@linux-foundation.org> Cc: Jon Tollefson <kniht@linux.vnet.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Matt Mackall <mpm@selenic.com> [akpm@linux-foundation.org: fix arch/powerpc/mm/hugetlbpage.c] [akpm@linux-foundation.org: fix mm/slab.c] [akpm@linux-foundation.org: fix ubifs] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-26radix-tree: add gang_lookup_slot, gang_lookup_slot_tagNick Piggin
Introduce gang_lookup_slot() and gang_lookup_slot_tag() functions, which are used by lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Cc: Hugh Dickins <hugh@veritas.com> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Reviewed-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-07-04Christoph has movedChristoph Lameter
Remove all clameter@sgi.com addresses from the kernel tree since they will become invalid on June 27th. Change my maintainer email address for the slab allocators to cl@linux-foundation.org (which will be the new email address for the future). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Matt Mackall <mpm@selenic.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-06-12radix-tree: fix small lockless radix-tree bugNick Piggin
We shrink a radix tree when its root node has only one child, in the left most slot. The child becomes the new root node. To perform this operation in a manner compatible with concurrent lockless lookups, we atomically switch the root pointer from the parent to its child. However a concurrent lockless lookup may now have loaded a pointer to the parent (and is presently deciding what to do next). For this reason, we also have to keep the parent node in a valid state after shrinking the tree, until the next RCU grace period -- otherwise this lookup with the parent pointer may not do the right thing. Notably, we need to keep the child in the left most slot there in case that is requested by the lookup. This is all pretty standard RCU stuff. It is worth repeating because in my eagerness to obey the radix tree node constructor scheme, I had broken it by zeroing the radix tree node before the grace period. What could happen is that a lookup can load the parent pointer, then decide it wants to follow the left most child slot, only to find the slot contained NULL due to the concurrent shrinker having zeroed the parent node before waiting for a grace period. The lookup would return a false negative as a result. Fix it by doing that clearing in the RCU callback. I would normally want to rip out the constructor entirely, but radix tree nodes are one of those places where they make sense (only few cachelines will be touched soon after allocation). This was never actually found in any lockless pagecache testing or by the test harness, but by seeing the odd problem with my scalable vmap rewrite. I have not tickled the test harness into reproducing it yet, but I'll keep working at it. Fortunately, it is not a problem anywhere lockless pagecache is used in mainline kernels (pagecache probe is not a guarantee, and brd does not have concurrent lookups and deletes). Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-28Remove set_migrateflags()Christoph Lameter
Migrate flags must be set on slab creation as agreed upon when the antifrag logic was reviewed. Otherwise some slabs of a slabcache will end up in the unmovable and others in the reclaimable section depending on which flag was active when a new slab page was allocated. This likely slid in somehow when antifrag was merged. Remove it. The buffer_heads are always allocated with __GFP_RECLAIMABLE because the SLAB_RECLAIM_ACCOUNT option is set. The set_migrateflags() never had any effect there. Radix tree allocations are not directly reclaimable but they are allocated with __GFP_RECLAIMABLE set on each allocation. We now set SLAB_RECLAIM_ACCOUNT on radix tree slab creation making sure that radix tree slabs are consistently placed in the reclaimable section. Radix tree slabs will also be accounted as such. There is then no user left of set_migratepages. So remove it. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-02-05radix-tree: avoid atomic allocations for preloaded insertionsNick Piggin
Most pagecache (and some other) radix tree insertions have the great opportunity to preallocate a few nodes with relaxed gfp flags. But the preallocation is squandered when it comes time to allocate a node, we default to first attempting a GFP_ATOMIC allocation -- that doesn't normally fail, but it can eat into atomic memory reserves that we don't need to be using. Another upshot of this is that it removes the sometimes highly contended zone->lock from underneath tree_lock. Pagecache insertions are always performed with a radix tree preload, and after this change, such a situation will never fall back to kmem_cache_alloc within radix_tree_node_alloc. David Miller reports seeing this allocation fail on a highly threaded sparc64 system: [527319.459981] dd: page allocation failure. order:0, mode:0x20 [527319.460403] Call Trace: [527319.460568] [00000000004b71e0] __slab_alloc+0x1b0/0x6a8 [527319.460636] [00000000004b7bbc] kmem_cache_alloc+0x4c/0xa8 [527319.460698] [000000000055309c] radix_tree_node_alloc+0x20/0x90 [527319.460763] [0000000000553238] radix_tree_insert+0x12c/0x260 [527319.460830] [0000000000495cd0] add_to_page_cache+0x38/0xb0 [527319.460893] [00000000004e4794] mpage_readpages+0x6c/0x134 [527319.460955] [000000000049c7fc] __do_page_cache_readahead+0x170/0x280 [527319.461028] [000000000049cc88] ondemand_readahead+0x208/0x214 [527319.461094] [0000000000496018] do_generic_mapping_read+0xe8/0x428 [527319.461152] [0000000000497948] generic_file_aio_read+0x108/0x170 [527319.461217] [00000000004badac] do_sync_read+0x88/0xd0 [527319.461292] [00000000004bb5cc] vfs_read+0x78/0x10c [527319.461361] [00000000004bb920] sys_read+0x34/0x60 [527319.461424] [0000000000406294] linux_sparc_syscall32+0x3c/0x40 The calltrace is significant: __do_page_cache_readahead allocates a number of pages with GFP_KERNEL, and hence it should have reclaimed sufficient memory to satisfy GFP_ATOMIC allocations. However after the list of pages goes to mpage_readpages, there can be significant intervals (including disk IO) before all the pages are inserted into the radix-tree. So the reserves can easily be depleted at that point. The patch is confirmed to fix the problem. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "David S. Miller" <davem@davemloft.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17avoid negative (and full-width) shifts in radix-tree.cPeter Lund
Negative shifts are not allowed in C (the result is undefined). Same thing with full-width shifts. It works on most platforms but not on the VAX with gcc 4.0.1 (it results in an "operand reserved" fault). Shifting by more than the width of the value on the left is also not allowed. I think the extra '>> 1' tacked on at the end in the original code was an attempt to work around that. Getting rid of that is an extra feature of this patch. Here's the chapter and verse, taken from the final draft of the C99 standard ("6.5.7 Bitwise shift operators", paragraph 3): "The integer promotions are performed on each of the operands. The type of the result is that of the promoted left operand. If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined." Thank you to Jan-Benedict Glaw, Christoph Hellwig, Maciej Rozycki, Pekka Enberg, Andreas Schwab, and Christoph Lameter for review. Special thanks to Andreas for spotting that my fix only removed half the undefined behaviour. Signed-off-by: Peter Lund <firefly@vax64.dk> Christoph Lameter <clameter@sgi.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: "Maciej W. Rozycki" <macro@linux-mips.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Andreas Schwab <schwab@suse.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: WU Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-17Slab API: remove useless ctor parameter and reorder parametersChristoph Lameter
Slab constructors currently have a flags parameter that is never used. And the order of the arguments is opposite to other slab functions. The object pointer is placed before the kmem_cache pointer. Convert ctor(void *object, struct kmem_cache *s, unsigned long flags) to ctor(struct kmem_cache *s, void *object) throughout the kernel [akpm@linux-foundation.org: coupla fixes] Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16Group short-lived and reclaimable kernel allocationsMel Gorman
This patch marks a number of allocations that are either short-lived such as network buffers or are reclaimable such as inode allocations. When something like updatedb is called, long-lived and unmovable kernel allocations tend to be spread throughout the address space which increases fragmentation. This patch groups these allocations together as much as possible by adding a new MIGRATE_TYPE. The MIGRATE_RECLAIMABLE type is for allocations that can be reclaimed on demand, but not moved. i.e. they can be migrated by deleting them and re-reading the information from elsewhere. Signed-off-by: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16fix the max path calculation in radix-tree.cJeff Moyer
A while back, Nick Piggin introduced a patch to reduce the node memory usage for small files (commit cfd9b7df4abd3257c9e381b0e445817b26a51c0c): -#define RADIX_TREE_MAP_SHIFT 6 +#define RADIX_TREE_MAP_SHIFT (CONFIG_BASE_SMALL ? 4 : 6) Unfortunately, he didn't take into account the fact that the calculation of the maximum path was based on an assumption of having to round up: #define RADIX_TREE_MAX_PATH (RADIX_TREE_INDEX_BITS/RADIX_TREE_MAP_SHIFT + 2) So, if CONFIG_BASE_SMALL is set, you will end up with a RADIX_TREE_MAX_PATH that is one greater than necessary. The practical upshot of this is just a bit of wasted memory (one long in the height_to_maxindex array, an extra pre-allocated radix tree node per cpu, and extra stack usage in a couple of functions), but it seems worth getting right. It's also worth noting that I never build with CONFIG_BASE_SMALL. What I did to test this was duplicate the code in a small user-space program and check the results of the calculations for max path and the contents of the height_to_maxindex array. Signed-off-by: Jeff Moyer <jmoyer@redhat.com> Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16radix-tree: use indirect bitNick Piggin
Rather than sign direct radix-tree pointers with a special bit, sign the indirect one that hangs off the root. This means that, given a lookup_slot operation, the invalid result will be differentiated from the valid (previously, valid results could have the bit either set or clear). This does not affect slot lookups which occur under lock -- they can never return an invalid result. Is needed in future for lockless pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-10-16radixtree: introduce radix_tree_next_hole()Fengguang Wu
Introduce radix_tree_next_hole(root, index, max_scan) to scan radix tree for the first hole. It will be used in interleaved readahead. The implementation is dumb and obviously correct. It can help debug(and document) the possible smart one in future. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-07-20mm: Remove slab destructors from kmem_cache_create().Paul Mundt
Slab destructors were no longer supported after Christoph's c59def9f222d44bb7e2f0a559f2906191a0862d7 change. They've been BUGs for both slab and slub, and slob never supported them either. This rips out support for the dtor pointer from kmem_cache_create() completely and fixes up every single callsite in the kernel (there were about 224, not including the slab allocator definitions themselves, or the documentation references). Signed-off-by: Paul Mundt <lethal@linux-sh.org>
2007-07-14[LIB]: export radix_tree_preload()David Chinner
XFS filestreams functionality uses radix trees and the preload functions. XFS can be built as a module and hence we need radix_tree_preload() exported. radix_tree_preload_end() is a static inline, so it doesn't need exporting. Signed-Off-By: Dave Chinner <dgc@sgi.com> Signed-Off-By: Tim Shimmin <tes@sgi.com>
2007-05-09Add suspend-related notifications for CPU hotplugRafael J. Wysocki
Since nonboot CPUs are now disabled after tasks and devices have been frozen and the CPU hotplug infrastructure is used for this purpose, we need special CPU hotplug notifications that will help the CPU-hotplug-aware subsystems distinguish normal CPU hotplug events from CPU hotplug events related to a system-wide suspend or resume operation in progress. This patch introduces such notifications and causes them to be used during suspend and resume transitions. It also changes all of the CPU-hotplug-aware subsystems to take these notifications into consideration (for now they are handled in the same way as the corresponding "normal" ones). [oleg@tv-sign.ru: cleanups] Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Cc: Gautham R Shenoy <ego@in.ibm.com> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2006-12-07[PATCH] hotplug CPU: clean up hotcpu_notifier() useIngo Molnar
There was lots of #ifdef noise in the kernel due to hotcpu_notifier(fn, prio) not correctly marking 'fn' as used in the !HOTPLUG_CPU case, and thus generating compiler warnings of unused symbols, hence forcing people to add #ifdefs. the compiler can skip truly unused functions just fine: text data bss dec hex filename 1624412 728710 3674856 6027978 5bfaca vmlinux.before 1624412 728710 3674856 6027978 5bfaca vmlinux.after [akpm@osdl.org: topology.c fix] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] radix-tree: RCU lockless readsideNick Piggin
Make radix tree lookups safe to be performed without locks. Readers are protected against nodes being deleted by using RCU based freeing. Readers are protected against new node insertion by using memory barriers to ensure the node itself will be properly written before it is visible in the radix tree. Each radix tree node keeps a record of their height (above leaf nodes). This height does not change after insertion -- when the radix tree is extended, higher nodes are only inserted in the top. So a lookup can take the pointer to what is *now* the root node, and traverse down it even if the tree is concurrently extended and this node becomes a subtree of a new root. "Direct" pointers (tree height of 0, where root->rnode points directly to the data item) are handled by using the low bit of the pointer to signal whether rnode is a direct pointer or a pointer to a radix tree node. When a reader wants to traverse the next branch, they will take a copy of the pointer. This pointer will be either NULL (and the branch is empty) or non-NULL (and will point to a valid node). [akpm@osdl.org: cleanups] [Lee.Schermerhorn@hp.com: bugfixes, comments, simplifications] [clameter@sgi.com: build fix] Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: "Paul E. McKenney" <paulmck@us.ibm.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-07[PATCH] slab: remove kmem_cache_tChristoph Lameter
Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-10-10[PATCH] gfp annotations: radix_tree_rootAl Viro
struct radix_tree_root has unused upper bits of ->gfp_mask reused for tags bitmap. Annotated. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-25[PATCH] radixtree: normalize radix_tree_tag_get() return valueWu Fengguang
In radix_tree_tag_get(), return normalized value of 0/1, as indicated by its comment. Signed-off-by: Wu Fengguang <wfg@mail.ustc.edu.cn> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] buglet in radix_tree_tag_setPeter Zijlstra
The comment states: 'Setting a tag on a not-present item is a BUG.' Hence if 'index' is larger than the maxindex; the item _cannot_ be presen; it should also be a BUG. Also, this allows the following statement (assume a fresh tree): radix_tree_tag_set(root, 16, 1); to fail silently, but when preceded by: radix_tree_insert(root, 32, item); it would BUG, because the height has been extended by the insert. In neither case was 16 present. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] radix-tree: smallNick Piggin
Reduce radix tree node memory usage by about a factor of 4 for small files (< 64K). There are pointer traversal and memory usage costs for large files with dense pagecache. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-06-23[PATCH] radix-tree: direct dataNick Piggin
The ability to have height 0 radix trees (a direct pointer to the data item rather than going through a full node->slot) quietly disappeared with old-2.6-bkcvs commit ffee171812d51652f9ba284302d9e5c5cc14bdfd. On 64-bit machines this causes nearly 600 bytes to be used for every <= 4K file in pagecache. Re-introduce this feature, root tags stored in spare ->gfp_mask bits. Simplify radix_tree_delete's complex tag clearing arrangement (which would become even more complex) by just falling back to tag clearing functions (the pagecache radix-tree never uses this path anyway, so the icache savings will mean it's actually a speedup). On my 4GB G5, this saves 8MB RAM per kernel kernel source+object tree in pagecache. Pagecache lookup, insertion, and removal speed for small files will also be improved. This makes RCU radix tree harder, but it's worth it. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-03-25[PATCH] radix-tree documentation cleanupsJonathan Corbet
Documentation changes to help radix tree users avoid overrunning the tags array. RADIX_TREE_TAGS moves to linux/radix-tree.h and is now known as RADIX_TREE_MAX_TAGS (Nick Piggin's idea). Tag parameters are changed to unsigned, and some comments are updated. Signed-off-by: Jonathan Corbet <corbet@lwn.net> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-02-16[PATCH] Fix over-zealous tag clearing in radix_tree_deleteNeilBrown
If a tag is set for a node being deleted from a radix_tree, then that tag gets cleared from the parent of the node, even if it is set for some siblings of the node begin deleted. This patch changes the logic to include a test for any_tag_set similar to the logic a little futher down. Care is taken to ensure that 'nr_cleared_tags' remains equals to the number of entries in the 'tags' array which are set to '0' (which means that this tag is not set in the tree below pathp->node, and should be cleared at pathp->node and possibly above. [ Nick says: "Linus FYI, I was able to modify the radix tree test harness to catch the bug and can no longer trigger it after the fix. Resulting code passes all other harness tests as well of course." ] Signed-off-by: Neil Brown <neilb@suse.de> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] radix-tree: reduce tree height upon partial truncationNick Piggin
Shrink the height of a radix tree when it is partially truncated - we only do shrinkage of full truncation at present. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] radix tree: early termination of tag clearingNick Piggin
Correctly determine the tags to be cleared in radix_tree_delete() so we don't keep moving up the tree clearing tags that we don't need to. For example, if a tag is simply not set in the deleted item, nor anywhere up the tree, radix_tree_delete() would attempt to clear it up the entire height of the tree. Also, tag_set() was made conditional so as not to dirty too many cachelines high up in the radix tree. Instead, put this logic into radix_tree_tag_set(). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-01-08[PATCH] radix tree: code consolidationNick Piggin
Introduce helper any_tag_set() rather than repeat the same code sequence 4 times. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-11-07[PATCH] reiser4: add radix_tree_lookup_slot()Hans Reiser
Reiser4 uses radix trees to solve a trouble reiser4_readdir has serving nfs requests. Unfortunately, radix tree api lacks an operation suitable for modifying existing entry. This patch adds radix_tree_lookup_slot which returns pointer to found item within the tree. That location can be then updated. Both Nick and Christoph Lameter have patches which need this as well. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-10-08[PATCH] gfp flags annotations - part 1Al Viro
- added typedef unsigned int __nocast gfp_t; - replaced __nocast uses for gfp flags with gfp_t - it gives exactly the same warnings as far as sparse is concerned, doesn't change generated code (from gcc point of view we replaced unsigned int with typedef) and documents what's going on far better. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-10[PATCH] lib/radix-tree: Fix "nocast type" warningsVictor Fusco
Fix the sparse warning "implicit cast to nocast type" Signed-off-by: Victor Fusco <victor@cetuc.puc-rio.br> Signed-off-by: Domen Puncer <domen@coderock.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] radix_tag_get(): differentiate between no present node and tag unset ↵Marcelo Tosatti
cases Simple patch to radix_tree_tag_get() to return different values for non present node and tag unset. The function is not used by any in-kernel callers (yet), but this information is definitely useful. Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-09-07[PATCH] radix-tree: Remove unnecessary indirections and clean up codeChristoph Lameter
- There is frequent use of indirections in the radix code. This patch removes those indirections, makes the code more readable and allows the compilers to generate better code. - Removing indirections allows the removal of several casts. - Removing indirections allows the reduction of the radix_tree_path size from 3 to 2 words. - Use pathp-> consistently. - Remove unnecessary tmp variable in radix_tree_insert - Separate the upper layer processing from the lowest layer in __lookup() in order to make it easier to understand what is going on and allow compilers to generate better code for the loop. Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: James Bottomley <James.Bottomley@steeleye.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-07-07[PATCH] mostly_read data sectionChristoph Lameter
Add a new section called ".data.read_mostly" for data items that are read frequently and rarely written to like cpumaps etc. If these maps are placed in the .data section then these frequenly read items may end up in cachelines with data is is frequently updated. In that case all processors in an SMP system must needlessly reload the cachelines again and again containing elements of those frequently used variables. The ability to share these cachelines will allow each cpu in an SMP system to keep local copies of those shared cachelines thereby optimizing performance. Signed-off-by: Alok N Kataria <alokk@calsoftinc.com> Signed-off-by: Shobhit Dayal <shobhit@calsoftinc.com> Signed-off-by: Christoph Lameter <christoph@scalex86.org> Signed-off-by: Shai Fultheim <shai@scalex86.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2005-04-16Linux-2.6.12-rc2v2.6.12-rc2Linus Torvalds
Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip!