linux - Linux kernel source tree

Age	Commit message (Collapse)	Author
2008-07-24	slub: Fix use-after-preempt of per-CPU data structure	Dmitry Adamushko
	commit bdb21928512a860a60e6a24a849dc5b63cbaf96a upstream Vegard Nossum reported a crash in kmem_cache_alloc(): BUG: unable to handle kernel paging request at da87d000 IP: [<c01991c7>] kmem_cache_alloc+0xc7/0xe0 pde = 28180163 pte = 1a87d160 Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC Pid: 3850, comm: grep Not tainted (2.6.26-rc9-00059-gb190333 #5) EIP: 0060:[<c01991c7>] EFLAGS: 00210203 CPU: 0 EIP is at kmem_cache_alloc+0xc7/0xe0 EAX: 00000000 EBX: da87c100 ECX: 1adad71a EDX: 6b6b6b6b ESI: 00200282 EDI: da87d000 EBP: f60bfe74 ESP: f60bfe54 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 and analyzed it: "The register %ecx looks innocent but is very important here. The disassembly: mov %edx,%ecx shr $0x2,%ecx rep stos %eax,%es:(%edi) <-- the fault So %ecx has been loaded from %edx... which is 0x6b6b6b6b/POISON_FREE. (0x6b6b6b6b >> 2 == 0x1adadada.) %ecx is the counter for the memset, from here: memset(object, 0, c->objsize); i.e. %ecx was loaded from c->objsize, so "c" must have been freed. Where did "c" come from? Uh-oh... c = get_cpu_slab(s, smp_processor_id()); This looks like it has very much to do with CPU hotplug/unplug. Is there a race between SLUB/hotplug since the CPU slab is used after it has been freed?" Good analysis. Yeah, it's possible that a caller of kmem_cache_alloc() -> slab_alloc() can be migrated on another CPU right after local_irq_restore() and before memset(). The inital cpu can become offline in the mean time (or a migration is a consequence of the CPU going offline) so its 'kmem_cache_cpu' structure gets freed ( slab_cpuup_callback). At some point of time the caller continues on another CPU having an obsolete pointer... Signed-off-by: Dmitry Adamushko <dmitry.adamushko@gmail.com> Reported-by: Vegard Nossum <vegard.nossum@gmail.com> Acked-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-06-24	Fix ZERO_PAGE breakage with vmware	Linus Torvalds
	commit 672ca28e300c17bf8d792a2a7a8631193e580c74 upstream Commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f ("Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP") broke vmware, as reported by Jeff Chua: "This broke vmware 6.0.4. Jun 22 14:53:03.845: vmx\| NOT_IMPLEMENTED /build/mts/release/bora-93057/bora/vmx/main/vmmonPosix.c:774" and the reason seems to be that there's an old bug in how we handle do FOLL_ANON on VM_SHARED areas in get_user_pages(), but since it only triggered if the whole page table was missing, nobody had apparently hit it before. The recent changes to 'follow_page()' made the FOLL_ANON logic trigger not just for whole missing page tables, but for individual pages as well, and exposed this problem. This fixes it by making the test for when FOLL_ANON is used more careful, and also makes the code easier to read and understand by moving the logic to a separate inline function. Reported-and-tested-by: Jeff Chua <jeff.chua.linux@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-06-24	Add return value to reserve_bootmem_node()	Bernhard Walle
	commit 71c2742f5e6348d76ee62085cf0a13e5eff0f00e upstream This patch changes the function reserve_bootmem_node() from void to int, returning -ENOMEM if the allocation fails. This fixes a build problem on x86 with CONFIG_KEXEC=y and CONFIG_NEED_MULTIPLE_NODES=y Signed-off-by: Bernhard Walle <bwalle@suse.de> Reported-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-06-24	Reinstate ZERO_PAGE optimization in 'get_user_pages()' and fix XIP	Linus Torvalds
	commit 89f5b7da2a6bad2e84670422ab8192382a5aeb9f upstream KAMEZAWA Hiroyuki and Oleg Nesterov point out that since the commit 557ed1fa2620dc119adb86b34c614e152a629a80 ("remove ZERO_PAGE") removed the ZERO_PAGE from the VM mappings, any users of get_user_pages() will generally now populate the VM with real empty pages needlessly. We used to get the ZERO_PAGE when we did the "handle_mm_fault()", but since fault handling no longer uses ZERO_PAGE for new anonymous pages, we now need to handle that special case in follow_page() instead. In particular, the removal of ZERO_PAGE effectively removed the core file writing optimization where we would skip writing pages that had not been populated at all, and increased memory pressure a lot by allocating all those useless newly zeroed pages. This reinstates the optimization by making the unmapped PTE case the same as for a non-existent page table, which already did this correctly. While at it, this also fixes the XIP case for follow_page(), where the caller could not differentiate between the case of a page that simply could not be used (because it had no "struct page" associated with it) and a page that just wasn't mapped. We do that by simply returning an error pointer for pages that could not be turned into a "struct page *". The error is arbitrarily picked to be EFAULT, since that was what get_user_pages() already used for the equivalent IO-mapped page case. [ Also removed an impossible test for pte_offset_map_lock() failing: that's not how that function works ] Acked-by: Oleg Nesterov <oleg@tv-sign.ru> Acked-by: Nick Piggin <npiggin@suse.de> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Ingo Molnar <mingo@elte.hu> Cc: Roland McGrath <roland@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-06-09	brk: make sys_brk() honor COMPAT_BRK when computing lower bound	Jiri Kosina
	upstream commit: a5b4592cf77b973c29e7c9695873a26052b58951 Fix a regression introduced by commit 4cc6028d4040f95cdb590a87db478b42b8be0508 Author: Jiri Kosina <jkosina@suse.cz> Date: Wed Feb 6 22:39:44 2008 +0100 brk: check the lower bound properly The check in sys_brk() on minimum value the brk might have must take CONFIG_COMPAT_BRK setting into account. When this option is turned on (i.e. we support ancient legacy binaries, e.g. libc5-linked stuff), the lower bound on brk value is mm->end_code, otherwise the brk start is allowed to be arbitrarily shifted. Signed-off-by: Jiri Kosina <jkosina@suse.cz> Tested-by: Geert Uytterhoeven <geert@linux-m68k.org> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-06-09	memory_hotplug: always initialize pageblock bitmap	Heiko Carstens
	commit 76cdd58e558669366adfaded436fda01b30cce3e upstream Trying to online a new memory section that was added via memory hotplug sometimes results in crashes when the new pages are added via __free_page. Reason for that is that the pageblock bitmap isn't initialized and hence contains random stuff. That means that get_pageblock_migratetype() returns also random stuff and therefore list_add(&page->lru, &zone->free_area[order].free_list[migratetype]); in __free_one_page() tries to do a list_add to something that isn't even necessarily a list. This happens since 86051ca5eaf5e560113ec7673462804c54284456 ("mm: fix usemap initialization") which makes sure that the pageblock bitmap gets only initialized for pages present in a zone. Unfortunately for hot-added memory the zones "grow" after the memmap and the pageblock memmap have been initialized. Which means that the new pages have an unitialized bitmap. To solve this the calls to grow_zone_span() and grow_pgdat_span() are moved to __add_zone() just before the initialization happens. The patch also moves the two functions since __add_zone() is the only caller and I didn't want to add a forward declaration. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2008-05-09	mm: fix usemap initialization	KAMEZAWA Hiroyuki
	commit: 86051ca5eaf5e560113ec7673462804c54284456 upstream usemap must be initialized only when pfn is within zone. If not, it corrupts memory. And this patch also reduces the number of calls to set_pageblock_migratetype() from (pfn & (pageblock_nr_pages -1) to !(pfn & (pageblock_nr_pages-1) it should be called once per pageblock. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hugh@veritas.com> Cc: Shi Weihua <shiwh@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-05-01	mm: fix possible off-by-one in walk_pte_range()	Johannes Weiner
	commit 556637cdabcd5918c7d4a1a2679b8f86fc81e891 upstream After the loop in walk_pte_range() pte might point to the first address after the pmd it walks. The pte_unmap() is then applied to something bad. Spotted by Roel Kluin and Andreas Schwab. Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Cc: Roel Kluin <12o3l@tiscali.nl> Cc: Andreas Schwab <schwab@suse.de> Acked-by: Matt Mackall <mpm@selenic.com> Acked-by: Mikael Pettersson <mikpe@it.uu.se> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-04-15	add "Isolate" migratetype name to /proc/pagetypeinfo	KOSAKI Motohiro
	In a5d76b54a3f3a40385d7f76069a2feac9f1bad63 (memory unplug: page isolation by KAMEZAWA Hiroyuki), "isolate" migratetype added. but unfortunately, it doesn't treat /proc/pagetypeinfo display logic. this patch add "Isolate" to pagetype name field. /proc/pagetype before: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 1 9 7 4 1 1 1 1 0 0 0 Node 0, zone Normal, type Reclaimable 5 2 0 0 1 1 0 0 0 1 0 Node 0, zone Normal, type Movable 0 1 1 0 0 0 1 0 0 1 60 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 0 1 1 1 0 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 236 62 6 2 2 1 1 0 1 1 16 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type <NULL> 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve <NULL> Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 10 40 169 1 0 Node 0, zone HighMem 2 0 283 1 0 after: ------------------------------------------------------------------------------------------------------------------------ Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 Node 0, zone DMA, type Unmovable 1 2 2 2 1 2 2 1 1 0 0 Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone DMA, type Movable 2 3 3 1 3 3 2 0 0 0 0 Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone Normal, type Unmovable 0 2 1 1 0 1 0 0 0 0 0 Node 0, zone Normal, type Reclaimable 1 1 1 1 1 0 1 1 1 0 0 Node 0, zone Normal, type Movable 0 1 1 1 0 1 0 1 0 0 196 Node 0, zone Normal, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone Normal, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Unmovable 0 1 0 0 0 1 1 1 2 2 0 Node 0, zone HighMem, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 Node 0, zone HighMem, type Movable 1 0 1 1 0 0 0 0 1 0 200 Node 0, zone HighMem, type Reserve 0 0 0 0 0 0 0 0 0 0 1 Node 0, zone HighMem, type Isolate 0 0 0 0 0 0 0 0 0 0 0 Number of blocks type Unmovable Reclaimable Movable Reserve Isolate Node 0, zone DMA 1 0 2 1 0 Node 0, zone Normal 8 4 207 1 0 Node 0, zone HighMem 2 0 283 1 0 Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Mel Gorman <mel@csn.ul.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-15	memcg: fix oops in oom handling	Li Zefan
	When I used a test program to fork mass processes and immediately move them to a cgroup where the memory limit is low enough to trigger oom kill, I got oops: BUG: unable to handle kernel NULL pointer dereference at 0000000000000808 IP: [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 PGD 4c95f067 PUD 4406c067 PMD 0 Oops: 0002 [1] SMP CPU 2 Modules linked in: Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5 RIP: 0010:[<ffffffff8045c47f>] [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002 RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3 RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808 RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900 R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140 R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200 FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0) Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0 ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0 ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0 Call Trace: [<ffffffff8023ef5a>] ? force_sig_info+0x25/0xb9 [<ffffffff8026cef9>] ? oom_kill_task+0x77/0xe2 [<ffffffff8026d696>] ? mem_cgroup_out_of_memory+0x55/0x67 [<ffffffff802910ad>] ? mem_cgroup_charge_common+0xec/0x202 [<ffffffff8027997b>] ? handle_mm_fault+0x24e/0x77f [<ffffffff8022c4af>] ? default_wake_function+0x0/0xe [<ffffffff8027a17a>] ? get_user_pages+0x2ce/0x3af [<ffffffff80290fee>] ? mem_cgroup_charge_common+0x2d/0x202 [<ffffffff8027a441>] ? make_pages_present+0x8e/0xa4 [<ffffffff8027d1ab>] ? mmap_region+0x373/0x429 [<ffffffff8027d7eb>] ? do_mmap_pgoff+0x2ff/0x364 [<ffffffff80210471>] ? sys_mmap+0xe5/0x111 [<ffffffff8020bfc9>] ? tracesys+0xdc/0xe1 Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 <f0> 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00 RIP [<ffffffff8045c47f>] _spin_lock_irqsave+0x8/0x18 RSP <ffff8100448c7c30> CR2: 0000000000000808 ---[ end trace c3702fa668021ea4 ]--- It's reproducable in a x86_64 box, but doesn't happen in x86_32. This is because tsk->sighand is not guarded by RCU, so we have to hold tasklist_lock, just as what out_of_memory() does. Signed-off-by: Li Zefan <lizf@cn.fujitsu> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelianov <xemul@openvz.org> Cc: Paul Menage <menage@google.com> Cc: Oleg Nesterov <oleg@tv-sign.ru> Cc: David Rientjes <rientjes@cs.washington.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-15	mm: sparsemem memory_present() fix	Ingo Molnar
	Fix memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS. That causes this loop to happily walk over the limit of the sparse memory section map: for (pfn = start; pfn < end; pfn += PAGES_PER_SECTION) { unsigned long section = pfn_to_section_nr(pfn); struct mem_section *ms; sparse_index_init(section, nid); set_section_nid(section, nid); ms = __nr_to_section(section); if (!ms->section_mem_map) ms->section_mem_map = sparse_encode_early_nid(nid) \| SECTION_MARKED_PRESENT; 'ms' will be out of bounds and we'll corrupt a small amount of memory by encoding the node ID and writing SECTION_MARKED_PRESENT (==0x1) over it. The corruption might happen when encoding a non-zero node ID, or due to the SECTION_MARKED_PRESENT which is 0x1: mmzone.h:#define SECTION_MARKED_PRESENT (1UL<<0) The fix is to sanity check anything the architecture passes to sparsemem. This bug seems to be rather old (as old as sparsemem support itself), but the exact incarnation depended on random details like configs, which made this bug more prominent in v2.6.25-to-be. An additional enhancement might be to print a warning about ignored or trimmed memory ranges. Signed-off-by: Ingo Molnar <mingo@elte.hu> Tested-by: Christoph Lameter <clameter@sgi.com> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rafael J. Wysocki <rjw@sisk.pl> Cc: Yinghai Lu <Yinghai.Lu@sun.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-08	memcg: fix node_state handling	KAMEZAWA Hiroyuki
	This should be N_NORMAL_MEMORY. N_NORMAL_MEMORY is "true" if a node has memory for the kernel. N_HIGH_MEMORY is "true" if a node has memory for HIGHMEM. (If CONFIG_HIGHMEM=n, always "true") This check is used for testing whether we can use kmalloc_node() on a node. Then, if there is a node which only contains HIGHMEM, the system will call kmalloc_node() which doesn't contain memory for the kernel. If it happens under SLUB, the kernel will panic. I think this only happens on x86_32-numa. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-04	memory controller: make memory resource control aware of boot options	Balbir Singh
	A boot option for the memory controller was discussed on lkml. It is a good idea to add it, since it saves memory for people who want to turn off the memory controller. By default the option is on for the following two reasons: 1. It provides compatibility with the current scheme where the memory controller turns on if the config option is enabled 2. It allows for wider testing of the memory controller, once the config option is enabled We still allow the create, destroy callbacks to succeed, since they are not aware of boot options. We do not populate the directory will memory resource controller specific files. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Paul Menage <menage@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: David Rientjes <rientjes@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-04-01	Fix undefined count_partial if !CONFIG_SLABINFO	Christoph Lameter
	Small typo in the patch recently merged to avoid the unused symbol message for count_partial(). Discussion thread with confirmation of fix at http://marc.info/?t=120696854400001&r=1&w=2 Typo in the check if we need the count_partial function that was introduced by 53625b4204753b904addd40ca96d9ba802e6977d Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-30	NULL noise: fs/, mm/, kernel/*	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-27	Revert "SLUB: remove useless masking of GFP_ZERO"	Linus Torvalds
	This reverts commit 3811dbf67162bd08412f1b0e02e554f353e93bdb. The masking was not at all useless, and it was sensible. We handle GFP_ZERO in the caller, and passing it down to any page allocator logic is buggy and wrong. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-26	hugetlb: fix potential livelock in return_unused_surplus_hugepages()	Nishanth Aravamudan
	Running the counters testcase from libhugetlbfs results in on 2.6.25-rc5 and 2.6.25-rc5-mm1: BUG: soft lockup - CPU#3 stuck for 61s! [counters:10531] NIP: c0000000000d1f3c LR: c0000000000d1f2c CTR: c0000000001b5088 REGS: c000005db12cb360 TRAP: 0901 Not tainted (2.6.25-rc5-autokern1) MSR: 8000000000009032 <EE,ME,IR,DR> CR: 48008448 XER: 20000000 TASK = c000005dbf3d6000[10531] 'counters' THREAD: c000005db12c8000 CPU: 3 GPR00: 0000000000000004 c000005db12cb5e0 c000000000879228 0000000000000004 GPR04: 0000000000000010 0000000000000000 0000000000200200 0000000000100100 GPR08: c0000000008aba10 000000000000ffff 0000000000000004 0000000000000000 GPR12: 0000000028000442 c000000000770080 NIP [c0000000000d1f3c] .return_unused_surplus_pages+0x84/0x18c LR [c0000000000d1f2c] .return_unused_surplus_pages+0x74/0x18c Call Trace: [c000005db12cb5e0] [c000005db12cb670] 0xc000005db12cb670 (unreliable) [c000005db12cb670] [c0000000000d24c4] .hugetlb_acct_memory+0x2e0/0x354 [c000005db12cb740] [c0000000001b5048] .truncate_hugepages+0x1d4/0x214 [c000005db12cb890] [c0000000001b50a4] .hugetlbfs_delete_inode+0x1c/0x3c [c000005db12cb920] [c000000000103fd8] .generic_delete_inode+0xf8/0x1c0 [c000005db12cb9b0] [c0000000001b5100] .hugetlbfs_drop_inode+0x3c/0x24c [c000005db12cba50] [c00000000010287c] .iput+0xdc/0xf8 [c000005db12cbad0] [c0000000000fee54] .dentry_iput+0x12c/0x194 [c000005db12cbb60] [c0000000000ff050] .d_kill+0x6c/0xa4 [c000005db12cbbf0] [c0000000000ffb74] .dput+0x18c/0x1b0 [c000005db12cbc70] [c0000000000e9e98] .__fput+0x1a4/0x1e8 [c000005db12cbd10] [c0000000000e61ec] .filp_close+0xb8/0xe0 [c000005db12cbda0] [c0000000000e62d0] .sys_close+0xbc/0x134 [c000005db12cbe30] [c00000000000872c] syscall_exit+0x0/0x40 Instruction dump: ebbe8038 38800010 e8bf0002 3bbd0008 7fa3eb78 38a50001 7ca507b4 4818df25 60000000 38800010 38a00000 7c601b78 <7fa3eb78> 2f800010 409d0008 38000010 This was tracked down to a potential livelock in return_unused_surplus_hugepages(). In the case where we have surplus pages on some node, but no free pages on the same node, we may never break out of the loop. To avoid this livelock, terminate the search if we iterate a number of times equal to the number of online nodes without freeing a page. Thanks to Andy Whitcroft and Adam Litke for helping with debugging and the patch. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-26	hugetlb: indicate surplus huge page counts in per-node meminfo	Nishanth Aravamudan
	Currently we show the surplus hugetlb pool state in /proc/meminfo, but not in the per-node meminfo files, even though we track the information on a per-node basis. Printing it there can help track down dynamic pool bugs including the one in the follow-on patch. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-26	slab: fix cache_cache bootstrap in kmem_cache_init()	Daniel Yeisley
	Commit 556a169dab38b5100df6f4a45b655dddd3db94c1 ("slab: fix bootstrap on memoryless node") introduced bootstrap-time cache_cache list3s for all nodes but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This patch fixes list_add() corruption in mm/slab.c seen on the ES7000. Cc: Mel Gorman <mel@csn.ul.ie> Cc: Olaf Hering <olaf@aepfle.de> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com> Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-26	count_partial() is not used if !SLUB_DEBUG and !CONFIG_SLABINFO	Christoph Lameter
	Avoid warnings about unused functions if neither SLUB_DEBUG nor CONFIG_SLABINFO is defined. This patch will be reversed when slab defrag is merged since slab defrag requires count_partial() to determine the fragmentation status of slab caches. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-25	Merge branch 'for-linus' of ↵	Linus Torvalds
	git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: [PATCH] get stack footprint of pathname resolution back to relative sanity [PATCH] double iput() on failure exit in hugetlb [PATCH] double dput() on failure exit in tiny-shmem [PATCH] fix up new filp allocators [PATCH] check for null vfsmount in dentry_open() [PATCH] reiserfs: eliminate private use of struct file in xattr [PATCH] sanitize hppfs hppfs pass vfsmount to dentry_open() [PATCH] restore export of do_kern_mount()
2008-03-24	revert "kswapd should only wait on IO if there is IO"	Andrew Morton
	Revert commit f1a9ee758de7de1e040de849fdef46e6802ea117: Author: Rik van Riel <riel@redhat.com> Date: Thu Feb 7 00:14:08 2008 -0800 kswapd should only wait on IO if there is IO The current kswapd (and try_to_free_pages) code has an oddity where the code will wait on IO, even if there is no IO in flight. This problem is notable especially when the system scans through many unfreeable pages, causing unnecessary stalls in the VM. Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path will sleep if a significant number of pages are encountered that should be written out. This gives kswapd a chance to write out those pages, while the direct reclaim task sleeps. Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Because of large latencies and interactivity problems reported by Carlos, here: http://lkml.org/lkml/2008/3/22/211 Cc: Rik van Riel <riel@redhat.com> Cc: "Carlos R. Mafra" <crmafra2@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-24	mm: fix boundary checking in free_bootmem_core	Yinghai Lu
	With numa enabled, some callers could have a range of memory on one node but try to free that on other node. This can cause some pages to be freed wrongly. For example: when we try to allocate 128g boot ram early for gart/swiotlb, and free that range later so gart/swiotlb can get some range afterwards. With this patch, we don't need to care which node holds the range, just loop to call free_bootmem_node for all online nodes. This patch makes free_bootmem_core() more robust by trimming the sidx and eidx according the ram range that the node has. And make the free_bootmem_core handle this out of range case. We could use bdata_list to make sure the range can be freed for sure. So next time, we don't need to loop online nodes and could use free_bootmem directly. Signed-off-by: Yinghai Lu <yhlu.kernel@gmail.com> Cc: Andi Kleen <ak@suse.de> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Acked-by: Ingo Molnar <mingo@elte.hu> Tested-by: Ingo Molnar <mingo@elte.hu> Cc: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm/readahead: fix kernel-doc notation	Randy Dunlap
	Fix kernel-doc notation in mm/readahead.c. Change ":" to ";" so that it doesn't get treated as a doc section heading. Move the comment block ending "*/" to a line by itself so that the text on that last line is not lost (dropped). Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	memcgroup: fix check for thread being a group leader in memcgroup	Pavel Emelyanov
	The check t->pid == t->pid is not the blessed way to check whether a task is a group leader. This is not about the code beautifulness only, but about pid namespaces fixes - both the tgid and the pid fields on the task_struct are (slowly :( ) becoming deprecated. Besides, the thread_group_leader() macro makes only one dereference :) Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm: rmap kernel-doc fixes	Randy Dunlap
	Correct kernel-doc function names and parameters in rmap.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm: highmem kernel-doc additions	Randy Dunlap
	Add kernel-doc comments to highmem.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm/oom_kill: fix kernel-doc	Randy Dunlap
	Fix kernel-doc notation in oom_kill.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm/shmem and tiny-shmem: fix some kernel-doc	Randy Dunlap
	Convert tiny-shmem.c function comments to kernel-doc. Add parameters and convert/fix other kernel-doc in shmem.c. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	mm: fix various kernel-doc comments	Randy Dunlap
	Fix various kernel-doc notation in mm/: filemap.c: add function short description; convert 2 to kernel-doc fremap.c: change parameter 'prot' to @prot pagewalk.c: change "-" in function parameters to ":" slab.c: fix short description of kmem_ptr_validate() swap.c: fix description & parameters of put_pages_list() swap_state.c: fix function parameters vmalloc.c: change "@returns" to "Returns:" since that is not a parameter Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-19	[PATCH] double dput() on failure exit in tiny-shmem	Al Viro
	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2008-03-17	slub page alloc fallback: Enable interrupts for GFP_WAIT.	Christoph Lameter
	The fallback path needs to enable interrupts like done for the other page allocator calls. This was not necessary with the alternate fast path since we handled irq enable/disable in the slow path. The regular fastpath handles irq enable/disable around calls to the slow path so we need to restore the proper status before calling the page allocator from the slowpath. Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-10	iov_iter_advance() fix	Nick Piggin
	iov_iter_advance() skips over zero-length iovecs, however it does not properly terminate at the end of the iovec array. Fix this by checking against i->count before we skip a zero-length iov. The bug was reproduced with a test program that continually randomly creates iovs to writev. The fix was also verified with the same program and also it could verify that the correct data was contained in the file after each writev. Signed-off-by: Nick Piggin <npiggin@suse.de> Tested-by: "Kevin Coffman" <kwc@citi.umich.edu> Cc: "Alexey Dobriyan" <adobriyan@gmail.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10	hugetlb: correct page count for surplus huge pages	Adam Litke
	Free pages in the hugetlb pool are free and as such have a reference count of zero. Regular allocations into the pool from the buddy are "freed" into the pool which results in their page_count dropping to zero. However, surplus pages can be directly utilized by the caller without first being freed to the pool. Therefore, a call to put_page_testzero() is in order so that such a page will be handed to the caller with a correct count. This has not affected end users because the bad page count is reset before the page is handed off. However, under CONFIG_DEBUG_VM this triggers a BUG when the page count is validated. Thanks go to Mel for first spotting this issue and providing an initial fix. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-10	mempolicy: fix reference counting bugs	Lee Schermerhorn
	Address 3 known bugs in the current memory policy reference counting method. I have a series of patches to rework the reference counting to reduce overhead in the allocation path. However, that series will require testing in -mm once I repost it. 1) alloc_page_vma() does not release the extra reference taken for vma/shared mempolicy when the mode == MPOL_INTERLEAVE. This can result in leaking mempolicy structures. This is probably occurring, but not being noticed. Fix: add the conditional release of the reference. 2) hugezonelist unconditionally releases a reference on the mempolicy when mode == MPOL_INTERLEAVE. This can result in decrementing the reference count for system default policy [should have no ill effect] or premature freeing of task policy. If this occurred, the next allocation using task mempolicy would use the freed structure and probably BUG out. Fix: add the necessary check to the release. 3) The current reference counting method assumes that vma 'get_policy()' methods automatically add an extra reference a non-NULL returned mempolicy. This is true for shmem_get_policy() used by tmpfs mappings, including regular page shm segments. However, SHM_HUGETLB shm's, backed by hugetlbfs, just use the vma policy without the extra reference. This results in freeing of the vma policy on the first allocation, with reuse of the freed mempolicy structure on subsequent allocations. Fix: Rather than add another condition to the conditional reference release, which occur in the allocation path, just add a reference when returning the vma policy in shm_get_policy() to match the assumptions. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Greg KH <greg@kroah.com> Cc: Andi Kleen <ak@suse.de> Cc: Christoph Lameter <clameter@sgi.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Rientjes <rientjes@google.com> Cc: <eric.whitney@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-09	Do not include linux/backing-dev.h twice	Jesper Juhl
	Don't include linux/backing-dev.h twice in mm/filemap.c, it's pointless. Signed-off-by: Jesper Juhl <jesper.juhl@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-06	slab: NUMA slab allocator migration bugfix	Joe Korty
	NUMA slab allocator cpu migration bugfix The NUMA slab allocator (specifically, cache_alloc_refill) is not refreshing its local copies of what cpu and what numa node it is on, when it drops and reacquires the irq block that it inherited from its caller. As a result those values become invalid if an attempt to migrate the process to another numa node occured while the irq block had been dropped. The solution is to make cache_alloc_refill reload these variables whenever it drops and reacquires the irq block. The error is very difficult to hit. When it does occur, one gets the following oops + stack traceback bits in check_spinlock_acquired: kernel BUG at mm/slab.c:2417 cache_alloc_refill+0xe6 kmem_cache_alloc+0xd0 ... This patch was developed against 2.6.23, ported to and compiled-tested only against 2.6.25-rc4. Signed-off-by: Joe Korty <joe.korty@ccur.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06	slub: Do not cross cacheline boundaries for very small objects	Nick Piggin
	SLUB should pack even small objects nicely into cachelines if that is what has been asked for. Use the same algorithm as SLAB for this. The effect of this patch for a system with a cacheline size of 64 bytes is that the 24 byte sized slab caches will now put exactly 2 objects into a cacheline instead of 3 with some overlap into the next cacheline. This reduces the object density in a 4k slab from 170 to 128 objects (same as SLAB). Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06	slab - use angle brackets for include of kmalloc_sizes.h	Joe Perches
	Make them all use angle brackets and the directory name. Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Joe Perches <joe@perches.com> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06	slab numa fallback logic: Do not pass unfiltered flags to page allocator	Christoph Lameter
	The NUMA fallback logic should be passing local_flags to kmem_get_pages() and not simply the flags passed in. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-06	slub statistics: Fix check for DEACTIVATE_REMOTE_FREES	Christoph Lameter
	The remote frees are in the freelist of the page and not in the percpu freelist. Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Christoph Lameter <clameter@sgi.com>
2008-03-04	hugetlb: fix pool shrinking while in restricted cpuset	Nishanth Aravamudan
	Adam Litke noticed that currently we grow the hugepage pool independent of any cpuset the running process may be in, but when shrinking the pool, the cpuset is checked. This leads to inconsistency when shrinking the pool in a restricted cpuset -- an administrator may have been able to grow the pool on a node restricted by a containing cpuset, but they cannot shrink it there. There are two options: either prevent growing of the pool outside of the cpuset or allow shrinking outside of the cpuset. >From previous discussions on linux-mm, /proc/sys/vm/nr_hugepages is an administrative interface that should not be restricted by cpusets. So allow shrinking the pool by removing pages from nodes outside of current's cpuset. Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com> Acked-by: Adam Litke <agl@us.ibm.com> Cc: William Irwin <wli@holomorphy.com> Cc: Lee Schermerhorn <Lee.Schermerhonr@hp.com> Cc: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	hugetlb: close a difficult to trigger reservation race	Adam Litke
	A hugetlb reservation may be inadequately backed in the event of racing allocations and frees when utilizing surplus huge pages. Consider the following series of events in processes A and B: A) Allocates some surplus pages to satisfy a reservation B) Frees some huge pages A) A notices the extra free pages and drops hugetlb_lock to free some of its surplus pages back to the buddy allocator. B) Allocates some huge pages A) Reacquires hugetlb_lock and returns from gather_surplus_huge_pages() Avoid this by commiting the reservation after pages have been allocated but before dropping the lock to free excess pages. For parity, release the reservation in return_unused_surplus_pages(). This patch also corrects the cpuset_mems_nr() error path in hugetlb_acct_memory(). If the cpuset check fails, uncommit the reservation, but also be sure to return any surplus huge pages that may have been allocated to back the failed reservation. Thanks to Andy Whitcroft for discovering this. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Dave Hansen <haveblue@us.ibm.com> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Andy Whitcroft <apw@shadowen.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: fix oops on NULL lru list	Hugh Dickins
	While testing force_empty, during an exit_mmap, __mem_cgroup_remove_list called from mem_cgroup_uncharge_page oopsed on a NULL pointer in the lru list. I couldn't see what racing tasks on other cpus were doing, but surmise that another must have been in mem_cgroup_charge_common on the same page, between its unlock_page_cgroup and spin_lock_irqsave near done (thanks to that kzalloc which I'd almost changed to a kmalloc). Normally such a race cannot happen, the ref_cnt prevents it, the final uncharge cannot race with the initial charge. But force_empty buggers the ref_cnt, that's what it's all about; and thereafter forced pages are vulnerable to races such as this (just think of a shared page also mapped into an mm of another mem_cgroup than that just emptied). And remain vulnerable until they're freed indefinitely later. This patch just fixes the oops by moving the unlock_page_cgroups down below adding to and removing from the list (only possible given the previous patch); and while we're at it, we might as well make it an invariant that page->page_cgroup is always set while pc is on lru. But this behaviour of force_empty seems highly unsatisfactory to me: why have a ref_cnt if we always have to cope with it being violated (as in the earlier page migration patch). We may prefer force_empty to move pages to an orphan mem_cgroup (could be the root, but better not), from which other cgroups could recover them; we might need to reverse the locking again; but no time now for such concerns. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: simplify force_empty and move_lists	Hirokazu Takahashi
	As for force_empty, though this may not be the main topic here, mem_cgroup_force_empty_list() can be implemented simpler. It is possible to make the function just call mem_cgroup_uncharge_page() instead of releasing page_cgroups by itself. The tip is to call get_page() before invoking mem_cgroup_uncharge_page(), so the page won't be released during this function. Kamezawa-san points out that by the time mem_cgroup_uncharge_page() uncharges, the page might have been reassigned to an lru of a different mem_cgroup, and now be emptied from that; but Hugh claims that's okay, the end state is the same as when it hasn't gone to another list. And once force_empty stops taking lock_page_cgroup within mz->lru_lock, mem_cgroup_move_lists() can be simplified to take mz->lru_lock directly while holding page_cgroup lock (but still has to use try_lock_page_cgroup). Signed-off-by: Hirokazu Takahashi <taka@valinux.co.jp> Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: fix mem_cgroup_move_lists locking	Hugh Dickins
	Ever since the VM_BUG_ON(page_get_page_cgroup(page)) (now Bad page state) went into page freeing, I've hit it from time to time in testing on some machines, sometimes only after many days. Recently found a machine which could usually produce it within a few hours, which got me there at last. The culprit is mem_cgroup_move_lists, whose locking is inadequate; and the arrangement of structures was such that you got page_cgroups from the lru list neatly put on to SLUB's freelist. Kamezawa-san identified the same hole independently. The main problem was that it was missing the lock_page_cgroup it needs to safely page_get_page_cgroup; but it's tricky to go beyond that too, and I couldn't do it with SLAB_DESTROY_BY_RCU as I'd expected. See the code for comments on the constraints. This patch immediately gets replaced by a simpler one from Hirokazu-san; but is it just foolish pride that tells me to put this one on record, in case we need to come back to it later? Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: css_put after remove_list	Hugh Dickins
	mem_cgroup_uncharge_page does css_put on the mem_cgroup before uncharging from it, and before removing page_cgroup from one of its lru lists: isn't there a danger that struct mem_cgroup memory could be freed and reused before completing that, so corrupting something? Never seen it, and for all I know there may be other constraints which make it impossible; but let's be defensive and reverse the ordering there. mem_cgroup_force_empty_list is safe because there's an extra css_get around all its works; but even so, change its ordering the same way round, to help get in the habit of doing it like this. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: remove clear_page_cgroup and atomics	Hugh Dickins
	Remove clear_page_cgroup: it's an unhelpful helper, see for example how mem_cgroup_uncharge_page had to unlock_page_cgroup just in order to call it (serious races from that? I'm not sure). Once that's gone, you can see it's pointless for page_cgroup's ref_cnt to be atomic: it's always manipulated under lock_page_cgroup, except where force_empty unilaterally reset it to 0 (and how does uncharge's atomic_dec_and_test protect against that?). Simplify this page_cgroup locking: if you've got the lock and the pc is attached, then the ref_cnt must be positive: VM_BUG_ONs to check that, and to check that pc->page matches page (we're on the way to finding why sometimes it doesn't, but this patch doesn't fix that). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: memcontrol uninlined and static	Hugh Dickins
	More cleanup to memcontrol.c, this time changing some of the code generated. Let the compiler decide what to inline (except for page_cgroup_locked which is only used when CONFIG_DEBUG_VM): the __always_inline on lock_page_cgroup etc. was quite a waste since bit_spin_lock etc. are inlines in a header file; made mem_cgroup_force_empty and mem_cgroup_write_strategy static. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-03-04	memcg: memcontrol whitespace cleanups	Hugh Dickins
	Sorry, before getting down to more important changes, I'd like to do some cleanup in memcontrol.c. This patch doesn't change the code generated, but cleans up whitespace, moves up a double declaration, removes an unused enum, removes void returns, removes misleading comments, that kind of thing. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: David Rientjes <rientjes@google.com> Cc: Balbir Singh <balbir@linux.vnet.ibm.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Hirokazu Takahashi <taka@valinux.co.jp> Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp> Cc: Paul Menage <menage@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>