aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2007-08-25[PATCH] do not limit locked memory when RLIMIT_MEMLOCK is RLIM_INFINITYHerbert van den Bergh
Fix a bug in mm/mlock.c on 32-bit architectures that prevents a user from locking more than 4GB of shared memory, or allocating more than 4GB of shared memory in hugepages, when rlim[RLIMIT_MEMLOCK] is set to RLIM_INFINITY. Signed-off-by: Herbert van den Bergh <herbert.van.den.bergh@oracle.com> Acked-by: Chris Mason <chris.mason@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
2007-08-25[PATCH] hugetlb: fix race in alloc_fresh_huge_page()Joe Jin
That static `nid' index needs locking. Without it we can end up calling alloc_pages_node() with an illegal node ID and the kernel crashes. Acked-by: Gurudas Pai <gurudas.pai@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
2007-08-25[PATCH] readahead: MIN_RA_PAGES/MAX_RA_PAGES macrosFengguang Wu
Define two convenient macros for read-ahead: - MAX_RA_PAGES: rounded down counterpart of VM_MAX_READAHEAD - MIN_RA_PAGES: rounded _up_ counterpart of VM_MIN_READAHEAD Note that the rounded up MIN_RA_PAGES will work flawlessly with _large_ page sizes like 64k. Signed-off-by: Fengguang Wu <wfg@mail.ustc.edu.cn> Cc: Steven Pratt <slpratt@austin.ibm.com> Cc: Ram Pai <linuxram@us.ibm.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Willy Tarreau <w@1wt.eu>
2007-08-15[PATCH] mm: kill validate_anon_vma to avoid mapcount BUGHugh Dickins
validate_anon_vma gave a useful check on the integrity of the anon_vma list when Andrea was developing obj rmap; but it was not enabled in SLES9 itself, nor in mainline, until Nick changed commented-out RMAP_DEBUG to configurable CONFIG_DEBUG_VM in 2.6.17. Now Petr Vandrovec reports that its BUG_ON(mapcount > 100000) can easily crash a CONFIG_DEBUG_VM=y system. That limit was just an arbitrary number to protect against an infinite loop. We could raise it to something enormous (depending on sizeof struct vma and size of memory?); but I rather think validate_anon_vma has outlived its usefulness, and is better just removed - which gives a magnificent performance boost to anything like Petr's test program ;) Of course, a very long anon_vma list is bad news for preemption latency, and I believe there has been one recent report of such: let's not forget that, but validate_anon_vma only makes it worse not better. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Petr Vandrovec <petr@vmware.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrea Arcangeli <andrea@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-08-15[PATCH] x86_64: allocate sparsemem memmap above 4GZou Nan hai
On systems with huge amount of physical memory, VFS cache and memory memmap may eat all available system memory under 4G, then the system may fail to allocate swiotlb bounce buffer. There was a fix for this issue in arch/x86_64/mm/numa.c, but that fix dose not cover sparsemem model. This patch add fix to sparsemem model by first try to allocate memmap above 4G. Signed-off-by: Zou Nan hai <nanhai.zou@intel.com> Acked-by: Suresh Siddha <suresh.b.siddha@intel.com> Cc: Andi Kleen <ak@suse.de> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> [chrisw: trivial backport] Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-06-11[PATCH] s390: page_mkclean data corruption.Martin Schwidefsky
The git commit c2fda5fed81eea077363b285b66eafce20dfd45a which added the page_test_and_clear_dirty call to page_mkclean and the git commit 7658cc289288b8ae7dd2c2224549a048431222b3 which fixes the "nasty and subtle race in shared mmap'ed page writeback" problem in clear_page_dirty_for_io cause data corruption on s390. The effect of the two changes is that for every call to clear_page_dirty_for_io a page_test_and_clear_dirty is done. If the per page dirty bit is set set_page_dirty is called. Strangly clear_page_dirty_for_io is called for not-uptodate pages, e.g. over this call-chain: [<000000000007c0f2>] clear_page_dirty_for_io+0x12a/0x130 [<000000000007c494>] generic_writepages+0x258/0x3e0 [<000000000007c692>] do_writepages+0x76/0x7c [<00000000000c7a26>] __writeback_single_inode+0xba/0x3e4 [<00000000000c831a>] sync_sb_inodes+0x23e/0x398 [<00000000000c8802>] writeback_inodes+0x12e/0x140 [<000000000007b9ee>] wb_kupdate+0xd2/0x178 [<000000000007cca2>] pdflush+0x162/0x23c The bad news now is that page_test_and_clear_dirty might claim that a not-uptodate page is dirty since SetPageUptodate which resets the per page dirty bit has not yet been called. The page writeback that follows clobbers the data on disk. The simplest solution to this problem is to move the call to page_test_and_clear_dirty under the "if (page_mapped(page))". If a file backed page is mapped it is uptodate. Signed-off-by: Martin Schwidefsky <schwidefsky@de.ibm.com> Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de> Signed-off-by: Chris Wright <chrisw@sous-sol.org>
2007-06-11[PATCH] oom: kill all threads that share mm with killed taskDavid Rientjes
oom_kill_task() calls __oom_kill_task() to OOM kill a selected task. When finding other threads that share an mm with that task, we need to kill those individual threads and not the same one. (Bug introduced by f2a2a7108aa0039ba7a5fe7a0d2ecef2219a7584) Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: Christoph Lameter <clameter@engr.sgi.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@osdl.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: David Rientjes <rientjes@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Chris Wright <chrisw@sous-sol.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01page migration: fix NR_FILE_PAGES accountingChristoph Lameter
NR_FILE_PAGES must be accounted for depending on the zone that the page belongs to. If we replace the page in the radix tree then we may have to shift the count to another zone. Suggested-by: Ethan Solomita <solo@google.com> Cc: Martin Bligh <mbligh@mbligh.org> Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01fix OOM killing processes wrongly thought MPOL_BINDHugh Dickins
I only have CONFIG_NUMA=y for build testing: surprised when trying a memhog to see lots of other processes killed with "No available memory (MPOL_BIND)". memhog is killed correctly once we initialize nodemask in constrained_alloc(). Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: William Irwin <bill.irwin@oracle.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01holepunch: fix mmap_sem i_mutex deadlockHugh Dickins
sys_madvise has down_write of mmap_sem, then madvise_remove calls vmtruncate_range which takes i_mutex and i_alloc_sem: no, we can easily devise deadlocks from that ordering. madvise_remove drop mmap_sem while calling vmtruncate_range: luckily, since madvise_remove doesn't split or merge vmas, it's easy to handle this case with a NULL prev, without restructuring sys_madvise. (Though sad to retake mmap_sem when it's unlikely to be needed, and certainly down_read is sufficient for MADV_REMOVE, unlike the other madvices.) Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01holepunch: fix disconnected pages after second truncateHugh Dickins
shmem_truncate_range has its own truncate_inode_pages_range, to free any pages racily instantiated while it was in progress: a SHMEM_PAGEIN flag is set when this might have happened. But holepunching gets no chance to clear that flag at the start of vmtruncate_range, so it's always set (unless a truncate came just before), so holepunch almost always does this second truncate_inode_pages_range. shmem holepunch has unlikely swap<->file races hereabouts whatever we do (without a fuller rework than is fit for this release): I was going to skip the second truncate in the punch_hole case, but Miklos points out that would make holepunch correctness more vulnerable to swapoff. So keep the second truncate, but follow it by an unmap_mapping_range to eliminate the disconnected pages (freed from pagecache while still mapped in userspace) that it might have left behind. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01holepunch: fix shmem_truncate_range punch lockingHugh Dickins
Miklos Szeredi observes that during truncation of shmem page directories, info->lock is released to improve latency (after lowering i_size and next_index to exclude races); but this is quite wrong for holepunching, which receives no such protection from i_size or next_index, and is left vulnerable to races with shmem_unuse, shmem_getpage and shmem_writepage. Hold info->lock throughout when holepunching? No, any user could prevent rescheduling for far too long. Instead take info->lock just when needed: in shmem_free_swp when removing the swap entries, and whenever removing a directory page from the level above. But so long as we remove before scanning, we can safely skip taking the lock at the lower levels, except at misaligned start and end of the hole. Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-05-01holepunch: fix shmem_truncate_range punching too farHugh Dickins
Miklos Szeredi observes BUG_ON(!entry) in shmem_writepage() triggered in rare circumstances, because shmem_truncate_range() erroneously removes partially truncated directory pages at the end of the range: later reclaim on pages pointing to these removed directories triggers the BUG. Indeed, and it can also cause data loss beyond the hole. Fix this as in the patch proposed by Miklos, but distinguish between "limit" (how far we need to search: ignore truncation's next_index optimization in the holepunch case - if there are races it's more consistent to act on the whole range specified) and "upper_limit" (how far we can free directory pages: generally we must be careful to keep partially punched pages, but can relax at end of file - i_size being held stable by i_mutex). Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-23dio: invalidate clean pages before dio writeZach Brown
This patch fixes a user-triggerable oops that was reported by Leonid Ananiev as archived at http://lkml.org/lkml/2007/2/8/337. dio writes invalidate clean pages that intersect the written region so that subsequent buffered reads go to disk to read the new data. If this fails the interface tries to tell the caller that the cache is inconsistent by returning EIO. Before this patch we had the problem where this invalidation failure would clobber -EIOCBQUEUED as it made its way from fs/direct-io.c to fs/aio.c. Both fs/aio.c and bio completion call aio_complete() and we reference freed memory, usually oopsing. This patch addresses this problem by invalidating before the write so that we can cleanly return -EIO before ->direct_IO() has had a chance to return -EIOCBQUEUED. There is a compromise here. During the dio write we can fault in mmap()ed pages which intersect the written range with get_user_pages() if the user provided them for the source buffer. This is a crazy thing to do, but we can make it mostly work in most cases by trying the invalidation again. The compromise is that we won't return an error if this second invalidation fails if it's an AIO write and we have -EIOCBQUEUED. This was tested by having two processes race performing large O_DIRECT and buffered ordered writes. Within minutes ext3 would see a race between ext3_releasepage() and jbd holding a reference on ordered data buffers and would cause invalidation to fail, panicing the box. The test can be found in the 'aio_dio_bugs' test group in test.kernel.org/autotest. After this patch the test passes. Signed-off-by: Zach Brown <zach.brown@oracle.com> Signed-off-by: Benjamin LaHaise <bcrl@kvack.org> Cc: Chuck Ebbert <cebbert@redhat.com> Cc: Leonid Ananiev <leonid.i.ananiev@linux.intel.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-23oom fix: prevent oom from killing a process with children/sibling unkillableAnkita Garg
Looking at oom_kill.c, found that the intention to not kill the selected process if any of its children/siblings has OOM_DISABLE set, is not being met. Signed-off-by: Ankita Garg <ankita@in.ibm.com> Cc: Chuck Ebbert <cebbert@redhat.com> Acked-by: Nick Piggin <npiggin@suse.de> Acked-by: William Irwin <wli@holomorphy.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-23mm: fix madvise infinine loopNick Piggin
madvise(MADV_REMOVE) can go into an infinite loop or cause an oops if the call covers a region from the start of a vma, and extending past that vma. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Badari Pulavarty <pbadari@us.ibm.com> Acked-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-09throttle_vm_writeout(): don't loop on GFP_NOFS and GFP_NOIO allocationsAndrew Morton
throttle_vm_writeout() is designed to wait for the dirty levels to subside. But if the caller holds IO or FS locks, we might be holding up that writeout. So change it to take a single nap to give other devices a chance to clean some memory, then return. Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Kumar Gala <galak@kernel.crashing.org> Cc: Pete Zaitcev <zaitcev@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-03-09hugetlb: preserve hugetlb pte dirty stateKen Chen
__unmap_hugepage_range() is buggy that it does not preserve dirty state of huge_pte when unmapping hugepage range. It causes data corruption in the event of dop_caches being used by sys admin. For example, an application creates a hugetlb file, modify pages, then unmap it. While leaving the hugetlb file alive, comes along sys admin doing a "echo 3 > /proc/sys/vm/drop_caches". drop_pagecache_sb() will happily free all pages that aren't marked dirty if there are no active mapping. Later when application remaps the hugetlb file back and all data are gone, triggering catastrophic flip over on application. Not only that, the internal resv_huge_pages count will also get all messed up. Fix it up by marking page dirty appropriately. Signed-off-by: Ken Chen <kenchen@google.com> Cc: "Nish Aravamudan" <nish.aravamudan@gmail.com> Cc: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Acked-by: William Irwin <bill.irwin@oracle.com> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2007-01-31Revert "[PATCH] mm: micro optimise zone_watermark_ok"Linus Torvalds
This reverts commit e80ee884ae0e3794ef2b65a18a767d502ad712ee. Pawel Sikora had a boot-time oops due to it - because the sign change invalidates the following comparisons, since 'free_pages' can be negative. The micro-optimization just isn't worth it. Bisected-by: Pawel Sikora <pluto@agmk.net> Acked-by: Andrew Morton <akpm@osdl.org> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30[PATCH] Don't allow the stack to grow into hugetlb reserved regionsAdam Litke
When expanding the stack, we don't currently check if the VMA will cross into an area of the address space that is reserved for hugetlb pages. Subsequent faults on the expanded portion of such a VMA will confuse the low-level MMU code, resulting in an OOPS. Check for this. Signed-off-by: Adam Litke <agl@us.ibm.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: William Lee Irwin III <wli@holomorphy.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-30[PATCH] mm: mremap correct rmap accountingHugh Dickins
Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE. Instead of complicating that move_pte, just forget the minor optimization when mremapping, and change the one thing which needed it for correctness - filemap_xip use ZERO_PAGE(0) throughout instead of according to address. [ "There is no block device driver one could use for XIP on mips platforms" - Carsten Otte ] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Andrew Morton <akpm@osdl.org> Cc: Ralf Baechle <ralf@linux-mips.org> Cc: Carsten Otte <cotte@de.ibm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-29Fix balance_dirty_page() calculations with CONFIG_HIGHMEMLinus Torvalds
This makes balance_dirty_page() always base its calculations on the amount of non-highmem memory in the machine, rather than try to base it on total memory and then falling back on non-highmem memory if the mapping it was writing wasn't highmem capable. This not only fixes a situation where two different writers can have wildly different notions about what is a "balanced" dirty state, but it also means that people with highmem machines don't run into an OOM situation when regular memory fills up with dirty pages. We used to try to handle the latter case by scaling down the dirty_ratio if the machine had a lot of highmem pages in page_writeback_init(), but it wasn't aggressive enough for some situations, and since basing the dirty ratio on highmem memory was broken in the first place, let's just stop doing so. (A variation of this theme fixed Justin Piszcz's OOM problem when copying an 18GB file on a RAID setup). Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Justin Piszcz <jpiszcz@lucidpixels.com> Cc: Andrew Morton <akpm@osdl.org> Cc: Neil Brown <neilb@suse.de> Cc: Ingo Molnar <mingo@elte.hu> Cc: Randy Dunlap <rdunlap@xenotime.net> Cc: Christoph Lameter <clameter@sgi.com> Cc: Jens Axboe <jens.axboe@oracle.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Adrian Bunk <bunk@stusta.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] MM: Remove [PATCH] invalidate_inode_pages2_range() debugTrond Myklebust
NFS can handle the case where invalidate_inode_pages2_range() fails, so the premise behind commit 8258d4a574d3a8c01f0ef68aa26b969398a0e140 is now gone. Remove the WARN_ON_ONCE() which is causing users grief as we can see from http://bugzilla.kernel.org/show_bug.cgi?id=7826 Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] i386 vDSO: use VM_ALWAYSDUMPRoland McGrath
This patch fixes core dumps to include the vDSO vma, which is left out now. It removes the special-case core writing macros, which were not doing the right thing for the vDSO vma anyway. Instead, it uses VM_ALWAYSDUMP in the vma; there is no need for the fixmap page to be installed. It handles the CONFIG_COMPAT_VDSO case by making elf_core_dump use the fake vma from get_gate_vma after real vmas in the same way the /proc/PID/maps code does. This changes core dumps so they no longer include the non-PT_LOAD phdrs from the vDSO. I made the change to add them in the first place, but in turned out that nothing ever wanted them there since the advent of NT_AUXV. It's cleaner to leave them out, and just let the phdrs inside the vDSO image speak for themselves. Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26[PATCH] Fix gate_vma.vm_flagsRoland McGrath
This patch fixes the initialization of gate_vma.vm_flags and gate_vma.vm_page_prot to reflect reality. This makes the "[vdso]" line in /proc/PID/maps correctly show r-xp instead of ---p, when gate_vma is used (CONFIG_COMPAT_VDSO on i386). Signed-off-by: Roland McGrath <roland@redhat.com> Cc: Ingo Molnar <mingo@elte.hu> Cc: Paul Mackerras <paulus@samba.org> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-26Resurrect 'try_to_free_buffers()' VM hackeryLinus Torvalds
It's not pretty, but it appears that ext3 with data=journal will clean pages without ever actually telling the VM that they are clean. This, in turn, will result in the VM (and balance_dirty_pages() in particular) to never realize that the pages got cleaned, and wait forever for an event that already happened. Technically, this seems to be a problem with ext3 itself, but it used to be hidden by 'try_to_free_buffers()' noticing this situation on its own, and just working around the filesystem problem. This commit re-instates that hack, in order to avoid a regression for the 2.6.20 release. This fixes bugzilla 7844: http://bugzilla.kernel.org/show_bug.cgi?id=7844 Peter Zijlstra points out that we should probably retain the debugging code that this removes from cancel_dirty_page(), and I agree, but for the imminent release we might as well just silence the warning too (since it's not a new bug: anything that triggers that warning has been around forever). Acked-by: Randy Dunlap <rdunlap@xenotime.net> Acked-by: Jens Axboe <jens.axboe@oracle.com> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-23[PATCH] mbind: restrict nodes to the currently allowed cpusetChristoph Lameter
Currently one can specify an arbitrary node mask to mbind that includes nodes not allowed. If that is done with an interleave policy then we will go around all the nodes. Those outside of the currently allowed cpuset will be redirected to the border nodes. Interleave will then create imbalances at the borders of the cpuset. This patch restricts the nodes to the currently allowed cpuset. The RFC for this patch was discussed at http://marc.theaimsgroup.com/?t=116793842100004&r=1&w=2 Signed-off-by: Christoph Lameter <clameter@sgi.com> Cc: Paul Jackson <pj@sgi.com> Cc: Andi Kleen <ak@suse.de> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2007-01-12[PATCH] blktrace: only add a bounce trace when we really bounceJens Axboe
Currently we issue a bounce trace when __blk_queue_bounce() is called, but that merely means that the device has a lower dma mask than the higher pages in the system. The bio itself may still be lower pages. So move the bounce trace into __blk_queue_bounce(), when we know there will actually be page bouncing. Signed-off-by: Jens Axboe <jens.axboe@oracle.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11[PATCH] NFS: Fix race in nfs_release_page()Trond Myklebust
NFS: Fix race in nfs_release_page() invalidate_inode_pages2() may find the dirty bit has been set on a page owing to the fact that the page may still be mapped after it was locked. Only after the call to unmap_mapping_range() are we sure that the page can no longer be dirtied. In order to fix this, NFS has hooked the releasepage() method and tries to write the page out between the call to unmap_mapping_range() and the call to remove_mapping(). This, however leads to deadlocks in the page reclaim code, where the page may be locked without holding a reference to the inode or dentry. Fix is to add a new address_space_operation, launder_page(), which will attempt to write out a dirty page without releasing the page lock. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Also, the bare SetPageDirty() can skew all sort of accounting leading to other nasties. [akpm@osdl.org: cleanup] Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-11[PATCH] Fix sparsemem on CellDave Hansen
Fix an oops experienced on the Cell architecture when init-time functions, early_*(), are called at runtime. It alters the call paths to make sure that the callers explicitly say whether the call is being made on behalf of a hotplug even, or happening at boot-time. It has been compile tested on ppc64, ia64, s390, i386 and x86_64. Acked-by: Arnd Bergmann <arndb@de.ibm.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Cc: Yasunori Goto <y-goto@jp.fujitsu.com> Acked-by: Andy Whitcroft <apw@shadowen.org> Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: Martin Schwidefsky <schwidefsky@de.ibm.com> Acked-by: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Paul Mackerras <paulus@samba.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-08Merge master.kernel.org:/home/rmk/linux-2.6-armLinus Torvalds
* master.kernel.org:/home/rmk/linux-2.6-arm: [ARM] Provide basic printk_clock() implementation [ARM] Resolve fuse and direct-IO failures due to missing cache flushes [ARM] pass vma for flush_anon_page() [ARM] Fix potential MMCI bug [ARM] Fix kernel-mode undefined instruction aborts [ARM] 4082/1: iop3xx: fix iop33x gpio register offset [ARM] 4070/1: arch/arm/kernel: fix warnings from missing includes [ARM] 4079/1: iop: Update MAINTAINERS
2007-01-08[ARM] pass vma for flush_anon_page()Russell King
Since get_user_pages() may be used with processes other than the current process and calls flush_anon_page(), flush_anon_page() has to cope in some way with non-current processes. It may not be appropriate, or even desirable to flush a region of virtual memory cache in the current process when that is different to the process that we want the flush to occur for. Therefore, pass the vma into flush_anon_page() so that the architecture can work out whether the 'vmaddr' is for the current process or not. Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
2007-01-05[PATCH] shrink_all_memory(): fix lru_pages handlingAndrew Morton
At the end of shrink_all_memory() we forget to recalculate lru_pages: it can be zero. Fix that up, and add a helper function for this operation too. Also, recalculate lru_pages each time around the inner loop to get the balancing correct. Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Cc: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05[PATCH] fix OOM killing of swapoffHugh Dickins
These days, if you swapoff when there isn't enough memory, OOM killer gives "BUG: scheduling while atomic" and the machine hangs: badness() needs to do its PF_SWAPOFF return after the task_unlock (tasklist_lock is also held here, so p isn't going to be freed: PF_SWAPOFF might get turned off at any moment, but that doesn't really matter). Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05[PATCH] Check for populated zone in __drain_pagesChristoph Lameter
Both process_zones() and drain_node_pages() check for populated zones before touching pagesets. However, __drain_pages does not do so, This may result in a NULL pointer dereference for pagesets in unpopulated zones if a NUMA setup is combined with cpu hotplug. Initially the unpopulated zone has the pcp pointers pointing to the boot pagesets. Since the zone is not populated the boot pageset pointers will not be changed during page allocator and slab bootstrap. If a cpu is later brought down (first call to __drain_pages()) then the pcp pointers for cpus in unpopulated zones are set to NULL since __drain_pages does not first check for an unpopulated zone. If the cpu is then brought up again then we call process_zones() which will ignore the unpopulated zone. So the pageset pointers will still be NULL. If the cpu is then again brought down then __drain_pages will attempt to drain pages by following the NULL pageset pointer for unpopulated zones. Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05[PATCH] fix BUG_ON(!PageSlab) from fallback_allocHugh Dickins
pdflush hit the BUG_ON(!PageSlab(page)) in kmem_freepages called from fallback_alloc: cache_grow already freed those pages when alloc_slabmgmt failed. But it wouldn't have freed them if __GFP_NO_GROW, so make sure fallback_alloc doesn't waste its time on that case. Signed-off-by: Hugh Dickins <hugh@veritas.com> Acked-by: Christoph Lameter <clameter@sgi.com> Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05[PATCH] Sanely size hash tables when using large base pagesPaul Mundt
At the moment the inode/dentry cache hash tables (common by way of alloc_large_system_hash()) are incorrectly sized by their respective detection logic when we attempt to use large base pages on systems with little memory. This results in odd behaviour when using a 64kB PAGE_SIZE, such as: Dentry cache hash table entries: 8192 (order: -1, 32768 bytes) Inode-cache hash table entries: 4096 (order: -2, 16384 bytes) The mount cache hash table is seemingly the only one that gets this right by directly taking PAGE_SIZE in to account. The following patch attempts to catch the bogus values and round it up to at least 0-order. Signed-off-by: Paul Mundt <lethal@linux-sh.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2007-01-05[PATCH] swsusp: Do not fail if resume device is not setRafael J. Wysocki
In the kernels later than 2.6.19 there is a regression that makes swsusp fail if the resume device is not explicitly specified. It can be fixed by adding an additional parameter to mm/swapfile.c:swap_type_of() allowing us to pass the (struct block_device *) corresponding to the first available swap back to the caller. Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl> Acked-by: Pavel Machek <pavel@ucw.cz> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30[PATCH] Buglet in vmscan.cShantanu Goel
Fix a rather obvious buglet. Noticed while instrumenting the VM using /proc/vmstat. Cc: Christoph Lameter <clameter@engr.sgi.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30[PATCH] page_mkclean_one(): fix call to set_pte_at()Al Viro
(akpm: macros are wonderful) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30[PATCH] MM: SLOB is broken by recent cleanup of slab.hDimitri Gorokhovik
Recent cleanup of slab.h broke SLOB allocator: the routine kmem_cache_init has now the __init attribute for both slab.c and slob.c. This routine cannot be removed after init in the case of slob.c -- it serves as a timer callback. Provide a separate timer callback routine, call it once from kmem_cache_init, keep the __init attribute on the latter. Signed-off-by: Dimitri Gorokhovik <dimitri.gorokhovik@free.fr> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-30[PATCH] fix oom killer kills current every time if there is memory-less-node ↵KAMEZAWA Hiroyuki
take2 constrained_alloc(), which is called to detect where oom is from, checks passed zone_list(). If zone_list doesn't include all nodes, it thinks oom is from mempolicy. But there is memory-less-node. memory-less-node's zones are never included in zonelist[]. contstrained_alloc() should get memory_less_node into count. Otherwise, it always thinks 'oom is from mempolicy'. This means that current process dies at any time. This patch fix it. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Paul Jackson <pj@sgi.com> Cc: Christoph Lameter <clameter@engr.sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-29VM: Fix nasty and subtle race in shared mmap'ed page writebackLinus Torvalds
The VM layer (on the face of it, fairly reasonably) expected that when it does a ->writepage() call to the filesystem, it would write out the full page at that point in time. Especially since it had earlier marked the whole page dirty with "set_page_dirty()". But that isn't actually the case: ->writepage() does not actually write a page, it writes the parts of the page that have been explicitly marked dirty before, *and* that had not got written out for other reasons since the last time we told it they were dirty. That last caveat is the important one. Which _most_ of the time ends up being the whole page (since we had called "set_page_dirty()" on the page earlier), but if the filesystem had done any dirty flushing of its own (for example, to honor some internal write ordering guarantees), it might end up doing only a partial page IO (or none at all) when ->writepage() is actually called. That is the correct thing in general (since we actually often _want_ only the known-dirty parts of the page to be written out), but the shared dirty page handling had implicitly forgotten about these details, and had a number of cases where it was doing just the "->writepage()" part, without telling the low-level filesystem that the whole page might have been re-dirtied as part of being mapped writably into user space. Since most of the time the FS did actually write out the full page, we didn't notice this for a loong time, and this needed some really odd patterns to trigger. But it caused occasional corruption with rtorrent and with the Debian "apt" database, because both use shared mmaps to update the end result. This fixes it. Finally. After way too much hair-pulling. Acked-by: Nick Piggin <nickpiggin@yahoo.com.au> Acked-by: Martin J. Bligh <mbligh@google.com> Acked-by: Martin Michlmayr <tbm@cyrius.com> Acked-by: Martin Johansson <martin@fatbob.nu> Acked-by: Ingo Molnar <mingo@elte.hu> Acked-by: Andrei Popa <andrei.popa@i-neo.ro> Cc: High Dickins <hugh@veritas.com> Cc: Andrew Morton <akpm@osdl.org>, Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Segher Boessenkool <segher@kernel.crashing.org> Cc: David Miller <davem@davemloft.net> Cc: Arjan van de Ven <arjan@infradead.org> Cc: Gordon Farquharson <gordonfarquharson@gmail.com> Cc: Guillaume Chazarain <guichaz@yahoo.fr> Cc: Theodore Tso <tytso@mit.edu> Cc: Kenneth Cheng <kenneth.w.chen@intel.com> Cc: Tobias Diedrich <ranma@tdiedrich.de> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-23Clean up and export cancel_dirty_page() to modulesLinus Torvalds
Make cancel_dirty_page() act more like all the other dirty and writeback accounting functions: test for "mapping" being NULL, and do the NR_FILE_DIRY accounting purely based on mapping_cap_account_dirty()). Also, add it to the exports, so that modular filesystems can use it. Acked-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] Fix up page_mkclean_one(): virtual caches, s390Peter Zijlstra
- add flush_cache_page() for all those virtual indexed cache architectures. - handle s390. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] mm: more rmap debuggingNick Piggin
Add more debugging in the rmap code in an attempt to locate to source of the occasional "mapcount went negative" assertions. Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] Fix swapped parameters in mm/vmscan.cNigel Cunningham
The version of mm/vmscan.c in Linus' current tree has swapped parameters in the shrink_all_zones declaration and call, used by the various suspend-to-disk implementations. This doesn't seem to have any great adverse effect, but it's clearly wrong. Signed-off-by: Nigel Cunningham <nigel@suspend2.net> Cc: Pavel Machek <pavel@ucw.cz> Cc: "Rafael J. Wysocki" <rjw@sisk.pl> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] fix kernel-doc warnings in 2.6.20-rc1Randy Dunlap
Fix kernel-doc warnings in 2.6.20-rc1. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] slab: fix kmem_ptr_validate definitionChristoph Lameter
The declaration of kmem_ptr_validate in slab.h does not match the one in slab.c. Remove the fastcall attribute (this is the only use in slab.c). Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
2006-12-22[PATCH] Fix for shmem_truncate_range() BUG_ON()Badari Pulavarty
Ran into BUG() while doing madvise(REMOVE) testing. If we are punching a hole into shared memory segment using madvise(REMOVE) and the entire hole is below the indirect blocks, we hit following assert. BUG_ON(limit <= SHMEM_NR_DIRECT); Signed-off-by: Badari Pulavarty <pbadari@us.ibm.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: <stable@kernel.org> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>