aboutsummaryrefslogtreecommitdiff
path: root/mm
AgeCommit message (Collapse)Author
2009-02-17mm: rearrange exit_mmap() to unlock before arch_exit_mmapJeremy Fitzhardinge
commit 9480c53e9b2aa13a06283ffb96bb8f1873ac4e9a upstream. Christophe Saout reported [in precursor to: http://marc.info/?l=linux-kernel&m=123209902707347&w=4]: > Note that I also some a different issue with CONFIG_UNEVICTABLE_LRU. > Seems like Xen tears down current->mm early on process termination, so > that __get_user_pages in exit_mmap causes nasty messages when the > process had any mlocked pages. (in fact, it somehow manages to get into > the swapping code and produces a null pointer dereference trying to get > a swap token) Jeremy explained: Yes. In the normal case under Xen, an in-use pagetable is "pinned", meaning that it is RO to the kernel, and all updates must go via hypercall (or writes are trapped and emulated, which is much the same thing). An unpinned pagetable is not currently in use by any process, and can be directly accessed as normal RW pages. As an optimisation at process exit time, we unpin the pagetable as early as possible (switching the process to init_mm), so that all the normal pagetable teardown can happen with direct memory accesses. This happens in exit_mmap() -> arch_exit_mmap(). The munlocking happens a few lines below. The obvious thing to do would be to move arch_exit_mmap() to below the munlock code, but I think we'd want to call it even if mm->mmap is NULL, just to be on the safe side. Thus, this patch: exit_mmap() needs to unlock any locked vmas before calling arch_exit_mmap, as the latter may switch the current mm to init_mm, which would cause the former to fail. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Christophe Saout <christophe@saout.de> Cc: Keir Fraser <keir.fraser@eu.citrix.com> Cc: Christophe Saout <christophe@saout.de> Cc: Alex Williamson <alex.williamson@hp.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17writeback: fix break conditionFederico Cuello
commit 89e1219004b3657cc014521663eeef0744f1c99d upstream. Commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa ("write-back: fix nr_to_write counter") fixed nr_to_write counter, but didn't set the break condition properly. If nr_to_write == 0 after being decremented it will loop one more time before setting done = 1 and breaking the loop. [akpm@linux-foundation.org: coding-style fixes] Cc: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17write-back: fix nr_to_write counterArtem Bityutskiy
commit dcf6a79dda5cc2a2bec183e50d829030c0972aaa upstream. Commit 05fe478dd04e02fa230c305ab9b5616669821dd3 introduced some @wbc->nr_to_write breakage. It made the following changes: 1. Decrement wbc->nr_to_write instead of nr_to_write 2. Decrement wbc->nr_to_write _only_ if wbc->sync_mode == WB_SYNC_NONE 3. If synced nr_to_write pages, stop only if if wbc->sync_mode == WB_SYNC_NONE, otherwise keep going. However, according to the commit message, the intention was to only make change 3. Change 1 is a bug. Change 2 does not seem to be necessary, and it breaks UBIFS expectations, so if needed, it should be done separately later. And change 2 does not seem to be documented in the commit message. This patch does the following: 1. Undo changes 1 and 2 2. Add a comment explaining change 3 (it very useful to have comments in _code_, not only in the commit). Signed-off-by: Artem Bityutskiy <Artem.Bityutskiy@nokia.com> Acked-by: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-17Fix page writeback thinko, causing Berkeley DB slowdownNick Piggin
commit 3a4c6800f31ea8395628af5e7e490270ee5d0585 upstream. A bug was introduced into write_cache_pages cyclic writeout by commit 31a12666d8f0c22235297e1c1575f82061480029 ("mm: write_cache_pages cyclic fix"). The intention (and comments) is that we should cycle back and look for more dirty pages at the beginning of the file if there is no more work to be done. But the !done condition was dropped from the test. This means that any time the page writeout loop breaks (eg. due to nr_to_write == 0), we will set index to 0, then goto again. This will set done_index to index, then find done is set, so will proceed to the end of the function. When updating mapping->writeback_index for cyclic writeout, we now use done_index == 0, so we're always cycling back to 0. This seemed to be causing random mmap writes (slapadd and iozone) to start writing more pages from the LRU and writeout would slowdown, and caused bugzilla entry http://bugzilla.kernel.org/show_bug.cgi?id=12604 about Berkeley DB slowing down dramatically. With this patch, iozone random write performance is increased nearly 5x on my system (iozone -B -r 4k -s 64k -s 512m -s 1200m on ext2). Signed-off-by: Nick Piggin <npiggin@suse.de> Reported-and-tested-by: Jan Kara <jack@suse.cz> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-12mm: fix error case in mlock downgrade reversionHugh Dickins
commit d5b562330ec766292a3ac54ae5e0673610bd5b3d upstream. Commit 27421e211a39784694b597dbf35848b88363c248, Manually revert "mlock: downgrade mmap sem while populating mlocked regions", has introduced its own regression: __mlock_vma_pages_range() may report an error (for example, -EFAULT from trying to lock down pages from beyond EOF), but mlock_vma_pages_range() must hide that from its callers as before. Reported-by: Sami Farin <safari-kernel@safari.iki.fi> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2009-02-12do_wp_page: fix regression with execute in placeCarsten Otte
commit ab92661d5d9514647346047f30f67a7f35ffea67 upstream. Fix do_wp_page for VM_MIXEDMAP mappings. In the case where pfn_valid returns 0 for a pfn at the beginning of do_wp_page and the mapping is not shared writable, the code branches to label `gotten:' with old_page == NULL. In case the vma is locked (vma->vm_flags & VM_LOCKED), lock_page, clear_page_mlock, and unlock_page try to access the old_page. This patch checks whether old_page is valid before it is dereferenced. The regression was introduced by "mlock: mlocked pages are unevictable" (commit b291f000393f5a0b679012b39d79fbc85c018233). Signed-off-by: Carsten Otte <cotte@de.ibm.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-06Manually revert "mlock: downgrade mmap sem while populating mlocked regions"Linus Torvalds
commit 27421e211a39784694b597dbf35848b88363c248 upstream. This essentially reverts commit 8edb08caf68184fb170f4f69c7445929e199eaea. It downgraded our mmap semaphore to a read-lock while mlocking pages, in order to allow other threads (and external accesses like "ps" et al) to walk the vma lists and take page faults etc. Which is a nice idea, but the implementation does not work. Because we cannot upgrade the lock back to a write lock without releasing the mmap semaphore, the code had to release the lock entirely and then re-take it as a writelock. However, that meant that the caller possibly lost the vma chain that it was following, since now another thread could come in and mmap/munmap the range. The code tried to work around that by just looking up the vma again and erroring out if that happened, but quite frankly, that was just a buggy hack that doesn't actually protect against anything (the other thread could just have replaced the vma with another one instead of totally unmapping it). The only way to downgrade to a read map _reliably_ is to do it at the end, which is likely the right thing to do: do all the 'vma' operations with the write-lock held, then downgrade to a read after completing them all, and then do the "populate the newly mlocked regions" while holding just the read lock. And then just drop the read-lock and return to user space. The (perhaps somewhat simpler) alternative is to just make all the callers of mlock_vma_pages_range() know that the mmap lock got dropped, and just re-grab the mmap semaphore if it needs to mlock more than one vma region. So we can do this "downgrade mmap sem while populating mlocked regions" thing right, but the way it was done here was absolutely not correct. Thus the revert, in the expectation that we will do it all correctly some day. Cc: Lee Schermerhorn <lee.schermerhorn@hp.com> Cc: Rik van Riel <riel@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-02Fix OOPS in mmap_region() when merging adjacent VM_LOCKED file segmentsAndrew Morton
This patch differs from the upstream commit de33c8db5910cda599899dd431cc30d7c1018cbf written by Linus, as it aims to only prevent the oops from happening, not attempt to change anything else. The problem was introduced by commit ba470de43188cdbff795b5da43a1474523c6c2fb which added new references to *vma after we've potentially freed it. From: Andrew Morton <akpm@linux-foundation.org> Reported-by: Maksim Yevmenkin <maksim.yevmenkin@gmail.com> Tested-by: Maksim Yevmenkin <maksim.yevmenkin@gmail.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Rik van Riel <riel@redhat.com> Cc: Hugh Dickins <hugh@veritas.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-02-02alpha: fix vmalloc breakageIvan Kokshaysky
commit 822c18f2e38cbc775792ab65ace4f9198678dec9 upstream. On alpha, we have to map some stuff in the VMALLOC space very early in the boot process (to make SRM console callbacks work and so on, see arch/alpha/mm/init.c). For old VM allocator, we just manually placed a vm_struct onto the global vmlist and this worked for ages. Unfortunately, the new allocator isn't aware of this, so it constantly tries to allocate the VM space which is already in use, making vmalloc on alpha defunct. This patch forces KVA to import vmlist entries on init. [akpm@linux-foundation.org: remove unneeded check (per Johannes)] Signed-off-by: Ivan Kokshaysky <ink@jurassic.park.msu.ru> Cc: Nick Piggin <npiggin@suse.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Richard Henderson <rth@twiddle.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: direct IO starvation improvementNick Piggin
commit 48b47c561e41525061b5bc0cfd67d6367fd11dc4 upstream. Direct IO can invalidate and sync a lot of pagecache pages in the mapping. A 4K direct IO will actually try to sync and/or invalidate the pagecache of the entire file, for example (which might be many GB or TB large). Improve this by doing range syncs. Also, memory no longer has to be unmapped to catch the dirty bits for syncing, as dirty bits would remain coherent due to dirty mmap accounting. This fixes the immediate DM deadlocks when doing direct IO reads to block device with a mounted filesystem, if only by papering over the problem somewhat rather than addressing the fsync starvation cases. Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages more terminate quicklyAndrew Morton
commit 82fd1a9a8ced9607312b54859572bcc6211e8919 upstream. Now that we have the early-termination logic in place, it makes sense to bail out early in all other cases where done is set to 1. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages terminate quicklyNick Piggin
commit d5482cdf8a0aacb1e6468a97d5544f5829c8d8c4 upstream. Terminate the write_cache_pages loop upon encountering the first page past end, without locking the page. Pages cannot have their index change when we have a reference on them (truncate, eg truncate_inode_pages_range performs the same check without the page lock). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages optimise page cleaningNick Piggin
commit 515f4a037fb9ab736f8bad733fcd2ffd350cf265 upstream. In write_cache_pages, if we get stuck behind another process that is cleaning pages, we will be forced to wait for them to finish, then perform our own writeout (if it was redirtied during the long wait), then wait for that. If a page under writeout is still clean, we can skip waiting for it (if we're part of a data integrity sync, we'll be waiting for all writeout pages afterwards, so we'll still be waiting for the other guy's write that's cleaned the page). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cleanupsNick Piggin
commit 5a3d5c9813db56a75934eb1015367fda23a8b0b4 upstream. Get rid of some complex expressions from flow control statements, add a comment, remove some duplicate code. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages integrity fixNick Piggin
commit 05fe478dd04e02fa230c305ab9b5616669821dd3 upstream. In write_cache_pages, nr_to_write is heeded even for data-integrity syncs, so the function will return success after writing out nr_to_write pages, even if that was not sufficient to guarantee data integrity. The callers tend to set it to values that could break data interity semantics easily in practice. For example, nr_to_write can be set to mapping->nr_pages * 2, however if a file has a single, dirty page, then fsync is called, subsequent pages might be concurrently added and dirtied, then write_cache_pages might writeout two of these newly dirty pages, while not writing out the old page that should have been written out. Fix this by ignoring nr_to_write if it is a data integrity sync. This is a data integrity bug. The reason this has been done in the past is to avoid stalling sync operations behind page dirtiers. "If a file has one dirty page at offset 1000000000000000 then someone does an fsync() and someone else gets in first and starts madly writing pages at offset 0, we want to write that page at 1000000000000000. Somehow." What we do today is return success after an arbitrary amount of pages are written, whether or not we have provided the data-integrity semantics that the caller has asked for. Even this doesn't actually fix all stall cases completely: in the above situation, if the file has a huge number of pages in pagecache (but not dirty), then mapping->nrpages is going to be huge, even if pages are being dirtied. This change does indeed make the possibility of long stalls lager, and that's not a good thing, but lying about data integrity is even worse. We have to either perform the sync, or return -ELINUXISLAME so at least the caller knows what has happened. There are subsequent competing approaches in the works to solve the stall problems properly, without compromising data integrity. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages writepage error fixNick Piggin
commit 00266770b8b3a6a77f896ca501a0613739086832 upstream. In write_cache_pages, if ret signals a real error, but we still have some pages left in the pagevec, done would be set to 1, but the remaining pages would continue to be processed and ret will be overwritten in the process. It could easily be overwritten with success, and thus success will be returned even if there is an error. Thus the caller is told all writes succeeded, wheras in reality some did not. Fix this by bailing immediately if there is an error, and retaining the first error code. This is a data integrity bug. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages early loop terminationNick Piggin
commit bd19e012f6fd3b7309689165ea865cbb7bb88c1e upstream. We'd like to break out of the loop early in many situations, however the existing code has been setting mapping->writeback_index past the final page in the pagevec lookup for cyclic writeback. This is a problem if we don't process all pages up to the final page. Currently the code mostly keeps writeback_index reasonable and hacked around this by not breaking out of the loop or writing pages outside the range in these cases. Keep track of a real "done index" that enables us to terminate the loop in a much more flexible manner. Needed by the subsequent patch to preserve writepage errors, and then further patches to break out of the loop early for other reasons. However there are no functional changes with this patch alone. Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-24mm: write_cache_pages cyclic fixNick Piggin
commit 31a12666d8f0c22235297e1c1575f82061480029 upstream. In write_cache_pages, scanned == 1 is supposed to mean that cyclic writeback has circled through zero, thus we should not circle again. However it gets set to 1 after the first successful pagevec lookup. This leads to cases where not enough data gets written. Counterexample: file with first 10 pages dirty, writeback_index == 5, nr_to_write == 10. Then the 5 last pages will be found, and scanned will be set to 1, after writing those out, we will not cycle back to get the first 5. Rework this logic, now we'll always cycle unless we started off from index 0. When cycling, only write out as far as 1 page before the start page from the first cycle (so we don't write parts of the file twice). Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Chris Mason <chris.mason@oracle.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 28Heiko Carstens
commit 938bb9f5e840eddbf54e4f62f6c5ba9b3ae12c9d upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 26Heiko Carstens
commit c4ea37c26a691ad0b7e86aa5884aab27830e95c9 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 14Heiko Carstens
commit 3480b25743cb7404928d57efeaa3d085708b04c2 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrappers part 13Heiko Carstens
commit 6a6160a7b5c27b3c38651baef92a14fa7072b3c1 upstream. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18System call wrapper special casesHeiko Carstens
commit 6673e0c3fbeaed2cd08e2fd4a4aa97382d6fedb0 upstream. System calls with an unsigned long long argument can't be converted with the standard wrappers since that would include a cast to long, which in turn means that we would lose the upper 32 bit on 32 bit architectures. Also semctl can't use the standard wrapper since it has a 'union' parameter. So we handle them as special case and add some extra wrappers instead. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18Convert all system calls to return a longHeiko Carstens
commit 2ed7c03ec17779afb4fcfa3b8c61df61bd4879ba upstream. Convert all system calls to return a long. This should be a NOP since all converted types should have the same size anyway. With the exception of sys_exit_group which returned void. But that doesn't matter since the system call doesn't return. Signed-off-by: Heiko Carstens <heiko.carstens@de.ibm.com> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18fs: symlink write_begin allocation context fixNick Piggin
commit 54566b2c1594c2326a645a3551f9d989f7ba3c5e upstream. With the write_begin/write_end aops, page_symlink was broken because it could no longer pass a GFP_NOFS type mask into the point where the allocations happened. They are done in write_begin, which would always assume that the filesystem can be entered from reclaim. This bug could cause filesystem deadlocks. The funny thing with having a gfp_t mask there is that it doesn't really allow the caller to arbitrarily tinker with the context in which it can be called. It couldn't ever be GFP_ATOMIC, for example, because it needs to take the page lock. The only thing any callers care about is __GFP_FS anyway, so turn that into a single flag. Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on this flag in their write_begin function. Change __grab_cache_page to accept a nofs argument as well, to honour that flag (while we're there, change the name to grab_cache_page_write_begin which is more instructive and does away with random leading underscores). This is really a more flexible way to go in the end anyway -- if a filesystem happens to want any extra allocations aside from the pagecache ones in ints write_begin function, it may now use GFP_KERNEL (rather than GFP_NOFS) for common case allocations (eg. ocfs2_alloc_write_ctxt, for a random example). [kosaki.motohiro@jp.fujitsu.com: fix ubifs] [kosaki.motohiro@jp.fujitsu.com: fix fuse] Signed-off-by: Nick Piggin <npiggin@suse.de> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> [ Cleaned up the calling convention: just pass in the AOP flags untouched to the grab_cache_page_write_begin() function. That just simplifies everybody, and may even allow future expansion of the logic. - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2009-01-18vmalloc.c: fix flushing in vmap_page_range()Adam Lackorzynski
commit 2e4e27c7d082b2198b63041310609d7191185a9d upstream. The flush_cache_vmap in vmap_page_range() is called with the end of the range twice. The following patch fixes this for me. Signed-off-by: Adam Lackorzynski <adam@os.inf.tu-dresden.de> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
2008-12-16mm: Don't touch uninitialized variable in do_pages_stat_array()KOSAKI Motohiro
Commit 80bba1290ab5122c60cdb73332b26d288dc8aedd removed one necessary variable initialization. As a result following warning happened: CC mm/migrate.o mm/migrate.c: In function 'sys_move_pages': mm/migrate.c:1001: warning: 'err' may be used uninitialized in this function More unfortunately, if find_vma() failed, kernel read uninitialized memory. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Brice Goglin <Brice.Goglin@inria.fr> Cc: Christoph Lameter <clameter@sgi.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-15slob: do not pass the SLAB flags as GFP in kmem_cache_create()Catalin Marinas
The kmem_cache_create() function in the slob allocator passes the SLAB flags as GFP flags to the slob_alloc() function. The patch changes this call to pass GFP_KERNEL as the other allocators seem to do. Signed-off-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Matt Mackall <mpm@selenic.com> Cc: Cyrill Gorcunov <gorcunov@gmail.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10KSYM_SYMBOL_LEN fixesHugh Dickins
Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked to my 966c8c12dc9e77f931e2281ba25d2f0244b06949 sprint_symbol(): use less stack exposing a bug in slub's list_locations() - kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was beyond the end of page provided. The 100 slop which list_locations() allows at end of page looks roughly enough for all the other stuff it might print after the symbol before it checks again: break out KSYM_SYMBOL_LEN earlier than before. Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies them. [akpm@linux-foundation.org: ftrace.h needs module.h] Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc Miles Lane <miles.lane@gmail.com> Acked-by: Pekka Enberg <penberg@cs.helsinki.fi> Acked-by: Steven Rostedt <srostedt@redhat.com> Acked-by: Frederic Weisbecker <fweisbec@gmail.com> Cc: Rusty Russell <rusty@rustcorp.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm: no get_user/put_user while holding mmap_sem in do_pages_stat?Brice Goglin
Since commit 2f007e74bb85b9fc4eab28524052161703300f1a, do_pages_stat() gets the page address from user-space and puts the corresponding status back while holding the mmap_sem for read. There is no need to hold mmap_sem there while some page-faults may occur. This patch adds a temporary address and status buffer so as to only hold mmap_sem while working on these kernel buffers. This is implemented by extracting do_pages_stat_array() out of do_pages_stat(). Signed-off-by: Brice Goglin <Brice.Goglin@inria.fr> Cc: Christoph Lameter <clameter@sgi.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Nick Piggin <npiggin@suse.de> Cc: Hugh Dickins <hugh@veritas.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10page_cgroup should ignore empty nodesKAMEZAWA Hiroyuki
Fix a total bootup freeze on ia64. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Tested-by: Li Zefan <lizf@cn.fujitsu.com> Reported-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm: remove UP version of lru_add_drain_all()KOSAKI Motohiro
Currently, lru_add_drain_all() has two version. (1) use schedule_on_each_cpu() (2) don't use schedule_on_each_cpu() Gerald Schaefer reported it doesn't work well on SMP (not NUMA) S390 machine. offline_pages() calls lru_add_drain_all() followed by drain_all_pages(). While drain_all_pages() works on each cpu, lru_add_drain_all() only runs on the current cpu for architectures w/o CONFIG_NUMA. This let us run into the BUG_ON(!PageBuddy(page)) in __offline_isolated_pages() during memory hotplug stress test on s390. The page in question was still on the pcp list, because of a race with lru_add_drain_all() and drain_all_pages() on different cpus. Actually, Almost machine has CONFIG_UNEVICTABLE_LRU=y. Then almost machine use (1) version lru_add_drain_all although the machine is UP. Then this ifdef is not valueable. simple removing is better. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Acked-by: Gerald Schaefer <gerald.schaefer@de.ibm.com> Cc: Dave Hansen <dave@linux.vnet.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-10mm/backing-dev.c: remove recently-added WARN_ON()Andrew Morton
On second thoughts, this is just going to disturb people while telling us things which we already knew. Cc: Peter Korsgaard <jacmet@sunsite.dk> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Kay Sievers <kay.sievers@vrfy.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02vmscan: evict streaming IO firstRik van Riel
Count the insertion of new pages in the statistics used to drive the pageout scanning code. This should help the kernel quickly evict streaming file IO. We count on the fact that new file pages start on the inactive file LRU and new anonymous pages start on the active anon list. This means streaming file IO will increment the recent scanned file statistic, while leaving the recent rotated file statistic alone, driving pageout scanning to the file LRUs. Pageout activity does its own list manipulation. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Gene Heskett <gene.heskett@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-02bdi: register sysfs bdi device only once per queueKay Sievers
Devices which share the same queue, like floppies and mtd devices, get registered multiple times in the bdi interface, but bdi accounts only the last registered device of the devices sharing one queue. On remove, all earlier registered devices leak, stay around in sysfs, and cause "duplicate filename" errors if the devices are re-created. This prevents the creation of multiple bdi interfaces per queue, and the bdi device will carry the dev_t name of the block device which is the first one registered, of the pool of devices using the same queue. [akpm@linux-foundation.org: add a WARN_ON so we know which drivers are misbehaving] Tested-by: Peter Korsgaard <jacmet@sunsite.dk> Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Signed-off-by: Kay Sievers <kay.sievers@vrfy.org> Cc: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01memcg: memory hotplug fix for notifier callbackKAMEZAWA Hiroyuki
Fixes for memcg/memory hotplug. While memory hotplug allocate/free memmap, page_cgroup doesn't free page_cgroup at OFFLINE when page_cgroup is allocated via bootomem. (Because freeing bootmem requires special care.) Then, if page_cgroup is allocated by bootmem and memmap is freed/allocated by memory hotplug, page_cgroup->page == page is no longer true. But current MEM_ONLINE handler doesn't check it and update page_cgroup->page if it's not necessary to allocate page_cgroup. (This was not found because memmap is not freed if SPARSEMEM_VMEMMAP is y.) And I noticed that MEM_ONLINE can be called against "part of section". So, freeing page_cgroup at CANCEL_ONLINE will cause trouble. (freeing used page_cgroup) Don't rollback at CANCEL. One more, current memory hotplug notifier is stopped by slub because it sets NOTIFY_STOP_MASK to return vaule. So, page_cgroup's callback never be called. (low priority than slub now.) I think this slub's behavior is not intentional(BUG). and fixes it. Another way to be considered about page_cgroup allocation: - free page_cgroup at OFFLINE even if it's from bootmem and remove specieal handler. But it requires more changes. Addresses http://bugzilla.kernel.org/show_bug.cgi?id=12041 Signed-off-by: KAMEZAWA Hiruyoki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Li Zefan <lizf@cn.fujitsu.com> Cc: Balbir Singh <balbir@in.ibm.com> Cc: Pavel Emelyanov <xemul@openvz.org> Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01mm: vmalloc fix lazy unmapping cache aliasingNick Piggin
Jim Radford has reported that the vmap subsystem rewrite was sometimes causing his VIVT ARM system to behave strangely (seemed like going into infinite loops trying to fault in pages to userspace). We determined that the problem was most likely due to a cache aliasing issue. flush_cache_vunmap was only being called at the moment the page tables were to be taken down, however with lazy unmapping, this can happen after the page has subsequently been freed and allocated for something else. The dangling alias may still have dirty data attached to it. The fix for this problem is to do the cache flushing when the caller has called vunmap -- it would be a bug for them to write anything else to the mapping at that point. That appeared to solve Jim's problems. Reported-by: Jim Radford <radford@blackbean.org> Signed-off-by: Nick Piggin <npiggin@suse.de> Cc: Russell King <rmk@arm.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-12-01vmscan: protect zone rotation stats by lru lockJohannes Weiner
The zone's rotation statistics must not be accessed without the corresponding LRU lock held. Fix an unprotected write in shrink_active_list(). Acked-by: Rik van Riel <riel@redhat.com> Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Johannes Weiner <hannes@saeurebad.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-30meminit section warningsAl Viro
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: fix get_scan_ratio() commentRik van Riel
Fix the old comment on the scan ratio calculations. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19vmscan: let GFP_NOFS go to swap againHugh Dickins
In the past, GFP_NOFS (but of course not GFP_NOIO) was allowed to reclaim by writing to swap. That got partially broken in 2.6.23, when may_enter_fs initialization was moved up before the allocation of swap, so its PageSwapCache test was failing the first time around, Fix it by setting may_enter_fs when add_to_swap() succeeds with __GFP_IO. In fact, check __GFP_IO before calling add_to_swap(): allocating swap we're not ready to use just increases disk seeking. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19migration: fix writepage errorHugh Dickins
Page migration's writeout() has got understandably confused by the nasty AOP_WRITEPAGE_ACTIVATE case: as in normal success, a writepage() error has unlocked the page, so writeout() then needs to relock it. Signed-off-by: Hugh Dickins <hugh@veritas.com> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Christoph Lameter <cl@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc search restart fixGlauber Costa
Current vmalloc restart search for a free area in case we can't find one. The reason is there are areas which are lazily freed, and could be possibly freed now. However, current implementation start searching the tree from the last failing address, which is pretty much by definition at the end of address space. So, we fail. The proposal of this patch is to restart the search from the beginning of the requested vstart address. This fixes the regression in running KVM virtual machines for me, described in http://lkml.org/lkml/2008/10/28/349, caused by commit db64fe02258f1507e13fe5212a989922323685ce. Signed-off-by: Glauber Costa <glommer@redhat.com> Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc failure flush fixNick Piggin
An initial vmalloc failure should start off a synchronous flush of lazy areas, in case someone is in progress flushing them already, which could cause us to return an allocation failure even if there is plenty of KVA free. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19mm: vmalloc allocator off by oneNick Piggin
Fix off by one bug in the KVA allocator that can leave gaps in the address space. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-19cpuset: update top cpuset's mems after adding a nodeMiao Xie
After adding a node into the machine, top cpuset's mems isn't updated. By reviewing the code, we found that the update function cpuset_track_online_nodes() was invoked after node_states[N_ONLINE] changes. It is wrong because N_ONLINE just means node has pgdat, and if node has/added memory, we use N_HIGH_MEMORY. So, We should invoke the update function after node_states[N_HIGH_MEMORY] changes, just like its commit says. This patch fixes it. And we use notifier of memory hotplug instead of direct calling of cpuset_track_online_nodes(). Signed-off-by: Miao Xie <miaox@cn.fujitsu.com> Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com> Cc: David Rientjes <rientjes@google.com> Cc: Paul Menage <menage@google.com Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-16unitialized return value in mm/mlock.c: __mlock_vma_pages_range()Helge Deller
Fix an unitialized return value when compiling on parisc (with CONFIG_UNEVICTABLE_LRU=y): mm/mlock.c: In function `__mlock_vma_pages_range': mm/mlock.c:165: warning: `ret' might be used uninitialized in this function Signed-off-by: Helge Deller <deller@gmx.de> [ It isn't ever really used uninitialized, since no caller should ever call this function with an empty range. But the compiler is correct that from a local analysis standpoint that is impossible to see, and fixing the warning is appropriate. ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-15mm: remove unevictable's show_page_pathKOSAKI Motohiro
Hugh Dickins reported show_page_path() is buggy and unsafe because - lack dput() against d_find_alias() - don't concern vma->vm_mm->owner == NULL - lack lock_page() it was only for debugging, so rather than trying to fix it, just remove it now. Reported-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: Hugh Dickins <hugh@veritas.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com> CC: Rik van Riel <riel@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12memcg: bugfix for memory hotplugKAMEZAWA Hiroyuki
The start pfn calculation in page_cgroup's memory hotplug notifier chain is wrong. Tested-by: Badari Pulavarty <pbadari@us.ibm.com> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2008-11-12mm: remove lru_add_drain_all() from the munlock pathKOSAKI Motohiro
lockdep warns about following message at boot time on one of my test machine. Then, schedule_on_each_cpu() sholdn't be called when the task have mmap_sem. Actually, lru_add_drain_all() exist to prevent the unevictalble pages stay on reclaimable lru list. but currenct unevictable code can rescue unevictable pages although it stay on reclaimable list. So removing is better. In addition, this patch add lru_add_drain_all() to sys_mlock() and sys_mlockall(). it isn't must. but it reduce the failure of moving to unevictable list. its failure can rescue in vmscan later. but reducing is better. Note, if above rescuing happend, the Mlocked and the Unevictable field mismatching happend in /proc/meminfo. but it doesn't cause any real trouble. ======================================================= [ INFO: possible circular locking dependency detected ] 2.6.28-rc2-mm1 #2 ------------------------------------------------------- lvm/1103 is trying to acquire lock: (&cpu_hotplug.lock){--..}, at: [<c0130789>] get_online_cpus+0x29/0x50 but task is already holding lock: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (&mm->mmap_sem){----}: [<c0153da2>] check_noncircular+0x82/0x110 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156161>] validate_chain+0xb11/0x1070 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab mmap_sem [<c0185e6a>] might_fault+0x4a/0xa0 [<c0185e9b>] might_fault+0x7b/0xa0 [<c0185e6a>] might_fault+0x4a/0xa0 [<c0294dd0>] copy_to_user+0x30/0x60 [<c01ae3ec>] filldir+0x7c/0xd0 [<c01e3a6a>] sysfs_readdir+0x11a/0x1f0 (*) grab sysfs_mutex [<c01ae370>] filldir+0x0/0xd0 [<c01ae370>] filldir+0x0/0xd0 [<c01ae4c6>] vfs_readdir+0x86/0xa0 (*) grab i_mutex [<c01ae75b>] sys_getdents+0x6b/0xc0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff -> #2 (sysfs_mutex){--..}: [<c0153da2>] check_noncircular+0x82/0x110 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156161>] validate_chain+0xb11/0x1070 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 (*) grab sysfs_mutex [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e3d2c>] sysfs_addrm_start+0x2c/0xc0 [<c01e422f>] create_dir+0x3f/0x90 [<c01e42a9>] sysfs_create_dir+0x29/0x50 [<c04faaf5>] _spin_unlock+0x25/0x40 [<c028f21d>] kobject_add_internal+0xcd/0x1a0 [<c028f37a>] kobject_set_name_vargs+0x3a/0x50 [<c028f41d>] kobject_init_and_add+0x2d/0x40 [<c019d4d2>] sysfs_slab_add+0xd2/0x180 [<c019d580>] sysfs_add_func+0x0/0x70 [<c019d5dc>] sysfs_add_func+0x5c/0x70 (*) grab slub_lock [<c01400f2>] run_workqueue+0x172/0x200 [<c014008f>] run_workqueue+0x10f/0x200 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0140c6c>] worker_thread+0x9c/0xf0 [<c0143c80>] autoremove_wake_function+0x0/0x50 [<c0140bd0>] worker_thread+0x0/0xf0 [<c0143972>] kthread+0x42/0x70 [<c0143930>] kthread+0x0/0x70 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #1 (slub_lock){----}: [<c0153d2d>] check_noncircular+0xd/0x110 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c0156161>] validate_chain+0xb11/0x1070 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c015433d>] mark_lock+0x35d/0xd00 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04f93a3>] down_read+0x43/0x80 [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 (*) grab slub_lock [<c04f650f>] slab_cpuup_callback+0x11f/0x1d0 [<c04fd9ac>] notifier_call_chain+0x3c/0x70 [<c04f5454>] _cpu_up+0x84/0x110 [<c04f552b>] cpu_up+0x4b/0x70 (*) grab cpu_hotplug.lock [<c06d1530>] kernel_init+0x0/0x170 [<c06d15e5>] kernel_init+0xb5/0x170 [<c06d1530>] kernel_init+0x0/0x170 [<c01042db>] kernel_thread_helper+0x7/0x1c [<ffffffff>] 0xffffffff -> #0 (&cpu_hotplug.lock){--..}: [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 (*) grab cpu_hotplug.lock [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 (*) grab mmap_sem [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb [<ffffffff>] 0xffffffff other info that might help us debug this: 1 lock held by lvm/1103: #0: (&mm->mmap_sem){----}, at: [<c01878ae>] sys_mlockall+0x4e/0xb0 stack backtrace: Pid: 1103, comm: lvm Not tainted 2.6.28-rc2-mm1 #2 Call Trace: [<c01555fc>] print_circular_bug_tail+0x7c/0xd0 [<c0155bff>] validate_chain+0x5af/0x1070 [<c040f7e0>] dev_status+0x0/0x50 [<c0156923>] __lock_acquire+0x263/0xa10 [<c015714c>] lock_acquire+0x7c/0xb0 [<c0130789>] get_online_cpus+0x29/0x50 [<c04f8b55>] mutex_lock_nested+0xa5/0x2f0 [<c0130789>] get_online_cpus+0x29/0x50 [<c0130789>] get_online_cpus+0x29/0x50 [<c017bc30>] lru_add_drain_per_cpu+0x0/0x10 [<c0130789>] get_online_cpus+0x29/0x50 [<c0140cf2>] schedule_on_each_cpu+0x32/0xe0 [<c0187095>] __mlock_vma_pages_range+0x85/0x2c0 [<c0156945>] __lock_acquire+0x285/0xa10 [<c0188f09>] vma_merge+0xa9/0x1d0 [<c0187450>] mlock_fixup+0x180/0x200 [<c0187548>] do_mlockall+0x78/0x90 [<c01878e1>] sys_mlockall+0x81/0xb0 [<c010355a>] syscall_call+0x7/0xb Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Tested-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Cc: Christoph Lameter <cl@linux-foundation.org> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Cc: Hugh Dickins <hugh@veritas.com> Cc: Rik van Riel <riel@redhat.com> Cc: Peter Zijlstra <peterz@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>