linux/mm/filemap.c, branch v3.3.5

readahead: fix pipeline break caused by block plug

2012-02-04T00:16:41Z

Herbert Poetzl reported a performance regression since 2.6.39. The test is a simple dd read, but with big block size. The reason is: T1: ra (A, A+128k), (A+128k, A+256k) T2: lock_page for page A, submit the 256k T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted because of plug and there isn't any lock_page till we hit page A+256k because all pages from A to A+256k is in memory T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't submitted again. T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is waitting for (A+256k, A+512k) finish. There is no request to disk in T3 and T4, so readahead pipeline breaks. We really don't need block plug for generic_file_aio_read() for buffered I/O. The readahead already has plug and has fine grained control when I/O should be submitted. Deleting plug for buffered I/O fixes the regression. One side effect is plug makes the request size 256k, the size is 128k without it. This is because default ra size is 128k and not a reason we need plug here. Vivek said: : We submit some readahead IO to device request queue but because of nested : plug, queue never gets unplugged. When read logic reaches a page which is : not in page cache, it waits for page to be read from the disk : (lock_page_killable()) and that time we flush the plug list. : : So effectively read ahead logic is kind of broken in parts because of : nested plugging. Removing top level plug (generic_file_aio_read()) for : buffered reads, will allow unplugging queue earlier for readahead. Signed-off-by: Shaohua Li Signed-off-by: Wu Fengguang Reported-by: Herbert Poetzl Tested-by: Eric Dumazet Cc: Christoph Hellwig Cc: Jens Axboe Cc: Vivek Goyal Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

memcg: add mem_cgroup_replace_page_cache() to fix LRU issue

2012-01-13T04:13:04Z

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a function replace_page_cache_page(). This function replaces a page in the radix-tree with a new page. WHen doing this, memory cgroup needs to fix up the accounting information. memcg need to check PCG_USED bit etc. In some(many?) cases, 'newpage' is on LRU before calling replace_page_cache(). So, memcg's LRU accounting information should be fixed, too. This patch adds mem_cgroup_replace_page_cache() and removes the old hooks. In that function, old pages will be unaccounted without touching res_counter and new page will be accounted to the memcg (of old page). WHen overwriting pc->mem_cgroup of newpage, take zone->lru_lock and avoid races with LRU handling. Background: replace_page_cache_page() is called by FUSE code in its splice() handling. Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated page and may be on LRU. LRU mis-accounting will be critical for memory cgroup because rmdir() checks the whole LRU is empty and there is no account leak. If a page is on the other LRU than it should be, rmdir() will fail. This bug was added in March 2011, but no bug report yet. I guess there are not many people who use memcg and FUSE at the same time with upstream kernels. The result of this bug is that admin cannot destroy a memcg because of account leak. So, no panic, no deadlock. And, even if an active cgroup exist, umount can succseed. So no problem at shutdown. Signed-off-by: KAMEZAWA Hiroyuki Acked-by: Johannes Weiner Acked-by: Michal Hocko Cc: Miklos Szeredi Cc: Hugh Dickins Cc: Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin()

2012-01-11T00:30:43Z

Tell the page allocator that pages allocated through grab_cache_page_write_begin() are expected to become dirty soon. Signed-off-by: Johannes Weiner Reviewed-by: Rik van Riel Acked-by: Mel Gorman Reviewed-by: Minchan Kim Reviewed-by: Michal Hocko Cc: KAMEZAWA Hiroyuki Cc: Christoph Hellwig Cc: Wu Fengguang Cc: Dave Chinner Cc: Jan Kara Cc: Shaohua Li Cc: Chris Mason Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds

should_remove_suid(): inode->i_mode is umode_t

2012-01-04T03:55:14Z

Signed-off-by: Al Viro

vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL

2011-12-22T01:02:46Z

lockdep reports a deadlock in jfs because a special inode's rw semaphore is taken recursively. The mapping's gfp mask is GFP_NOFS, but is not used when __read_cache_page() calls add_to_page_cache_lru(). Signed-off-by: Dave Kleikamp Acked-by: Hugh Dickins Acked-by: Al Viro Cc: stable@kernel.org Signed-off-by: Linus Torvalds

fs: Make write(2) interruptible by a fatal signal

2011-12-02T01:17:05Z

Currently write(2) to a file is not interruptible by any signal. Sometimes this is desirable, e.g. when you want to quickly kill a process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make task in balance_dirty_pages() killable"), it's necessary to abort the current write accordingly to avoid it quickly dirtying lots more pages at unthrottled rate. This patch makes write interruptible by SIGKILL. We do not allow write to be interruptible by any other signal because that has larger potential of screwing some badly written applications. Reported-by: Kazuya Mio Tested-by: Kazuya Mio Acked-by: Matthew Wilcox Signed-off-by: Jan Kara Signed-off-by: Wu Fengguang

mm: Map most files to use export.h instead of module.h

2011-10-31T13:20:12Z

The files changed within are only using the EXPORT_SYMBOL macro variants. They are not using core modular infrastructure and hence don't need module.h but only the export.h header. Signed-off-by: Paul Gortmaker

vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately

2011-10-28T11:55:08Z

Currently, when you call iov_iter_advance, then the pointer to the iovec array can be incremented, but it does not decrement the nr_segs value in the iov_iter struct. The result is a iov_iter struct with a nr_segs value that goes beyond the end of the array. While I'm not aware of anything that's specifically broken by this, it seems odd and a bit dangerous not to decrement that value. If someone were to trust the nr_segs value to be correct, then they could end up walking off the end of the array. Changing this might also provide some micro-optimization when dealing with the last iovec in an array. Many of the other routines that deal with iov_iter have optimized codepaths when nr_segs == 1. Cc: Nick Piggin Signed-off-by: Jeff Layton Signed-off-by: Christoph Hellwig

mm: account skipped entries to avoid looping in find_get_pages

2011-09-15T01:17:56Z

The found entries by find_get_pages() could be all swap entries. In this case we skip the entries, but make sure the skipped entries are accounted, so we don't keep looping. Using nr_found > nr_skip to simplify code as suggested by Eric. Reported-and-tested-by: Eric Dumazet Signed-off-by: Shaohua Li Signed-off-by: Linus Torvalds

mm: clarify the radix_tree exceptional cases

2011-08-04T00:25:24Z

Make the radix_tree exceptional cases, mostly in filemap.c, clearer. It's hard to devise a suitable snappy name that illuminates the use by shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree generality. And akpm points out that /* radix_tree_deref_retry(page) */ comments look like calls that have been commented out for unknown reason. Skirt the naming difficulty by rearranging these blocks to handle the transient radix_tree_deref_retry(page) case first; then just explain the remaining shmem/tmpfs swap case in a comment. Signed-off-by: Hugh Dickins Signed-off-by: Andrew Morton Signed-off-by: Linus Torvalds