<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/filemap.c, branch v3.2.38</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/mm/filemap.c?h=v3.2.38</id>
<link rel='self' href='https://git.amat.us/linux/atom/mm/filemap.c?h=v3.2.38'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2012-08-02T13:37:36Z</updated>
<entry>
<title>cpuset: mm: reduce large amounts of memory barrier related damage v3</title>
<updated>2012-08-02T13:37:36Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-03-21T23:34:11Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e8bf81d11ffae58e78a1d8f58f51accdae627c65'/>
<id>urn:sha1:e8bf81d11ffae58e78a1d8f58f51accdae627c65</id>
<content type='text'>
commit cc9a6c8776615f9c194ccf0b63a0aa5628235545 upstream.

Stable note:  Not tracked in Bugzilla. [get|put]_mems_allowed() is extremely
	expensive and severely impacted page allocator performance. This
	is part of a series of patches that reduce page allocator overhead.

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Miao Xie &lt;miaox@cn.fujitsu.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: Peter Zijlstra &lt;a.p.zijlstra@chello.nl&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
[bwh: Forward-ported from 3.0 to 3.2: apply the upstream changes
 to get_any_partial()]
Signed-off-by: Ben Hutchings &lt;ben@decadent.org.uk&gt;
</content>
</entry>
<entry>
<title>readahead: fix pipeline break caused by block plug</title>
<updated>2012-02-13T19:16:50Z</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2012-02-03T23:37:17Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=76af79f393ad562077f79627a4c719219ef09ee8'/>
<id>urn:sha1:76af79f393ad562077f79627a4c719219ef09ee8</id>
<content type='text'>
commit 3deaa7190a8da38453c4fabd9dec7f66d17fff67 upstream.

Herbert Poetzl reported a performance regression since 2.6.39.  The test
is a simple dd read, but with big block size.  The reason is:

T1: ra (A, A+128k), (A+128k, A+256k)
T2: lock_page for page A, submit the 256k
T3: hit page A+128K, ra (A+256k, A+384). the range isn't submitted
because of plug and there isn't any lock_page till we hit page A+256k
because all pages from A to A+256k is in memory
T4: hit page A+256k, ra (A+384, A+ 512). Because of plug, the range isn't
submitted again.
T5: lock_page A+256k, so (A+256k, A+512k) will be submitted. The task is
waitting for (A+256k, A+512k) finish.

There is no request to disk in T3 and T4, so readahead pipeline breaks.

We really don't need block plug for generic_file_aio_read() for buffered
I/O.  The readahead already has plug and has fine grained control when I/O
should be submitted.  Deleting plug for buffered I/O fixes the regression.

One side effect is plug makes the request size 256k, the size is 128k
without it.  This is because default ra size is 128k and not a reason we
need plug here.

Vivek said:

: We submit some readahead IO to device request queue but because of nested
: plug, queue never gets unplugged.  When read logic reaches a page which is
: not in page cache, it waits for page to be read from the disk
: (lock_page_killable()) and that time we flush the plug list.
:
: So effectively read ahead logic is kind of broken in parts because of
: nested plugging.  Removing top level plug (generic_file_aio_read()) for
: buffered reads, will allow unplugging queue earlier for readahead.

Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Reported-by: Herbert Poetzl &lt;herbert@13thfloor.at&gt;
Tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Vivek Goyal &lt;vgoyal@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>memcg: add mem_cgroup_replace_page_cache() to fix LRU issue</title>
<updated>2012-01-26T00:13:23Z</updated>
<author>
<name>KAMEZAWA Hiroyuki</name>
<email>kamezawa.hiroyu@jp.fujitsu.com</email>
</author>
<published>2012-01-13T01:17:44Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e6e6e8cd34116b467c4201ef5c5d354f9919005c'/>
<id>urn:sha1:e6e6e8cd34116b467c4201ef5c5d354f9919005c</id>
<content type='text'>
commit ab936cbcd02072a34b60d268f94440fd5cf1970b upstream.

Commit ef6a3c6311 ("mm: add replace_page_cache_page() function") added a
function replace_page_cache_page().  This function replaces a page in the
radix-tree with a new page.  WHen doing this, memory cgroup needs to fix
up the accounting information.  memcg need to check PCG_USED bit etc.

In some(many?) cases, 'newpage' is on LRU before calling
replace_page_cache().  So, memcg's LRU accounting information should be
fixed, too.

This patch adds mem_cgroup_replace_page_cache() and removes the old hooks.
 In that function, old pages will be unaccounted without touching
res_counter and new page will be accounted to the memcg (of old page).
WHen overwriting pc-&gt;mem_cgroup of newpage, take zone-&gt;lru_lock and avoid
races with LRU handling.

Background:
  replace_page_cache_page() is called by FUSE code in its splice() handling.
  Here, 'newpage' is replacing oldpage but this newpage is not a newly allocated
  page and may be on LRU. LRU mis-accounting will be critical for memory cgroup
  because rmdir() checks the whole LRU is empty and there is no account leak.
  If a page is on the other LRU than it should be, rmdir() will fail.

This bug was added in March 2011, but no bug report yet.  I guess there
are not many people who use memcg and FUSE at the same time with upstream
kernels.

The result of this bug is that admin cannot destroy a memcg because of
account leak.  So, no panic, no deadlock.  And, even if an active cgroup
exist, umount can succseed.  So no problem at shutdown.

Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Miklos Szeredi &lt;mszeredi@suse.cz&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@suse.de&gt;

</content>
</entry>
<entry>
<title>vfs: __read_cache_page should use gfp argument rather than GFP_KERNEL</title>
<updated>2011-12-22T01:02:46Z</updated>
<author>
<name>Dave Kleikamp</name>
<email>dave.kleikamp@oracle.com</email>
</author>
<published>2011-12-21T17:05:48Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e6f67b8c05f5e129e126f4409ddac6f25f58ffcb'/>
<id>urn:sha1:e6f67b8c05f5e129e126f4409ddac6f25f58ffcb</id>
<content type='text'>
lockdep reports a deadlock in jfs because a special inode's rw semaphore
is taken recursively.  The mapping's gfp mask is GFP_NOFS, but is not
used when __read_cache_page() calls add_to_page_cache_lru().

Signed-off-by: Dave Kleikamp &lt;dave.kleikamp@oracle.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>fs: Make write(2) interruptible by a fatal signal</title>
<updated>2011-12-02T01:17:05Z</updated>
<author>
<name>Jan Kara</name>
<email>jack@suse.cz</email>
</author>
<published>2011-12-02T01:17:02Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=a50527b19c62c808a7fca022816fff88a50b948d'/>
<id>urn:sha1:a50527b19c62c808a7fca022816fff88a50b948d</id>
<content type='text'>
Currently write(2) to a file is not interruptible by any signal.
Sometimes this is desirable, e.g. when you want to quickly kill a
process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
task in balance_dirty_pages() killable"), it's necessary to abort the
current write accordingly to avoid it quickly dirtying lots more pages
at unthrottled rate.

This patch makes write interruptible by SIGKILL. We do not allow write
to be interruptible by any other signal because that has larger
potential of screwing some badly written applications.

Reported-by: Kazuya Mio &lt;k-mio@sx.jp.nec.com&gt;
Tested-by: Kazuya Mio &lt;k-mio@sx.jp.nec.com&gt;
Acked-by: Matthew Wilcox &lt;matthew.r.wilcox@intel.com&gt;
Signed-off-by: Jan Kara &lt;jack@suse.cz&gt;
Signed-off-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
</content>
</entry>
<entry>
<title>mm: Map most files to use export.h instead of module.h</title>
<updated>2011-10-31T13:20:12Z</updated>
<author>
<name>Paul Gortmaker</name>
<email>paul.gortmaker@windriver.com</email>
</author>
<published>2011-10-16T06:01:52Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=b95f1b31b75588306e32b2afd32166cad48f670b'/>
<id>urn:sha1:b95f1b31b75588306e32b2afd32166cad48f670b</id>
<content type='text'>
The files changed within are only using the EXPORT_SYMBOL
macro variants.  They are not using core modular infrastructure
and hence don't need module.h but only the export.h header.

Signed-off-by: Paul Gortmaker &lt;paul.gortmaker@windriver.com&gt;
</content>
</entry>
<entry>
<title>vfs: iov_iter: have iov_iter_advance decrement nr_segs appropriately</title>
<updated>2011-10-28T11:55:08Z</updated>
<author>
<name>Jeff Layton</name>
<email>jlayton@redhat.com</email>
</author>
<published>2011-10-27T21:53:08Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=39be79c16f2b8eb07dd0d4e965cddfe39cc0534a'/>
<id>urn:sha1:39be79c16f2b8eb07dd0d4e965cddfe39cc0534a</id>
<content type='text'>
Currently, when you call iov_iter_advance, then the pointer to the iovec
array can be incremented, but it does not decrement the nr_segs value in
the iov_iter struct. The result is a iov_iter struct with a nr_segs
value that goes beyond the end of the array.

While I'm not aware of anything that's specifically broken by this, it
seems odd and a bit dangerous not to decrement that value. If someone
were to trust the nr_segs value to be correct, then they could end up
walking off the end of the array.

Changing this might also provide some micro-optimization when dealing
with the last iovec in an array. Many of the other routines that deal
with iov_iter have optimized codepaths when nr_segs == 1.

Cc: Nick Piggin &lt;npiggin@suse.de&gt;
Signed-off-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Christoph Hellwig &lt;hch@lst.de&gt;
</content>
</entry>
<entry>
<title>mm: account skipped entries to avoid looping in find_get_pages</title>
<updated>2011-09-15T01:17:56Z</updated>
<author>
<name>Shaohua Li</name>
<email>shaohua.li@intel.com</email>
</author>
<published>2011-09-15T00:45:19Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=cc39c6a9bbdebfcf1a7dee64d83bf302bc38d941'/>
<id>urn:sha1:cc39c6a9bbdebfcf1a7dee64d83bf302bc38d941</id>
<content type='text'>
The found entries by find_get_pages() could be all swap entries.  In
this case we skip the entries, but make sure the skipped entries are
accounted, so we don't keep looping.

Using nr_found &gt; nr_skip to simplify code as suggested by Eric.

Reported-and-tested-by: Eric Dumazet &lt;eric.dumazet@gmail.com&gt;
Signed-off-by: Shaohua Li &lt;shaohua.li@intel.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: clarify the radix_tree exceptional cases</title>
<updated>2011-08-04T00:25:24Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2011-08-03T23:21:28Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=8079b1c859c44f27d63da4951f5038a16589a563'/>
<id>urn:sha1:8079b1c859c44f27d63da4951f5038a16589a563</id>
<content type='text'>
Make the radix_tree exceptional cases, mostly in filemap.c, clearer.

It's hard to devise a suitable snappy name that illuminates the use by
shmem/tmpfs for swap, while keeping filemap/pagecache/radix_tree
generality.  And akpm points out that /* radix_tree_deref_retry(page) */
comments look like calls that have been commented out for unknown
reason.

Skirt the naming difficulty by rearranging these blocks to handle the
transient radix_tree_deref_retry(page) case first; then just explain the
remaining shmem/tmpfs swap case in a comment.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: a few small updates for radix-swap</title>
<updated>2011-08-04T00:25:24Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2011-08-03T23:21:27Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=31475dd611209413bace21651a400afb91d0bd9d'/>
<id>urn:sha1:31475dd611209413bace21651a400afb91d0bd9d</id>
<content type='text'>
Remove PageSwapBacked (!page_is_file_cache) cases from
add_to_page_cache_locked() and add_to_page_cache_lru(): those pages now
go through shmem_add_to_page_cache().

Remove a comment on maximum tmpfs size from fsstack_copy_inode_size(),
and add a comment on swap entries to invalidate_mapping_pages().

And mincore_page() uses find_get_page() on what might be shmem or a
tmpfs file: allow for a radix_tree_exceptional_entry(), and proceed to
find_get_page() on swapper_space if so (oh, swapper_space needs #ifdef).

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
