<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm/page_alloc.c, branch v3.3.5</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/mm/page_alloc.c?h=v3.3.5</id>
<link rel='self' href='https://git.amat.us/linux/atom/mm/page_alloc.c?h=v3.3.5'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2012-02-14T01:45:38Z</updated>
<entry>
<title>vfs: fix panic in __d_lookup() with high dentry hashtable counts</title>
<updated>2012-02-14T01:45:38Z</updated>
<author>
<name>Dimitri Sivanich</name>
<email>sivanich@sgi.com</email>
</author>
<published>2012-02-08T20:39:07Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=074b85175a43a23fdbde60f55feea636e0bf0f85'/>
<id>urn:sha1:074b85175a43a23fdbde60f55feea636e0bf0f85</id>
<content type='text'>
When the number of dentry cache hash table entries gets too high
(2147483648 entries), as happens by default on a 16TB system, use of a
signed integer in the dcache_init() initialization loop prevents the
dentry_hashtable from getting initialized, causing a panic in
__d_lookup().  Fix this in dcache_init() and similar areas.

Signed-off-by: Dimitri Sivanich &lt;sivanich@sgi.com&gt;
Acked-by: David S. Miller &lt;davem@davemloft.net&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
</content>
</entry>
<entry>
<title>mm: __count_immobile_pages(): make sure the node is online</title>
<updated>2012-01-23T16:38:47Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2012-01-20T22:33:58Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=656a070629adfe23c12768e35ddf91feab469ff7'/>
<id>urn:sha1:656a070629adfe23c12768e35ddf91feab469ff7</id>
<content type='text'>
page_zone() requires an online node otherwise we are accessing NULL
NODE_DATA.  This is not an issue at the moment because node_zones are
located at the structure beginning but this might change in the future
so better be careful about that.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix NULL ptr dereference in __count_immobile_pages</title>
<updated>2012-01-23T16:38:47Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2012-01-20T22:33:55Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=687875fb7de4a95223af20ee024282fa9099f860'/>
<id>urn:sha1:687875fb7de4a95223af20ee024282fa9099f860</id>
<content type='text'>
Fix the following NULL ptr dereference caused by

  cat /sys/devices/system/memory/memory0/removable

Pid: 13979, comm: sed Not tainted 3.0.13-0.5-default #1 IBM BladeCenter LS21 -[7971PAM]-/Server Blade
RIP: __count_immobile_pages+0x4/0x100
Process sed (pid: 13979, threadinfo ffff880221c36000, task ffff88022e788480)
Call Trace:
  is_pageblock_removable_nolock+0x34/0x40
  is_mem_section_removable+0x74/0xf0
  show_mem_removable+0x41/0x70
  sysfs_read_file+0xfe/0x1c0
  vfs_read+0xc7/0x130
  sys_read+0x53/0xa0
  system_call_fastpath+0x16/0x1b

We are crashing because we are trying to dereference NULL zone which
came from pfn=0 (struct page ffffea0000000000). According to the boot
log this page is marked reserved:
e820 update range: 0000000000000000 - 0000000000010000 (usable) ==&gt; (reserved)

and early_node_map confirms that:
early_node_map[3] active PFN ranges
    1: 0x00000010 -&gt; 0x0000009c
    1: 0x00000100 -&gt; 0x000bffa3
    1: 0x00100000 -&gt; 0x00240000

The problem is that memory_present works in PAGE_SECTION_MASK aligned
blocks so the reserved range sneaks into the the section as well.  This
also means that free_area_init_node will not take care of those reserved
pages and they stay uninitialized.

When we try to read the removable status we walk through all available
sections and hope that the zone is valid for all pages in the section.
But this is not true in this case as the zone and nid are not initialized.

We have only one node in this particular case and it is marked as node=1
(rather than 0) and that made the problem visible because page_to_nid will
return 0 and there are no zones on the node.

Let's check that the zone is valid and that the given pfn falls into its
boundaries and mark the section not removable.  This might cause some
false positives, probably, but we do not have any sane way to find out
whether the page is reserved by the platform or it is just not used for
whatever other reasons.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: David Rientjes &lt;rientjes@google.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: enum lru_list lru</title>
<updated>2012-01-13T04:13:10Z</updated>
<author>
<name>Hugh Dickins</name>
<email>hughd@google.com</email>
</author>
<published>2012-01-13T01:20:01Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4111304dab198c687bc60f2e235a9f7ee92c47c8'/>
<id>urn:sha1:4111304dab198c687bc60f2e235a9f7ee92c47c8</id>
<content type='text'>
Mostly we use "enum lru_list lru": change those few "l"s to "lru"s.

Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Reviewed-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page allocator: do not call direct reclaim for THP allocations while compaction is deferred</title>
<updated>2012-01-13T04:13:09Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2012-01-13T01:19:41Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=66199712e9eef5aede09dbcd9dfff87798a66917'/>
<id>urn:sha1:66199712e9eef5aede09dbcd9dfff87798a66917</id>
<content type='text'>
If compaction is deferred, direct reclaim is used to try to free enough
pages for the allocation to succeed.  For small high-orders, this has a
reasonable chance of success.  However, if the caller has specified
__GFP_NO_KSWAPD to limit the disruption to the system, it makes more sense
to fail the allocation rather than stall the caller in direct reclaim.
This patch skips direct reclaim if compaction is deferred and the caller
specifies __GFP_NO_KSWAPD.

Async compaction only considers a subset of pages so it is possible for
compaction to be deferred prematurely and not enter direct reclaim even in
cases where it should.  To compensate for this, this patch also defers
compaction only if sync compaction failed.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Reviewed-by: Rik van Riel&lt;riel@redhat.com&gt;
Cc: Andrea Arcangeli &lt;aarcange@redhat.com&gt;
Cc: Dave Jones &lt;davej@redhat.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andy Isaacson &lt;adi@hexapodia.org&gt;
Cc: Nai Xia &lt;nai.xia@gmail.com&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>page_alloc: break early in check_for_regular_memory()</title>
<updated>2012-01-13T04:13:07Z</updated>
<author>
<name>Bob Liu</name>
<email>lliubbo@gmail.com</email>
</author>
<published>2012-01-13T01:19:07Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=d0048b0e59c1218d62bb4d014f34bbd7e7c0a214'/>
<id>urn:sha1:d0048b0e59c1218d62bb4d014f34bbd7e7c0a214</id>
<content type='text'>
If there is a zone below ZONE_NORMAL has present_pages, we can set node
state to N_NORMAL_MEMORY, no need to loop to end.

Signed-off-by: Bob Liu &lt;lliubbo@gmail.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: collect LRU list heads into struct lruvec</title>
<updated>2012-01-13T04:13:05Z</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@redhat.com</email>
</author>
<published>2012-01-13T01:18:10Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=6290df545814990ca2663baf6e894669132d5f73'/>
<id>urn:sha1:6290df545814990ca2663baf6e894669132d5f73</id>
<content type='text'>
Having a unified structure with a LRU list set for both global zones and
per-memcg zones allows to keep that code simple which deals with LRU
lists and does not care about the container itself.

Once the per-memcg LRU lists directly link struct pages, the isolation
function and all other list manipulations are shared between the memcg
case and the global LRU case.

Signed-off-by: Johannes Weiner &lt;jweiner@redhat.com&gt;
Reviewed-by: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reviewed-by: Kirill A. Shutemov &lt;kirill@shutemov.name&gt;
Cc: Daisuke Nishimura &lt;nishimura@mxp.nes.nec.co.jp&gt;
Cc: Balbir Singh &lt;bsingharora@gmail.com&gt;
Cc: Ying Han &lt;yinghan@google.com&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: page_alloc: generalize order handling in __free_pages_bootmem()</title>
<updated>2012-01-11T00:30:44Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2012-01-10T23:08:10Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c3993076f842de3754360e5b998d6657a9d30303'/>
<id>urn:sha1:c3993076f842de3754360e5b998d6657a9d30303</id>
<content type='text'>
__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().

Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore.  The higher-order loop will work just fine for
zero order pages.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Uwe Kleine-König &lt;u.kleine-koenig@pengutronix.de&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: fix off-by-two in __zone_watermark_ok()</title>
<updated>2012-01-11T00:30:44Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2012-01-10T23:08:02Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=df0a6daa01fa3856c08f4274d4f21a8092caa480'/>
<id>urn:sha1:df0a6daa01fa3856c08f4274d4f21a8092caa480</id>
<content type='text'>
Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 &lt;&lt; order) -
1) so we ended up with off-by-two when calculating free_pages.

Reported-by: Wang Sheng-Hui &lt;shhuiw@gmail.com&gt;
Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: try to distribute dirty pages fairly across zones</title>
<updated>2012-01-11T00:30:43Z</updated>
<author>
<name>Johannes Weiner</name>
<email>jweiner@redhat.com</email>
</author>
<published>2012-01-10T23:07:49Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=a756cf5908530e8b40bdf569eb48b40139e8d7fd'/>
<id>urn:sha1:a756cf5908530e8b40bdf569eb48b40139e8d7fd</id>
<content type='text'>
The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones.  And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list.  This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits.  They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place.  As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon.  The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case.  With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations.  Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation.  Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

			Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 &lt;&lt; 15))

		seconds			nr_vmscan_write
		        (stddev)	       min|     median|        max
xfs
vanilla:	 549.747( 3.492)	     0.000|      0.000|      0.000
patched:	 550.996( 3.802)	     0.000|      0.000|      0.000

fuse-ntfs
vanilla:	1183.094(53.178)	 54349.000|  59341.000|  65163.000
patched:	 558.049(17.914)	     0.000|      0.000|     43.000

btrfs
vanilla:	 573.679(14.015)	156657.000| 460178.000| 606926.000
patched:	 563.365(11.368)	     0.000|      0.000|   1362.000

ext4
vanilla:	 561.197(15.782)	     0.000|2725438.000|4143837.000
patched:	 568.806(17.496)	     0.000|      0.000|      0.000

Signed-off-by: Johannes Weiner &lt;jweiner@redhat.com&gt;
Reviewed-by: Minchan Kim &lt;minchan.kim@gmail.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Reviewed-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Tested-by: Wu Fengguang &lt;fengguang.wu@intel.com&gt;
Cc: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Shaohua Li &lt;shaohua.li@intel.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Chris Mason &lt;chris.mason@oracle.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
