<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v3.13.4</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/mm?h=v3.13.4</id>
<link rel='self' href='https://git.amat.us/linux/atom/mm?h=v3.13.4'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2014-02-20T19:10:09Z</updated>
<entry>
<title>mm: __set_page_dirty_nobuffers() uses spin_lock_irqsave() instead of spin_lock_irq()</title>
<updated>2014-02-20T19:10:09Z</updated>
<author>
<name>KOSAKI Motohiro</name>
<email>kosaki.motohiro@jp.fujitsu.com</email>
</author>
<published>2014-02-06T20:04:24Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=27b4328e523b3de854229e6b505f94aa9708dde6'/>
<id>urn:sha1:27b4328e523b3de854229e6b505f94aa9708dde6</id>
<content type='text'>
commit a85d9df1ea1d23682a0ed1e100e6965006595d06 upstream.

During aio stress test, we observed the following lockdep warning.  This
mean AIO+numa_balancing is currently deadlockable.

The problem is, aio_migratepage disable interrupt, but
__set_page_dirty_nobuffers unintentionally enable it again.

Generally, all helper function should use spin_lock_irqsave() instead of
spin_lock_irq() because they don't know caller at all.

   other info that might help us debug this:
    Possible unsafe locking scenario:

          CPU0
          ----
     lock(&amp;(&amp;ctx-&gt;completion_lock)-&gt;rlock);
     &lt;Interrupt&gt;
       lock(&amp;(&amp;ctx-&gt;completion_lock)-&gt;rlock);

    *** DEADLOCK ***

      dump_stack+0x19/0x1b
      print_usage_bug+0x1f7/0x208
      mark_lock+0x21d/0x2a0
      mark_held_locks+0xb9/0x140
      trace_hardirqs_on_caller+0x105/0x1d0
      trace_hardirqs_on+0xd/0x10
      _raw_spin_unlock_irq+0x2c/0x50
      __set_page_dirty_nobuffers+0x8c/0xf0
      migrate_page_copy+0x434/0x540
      aio_migratepage+0xb1/0x140
      move_to_new_page+0x7d/0x230
      migrate_pages+0x5e5/0x700
      migrate_misplaced_page+0xbc/0xf0
      do_numa_page+0x102/0x190
      handle_pte_fault+0x241/0x970
      handle_mm_fault+0x265/0x370
      __do_page_fault+0x172/0x5a0
      do_page_fault+0x1a/0x70
      page_fault+0x28/0x30

Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
Cc: Larry Woodman &lt;lwoodman@redhat.com&gt;
Cc: Rik van Riel &lt;riel@redhat.com&gt;
Cc: Johannes Weiner &lt;jweiner@redhat.com&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm/swap: fix race on swap_info reuse between swapoff and swapon</title>
<updated>2014-02-20T19:10:09Z</updated>
<author>
<name>Weijie Yang</name>
<email>weijie.yang@samsung.com</email>
</author>
<published>2014-02-06T20:04:23Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=ef0a37df2dc7570681f12a9968f7757b55d2f936'/>
<id>urn:sha1:ef0a37df2dc7570681f12a9968f7757b55d2f936</id>
<content type='text'>
commit f893ab41e4dae2fe8991faf5d86d029068d1ef3a upstream.

swapoff clear swap_info's SWP_USED flag prematurely and free its
resources after that.  A concurrent swapon will reuse this swap_info
while its previous resources are not cleared completely.

These late freed resources are:
 - p-&gt;percpu_cluster
 - swap_cgroup_ctrl[type]
 - block_device setting
 - inode-&gt;i_flags &amp;= ~S_SWAPFILE

This patch clears the SWP_USED flag after all its resources are freed,
so that swapon can reuse this swap_info by alloc_swap_info() safely.

[akpm@linux-foundation.org: tidy up code comment]
Signed-off-by: Weijie Yang &lt;weijie.yang@samsung.com&gt;
Acked-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Krzysztof Kozlowski &lt;k.kozlowski@samsung.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>powerpc/thp: Fix crash on mremap</title>
<updated>2014-02-13T21:55:44Z</updated>
<author>
<name>Aneesh Kumar K.V</name>
<email>aneesh.kumar@linux.vnet.ibm.com</email>
</author>
<published>2014-01-13T06:04:24Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=cba4fbae241f7afb60d24529d2f7d24486ce5109'/>
<id>urn:sha1:cba4fbae241f7afb60d24529d2f7d24486ce5109</id>
<content type='text'>
commit b3084f4db3aeb991c507ca774337c7e7893ed04f upstream.

This patch fix the below crash

NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
LR [c0000000000439ac] .hash_page+0x18c/0x5e0
...
Call Trace:
[c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
[437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
[437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58

On ppc64 we use the pgtable for storing the hpte slot information and
store address to the pgtable at a constant offset (PTRS_PER_PMD) from
pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
from new pmd.

We also want to move the withdraw and deposit before the set_pmd so
that, when page fault find the pmd as trans huge we can be sure that
pgtable can be located at the offset.

Signed-off-by: Aneesh Kumar K.V &lt;aneesh.kumar@linux.vnet.ibm.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Signed-off-by: Benjamin Herrenschmidt &lt;benh@kernel.crashing.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm, oom: base root bonus on current usage</title>
<updated>2014-02-13T21:55:37Z</updated>
<author>
<name>David Rientjes</name>
<email>rientjes@google.com</email>
</author>
<published>2014-01-30T23:46:11Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=23e8b4acf59dcca7782fa43f9528f811b5cf5746'/>
<id>urn:sha1:23e8b4acf59dcca7782fa43f9528f811b5cf5746</id>
<content type='text'>
commit 778c14affaf94a9e4953179d3e13a544ccce7707 upstream.

A 3% of system memory bonus is sometimes too excessive in comparison to
other processes.

With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
killer tries to avoid killing privileged tasks by subtracting 3% of
overall memory (system or cgroup) from their per-task consumption.  But
as a result, all root tasks that consume less than 3% of overall memory
are considered equal, and so it only takes 33+ privileged tasks pushing
the system out of memory for the OOM killer to do something stupid and
kill dhclient or other root-owned processes.  For example, on a 32G
machine it can't tell the difference between the 1M agetty and the 10G
fork bomb member.

The changelog describes this 3% boost as the equivalent to the global
overcommit limit being 3% higher for privileged tasks, but this is not
the same as discounting 3% of overall memory from _every privileged task
individually_ during OOM selection.

Replace the 3% of system memory bonus with a 3% of current memory usage
bonus.

By giving root tasks a bonus that is proportional to their actual size,
they remain comparable even when relatively small.  In the example
above, the OOM killer will discount the 1M agetty's 256 badness points
down to 179, and the 10G fork bomb's 262144 points down to 183500 points
and make the right choice, instead of discounting both to 0 and killing
agetty because it's first in the task list.

Signed-off-by: David Rientjes &lt;rientjes@google.com&gt;
Reported-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: numa: initialise numa balancing after jump label initialisation</title>
<updated>2014-02-13T21:55:37Z</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@suse.de</email>
</author>
<published>2014-01-29T22:05:42Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=496a762865aa021f8f775b2644d7acae731adc35'/>
<id>urn:sha1:496a762865aa021f8f775b2644d7acae731adc35</id>
<content type='text'>
commit c297663c0b3930491a3cb2aba4b6e5a7159c3503 upstream.

The command line parsing takes place before jump labels are initialised
which generates a warning if numa_balancing= is specified and
CONFIG_JUMP_LABEL is set.

On older kernels before commit c4b2c0c5f647 ("static_key: WARN on usage
before jump_label_init was called") the kernel would have crashed.  This
patch enables automatic numa balancing later in the initialisation
process if numa_balancing= is specified.

Signed-off-by: Mel Gorman &lt;mgorman@suse.de&gt;
Acked-by: Rik van Riel &lt;riel@redhat.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>numa: add a sysctl for numa_balancing</title>
<updated>2014-02-13T21:55:36Z</updated>
<author>
<name>Andi Kleen</name>
<email>ak@linux.intel.com</email>
</author>
<published>2014-01-23T23:53:13Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=03876923ca84ba27f66a6e941ba925e94434ca5c'/>
<id>urn:sha1:03876923ca84ba27f66a6e941ba925e94434ca5c</id>
<content type='text'>
commit 54a43d54988a3731d644fdeb7a1d6f46b4ac64c7 upstream.

Add a working sysctl to enable/disable automatic numa memory balancing
at runtime.

This allows us to track down performance problems with this feature and
is generally a good idea.

This was possible earlier through debugfs, but only with special
debugging options set.  Also fix the boot message.

[akpm@linux-foundation.org: s/sched_numa_balancing/sysctl_numa_balancing/]
Signed-off-by: Andi Kleen &lt;ak@linux.intel.com&gt;
Acked-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>slub: Fix calculation of cpu slabs</title>
<updated>2014-02-13T21:55:30Z</updated>
<author>
<name>Li Zefan</name>
<email>lizefan@huawei.com</email>
</author>
<published>2013-09-10T03:43:37Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=3ae84de8c0a75feb7cdb57f9bd133e8f50b23565'/>
<id>urn:sha1:3ae84de8c0a75feb7cdb57f9bd133e8f50b23565</id>
<content type='text'>
commit 8afb1474db4701d1ab80cd8251137a3260e6913e upstream.

  /sys/kernel/slab/:t-0000048 # cat cpu_slabs
  231 N0=16 N1=215
  /sys/kernel/slab/:t-0000048 # cat slabs
  145 N0=36 N1=109

See, the number of slabs is smaller than that of cpu slabs.

The bug was introduced by commit 49e2258586b423684f03c278149ab46d8f8b6700
("slub: per cpu cache for partial pages").

We should use page-&gt;pages instead of page-&gt;pobjects when calculating
the number of cpu partial slabs. This also fixes the mapping of slabs
and nodes.

As there's no variable storing the number of total/active objects in
cpu partial slabs, and we don't have user interfaces requiring those
statistics, I just add WARN_ON for those cases.

Acked-by: Christoph Lameter &lt;cl@linux.com&gt;
Reviewed-by: Wanpeng Li &lt;liwanp@linux.vnet.ibm.com&gt;
Signed-off-by: Li Zefan &lt;lizefan@huawei.com&gt;
Signed-off-by: Pekka Enberg &lt;penberg@kernel.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>mm: ignore VM_SOFTDIRTY on VMA merging</title>
<updated>2014-02-13T21:55:29Z</updated>
<author>
<name>Cyrill Gorcunov</name>
<email>gorcunov@gmail.com</email>
</author>
<published>2014-01-23T23:53:42Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=1140f8ce36b1cb82091eefc34d2fea953df30d25'/>
<id>urn:sha1:1140f8ce36b1cb82091eefc34d2fea953df30d25</id>
<content type='text'>
commit 34228d473efe764d4db7c0536375f0c993e6e06a upstream.

The VM_SOFTDIRTY bit affects vma merge routine: if two VMAs has all bits
in vm_flags matched except dirty bit the kernel can't longer merge them
and this forces the kernel to generate new VMAs instead.

It finally may lead to the situation when userspace application reaches
vm.max_map_count limit and get crashed in worse case

 | (gimp:11768): GLib-ERROR **: gmem.c:110: failed to allocate 4096 bytes
 |
 | (file-tiff-load:12038): LibGimpBase-WARNING **: file-tiff-load: gimp_wire_read(): error
 | xinit: connection to X server lost
 |
 | waiting for X server to shut down
 | /usr/lib64/gimp/2.0/plug-ins/file-tiff-load terminated: Hangup
 | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup
 | /usr/lib64/gimp/2.0/plug-ins/script-fu terminated: Hangup

  https://bugzilla.kernel.org/show_bug.cgi?id=67651
  https://bugzilla.gnome.org/show_bug.cgi?id=719619#c0

Initial problem came from missed VM_SOFTDIRTY in do_brk() routine but
even if we would set up VM_SOFTDIRTY here, there is still a way to
prevent VMAs from merging: one can call

 | echo 4 &gt; /proc/$PID/clear_refs

and clear all VM_SOFTDIRTY over all VMAs presented in memory map, then
new do_brk() will try to extend old VMA and finds that dirty bit doesn't
match thus new VMA will be generated.

As discussed with Pavel, the right approach should be to ignore
VM_SOFTDIRTY bit when we're trying to merge VMAs and if merge successed
we mark extended VMA with dirty bit where needed.

Signed-off-by: Cyrill Gorcunov &lt;gorcunov@openvz.org&gt;
Reported-by: Bastian Hougaard &lt;gnome@rvzt.net&gt;
Reported-by: Mel Gorman &lt;mgorman@suse.de&gt;
Cc: Pavel Emelyanov &lt;xemul@parallels.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>memcg: fix css reference leak and endless loop in mem_cgroup_iter</title>
<updated>2014-02-13T21:55:29Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2014-01-23T23:53:37Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=fd074b7b40fa4e99c92ee26104272f3ba1eee537'/>
<id>urn:sha1:fd074b7b40fa4e99c92ee26104272f3ba1eee537</id>
<content type='text'>
commit 0eef615665ede1e0d603ea9ecca88c1da6f02234 upstream.

Commit 19f39402864e ("memcg: simplify mem_cgroup_iter") has reorganized
mem_cgroup_iter code in order to simplify it.  A part of that change was
dropping an optimization which didn't call css_tryget on the root of the
walked tree.  The patch however didn't change the css_put part in
mem_cgroup_iter which excludes root.

This wasn't an issue at the time because __mem_cgroup_iter_next bailed
out for root early without taking a reference as cgroup iterators
(css_next_descendant_pre) didn't visit root themselves.

Nevertheless cgroup iterators have been reworked to visit root by commit
bd8815a6d802 ("cgroup: make css_for_each_descendant() and friends
include the origin css in the iteration") when the root bypass have been
dropped in __mem_cgroup_iter_next.  This means that css_put is not
called for root and so css along with mem_cgroup and other cgroup
internal object tied by css lifetime are never freed.

Fix the issue by reintroducing root check in __mem_cgroup_iter_next and
do not take css reference for it.

This reference counting magic protects us also from another issue, an
endless loop reported by Hugh Dickins when reclaim races with root
removal and css_tryget called by iterator internally would fail.  There
would be no other nodes to visit so __mem_cgroup_iter_next would return
NULL and mem_cgroup_iter would interpret it as "start looping from root
again" and so mem_cgroup_iter would loop forever internally.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reported-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
<entry>
<title>memcg: fix endless loop caused by mem_cgroup_iter</title>
<updated>2014-02-13T21:55:29Z</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.cz</email>
</author>
<published>2014-01-23T23:53:35Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=e6edc509f574f4a7b3dc37392c8afa70d851002d'/>
<id>urn:sha1:e6edc509f574f4a7b3dc37392c8afa70d851002d</id>
<content type='text'>
commit ecc736fc3c71c411a9d201d8588c9e7e049e5d8c upstream.

Hugh has reported an endless loop when the hardlimit reclaim sees the
same group all the time.  This might happen when the reclaim races with
the memcg removal.

shrink_zone
                                                [rmdir root]
  mem_cgroup_iter(root, NULL, reclaim)
    // prev = NULL
    rcu_read_lock()
    mem_cgroup_iter_load
      last_visited = iter-&gt;last_visited   // gets root || NULL
      css_tryget(last_visited)            // failed
      last_visited = NULL                 [1]
    memcg = root = __mem_cgroup_iter_next(root, NULL)
    mem_cgroup_iter_update
      iter-&gt;last_visited = root;
    reclaim-&gt;generation = iter-&gt;generation

 mem_cgroup_iter(root, root, reclaim)
   // prev = root
   rcu_read_lock
    mem_cgroup_iter_load
      last_visited = iter-&gt;last_visited   // gets root
      css_tryget(last_visited)            // failed
    [1]

The issue seemed to be introduced by commit 5f5781619718 ("memcg: relax
memcg iter caching") which has replaced unconditional css_get/css_put by
css_tryget/css_put for the cached iterator.

This patch fixes the issue by skipping css_tryget on the root of the
tree walk in mem_cgroup_iter_load and symmetrically doesn't release it
in mem_cgroup_iter_update.

Signed-off-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Reported-by: Hugh Dickins &lt;hughd@google.com&gt;
Tested-by: Hugh Dickins &lt;hughd@google.com&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Greg Thelen &lt;gthelen@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
</entry>
</feed>
