<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux/mm, branch v3.12</title>
<subtitle>Linux kernel source tree</subtitle>
<id>https://git.amat.us/linux/atom/mm?h=v3.12</id>
<link rel='self' href='https://git.amat.us/linux/atom/mm?h=v3.12'/>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/'/>
<updated>2013-11-01T19:22:28Z</updated>
<entry>
<title>memcg: remove incorrect underflow check</title>
<updated>2013-11-01T19:22:28Z</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2013-11-01T19:16:59Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=6920a1bd037374a632d585de127b6f945199dcb8'/>
<id>urn:sha1:6920a1bd037374a632d585de127b6f945199dcb8</id>
<content type='text'>
When a memcg is deleted mem_cgroup_reparent_charges() moves charged
memory to the parent memcg.  As of v3.11-9444-g3ea67d0 "memcg: add per
cgroup writeback pages accounting" there's bad pointer read.  The goal
was to check for counter underflow.  The counter is a per cpu counter
and there are two problems with the code:

 (1) per cpu access function isn't used, instead a naked pointer is used
     which easily causes oops.
 (2) the check doesn't sum all cpus

Test:
  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ echo 3 &gt; /proc/sys/vm/drop_caches
  $ (echo $BASHPID &gt;&gt; x/tasks &amp;&amp; exec cat) &amp;
  [1] 7154
  $ grep ^mapped x/memory.stat
  mapped_file 53248
  $ echo 7154 &gt; tasks
  $ rmdir x
  &lt;OOPS&gt;

The fix is to remove the check.  It's currently dangerous and isn't
worth fixing it to use something expensive, such as
percpu_counter_sum(), for each reparented page.  __this_cpu_read() isn't
enough to fix this because there's no guarantees of the current cpus
count.  The only guarantees is that the sum of all per-cpu counter is &gt;=
nr_pages.

Fixes: 3ea67d06e467 ("memcg: add per cgroup writeback pages accounting")
Reported-and-tested-by: Flavio Leitner &lt;fbl@redhat.com&gt;
Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Reviewed-by: Sha Zhengju &lt;handai.szj@taobao.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Hugh Dickins &lt;hughd@google.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (fixes from Andrew Morton)</title>
<updated>2013-10-31T23:58:23Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-10-31T23:58:23Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=4f794ee8c40119366e0be45f9d49e7ff1a89b79b'/>
<id>urn:sha1:4f794ee8c40119366e0be45f9d49e7ff1a89b79b</id>
<content type='text'>
Merge four more fixes from Andrew Morton.

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;:
  lib/scatterlist.c: don't flush_kernel_dcache_page on slab page
  mm: memcg: fix test for child groups
  mm: memcg: lockdep annotation for memcg OOM lock
  mm: memcg: use proper memcg in limit bypass
</content>
</entry>
<entry>
<title>mm: memcg: fix test for child groups</title>
<updated>2013-10-31T23:58:13Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-10-31T23:34:15Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=696ac172fffa653dca401bb2b0cad91cf2ce453f'/>
<id>urn:sha1:696ac172fffa653dca401bb2b0cad91cf2ce453f</id>
<content type='text'>
When memcg code needs to know whether any given memcg has children, it
uses the cgroup child iteration primitives and returns true/false
depending on whether the iteration loop is executed at least once or
not.

Because a cgroup's list of children is RCU protected, these primitives
require the RCU read-lock to be held, which is not the case for all
memcg callers.  This results in the following splat when e.g.  enabling
hierarchy mode:

  WARNING: CPU: 3 PID: 1 at kernel/cgroup.c:3043 css_next_child+0xa3/0x160()
  CPU: 3 PID: 1 Comm: systemd Not tainted 3.12.0-rc5-00117-g83f11a9-dirty #18
  Hardware name: LENOVO 3680B56/3680B56, BIOS 6QET69WW (1.39 ) 04/26/2012
  Call Trace:
    dump_stack+0x54/0x74
    warn_slowpath_common+0x78/0xa0
    warn_slowpath_null+0x1a/0x20
    css_next_child+0xa3/0x160
    mem_cgroup_hierarchy_write+0x5b/0xa0
    cgroup_file_write+0x108/0x2a0
    vfs_write+0xbd/0x1e0
    SyS_write+0x4c/0xa0
    system_call_fastpath+0x16/0x1b

In the memcg case, we only care about children when we are attempting to
modify inheritable attributes interactively.  Racing with deletion could
mean a spurious -EBUSY, no problem.  Racing with addition is handled
just fine as well through the memcg_create_mutex: if the child group is
not on the list after the mutex is acquired, it won't be initialized
from the parent's attributes until after the unlock.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: lockdep annotation for memcg OOM lock</title>
<updated>2013-10-31T23:58:13Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-10-31T23:34:14Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=0056f4e66a1b8f00245248877e80386af36af14c'/>
<id>urn:sha1:0056f4e66a1b8f00245248877e80386af36af14c</id>
<content type='text'>
The memcg OOM lock is a mutex-type lock that is open-coded due to
memcg's special needs.  Add annotations for lockdep coverage.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: memcg: use proper memcg in limit bypass</title>
<updated>2013-10-31T23:58:13Z</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2013-10-31T23:34:13Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=3168ecbe1c04ec3feb7cb42388a17d7f047fe1a2'/>
<id>urn:sha1:3168ecbe1c04ec3feb7cb42388a17d7f047fe1a2</id>
<content type='text'>
Commit 84235de394d9 ("fs: buffer: move allocation failure loop into the
allocator") allowed __GFP_NOFAIL allocations to bypass the limit if they
fail to reclaim enough memory for the charge.  But because the main test
case was on a 3.2-based system, the patch missed the fact that on newer
kernels the charge function needs to return root_mem_cgroup when
bypassing the limit, and not NULL.  This will corrupt whatever memory is
at NULL + percpu pointer offset.  Fix this quickly before problems are
reported.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip</title>
<updated>2013-10-31T22:21:26Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-10-31T22:21:26Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=52469b4fcd4fc433ffc78cec4cf94368e9052890'/>
<id>urn:sha1:52469b4fcd4fc433ffc78cec4cf94368e9052890</id>
<content type='text'>
Pull NUMA balancing memory corruption fixes from Ingo Molnar:
 "So these fixes are definitely not something I'd like to sit on, but as
  I said to Mel at the KS the timing is quite tight, with Linus planning
  v3.12-final within a week.

  Fedora-19 is affected:

   comet:~&gt; grep NUMA_BALANCING /boot/config-3.11.3-201.fc19.x86_64

   CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
   CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
   CONFIG_NUMA_BALANCING=y

  AFAICS Ubuntu will be affected as well, once it updates the kernel:

   hubble:~&gt; grep NUMA_BALANCING /boot/config-3.8.0-32-generic

   CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
   CONFIG_NUMA_BALANCING_DEFAULT_ENABLED=y
   CONFIG_NUMA_BALANCING=y

  These 6 commits are a minimalized set of cherry-picks needed to fix
  the memory corruption bugs.  All commits are fixes, except "mm: numa:
  Sanitize task_numa_fault() callsites" which is a cleanup that made two
  followup fixes simpler.

  I've done targeted testing with just this SHA1 to try to make sure
  there are no cherry-picking artifacts.  The original non-cherry-picked
  set of fixes were exposed to linux-next for a couple of weeks"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  mm: Account for a THP NUMA hinting update as one PTE update
  mm: Close races between THP migration and PMD numa clearing
  mm: numa: Sanitize task_numa_fault() callsites
  mm: Prevent parallel splits during THP migration
  mm: Wait for THP migrations to complete during NUMA hinting faults
  mm: numa: Do not account for a hinting fault if we raced
</content>
</entry>
<entry>
<title>Merge branch 'akpm' (fixes from Andrew Morton)</title>
<updated>2013-10-30T21:27:10Z</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2013-10-30T21:27:10Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=12aee278b50c4a94a93fa0b4d201ae35d792c696'/>
<id>urn:sha1:12aee278b50c4a94a93fa0b4d201ae35d792c696</id>
<content type='text'>
Merge three fixes from Andrew Morton.

* emailed patches from Andrew Morton &lt;akpm@linux-foundation.org&gt;:
  memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting
  percpu: fix this_cpu_sub() subtrahend casting for unsigneds
  mm/pagewalk.c: fix walk_page_range() access of wrong PTEs
</content>
</entry>
<entry>
<title>memcg: use __this_cpu_sub() to dec stats to avoid incorrect subtrahend casting</title>
<updated>2013-10-30T21:27:03Z</updated>
<author>
<name>Greg Thelen</name>
<email>gthelen@google.com</email>
</author>
<published>2013-10-30T20:56:21Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=5e8cfc3c75b3e43497389896c0ecda62fc311ce9'/>
<id>urn:sha1:5e8cfc3c75b3e43497389896c0ecda62fc311ce9</id>
<content type='text'>
As of commit 3ea67d06e467 ("memcg: add per cgroup writeback pages
accounting") memcg counter errors are possible when moving charged
memory to a different memcg.  Charge movement occurs when processing
writes to memory.force_empty, moving tasks to a memcg with
memcg.move_charge_at_immigrate=1, or memcg deletion.

An example showing error after memory.force_empty:

  $ cd /sys/fs/cgroup/memory
  $ mkdir x
  $ rm /data/tmp/file
  $ (echo $BASHPID &gt;&gt; x/tasks &amp;&amp; exec mmap_writer /data/tmp/file 1M) &amp;
  [1] 13600
  $ grep ^mapped x/memory.stat
  mapped_file 1048576
  $ echo 13600 &gt; tasks
  $ echo 1 &gt; x/memory.force_empty
  $ grep ^mapped x/memory.stat
  mapped_file 4503599627370496

mapped_file should end with 0.
  4503599627370496 == 0x10,0000,0000,0000 == 0x100,0000,0000 pages
  1048576          == 0x10,0000           == 0x100 pages

This issue only affects the source memcg on 64 bit machines; the
destination memcg counters are correct.  So the rmdir case is not too
important because such counters are soon disappearing with the entire
memcg.  But the memcg.force_empty and memory.move_charge_at_immigrate=1
cases are larger problems as the bogus counters are visible for the
(possibly long) remaining life of the source memcg.

The problem is due to memcg use of __this_cpu_from(.., -nr_pages), which
is subtly wrong because it subtracts the unsigned int nr_pages (either
-1 or -512 for THP) from a signed long percpu counter.  When
nr_pages=-1, -nr_pages=0xffffffff.  On 64 bit machines stat-&gt;count[idx]
is signed 64 bit.  So memcg's attempt to simply decrement a count (e.g.
from 1 to 0) boils down to:

  long count = 1
  unsigned int nr_pages = 1
  count += -nr_pages  /* -nr_pages == 0xffff,ffff */
  count is now 0x1,0000,0000 instead of 0

The fix is to subtract the unsigned page count rather than adding its
negation.  This only works once "percpu: fix this_cpu_sub() subtrahend
casting for unsigneds" is applied to fix this_cpu_sub().

Signed-off-by: Greg Thelen &lt;gthelen@google.com&gt;
Acked-by: Tejun Heo &lt;tj@kernel.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm/pagewalk.c: fix walk_page_range() access of wrong PTEs</title>
<updated>2013-10-30T21:27:03Z</updated>
<author>
<name>Chen LinX</name>
<email>linx.z.chen@intel.com</email>
</author>
<published>2013-10-30T20:56:18Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=3017f079efd6af199b0852b5c425364513db460e'/>
<id>urn:sha1:3017f079efd6af199b0852b5c425364513db460e</id>
<content type='text'>
When walk_page_range walk a memory map's page tables, it'll skip
VM_PFNMAP area, then variable 'next' will to assign to vma-&gt;vm_end, it
maybe larger than 'end'.  In next loop, 'addr' will be larger than
'next'.  Then in /proc/XXXX/pagemap file reading procedure, the 'addr'
will growing forever in pagemap_pte_range, pte_to_pagemap_entry will
access the wrong pte.

  BUG: Bad page map in process procrank  pte:8437526f pmd:785de067
  addr:9108d000 vm_flags:00200073 anon_vma:f0d99020 mapping:  (null) index:9108d
  CPU: 1 PID: 4974 Comm: procrank Tainted: G    B   W  O 3.10.1+ #1
  Call Trace:
    dump_stack+0x16/0x18
    print_bad_pte+0x114/0x1b0
    vm_normal_page+0x56/0x60
    pagemap_pte_range+0x17a/0x1d0
    walk_page_range+0x19e/0x2c0
    pagemap_read+0x16e/0x200
    vfs_read+0x84/0x150
    SyS_read+0x4a/0x80
    syscall_call+0x7/0xb

Signed-off-by: Liu ShuoX &lt;shuox.liu@intel.com&gt;
Signed-off-by: Chen LinX &lt;linx.z.chen@intel.com&gt;
Acked-by: Kirill A. Shutemov &lt;kirill.shutemov@linux.intel.com&gt;
Reviewed-by: Naoya Horiguchi &lt;n-horiguchi@ah.jp.nec.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[3.10.x+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
<entry>
<title>mm: list_lru: fix almost infinite loop causing effective livelock</title>
<updated>2013-10-30T19:57:46Z</updated>
<author>
<name>Russell King</name>
<email>rmk+kernel@arm.linux.org.uk</email>
</author>
<published>2013-10-30T14:16:16Z</published>
<link rel='alternate' type='text/html' href='https://git.amat.us/linux/commit/?id=c56b097af26cb11c1f49a4311ba538c825666fed'/>
<id>urn:sha1:c56b097af26cb11c1f49a4311ba538c825666fed</id>
<content type='text'>
I've seen a fair number of issues with kswapd and other processes
appearing to get stuck in v3.12-rc.  Using sysrq-p many times seems to
indicate that it gets stuck somewhere in list_lru_walk_node(), called
from prune_icache_sb() and super_cache_scan().

I never seem to be able to trigger a calltrace for functions above that
point.

So I decided to add the following to super_cache_scan():

    @@ -81,10 +81,14 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
            inodes = list_lru_count_node(&amp;sb-&gt;s_inode_lru, sc-&gt;nid);
            dentries = list_lru_count_node(&amp;sb-&gt;s_dentry_lru, sc-&gt;nid);
            total_objects = dentries + inodes + fs_objects + 1;
    +printk("%s:%u: %s: dentries %lu inodes %lu total %lu\n", current-&gt;comm, current-&gt;pid, __func__, dentries, inodes, total_objects);

            /* proportion the scan between the caches */
            dentries = mult_frac(sc-&gt;nr_to_scan, dentries, total_objects);
            inodes = mult_frac(sc-&gt;nr_to_scan, inodes, total_objects);
    +printk("%s:%u: %s: dentries %lu inodes %lu\n", current-&gt;comm, current-&gt;pid, __func__, dentries, inodes);
    +BUG_ON(dentries == 0);
    +BUG_ON(inodes == 0);

            /*
             * prune the dcache first as the icache is pinned by it, then
    @@ -99,7 +103,7 @@ static unsigned long super_cache_scan(struct shrinker *shrink,
                    freed += sb-&gt;s_op-&gt;free_cached_objects(sb, fs_objects,
                                                           sc-&gt;nid);
            }
    -
    +printk("%s:%u: %s: dentries %lu inodes %lu freed %lu\n", current-&gt;comm, current-&gt;pid, __func__, dentries, inodes, freed);
            drop_super(sb);
            return freed;
     }

and shortly thereafter, having applied some pressure, I got this:

    update-apt-xapi:1616: super_cache_scan: dentries 25632 inodes 2 total 25635
    update-apt-xapi:1616: super_cache_scan: dentries 1023 inodes 0
    ------------[ cut here ]------------
    Kernel BUG at c0101994 [verbose debug info unavailable]
    Internal error: Oops - BUG: 0 [#3] SMP ARM
    Modules linked in: fuse rfcomm bnep bluetooth hid_cypress
    CPU: 0 PID: 1616 Comm: update-apt-xapi Tainted: G      D      3.12.0-rc7+ #154
    task: daea1200 ti: c3bf8000 task.ti: c3bf8000
    PC is at super_cache_scan+0x1c0/0x278
    LR is at trace_hardirqs_on+0x14/0x18
    Process update-apt-xapi (pid: 1616, stack limit = 0xc3bf8240)
    ...
    Backtrace:
      (super_cache_scan) from [&lt;c00cd69c&gt;] (shrink_slab+0x254/0x4c8)
      (shrink_slab) from [&lt;c00d09a0&gt;] (try_to_free_pages+0x3a0/0x5e0)
      (try_to_free_pages) from [&lt;c00c59cc&gt;] (__alloc_pages_nodemask+0x5)
      (__alloc_pages_nodemask) from [&lt;c00e07c0&gt;] (__pte_alloc+0x2c/0x13)
      (__pte_alloc) from [&lt;c00e3a70&gt;] (handle_mm_fault+0x84c/0x914)
      (handle_mm_fault) from [&lt;c001a4cc&gt;] (do_page_fault+0x1f0/0x3bc)
      (do_page_fault) from [&lt;c001a7b0&gt;] (do_translation_fault+0xac/0xb8)
      (do_translation_fault) from [&lt;c000840c&gt;] (do_DataAbort+0x38/0xa0)
      (do_DataAbort) from [&lt;c00133f8&gt;] (__dabt_usr+0x38/0x40)

Notice that we had a very low number of inodes, which were reduced to
zero my mult_frac().

Now, prune_icache_sb() calls list_lru_walk_node() passing that number of
inodes (0) into that as the number of objects to scan:

    long prune_icache_sb(struct super_block *sb, unsigned long nr_to_scan,
                         int nid)
    {
            LIST_HEAD(freeable);
            long freed;

            freed = list_lru_walk_node(&amp;sb-&gt;s_inode_lru, nid, inode_lru_isolate,
                                           &amp;freeable, &amp;nr_to_scan);

which does:

    unsigned long
    list_lru_walk_node(struct list_lru *lru, int nid, list_lru_walk_cb isolate,
                       void *cb_arg, unsigned long *nr_to_walk)
    {

            struct list_lru_node    *nlru = &amp;lru-&gt;node[nid];
            struct list_head *item, *n;
            unsigned long isolated = 0;

            spin_lock(&amp;nlru-&gt;lock);
    restart:
            list_for_each_safe(item, n, &amp;nlru-&gt;list) {
                    enum lru_status ret;

                    /*
                     * decrement nr_to_walk first so that we don't livelock if we
                     * get stuck on large numbesr of LRU_RETRY items
                     */
                    if (--(*nr_to_walk) == 0)
                            break;

So, if *nr_to_walk was zero when this function was entered, that means
we're wanting to operate on (~0UL)+1 objects - which might as well be
infinite.

Clearly this is not correct behaviour.  If we think about the behaviour
of this function when *nr_to_walk is 1, then clearly it's wrong - we
decrement first and then test for zero - which results in us doing
nothing at all.  A post-decrement would give the desired behaviour -
we'd try to walk one object and one object only if *nr_to_walk were one.

It also gives the correct behaviour for zero - we exit at this point.

Fixes: 5cedf721a7cd ("list_lru: fix broken LRU_RETRY behaviour")
Signed-off-by: Russell King &lt;rmk+kernel@arm.linux.org.uk&gt;
Cc: Dave Chinner &lt;dchinner@redhat.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Andrew Morton &lt;akpm@linux-foundation.org&gt;
[ Modified to make sure we never underflow the count: this function gets
  called in a loop, so the 0 -&gt; ~0ul transition is dangerous  - Linus ]
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
</entry>
</feed>
