Age | Commit message (Collapse) | Author |
|
upstream commit: be852795e1c8d3829ddf3cb1ce806113611fa555
Some oprofile results obtained while using tbench on a 2x2 cpu machine were
very surprising.
For example, loopback_xmit() function was using high number of cpu cycles
to perform the statistic updates, supposed to be real cheap since they use
percpu data
pcpu_lstats = netdev_priv(dev);
lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
lb_stats->packets++; /* HERE : serious contention */
lb_stats->bytes += skb->len;
struct pcpu_lstats is a small structure containing two longs. It appears
that on my 32bits platform, alloc_percpu(8) allocates a single cache line,
instead of giving to each cpu a separate cache line.
Using the following patch gave me impressive boost in various benchmarks
( 6 % in tbench)
(all percpu_counters hit this bug too)
Long term fix (ie >= 2.6.26) would be to let each CPU allocate their own
block of memory, so that we dont need to roudup sizes to L1_CACHE_BYTES, or
merging the SGI stuff of course...
Note : SLUB vs SLAB is important here to *show* the improvement, since they
dont have the same minimum allocation sizes (8 bytes vs 32 bytes). This
could very well explain regressions some guys reported when they switched
to SLUB.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
|
|
upstream commit: b3242151906372f30f57feaa43b4cac96a23edb1
Instead of allocating a fix sized array of NR_CPUS pointers for percpu_data,
we can use nr_cpu_ids, which is generally < NR_CPUS.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
|
|
upstream commit: ec1f5eeeb5a79a0d48036de649a3498da42db565
Commit 556a169dab38b5100df6f4a45b655dddd3db94c1 ("slab: fix bootstrap on
memoryless node") introduced bootstrap-time cache_cache list3s for all nodes
but forgot that initkmem_list3 needs to be accessed by [somevalue + node]. This
patch fixes list_add() corruption in mm/slab.c seen on the ES7000.
Cc: Mel Gorman <mel@csn.ul.ie>
Cc: Olaf Hering <olaf@aepfle.de>
Signed-off-by: Dan Yeisley <dan.yeisley@unisys.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
|
|
NUMA slab allocator cpu migration bugfix
The NUMA slab allocator (specifically, cache_alloc_refill)
is not refreshing its local copies of what cpu and what
numa node it is on, when it drops and reacquires the irq
block that it inherited from its caller. As a result
those values become invalid if an attempt to migrate the
process to another numa node occured while the irq block
had been dropped.
The solution is to make cache_alloc_refill reload these
variables whenever it drops and reacquires the irq block.
The error is very difficult to hit. When it does occur,
one gets the following oops + stack traceback bits in
check_spinlock_acquired:
kernel BUG at mm/slab.c:2417
cache_alloc_refill+0xe6
kmem_cache_alloc+0xd0
...
This patch was developed against 2.6.23, ported to and
compiled-tested only against 2.6.25-rc4.
Signed-off-by: Joe Korty <joe.korty@ccur.com>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
|
|
commit: e5df70ab194543522397fa3da8c8f80564a0f7d3
When we free a page via free_huge_page and we detect that we are in surplus
the page will be returned to the buddy. After this we no longer own the page.
However at the end free_huge_page we clear out our mapping pointer from
page private. Even where the page is not a surplus we free the page to
the hugepage pool, drop the pool locks and then clear page private. In
either case the page may have been reallocated. BAD.
Make sure we clear out page private before we free the page.
Signed-off-by: Andy Whitcroft <apw@shadowen.org>
Acked-by: Adam Litke <agl@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
commit: f7009264c519603b8ec67c881bd368a56703cfc9
iov_iter_advance() skips over zero-length iovecs, however it does not properly
terminate at the end of the iovec array. Fix this by checking against
i->count before we skip a zero-length iov.
The bug was reproduced with a test program that continually randomly creates
iovs to writev. The fix was also verified with the same program and also it
could verify that the correct data was contained in the file after each
writev.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Tested-by: "Kevin Coffman" <kwc@citi.umich.edu>
Cc: "Alexey Dobriyan" <adobriyan@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
patch 5bb983b0cce9b7b281af15730f7019116dd42568 in mainline.
gcc 4.2 spits out an annoying warning if one casts a const void *
pointer to a void * pointer. No warning is generated if the
conversion is done through an assignment.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
patch 900cf086fd2fbad07f72f4575449e0d0958f860f in mainline.
So I spent a while pounding my head against my monitor trying to figure
out the vmsplice() vulnerability - how could a failure to check for
*read* access turn into a root exploit? It turns out that it's a buffer
overflow problem which is made easy by the way get_user_pages() is
coded.
In particular, "len" is a signed int, and it is only checked at the
*end* of a do {} while() loop. So, if it is passed in as zero, the loop
will execute once and decrement len to -1. At that point, the loop will
proceed until the next invalid address is found; in the process, it will
likely overflow the pages array passed in to get_user_pages().
I think that, if get_user_pages() has been asked to grab zero pages,
that's what it should do. Thus this patch; it is, among other things,
enough to block the (already fixed) root exploit and any others which
might be lurking in similar code. I also think that the number of pages
should be unsigned, but changing the prototype of this function probably
requires some more careful review.
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
patch a3d0c6aa1bb342b9b2c7b123b52ac2f48a4d4d0a in mainline.
When I replaced hugetlb_dynamic_pool with nr_overcommit_hugepages I used
proc_doulongvec_minmax() directly. However, hugetlb.c's locking rules
require that all counter modifications occur under the hugetlb_lock. Add a
callback into the hugetlb code similar to the one for nr_hugepages. Grab
the lock around the manipulation of nr_overcommit_hugepages in
proc_doulongvec_minmax().
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
patch 124d3b7041f9a0ca7c43a6293e1cae4576c32fd5 in mainline.
Frederik Himpe reported an unkillable and un-straceable pan process.
Zero length iovecs can go into an infinite loop in writev, because the
iovec iterator does not always advance over them.
The sequence required to trigger this is not trivial. I think it
requires that a zero-length iovec be followed by a non-zero-length iovec
which causes a pagefault in the atomic usercopy. This causes the writev
code to drop back into single-segment copy mode, which then tries to
copy the 0 bytes of the zero-length iovec; a zero length copy looks like
a failure though, so it loops.
Put a test into iov_iter_advance to catch zero-length iovecs. We could
just put the test in the fallback path, but I feel it is more robust to
skip over zero-length iovecs throughout the code (iovec iterator may be
used in filesystems too, so it should be robust).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
patch 8a459e44ad837018ea5c34a9efe8eb4ad27ded26 in mainline.
Fix ->vm_file accounting, mmap_region() may do do_munmap().
Signed-off-by: Oleg Nesterov <oleg@tv-sign.ru>
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
Drivers that register a ->fault handler, but do not range-check the
offset argument, must set VM_DONTEXPAND in the vm_flags in order to
prevent an expanding mremap from overflowing the resource.
I've audited the tree and attempted to fix these problems (usually by
adding VM_DONTEXPAND where it is not obvious).
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>
|
|
[ Upstream commit: 556a169dab38b5100df6f4a45b655dddd3db94c1 ]
If the node we're booting on doesn't have memory, bootstrapping kmalloc()
caches resorts to fallback_alloc() which requires ->nodelists set for all
nodes. Fix that by calling set_up_list3s() for CACHE_CACHE in
kmem_cache_init().
As kmem_getpages() is called with GFP_THISNODE set, this used to work before
because of breakage in 2.6.22 and before with GFP_THISNODE returning pages from
the wrong node if a node had no memory. So it may have worked accidentally and
in an unsafe manner because the pages would have been associated with the wrong
node which could trigger bug ons and locking troubles.
Tested-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Olaf Hering <olaf@aepfle.de>
Reviewed-by: Christoph Lameter <clameter@sgi.com>
[ With additional one-liner by Olaf Hering - Linus ]
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
|
|
Partial revert the changes made by 04231b3002ac53f8a64a7bd142fde3fa4b6808c6
to the kmem_list3 management. On a machine with a memoryless node, this
BUG_ON was triggering
static void *____cache_alloc_node(struct kmem_cache *cachep, gfp_t flags, int nodeid)
{
struct list_head *entry;
struct slab *slabp;
struct kmem_list3 *l3;
void *obj;
int x;
l3 = cachep->nodelists[nodeid];
BUG_ON(!l3);
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Christoph Lameter <clameter@sgi.com>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
Cc: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The shared page table code for hugetlb memory on x86 and x86_64
is causing a leak. When a user of hugepages exits using this code
the system leaks some of the hugepages.
-------------------------------------------------------
Part of /proc/meminfo just before database startup:
HugePages_Total: 5500
HugePages_Free: 5500
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
Just before shutdown:
HugePages_Total: 5500
HugePages_Free: 4475
HugePages_Rsvd: 0
Hugepagesize: 2048 kB
After shutdown:
HugePages_Total: 5500
HugePages_Free: 4988
HugePages_Rsvd:
0 Hugepagesize: 2048 kB
----------------------------------------------------------
The problem occurs durring a fork, in copy_hugetlb_page_range(). It
locates the dst_pte using huge_pte_alloc(). Since huge_pte_alloc() calls
huge_pmd_share() it will share the pmd page if can, yet the main loop in
copy_hugetlb_page_range() does a get_page() on every hugepage. This is a
violation of the shared hugepmd pagetable protocol and creates additional
referenced to the hugepages causing a leak when the unmap of the VMA
occurs. We can skip the entire replication of the ptes when the hugepage
pagetables are shared. The attached patch skips copying the ptes and the
get_page() calls if the hugetlbpage pagetable is shared.
[akpm@linux-foundation.org: coding-style cleanups]
Signed-off-by: Larry Woodman <lwoodman@redhat.com>
Signed-off-by: Adam Litke <agl@us.ibm.com>
Cc: Badari Pulavarty <pbadari@us.ibm.com>
Cc: Ken Chen <kenchen@google.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Update ctime and mtime for memory-mapped files at a write access on
a present, read-only PTE, as well as at a write on a non-present PTE.
Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This patch puts #ifdef CONFIG_DEBUG_VM around a check in vm_normal_page
that verifies that a pfn is valid. This patch increases performance of the
page fault microbenchmark in lmbench by 13% and overall dbench performance
by 7% on s390x. pfn_valid() is an expensive operation on s390 that needs a
high double digit amount of CPU cycles. Nick Piggin suggested that
pfn_valid() involves an array lookup on systems with sparsemem, and
therefore is an expensive operation there too.
The check looks like a clear debug thing to me, it should never trigger on
regular kernels. And if a pte is created for an invalid pfn, we'll find
out once the memory gets accessed later on anyway. Please consider
inclusion of this patch into mm.
Signed-off-by: Carsten Otte <cotte@de.ibm.com>
Acked-by: Nick Piggin <nickpiggin@yahoo.com.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
With CONFIG_HOTPLUG=n and CONFIG_HOTPLUG_CPU=y we saw
following warning:
WARNING: mm/built-in.o(.text+0x6864): Section mismatch: reference to .init.text: (between 'process_zones' and 'pageset_cpuup_callback')
The culprit was zone_batchsize() which were annotated __devinit but used
from process_zones() which is annotated __cpuinit. zone_batchsize() are
used from another function annotated __meminit so the only valid option is
to drop the annotation of zone_batchsize() so we know it is always valid to
use it.
Signed-off-by: Sam Ravnborg <sam@ravnborg.org>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 2e6883bdf49abd0e7f0d9b6297fc3be7ebb2250b, as
requested by Fengguang Wu. It's not quite fully baked yet, and while
there are patches around to fix the problems it caused, they should get
more testing. Says Fengguang: "I'll resend them both for -mm later on,
in a more complete patchset".
See
http://bugzilla.kernel.org/show_bug.cgi?id=9738
for some of this discussion.
Requested-by: Fengguang Wu <wfg@mail.ustc.edu.cn>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In the error path of both shared and private hugetlb page allocation,
the file system quota is never undone, leading to fs quota leak. Fix
them up.
[akpm@linux-foundation.org: cleanup, micro-optimise]
Signed-off-by: Ken Chen <kenchen@google.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Cc: William Lee Irwin III <wli@holomorphy.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Quicklists calculates the size of the quicklists based on the number of
free pages. This must be the number of free pages that can be allocated
with GFP_KERNEL. node_page_state() includes the pages in ZONE_HIGHMEM and
ZONE_MOVABLE which may lead the quicklists to become too large causing OOM.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Tested-by: Dhaval Giani <dhaval@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
When using FLAT_MEMORY and ARCH_PFN_OFFSET is not 0, the kernel crashes in
memmap_init_zone(). This bug got introduced by commit
c713216deebd95d2b0ab38fef8bb2361c0180c2d
Signed-off-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Acked-by: Mel Gorman <mel@csn.ul.ie>
Cc: Bob Picco <bob.picco@hp.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Cc: Andi Kleen <ak@muc.de>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: "Keith Mannthey" <kmannth@gmail.com>
Cc: "Luck, Tony" <tony.luck@intel.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Yasunori Goto <y-goto@jp.fujitsu.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The use of get_zeroed_page() with __GFP_HIGHMEM is invalid. Use
alloc_page() with __GFP_ZERO instead of invalid get_zeroed_page().
(This patch is only compile tested)
Cc: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Akinobu Mita <akinobu.mita@gmail.com>
Acked-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Both SLUB and SLAB really did almost exactly the same thing for
/proc/slabinfo setup, using duplicate code and per-allocator #ifdef's.
This just creates a common CONFIG_SLABINFO that is enabled by both SLUB
and SLAB, and shares all the setup code. Maybe SLOB will want this some
day too.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This adds a read-only /proc/slabinfo file on SLUB, that makes slabtop work.
[ mingo@elte.hu: build fix. ]
Cc: Andi Kleen <andi@firstfloor.org>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Increase the mininum number of partial slabs to keep around and put
partial slabs to the end of the partial queue so that they can add
more objects.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Krzysztof Oledzki noticed a dirty page accounting leak on some of his
machines, causing the machine to eventually lock up when the kernel
decided that there was too much dirty data, but nobody could actually
write anything out to fix it.
The culprit turns out to be filesystems (cough ext3 with data=journal
cough) that re-dirty the page when the "->invalidatepage()" callback is
called.
Fix it up by doing a final dirty page accounting check when we actually
remove the page from the page cache.
This fixes bugzilla entry 9182:
http://bugzilla.kernel.org/show_bug.cgi?id=9182
Tested-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Krzysztof Oledzki <olel@ans.pl>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Remove a recently added useless masking of GFP_ZERO. GFP_ZERO is already
masked out in new_slab() (See how it calls allocate_slab). No need to do
it twice.
This reverts the SLUB parts of 7fd272550bd43cc1d7289ef0ab2fa50de137e767.
Cc: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
This reverts commit 54f9f80d6543fb7b157d3b11e2e7911dc1379790 ("hugetlb:
Add hugetlb_dynamic_pool sysctl")
Given the new sysctl nr_overcommit_hugepages, the boolean dynamic pool
sysctl is not needed, as its semantics can be expressed by 0 in the
overcommit sysctl (no dynamic pool) and non-0 in the overcommit sysctl
(pool enabled).
(Needed in 2.6.24 since it reverts a post-2.6.23 userspace-visible change)
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
hugetlb: introduce nr_overcommit_hugepages sysctl
While examining the code to support /proc/sys/vm/hugetlb_dynamic_pool, I
became convinced that having a boolean sysctl was insufficient:
1) To support per-node control of hugepages, I have previously submitted
patches to add a sysfs attribute related to nr_hugepages. However, with
a boolean global value and per-mount quota enforcement constraining the
dynamic pool, adding corresponding control of the dynamic pool on a
per-node basis seems inconsistent to me.
2) Administration of the hugetlb dynamic pool with multiple hugetlbfs
mount points is, arguably, more arduous than it needs to be. Each quota
would need to be set separately, and the sum would need to be monitored.
To ease the administration, and to help make the way for per-node
control of the static & dynamic hugepage pool, I added a separate
sysctl, nr_overcommit_hugepages. This value serves as a high watermark
for the overall hugepage pool, while nr_hugepages serves as a low
watermark. The boolean sysctl can then be removed, as the condition
nr_overcommit_hugepages > 0
indicates the same administrative setting as
hugetlb_dynamic_pool == 1
Quotas still serve as local enforcement of the size of the pool on a
per-mount basis.
A few caveats:
1) There is a race whereby the global surplus huge page counter is
incremented before a hugepage has allocated. Another process could then
try grow the pool, and fail to convert a surplus huge page to a normal
huge page and instead allocate a fresh huge page. I believe this is
benign, as no memory is leaked (the actual pages are still tracked
correctly) and the counters won't go out of sync.
2) Shrinking the static pool while a surplus is in effect will allow the
number of surplus huge pages to exceed the overcommit value. As long as
this condition holds, however, no more surplus huge pages will be
allowed on the system until one of the two sysctls are increased
sufficiently, or the surplus huge pages go out of use and are freed.
Successfully tested on x86_64 with the current libhugetlbfs snapshot,
modified to use the new sysctl.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Acked-by: Adam Litke <agl@us.ibm.com>
Cc: William Lee Irwin III <wli@holomorphy.com>
Cc: Dave Hansen <haveblue@us.ibm.com>
Cc: David Gibson <david@gibson.dropbear.id.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
In some cases the IO subsystem is able to merge requests if the pages are
adjacent in physical memory. This was achieved in the allocator by having
expand() return pages in physically contiguous order in situations were a
large buddy was split. However, list-based anti-fragmentation changed the
order pages were returned in to avoid searching in buffered_rmqueue() for a
page of the appropriate migrate type.
This patch restores behaviour of rmqueue_bulk() preserving the physical
order of pages returned by the allocator without incurring increased search
costs for anti-fragmentation.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: James Bottomley <James.Bottomley@steeleye.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
Cc: Mark Lord <mlord@pobox.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Improve the error handling for mm/sparse.c::sparse_add_one_section(). And I
see no reason to check 'usemap' until holding the 'pgdat_resize_lock'.
[geoffrey.levand@am.sony.com: sparse_index_init() returns -EEXIST]
Cc: Christoph Lameter <clameter@sgi.com>
Acked-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Since sparse_index_alloc() can return NULL on memory allocation failure,
we must deal with the failure condition when calling it.
Signed-off-by: WANG Cong <xiyou.wangcong@gmail.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
SPARSEMEM_VMEMMAP needs to be a selectable config option to support
building the kernel both with and without sparsemem vmemmap support. This
selection is desirable for platforms which could be configured one way for
platform specific builds and the other for multi-platform builds.
Signed-off-by: Miguel Botón <mboton@gmail.com>
Signed-off-by: Geoff Levand <geoffrey.levand@am.sony.com>
Acked-by: Yasunori Goto <y-goto@jp.fujitsu.com>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The follow_hugetlb_page() fix I posted (merged as git commit
5b23dbe8173c212d6a326e35347b038705603d39) missed one case. If the pte is
present, but not writable and write access is requested by the caller to
get_user_pages(), the code will do the wrong thing. Rather than calling
hugetlb_fault to make the pte writable, it notes the presence of the pte
and continues.
This simple one-liner makes sure we also fault on the pte for this case.
Please apply.
Signed-off-by: Adam Litke <agl@us.ibm.com>
Acked-by: Dave Kleikamp <shaggy@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Both slob and slub react to __GFP_ZERO by clearing the allocation, which
means that passing the GFP_ZERO bit down to the page allocator is just
wasteful and pointless.
Acked-by: Matt Mackall <mpm@selenic.com>
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/selinux-2.6:
VM/Security: add security hook to do_brk
Security: round mmap hint address above mmap_min_addr
security: protect from stack expantion into low vm addresses
Security: allow capable check to permit mmap or low vm space
SELinux: detect dead booleans
SELinux: do not clear f_op when removing entries
|
|
Given a specifically crafted binary do_brk() can be used to get low pages
available in userspace virtual memory and can thus be used to circumvent
the mmap_min_addr low memory protection. Add security checks in do_brk().
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Cc: James Morris <jmorris@namei.org>
Cc: Chris Wright <chrisw@sous-sol.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
I can't pass memory allocated by kmalloc() to ksize() if it is allocated by
SLUB allocator and size is larger than (I guess) PAGE_SIZE / 2.
The error of ksize() seems to be that it does not check if the allocation
was made by SLUB or the page allocator.
Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>
Tested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>, Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Writing to XIP files at a non-page-aligned offset results in data corruption
because the writes were always sent to the start of the page.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Acked-by: Carsten Otte <cotte@de.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mm/slub.c exports ksize(), but mm/slob.c and mm/slab.c don't.
It's used by binfmt_flat, which can be built as a module.
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Christoph Lameter <clameter@sgi.com>
Cc: Matt Mackall <mpm@selenic.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
this call should use the array index j, not i. But with this approach, just
one int i is enough, int j is not needed.
Signed-off-by: Denis Cheng <crquan@gmail.com>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Given a specifically crafted binary do_brk() can be used to get low
pages available in userspace virtually memory and can thus be used to
circumvent the mmap_min_addr low memory protection. Add security checks
in do_brk().
Signed-off-by: Eric Paris <eparis@redhat.com>
Acked-by: Alan Cox <alan@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
If mmap_min_addr is set and a process attempts to mmap (not fixed) with a
non-null hint address less than mmap_min_addr the mapping will fail the
security checks. Since this is just a hint address this patch will round
such a hint address above mmap_min_addr.
gcj was found to try to be very frugal with vm usage and give hint addresses
in the 8k-32k range. Without this patch all such programs failed and with
the patch they happily get a higher address.
This patch is wrappad in CONFIG_SECURITY since mmap_min_addr doesn't exist
without it and there would be no security check possible no matter what. So
we should not bother compiling in this rounding if it is just a waste of
time.
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
Add security checks to make sure we are not attempting to expand the
stack into memory protected by mmap_min_addr
Signed-off-by: Eric Paris <eparis@redhat.com>
Signed-off-by: James Morris <jmorris@namei.org>
|
|
The database performance group have found that half the cycles spent
in kmem_cache_free are spent in this one call to BUG_ON. Moving it
into the CONFIG_SLAB_DEBUG-only function cache_free_debugcheck() is a
performance win of almost 0.5% on their particular benchmark.
The call was added as part of commit ddc2e812d592457747c4367fb73edcaa8e1e49ff
with the comment that "overhead should be minimal". It may have been
minimal at the time, but it isn't now.
[ Quoth Pekka Enberg: "I don't think the BUG_ON per se caused the
performance regression but rather the virt_to_head_page() changes to
virt_to_cache() that were added later." ]
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Acked-by: Pekka J Enberg <penberg@cs.helsinki.fi>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Fixes section mismatch below.
WARNING: vmlinux.o(.text+0x946b5): Section mismatch: reference to .init.text:'
__alloc_bootmem_node (between 'vmemmap_alloc_block' and 'vmemmap_pgd_populate')
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Ordinarily the size of a pageblock is determined at compile-time based on the
hugepage size. On PPC64, the hugepage size is determined at runtime based on
what is supported by the machine. With legacy machines such as iSeries that
do not support hugepages, HPAGE_SHIFT is 0. This results in pageblock_order
being set to -PAGE_SHIFT and a crash results shortly afterwards.
This patch adds a function to select a sensible value for pageblock order by
default when HUGETLB_PAGE_SIZE_VARIABLE is set. It checks that HPAGE_SHIFT
is a sensible value before using the hugepage size; if it is not MAX_ORDER-1
is used.
This is a fix for 2.6.24.
Credit goes to Stephen Rothwell for identifying the bug and testing candidate
patches. Additional credit goes to Andy Whitcroft for spotting a problem
with respects to IA-64 before releasing. Additional credit to David Gibson
for testing with the libhugetlbfs test suite.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Tested-by: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Paul Mackerras <paulus@samba.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
2.6.11 gave __GFP_ZERO's prep_zero_page a bogus "highmem may have to wait"
assertion. Presumably added under the misconception that clear_highpage
uses nonatomic kmap; but then and now it uses kmap_atomic, so no problem.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
tmpfs was misconverted to __GFP_ZERO in 2.6.11. There's an unusual case in
which shmem_getpage receives the page from its caller instead of allocating.
We must cover this case by clear_highpage before SetPageUptodate, as before.
Signed-off-by: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|