Age | Commit message (Collapse) | Author |
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (162 commits)
tracing/kprobes: unregister_trace_probe needs to be called under mutex
perf: expose event__process function
perf events: Fix mmap offset determination
perf, powerpc: fsl_emb: Restore setting perf_sample_data.period
perf, powerpc: Convert the FSL driver to use local64_t
perf tools: Don't keep unreferenced maps when unmaps are detected
perf session: Invalidate last_match when removing threads from rb_tree
perf session: Free the ref_reloc_sym memory at the right place
x86,mmiotrace: Add support for tracing STOS instruction
perf, sched migration: Librarize task states and event headers helpers
perf, sched migration: Librarize the GUI class
perf, sched migration: Make the GUI class client agnostic
perf, sched migration: Make it vertically scrollable
perf, sched migration: Parameterize cpu height and spacing
perf, sched migration: Fix key bindings
perf, sched migration: Ignore unhandled task states
perf, sched migration: Handle ignored migrate out events
perf: New migration tool overview
tracing: Drop cpparg() macro
perf: Use tracepoint_synchronize_unregister() to flush any pending tracepoint call
...
Fix up trivial conflicts in Makefile and drivers/cpufreq/cpufreq.c
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip
* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
Revert "net: Make accesses to ->br_port safe for sparse RCU"
mce: convert to rcu_dereference_index_check()
net: Make accesses to ->br_port safe for sparse RCU
vfs: add fs.h to define struct file
lockdep: Add an in_workqueue_context() lockdep-based test function
rcu: add __rcu API for later sparse checking
rcu: add an rcu_dereference_index_check()
tree/tiny rcu: Add debug RCU head objects
mm: remove all rcu head initializations
fs: remove all rcu head initializations, except on_stack initializations
powerpc: remove all rcu head initializations
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb
* 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jwessel/linux-2.6-kgdb:
debug_core,kdb: fix crash when arch does not have single step
kgdb,x86: use macro HBP_NUM to replace magic number 4
kgdb,mips: remove unused kgdb_cpu_doing_single_step operations
mm,kdb,kgdb: Add a debug reference for the kdb kmap usage
KGDB: Remove set but unused newPC
ftrace,kdb: Allow dumping a specific cpu's buffer with ftdump
ftrace,kdb: Extend kdb to be able to dump the ftrace buffer
kgdb,powerpc: Replace hardcoded offset by BREAK_INSTR_SIZE
arm,kgdb: Add ability to trap into debugger on notify_die
gdbstub: do not directly use dbg_reg_def[] in gdb_cmd_reg_set()
gdbstub: Implement gdbserial 'p' and 'P' packets
kgdb,arm: Individual register get/set for arm
kgdb,mips: Individual register get/set for mips
kgdb,x86: Individual register get/set for x86
kgdb,kdb: individual register set and and get API
gdbstub: Optimize kgdb's "thread:" response for the gdb serial protocol
kgdb: remove custom hex_to_bin()implementation
|
|
The kdb kmap should never get used outside of the kernel debugger
exception context.
Signed-off-by: Jason Wessel<jason.wessel@windriver.com>
CC: Andrew Morton <akpm@linux-foundation.org>
CC: Ingo Molnar <mingo@elte.hu>
CC: linux-mm@kvack.org
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: allow limited allocation before slab is online
percpu: make @dyn_size always mean min dyn_size in first chunk init functions
|
|
* 'kvm-updates/2.6.36' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (198 commits)
KVM: VMX: Fix host GDT.LIMIT corruption
KVM: MMU: using __xchg_spte more smarter
KVM: MMU: cleanup spte set and accssed/dirty tracking
KVM: MMU: don't atomicly set spte if it's not present
KVM: MMU: fix page dirty tracking lost while sync page
KVM: MMU: fix broken page accessed tracking with ept enabled
KVM: MMU: add missing reserved bits check in speculative path
KVM: MMU: fix mmu notifier invalidate handler for huge spte
KVM: x86 emulator: fix xchg instruction emulation
KVM: x86: Call mask notifiers from pic
KVM: x86: never re-execute instruction with enabled tdp
KVM: Document KVM_GET_SUPPORTED_CPUID2 ioctl
KVM: x86: emulator: inc/dec can have lock prefix
KVM: MMU: Eliminate redundant temporaries in FNAME(fetch)
KVM: MMU: Validate all gptes during fetch, not just those used for new pages
KVM: MMU: Simplify spte fetch() function
KVM: MMU: Add gpte_valid() helper
KVM: MMU: Add validate_direct_spte() helper
KVM: MMU: Add drop_large_spte() helper
KVM: MMU: Use __set_spte to link shadow pages
...
|
|
Conflicts:
tools/perf/Makefile
tools/perf/util/hist.c
Merge reason: Resolve the conflicts and update to latest upstream.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
is_hwpoison_address accesses the page table, so the caller must hold
current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
kvm accordingly.
Comment is_hwpoison_address to remind other users.
Reported-by: Avi Kivity <avi@redhat.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
In common cases, guest SRAO MCE will cause corresponding poisoned page
be un-mapped and SIGBUS be sent to QEMU-KVM, then QEMU-KVM will relay
the MCE to guest OS.
But it is reported that if the poisoned page is accessed in guest
after unmapping and before MCE is relayed to guest OS, userspace will
be killed.
The reason is as follows. Because poisoned page has been un-mapped,
guest access will cause guest exit and kvm_mmu_page_fault will be
called. kvm_mmu_page_fault can not get the poisoned page for fault
address, so kernel and user space MMIO processing is tried in turn. In
user MMIO processing, poisoned page is accessed again, then userspace
is killed by force_sig_info.
To fix the bug, kvm_mmu_page_fault send HWPOISON signal to QEMU-KVM
and do not try kernel and user space MMIO processing for poisoned
page.
[xiao: fix warning introduced by avi]
Reported-by: Max Asbock <masbock@linux.vnet.ibm.com>
Signed-off-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Avi Kivity <avi@redhat.com>
|
|
Debian's ia64 autobuilders have been seeing kernel freeze or reboot
when running the gdb testsuite (Debian bug 588574): dannf bisected to
2.6.32 62eede62dafb4a6633eae7ffbeb34c60dba5e7b1 "mm: ZERO_PAGE without
PTE_SPECIAL"; and reproduced it with gdb's gcore on a simple target.
I'd missed updating the gate_vma handling in __get_user_pages(): that
happens to use vm_normal_page() (nowadays failing on the zero page),
yet reported success even when it failed to get a page - boom when
access_process_vm() tried to copy that to its intermediate buffer.
Fix this, resisting cleanups: in particular, leave it for now reporting
success when not asked to get any pages - very probably safe to change,
but let's not risk it without testing exposure.
Why did ia64 crash with 16kB pages, but succeed with 64kB pages?
Because setup_gate() pads each 64kB of its gate area with zero pages.
Reported-by: Andreas Barth <aba@not.so.argh.org>
Bisected-by: dann frazier <dannf@debian.org>
Signed-off-by: Hugh Dickins <hughd@google.com>
Tested-by: dann frazier <dannf@dannf.org>
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
Merge reason: Pick up the latest perf fixes.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
numa is used
Borislav Petkov reported his 32bit numa system has problem:
[ 0.000000] Reserving total of 4c00 pages for numa KVA remap
[ 0.000000] kva_start_pfn ~ 32800 max_low_pfn ~ 375fe
[ 0.000000] max_pfn = 238000
[ 0.000000] 8202MB HIGHMEM available.
[ 0.000000] 885MB LOWMEM available.
[ 0.000000] mapped low ram: 0 - 375fe000
[ 0.000000] low ram: 0 - 375fe000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 1000 1000 => 34e7000
[ 0.000000] alloc (nid=8 100000 - 7ee00000) (1000000 - ffffffff) 200 40 => 34c9d80
[ 0.000000] alloc (nid=0 100000 - 7ee00000) (1000000 - ffffffffffffffff) 180 40 => 34e6140
[ 0.000000] alloc (nid=1 80000000 - c7e60000) (1000000 - ffffffffffffffff) 240 40 => 80000000
[ 0.000000] BUG: unable to handle kernel paging request at 40000000
[ 0.000000] IP: [<c2c8cff1>] __alloc_memory_core_early+0x147/0x1d6
[ 0.000000] *pdpt = 0000000000000000 *pde = f000ff53f000ff00
...
[ 0.000000] Call Trace:
[ 0.000000] [<c2c8b4f8>] ? __alloc_bootmem_node+0x216/0x22f
[ 0.000000] [<c2c90c9b>] ? sparse_early_usemaps_alloc_node+0x5a/0x10b
[ 0.000000] [<c2c9149e>] ? sparse_init+0x1dc/0x499
[ 0.000000] [<c2c79118>] ? paging_init+0x168/0x1df
[ 0.000000] [<c2c780ff>] ? native_pagetable_setup_start+0xef/0x1bb
looks like it allocates too much high address for bootmem.
Try to cut limit with get_max_mapped()
Reported-by: Borislav Petkov <borislav.petkov@amd.com>
Tested-by: Conny Seidel <conny.seidel@amd.com>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: <stable@kernel.org> [2.6.34.x]
Cc: Ingo Molnar <mingo@elte.hu>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Lee Schermerhorn <lee.schermerhorn@hp.com>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
We need lock_page_nosync() here because we have no reference to the
mapping when taking the page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Mel Gorman <mel@csn.ul.ie>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
The current shrinker implementation requires the registered callback
to have global state to work from. This makes it difficult to shrink
caches that are not global (e.g. per-filesystem caches). Pass the shrinker
structure to the callback so that users can embed the shrinker structure
in the context the shrinker needs to operate on and get back to it in the
callback via container_of().
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm
* 'kmemleak' of git://git.kernel.org/pub/scm/linux/kernel/git/cmarinas/linux-2.6-cm:
kmemleak: Add support for NO_BOOTMEM configurations
kmemleak: Annotate false positive in init_section_page_cgroup()
|
|
With commits 08677214 and 59be5a8e, alloc_bootmem()/free_bootmem() and
friends use the early_res functions for memory management when
NO_BOOTMEM is enabled. This patch adds the kmemleak calls in the
corresponding code paths for bootmem allocations.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Yinghai Lu <yinghai@kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: stable@kernel.org
|
|
The pointer to the page_cgroup table allocated in
init_section_page_cgroup() is stored in section->page_cgroup as (base -
pfn). Since this value does not point to the beginning or inside the
allocated memory block, kmemleak reports a false positive.
This was reported in bugzilla.kernel.org as #16297.
Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
Reported-by: Adrien Dessemond <adrien.dessemond@gmail.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Andrew Morton <akpm@linux-foundation.org>
|
|
via following scripts
FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')
sed -i \
-e 's/lmb/memblock/g' \
-e 's/LMB/MEMBLOCK/g' \
$FILES
for N in $(find . -name lmb.[ch]); do
M=$(echo $N | sed 's/lmb/memblock/g')
mv $N $M
done
and remove some wrong change like lmbench and dlmb etc.
also move memblock.c from lib/ to mm/
Suggested-by: Ingo Molnar <mingo@elte.hu>
Acked-by: "H. Peter Anvin" <hpa@zytor.com>
Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
|
|
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
writeback: simplify the write back thread queue
writeback: split writeback_inodes_wb
writeback: remove writeback_inodes_wbc
fs-writeback: fix kernel-doc warnings
splice: check f_mode for seekable file
splice: direct_splice_actor() should not use pos in sd
|
|
First remove items from work_list as soon as we start working on them. This
means we don't have to track any pending or visited state and can get
rid of all the RCU magic freeing the work items - we can simply free
them once the operation has finished. Second use a real completion for
tracking synchronous requests - if the caller sets the completion pointer
we complete it, otherwise use it as a boolean indicator that we can free
the work item directly. Third unify struct wb_writeback_args and struct
bdi_work into a single data structure, wb_writeback_work. Previous we
set all parameters into a struct wb_writeback_args, copied it into
struct bdi_work, copied it again on the stack to use it there. Instead
of just allocate one structure dynamically or on the stack and use it
all the way through the stack.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
This was just an odd wrapper around writeback_inodes_wb. Removing this
also allows to get rid of the bdi member of struct writeback_control
which was rather out of place there.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
Merge reason: Pick up the latest perf fixes
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
Conflicts:
fs/fs-writeback.c
Merge reason: Resolve the conflict
Note, i picked the version from Linus's tree, which effectively reverts
the fs-writeback.c bits of:
b97181f: fs: remove all rcu head initializations, except on_stack initializations
As the upstream changes to this file changed this code heavily and the
first attempt to resolve the conflict resulted in a non-booting kernel.
It's safer to re-try this portion of the commit cleanly.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
My patch to "Factor out duplicate put/frees in mpol_shared_policy_init()
to a common return path"; and Dan Carpenter's fix thereto both left a
dangling reference to the incoming tmpfs superblock mempolicy structure.
A similar leak was introduced earlier when the nodemask was moved offstack
to the scratch area despite the note in the comment block regarding the
incoming ref.
Move the remaining 'put of the incoming "mpol" to the common exit path to
drop the reference.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Acked-by: Dan Carpenter <error27@gmail.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
OOM-waitqueue should be waken up when oom_disable is canceled. This is a
fix for 3c11ecf448eff8f1 ("memcg: oom kill disable and oom status").
How to test:
Create a cgroup A...
1. set memory.limit and memory.memsw.limit to be small value
2. echo 1 > /cgroup/A/memory.oom_control, this disables oom-kill.
3. run a program which must cause OOM.
A program executed in 3 will sleep by oom_waiqueue in memcg. Then, how to
wake it up is problem.
1. echo 0 > /cgroup/A/memory.oom_control (enable OOM-killer)
2. echo big mem > /cgroup/A/memory.memsw.limit_in_bytes(allow more swap)
etc..
Without the patch, a task in slept can not be waken up.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
* 'for-linus' of git://git.kernel.dk/linux-2.6-block:
block: Don't count_vm_events for discard bio in submit_bio.
cfq: fix recursive call in cfq_blkiocg_update_completion_stats()
cfq-iosched: Fixed boot warning with BLK_CGROUP=y and CFQ_GROUP_IOSCHED=n
cfq: Don't allow queue merges for queues that have no process references
block: fix DISCARD_BARRIER requests
cciss: set SCSI max cmd len to 16, as default is wrong
cpqarray: fix two more wrong section type
cpqarray: fix wrong __init type on pci probe function
drbd: Fixed a race between disk-attach and unexpected state changes
writeback: fix pin_sb_for_writeback
writeback: add missing requeue_io in writeback_inodes_wb
writeback: simplify and split bdi_start_writeback
writeback: simplify wakeup_flusher_threads
writeback: fix writeback_inodes_wb from writeback_inodes_sb
writeback: enforce s_umount locking in writeback_inodes_sb
writeback: queue work on stack in writeback_inodes_sb
writeback: fix writeback completion notifications
|
|
Reason: Further changes conflict with upstream fixes
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
|
|
This patch updates percpu allocator such that it can serve limited
amount of allocation before slab comes online. This is primarily to
allow slab to depend on working percpu allocator.
Two parameters, PERCPU_DYNAMIC_EARLY_SIZE and SLOTS, determine how
much memory space and allocation map slots are reserved. If this
reserved area is exhausted, WARN_ON_ONCE() will trigger and allocation
will fail till slab comes online.
The following changes are made to implement early alloc.
* pcpu_mem_alloc() now checks slab_is_available()
* Chunks are allocated using pcpu_mem_alloc()
* Init paths make sure ai->dyn_size is at least as large as
PERCPU_DYNAMIC_EARLY_SIZE.
* Initial alloc maps are allocated in __initdata and copied to
kmalloc'd areas once slab is online.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Lameter <cl@linux-foundation.org>
|
|
In pcpu_build_alloc_info() and pcpu_embed_first_chunk(), @dyn_size was
ssize_t, -1 meant auto-size, 0 forced 0 and positive meant minimum
size. There's no use case for forcing 0 and the upcoming early alloc
support always requires non-zero dynamic size. Make @dyn_size always
mean minimum dyn_size.
While at it, make pcpu_build_alloc_info() static which doesn't have
any external caller as suggested by David Rientjes.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: David Rientjes <rientjes@google.com>
|
|
per_cpu_ptr_to_phys() determines whether the passed in @addr belongs
to the first_chunk or not by just matching the address against the
address range of the base unit (unit0, used by cpu0). When an adress
from another cpu was passed in, it will always determine that the
address doesn't belong to the first chunk even when it does. This
makes the function return a bogus physical address which may lead to
crash.
This problem was discovered by Cliff Wickman while investigating a
crash during kdump on a SGI UV system.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Cliff Wickman <cpw@sgi.com>
Tested-by: Cliff Wickman <cpw@sgi.com>
Cc: stable@kernel.org
|
|
Merge reason: Go from -rc1 base to -rc3 base, merge in fixes.
|
|
Fix the following two trivial bugs in pcpu_build_alloc_info()
* we should memset group_cnt to 0 by size of group_cnt, not size of
group_map (both are of the same size, so the bug isn't dangerous)
* we can delete useless variable group_cnt_max.
Signed-off-by: Pavel V. Panteleev <pp_84@mail.ru>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux-foundation.org>
Cc: Pekka Enberg <penberg@cs.helsinki.fi>
Cc: Matt Mackall <mpm@selenic.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
|
|
bdi_start_writeback now never gets a superblock passed, so we can just remove
that case. And to further untangle the code and flatten the call stack
split it into two trivial helpers for it's two callers.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/frederic/random-tracing into perf/core
|
|
We have been resisting new ftrace plugins and removing existing
ones, and kmemtrace has been superseded by kmem trace events
and perf-kmem, so we remove it.
Signed-off-by: Li Zefan <lizf@cn.fujitsu.com>
Acked-by: Pekka Enberg <penberg@cs.helsinki.fi>
Acked-by: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Steven Rostedt <rostedt@goodmis.org>
[ remove kmemtrace from the makefile, handle slob too ]
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
|
|
Add the capacility to track data mmap()s. This can be used together
with PERF_SAMPLE_ADDR for data profiling.
Signed-off-by: Anton Blanchard <anton@samba.org>
[Updated code for stable perf ABI]
Signed-off-by: Eric B Munson <ebmunson@us.ibm.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Steven Rostedt <rostedt@goodmis.org>
LKML-Reference: <1274193049-25997-1-git-send-email-ebmunson@us.ibm.com>
Signed-off-by: Ingo Molnar <mingo@elte.hu>
|
|
sync can currently take a really long time if a concurrent writer is
extending a file. The problem is that the dirty pages on the address
space grow in the same direction as write_cache_pages scans, so if
the writer keeps ahead of writeback, the writeback will not
terminate until the writer stops adding dirty pages.
For a data integrity sync, we only need to write the pages dirty at
the time we start the writeback, so we can stop scanning once we get
to the page that was at the end of the file at the time the scan
started.
This will prevent operations like copying a large file preventing
sync from completing as it will not write back pages that were
dirtied after the sync was started. This does not impact the
existing integrity guarantees, as any dirty page (old or new)
within the EOF range at the start of the scan will still be
captured.
This patch will not prevent sync from blocking on large writes into
holes. That requires more complex intervention while this patch only
addresses the common append-case of this sync holdoff.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
If a filesystem writes more than one page in ->writepage, write_cache_pages
fails to notice this and continues to attempt writeback when wbc->nr_to_write
has gone negative - this trace was captured from XFS:
wbc_writeback_start: towrt=1024
wbc_writepage: towrt=1024
wbc_writepage: towrt=0
wbc_writepage: towrt=-1
wbc_writepage: towrt=-5
wbc_writepage: towrt=-21
wbc_writepage: towrt=-85
This has adverse effects on filesystem writeback behaviour. write_cache_pages()
needs to terminate after a certain number of pages are written, not after a
certain number of calls to ->writepage are made. This is a regression
introduced by 17bc6c30cf6bfffd816bdc53682dd46fc34a2cf4 ("vfs: Add
no_nrwrite_index_update writeback control flag"), but cannot be reverted
directly due to subsequent bug fixes that have gone in on top of it.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
Minix: Clean up left over label
fix truncate inode time modification breakage
fix setattr error handling in sysfs, configfs
fcntl: return -EFAULT if copy_to_user fails
wrong type for 'magic' argument in simple_fill_super()
fix the deadlock in qib_fs
mqueue doesn't need make_bad_inode()
|
|
* 'for-linus' of git://git.kernel.dk/linux-2.6-block: (27 commits)
block: make blk_init_free_list and elevator_init idempotent
block: avoid unconditionally freeing previously allocated request_queue
pipe: change /proc/sys/fs/pipe-max-pages to byte sized interface
pipe: change the privilege required for growing a pipe beyond system max
pipe: adjust minimum pipe size to 1 page
block: disable preemption before using sched_clock()
cciss: call BUG() earlier
Preparing 8.3.8rc2
drbd: Reduce verbosity
drbd: use drbd specific ratelimit instead of global printk_ratelimit
drbd: fix hang on local read errors while disconnected
drbd: Removed the now empty w_io_error() function
drbd: removed duplicated #includes
drbd: improve usage of MSG_MORE
drbd: need to set socket bufsize early to take effect
drbd: improve network latency, TCP_QUICKACK
drbd: Revert "drbd: Create new current UUID as late as possible"
brd: support discard
Revert "writeback: fix WB_SYNC_NONE writeback from umount"
Revert "writeback: ensure that WB_SYNC_NONE writeback with sb pinned is sync"
...
|
|
Greg Thelen reported recent Johannes's stack diet patch makes kernel hang.
His test is following.
mount -t cgroup none /cgroups -o memory
mkdir /cgroups/cg1
echo $$ > /cgroups/cg1/tasks
dd bs=1024 count=1024 if=/dev/null of=/data/foo
echo $$ > /cgroups/tasks
echo 1 > /cgroups/cg1/memory.force_empty
Actually, This OOM hard to try logic have been corrupted since following
two years old patch.
commit a41f24ea9fd6169b147c53c2392e2887cc1d9247
Author: Nishanth Aravamudan <nacc@us.ibm.com>
Date: Tue Apr 29 00:58:25 2008 -0700
page allocator: smarter retry of costly-order allocations
Original intention was "return success if the system have shrinkable zones
though priority==0 reclaim was failure". But the above patch changed to
"return nr_reclaimed if .....". Oh, That forgot nr_reclaimed may be 0 if
priority==0 reclaim failure.
And Johannes's patch 0aeb2339e54e ("vmscan: remove all_unreclaimable scan
control") made it more corrupt. Originally, priority==0 reclaim failure
on memcg return 0, but this patch changed to return 1. It totally
confused memcg.
This patch fixes it completely.
Reported-by: Greg Thelen <gthelen@google.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Tested-by: Greg Thelen <gthelen@google.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|
mtime and ctime should be changed only if the file size has actually
changed. Patches changing ext2 and tmpfs from vmtruncate to new truncate
sequence has caused regressions where they always update timestamps.
There is some strange cases in POSIX where truncate(2) must not update
times unless the size has acutally changed, see 6e656be89.
This area is all still rather buggy in different ways in a lot of
filesystems and needs a cleanup and audit (ideally the vfs will provide
a simple attribute or call to direct all filesystems exactly which
attributes to change). But coming up with the best solution will take a
while and is not appropriate for rc anyway.
So fix recent regression for now.
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Conflicts:
fs/pipe.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
This reverts commit e913fc825dc685a444cb4c1d0f9d32f372f59861.
We are investigating a hang associated with the WB_SYNC_NONE changes,
so revert them for now.
Conflicts:
fs/fs-writeback.c
mm/page-writeback.c
Signed-off-by: Jens Axboe <jaxboe@fusionio.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6
* 'slub/urgent' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLUB: Allow full duplication of kmalloc array for 390
slub: move kmem_cache_node into it's own cacheline
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse:
mm: export generic_pipe_buf_*() to modules
fuse: support splice() reading from fuse device
fuse: allow splice to move pages
mm: export remove_from_page_cache() to modules
mm: export lru_cache_add_*() to modules
fuse: support splice() writing to fuse device
fuse: get page reference for readpages
fuse: use get_user_pages_fast()
fuse: remove unneeded variable
|
|
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
setattr > vmtruncate > truncate, have filesystems call their truncate sequence
from ->setattr if filesystem specific operations are required. vmtruncate is
deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
previously should be used.
simple_setattr is introduced for simple in-ram filesystems to implement
the new truncate sequence. Eventually all filesystems should be converted
to implement a setattr, and the default code in notify_change should go
away.
simple_setsize is also introduced to perform just the ATTR_SIZE portion
of simple_setattr (ie. changing i_size and trimming pagecache).
To implement the new truncate sequence:
- filesystem specific manipulations (eg freeing blocks) must be done in
the setattr method rather than ->truncate.
- vmtruncate can not be used by core code to trim blocks past i_size in
the event of write failure after allocation, so this must be performed
in the fs code.
- convert usage of helpers block_write_begin, nobh_write_begin,
cont_write_begin, and *blockdev_direct_IO* to use _newtrunc postfixed
variants. These avoid calling vmtruncate to trim blocks (see previous).
- inode_setattr should not be used. generic_setattr is a new function
to be used to copy simple attributes into the generic inode.
- make use of the better opportunity to handle errors with the new sequence.
Big problem with the previous calling sequence: the filesystem is not called
until i_size has already changed. This means it is not allowed to fail the
call, and also it does not know what the previous i_size was. Also, generic
code calling vmtruncate to truncate allocated blocks in case of error had
no good way to return a meaningful error (or, for example, atomically handle
block deallocation).
Cc: Christoph Hellwig <hch@lst.de>
Acked-by: Jan Kara <jack@suse.cz>
Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
We don't name our generic fsync implementations very well currently.
The no-op implementation for in-memory filesystems currently is called
simple_sync_file which doesn't make too much sense to start with,
the the generic one for simple filesystems is called simple_fsync
which can lead to some confusion.
This patch renames the generic file fsync method to generic_file_fsync
to match the other generic_file_* routines it is supposed to be used
with, and the no-op implementation to noop_fsync to make it obvious
what to expect. In addition add some documentation for both methods.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|