aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-06-06perf/x86: Use rdpmc() rather than rdmsr() when possible in the kernelVince Weaver
The rdpmc instruction is faster than the equivelant rdmsr call, so use it when possible in the kernel. The perfctr kernel patches did this, after extensive testing showed rdpmc to always be faster (One can look in etc/costs in the perfctr-2.6 package to see a historical list of the overhead). I have done some tests on a 3.2 kernel, the kernel module I used was included in the first posting of this patch: rdmsr rdpmc Core2 T9900: 203.9 cycles 30.9 cycles AMD fam0fh: 56.2 cycles 9.8 cycles Atom 6/28/2: 129.7 cycles 50.6 cycles The speedup of using rdpmc is large. [ It's probably possible (and desirable) to do this without requiring a new field in the hw_perf_event structure, but the fixed events make this tricky. ] Signed-off-by: Vince Weaver <vweaver1@eecs.utk.edu> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/alpine.DEB.2.00.1203011724030.26934@cl320.eecs.utk.edu Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06x86: Add rdpmcl()Andi Kleen
Add a version of rdpmc() that directly reads into a u64 Signed-off-by: Andi Kleen <ak@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338944211-28275-4-git-send-email-andi@firstfloor.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Fix wrmsrl() debug wrapperPeter Zijlstra
Move the wrmslr() debug wrapper to the common header now that all the include games are gone. Also clean it up a bit to avoid multiple evaluation of the argument. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-l4gkfnivwv4yi5mqxjlovymx@git.kernel.org Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06Merge branch 'perf/urgent' into perf/coreIngo Molnar
Eliminate a conflict in a patch I am going to apply. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Update SNB PEBS constraintsPeter Zijlstra
Afaict there's no need to (incompletely) iterate the MEM_UOPS_RETIRED.* umask state. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Enable/Add IvyBridge hardware supportPeter Zijlstra
Implement rudimentary IVB perf support. The SDM states its identical to SNB with exception of the exact event tables, but a quick look suggests they're similar enough. Also mark SNB-EP as broken for now. Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Implement cycles:p for SNB/IVBPeter Zijlstra
Now that there's finally a chip with working PEBS (IvyBridge), we can enable the hardware and implement cycles:p for SNB/IVB. Cc: Stephane Eranian <eranian@google.com> Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Fix Intel shared extra MSR allocationPeter Zijlstra
Zheng Yan reported that event group validation can wreck event state when Intel extra_reg allocation changes event state. Validation shouldn't change any persistent state. Cloning events in validate_{event,group}() isn't really pretty either, so add a few special cases to avoid modifying the event state. The code is restructured to minimize the special case impact. Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com> Acked-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Kill uprobes_srcu/uprobe_srcu_idOleg Nesterov
Kill the no longer needed uprobes_srcu/uprobe_srcu_id code. It doesn't really work anyway. synchronize_srcu() can only synchronize with the code "inside" the srcu_read_lock/srcu_read_unlock section, while uprobe_pre_sstep_notifier() does srcu_read_lock() _after_ we already hit the breakpoint. I guess this probably works "in practice". synchronize_srcu() is slow and it implies synchronize_sched(), and the probed task enters the non- preemptible section at the start of exception handler. Still this is not right at least in theory, and task->uprobe_srcu_id blows task_struct. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529193008.GG8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Teach handle_swbp() to rely on "is_swbp" rather than uprobes_srcuOleg Nesterov
Currently handle_swbp() assumes that it can't race with unregister, so it roughly does: if (find_uprobe(vaddr)) process_uprobe(); else send_sig(SIGTRAP); This relies on the not-really-working uprobes_srcu code we are going to remove, see the next patch. With this patch we rely on the result of is_swbp_at_addr(bp_vaddr) if find_uprobe() fails. If is_swbp == 1, then we hit the normal int3, we should send SIGTRAP. If is_swbp == 0, we raced with uprobe_unregister(), we simply restart this insn again. The "difficult" case is is_swbp == -EFAULT, when we can't read this memory. In this case I think we should restart too, and this is more correct compared to the current code which sends SIGTRAP. Ignoring ENOMEM/etc from get_user_pages(), this can only happen if another thread unmaps this memory before find_active_uprobe() takes mmap_sem. It would be better to pretend it was unmapped before this insn was executed, restart, and get SIGSEGV. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192947.GF8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Change register_for_each_vma() to take mm->mmap_sem for writingOleg Nesterov
Change register_for_each_vma() to take mm->mmap_sem for writing. This is a bit unfortunate but hopefully not too bad, this is the slow path anyway. This is needed to ensure that find_active_uprobe() can not race with uprobe_register() which adds the new bp at the same bp_vaddr, after find_uprobe() fails and before is_swbp_at_addr_fast() checks the memory. IOW, this is needed to ensure that if find_active_uprobe() returns NULL but is_swbp == true, we can safely assume that it was the "normal" int3 and we should send SIGTRAP. There is another reason for this change. We are going to replace uprobes_state->count with MMF_ flags set by register/unregister and cleared by find_active_uprobe(), and set/clear shouldn't race with each other. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192928.GE8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Teach find_active_uprobe() to provide the "is_swbp" infoOleg Nesterov
A separate patch to simplify the review, and for the documentation. The patch adds another "int *is_swbp" argument to find_active_uprobe(), so far its only caller doesn't use this info. With this patch find_active_uprobe() additionally does: - if find_vma() + ->vm_start check fails, *is_swbp = -EFAULT - otherwise, if valid_vma() + find_uprobe() fails, it holds the result of is_swbp_at_addr(), can be negative too. The latter is only possible if we raced with another thread which did munmap/etc after we hit this bp. IOW. If find_active_uprobe(&is_swbp) returns NULL, the caller can look at is_swbp to figure out whether the current insn is bp or not, or detect the race with another thread if it is negative. Note: I think that performance-wise this change is fine. This adds is_swbp_at_addr(), but only if we raced with uprobe_unregister() or if we hit the "normal" int3 but this mm has uprobes as well. And even in this case the slow read_opcode() path is very unlikely, this insn recently triggered do_int3(), __copy_from_user_inatomic() shouldn't fail in the likely case. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192914.GD8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Introduce find_active_uprobe() helperOleg Nesterov
No functional changes. Move the "find uprobe" code from handle_swbp() to the new helper, find_active_uprobe(). Note: with or without this change, the find-active-uprobe logic is not exactly right. We can race with another thread which unmaps the memory with the valid uprobe before we take mm->mmap_sem. We can't find this uprobe simply because find_vma() fails. In this case we wrongly assume that this trap was not caused by uprobe and send the erroneous SIGTRAP. See the next changes. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192857.GC8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Change read_opcode() to use FOLL_FORCEOleg Nesterov
set_orig_insn()->read_opcode() should not fail if the probed task did mprotect() after uprobe_register(), change it to use FOLL_FORCE. Without FOLL_WRITE this doesn't have any "side" effect but allows to read the !VM_READ memory. There is another reason for this change, we are going to use is_swbp_at_addr() from handle_swbp() which can race with another thread doing mprotect(). Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192759.GB8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06uprobes: Optimize is_swbp_at_addr() for current->mmOleg Nesterov
Change is_swbp_at_addr() to try to avoid the costly read_opcode() if mm == current->mm, __copy_from_user_inatomic() should succeed in the likely case. Currently this optimization is not important, but we are going to add more is_swbp_at_addr(current->mm) callers. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com> Cc: Anton Arapov <anton@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: Peter Zijlstra <peterz@infradead.org> Link: http://lkml.kernel.org/r/20120529192744.GA8057@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Check user address explicitly in copy_from_user_nmi()Arun Sharma
Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1334961696-19580-5-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Check if user fp is validArun Sharma
Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1334961696-19580-4-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf: Limit callchains to 127Arun Sharma
Stack depth of 255 seems excessive, given that copy_from_user_nmi() could be slow. Signed-off-by: Arun Sharma <asharma@fb.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: linux-kernel@vger.kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1334961696-19580-3-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Allow multiple stacksArun Sharma
Without this patch, applications with two different stack regions (eg: native stack vs JIT stack) get truncated callchains even when RBP chaining is present. GDB shows proper stack traces and the frame pointer chaining is intact. This patch disables the (fp < RSP) check, hoping that other checks in the code save the day for us. In our limited testing, this didn't seem to break anything. In the long term, we could potentially have userspace advise the kernel on the range of valid stack addresses, so we don't spend a lot of time unwinding from bogus addresses. Signed-off-by: Arun Sharma <asharma@fb.com> CC: Arnaldo Carvalho de Melo <acme@redhat.com> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Mike Galbraith <efault@gmx.de> Cc: Paul Mackerras <paulus@samba.org> Cc: Stephane Eranian <eranian@google.com> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: Tom Zanussi <tzanussi@gmail.com> Cc: linux-kernel@vger.kernel.org Cc: linux-perf-users@vger.kernel.org Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1334961696-19580-2-git-send-email-asharma@fb.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Update SNB PEBS constraintsPeter Zijlstra
Afaict there's no need to (incompletely) iterate the MEM_UOPS_RETIRED.* umask state. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Stephane Eranian <eranian@google.com> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Enable/Add IvyBridge hardware supportPeter Zijlstra
Implement rudimentary IVB perf support. The SDM states its identical to SNB with exception of the exact event tables, but a quick look suggests they're similar enough. Also mark SNB-EP as broken for now. Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Implement cycles:p for SNB/IVBPeter Zijlstra
Now that there's finally a chip with working PEBS (IvyBridge), we can enable the hardware and implement cycles:p for SNB/IVB. Cc: Stephane Eranian <eranian@google.com> Requested-and-tested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338884803.28282.153.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06perf/x86: Fix Intel shared extra MSR allocationPeter Zijlstra
Zheng Yan reported that event group validation can wreck event state when Intel extra_reg allocation changes event state. Validation shouldn't change any persistent state. Cloning events in validate_{event,group}() isn't really pretty either, so add a few special cases to avoid modifying the event state. The code is restructured to minimize the special case impact. Reported-by: Zheng Yan <zheng.z.yan@linux.intel.com> Acked-by: Stephane Eranian <eranian@google.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/1338903031.28282.175.camel@twins Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06x86/decoder: Fix bsr/bsf/jmpe decoding with operand-size prefixMasami Hiramatsu
Fix the x86 instruction decoder to decode bsr/bsf/jmpe with operand-size prefix (66h). This fixes the test case failure reported by Linus, attached below. bsf/bsr/jmpe have a special encoding. Opcode map in Intel Software Developers Manual vol2 says they have TZCNT/LZCNT variants if it has F3h prefix. However, there is no information if it has other 66h or F2h prefixes. Current instruction decoder supposes that those are bad instructions, but it actually accepts at least operand-size prefixes. H. Peter Anvin further explains: " TZCNT/LZCNT are F3 + BSF/BSR exactly because the F2 and F3 prefixes have historically been no-ops with most instructions. This allows software to unconditionally use the prefixed versions and get TZCNT/LZCNT on the processors that have them if they don't care about the difference. " This fixes errors reported by test_get_len: Warning: arch/x86/tools/test_get_len found difference at <em_bsf>:ffffffff81036d87 Warning: ffffffff81036de5: 66 0f bc c2 bsf %dx,%ax Warning: objdump says 4 bytes, but insn_get_length() says 3 Warning: arch/x86/tools/test_get_len found difference at <em_bsr>:ffffffff81036ea6 Warning: ffffffff81036f04: 66 0f bd c2 bsr %dx,%ax Warning: objdump says 4 bytes, but insn_get_length() says 3 Warning: decoded and checked 13298882 instructions with 2 warnings Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: Pekka Enberg <penberg@kernel.org> Signed-off-by: Masami Hiramatsu <masami.hiramatsu.pt@hitachi.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: <yrl.pp-manager.tt@hitachi.com> Link: http://lkml.kernel.org/r/20120604150911.22338.43296.stgit@localhost.localdomain Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-06Merge tag 'perf-urgent-for-mingo' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/urgent Pull perf fixes from Arnaldo Carvalho de Melo: * Endianness fixes from Jiri Olsa * Fixes for make perf tarball * Fix for DSO name in perf script callchains, from David Ahern * Segfault fixes for perf top --callchain, from Namhyung Kim * Minor function result fixes from Srikar Dronamraju * Add missing 3rd ioctl parameter, from Namhyung Kim * Fix pager usage in minimal embedded systems, from Avik Sil Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-06-05Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: fix blksize calculation fuse: fix stat call on 32 bit platforms fuse: optimize fallocate on permanent failure fuse: add FALLOCATE operation fuse: Convert to kstrtoul_from_user
2012-06-05Merge branch 'sched-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler fixes from Ingo Molnar. * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched: Remove NULL assignment of dattr_cur sched: Remove the last NULL entry from sched_feat_names sched: Make sched_feat_names const sched/rt: Fix SCHED_RR across cgroups sched: Move nr_cpus_allowed out of 'struct sched_rt_entity' sched: Make sure to not re-read variables after validation sched: Fix SD_OVERLAP sched: Don't try allocating memory from offline nodes sched/nohz: Fix rq->cpu_load calculations some more sched/x86: Use cpu_llc_shared_mask(cpu) for coregroup_mask
2012-06-04Pull 'for-linus' branches of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/viro/{signal,vfs} Pull signal and vfs compile breakage fixes from Al Viro. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/signal: fixups for signal breakage * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: nommu: fix compilation of nommu.c
2012-06-04Merge branch 'for-next' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds
Pull cifs fixes from Steve French. * 'for-next' of git://git.samba.org/sfrench/cifs-2.6: CIFS: Move get_next_mid to ops struct CIFS: Make accessing is_valid_oplock/dump_detail ops struct field safe CIFS: Improve identation in cifs_unlock_range CIFS: Fix possible wrong memory allocation
2012-06-04fixups for signal breakageAl Viro
Obvious brainos spotted by Geert. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-04nommu: fix compilation of nommu.cGreg Ungerer
Compiling 3.5-rc1 for nommu targets gives: CC mm/nommu.o mm/nommu.c: In function ‘sys_mmap_pgoff’: mm/nommu.c:1489:2: error: ‘ret’ undeclared (first use in this function) mm/nommu.c:1489:2: note: each undeclared identifier is reported only once for each function it appears in It is trivially fixed by replacing 'ret' with the local variable that is already defined for the return value 'retval'. Signed-off-by: Greg Ungerer <gerg@uclinux.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2012-06-04Merge tag 'stable/frontswap.v16-tag' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm Pull frontswap feature from Konrad Rzeszutek Wilk: "Frontswap provides a "transcendent memory" interface for swap pages. In some environments, dramatic performance savings may be obtained because swapped pages are saved in RAM (or a RAM-like device) instead of a swap disk. This tag provides the basic infrastructure along with some changes to the existing backends." Fix up trivial conflict in mm/Makefile due to removal of swap token code changing a line next to the new frontswap entry. This pull request came in before the merge window even opened, it got delayed to after the merge window by me just wanting to make sure it had actual users. Apparently IBM is using this on their embedded side, and Jan Beulich says that it's already made available for SLES and OpenSUSE users. Also acked by Rik van Riel, and Konrad points to other people liking it too. So in it goes. By Dan Magenheimer (4) and Konrad Rzeszutek Wilk (2) via Konrad Rzeszutek Wilk * tag 'stable/frontswap.v16-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/mm: frontswap: s/put_page/store/g s/get_page/load MAINTAINER: Add myself for the frontswap API mm: frontswap: config and doc files mm: frontswap: core frontswap functionality mm: frontswap: core swap subsystem hooks and headers mm: frontswap: add frontswap header file
2012-06-04Merge branches 'irq-urgent-for-linus' and 'smp-hotplug-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull irq and smpboot updates from Thomas Gleixner: "Just cleanup patches with no functional change and a fix for suspend issues." * 'irq-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: genirq: Introduce irq_do_set_affinity() to reduce duplicated code genirq: Add IRQS_PENDING for nested and simple irq * 'smp-hotplug-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: smpboot, idle: Fix comment mismatch over idle_threads_init() smpboot, idle: Optimize calls to smp_processor_id() in idle_threads_init()
2012-06-04Merge branch 'timers-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull timer updates from Thomas Gleixner: "The clocksource driver is pure hardware enablement and the skew option is default off, well tested and non dangerous." * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: tick: Move skew_tick option into the HIGH_RES_TIMER section clocksource: em_sti: Add DT support clocksource: em_sti: Emma Mobile STI driver clockevents: Make clockevents_config() a global symbol tick: Add tick skew boot option
2012-06-04vfs: Fix /proc/<tid>/fdinfo/<fd> file handlingLinus Torvalds
Cyrill Gorcunov reports that I broke the fdinfo files with commit 30a08bf2d31d ("proc: move fd symlink i_mode calculations into tid_fd_revalidate()"), and he's quite right. The tid_fd_revalidate() function is not just used for the <tid>/fd symlinks, it's also used for the <tid>/fdinfo/<fd> files, and the permission model for those are different. So do the dynamic symlink permission handling just for symlinks, making the fdinfo files once more appear as the proper regular files they are. Of course, Al Viro argued (probably correctly) that we shouldn't do the symlink permission games at all, and make the symlinks always just be the normal 'lrwxrwxrwx'. That would have avoided this issue too, but since somebody noticed that the permissions had changed (which was the reason for that original commit 30a08bf2d31d in the first place), people do apparently use this feature. [ Basically, you can use the symlink permission data as a cheap "fdinfo" replacement, since you see whether the file is open for reading and/or writing by just looking at st_mode of the symlink. So the feature does make sense, even if the pain it has caused means we probably shouldn't have done it to begin with. ] Reported-and-tested-by: Cyrill Gorcunov <gorcunov@openvz.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03gpio/samsung: fix the typo 'exynos5_xxx' instead of 'exonys5_xxx'Kukjin Kim
Should be 'exynos5_xxx' instead of 'exonys5_xxx'. It happened at the commit 30b842889eea ("Merge tag 'soc2' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc") during v3.5 merge window. Signed-off-by: Kukjin Kim <kgene.kim@samsung.com> [ My bad - Linus ] Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03Merge branch 'pm-acpi' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull some left-over PM patches from Rafael J. Wysocki. * 'pm-acpi' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: ACPI / PM: Make acpi_pm_device_sleep_state() follow the specification ACPI / PM: Make __acpi_bus_get_power() cover D3cold correctly ACPI / PM: Fix error messages in drivers/acpi/bus.c rtc-cmos / PM: report wakeup event on ACPI RTC alarm ACPI / PM: Generate wakeup events on fixed power button
2012-06-03Revert "mm: compaction: handle incorrect MIGRATE_UNMOVABLE type pageblocks"Linus Torvalds
This reverts commit 5ceb9ce6fe9462a298bb2cd5c9f1ca6cb80a0199. That commit seems to be the cause of the mm compation list corruption issues that Dave Jones reported. The locking (or rather, absense there-of) is dubious, as is the use of the 'page' variable once it has been found to be outside the pageblock range. So revert it for now, we can re-visit this for 3.6. If we even need to: as Minchan Kim says, "The patch wasn't a bug fix and even test workload was very theoretical". Reported-and-tested-by: Dave Jones <davej@redhat.com> Acked-by: Hugh Dickins <hughd@google.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@gmail.com> Acked-by: Minchan Kim <minchan@kernel.org> Cc: Bartlomiej Zolnierkiewicz <b.zolnierkie@samsung.com> Cc: Kyungmin Park <kyungmin.park@samsung.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03mm: fix warning in __set_page_dirty_nobuffersHugh Dickins
New tmpfs use of !PageUptodate pages for fallocate() is triggering the WARNING: at mm/page-writeback.c:1990 when __set_page_dirty_nobuffers() is called from migrate_page_copy() for compaction. It is anomalous that migration should use __set_page_dirty_nobuffers() on an address_space that does not participate in dirty and writeback accounting; and this has also been observed to insert surprising dirty tags into a tmpfs radix_tree, despite tmpfs not using tags at all. We should probably give migrate_page_copy() a better way to preserve the tag and migrate accounting info, when mapping_cap_account_dirty(). But that needs some more work: so in the interim, avoid the warning by using a simple SetPageDirty on PageSwapBacked pages. Reported-and-tested-by: Dave Jones <davej@redhat.com> Signed-off-by: Hugh Dickins <hughd@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-03vfs: move inode stat information closer togetherLinus Torvalds
The comment above it says "Stat data, not accessed from path walking", but in fact some of inode fields we use for the common stat data was way down at the end of the inode, causing unnecessary cache misses for the common stat operations. The inode structure is pretty big, and this can change padding depending on field width, but at least on the common 64-bit configurations this doesn't change the size. Some of our inode layout has historically been to tro to avoid unnecessary padding fields, but cache locality is at least as important for layout, if not more. Noticed by looking at kernel profiles, and noticing that the "i_blkbits" access stood out like a sore thumb. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-06-02Linux 3.5-rc1v3.5-rc1Linus Torvalds
2012-06-02Merge tag 'dm-3.5-changes-1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm Pull device-mapper updates from Alasdair G Kergon: "Improve multipath's retrying mechanism in some defined circumstances and provide a simple reserve/release mechanism for userspace tools to access thin provisioning metadata while the pool is in use." * tag 'dm-3.5-changes-1' of git://git.kernel.org/pub/scm/linux/kernel/git/agk/linux-dm: dm thin: provide userspace access to pool metadata dm thin: use slab mempools dm mpath: allow ioctls to trigger pg init dm mpath: delay retry of bypassed pg dm mpath: reduce size of struct multipath
2012-06-03dm thin: provide userspace access to pool metadataJoe Thornber
This patch implements two new messages that can be sent to the thin pool target allowing it to take a snapshot of the _metadata_. This, read-only snapshot can be accessed by userland, concurrently with the live target. Only one metadata snapshot can be held at a time. The pool's status line will give the block location for the current msnap. Since version 0.1.5 of the userland thin provisioning tools, the thin_dump program displays the msnap as follows: thin_dump -m <msnap root> <metadata dev> Available here: https://github.com/jthornber/thin-provisioning-tools Now that userland can access the metadata we can do various things that have traditionally been kernel side tasks: i) Incremental backups. By using metadata snapshots we can work out what blocks have changed over time. Combined with data snapshots we can ensure the data doesn't change while we back it up. A short proof of concept script can be found here: https://github.com/jthornber/thinp-test-suite/blob/master/incremental_backup_example.rb ii) Migration of thin devices from one pool to another. iii) Merging snapshots back into an external origin. iv) Asyncronous replication. Signed-off-by: Joe Thornber <ejt@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm thin: use slab mempoolsMike Snitzer
Use dedicated caches prefixed with a "dm_" name rather than relying on kmalloc mempools backed by generic slab caches so the memory usage of thin provisioning (and any leaks) can be accounted for independently. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: allow ioctls to trigger pg initMikulas Patocka
After the failure of a group of paths, any alternative paths that need initialising do not become available until further I/O is sent to the device. Until this has happened, ioctls return -EAGAIN. With this patch, new paths are made available in response to an ioctl too. The processing of the ioctl gets delayed until this has happened. Instead of returning an error, we submit a work item to kmultipathd (that will potentially activate the new path) and retry in ten milliseconds. Note that the patch doesn't retry an ioctl if the ioctl itself fails due to a path failure. Such retries should be handled intelligently by the code that generated the ioctl in the first place, noting that some SCSI commands should not be retried because they are not idempotent (XOR write commands). For commands that could be retried, there is a danger that if the device rejected the SCSI command, the path could be errorneously marked as failed, and the request would be retried on another path which might fail too. It can be determined if the failure happens on the device or on the SCSI controller, but there is no guarantee that all SCSI drivers set these flags correctly. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: delay retry of bypassed pgMike Christie
If I/O needs retrying and only bypassed priority groups are available, set the pg_init_delay_retry flag to wait before retrying. If, for example, the reason for the bypass is that the controller is getting reset or there is a firmware upgrade happening, retrying right away would cause a flood of log messages and retries for what could be a few seconds or even several minutes. Signed-off-by: Mike Christie <michaelc@cs.wisc.edu> Acked-by: Mike Snitzer <snitzer@redhat.com> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-03dm mpath: reduce size of struct multipathMike Snitzer
Move multipath structure's 'lock' and 'queue_size' members to eliminate two 4-byte holes. Also use a bit within a single unsigned int for each existing flag (saves 8-bytes). This allows future flags to be added without each consuming an unsigned int. Signed-off-by: Mike Snitzer <snitzer@redhat.com> Acked-by: Hannes Reinecke <hare@suse.de> Signed-off-by: Alasdair G Kergon <agk@redhat.com>
2012-06-02Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking updates from David Miller: 1) Make syn floods consume significantly less resources by a) Not pre-COW'ing routing metrics for SYN/ACKs b) Mirroring the device queue mapping of the SYN for the SYN/ACK reply. Both from Eric Dumazet. 2) Fix calculation errors in Byte Queue Limiting, from Hiroaki SHIMODA. 3) Validate the length requested when building a paged SKB for a socket, so we don't overrun the page vector accidently. From Jason Wang. 4) When netlabel is disabled, we abort all IP option processing when we see a CIPSO option. This isn't the right thing to do, we should simply skip over it and continue processing the remaining options (if any). Fix from Paul Moore. 5) SRIOV fixes for the mellanox driver from Jack orgenstein and Marcel Apfelbaum. 6) 8139cp enables the receiver before the ring address is properly programmed, which potentially lets the device crap over random memory. Fix from Jason Wang. 7) e1000/e1000e fixes for i217 RST handling, and an improper buffer address reference in jumbo RX frame processing from Bruce Allan and Sebastian Andrzej Siewior, respectively. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: fec_mpc52xx: fix timestamp filtering mcs7830: Implement link state detection e1000e: fix Rapid Start Technology support for i217 e1000: look into the page instead of skb->data for e1000_tbi_adjust_stats() r8169: call netif_napi_del at errpaths and at driver unload tcp: reflect SYN queue_mapping into SYNACK packets tcp: do not create inetpeer on SYNACK message 8139cp/8139too: terminate the eeprom access with the right opmode 8139cp: set ring address before enabling receiver cipso: handle CIPSO options correctly when NetLabel is disabled net: sock: validate data_len before allocating skb in sock_alloc_send_pskb() bql: Avoid possible inconsistent calculation. bql: Avoid unneeded limit decrement. bql: Fix POSDIFF() to integer overflow aware. net/mlx4_core: Fix obscure mlx4_cmd_box parameter in QUERY_DEV_CAP net/mlx4_core: Check port out-of-range before using in mlx4_slave_cap net/mlx4_core: Fixes for VF / Guest startup flow net/mlx4_en: Fix improper use of "port" parameter in mlx4_en_event net/mlx4_core: Fix number of EQs used in ICM initialisation net/mlx4_core: Fix the slave_id out-of-range test in mlx4_eq_int
2012-06-02Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull straggler x86 fixes from Peter Anvin: "Three groups of patches: - EFI boot stub documentation and the ability to print error messages; - Removal for PTRACE_ARCH_PRCTL for x32 (obsolete interface which should never have been ported, and the port is broken and potentially dangerous.) - ftrace stack corruption fixes. I'm not super-happy about the technical implementation, but it is probably the least invasive in the short term. In the future I would like a single method for nesting the debug stack, however." * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86, x32, ptrace: Remove PTRACE_ARCH_PRCTL for x32 x86, efi: Add EFI boot stub documentation x86, efi; Add EFI boot stub console support x86, efi: Only close open files in error path ftrace/x86: Do not change stacks in DEBUG when calling lockdep x86: Allow nesting of the debug stack IDT setting x86: Reset the debug_stack update counter ftrace: Use breakpoint method to update ftrace caller ftrace: Synchronize variable setting with breakpoints
2012-06-02tty: Revert the tty locking series, it needs more workLinus Torvalds
This reverts the tty layer change to use per-tty locking, because it's not correct yet, and fixing it will require some more deep surgery. The main revert is d29f3ef39be4 ("tty_lock: Localise the lock"), but there are several smaller commits that built upon it, they also get reverted here. The list of reverted commits is: fde86d310886 - tty: add lockdep annotations 8f6576ad476b - tty: fix ldisc lock inversion trace d3ca8b64b97e - pty: Fix lock inversion b1d679afd766 - tty: drop the pty lock during hangup abcefe5fc357 - tty/amiserial: Add missing argument for tty_unlock() fd11b42e3598 - cris: fix missing tty arg in wait_event_interruptible_tty call d29f3ef39be4 - tty_lock: Localise the lock The revert had a trivial conflict in the 68360serial.c staging driver that got removed in the meantime. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>