aboutsummaryrefslogtreecommitdiff
path: root/include/trace
AgeCommit message (Collapse)Author
2012-07-24Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wqLinus Torvalds
Pull workqueue changes from Tejun Heo: "There are three major changes. - WQ_HIGHPRI has been reimplemented so that high priority work items are served by worker threads with -20 nice value from dedicated highpri worker pools. - CPU hotplug support has been reimplemented such that idle workers are kept across CPU hotplug events. This makes CPU hotplug cheaper (for PM) and makes the code simpler. - flush_kthread_work() has been reimplemented so that a work item can be freed while executing. This removes an annoying behavior difference between kthread_worker and workqueue." * 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: fix spurious CPU locality WARN from process_one_work() kthread_worker: reimplement flush_kthread_work() to allow freeing the work item being executed kthread_worker: reorganize to prepare for flush_kthread_work() reimplementation workqueue: simplify CPU hotplug code workqueue: remove CPU offline trustee workqueue: don't butcher idle workers on an offline CPU workqueue: reimplement CPU online rebinding to handle idle workers workqueue: drop @bind from create_worker() workqueue: use mutex for global_cwq manager exclusion workqueue: ROGUE workers are UNBOUND workers workqueue: drop CPU_DYING notifier operation workqueue: perform cpu down operations from low priority cpu_notifier() workqueue: reimplement WQ_HIGHPRI using a separate worker_pool workqueue: introduce NR_WORKER_POOLS and for_each_worker_pool() workqueue: separate out worker_pool flags workqueue: use @pool instead of @gcwq or @cpu where applicable workqueue: factor out worker_pool from global_cwq workqueue: don't use WQ_HIGHPRI for unbound workqueues
2012-07-24Merge tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds
Pull KVM updates from Avi Kivity: "Highlights include - full big real mode emulation on pre-Westmere Intel hosts (can be disabled with emulate_invalid_guest_state=0) - relatively small ppc and s390 updates - PCID/INVPCID support in guests - EOI avoidance; 3.6 guests should perform better on 3.6 hosts on interrupt intensive workloads) - Lockless write faults during live migration - EPT accessed/dirty bits support for new Intel processors" Fix up conflicts in: - Documentation/virtual/kvm/api.txt: Stupid subchapter numbering, added next to each other. - arch/powerpc/kvm/booke_interrupts.S: PPC asm changes clashing with the KVM fixes - arch/s390/include/asm/sigp.h, arch/s390/kvm/sigp.c: Duplicated commits through the kvm tree and the s390 tree, with subsequent edits in the KVM tree. * tag 'kvm-3.6-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (93 commits) KVM: fix race with level interrupts x86, hyper: fix build with !CONFIG_KVM_GUEST Revert "apic: fix kvm build on UP without IOAPIC" KVM guest: switch to apic_set_eoi_write, apic_write apic: add apic_set_eoi_write for PV use KVM: VMX: Implement PCID/INVPCID for guests with EPT KVM: Add x86_hyper_kvm to complete detect_hypervisor_platform check KVM: PPC: Critical interrupt emulation support KVM: PPC: e500mc: Fix tlbilx emulation for 64-bit guests KVM: PPC64: booke: Set interrupt computation mode for 64-bit host KVM: PPC: bookehv: Add ESR flag to Data Storage Interrupt KVM: PPC: bookehv64: Add support for std/ld emulation. booke: Added crit/mc exception handler for e500v2 booke/bookehv: Add host crit-watchdog exception support KVM: MMU: document mmu-lock and fast page fault KVM: MMU: fix kvm_mmu_pagetable_walk tracepoint KVM: MMU: trace fast page fault KVM: MMU: fast path of handling guest page fault KVM: MMU: introduce SPTE_MMU_WRITEABLE bit KVM: MMU: fold tlb flush judgement into mmu_spte_update ...
2012-07-22Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events changes from Ingo Molnar: "- kernel side: - Intel uncore PMU support for Nehalem and Sandy Bridge CPUs, we support both the events available via the MSR and via the PCI access space. - various uprobes cleanups and restructurings - PMU driver quirks by microcode version and required x86 microcode loader cleanups/robustization - various tracing robustness updates - static keys: remove obsolete static_branch() - tooling side: - GTK browser improvements - perf report browser: support screenshots to file - more automated tests - perf kvm improvements - perf bench refinements - build environment improvements - pipe mode improvements - libtraceevent updates, we have now hopefully merged most bits with the out of tree forked code base ... and many other goodies." * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (138 commits) tracing: Check for allocation failure in __tracing_open() perf/x86: Fix intel_perfmon_event_mapformatting jump label: Remove static_branch() tracepoint: Use static_key_false(), since static_branch() is deprecated perf/x86: Uncore filter support for SandyBridge-EP perf/x86: Detect number of instances of uncore CBox perf/x86: Fix event constraint for SandyBridge-EP C-Box perf/x86: Use 0xff as pseudo code for fixed uncore event perf/x86: Save a few bytes in 'struct x86_pmu' perf/x86: Add a microcode revision check for SNB-PEBS perf/x86: Improve debug output in check_hw_exists() perf/x86/amd: Unify AMD's generic and family 15h pmus perf/x86: Move Intel specific code to intel_pmu_init() perf/x86: Rename Intel specific macros perf/x86: Fix USER/KERNEL tagging of samples perf tools: Split event symbols arrays to hw and sw parts perf tools: Split out PE_VALUE_SYM parsing token to SW and HW tokens perf tools: Add empty rule for new line in event syntax parsing perf test: Use ARRAY_SIZE in parse events tests tools lib traceevent: Cleanup realloc use ...
2012-07-12workqueue: factor out worker_pool from global_cwqTejun Heo
Move worklist and all worker management fields from global_cwq into the new struct worker_pool. worker_pool points back to the containing gcwq. worker and cpu_workqueue_struct are updated to point to worker_pool instead of gcwq too. This change is mechanical and doesn't introduce any functional difference other than rearranging of fields and an added level of indirection in some places. This is to prepare for multiple pools per gcwq. v2: Comment typo fixes as suggested by Namhyung. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Namhyung Kim <namhyung@kernel.org>
2012-07-06Merge branch 'tip/perf/core' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace into perf/core Pull tracing updates from Steve Rostedt. Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-07-02rcu: Add tracing for _rcu_barrier()Paul E. McKenney
This commit adds event tracing for _rcu_barrier() execution. This is defined only if RCU_TRACE=y. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Reviewed-by: Josh Triplett <josh@joshtriplett.org>
2012-06-28tracing/kvm: Use __print_hex() for kvm_emulate_insn tracepointNamhyung Kim
The kvm_emulate_insn tracepoint used __print_insn() for printing its instructions. However it makes the format of the event hard to parse as it reveals TP internals. Fortunately, kernel provides __print_hex for almost same purpose, we can use it instead of open coding it. The user-space can be changed to parse it later. That means raw kernel tracing will not be affected by this change: # cd /sys/kernel/debug/tracing/ # cat events/kvm/kvm_emulate_insn/format name: kvm_emulate_insn ID: 29 format: ... print fmt: "%x:%llx:%s (%s)%s", REC->csbase, REC->rip, __print_hex(REC->insn, REC->len), \ __print_symbolic(REC->flags, { 0, "real" }, { (1 << 0) | (1 << 1), "vm16" }, \ { (1 << 0), "prot16" }, { (1 << 0) | (1 << 2), "prot32" }, { (1 << 0) | (1 << 3), "prot64" }), \ REC->failed ? " failed" : "" # echo 1 > events/kvm/kvm_emulate_insn/enable # cat trace # tracer: nop # # entries-in-buffer/entries-written: 2183/2183 #P:12 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | qemu-kvm-1782 [002] ...1 140.931636: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) qemu-kvm-1781 [004] ...1 140.931637: kvm_emulate_insn: 0:c102fa25:89 10 (prot32) Link: http://lkml.kernel.org/n/tip-wfw6y3b9ugtey8snaow9nmg5@git.kernel.org Link: http://lkml.kernel.org/r/1340757701-10711-2-git-send-email-namhyung@kernel.org Cc: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Cc: Frederic Weisbecker <fweisbec@gmail.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Ingo Molnar <mingo@kernel.org> Cc: Namhyung Kim <namhyung.kim@lge.com> Cc: kvm@vger.kernel.org Acked-by: Avi Kivity <avi@redhat.com> Signed-off-by: Namhyung Kim <namhyung@kernel.org> Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
2012-06-18KVM: Introduce __KVM_HAVE_IRQ_LINEChristoffer Dall
This is a preparatory patch for the KVM/ARM implementation. KVM/ARM will use the KVM_IRQ_LINE ioctl, which is currently conditional on __KVM_HAVE_IOAPIC, but ARM obviously doesn't have any IOAPIC support and we need a separate define. Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Avi Kivity <avi@redhat.com>
2012-06-13KVM: trace events: update list of exit reasonsCornelia Huck
The list of exit reasons for the kvm_userspace_exit event was missing recent additions; bring it into sync again. Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Cornelia Huck <cornelia.huck@de.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
2012-06-06rcu: Update RCU_FAST_NO_HZ tracing for lazy callbacksPaul E. McKenney
In the current code, a short dyntick-idle interval (where there is at least one non-lazy callback on the CPU) and a long dyntick-idle interval (where there are only lazy callbacks on the CPU) are traced identically, which can be less than helpful. This commit therefore emits different event traces in these two cases. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Tested-by: Heiko Carstens <heiko.carstens@de.ibm.com> Tested-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr>
2012-05-29mm: vmscan: remove reclaim_mode_tMel Gorman
There is little motiviation for reclaim_mode_t once RECLAIM_MODE_[A]SYNC and lumpy reclaim have been removed. This patch gets rid of reclaim_mode_t as well and improves the documentation about what reclaim/compaction is and when it is triggered. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: do not stall on writeback during memory compactionMel Gorman
This patch stops reclaim/compaction entering sync reclaim as this was only intended for lumpy reclaim and an oversight. Page migration has its own logic for stalling on writeback pages if necessary and memory compaction is already using it. Waiting on page writeback is bad for a number of reasons but the primary one is that waiting on writeback to a slow device like USB can take a considerable length of time. Page reclaim instead uses wait_iff_congested() to throttle if too many dirty pages are being scanned. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: vmscan: remove lumpy reclaimMel Gorman
This series removes lumpy reclaim and some stalling logic that was unintentionally being used by memory compaction. The end result is that stalling on dirty pages during page reclaim now depends on wait_iff_congested(). Four kernels were compared 3.3.0 vanilla 3.4.0-rc2 vanilla 3.4.0-rc2 lumpyremove-v2 is patch one from this series 3.4.0-rc2 nosync-v2r3 is the full series Removing lumpy reclaim saves almost 900 bytes of text whereas the full series removes 1200 bytes. text data bss dec hex filename 6740375 1927944 2260992 10929311 a6c49f vmlinux-3.4.0-rc2-vanilla 6739479 1927944 2260992 10928415 a6c11f vmlinux-3.4.0-rc2-lumpyremove-v2 6739159 1927944 2260992 10928095 a6bfdf vmlinux-3.4.0-rc2-nosync-v2 There are behaviour changes in the series and so tests were run with monitoring of ftrace events. This disrupts results so the performance results are distorted but the new behaviour should be clearer. fs-mark running in a threaded configuration showed little of interest as it did not push reclaim aggressively FS-Mark Multi Threaded 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Files/s min 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s mean 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Files/s stddev 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) Files/s max 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) 3.20 ( 0.00%) Overhead min 508667.00 ( 0.00%) 521350.00 (-2.49%) 544292.00 (-7.00%) 547168.00 (-7.57%) Overhead mean 551185.00 ( 0.00%) 652690.73 (-18.42%) 991208.40 (-79.83%) 570130.53 (-3.44%) Overhead stddev 18200.69 ( 0.00%) 331958.29 (-1723.88%) 1579579.43 (-8578.68%) 9576.81 (47.38%) Overhead max 576775.00 ( 0.00%) 1846634.00 (-220.17%) 6901055.00 (-1096.49%) 585675.00 (-1.54%) MMTests Statistics: duration Sys Time Running Test (seconds) 309.90 300.95 307.33 298.95 User+Sys Time Running Test (seconds) 319.32 309.67 315.69 307.51 Total Elapsed Time (seconds) 1187.85 1193.09 1191.98 1193.73 MMTests Statistics: vmstat Page Ins 80532 82212 81420 79480 Page Outs 111434984 111456240 111437376 111582628 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 44881 27889 27453 34843 Kswapd pages scanned 25841428 25860774 25861233 25843212 Kswapd pages reclaimed 25841393 25860741 25861199 25843179 Direct pages reclaimed 44881 27889 27453 34843 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 21754.791 21675.460 21696.029 21649.127 Direct efficiency 100% 100% 100% 100% Direct velocity 37.783 23.375 23.031 29.188 Percentage direct scans 0% 0% 0% 0% ftrace showed that there was no stalling on writeback or pages submitted for IO from reclaim context. postmark was similar and while it was more interesting, it also did not push reclaim heavily. POSTMARK 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Transactions per second: 16.00 ( 0.00%) 20.00 (25.00%) 18.00 (12.50%) 17.00 ( 6.25%) Data megabytes read per second: 18.80 ( 0.00%) 24.27 (29.10%) 22.26 (18.40%) 20.54 ( 9.26%) Data megabytes written per second: 35.83 ( 0.00%) 46.25 (29.08%) 42.42 (18.39%) 39.14 ( 9.24%) Files created alone per second: 28.00 ( 0.00%) 38.00 (35.71%) 34.00 (21.43%) 30.00 ( 7.14%) Files create/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) Files deleted alone per second: 556.00 ( 0.00%) 1224.00 (120.14%) 3062.00 (450.72%) 6124.00 (1001.44%) Files delete/transact per second: 8.00 ( 0.00%) 10.00 (25.00%) 9.00 (12.50%) 8.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 113.34 107.99 109.73 108.72 User+Sys Time Running Test (seconds) 145.51 139.81 143.32 143.55 Total Elapsed Time (seconds) 1159.16 899.23 980.17 1062.27 MMTests Statistics: vmstat Page Ins 13710192 13729032 13727944 13760136 Page Outs 43071140 42987228 42733684 42931624 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 0 0 0 0 Kswapd pages scanned 9941613 9937443 9939085 9929154 Kswapd pages reclaimed 9940926 9936751 9938397 9928465 Direct pages reclaimed 0 0 0 0 Kswapd efficiency 99% 99% 99% 99% Kswapd velocity 8576.567 11051.058 10140.164 9347.109 Direct efficiency 100% 100% 100% 100% Direct velocity 0.000 0.000 0.000 0.000 It looks like here that the full series regresses performance but as ftrace showed no usage of wait_iff_congested() or sync reclaim I am assuming it's a disruption due to monitoring. Other data such as memory usage, page IO, swap IO all looked similar. Running a benchmark with a plain DD showed nothing very interesting. The full series stalled in wait_iff_congested() slightly less but stall times on vanilla kernels were marginal. Running a benchmark that hammered on file-backed mappings showed stalls due to congestion but not in sync writebacks MICRO 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 MMTests Statistics: duration Sys Time Running Test (seconds) 308.13 294.50 298.75 299.53 User+Sys Time Running Test (seconds) 330.45 316.28 318.93 320.79 Total Elapsed Time (seconds) 1814.90 1833.88 1821.14 1832.91 MMTests Statistics: vmstat Page Ins 108712 120708 97224 110344 Page Outs 155514576 156017404 155813676 156193256 Swap Ins 0 0 0 0 Swap Outs 0 0 0 0 Direct pages scanned 2599253 1550480 2512822 2414760 Kswapd pages scanned 69742364 71150694 68839041 69692533 Kswapd pages reclaimed 34824488 34773341 34796602 34799396 Direct pages reclaimed 53693 94750 61792 75205 Kswapd efficiency 49% 48% 50% 49% Kswapd velocity 38427.662 38797.901 37799.972 38022.889 Direct efficiency 2% 6% 2% 3% Direct velocity 1432.174 845.464 1379.807 1317.446 Percentage direct scans 3% 2% 3% 3% Page writes by reclaim 0 0 0 0 Page writes file 0 0 0 0 Page writes anon 0 0 0 0 Page reclaim immediate 0 0 0 1218 Page rescued immediate 0 0 0 0 Slabs scanned 15360 16384 13312 16384 Direct inode steals 0 0 0 0 Kswapd inode steals 4340 4327 1630 4323 FTrace Reclaim Statistics: congestion_wait Direct number congest waited 0 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 900 870 754 789 Direct time conditional waited 0ms 0ms 0ms 20ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 2106 2308 2116 1915 KSwapd time congest waited 139924ms 157832ms 125652ms 132516ms KSwapd full congest waited 1346 1530 1202 1278 KSwapd number conditional waited 12922 16320 10943 14670 KSwapd time conditional waited 0ms 0ms 0ms 0ms KSwapd full conditional waited 0 0 0 0 Reclaim statistics are not radically changed. The stall times in kswapd are massive but it is clear that it is due to calls to congestion_wait() and that is almost certainly the call in balance_pgdat(). Otherwise stalls due to dirty pages are non-existant. I ran a benchmark that stressed high-order allocation. This is very artifical load but was used in the past to evaluate lumpy reclaim and compaction. Generally I look at allocation success rates and latency figures. STRESS-HIGHALLOC 3.3.0-vanilla rc2-vanilla lumpyremove-v2r3 nosync-v2r3 Pass 1 81.00 ( 0.00%) 28.00 (-53.00%) 24.00 (-57.00%) 28.00 (-53.00%) Pass 2 82.00 ( 0.00%) 39.00 (-43.00%) 38.00 (-44.00%) 43.00 (-39.00%) while Rested 88.00 ( 0.00%) 87.00 (-1.00%) 88.00 ( 0.00%) 88.00 ( 0.00%) MMTests Statistics: duration Sys Time Running Test (seconds) 740.93 681.42 685.14 684.87 User+Sys Time Running Test (seconds) 2922.65 3269.52 3281.35 3279.44 Total Elapsed Time (seconds) 1161.73 1152.49 1159.55 1161.44 MMTests Statistics: vmstat Page Ins 4486020 2807256 2855944 2876244 Page Outs 7261600 7973688 7975320 7986120 Swap Ins 31694 0 0 0 Swap Outs 98179 0 0 0 Direct pages scanned 53494 57731 34406 113015 Kswapd pages scanned 6271173 1287481 1278174 1219095 Kswapd pages reclaimed 2029240 1281025 1260708 1201583 Direct pages reclaimed 1468 14564 16649 92456 Kswapd efficiency 32% 99% 98% 98% Kswapd velocity 5398.133 1117.130 1102.302 1049.641 Direct efficiency 2% 25% 48% 81% Direct velocity 46.047 50.092 29.672 97.306 Percentage direct scans 0% 4% 2% 8% Page writes by reclaim 1616049 0 0 0 Page writes file 1517870 0 0 0 Page writes anon 98179 0 0 0 Page reclaim immediate 103778 27339 9796 17831 Page rescued immediate 0 0 0 0 Slabs scanned 1096704 986112 980992 998400 Direct inode steals 223 215040 216736 247881 Kswapd inode steals 175331 61548 68444 63066 Kswapd skipped wait 21991 0 1 0 THP fault alloc 1 135 125 134 THP collapse alloc 393 311 228 236 THP splits 25 13 7 8 THP fault fallback 0 0 0 0 THP collapse fail 3 5 7 7 Compaction stalls 865 1270 1422 1518 Compaction success 370 401 353 383 Compaction failures 495 869 1069 1135 Compaction pages moved 870155 3828868 4036106 4423626 Compaction move failure 26429 23865 29742 27514 Success rates are completely hosed for 3.4-rc2 which is almost certainly due to commit fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled"). I expected this would happen for kswapd and impair allocation success rates (https://lkml.org/lkml/2012/1/25/166) but I did not anticipate this much a difference: 80% less scanning, 37% less reclaim by kswapd In comparison, reclaim/compaction is not aggressive and gives up easily which is the intended behaviour. hugetlbfs uses __GFP_REPEAT and would be much more aggressive about reclaim/compaction than THP allocations are. The stress test above is allocating like neither THP or hugetlbfs but is much closer to THP. Mainline is now impaired in terms of high order allocation under heavy load although I do not know to what degree as I did not test with __GFP_REPEAT. Keep this in mind for bugs related to hugepage pool resizing, THP allocation and high order atomic allocation failures from network devices. In terms of congestion throttling, I see the following for this test FTrace Reclaim Statistics: congestion_wait Direct number congest waited 3 0 0 0 Direct time congest waited 0ms 0ms 0ms 0ms Direct full congest waited 0 0 0 0 Direct number conditional waited 957 512 1081 1075 Direct time conditional waited 0ms 0ms 0ms 0ms Direct full conditional waited 0 0 0 0 KSwapd number congest waited 36 4 3 5 KSwapd time congest waited 3148ms 400ms 300ms 500ms KSwapd full congest waited 30 4 3 5 KSwapd number conditional waited 88514 197 332 542 KSwapd time conditional waited 4980ms 0ms 0ms 0ms KSwapd full conditional waited 49 0 0 0 The "conditional waited" times are the most interesting as this is directly impacted by the number of dirty pages encountered during scan. As lumpy reclaim is no longer scanning contiguous ranges, it is finding fewer dirty pages. This brings wait times from about 5 seconds to 0. kswapd itself is still calling congestion_wait() so it'll still stall but it's a lot less. In terms of the type of IO we were doing, I see this FTrace Reclaim Statistics: mm_vmscan_writepage Direct writes anon sync 0 0 0 0 Direct writes anon async 0 0 0 0 Direct writes file sync 0 0 0 0 Direct writes file async 0 0 0 0 Direct writes mixed sync 0 0 0 0 Direct writes mixed async 0 0 0 0 KSwapd writes anon sync 0 0 0 0 KSwapd writes anon async 91682 0 0 0 KSwapd writes file sync 0 0 0 0 KSwapd writes file async 822629 0 0 0 KSwapd writes mixed sync 0 0 0 0 KSwapd writes mixed async 0 0 0 0 In 3.2, kswapd was doing a bunch of async writes of pages but reclaim/compaction was never reaching a point where it was doing sync IO. This does not guarantee that reclaim/compaction was not calling wait_on_page_writeback() but I would consider it unlikely. It indicates that merging patches 2 and 3 to stop reclaim/compaction calling wait_on_page_writeback() should be safe. This patch: Lumpy reclaim had a purpose but in the mind of some, it was to kick the system so hard it trashed. For others the purpose was to complicate vmscan.c. Over time it was giving softer shoes and a nicer attitude but memory compaction needs to step up and replace it so this patch sends lumpy reclaim to the farm. The tracepoint format changes for isolating LRU pages with this patch applied. Furthermore reclaim/compaction can no longer queue dirty pages in pageout() if the underlying BDI is congested. Lumpy reclaim used this logic and reclaim/compaction was using it in error. Signed-off-by: Mel Gorman <mgorman@suse.de> Acked-by: Rik van Riel <riel@redhat.com> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Cc: Hugh Dickins <hughd@google.com> Cc: Ying Han <yinghan@google.com> Cc: Andy Whitcroft <apw@shadowen.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-29mm: remove swap token codeRik van Riel
The swap token code no longer fits in with the current VM model. It does not play well with cgroups or the better NUMA placement code in development, since we have only one swap token globally. It also has the potential to mess with scalability of the system, by increasing the number of non-reclaimable pages on the active and inactive anon LRU lists. Last but not least, the swap token code has been broken for a year without complaints, as reported by Konstantin Khlebnikov. This suggests we no longer have much use for it. The days of sub-1G memory systems with heavy use of swap are over. If we ever need thrashing reducing code in the future, we will have to implement something that does scale. Signed-off-by: Rik van Riel <riel@redhat.com> Cc: Konstantin Khlebnikov <khlebnikov@openvz.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Mel Gorman <mel@csn.ul.ie> Cc: Hugh Dickins <hughd@google.com> Acked-by: Bob Picco <bpicco@meloft.net> Acked-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2012-05-28Merge tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linuxLinus Torvalds
Pull writeback tree from Wu Fengguang: "Mainly from Jan Kara to avoid iput() in the flusher threads." * tag 'writeback' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux: writeback: Avoid iput() from flusher thread vfs: Rename end_writeback() to clear_inode() vfs: Move waiting for inode writeback from end_writeback() to evict_inode() writeback: Refactor writeback_single_inode() writeback: Remove wb->list_lock from writeback_single_inode() writeback: Separate inode requeueing after writeback writeback: Move I_DIRTY_PAGES handling writeback: Move requeueing when I_SYNC set to writeback_sb_inodes() writeback: Move clearing of I_SYNC into inode_sync_complete() writeback: initialize global_dirty_limit fs: remove 8 bytes of padding from struct writeback_control on 64 bit builds mm: page-writeback.c: local functions should not be exposed globally
2012-05-25Merge branch 'for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs Pull ext2, ext3 and quota fixes from Jan Kara: "Interesting bits are: - removal of a special i_mutex locking subclass (I_MUTEX_QUOTA) since quota code does not need i_mutex anymore in any unusual way. - backport (from ext4) of a fix of a checkpointing bug (missing cache flush) that could lead to fs corruption on power failure The rest are just random small fixes & cleanups." * 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs: ext2: trivial fix to comment for ext2_free_blocks ext2: remove the redundant comment for ext2_export_ops ext3: return 32/64-bit dir name hash according to usage type quota: Get rid of nested I_MUTEX_QUOTA locking subclass quota: Use precomputed value of sb_dqopt in dquot_quota_sync ext2: Remove i_mutex use from ext2_quota_write() reiserfs: Remove i_mutex use from reiserfs_quota_write() ext4: Remove i_mutex use from ext4_quota_write() ext3: Remove i_mutex use from ext3_quota_write() quota: Fix double lock in add_dquot_ref() with CONFIG_QUOTA_DEBUG jbd: Write journal superblock with WRITE_FUA after checkpointing jbd: protect all log tail updates with j_checkpoint_mutex jbd: Split updating of journal superblock and marking journal empty ext2: do not register write_super within VFS ext2: Remove s_dirt handling ext2: write superblock only once on unmount ext3: update documentation with barrier=1 default ext3: remove max_debt in find_group_orlov() jbd: Refine commit writeout logic
2012-05-23Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace Pull user namespace enhancements from Eric Biederman: "This is a course correction for the user namespace, so that we can reach an inexpensive, maintainable, and reasonably complete implementation. Highlights: - Config guards make it impossible to enable the user namespace and code that has not been converted to be user namespace safe. - Use of the new kuid_t type ensures the if you somehow get past the config guards the kernel will encounter type errors if you enable user namespaces and attempt to compile in code whose permission checks have not been updated to be user namespace safe. - All uids from child user namespaces are mapped into the initial user namespace before they are processed. Removing the need to add an additional check to see if the user namespace of the compared uids remains the same. - With the user namespaces compiled out the performance is as good or better than it is today. - For most operations absolutely nothing changes performance or operationally with the user namespace enabled. - The worst case performance I could come up with was timing 1 billion cache cold stat operations with the user namespace code enabled. This went from 156s to 164s on my laptop (or 156ns to 164ns per stat operation). - (uid_t)-1 and (gid_t)-1 are reserved as an internal error value. Most uid/gid setting system calls treat these value specially anyway so attempting to use -1 as a uid would likely cause entertaining failures in userspace. - If setuid is called with a uid that can not be mapped setuid fails. I have looked at sendmail, login, ssh and every other program I could think of that would call setuid and they all check for and handle the case where setuid fails. - If stat or a similar system call is called from a context in which we can not map a uid we lie and return overflowuid. The LFS experience suggests not lying and returning an error code might be better, but the historical precedent with uids is different and I can not think of anything that would break by lying about a uid we can't map. - Capabilities are localized to the current user namespace making it safe to give the initial user in a user namespace all capabilities. My git tree covers all of the modifications needed to convert the core kernel and enough changes to make a system bootable to runlevel 1." Fix up trivial conflicts due to nearby independent changes in fs/stat.c * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (46 commits) userns: Silence silly gcc warning. cred: use correct cred accessor with regards to rcu read lock userns: Convert the move_pages, and migrate_pages permission checks to use uid_eq userns: Convert cgroup permission checks to use uid_eq userns: Convert tmpfs to use kuid and kgid where appropriate userns: Convert sysfs to use kgid/kuid where appropriate userns: Convert sysctl permission checks to use kuid and kgids. userns: Convert proc to use kuid/kgid where appropriate userns: Convert ext4 to user kuid/kgid where appropriate userns: Convert ext3 to use kuid/kgid where appropriate userns: Convert ext2 to use kuid/kgid where appropriate. userns: Convert devpts to use kuid/kgid where appropriate userns: Convert binary formats to use kuid/kgid where appropriate userns: Add negative depends on entries to avoid building code that is userns unsafe userns: signal remove unnecessary map_cred_ns userns: Teach inode_capable to understand inodes whose uids map to other namespaces. userns: Fail exec for suid and sgid binaries with ids outside our user namespace. userns: Convert stat to return values mapped from kuids and kgids userns: Convert user specfied uids and gids in chown into kuids and kgid userns: Use uid_eq gid_eq helpers when comparing kuids and kgids in the vfs ...
2012-05-23Merge tag 'pm-for-3.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm Pull power management updates from Rafael Wysocki: - Implementation of opportunistic suspend (autosleep) and user space interface for manipulating wakeup sources. - Hibernate updates from Bojan Smojver and Minho Ban. - Updates of the runtime PM core and generic PM domains framework related to PM QoS. - Assorted fixes. * tag 'pm-for-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits) epoll: Fix user space breakage related to EPOLLWAKEUP PM / Domains: Make it possible to add devices to inactive domains PM / Hibernate: Use get_gendisk to verify partition if resume_file is integer format PM / Domains: Fix computation of maximum domain off time PM / Domains: Fix link checking when add subdomain PM / Sleep: User space wakeup sources garbage collector Kconfig option PM / Sleep: Make the limit of user space wakeup sources configurable PM / Documentation: suspend-and-cpuhotplug.txt: Fix typo PM / Domains: Cache device stop and domain power off governor results, v3 PM / Domains: Make device removal more straightforward PM / Sleep: Fix a mistake in a conditional in autosleep_store() epoll: Add a flag, EPOLLWAKEUP, to prevent suspend while epoll events are ready PM / QoS: Create device constraints objects on notifier registration PM / Runtime: Remove device fields related to suspend time, v2 PM / Domains: Rework default domain power off governor function, v2 PM / Domains: Rework default device stop governor function, v2 PM / Sleep: Add user space interface for manipulating wakeup sources, v3 PM / Sleep: Add "prevent autosleep time" statistics to wakeup sources PM / Sleep: Implement opportunistic sleep, v2 PM / Sleep: Add wakeup_source_activate and wakeup_source_deactivate tracepoints ...
2012-05-23Merge tag 'sound-3.5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound Pull sound updates from Takashi Iwai: "This is the first big chunk for 3.5 merges of sound stuff. There are a few big changes in different areas. First off, the streaming logic of USB-audio endpoints has been largely rewritten for the better support of "implicit feedback". If anything about USB got broken, this change has to be checked. For HD-audio, the resume procedure was changed; instead of delaying the resume of the hardware until the first use, now waking up immediately at resume. This is for buggy BIOS. For ASoC, dynamic PCM support and the improved support for digital links between off-SoC devices are major framework changes. Some highlights are below: * HD-audio - Avoid accesses of invalid pin-control bits that may stall the codec - V-ref setup cleanups - Fix the races in power-saving code - Fix the races in codec cache hashes and connection lists - Split some common codes for BIOS auto-parser to hda_auto_parser.c - Changed the PM resume code to wake up immediately for buggy BIOS - Creative SoundCore3D support - Add Conexant CX20751/2/3/4 codec support * ASoC - Dynamic PCM support, allowing support for SoCs with internal routing through components with tight sequencing and formatting constraints within their internal paths or where there are multiple components connected with CPU managed DMA controllers inside the SoC. - Greatly improved support for direct digital links between off-SoC devices, providing a much simpler way of connecting things like digital basebands to CODECs. - Much more fine grained and robust locking, cleaning up some of the confusion that crept in with multi-component. - CPU support for nVidia Tegra 30 I2S and audio hub controllers and ST-Ericsson MSP I2S controolers - New CODEC drivers for Cirrus CS42L52, LAPIS Semiconductor ML26124, Texas Instruments LM49453. - Some regmap changes needed by the Tegra I2S driver. - mc13783 audio support. * Misc - Rewrite with module_pci_driver() - Xonar DGX support for snd-oxygen - Improvement of packet handling in snd-firewire driver - New USB-endpoint streaming logic - Enhanced M-audio FTU quirks and relevant cleanups - Increment the support of OSS devices to 256 - snd-aloop accuracy improvement There are a few more pending changes for 3.5, but they will be sent slightly later as partly depending on the changes of DRM." Fix up conflicts in regmap (due to duplicate patches, with some further updates then having already come in from the regmap tree). Also some fairly trivial context conflicts in the imx and mcx soc drivers. * tag 'sound-3.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound: (280 commits) ALSA: snd-usb: fix stream info output in /proc ALSA: pcm - Add proper state checks to snd_pcm_drain() ALSA: sh: Fix up namespace collision in sh_dac_audio. ALSA: hda/realtek - Fix unused variable compile warning ASoC: sh: fsi: enable chip specific data transfer mode ASoC: sh: fsi: call fsi_hw_startup/shutdown from fsi_dai_trigger() ASoC: sh: fsi: use same format for IN/OUT ASoC: sh: fsi: add fsi_version() and removed meaningless version check ASoC: sh: fsi: use register field macro name on IN/OUT_DMAC ASoC: tegra: Add machine driver for WM8753 codec ALSA: hda - Fix possible races of accesses to connection list array ASoC: OMAP: HDMI: Introduce codec ARM: mx31_3ds: Add sound support ASoC: imx-mc13783 cleanup mx31moboard: Add sound support ASoC: mc13783 codec cleanups ASoC: add imx-mc13783 sound support ASoC: Add mc13783 codec mfd: mc13xxx: add codec platform data ASoC: don't flip master of DT-instantiated DAI links ...
2012-05-22Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial updates from Jiri Kosina: "As usual, it's mostly typo fixes, redundant code elimination and some documentation updates." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (57 commits) edac, mips: don't change code that has been removed in edac/mips tree xtensa: Change mail addresses of Hannes Weiner and Oskar Schirmer lib: Change mail address of Oskar Schirmer net: Change mail address of Oskar Schirmer arm/m68k: Change mail address of Sebastian Hess i2c: Change mail address of Oskar Schirmer net: Fix tcp_build_and_update_options comment in struct tcp_sock atomic64_32.h: fix parameter naming mismatch Kconfig: replace "--- help ---" with "---help---" c2port: fix bogus Kconfig "default no" edac: Fix spelling errors. qla1280: Remove redundant NULL check before release_firmware() call remoteproc: remove redundant NULL check before release_firmware() qla2xxx: Remove redundant NULL check before release_firmware() call. aic94xx: Get rid of redundant NULL check before release_firmware() call tehuti: delete redundant NULL check before release_firmware() qlogic: get rid of a redundant test for NULL before call to release_firmware() bna: remove redundant NULL test before release_firmware() tg3: remove redundant NULL test before release_firmware() call typhoon: get rid of redundant conditional before all to release_firmware() ...
2012-05-15userns: Convert ext4 to user kuid/kgid where appropriateEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15userns: Convert ext3 to use kuid/kgid where appropriateEric W. Biederman
Acked-by: Serge Hallyn <serge.hallyn@canonical.com> Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2012-05-15jbd: Write journal superblock with WRITE_FUA after checkpointingJan Kara
If journal superblock is written only in disk's caches and other transaction starts reusing space of the transaction cleaned from the log, it can happen blocks of a new transaction reach the disk before journal superblock. When power failure happens in such case, subsequent journal replay would still try to replay the old transaction but some of it's blocks may be already overwritten by the new transaction. For this reason we must use WRITE_FUA when updating log tail and we must first write new log tail to disk and update in-memory information only after that. Signed-off-by: Jan Kara <jack@suse.cz>
2012-05-15jbd: Split updating of journal superblock and marking journal emptyJan Kara
There are three case of updating journal superblock. In the first case, we want to mark journal as empty (setting s_sequence to 0), in the second case we want to update log tail, in the third case we want to update s_errno. Split these cases into separate functions. It makes the code slightly more straightforward and later patches will make the distinction even more important. Signed-off-by: Jan Kara <jack@suse.cz>
2012-05-09rcu: Make RCU_FAST_NO_HZ handle timer migrationPaul E. McKenney
The current RCU_FAST_NO_HZ assumes that timers do not migrate unless a CPU goes offline, in which case it assumes that the CPU will have to come out of dyntick-idle mode (cancelling the timer) in order to go offline. This is important because when RCU_FAST_NO_HZ permits a CPU to enter dyntick-idle mode despite having RCU callbacks pending, it posts a timer on that CPU to force a wakeup on that CPU. This wakeup ensures that the CPU will eventually handle the end of the grace period, including invoking its RCU callbacks. However, Pascal Chapperon's test setup shows that the timer handler rcu_idle_gp_timer_func() really does get invoked in some cases. This is problematic because this can cause the CPU that entered dyntick-idle mode despite still having RCU callbacks pending to remain in dyntick-idle mode indefinitely, which means that its RCU callbacks might never be invoked. This situation can result in grace-period delays or even system hangs, which matches Pascal's observations of slow boot-up and shutdown (https://lkml.org/lkml/2012/4/5/142). See also the bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=806548 This commit therefore causes the "should never be invoked" timer handler rcu_idle_gp_timer_func() to use smp_call_function_single() to wake up the CPU for which the timer was intended, allowing that CPU to invoke its RCU callbacks in a timely manner. Reported-by: Pascal Chapperon <pascal.chapperon@wanadoo.fr> Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-05-06writeback: Move requeueing when I_SYNC set to writeback_sb_inodes()Jan Kara
When writeback_single_inode() is called on inode which has I_SYNC already set while doing WB_SYNC_NONE, inode is moved to b_more_io list. However this makes sense only if the caller is flusher thread. For other callers of writeback_single_inode() it doesn't really make sense and may be even wrong - flusher thread may be doing WB_SYNC_ALL writeback in parallel. So we move requeueing from writeback_single_inode() to writeback_sb_inodes(). Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: Fengguang Wu <fengguang.wu@intel.com>
2012-05-01PM / Sleep: Add wakeup_source_activate and wakeup_source_deactivate tracepointsArve Hjønnevåg
Add tracepoints to wakeup_source_activate and wakeup_source_deactivate. Useful for checking that specific wakeup sources overlap as expected. Signed-off-by: Arve Hjønnevåg <arve@android.com> Acked-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
2012-04-24rcu: Add RCU_FAST_NO_HZ tracing for idle exitPaul E. McKenney
Traces of rcu_prep_idle events can be confusing because rcu_cleanup_after_idle() does no tracing. This commit therefore adds this tracing. Signed-off-by: Paul E. McKenney <paul.mckenney@linaro.org> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
2012-04-23ASoC: dapm: Fix x86_64 build warning.Liam Girdwood
Fixes the following build warning on x86_64. In file included from include/trace/ftrace.h:567:0, from include/trace/define_trace.h:86, from include/trace/events/asoc.h:410, from sound/soc/soc-core.c:45: include/trace/events/asoc.h: In function 'ftrace_raw_event_snd_soc_dapm_output_path': include/trace/events/asoc.h:246:1: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] include/trace/events/asoc.h: In function 'ftrace_raw_event_snd_soc_dapm_input_path': include/trace/events/asoc.h:275:1: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast] Signed-off-by: Liam Girdwood <lrg@ti.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-18ASoC: dapm: Add API call to query valid DAPM pathsLiam Girdwood
In preparation for ASoC DSP support. Add a DAPM API call to determine whether a DAPM audio path is valid between source and sink widgets. This also takes into account all kcontrol mux and mixer settings in between the source and sink widgets to validate the audio path. This will be used by the DSP core to determine the runtime DAI mappings between FE and BE DAIs in order to run PCM operations. Signed-off-by: Liam Girdwood <lrg@ti.com> Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-04-11jbd: Refine commit writeout logicJan Kara
Currently we write out all journal buffers in WRITE_SYNC mode. This improves performance for fsync heavy workloads but hinders performance when writes are mostly asynchronous, most noticably it slows down readers and users complain about slow desktop response etc. So submit writes as asynchronous in the normal case and only submit writes as WRITE_SYNC if we detect someone is waiting for current transaction commit. I've gathered some numbers to back this change. The first is the read latency test. It measures time to read 1 MB after several seconds of sleeping in presence of streaming writes. Top 10 times (out of 90) in us: Before After 2131586 697473 1709932 557487 1564598 535642 1480462 347573 1478579 323153 1408496 222181 1388960 181273 1329565 181070 1252486 172832 1223265 172278 Average: 619377 82180 So the improvement in both maximum and average latency is massive. I've measured fsync throughput by: fs_mark -n 100 -t 1 -s 16384 -d /mnt/fsync/ -S 1 -L 4 in presence of streaming reader. The numbers (fsyncs/s) are: Before After 9.9 6.3 6.8 6.0 6.3 6.2 5.8 6.1 So fsync performance seems unharmed by this change. Signed-off-by: Jan Kara <jack@suse.cz>
2012-04-10workqueue: Fix workqueue_execute_end() commentStephen Boyd
workqueue_execute_end() is called after the callback function, not before. Signed-off-by: Stephen Boyd <sboyd@codeaurora.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2012-04-04Merge branch 'perf-urgent-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar. * 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: perf/x86/p4: Add format attributes tracing, sched, vfs: Fix 'old_pid' usage in trace_sched_process_exec()
2012-03-31tracing, sched, vfs: Fix 'old_pid' usage in trace_sched_process_exec()Oleg Nesterov
1. TRACE_EVENT(sched_process_exec) forgets to actually use the old pid argument, it sets ->old_pid = p->pid. 2. search_binary_handler() uses the wrong pid number. tracepoint needs the global pid_t from the root namespace, while old_pid is the virtual pid number as it seen by the tracer/parent. With this patch we have two pid_t's in search_binary_handler(), not really nice. Perhaps we should switch to "struct pid*", but in this case it would be better to cleanup the current code first and move the "depth == 0" code outside. Signed-off-by: Oleg Nesterov <oleg@redhat.com> Cc: David Smith <dsmith@redhat.com> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Denys Vlasenko <dvlasenk@redhat.com> Link: http://lkml.kernel.org/r/20120330162636.GA4857@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
2012-03-30Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs Pull btrfs fixes and features from Chris Mason: "We've merged in the error handling patches from SuSE. These are already shipping in the sles kernel, and they give btrfs the ability to abort transactions and go readonly on errors. It involves a lot of churn as they clarify BUG_ONs, and remove the ones we now properly deal with. Josef reworked the way our metadata interacts with the page cache. page->private now points to the btrfs extent_buffer object, which makes everything faster. He changed it so we write an whole extent buffer at a time instead of allowing individual pages to go down,, which will be important for the raid5/6 code (for the 3.5 merge window ;) Josef also made us more aggressive about dropping pages for metadata blocks that were freed due to COW. Overall, our metadata caching is much faster now. We've integrated my patch for metadata bigger than the page size. This allows metadata blocks up to 64KB in size. In practice 16K and 32K seem to work best. For workloads with lots of metadata, this cuts down the size of the extent allocation tree dramatically and fragments much less. Scrub was updated to support the larger block sizes, which ended up being a fairly large change (thanks Stefan Behrens). We also have an assortment of fixes and updates, especially to the balancing code (Ilya Dryomov), the back ref walker (Jan Schmidt) and the defragging code (Liu Bo)." Fixed up trivial conflicts in fs/btrfs/scrub.c that were just due to removal of the second argument to k[un]map_atomic() in commit 7ac687d9e047. * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mason/linux-btrfs: (75 commits) Btrfs: update the checks for mixed block groups with big metadata blocks Btrfs: update to the right index of defragment Btrfs: do not bother to defrag an extent if it is a big real extent Btrfs: add a check to decide if we should defrag the range Btrfs: fix recursive defragment with autodefrag option Btrfs: fix the mismatch of page->mapping Btrfs: fix race between direct io and autodefrag Btrfs: fix deadlock during allocating chunks Btrfs: show useful info in space reservation tracepoint Btrfs: don't use crc items bigger than 4KB Btrfs: flush out and clean up any block device pages during mount btrfs: disallow unequal data/metadata blocksize for mixed block groups Btrfs: enhance superblock sanity checks Btrfs: change scrub to support big blocks Btrfs: minor cleanup in scrub Btrfs: introduce common define for max number of mirrors Btrfs: fix infinite loop in btrfs_shrink_device() Btrfs: fix memory leak in resolver code Btrfs: allow dup for data chunks in mixed mode Btrfs: validate target profiles only if we are going to use them ...
2012-03-28Merge tag 'ext4_for_linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates for 3.4 from Ted Ts'o: "Ext4 commits for 3.3 merge window; mostly cleanups and bug fixes The changes to export dirty_writeback_interval are from Artem's s_dirt cleanup patch series. The same is true of the change to remove the s_dirt helper functions which never got used by anyone in-tree. I've run these changes by Al Viro, and am carrying them so that Artem can more easily fix up the rest of the file systems during the next merge window. (Originally we had hopped to remove the use of s_dirt from ext4 during this merge window, but his patches had some bugs, so I ultimately ended dropping them from the ext4 tree.)" * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (66 commits) vfs: remove unused superblock helpers mm: export dirty_writeback_interval ext4: remove useless s_dirt assignment ext4: write superblock only once on unmount ext4: do not mark superblock as dirty unnecessarily ext4: correct ext4_punch_hole return codes ext4: remove restrictive checks for EOFBLOCKS_FL ext4: always set then trimmed blocks count into len ext4: fix trimmed block count accunting ext4: fix start and len arguments handling in ext4_trim_fs() ext4: update s_free_{inodes,blocks}_count during online resize ext4: change some printk() calls to use ext4_msg() instead ext4: avoid output message interleaving in ext4_error_<foo>() ext4: remove trailing newlines from ext4_msg() and ext4_error() messages ext4: add no_printk argument validation, fix fallout ext4: remove redundant "EXT4-fs: " from uses of ext4_msg ext4: give more helpful error message in ext4_ext_rm_leaf() ext4: remove unused code from ext4_ext_map_blocks() ext4: rewrite punch hole to use ext4_ext_remove_space() jbd2: cleanup journal tail after transaction commit ...
2012-03-24Merge tag 'device-for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux Pull <linux/device.h> avoidance patches from Paul Gortmaker: "Nearly every subsystem has some kind of header with a proto like: void foo(struct device *dev); and yet there is no reason for most of these guys to care about the sub fields within the device struct. This allows us to significantly reduce the scope of headers including headers. For this instance, a reduction of about 40% is achieved by replacing the include with the simple fact that the device is some kind of a struct. Unlike the much larger module.h cleanup, this one is simply two commits. One to fix the implicit <linux/device.h> users, and then one to delete the device.h includes from the linux/include/ dir wherever possible." * tag 'device-for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: device.h: audit and cleanup users in main include dir device.h: cleanup users outside of linux/include (C files)
2012-03-23Merge tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds
Pull NFS client updates for Linux 3.4 from Trond Myklebust: "New features include: - Add NFS client support for containers. This should enable most of the necessary functionality, including lockd support, and support for rpc.statd, NFSv4 idmapper and RPCSEC_GSS upcalls into the correct network namespace from which the mount system call was issued. - NFSv4 idmapper scalability improvements Base the idmapper cache on the keyring interface to allow concurrent access to idmapper entries. Start the process of migrating users from the single-threaded daemon-based approach to the multi-threaded request-key based approach. - NFSv4.1 implementation id. Allows the NFSv4.1 client and server to mutually identify each other for logging and debugging purposes. - Support the 'vers=4.1' mount option for mounting NFSv4.1 instead of having to use the more counterintuitive 'vers=4,minorversion=1'. - SUNRPC tracepoints. Start the process of adding tracepoints in order to improve debugging of the RPC layer. - pNFS object layout support for autologin. Important bugfixes include: - Fix a bug in rpc_wake_up/rpc_wake_up_status that caused them to fail to wake up all tasks when applied to priority waitqueues. - Ensure that we handle read delegations correctly, when we try to truncate a file. - A number of fixes for NFSv4 state manager loops (mostly to do with delegation recovery)." * tag 'nfs-for-3.4-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (224 commits) NFS: fix sb->s_id in nfs debug prints xprtrdma: Remove assumption that each segment is <= PAGE_SIZE xprtrdma: The transport should not bug-check when a dup reply is received pnfs-obj: autologin: Add support for protocol autologin NFS: Remove nfs4_setup_sequence from generic rename code NFS: Remove nfs4_setup_sequence from generic unlink code NFS: Remove nfs4_setup_sequence from generic read code NFS: Remove nfs4_setup_sequence from generic write code NFS: Fix more NFS debug related build warnings SUNRPC/LOCKD: Fix build warnings when CONFIG_SUNRPC_DEBUG is undefined nfs: non void functions must return a value SUNRPC: Kill compiler warning when RPC_DEBUG is unset SUNRPC/NFS: Add Kbuild dependencies for NFS_DEBUG/RPC_DEBUG NFS: Use cond_resched_lock() to reduce latencies in the commit scans NFSv4: It is not safe to dereference lsp->ls_state in release_lockowner NFS: ncommit count is being double decremented SUNRPC: We must not use list_for_each_entry_safe() in rpc_wake_up() Try using machine credentials for RENEW calls NFSv4.1: Fix a few issues in filelayout_commit_pagelist NFSv4.1: Clean ups and bugfixes for the pNFS read/writeback/commit code ...
2012-03-22Merge tag 'regmap-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap Pull regmap updates from Mark Brown: "Things are really quieting down with the regmap API, while we're still seeing a trickle of new features coming in they're getting much smaller than they were. It's also nice to have some features which support other subsystems building infrastructure on top of regmap. Highlights include: - Support for padding between the register and the value when interacting with the device, sometimes needed for fast interfaces. - Support for applying register updates to the device when restoring the register state. This is intended to be used to apply updates supplied by manufacturers for tuning the performance of the device (many of which are to undocumented registers which aren't otherwise covered). - Support for multi-register operations on cached registers. - Support for syncing only part of the register cache. - Stubs and parameter query functions intended to make it easier for other subsystems to build infrastructure on top of the regmap API. plus a few driver updates making use of the new features which it was easier to merge via this tree." * tag 'regmap-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/broonie/regmap: (41 commits) regmap: Fix future missing prototype of devres_alloc() and friends regmap: Rejig struct declarations for stubbed API regmap: Fix rbtree block base in sync regcache: Make sure we sync register 0 in an rbtree cache regmap: delete unused module.h from drivers/base/regmap files regmap: Add stub for regcache_sync_region() mfd: Improve performance of later WM1811 revisions regmap: Fix x86_64 breakage regmap: Allow drivers to sync only part of the register cache regmap: Supply ranges to the sync operations regmap: Add tracepoints for cache only and cache bypass regmap: Mark the cache as clean after a successful sync regmap: Remove default cache sync implementation regmap: Skip hardware defaults for LZO caches regmap: Expose the driver name in debugfs mfd: wm8400: Convert to devm_regmap_init_i2c() mfd: wm831x: Convert to devm_regmap_init() mfd: wm8994: Convert to devm_regmap_init() mfd/ASoC: Convert WM8994 driver to use regmap patches mfd: Add __devinit and __devexit annotations in wm8994 ...
2012-03-22btrfs: return void in functions without error conditionsJeff Mahoney
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
2012-03-20Merge branch 'for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial Pull trivial tree from Jiri Kosina: "It's indeed trivial -- mostly documentation updates and a bunch of typo fixes from Masanari. There are also several linux/version.h include removals from Jesper." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (101 commits) kcore: fix spelling in read_kcore() comment constify struct pci_dev * in obvious cases Revert "char: Fix typo in viotape.c" init: fix wording error in mm_init comment usb: gadget: Kconfig: fix typo for 'different' Revert "power, max8998: Include linux/module.h just once in drivers/power/max8998_charger.c" writeback: fix fn name in writeback_inodes_sb_nr_if_idle() comment header writeback: fix typo in the writeback_control comment Documentation: Fix multiple typo in Documentation tpm_tis: fix tis_lock with respect to RCU Revert "media: Fix typo in mixer_drv.c and hdmi_drv.c" Doc: Update numastat.txt qla4xxx: Add missing spaces to error messages compiler.h: Fix typo security: struct security_operations kerneldoc fix Documentation: broken URL in libata.tmpl Documentation: broken URL in filesystems.tmpl mtd: simplify return logic in do_map_probe() mm: fix comment typo of truncate_inode_pages_range power: bq27x00: Fix typos in comment ...
2012-03-20Merge branch 'perf-core-for-linus' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf events changes for v3.4 from Ingo Molnar: - New "hardware based branch profiling" feature both on the kernel and the tooling side, on CPUs that support it. (modern x86 Intel CPUs with the 'LBR' hardware feature currently.) This new feature is basically a sophisticated 'magnifying glass' for branch execution - something that is pretty difficult to extract from regular, function histogram centric profiles. The simplest mode is activated via 'perf record -b', and the result looks like this in perf report: $ perf record -b any_call,u -e cycles:u branchy $ perf report -b --sort=symbol 52.34% [.] main [.] f1 24.04% [.] f1 [.] f3 23.60% [.] f1 [.] f2 0.01% [k] _IO_new_file_xsputn [k] _IO_file_overflow 0.01% [k] _IO_vfprintf_internal [k] _IO_new_file_xsputn 0.01% [k] _IO_vfprintf_internal [k] strchrnul 0.01% [k] __printf [k] _IO_vfprintf_internal 0.01% [k] main [k] __printf This output shows from/to branch columns and shows the highest percentage (from,to) jump combinations - i.e. the most likely taken branches in the system. "branches" can also include function calls and any other synchronous and asynchronous transitions of the instruction pointer that are not 'next instruction' - such as system calls, traps, interrupts, etc. This feature comes with (hopefully intuitive) flat ascii and TUI support in perf report. - Various 'perf annotate' visual improvements for us assembly junkies. It will now recognize function calls in the TUI and by hitting enter you can follow the call (recursively) and back, amongst other improvements. - Multiple threads/processes recording support in perf record, perf stat, perf top - which is activated via a comma-list of PIDs: perf top -p 21483,21485 perf stat -p 21483,21485 -ddd perf record -p 21483,21485 - Support for per UID views, via the --uid paramter to perf top, perf report, etc. For example 'perf top --uid mingo' will only show the tasks that I am running, excluding other users, root, etc. - Jump label restructurings and improvements - this includes the factoring out of the (hopefully much clearer) include/linux/static_key.h generic facility: struct static_key key = STATIC_KEY_INIT_FALSE; ... if (static_key_false(&key)) do unlikely code else do likely code ... static_key_slow_inc(); ... static_key_slow_inc(); ... The static_key_false() branch will be generated into the code with as little impact to the likely code path as possible. the static_key_slow_*() APIs flip the branch via live kernel code patching. This facility can now be used more widely within the kernel to micro-optimize hot branches whose likelihood matches the static-key usage and fast/slow cost patterns. - SW function tracer improvements: perf support and filtering support. - Various hardenings of the perf.data ABI, to make older perf.data's smoother on newer tool versions, to make new features integrate more smoothly, to support cross-endian recording/analyzing workflows better, etc. - Restructuring of the kprobes code, the splitting out of 'optprobes', and a corner case bugfix. - Allow the tracing of kernel console output (printk). - Improvements/fixes to user-space RDPMC support, allowing user-space self-profiling code to extract PMU counts without performing any system calls, while playing nice with the kernel side. - 'perf bench' improvements - ... and lots of internal restructurings, cleanups and fixes that made these features possible. And, as usual this list is incomplete as there were also lots of other improvements * 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (120 commits) perf report: Fix annotate double quit issue in branch view mode perf report: Remove duplicate annotate choice in branch view mode perf/x86: Prettify pmu config literals perf report: Enable TUI in branch view mode perf report: Auto-detect branch stack sampling mode perf record: Add HEADER_BRANCH_STACK tag perf record: Provide default branch stack sampling mode option perf tools: Make perf able to read files from older ABIs perf tools: Fix ABI compatibility bug in print_event_desc() perf tools: Enable reading of perf.data files from different ABI rev perf: Add ABI reference sizes perf report: Add support for taken branch sampling perf record: Add support for sampling taken branch perf tools: Add code to support PERF_SAMPLE_BRANCH_STACK x86/kprobes: Split out optprobe related code to kprobes-opt.c x86/kprobes: Fix a bug which can modify kernel code permanently x86/kprobes: Fix instruction recovery on optimized path perf: Add callback to flush branch_stack on context switch perf: Disable PERF_SAMPLE_BRANCH_* when not supported perf/x86: Add LBR software filter support for Intel CPUs ...
2012-03-16device.h: audit and cleanup users in main include dirPaul Gortmaker
The <linux/device.h> header includes a lot of stuff, and it in turn gets a lot of use just for the basic "struct device" which appears so often. Clean up the users as follows: 1) For those headers only needing "struct device" as a pointer in fcn args, replace the include with exactly that. 2) For headers not really using anything from device.h, simply delete the include altogether. 3) For headers relying on getting device.h implicitly before being included themselves, now explicitly include device.h 4) For files in which doing #1 or #2 uncovers an implicit dependency on some other header, fix by explicitly adding the required header(s). Any C files that were implicitly relying on device.h to be present have already been dealt with in advance. Total removals from #1 and #2: 51. Total additions coming from #3: 9. Total other implicit dependencies from #4: 7. As of 3.3-rc1, there were 110, so a net removal of 42 gives about a 38% reduction in device.h presence in include/* Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
2012-03-14Merge remote-tracking branch 'regmap/topic/drivers' into regmap-nextMark Brown
Resolved simple add/add conflicts: drivers/base/regmap/internal.h drivers/base/regmap/regmap.c
2012-03-13jbd2: issue cache flush after checkpointing even with internal journalJan Kara
When we reach jbd2_cleanup_journal_tail(), there is no guarantee that checkpointed buffers are on a stable storage - especially if buffers were written out by jbd2_log_do_checkpoint(), they are likely to be only in disk's caches. Thus when we update journal superblock effectively removing old transaction from journal, this write of superblock can get to stable storage before those checkpointed buffers which can result in filesystem corruption after a crash. Thus we must unconditionally issue a cache flush before we update journal superblock in these cases. A similar problem can also occur if journal superblock is written only in disk's caches, other transaction starts reusing space of the transaction cleaned from the log and power failure happens. Subsequent journal replay would still try to replay the old transaction but some of it's blocks may be already overwritten by the new transaction. For this reason we must use WRITE_FUA when updating log tail and we must first write new log tail to disk and update in-memory information only after that. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-13jbd2: split updating of journal superblock and marking journal emptyJan Kara
There are three case of updating journal superblock. In the first case, we want to mark journal as empty (setting s_sequence to 0), in the second case we want to update log tail, in the third case we want to update s_errno. Split these cases into separate functions. It makes the code slightly more straightforward and later patches will make the distinction even more important. Signed-off-by: Jan Kara <jack@suse.cz> Signed-off-by: "Theodore Ts'o" <tytso@mit.edu>
2012-03-05Merge branch 'perf/urgent' into perf/coreIngo Molnar
Conflicts: tools/perf/builtin-record.c tools/perf/builtin-top.c tools/perf/perf.h tools/perf/util/top.h Merge reason: resolve these cherry-picking conflicts. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-28Merge branch 'rcu/next' of ↵Ingo Molnar
git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu The major features of this series are: - making RCU more aggressive about entering dyntick-idle mode in order to improve energy efficiency - converting a few more call_rcu()s to kfree_rcu()s - applying a number of rcutree fixes and cleanups to rcutiny - removing CONFIG_SMP #ifdefs from treercu - allowing RCU CPU stall times to be set via sysfs - adding CPU-stall capability to rcutorture - adding more RCU-abuse diagnostics - updating documentation - fixing yet more issues located by the still-ongoing top-to-bottom inspection of RCU, this time with a special focus on the CPU-hotplug code path. Signed-off-by: Ingo Molnar <mingo@elte.hu>
2012-02-23regmap: Add tracepoints for cache only and cache bypassMark Brown
Useful for figuring out where the hardware interaction went or came from. Signed-off-by: Mark Brown <broonie@opensource.wolfsonmicro.com>
2012-02-23tracepoint, vfs, sched: Add exec() tracepointDavid Smith
Added a minimal exec tracepoint. Exec is an important major event in the life of a task, like fork(), clone() or exit(), all of which we already trace. [ We also do scheduling re-balancing during exec() - so it's useful from a scheduler instrumentation POV as well. ] If you want to watch a task start up, when it gets exec'ed is a good place to start. With the addition of this tracepoint, exec's can be monitored and better picture of general system activity can be obtained. This tracepoint will also enable better process life tracking, allowing you to answer questions like "what process keeps starting up binary X?". This tracepoint can also be useful in ftrace filtering and trigger conditions: i.e. starting or stopping filtering when exec is called. Signed-off-by: David Smith <dsmith@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: http://lkml.kernel.org/r/4F314D19.7030504@redhat.com Signed-off-by: Ingo Molnar <mingo@elte.hu>