diff options
Diffstat (limited to 'Documentation/trace')
| -rw-r--r-- | Documentation/trace/events-kmem.txt | 107 | ||||
| -rw-r--r-- | Documentation/trace/events-nmi.txt | 43 | ||||
| -rw-r--r-- | Documentation/trace/events-power.txt | 96 | ||||
| -rw-r--r-- | Documentation/trace/events.txt | 434 | ||||
| -rw-r--r-- | Documentation/trace/ftrace-design.txt | 408 | ||||
| -rw-r--r-- | Documentation/trace/ftrace.txt | 2652 | ||||
| -rw-r--r-- | Documentation/trace/function-graph-fold.vim | 42 | ||||
| -rw-r--r-- | Documentation/trace/kmemtrace.txt | 126 | ||||
| -rw-r--r-- | Documentation/trace/kprobetrace.txt | 172 | ||||
| -rw-r--r-- | Documentation/trace/mmiotrace.txt | 39 | ||||
| -rw-r--r-- | Documentation/trace/postprocess/trace-pagealloc-postprocess.pl | 418 | ||||
| -rw-r--r-- | Documentation/trace/postprocess/trace-vmscan-postprocess.pl | 704 | ||||
| -rw-r--r-- | Documentation/trace/power.txt | 17 | ||||
| -rw-r--r-- | Documentation/trace/ring-buffer-design.txt | 955 | ||||
| -rw-r--r-- | Documentation/trace/tracepoint-analysis.txt | 327 | ||||
| -rw-r--r-- | Documentation/trace/tracepoints.txt | 55 | ||||
| -rw-r--r-- | Documentation/trace/uprobetracer.txt | 159 |
17 files changed, 5739 insertions, 1015 deletions
diff --git a/Documentation/trace/events-kmem.txt b/Documentation/trace/events-kmem.txt new file mode 100644 index 00000000000..19480041006 --- /dev/null +++ b/Documentation/trace/events-kmem.txt @@ -0,0 +1,107 @@ + Subsystem Trace Points: kmem + +The kmem tracing system captures events related to object and page allocation +within the kernel. Broadly speaking there are five major subheadings. + + o Slab allocation of small objects of unknown type (kmalloc) + o Slab allocation of small objects of known type + o Page allocation + o Per-CPU Allocator Activity + o External Fragmentation + +This document describes what each of the tracepoints is and why they +might be useful. + +1. Slab allocation of small objects of unknown type +=================================================== +kmalloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmalloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kfree call_site=%lx ptr=%p + +Heavy activity for these events may indicate that a specific cache is +justified, particularly if kmalloc slab pages are getting significantly +internal fragmented as a result of the allocation pattern. By correlating +kmalloc with kfree, it may be possible to identify memory leaks and where +the allocation sites were. + + +2. Slab allocation of small objects of known type +================================================= +kmem_cache_alloc call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s +kmem_cache_alloc_node call_site=%lx ptr=%p bytes_req=%zu bytes_alloc=%zu gfp_flags=%s node=%d +kmem_cache_free call_site=%lx ptr=%p + +These events are similar in usage to the kmalloc-related events except that +it is likely easier to pin the event down to a specific cache. At the time +of writing, no information is available on what slab is being allocated from, +but the call_site can usually be used to extrapolate that information. + +3. Page allocation +================== +mm_page_alloc page=%p pfn=%lu order=%d migratetype=%d gfp_flags=%s +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_free page=%p pfn=%lu order=%d +mm_page_free_batched page=%p pfn=%lu order=%d cold=%d + +These four events deal with page allocation and freeing. mm_page_alloc is +a simple indicator of page allocator activity. Pages may be allocated from +the per-CPU allocator (high performance) or the buddy allocator. + +If pages are allocated directly from the buddy allocator, the +mm_page_alloc_zone_locked event is triggered. This event is important as high +amounts of activity imply high activity on the zone->lock. Taking this lock +impairs performance by disabling interrupts, dirtying cache lines between +CPUs and serialising many CPUs. + +When a page is freed directly by the caller, the only mm_page_free event +is triggered. Significant amounts of activity here could indicate that the +callers should be batching their activities. + +When pages are freed in batch, the also mm_page_free_batched is triggered. +Broadly speaking, pages are taken off the LRU lock in bulk and +freed in batch with a page list. Significant amounts of activity here could +indicate that the system is under memory pressure and can also indicate +contention on the zone->lru_lock. + +4. Per-CPU Allocator Activity +============================= +mm_page_alloc_zone_locked page=%p pfn=%lu order=%u migratetype=%d cpu=%d percpu_refill=%d +mm_page_pcpu_drain page=%p pfn=%lu order=%d cpu=%d migratetype=%d + +In front of the page allocator is a per-cpu page allocator. It exists only +for order-0 pages, reduces contention on the zone->lock and reduces the +amount of writing on struct page. + +When a per-CPU list is empty or pages of the wrong type are allocated, +the zone->lock will be taken once and the per-CPU list refilled. The event +triggered is mm_page_alloc_zone_locked for each page allocated with the +event indicating whether it is for a percpu_refill or not. + +When the per-CPU list is too full, a number of pages are freed, each one +which triggers a mm_page_pcpu_drain event. + +The individual nature of the events is so that pages can be tracked +between allocation and freeing. A number of drain or refill pages that occur +consecutively imply the zone->lock being taken once. Large amounts of per-CPU +refills and drains could imply an imbalance between CPUs where too much work +is being concentrated in one place. It could also indicate that the per-CPU +lists should be a larger size. Finally, large amounts of refills on one CPU +and drains on another could be a factor in causing large amounts of cache +line bounces due to writes between CPUs and worth investigating if pages +can be allocated and freed on the same CPU through some algorithm change. + +5. External Fragmentation +========================= +mm_page_alloc_extfrag page=%p pfn=%lu alloc_order=%d fallback_order=%d pageblock_order=%d alloc_migratetype=%d fallback_migratetype=%d fragmenting=%d change_ownership=%d + +External fragmentation affects whether a high-order allocation will be +successful or not. For some types of hardware, this is important although +it is avoided where possible. If the system is using huge pages and needs +to be able to resize the pool over the lifetime of the system, this value +is important. + +Large numbers of this event implies that memory is fragmenting and +high-order allocations will start failing at some time in the future. One +means of reducing the occurrence of this event is to increase the size of +min_free_kbytes in increments of 3*pageblock_size*nr_online_nodes where +pageblock_size is usually the size of the default hugepage size. diff --git a/Documentation/trace/events-nmi.txt b/Documentation/trace/events-nmi.txt new file mode 100644 index 00000000000..c03c8c89f08 --- /dev/null +++ b/Documentation/trace/events-nmi.txt @@ -0,0 +1,43 @@ +NMI Trace Events + +These events normally show up here: + + /sys/kernel/debug/tracing/events/nmi + +-- + +nmi_handler: + +You might want to use this tracepoint if you suspect that your +NMI handlers are hogging large amounts of CPU time. The kernel +will warn if it sees long-running handlers: + + INFO: NMI handler took too long to run: 9.207 msecs + +and this tracepoint will allow you to drill down and get some +more details. + +Let's say you suspect that perf_event_nmi_handler() is causing +you some problems and you only want to trace that handler +specifically. You need to find its address: + + $ grep perf_event_nmi_handler /proc/kallsyms + ffffffff81625600 t perf_event_nmi_handler + +Let's also say you are only interested in when that function is +really hogging a lot of CPU time, like a millisecond at a time. +Note that the kernel's output is in milliseconds, but the input +to the filter is in nanoseconds! You can filter on 'delta_ns': + +cd /sys/kernel/debug/tracing/events/nmi/nmi_handler +echo 'handler==0xffffffff81625600 && delta_ns>1000000' > filter +echo 1 > enable + +Your output would then look like: + +$ cat /sys/kernel/debug/tracing/trace_pipe +<idle>-0 [000] d.h3 505.397558: nmi_handler: perf_event_nmi_handler() delta_ns: 3236765 handled: 1 +<idle>-0 [000] d.h3 505.805893: nmi_handler: perf_event_nmi_handler() delta_ns: 3174234 handled: 1 +<idle>-0 [000] d.h3 506.158206: nmi_handler: perf_event_nmi_handler() delta_ns: 3084642 handled: 1 +<idle>-0 [000] d.h3 506.334346: nmi_handler: perf_event_nmi_handler() delta_ns: 3080351 handled: 1 + diff --git a/Documentation/trace/events-power.txt b/Documentation/trace/events-power.txt new file mode 100644 index 00000000000..21d514ced21 --- /dev/null +++ b/Documentation/trace/events-power.txt @@ -0,0 +1,96 @@ + + Subsystem Trace Points: power + +The power tracing system captures events related to power transitions +within the kernel. Broadly speaking there are three major subheadings: + + o Power state switch which reports events related to suspend (S-states), + cpuidle (C-states) and cpufreq (P-states) + o System clock related changes + o Power domains related changes and transitions + +This document describes what each of the tracepoints is and why they +might be useful. + +Cf. include/trace/events/power.h for the events definitions. + +1. Power state switch events +============================ + +1.1 Trace API +----------------- + +A 'cpu' event class gathers the CPU-related events: cpuidle and +cpufreq. + +cpu_idle "state=%lu cpu_id=%lu" +cpu_frequency "state=%lu cpu_id=%lu" + +A suspend event is used to indicate the system going in and out of the +suspend mode: + +machine_suspend "state=%lu" + + +Note: the value of '-1' or '4294967295' for state means an exit from the current state, +i.e. trace_cpu_idle(4, smp_processor_id()) means that the system +enters the idle state 4, while trace_cpu_idle(PWR_EVENT_EXIT, smp_processor_id()) +means that the system exits the previous idle state. + +The event which has 'state=4294967295' in the trace is very important to the user +space tools which are using it to detect the end of the current state, and so to +correctly draw the states diagrams and to calculate accurate statistics etc. + +2. Clocks events +================ +The clock events are used for clock enable/disable and for +clock rate change. + +clock_enable "%s state=%lu cpu_id=%lu" +clock_disable "%s state=%lu cpu_id=%lu" +clock_set_rate "%s state=%lu cpu_id=%lu" + +The first parameter gives the clock name (e.g. "gpio1_iclk"). +The second parameter is '1' for enable, '0' for disable, the target +clock rate for set_rate. + +3. Power domains events +======================= +The power domain events are used for power domains transitions + +power_domain_target "%s state=%lu cpu_id=%lu" + +The first parameter gives the power domain name (e.g. "mpu_pwrdm"). +The second parameter is the power domain target state. + +4. PM QoS events +================ +The PM QoS events are used for QoS add/update/remove request and for +target/flags update. + +pm_qos_add_request "pm_qos_class=%s value=%d" +pm_qos_update_request "pm_qos_class=%s value=%d" +pm_qos_remove_request "pm_qos_class=%s value=%d" +pm_qos_update_request_timeout "pm_qos_class=%s value=%d, timeout_us=%ld" + +The first parameter gives the QoS class name (e.g. "CPU_DMA_LATENCY"). +The second parameter is value to be added/updated/removed. +The third parameter is timeout value in usec. + +pm_qos_update_target "action=%s prev_value=%d curr_value=%d" +pm_qos_update_flags "action=%s prev_value=0x%x curr_value=0x%x" + +The first parameter gives the QoS action name (e.g. "ADD_REQ"). +The second parameter is the previous QoS value. +The third parameter is the current QoS value to update. + +And, there are also events used for device PM QoS add/update/remove request. + +dev_pm_qos_add_request "device=%s type=%s new_value=%d" +dev_pm_qos_update_request "device=%s type=%s new_value=%d" +dev_pm_qos_remove_request "device=%s type=%s new_value=%d" + +The first parameter gives the device name which tries to add/update/remove +QoS requests. +The second parameter gives the request type (e.g. "DEV_PM_QOS_RESUME_LATENCY"). +The third parameter is value to be added/updated/removed. diff --git a/Documentation/trace/events.txt b/Documentation/trace/events.txt index f157d7594ea..75d25a1d6e4 100644 --- a/Documentation/trace/events.txt +++ b/Documentation/trace/events.txt @@ -1,7 +1,7 @@ Event Tracing Documentation written by Theodore Ts'o - Updated by Li Zefan + Updated by Li Zefan and Tom Zanussi 1. Introduction =============== @@ -22,12 +22,12 @@ tracing information should be printed. --------------------------------- The events which are available for tracing can be found in the file -/debug/tracing/available_events. +/sys/kernel/debug/tracing/available_events. To enable a particular event, such as 'sched_wakeup', simply echo it -to /debug/tracing/set_event. For example: +to /sys/kernel/debug/tracing/set_event. For example: - # echo sched_wakeup >> /debug/tracing/set_event + # echo sched_wakeup >> /sys/kernel/debug/tracing/set_event [ Note: '>>' is necessary, otherwise it will firstly disable all the events. ] @@ -35,15 +35,15 @@ to /debug/tracing/set_event. For example: To disable an event, echo the event name to the set_event file prefixed with an exclamation point: - # echo '!sched_wakeup' >> /debug/tracing/set_event + # echo '!sched_wakeup' >> /sys/kernel/debug/tracing/set_event To disable all events, echo an empty line to the set_event file: - # echo > /debug/tracing/set_event + # echo > /sys/kernel/debug/tracing/set_event To enable all events, echo '*:*' or '*:' to the set_event file: - # echo *:* > /debug/tracing/set_event + # echo *:* > /sys/kernel/debug/tracing/set_event The events are organized into subsystems, such as ext4, irq, sched, etc., and a full event name looks like this: <subsystem>:<event>. The @@ -52,29 +52,29 @@ file. All of the events in a subsystem can be specified via the syntax "<subsystem>:*"; for example, to enable all irq events, you can use the command: - # echo 'irq:*' > /debug/tracing/set_event + # echo 'irq:*' > /sys/kernel/debug/tracing/set_event 2.2 Via the 'enable' toggle --------------------------- -The events available are also listed in /debug/tracing/events/ hierarchy +The events available are also listed in /sys/kernel/debug/tracing/events/ hierarchy of directories. To enable event 'sched_wakeup': - # echo 1 > /debug/tracing/events/sched/sched_wakeup/enable + # echo 1 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable To disable it: - # echo 0 > /debug/tracing/events/sched/sched_wakeup/enable + # echo 0 > /sys/kernel/debug/tracing/events/sched/sched_wakeup/enable To enable all events in sched subsystem: - # echo 1 > /debug/tracing/events/sched/enable + # echo 1 > /sys/kernel/debug/tracing/events/sched/enable -To eanble all events: +To enable all events: - # echo 1 > /debug/tracing/events/enable + # echo 1 > /sys/kernel/debug/tracing/events/enable When reading one of these enable files, there are four results: @@ -83,8 +83,414 @@ When reading one of these enable files, there are four results: X - there is a mixture of events enabled and disabled ? - this file does not affect any event +2.3 Boot option +--------------- + +In order to facilitate early boot debugging, use boot option: + + trace_event=[event-list] + +event-list is a comma separated list of events. See section 2.1 for event +format. + 3. Defining an event-enabled tracepoint ======================================= See The example provided in samples/trace_events +4. Event formats +================ + +Each trace event has a 'format' file associated with it that contains +a description of each field in a logged event. This information can +be used to parse the binary trace stream, and is also the place to +find the field names that can be used in event filters (see section 5). + +It also displays the format string that will be used to print the +event in text mode, along with the event name and ID used for +profiling. + +Every event has a set of 'common' fields associated with it; these are +the fields prefixed with 'common_'. The other fields vary between +events and correspond to the fields defined in the TRACE_EVENT +definition for that event. + +Each field in the format has the form: + + field:field-type field-name; offset:N; size:N; + +where offset is the offset of the field in the trace record and size +is the size of the data item, in bytes. + +For example, here's the information displayed for the 'sched_wakeup' +event: + +# cat /sys/kernel/debug/tracing/events/sched/sched_wakeup/format + +name: sched_wakeup +ID: 60 +format: + field:unsigned short common_type; offset:0; size:2; + field:unsigned char common_flags; offset:2; size:1; + field:unsigned char common_preempt_count; offset:3; size:1; + field:int common_pid; offset:4; size:4; + field:int common_tgid; offset:8; size:4; + + field:char comm[TASK_COMM_LEN]; offset:12; size:16; + field:pid_t pid; offset:28; size:4; + field:int prio; offset:32; size:4; + field:int success; offset:36; size:4; + field:int cpu; offset:40; size:4; + +print fmt: "task %s:%d [%d] success=%d [%03d]", REC->comm, REC->pid, + REC->prio, REC->success, REC->cpu + +This event contains 10 fields, the first 5 common and the remaining 5 +event-specific. All the fields for this event are numeric, except for +'comm' which is a string, a distinction important for event filtering. + +5. Event filtering +================== + +Trace events can be filtered in the kernel by associating boolean +'filter expressions' with them. As soon as an event is logged into +the trace buffer, its fields are checked against the filter expression +associated with that event type. An event with field values that +'match' the filter will appear in the trace output, and an event whose +values don't match will be discarded. An event with no filter +associated with it matches everything, and is the default when no +filter has been set for an event. + +5.1 Expression syntax +--------------------- + +A filter expression consists of one or more 'predicates' that can be +combined using the logical operators '&&' and '||'. A predicate is +simply a clause that compares the value of a field contained within a +logged event with a constant value and returns either 0 or 1 depending +on whether the field value matched (1) or didn't match (0): + + field-name relational-operator value + +Parentheses can be used to provide arbitrary logical groupings and +double-quotes can be used to prevent the shell from interpreting +operators as shell metacharacters. + +The field-names available for use in filters can be found in the +'format' files for trace events (see section 4). + +The relational-operators depend on the type of the field being tested: + +The operators available for numeric fields are: + +==, !=, <, <=, >, >=, & + +And for string fields they are: + +==, !=, ~ + +The glob (~) only accepts a wild card character (*) at the start and or +end of the string. For example: + + prev_comm ~ "*sh" + prev_comm ~ "sh*" + prev_comm ~ "*sh*" + +But does not allow for it to be within the string: + + prev_comm ~ "ba*sh" <-- is invalid + +5.2 Setting filters +------------------- + +A filter for an individual event is set by writing a filter expression +to the 'filter' file for the given event. + +For example: + +# cd /sys/kernel/debug/tracing/events/sched/sched_wakeup +# echo "common_preempt_count > 4" > filter + +A slightly more involved example: + +# cd /sys/kernel/debug/tracing/events/signal/signal_generate +# echo "((sig >= 10 && sig < 15) || sig == 17) && comm != bash" > filter + +If there is an error in the expression, you'll get an 'Invalid +argument' error when setting it, and the erroneous string along with +an error message can be seen by looking at the filter e.g.: + +# cd /sys/kernel/debug/tracing/events/signal/signal_generate +# echo "((sig >= 10 && sig < 15) || dsig == 17) && comm != bash" > filter +-bash: echo: write error: Invalid argument +# cat filter +((sig >= 10 && sig < 15) || dsig == 17) && comm != bash +^ +parse_error: Field not found + +Currently the caret ('^') for an error always appears at the beginning of +the filter string; the error message should still be useful though +even without more accurate position info. + +5.3 Clearing filters +-------------------- + +To clear the filter for an event, write a '0' to the event's filter +file. + +To clear the filters for all events in a subsystem, write a '0' to the +subsystem's filter file. + +5.3 Subsystem filters +--------------------- + +For convenience, filters for every event in a subsystem can be set or +cleared as a group by writing a filter expression into the filter file +at the root of the subsystem. Note however, that if a filter for any +event within the subsystem lacks a field specified in the subsystem +filter, or if the filter can't be applied for any other reason, the +filter for that event will retain its previous setting. This can +result in an unintended mixture of filters which could lead to +confusing (to the user who might think different filters are in +effect) trace output. Only filters that reference just the common +fields can be guaranteed to propagate successfully to all events. + +Here are a few subsystem filter examples that also illustrate the +above points: + +Clear the filters on all events in the sched subsystem: + +# cd /sys/kernel/debug/tracing/events/sched +# echo 0 > filter +# cat sched_switch/filter +none +# cat sched_wakeup/filter +none + +Set a filter using only common fields for all events in the sched +subsystem (all events end up with the same filter): + +# cd /sys/kernel/debug/tracing/events/sched +# echo common_pid == 0 > filter +# cat sched_switch/filter +common_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 + +Attempt to set a filter using a non-common field for all events in the +sched subsystem (all events but those that have a prev_pid field retain +their old filters): + +# cd /sys/kernel/debug/tracing/events/sched +# echo prev_pid == 0 > filter +# cat sched_switch/filter +prev_pid == 0 +# cat sched_wakeup/filter +common_pid == 0 + +6. Event triggers +================= + +Trace events can be made to conditionally invoke trigger 'commands' +which can take various forms and are described in detail below; +examples would be enabling or disabling other trace events or invoking +a stack trace whenever the trace event is hit. Whenever a trace event +with attached triggers is invoked, the set of trigger commands +associated with that event is invoked. Any given trigger can +additionally have an event filter of the same form as described in +section 5 (Event filtering) associated with it - the command will only +be invoked if the event being invoked passes the associated filter. +If no filter is associated with the trigger, it always passes. + +Triggers are added to and removed from a particular event by writing +trigger expressions to the 'trigger' file for the given event. + +A given event can have any number of triggers associated with it, +subject to any restrictions that individual commands may have in that +regard. + +Event triggers are implemented on top of "soft" mode, which means that +whenever a trace event has one or more triggers associated with it, +the event is activated even if it isn't actually enabled, but is +disabled in a "soft" mode. That is, the tracepoint will be called, +but just will not be traced, unless of course it's actually enabled. +This scheme allows triggers to be invoked even for events that aren't +enabled, and also allows the current event filter implementation to be +used for conditionally invoking triggers. + +The syntax for event triggers is roughly based on the syntax for +set_ftrace_filter 'ftrace filter commands' (see the 'Filter commands' +section of Documentation/trace/ftrace.txt), but there are major +differences and the implementation isn't currently tied to it in any +way, so beware about making generalizations between the two. + +6.1 Expression syntax +--------------------- + +Triggers are added by echoing the command to the 'trigger' file: + + # echo 'command[:count] [if filter]' > trigger + +Triggers are removed by echoing the same command but starting with '!' +to the 'trigger' file: + + # echo '!command[:count] [if filter]' > trigger + +The [if filter] part isn't used in matching commands when removing, so +leaving that off in a '!' command will accomplish the same thing as +having it in. + +The filter syntax is the same as that described in the 'Event +filtering' section above. + +For ease of use, writing to the trigger file using '>' currently just +adds or removes a single trigger and there's no explicit '>>' support +('>' actually behaves like '>>') or truncation support to remove all +triggers (you have to use '!' for each one added.) + +6.2 Supported trigger commands +------------------------------ + +The following commands are supported: + +- enable_event/disable_event + + These commands can enable or disable another trace event whenever + the triggering event is hit. When these commands are registered, + the other trace event is activated, but disabled in a "soft" mode. + That is, the tracepoint will be called, but just will not be traced. + The event tracepoint stays in this mode as long as there's a trigger + in effect that can trigger it. + + For example, the following trigger causes kmalloc events to be + traced when a read system call is entered, and the :1 at the end + specifies that this enablement happens only once: + + # echo 'enable_event:kmem:kmalloc:1' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger + + The following trigger causes kmalloc events to stop being traced + when a read system call exits. This disablement happens on every + read system call exit: + + # echo 'disable_event:kmem:kmalloc' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger + + The format is: + + enable_event:<system>:<event>[:count] + disable_event:<system>:<event>[:count] + + To remove the above commands: + + # echo '!enable_event:kmem:kmalloc:1' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_enter_read/trigger + + # echo '!disable_event:kmem:kmalloc' > \ + /sys/kernel/debug/tracing/events/syscalls/sys_exit_read/trigger + + Note that there can be any number of enable/disable_event triggers + per triggering event, but there can only be one trigger per + triggered event. e.g. sys_enter_read can have triggers enabling both + kmem:kmalloc and sched:sched_switch, but can't have two kmem:kmalloc + versions such as kmem:kmalloc and kmem:kmalloc:1 or 'kmem:kmalloc if + bytes_req == 256' and 'kmem:kmalloc if bytes_alloc == 256' (they + could be combined into a single filter on kmem:kmalloc though). + +- stacktrace + + This command dumps a stacktrace in the trace buffer whenever the + triggering event occurs. + + For example, the following trigger dumps a stacktrace every time the + kmalloc tracepoint is hit: + + # echo 'stacktrace' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The following trigger dumps a stacktrace the first 5 times a kmalloc + request happens with a size >= 64K + + # echo 'stacktrace:5 if bytes_req >= 65536' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The format is: + + stacktrace[:count] + + To remove the above commands: + + # echo '!stacktrace' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + # echo '!stacktrace:5 if bytes_req >= 65536' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + The latter can also be removed more simply by the following (without + the filter): + + # echo '!stacktrace:5' > \ + /sys/kernel/debug/tracing/events/kmem/kmalloc/trigger + + Note that there can be only one stacktrace trigger per triggering + event. + +- snapshot + + This command causes a snapshot to be triggered whenever the + triggering event occurs. + + The following command creates a snapshot every time a block request + queue is unplugged with a depth > 1. If you were tracing a set of + events or functions at the time, the snapshot trace buffer would + capture those events when the trigger event occurred: + + # echo 'snapshot if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To only snapshot once: + + # echo 'snapshot:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To remove the above commands: + + # echo '!snapshot if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + # echo '!snapshot:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + Note that there can be only one snapshot trigger per triggering + event. + +- traceon/traceoff + + These commands turn tracing on and off when the specified events are + hit. The parameter determines how many times the tracing system is + turned on and off. If unspecified, there is no limit. + + The following command turns tracing off the first time a block + request queue is unplugged with a depth > 1. If you were tracing a + set of events or functions at the time, you could then examine the + trace buffer to see the sequence of events that led up to the + trigger event: + + # echo 'traceoff:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To always disable tracing when nr_rq > 1 : + + # echo 'traceoff if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + To remove the above commands: + + # echo '!traceoff:1 if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + # echo '!traceoff if nr_rq > 1' > \ + /sys/kernel/debug/tracing/events/block/block_unplug/trigger + + Note that there can be only one traceon or traceoff trigger per + triggering event. diff --git a/Documentation/trace/ftrace-design.txt b/Documentation/trace/ftrace-design.txt new file mode 100644 index 00000000000..3f669b9e885 --- /dev/null +++ b/Documentation/trace/ftrace-design.txt @@ -0,0 +1,408 @@ + function tracer guts + ==================== + By Mike Frysinger + +Introduction +------------ + +Here we will cover the architecture pieces that the common function tracing +code relies on for proper functioning. Things are broken down into increasing +complexity so that you can start simple and at least get basic functionality. + +Note that this focuses on architecture implementation details only. If you +want more explanation of a feature in terms of common code, review the common +ftrace.txt file. + +Ideally, everyone who wishes to retain performance while supporting tracing in +their kernel should make it all the way to dynamic ftrace support. + + +Prerequisites +------------- + +Ftrace relies on these features being implemented: + STACKTRACE_SUPPORT - implement save_stack_trace() + TRACE_IRQFLAGS_SUPPORT - implement include/asm/irqflags.h + + +HAVE_FUNCTION_TRACER +-------------------- + +You will need to implement the mcount and the ftrace_stub functions. + +The exact mcount symbol name will depend on your toolchain. Some call it +"mcount", "_mcount", or even "__mcount". You can probably figure it out by +running something like: + $ echo 'main(){}' | gcc -x c -S -o - - -pg | grep mcount + call mcount +We'll make the assumption below that the symbol is "mcount" just to keep things +nice and simple in the examples. + +Keep in mind that the ABI that is in effect inside of the mcount function is +*highly* architecture/toolchain specific. We cannot help you in this regard, +sorry. Dig up some old documentation and/or find someone more familiar than +you to bang ideas off of. Typically, register usage (argument/scratch/etc...) +is a major issue at this point, especially in relation to the location of the +mcount call (before/after function prologue). You might also want to look at +how glibc has implemented the mcount function for your architecture. It might +be (semi-)relevant. + +The mcount function should check the function pointer ftrace_trace_function +to see if it is set to ftrace_stub. If it is, there is nothing for you to do, +so return immediately. If it isn't, then call that function in the same way +the mcount function normally calls __mcount_internal -- the first argument is +the "frompc" while the second argument is the "selfpc" (adjusted to remove the +size of the mcount call that is embedded in the function). + +For example, if the function foo() calls bar(), when the bar() function calls +mcount(), the arguments mcount() will pass to the tracer are: + "frompc" - the address bar() will use to return to foo() + "selfpc" - the address bar() (with mcount() size adjustment) + +Also keep in mind that this mcount function will be called *a lot*, so +optimizing for the default case of no tracer will help the smooth running of +your system when tracing is disabled. So the start of the mcount function is +typically the bare minimum with checking things before returning. That also +means the code flow should usually be kept linear (i.e. no branching in the nop +case). This is of course an optimization and not a hard requirement. + +Here is some pseudo code that should help (these functions should actually be +implemented in assembly): + +void ftrace_stub(void) +{ + return; +} + +void mcount(void) +{ + /* save any bare state needed in order to do initial checking */ + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + + /* restore any bare state */ + + return; + +do_trace: + + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + ftrace_trace_function(frompc, selfpc); + + /* restore all state needed by the ABI */ +} + +Don't forget to export mcount for modules ! +extern void mcount(void); +EXPORT_SYMBOL(mcount); + + +HAVE_FUNCTION_TRACE_MCOUNT_TEST +------------------------------- + +This is an optional optimization for the normal case when tracing is turned off +in the system. If you do not enable this Kconfig option, the common ftrace +code will take care of doing the checking for you. + +To support this feature, you only need to check the function_trace_stop +variable in the mcount function. If it is non-zero, there is no tracing to be +done at all, so you can return. + +This additional pseudo code would simply be: +void mcount(void) +{ + /* save any bare state needed in order to do initial checking */ + ++ if (function_trace_stop) ++ return; + + extern void (*ftrace_trace_function)(unsigned long, unsigned long); + if (ftrace_trace_function != ftrace_stub) +... + + +HAVE_FUNCTION_GRAPH_TRACER +-------------------------- + +Deep breath ... time to do some real work. Here you will need to update the +mcount function to check ftrace graph function pointers, as well as implement +some functions to save (hijack) and restore the return address. + +The mcount function should check the function pointers ftrace_graph_return +(compare to ftrace_stub) and ftrace_graph_entry (compare to +ftrace_graph_entry_stub). If either of those is not set to the relevant stub +function, call the arch-specific function ftrace_graph_caller which in turn +calls the arch-specific function prepare_ftrace_return. Neither of these +function names is strictly required, but you should use them anyway to stay +consistent across the architecture ports -- easier to compare & contrast +things. + +The arguments to prepare_ftrace_return are slightly different than what are +passed to ftrace_trace_function. The second argument "selfpc" is the same, +but the first argument should be a pointer to the "frompc". Typically this is +located on the stack. This allows the function to hijack the return address +temporarily to have it point to the arch-specific function return_to_handler. +That function will simply call the common ftrace_return_to_handler function and +that will return the original return address with which you can return to the +original call site. + +Here is the updated mcount pseudo code: +void mcount(void) +{ +... + if (ftrace_trace_function != ftrace_stub) + goto do_trace; + ++#ifdef CONFIG_FUNCTION_GRAPH_TRACER ++ extern void (*ftrace_graph_return)(...); ++ extern void (*ftrace_graph_entry)(...); ++ if (ftrace_graph_return != ftrace_stub || ++ ftrace_graph_entry != ftrace_graph_entry_stub) ++ ftrace_graph_caller(); ++#endif + + /* restore any bare state */ +... + +Here is the pseudo code for the new ftrace_graph_caller assembly function: +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void ftrace_graph_caller(void) +{ + /* save all state needed by the ABI */ + + unsigned long *frompc = &...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + /* passing frame pointer up is optional -- see below */ + prepare_ftrace_return(frompc, selfpc, frame_pointer); + + /* restore all state needed by the ABI */ +} +#endif + +For information on how to implement prepare_ftrace_return(), simply look at the +x86 version (the frame pointer passing is optional; see the next section for +more information). The only architecture-specific piece in it is the setup of +the fault recovery table (the asm(...) code). The rest should be the same +across architectures. + +Here is the pseudo code for the new return_to_handler assembly function. Note +that the ABI that applies here is different from what applies to the mcount +code. Since you are returning from a function (after the epilogue), you might +be able to skimp on things saved/restored (usually just registers used to pass +return values). + +#ifdef CONFIG_FUNCTION_GRAPH_TRACER +void return_to_handler(void) +{ + /* save all state needed by the ABI (see paragraph above) */ + + void (*original_return_point)(void) = ftrace_return_to_handler(); + + /* restore all state needed by the ABI */ + + /* this is usually either a return or a jump */ + original_return_point(); +} +#endif + + +HAVE_FUNCTION_GRAPH_FP_TEST +--------------------------- + +An arch may pass in a unique value (frame pointer) to both the entering and +exiting of a function. On exit, the value is compared and if it does not +match, then it will panic the kernel. This is largely a sanity check for bad +code generation with gcc. If gcc for your port sanely updates the frame +pointer under different optimization levels, then ignore this option. + +However, adding support for it isn't terribly difficult. In your assembly code +that calls prepare_ftrace_return(), pass the frame pointer as the 3rd argument. +Then in the C version of that function, do what the x86 port does and pass it +along to ftrace_push_return_trace() instead of a stub value of 0. + +Similarly, when you call ftrace_return_to_handler(), pass it the frame pointer. + + +HAVE_FTRACE_NMI_ENTER +--------------------- + +If you can't trace NMI functions, then skip this option. + +<details to be filled> + + +HAVE_SYSCALL_TRACEPOINTS +------------------------ + +You need very few things to get the syscalls tracing in an arch. + +- Support HAVE_ARCH_TRACEHOOK (see arch/Kconfig). +- Have a NR_syscalls variable in <asm/unistd.h> that provides the number + of syscalls supported by the arch. +- Support the TIF_SYSCALL_TRACEPOINT thread flags. +- Put the trace_sys_enter() and trace_sys_exit() tracepoints calls from ptrace + in the ptrace syscalls tracing path. +- If the system call table on this arch is more complicated than a simple array + of addresses of the system calls, implement an arch_syscall_addr to return + the address of a given system call. +- If the symbol names of the system calls do not match the function names on + this arch, define ARCH_HAS_SYSCALL_MATCH_SYM_NAME in asm/ftrace.h and + implement arch_syscall_match_sym_name with the appropriate logic to return + true if the function name corresponds with the symbol name. +- Tag this arch as HAVE_SYSCALL_TRACEPOINTS. + + +HAVE_FTRACE_MCOUNT_RECORD +------------------------- + +See scripts/recordmcount.pl for more info. Just fill in the arch-specific +details for how to locate the addresses of mcount call sites via objdump. +This option doesn't make much sense without also implementing dynamic ftrace. + + +HAVE_DYNAMIC_FTRACE +------------------- + +You will first need HAVE_FTRACE_MCOUNT_RECORD and HAVE_FUNCTION_TRACER, so +scroll your reader back up if you got over eager. + +Once those are out of the way, you will need to implement: + - asm/ftrace.h: + - MCOUNT_ADDR + - ftrace_call_adjust() + - struct dyn_arch_ftrace{} + - asm code: + - mcount() (new stub) + - ftrace_caller() + - ftrace_call() + - ftrace_stub() + - C code: + - ftrace_dyn_arch_init() + - ftrace_make_nop() + - ftrace_make_call() + - ftrace_update_ftrace_func() + +First you will need to fill out some arch details in your asm/ftrace.h. + +Define MCOUNT_ADDR as the address of your mcount symbol similar to: + #define MCOUNT_ADDR ((unsigned long)mcount) +Since no one else will have a decl for that function, you will need to: + extern void mcount(void); + +You will also need the helper function ftrace_call_adjust(). Most people +will be able to stub it out like so: + static inline unsigned long ftrace_call_adjust(unsigned long addr) + { + return addr; + } +<details to be filled> + +Lastly you will need the custom dyn_arch_ftrace structure. If you need +some extra state when runtime patching arbitrary call sites, this is the +place. For now though, create an empty struct: + struct dyn_arch_ftrace { + /* No extra data needed */ + }; + +With the header out of the way, we can fill out the assembly code. While we +did already create a mcount() function earlier, dynamic ftrace only wants a +stub function. This is because the mcount() will only be used during boot +and then all references to it will be patched out never to return. Instead, +the guts of the old mcount() will be used to create a new ftrace_caller() +function. Because the two are hard to merge, it will most likely be a lot +easier to have two separate definitions split up by #ifdefs. Same goes for +the ftrace_stub() as that will now be inlined in ftrace_caller(). + +Before we get confused anymore, let's check out some pseudo code so you can +implement your own stuff in assembly: + +void mcount(void) +{ + return; +} + +void ftrace_caller(void) +{ + /* implement HAVE_FUNCTION_TRACE_MCOUNT_TEST if you desire */ + + /* save all state needed by the ABI (see paragraph above) */ + + unsigned long frompc = ...; + unsigned long selfpc = <return address> - MCOUNT_INSN_SIZE; + +ftrace_call: + ftrace_stub(frompc, selfpc); + + /* restore all state needed by the ABI */ + +ftrace_stub: + return; +} + +This might look a little odd at first, but keep in mind that we will be runtime +patching multiple things. First, only functions that we actually want to trace +will be patched to call ftrace_caller(). Second, since we only have one tracer +active at a time, we will patch the ftrace_caller() function itself to call the +specific tracer in question. That is the point of the ftrace_call label. + +With that in mind, let's move on to the C code that will actually be doing the +runtime patching. You'll need a little knowledge of your arch's opcodes in +order to make it through the next section. + +Every arch has an init callback function. If you need to do something early on +to initialize some state, this is the time to do that. Otherwise, this simple +function below should be sufficient for most people: + +int __init ftrace_dyn_arch_init(void) +{ + return 0; +} + +There are two functions that are used to do runtime patching of arbitrary +functions. The first is used to turn the mcount call site into a nop (which +is what helps us retain runtime performance when not tracing). The second is +used to turn the mcount call site into a call to an arbitrary location (but +typically that is ftracer_caller()). See the general function definition in +linux/ftrace.h for the functions: + ftrace_make_nop() + ftrace_make_call() +The rec->ip value is the address of the mcount call site that was collected +by the scripts/recordmcount.pl during build time. + +The last function is used to do runtime patching of the active tracer. This +will be modifying the assembly code at the location of the ftrace_call symbol +inside of the ftrace_caller() function. So you should have sufficient padding +at that location to support the new function calls you'll be inserting. Some +people will be using a "call" type instruction while others will be using a +"branch" type instruction. Specifically, the function is: + ftrace_update_ftrace_func() + + +HAVE_DYNAMIC_FTRACE + HAVE_FUNCTION_GRAPH_TRACER +------------------------------------------------ + +The function grapher needs a few tweaks in order to work with dynamic ftrace. +Basically, you will need to: + - update: + - ftrace_caller() + - ftrace_graph_call() + - ftrace_graph_caller() + - implement: + - ftrace_enable_ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() + +<details to be filled> +Quick notes: + - add a nop stub after the ftrace_call location named ftrace_graph_call; + stub needs to be large enough to support a call to ftrace_graph_caller() + - update ftrace_graph_caller() to work with being called by the new + ftrace_caller() since some semantics may have changed + - ftrace_enable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with a call to ftrace_graph_caller() + - ftrace_disable_ftrace_graph_caller() will runtime patch the + ftrace_graph_call location with nops diff --git a/Documentation/trace/ftrace.txt b/Documentation/trace/ftrace.txt index 2a82d860294..2479b2a0c77 100644 --- a/Documentation/trace/ftrace.txt +++ b/Documentation/trace/ftrace.txt @@ -7,8 +7,8 @@ Copyright 2008 Red Hat Inc. (dual licensed under the GPL v2) Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton, John Kacur, and David Teigland. - Written for: 2.6.28-rc2 +Updated for: 3.10 Introduction ------------ @@ -18,13 +18,22 @@ designers of systems to find what is going on inside the kernel. It can be used for debugging or analyzing latencies and performance issues that take place outside of user-space. -Although ftrace is the function tracer, it also includes an -infrastructure that allows for other types of tracing. Some of -the tracers that are currently in ftrace include a tracer to -trace context switches, the time it takes for a high priority -task to run after it was woken up, the time interrupts are -disabled, and more (ftrace allows for tracer plugins, which -means that the list of tracers can always grow). +Although ftrace is typically considered the function tracer, it +is really a frame work of several assorted tracing utilities. +There's latency tracing to examine what occurs between interrupts +disabled and enabled, as well as for preemption and from a time +a task is woken to the task is actually scheduled in. + +One of the most common uses of ftrace is the event tracing. +Through out the kernel is hundreds of static event points that +can be enabled via the debugfs file system to see what is +going on in certain parts of the kernel. + + +Implementation Details +---------------------- + +See ftrace-design.txt for details for arch porters and such. The File System @@ -33,17 +42,30 @@ The File System Ftrace uses the debugfs file system to hold the control files as well as the files to display output. -To mount the debugfs system: +When debugfs is configured into the kernel (which selecting any ftrace +option will do) the directory /sys/kernel/debug will be created. To mount +this directory, you can add to your /etc/fstab file: - # mkdir /debug - # mount -t debugfs nodev /debug + debugfs /sys/kernel/debug debugfs defaults 0 0 -( Note: it is more common to mount at /sys/kernel/debug, but for - simplicity this document will use /debug) +Or you can mount it at run time with: + + mount -t debugfs nodev /sys/kernel/debug + +For quicker access to that directory you may want to make a soft link to +it: + + ln -s /sys/kernel/debug /debug + +Any selected ftrace option will also create a directory called tracing +within the debugfs. The rest of the document will assume that you are in +the ftrace directory (cd /sys/kernel/debug/tracing) and will only concentrate +on the files within that directory and not distract from the content with +the extended "/sys/kernel/debug/tracing" path name. That's it! (assuming that you have ftrace configured into your kernel) -After mounting the debugfs, you can see a directory called +After mounting debugfs, you can see a directory called "tracing". This directory contains the control and output files of ftrace. Here is a list of some of the key files: @@ -62,58 +84,68 @@ of ftrace. Here is a list of some of the key files: tracers listed here can be configured by echoing their name into current_tracer. - tracing_enabled: + tracing_on: - This sets or displays whether the current_tracer - is activated and tracing or not. Echo 0 into this - file to disable the tracer or 1 to enable it. + This sets or displays whether writing to the trace + ring buffer is enabled. Echo 0 into this file to disable + the tracer or 1 to enable it. Note, this only disables + writing to the ring buffer, the tracing overhead may + still be occurring. trace: This file holds the output of the trace in a human readable format (described below). - latency_trace: - - This file shows the same trace but the information - is organized more to display possible latencies - in the system (described below). - trace_pipe: The output is the same as the "trace" file but this file is meant to be streamed with live tracing. - Reads from this file will block until new data - is retrieved. Unlike the "trace" and "latency_trace" - files, this file is a consumer. This means reading - from this file causes sequential reads to display - more current data. Once data is read from this - file, it is consumed, and will not be read - again with a sequential read. The "trace" and - "latency_trace" files are static, and if the - tracer is not adding more data, they will display - the same information every time they are read. + Reads from this file will block until new data is + retrieved. Unlike the "trace" file, this file is a + consumer. This means reading from this file causes + sequential reads to display more current data. Once + data is read from this file, it is consumed, and + will not be read again with a sequential read. The + "trace" file is static, and if the tracer is not + adding more data,they will display the same + information every time they are read. trace_options: This file lets the user control the amount of data that is displayed in one of the above output - files. + files. Options also exist to modify how a tracer + or events work (stack traces, timestamps, etc). + + options: + + This is a directory that has a file for every available + trace option (also in trace_options). Options may also be set + or cleared by writing a "1" or "0" respectively into the + corresponding file with the option name. tracing_max_latency: Some of the tracers record the max latency. For example, the time interrupts are disabled. This time is saved in this file. The max trace - will also be stored, and displayed by either - "trace" or "latency_trace". A new max trace will - only be recorded if the latency is greater than - the value in this file. (in microseconds) + will also be stored, and displayed by "trace". + A new max trace will only be recorded if the + latency is greater than the value in this + file. (in microseconds) + + tracing_thresh: + + Some latency tracers will record a trace whenever the + latency is greater than the number in this file. + Only active when the file contains a number greater than 0. + (in microseconds) buffer_size_kb: This sets or displays the number of kilobytes each CPU - buffer can hold. The tracer buffers are the same size + buffer holds. By default, the trace buffers are the same size for each CPU. The displayed number is the size of the CPU buffer and not total size of all buffers. The trace buffers are allocated in pages (blocks of memory @@ -122,16 +154,30 @@ of ftrace. Here is a list of some of the key files: than requested, the rest of the page will be used, making the actual allocation bigger than requested. ( Note, the size may not be a multiple of the page size - due to buffer managment overhead. ) + due to buffer management meta-data. ) + + buffer_total_size_kb: + + This displays the total combined size of all the trace buffers. + + free_buffer: + + If a process is performing the tracing, and the ring buffer + should be shrunk "freed" when the process is finished, even + if it were to be killed by a signal, this file can be used + for that purpose. On close of this file, the ring buffer will + be resized to its minimum size. Having a process that is tracing + also open this file, when the process exits its file descriptor + for this file will be closed, and in doing so, the ring buffer + will be "freed". - This can only be updated when the current_tracer - is set to "nop". + It may also stop tracing if disable_on_free option is set. tracing_cpumask: This is a mask that lets the user only trace - on specified CPUS. The format is a hex string - representing the CPUS. + on specified CPUs. The format is a hex string + representing the CPUs. set_ftrace_filter: @@ -144,6 +190,9 @@ of ftrace. Here is a list of some of the key files: to be traced. Echoing names of functions into this file will limit the trace to only those functions. + This interface also allows for commands to be used. See the + "Filter commands" section for more details. + set_ftrace_notrace: This has an effect opposite to that of @@ -169,6 +218,261 @@ of ftrace. Here is a list of some of the key files: "set_ftrace_notrace". (See the section "dynamic ftrace" below for more details.) + enabled_functions: + + This file is more for debugging ftrace, but can also be useful + in seeing if any function has a callback attached to it. + Not only does the trace infrastructure use ftrace function + trace utility, but other subsystems might too. This file + displays all functions that have a callback attached to them + as well as the number of callbacks that have been attached. + Note, a callback may also call multiple functions which will + not be listed in this count. + + If the callback registered to be traced by a function with + the "save regs" attribute (thus even more overhead), a 'R' + will be displayed on the same line as the function that + is returning registers. + + function_profile_enabled: + + When set it will enable all functions with either the function + tracer, or if enabled, the function graph tracer. It will + keep a histogram of the number of functions that were called + and if run with the function graph tracer, it will also keep + track of the time spent in those functions. The histogram + content can be displayed in the files: + + trace_stats/function<cpu> ( function0, function1, etc). + + trace_stats: + + A directory that holds different tracing stats. + + kprobe_events: + + Enable dynamic trace points. See kprobetrace.txt. + + kprobe_profile: + + Dynamic trace points stats. See kprobetrace.txt. + + max_graph_depth: + + Used with the function graph tracer. This is the max depth + it will trace into a function. Setting this to a value of + one will show only the first kernel function that is called + from user space. + + printk_formats: + + This is for tools that read the raw format files. If an event in + the ring buffer references a string (currently only trace_printk() + does this), only a pointer to the string is recorded into the buffer + and not the string itself. This prevents tools from knowing what + that string was. This file displays the string and address for + the string allowing tools to map the pointers to what the + strings were. + + saved_cmdlines: + + Only the pid of the task is recorded in a trace event unless + the event specifically saves the task comm as well. Ftrace + makes a cache of pid mappings to comms to try to display + comms for events. If a pid for a comm is not listed, then + "<...>" is displayed in the output. + + snapshot: + + This displays the "snapshot" buffer and also lets the user + take a snapshot of the current running trace. + See the "Snapshot" section below for more details. + + stack_max_size: + + When the stack tracer is activated, this will display the + maximum stack size it has encountered. + See the "Stack Trace" section below. + + stack_trace: + + This displays the stack back trace of the largest stack + that was encountered when the stack tracer is activated. + See the "Stack Trace" section below. + + stack_trace_filter: + + This is similar to "set_ftrace_filter" but it limits what + functions the stack tracer will check. + + trace_clock: + + Whenever an event is recorded into the ring buffer, a + "timestamp" is added. This stamp comes from a specified + clock. By default, ftrace uses the "local" clock. This + clock is very fast and strictly per cpu, but on some + systems it may not be monotonic with respect to other + CPUs. In other words, the local clocks may not be in sync + with local clocks on other CPUs. + + Usual clocks for tracing: + + # cat trace_clock + [local] global counter x86-tsc + + local: Default clock, but may not be in sync across CPUs + + global: This clock is in sync with all CPUs but may + be a bit slower than the local clock. + + counter: This is not a clock at all, but literally an atomic + counter. It counts up one by one, but is in sync + with all CPUs. This is useful when you need to + know exactly the order events occurred with respect to + each other on different CPUs. + + uptime: This uses the jiffies counter and the time stamp + is relative to the time since boot up. + + perf: This makes ftrace use the same clock that perf uses. + Eventually perf will be able to read ftrace buffers + and this will help out in interleaving the data. + + x86-tsc: Architectures may define their own clocks. For + example, x86 uses its own TSC cycle clock here. + + To set a clock, simply echo the clock name into this file. + + echo global > trace_clock + + trace_marker: + + This is a very useful file for synchronizing user space + with events happening in the kernel. Writing strings into + this file will be written into the ftrace buffer. + + It is useful in applications to open this file at the start + of the application and just reference the file descriptor + for the file. + + void trace_write(const char *fmt, ...) + { + va_list ap; + char buf[256]; + int n; + + if (trace_fd < 0) + return; + + va_start(ap, fmt); + n = vsnprintf(buf, 256, fmt, ap); + va_end(ap); + + write(trace_fd, buf, n); + } + + start: + + trace_fd = open("trace_marker", WR_ONLY); + + uprobe_events: + + Add dynamic tracepoints in programs. + See uprobetracer.txt + + uprobe_profile: + + Uprobe statistics. See uprobetrace.txt + + instances: + + This is a way to make multiple trace buffers where different + events can be recorded in different buffers. + See "Instances" section below. + + events: + + This is the trace event directory. It holds event tracepoints + (also known as static tracepoints) that have been compiled + into the kernel. It shows what event tracepoints exist + and how they are grouped by system. There are "enable" + files at various levels that can enable the tracepoints + when a "1" is written to them. + + See events.txt for more information. + + per_cpu: + + This is a directory that contains the trace per_cpu information. + + per_cpu/cpu0/buffer_size_kb: + + The ftrace buffer is defined per_cpu. That is, there's a separate + buffer for each CPU to allow writes to be done atomically, + and free from cache bouncing. These buffers may have different + size buffers. This file is similar to the buffer_size_kb + file, but it only displays or sets the buffer size for the + specific CPU. (here cpu0). + + per_cpu/cpu0/trace: + + This is similar to the "trace" file, but it will only display + the data specific for the CPU. If written to, it only clears + the specific CPU buffer. + + per_cpu/cpu0/trace_pipe + + This is similar to the "trace_pipe" file, and is a consuming + read, but it will only display (and consume) the data specific + for the CPU. + + per_cpu/cpu0/trace_pipe_raw + + For tools that can parse the ftrace ring buffer binary format, + the trace_pipe_raw file can be used to extract the data + from the ring buffer directly. With the use of the splice() + system call, the buffer data can be quickly transferred to + a file or to the network where a server is collecting the + data. + + Like trace_pipe, this is a consuming reader, where multiple + reads will always produce different data. + + per_cpu/cpu0/snapshot: + + This is similar to the main "snapshot" file, but will only + snapshot the current CPU (if supported). It only displays + the content of the snapshot for a given CPU, and if + written to, only clears this CPU buffer. + + per_cpu/cpu0/snapshot_raw: + + Similar to the trace_pipe_raw, but will read the binary format + from the snapshot buffer for the given CPU. + + per_cpu/cpu0/stats: + + This displays certain stats about the ring buffer: + + entries: The number of events that are still in the buffer. + + overrun: The number of lost events due to overwriting when + the buffer was full. + + commit overrun: Should always be zero. + This gets set if so many events happened within a nested + event (ring buffer is re-entrant), that it fills the + buffer and starts dropping events. + + bytes: Bytes actually read (not overwritten). + + oldest event ts: The oldest timestamp in the buffer + + now ts: The current timestamp + + dropped events: Events lost due to overwrite option being off. + + read events: The number of events read. The Tracers ----------- @@ -188,17 +492,13 @@ Here is the list of current tracers that may be configured. to draw a graph of function calls similar to C code source. - "sched_switch" - - Traces the context switches and wakeups between tasks. - "irqsoff" Traces the areas that disable interrupts and saves the trace with the longest max latency. See tracing_max_latency. When a new max is recorded, it replaces the old trace. It is best to view this - trace via the latency_trace file. + trace with the latency-format option enabled. "preemptoff" @@ -216,11 +516,13 @@ Here is the list of current tracers that may be configured. Traces and records the max latency that it takes for the highest priority task to get scheduled after it has been woken up. + Traces all tasks as an average developer would expect. - "hw-branch-tracer" + "wakeup_rt" - Uses the BTS CPU feature on x86 CPUs to traces all - branches executed. + Traces and records the max latency that it takes for just + RT tasks (as the current "wakeup" does). This is useful + for those interested in wake up timings of RT tasks. "nop" @@ -244,103 +546,100 @@ Here is an example of the output format of the file "trace" -------- # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4251 [01] 10152.583854: path_put <-path_walk - bash-4251 [01] 10152.583855: dput <-path_put - bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput +# entries-in-buffer/entries-written: 140080/250280 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1977 [000] .... 17284.993652: sys_close <-system_call_fastpath + bash-1977 [000] .... 17284.993653: __close_fd <-sys_close + bash-1977 [000] .... 17284.993653: _raw_spin_lock <-__close_fd + sshd-1974 [003] .... 17284.993653: __srcu_read_unlock <-fsnotify + bash-1977 [000] .... 17284.993654: add_preempt_count <-_raw_spin_lock + bash-1977 [000] ...1 17284.993655: _raw_spin_unlock <-__close_fd + bash-1977 [000] ...1 17284.993656: sub_preempt_count <-_raw_spin_unlock + bash-1977 [000] .... 17284.993657: filp_close <-__close_fd + bash-1977 [000] .... 17284.993657: dnotify_flush <-filp_close + sshd-1974 [003] .... 17284.993658: sys_select <-system_call_fastpath -------- A header is printed with the tracer name that is represented by -the trace. In this case the tracer is "function". Then a header -showing the format. Task name "bash", the task PID "4251", the -CPU that it was running on "01", the timestamp in <secs>.<usecs> -format, the function name that was traced "path_put" and the -parent function that called this function "path_walk". The -timestamp is the time at which the function was entered. - -The sched_switch tracer also includes tracing of task wakeups -and context switches. - - ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S - ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S - ksoftirqd/1-7 [01] 1453.070013: 7:115:R ==> 10:115:R - events/1-10 [01] 1453.070013: 10:115:S ==> 2916:115:R - kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R - ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R - -Wake ups are represented by a "+" and the context switches are -shown as "==>". The format is: - - Context switches: - - Previous task Next Task - - <pid>:<prio>:<state> ==> <pid>:<prio>:<state> - - Wake ups: - - Current task Task waking up - - <pid>:<prio>:<state> + <pid>:<prio>:<state> - -The prio is the internal kernel priority, which is the inverse -of the priority that is usually displayed by user-space tools. -Zero represents the highest priority (99). Prio 100 starts the -"nice" priorities with 100 being equal to nice -20 and 139 being -nice 19. The prio "140" is reserved for the idle task which is -the lowest priority thread (pid 0). - +the trace. In this case the tracer is "function". Then it shows the +number of events in the buffer as well as the total number of entries +that were written. The difference is the number of entries that were +lost due to the buffer filling up (250280 - 140080 = 110200 events +lost). + +The header explains the content of the events. Task name "bash", the task +PID "1977", the CPU that it was running on "000", the latency format +(explained below), the timestamp in <secs>.<usecs> format, the +function name that was traced "sys_close" and the parent function that +called this function "system_call_fastpath". The timestamp is the time +at which the function was entered. Latency trace format -------------------- -For traces that display latency times, the latency_trace file -gives somewhat more information to see why a latency happened. -Here is a typical trace. +When the latency-format option is enabled or when one of the latency +tracers is set, the trace file gives somewhat more information to see +why a latency happened. Here is a typical trace. # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 97 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: swapper-0 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 0d..1 0us+: trace_hardirqs_off_thunk (apic_timer_interrupt) - <idle>-0 0d.s. 97us : __do_softirq (do_softirq) - <idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq) +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 259 us, #4/4, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ps-6143 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: __lock_task_sighand +# => ended at: _raw_spin_unlock_irqrestore +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ps-6143 2d... 0us!: trace_hardirqs_off <-__lock_task_sighand + ps-6143 2d..1 259us+: trace_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 263us+: time_hardirqs_on <-_raw_spin_unlock_irqrestore + ps-6143 2d..1 306us : <stack trace> + => trace_hardirqs_on_caller + => trace_hardirqs_on + => _raw_spin_unlock_irqrestore + => do_task_stat + => proc_tgid_stat + => proc_single_show + => seq_read + => vfs_read + => sys_read + => system_call_fastpath This shows that the current tracer is "irqsoff" tracing the time -for which interrupts were disabled. It gives the trace version -and the version of the kernel upon which this was executed on -(2.6.26-rc8). Then it displays the max latency in microsecs (97 -us). The number of trace entries displayed and the total number -recorded (both are three: #3/3). The type of preemption that was -used (PREEMPT). VP, KP, SP, and HP are always zero and are -reserved for later use. #P is the number of online CPUS (#P:2). +for which interrupts were disabled. It gives the trace version (which +never changes) and the version of the kernel upon which this was executed on +(3.10). Then it displays the max latency in microseconds (259 us). The number +of trace entries displayed and the total number (both are four: #4/4). +VP, KP, SP, and HP are always zero and are reserved for later use. +#P is the number of online CPUs (#P:4). The task is the process that was running when the latency -occurred. (swapper pid: 0). +occurred. (ps pid: 6143). The start and stop (the functions in which the interrupts were disabled and enabled respectively) that caused the latencies: - apic_timer_interrupt is where the interrupts were disabled. - do_softirq is where they were enabled again. + __lock_task_sighand is where the interrupts were disabled. + _raw_spin_unlock_irqrestore is where they were enabled again. The next lines after the header are the trace itself. The header explains which is which. @@ -356,7 +655,11 @@ explains which is which. read the irq flags variable, an 'X' will always be printed here. - need-resched: 'N' task need_resched is set, '.' otherwise. + need-resched: + 'N' both TIF_NEED_RESCHED and PREEMPT_NEED_RESCHED is set, + 'n' only TIF_NEED_RESCHED is set, + 'p' only PREEMPT_NEED_RESCHED is set, + '.' otherwise. hardirq/softirq: 'H' - hard irq occurred inside a softirq. @@ -368,9 +671,10 @@ explains which is which. The above is mostly meaningful for kernel developers. - time: This differs from the trace file output. The trace file output - includes an absolute timestamp. The timestamp used by the - latency_trace file is relative to the start of the trace. + time: When the latency-format option is enabled, the trace file + output includes a timestamp relative to the start of the + trace. This differs from the output when latency-format + is disabled, which includes an absolute timestamp. delay: This is just to help catch your eye a bit better. And needs to be fixed to be only relative to the same CPU. @@ -382,25 +686,52 @@ The above is mostly meaningful for kernel developers. The rest is the same as the 'trace' file. + Note, the latency tracers will usually end with a back trace + to easily find where the latency occurred. trace_options ------------- -The trace_options file is used to control what gets printed in -the trace output. To see what is available, simply cat the file: - - cat /debug/tracing/trace_options - print-parent nosym-offset nosym-addr noverbose noraw nohex nobin \ - noblock nostacktrace nosched-tree nouserstacktrace nosym-userobj +The trace_options file (or the options directory) is used to control +what gets printed in the trace output, or manipulate the tracers. +To see what is available, simply cat the file: + + cat trace_options +print-parent +nosym-offset +nosym-addr +noverbose +noraw +nohex +nobin +noblock +nostacktrace +trace_printk +noftrace_preempt +nobranch +annotate +nouserstacktrace +nosym-userobj +noprintk-msg-only +context-info +latency-format +sleep-time +graph-time +record-cmd +overwrite +nodisable_on_free +irq-info +markers +function-trace To disable one of the options, echo in the option prepended with "no". - echo noprint-parent > /debug/tracing/trace_options + echo noprint-parent > trace_options To enable an option, leave off the "no". - echo sym-offset > /debug/tracing/trace_options + echo sym-offset > trace_options Here are the available options: @@ -408,7 +739,7 @@ Here are the available options: function as well as the function being traced. print-parent: - bash-4000 [01] 1477.606694: simple_strtoul <-strict_strtoul + bash-4000 [01] 1477.606694: simple_strtoul <-kstrtoul noprint-parent: bash-4000 [01] 1477.606694: simple_strtoul @@ -428,10 +759,11 @@ Here are the available options: sym-addr: bash-4000 [01] 1477.606694: simple_strtoul <c0339346> - verbose - This deals with the latency_trace file. + verbose - This deals with the trace file when the + latency-format option is enabled. bash 4000 1 0 00000000 00010a95 [58127d26] 1720.415ms \ - (+0.000ms): simple_strtoul (strict_strtoul) + (+0.000ms): simple_strtoul (kstrtoul) raw - This will display raw numbers. This option is best for use with user applications that can translate the raw @@ -442,13 +774,34 @@ Here are the available options: bin - This will print out the formats in raw binary. - block - TBD (needs update) + block - When set, reading trace_pipe will not block when polled. stacktrace - This is one of the options that changes the trace itself. When a trace is recorded, so is the stack of functions. This allows for back traces of trace sites. + trace_printk - Can disable trace_printk() from writing into the buffer. + + branch - Enable branch tracing with the tracer. + + annotate - It is sometimes confusing when the CPU buffers are full + and one CPU buffer had a lot of events recently, thus + a shorter time frame, were another CPU may have only had + a few events, which lets it have older events. When + the trace is reported, it shows the oldest events first, + and it may look like only one CPU ran (the one with the + oldest events). When the annotate option is set, it will + display when a new CPU buffer started: + + <idle>-0 [001] dNs4 21169.031481: wake_up_idle_cpu <-add_timer_on + <idle>-0 [001] dNs4 21169.031482: _raw_spin_unlock_irqrestore <-add_timer_on + <idle>-0 [001] .Ns4 21169.031484: sub_preempt_count <-_raw_spin_unlock_irqrestore +##### CPU 2 buffer started #### + <idle>-0 [002] .N.1 21169.031484: rcu_idle_exit <-cpu_idle + <idle>-0 [001] .Ns3 21169.031484: _raw_spin_unlock <-clocksource_watchdog + <idle>-0 [001] .Ns3 21169.031485: sub_preempt_count <-_raw_spin_unlock + userstacktrace - This option changes the trace. It records a stacktrace of the current userspace thread. @@ -460,109 +813,80 @@ Here are the available options: the app is no longer running The lookup is performed when you read - trace,trace_pipe,latency_trace. Example: + trace,trace_pipe. Example: a.out-1623 [000] 40874.465068: /root/a.out[+0x480] <-/root/a.out[+0 x494] <- /root/a.out[+0x4a8] <- /lib/libc-2.7.so[+0x1e1a6] - sched-tree - trace all tasks that are on the runqueue, at - every scheduling event. Will add overhead if - there's a lot of tasks running at once. + printk-msg-only - When set, trace_printk()s will only show the format + and not their parameters (if trace_bprintk() or + trace_bputs() was used to save the trace_printk()). -sched_switch ------------- + context-info - Show only the event data. Hides the comm, PID, + timestamp, CPU, and other useful data. -This tracer simply records schedule switches. Here is an example -of how to use it. + latency-format - This option changes the trace. When + it is enabled, the trace displays + additional information about the + latencies, as described in "Latency + trace format". - # echo sched_switch > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled - # sleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + sleep-time - When running function graph tracer, to include + the time a task schedules out in its function. + When enabled, it will account time the task has been + scheduled out as part of the function call. -# tracer: sched_switch -# -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-3997 [01] 240.132281: 3997:120:R + 4055:120:R - bash-3997 [01] 240.132284: 3997:120:R ==> 4055:120:R - sleep-4055 [01] 240.132371: 4055:120:S ==> 3997:120:R - bash-3997 [01] 240.132454: 3997:120:R + 4055:120:S - bash-3997 [01] 240.132457: 3997:120:R ==> 4055:120:R - sleep-4055 [01] 240.132460: 4055:120:D ==> 3997:120:R - bash-3997 [01] 240.132463: 3997:120:R + 4055:120:D - bash-3997 [01] 240.132465: 3997:120:R ==> 4055:120:R - <idle>-0 [00] 240.132589: 0:140:R + 4:115:S - <idle>-0 [00] 240.132591: 0:140:R ==> 4:115:R - ksoftirqd/0-4 [00] 240.132595: 4:115:S ==> 0:140:R - <idle>-0 [00] 240.132598: 0:140:R + 4:115:S - <idle>-0 [00] 240.132599: 0:140:R ==> 4:115:R - ksoftirqd/0-4 [00] 240.132603: 4:115:S ==> 0:140:R - sleep-4055 [01] 240.133058: 4055:120:S ==> 3997:120:R - [...] + graph-time - When running function graph tracer, to include the + time to call nested functions. When this is not set, + the time reported for the function will only include + the time the function itself executed for, not the time + for functions that it called. + record-cmd - When any event or tracer is enabled, a hook is enabled + in the sched_switch trace point to fill comm cache + with mapped pids and comms. But this may cause some + overhead, and if you only care about pids, and not the + name of the task, disabling this option can lower the + impact of tracing. -As we have discussed previously about this format, the header -shows the name of the trace and points to the options. The -"FUNCTION" is a misnomer since here it represents the wake ups -and context switches. - -The sched_switch file only lists the wake ups (represented with -'+') and context switches ('==>') with the previous task or -current task first followed by the next task or task waking up. -The format for both of these is PID:KERNEL-PRIO:TASK-STATE. -Remember that the KERNEL-PRIO is the inverse of the actual -priority with zero (0) being the highest priority and the nice -values starting at 100 (nice -20). Below is a quick chart to map -the kernel priority to user land priorities. - - Kernel Space User Space - =============================================================== - 0(high) to 98(low) user RT priority 99(high) to 1(low) - with SCHED_RR or SCHED_FIFO - --------------------------------------------------------------- - 99 sched_priority is not used in scheduling - decisions(it must be specified as 0) - --------------------------------------------------------------- - 100(high) to 139(low) user nice -20(high) to 19(low) - --------------------------------------------------------------- - 140 idle task priority - --------------------------------------------------------------- - -The task states are: - - R - running : wants to run, may not actually be running - S - sleep : process is waiting to be woken up (handles signals) - D - disk sleep (uninterruptible sleep) : process must be woken up - (ignores signals) - T - stopped : process suspended - t - traced : process is being traced (with something like gdb) - Z - zombie : process waiting to be cleaned up - X - unknown + overwrite - This controls what happens when the trace buffer is + full. If "1" (default), the oldest events are + discarded and overwritten. If "0", then the newest + events are discarded. + (see per_cpu/cpu0/stats for overrun and dropped) + disable_on_free - When the free_buffer is closed, tracing will + stop (tracing_on set to 0). -ftrace_enabled --------------- + irq-info - Shows the interrupt, preempt count, need resched data. + When disabled, the trace looks like: -The following tracers (listed below) give different output -depending on whether or not the sysctl ftrace_enabled is set. To -set ftrace_enabled, one can either use the sysctl function or -set it via the proc file system interface. +# tracer: function +# +# entries-in-buffer/entries-written: 144405/9452052 #P:4 +# +# TASK-PID CPU# TIMESTAMP FUNCTION +# | | | | | + <idle>-0 [002] 23636.756054: ttwu_do_activate.constprop.89 <-try_to_wake_up + <idle>-0 [002] 23636.756054: activate_task <-ttwu_do_activate.constprop.89 + <idle>-0 [002] 23636.756055: enqueue_task <-activate_task - sysctl kernel.ftrace_enabled=1 - or + markers - When set, the trace_marker is writable (only by root). + When disabled, the trace_marker will error with EINVAL + on write. - echo 1 > /proc/sys/kernel/ftrace_enabled -To disable ftrace_enabled simply replace the '1' with '0' in the -above commands. + function-trace - The latency tracers will enable function tracing + if this option is enabled (default it is). When + it is disabled, the latency tracers do not trace + functions. This keeps the overhead of the tracer down + when performing latency tests. + + Note: Some tracers have their own options. They only appear + when the tracer is active. -When ftrace_enabled is set the tracers will also record the -functions that are within the trace. The descriptions of the -tracers will also show an example with ftrace enabled. irqsoff @@ -583,94 +907,133 @@ new trace is saved. To reset the maximum, echo 0 into tracing_max_latency. Here is an example: - # echo irqsoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo 0 > options/function-trace + # echo irqsoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_on + # cat trace # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26 --------------------------------------------------------------------- - latency: 12 us, #3/3, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: bash-3730 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: sys_setpgid - => ended at: sys_setpgid - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - bash-3730 1d... 0us : _write_lock_irq (sys_setpgid) - bash-3730 1d..1 1us+: _write_unlock_irq (sys_setpgid) - bash-3730 1d..2 14us : trace_hardirqs_on (sys_setpgid) - - -Here we see that that we had a latency of 12 microsecs (which is -very good). The _write_lock_irq in sys_setpgid disabled -interrupts. The difference between the 12 and the displayed -timestamp 14us occurred because the clock was incremented +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 16 us, #4/4, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: swapper/0-0 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: run_timer_softirq +# => ended at: run_timer_softirq +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 0d.s2 0us+: _raw_spin_lock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us : _raw_spin_unlock_irq <-run_timer_softirq + <idle>-0 0dNs3 17us+: trace_hardirqs_on <-run_timer_softirq + <idle>-0 0dNs3 25us : <stack trace> + => _raw_spin_unlock_irq + => run_timer_softirq + => __do_softirq + => call_softirq + => do_softirq + => irq_exit + => smp_apic_timer_interrupt + => apic_timer_interrupt + => rcu_idle_exit + => cpu_idle + => rest_init + => start_kernel + => x86_64_start_reservations + => x86_64_start_kernel + +Here we see that that we had a latency of 16 microseconds (which is +very good). The _raw_spin_lock_irq in run_timer_softirq disabled +interrupts. The difference between the 16 and the displayed +timestamp 25us occurred because the clock was incremented between the time of recording the max latency and the time of recording the function that had that latency. -Note the above example had ftrace_enabled not set. If we set the -ftrace_enabled, we get a much larger output: +Note the above example had function-trace not set. If we set +function-trace, we get a much larger output: + + with echo 1 > options/function-trace # tracer: irqsoff # -irqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #101/101, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4339 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: __alloc_pages_internal - => ended at: __alloc_pages_internal - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4339 0...1 0us+: get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..1 3us : rmqueue_bulk (get_page_from_freelist) - ls-4339 0d..1 3us : _spin_lock (rmqueue_bulk) - ls-4339 0d..1 4us : add_preempt_count (_spin_lock) - ls-4339 0d..2 4us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 5us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 5us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 6us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 6us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 7us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 7us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 8us : __rmqueue_smallest (__rmqueue) +# irqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 71 us, #168/168, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-2042 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-2042 3d... 0us : _raw_spin_lock_irqsave <-ata_scsi_queuecmd + bash-2042 3d... 0us : add_preempt_count <-_raw_spin_lock_irqsave + bash-2042 3d..1 1us : ata_scsi_find_dev <-ata_scsi_queuecmd + bash-2042 3d..1 1us : __ata_scsi_find_dev <-ata_scsi_find_dev + bash-2042 3d..1 2us : ata_find_dev.part.14 <-__ata_scsi_find_dev + bash-2042 3d..1 2us : ata_qc_new_init <-__ata_scsi_queuecmd + bash-2042 3d..1 3us : ata_sg_init <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_scsi_rw_xlat <-__ata_scsi_queuecmd + bash-2042 3d..1 4us : ata_build_rw_tf <-ata_scsi_rw_xlat [...] - ls-4339 0d..2 46us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 47us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 47us : __rmqueue (rmqueue_bulk) - ls-4339 0d..2 48us : __rmqueue_smallest (__rmqueue) - ls-4339 0d..2 48us : __mod_zone_page_state (__rmqueue_smallest) - ls-4339 0d..2 49us : _spin_unlock (rmqueue_bulk) - ls-4339 0d..2 49us : sub_preempt_count (_spin_unlock) - ls-4339 0d..1 50us : get_page_from_freelist (__alloc_pages_internal) - ls-4339 0d..2 51us : trace_hardirqs_on (__alloc_pages_internal) - - - -Here we traced a 50 microsecond latency. But we also see all the + bash-2042 3d..1 67us : delay_tsc <-__delay + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 67us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 67us : add_preempt_count <-delay_tsc + bash-2042 3d..2 68us : sub_preempt_count <-delay_tsc + bash-2042 3d..1 68us+: ata_bmdma_start <-ata_bmdma_qc_issue + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 71us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + bash-2042 3d..1 72us+: trace_hardirqs_on <-ata_scsi_queuecmd + bash-2042 3d..1 120us : <stack trace> + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => __ext3_get_inode_loc + => ext3_iget + => ext3_lookup + => lookup_real + => __lookup_hash + => walk_component + => lookup_last + => path_lookupat + => filename_lookup + => user_path_at_empty + => user_path_at + => vfs_fstatat + => vfs_stat + => sys_newstat + => system_call_fastpath + + +Here we traced a 71 microsecond latency. But we also see all the functions that were called during that time. Note that by enabling function tracing, we incur an added overhead. This overhead may extend the latency times. But nevertheless, this @@ -690,119 +1053,122 @@ Like the irqsoff tracer, it records the maximum latency for which preemption was disabled. The control of preemptoff tracer is much like the irqsoff tracer. - # echo preemptoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo 0 > options/function-trace + # echo preemptoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_on + # cat trace # tracer: preemptoff # -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 29 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: do_IRQ - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d.h. 0us+: irq_enter (do_IRQ) - sshd-4261 0d.s. 29us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 30us : trace_preempt_on (__do_softirq) +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 46 us, #4/4, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sshd-1991 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: do_IRQ +# => ended at: do_IRQ +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + sshd-1991 1d.h. 0us+: irq_enter <-do_IRQ + sshd-1991 1d..1 46us : irq_exit <-do_IRQ + sshd-1991 1d..1 47us+: trace_preempt_on <-do_IRQ + sshd-1991 1d..1 52us : <stack trace> + => sub_preempt_count + => irq_exit + => do_IRQ + => ret_from_intr This has some more changes. Preemption was disabled when an -interrupt came in (notice the 'h'), and was enabled while doing -a softirq. (notice the 's'). But we also see that interrupts -have been disabled when entering the preempt off section and -leaving it (the 'd'). We do not know if interrupts were enabled -in the mean time. +interrupt came in (notice the 'h'), and was enabled on exit. +But we also see that interrupts have been disabled when entering +the preempt off section and leaving it (the 'd'). We do not know if +interrupts were enabled in the mean time or shortly after this +was over. # tracer: preemptoff # -preemptoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 63 us, #87/87, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: remove_wait_queue - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - sshd-4261 0d..1 0us : _spin_lock_irqsave (remove_wait_queue) - sshd-4261 0d..1 1us : _spin_unlock_irqrestore (remove_wait_queue) - sshd-4261 0d..1 2us : do_IRQ (common_interrupt) - sshd-4261 0d..1 2us : irq_enter (do_IRQ) - sshd-4261 0d..1 2us : idle_cpu (irq_enter) - sshd-4261 0d..1 3us : add_preempt_count (irq_enter) - sshd-4261 0d.h1 3us : idle_cpu (irq_enter) - sshd-4261 0d.h. 4us : handle_fasteoi_irq (do_IRQ) +# preemptoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 83 us, #241/241, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: bash-1994 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: wake_up_new_task +# => ended at: task_rq_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + bash-1994 1d..1 0us : _raw_spin_lock_irqsave <-wake_up_new_task + bash-1994 1d..1 0us : select_task_rq_fair <-select_task_rq + bash-1994 1d..1 1us : __rcu_read_lock <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair + bash-1994 1d..1 1us : source_load <-select_task_rq_fair [...] - sshd-4261 0d.h. 12us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 12us : ack_ioapic_quirk_irq (handle_fasteoi_irq) - sshd-4261 0d.h1 13us : move_native_irq (ack_ioapic_quirk_irq) - sshd-4261 0d.h1 13us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 14us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h1 14us : irq_exit (do_IRQ) - sshd-4261 0d.h1 15us : sub_preempt_count (irq_exit) - sshd-4261 0d..2 15us : do_softirq (irq_exit) - sshd-4261 0d... 15us : __do_softirq (do_softirq) - sshd-4261 0d... 16us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 16us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 20us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 21us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 21us : sub_preempt_count (local_bh_enable) + bash-1994 1d..1 12us : irq_enter <-smp_apic_timer_interrupt + bash-1994 1d..1 12us : rcu_irq_enter <-irq_enter + bash-1994 1d..1 13us : add_preempt_count <-irq_enter + bash-1994 1d.h1 13us : exit_idle <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : hrtimer_interrupt <-smp_apic_timer_interrupt + bash-1994 1d.h1 13us : _raw_spin_lock <-hrtimer_interrupt + bash-1994 1d.h1 14us : add_preempt_count <-_raw_spin_lock + bash-1994 1d.h2 14us : ktime_get_update_offsets <-hrtimer_interrupt [...] - sshd-4261 0d.s6 41us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s6 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s7 42us : sub_preempt_count (local_bh_enable) - sshd-4261 0d.s5 43us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s6 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s5 44us : add_preempt_count (__local_bh_disable) - sshd-4261 0d.s5 45us : sub_preempt_count (local_bh_enable) + bash-1994 1d.h1 35us : lapic_next_event <-clockevents_program_event + bash-1994 1d.h1 35us : irq_exit <-smp_apic_timer_interrupt + bash-1994 1d.h1 36us : sub_preempt_count <-irq_exit + bash-1994 1d..2 36us : do_softirq <-irq_exit + bash-1994 1d..2 36us : __do_softirq <-call_softirq + bash-1994 1d..2 36us : __local_bh_disable <-__do_softirq + bash-1994 1d.s2 37us : add_preempt_count <-_raw_spin_lock_irq + bash-1994 1d.s3 38us : _raw_spin_unlock <-run_timer_softirq + bash-1994 1d.s3 39us : sub_preempt_count <-_raw_spin_unlock + bash-1994 1d.s2 39us : call_timer_fn <-run_timer_softirq [...] - sshd-4261 0d.s. 63us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 64us : trace_preempt_on (__do_softirq) + bash-1994 1dNs2 81us : cpu_needs_another_gp <-rcu_process_callbacks + bash-1994 1dNs2 82us : __local_bh_enable <-__do_softirq + bash-1994 1dNs2 82us : sub_preempt_count <-__local_bh_enable + bash-1994 1dN.2 82us : idle_cpu <-irq_exit + bash-1994 1dN.2 83us : rcu_irq_exit <-irq_exit + bash-1994 1dN.2 83us : sub_preempt_count <-irq_exit + bash-1994 1.N.1 84us : _raw_spin_unlock_irqrestore <-task_rq_unlock + bash-1994 1.N.1 84us+: trace_preempt_on <-task_rq_unlock + bash-1994 1.N.1 104us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => task_rq_unlock + => wake_up_new_task + => do_fork + => sys_clone + => stub_clone The above is an example of the preemptoff trace with -ftrace_enabled set. Here we see that interrupts were disabled +function-trace set. Here we see that interrupts were not disabled the entire time. The irq_enter code lets us know that we entered an interrupt 'h'. Before that, the functions being traced still show that it is not in an interrupt, but we can see from the functions themselves that this is not the case. -Notice that __do_softirq when called does not have a -preempt_count. It may seem that we missed a preempt enabling. -What really happened is that the preempt count is held on the -thread's stack and we switched to the softirq stack (4K stacks -in effect). The code does not copy the preempt count, but -because interrupts are disabled, we do not need to worry about -it. Having a tracer like this is good for letting people know -what really happens inside the kernel. - - preemptirqsoff -------------- @@ -837,37 +1203,57 @@ tracer. Again, using this trace is much like the irqsoff and preemptoff tracers. - # echo preemptirqsoff > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo 0 > options/function-trace + # echo preemptirqsoff > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency # ls -ltr [...] - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_on + # cat trace # tracer: preemptirqsoff # -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 293 us, #3/3, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: ls-4860 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: apic_timer_interrupt - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4860 0d... 0us!: trace_hardirqs_off_thunk (apic_timer_interrupt) - ls-4860 0d.s. 294us : _local_bh_enable (__do_softirq) - ls-4860 0d.s1 294us : trace_preempt_on (__do_softirq) - +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 100 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2230 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: ata_scsi_queuecmd +# => ended at: ata_scsi_queuecmd +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + ls-2230 3d... 0us+: _raw_spin_lock_irqsave <-ata_scsi_queuecmd + ls-2230 3...1 100us : _raw_spin_unlock_irqrestore <-ata_scsi_queuecmd + ls-2230 3...1 101us+: trace_preempt_on <-ata_scsi_queuecmd + ls-2230 3...1 111us : <stack trace> + => sub_preempt_count + => _raw_spin_unlock_irqrestore + => ata_scsi_queuecmd + => scsi_dispatch_cmd + => scsi_request_fn + => __blk_run_queue_uncond + => __blk_run_queue + => blk_queue_bio + => generic_make_request + => submit_bio + => submit_bh + => ext3_bread + => ext3_dir_bread + => htree_dirblock_to_tree + => ext3_htree_fill_tree + => ext3_readdir + => vfs_readdir + => sys_getdents + => system_call_fastpath The trace_hardirqs_off_thunk is called from assembly on x86 when @@ -876,105 +1262,158 @@ function tracing, we do not know if interrupts were enabled within the preemption points. We do see that it started with preemption enabled. -Here is a trace with ftrace_enabled set: - +Here is a trace with function-trace set: # tracer: preemptirqsoff # -preemptirqsoff latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 105 us, #183/183, CPU#0 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sshd-4261 (uid:0 nice:0 policy:0 rt_prio:0) - ----------------- - => started at: write_chan - => ended at: __do_softirq - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - ls-4473 0.N.. 0us : preempt_schedule (write_chan) - ls-4473 0dN.1 1us : _spin_lock (schedule) - ls-4473 0dN.1 2us : add_preempt_count (_spin_lock) - ls-4473 0d..2 2us : put_prev_task_fair (schedule) -[...] - ls-4473 0d..2 13us : set_normalized_timespec (ktime_get_ts) - ls-4473 0d..2 13us : __switch_to (schedule) - sshd-4261 0d..2 14us : finish_task_switch (schedule) - sshd-4261 0d..2 14us : _spin_unlock_irq (finish_task_switch) - sshd-4261 0d..1 15us : add_preempt_count (_spin_lock_irqsave) - sshd-4261 0d..2 16us : _spin_unlock_irqrestore (hrtick_set) - sshd-4261 0d..2 16us : do_IRQ (common_interrupt) - sshd-4261 0d..2 17us : irq_enter (do_IRQ) - sshd-4261 0d..2 17us : idle_cpu (irq_enter) - sshd-4261 0d..2 18us : add_preempt_count (irq_enter) - sshd-4261 0d.h2 18us : idle_cpu (irq_enter) - sshd-4261 0d.h. 18us : handle_fasteoi_irq (do_IRQ) - sshd-4261 0d.h. 19us : _spin_lock (handle_fasteoi_irq) - sshd-4261 0d.h. 19us : add_preempt_count (_spin_lock) - sshd-4261 0d.h1 20us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 20us : sub_preempt_count (_spin_unlock) -[...] - sshd-4261 0d.h1 28us : _spin_unlock (handle_fasteoi_irq) - sshd-4261 0d.h1 29us : sub_preempt_count (_spin_unlock) - sshd-4261 0d.h2 29us : irq_exit (do_IRQ) - sshd-4261 0d.h2 29us : sub_preempt_count (irq_exit) - sshd-4261 0d..3 30us : do_softirq (irq_exit) - sshd-4261 0d... 30us : __do_softirq (do_softirq) - sshd-4261 0d... 31us : __local_bh_disable (__do_softirq) - sshd-4261 0d... 31us+: add_preempt_count (__local_bh_disable) - sshd-4261 0d.s4 34us : add_preempt_count (__local_bh_disable) +# preemptirqsoff latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 161 us, #339/339, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: ls-2269 (uid:0 nice:0 policy:0 rt_prio:0) +# ----------------- +# => started at: schedule +# => ended at: mutex_unlock +# +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / +kworker/-59 3...1 0us : __schedule <-schedule +kworker/-59 3d..1 0us : rcu_preempt_qs <-rcu_note_context_switch +kworker/-59 3d..1 1us : add_preempt_count <-_raw_spin_lock_irq +kworker/-59 3d..2 1us : deactivate_task <-__schedule +kworker/-59 3d..2 1us : dequeue_task <-deactivate_task +kworker/-59 3d..2 2us : update_rq_clock <-dequeue_task +kworker/-59 3d..2 2us : dequeue_task_fair <-dequeue_task +kworker/-59 3d..2 2us : update_curr <-dequeue_task_fair +kworker/-59 3d..2 2us : update_min_vruntime <-update_curr +kworker/-59 3d..2 3us : cpuacct_charge <-update_curr +kworker/-59 3d..2 3us : __rcu_read_lock <-cpuacct_charge +kworker/-59 3d..2 3us : __rcu_read_unlock <-cpuacct_charge +kworker/-59 3d..2 3us : update_cfs_rq_blocked_load <-dequeue_task_fair +kworker/-59 3d..2 4us : clear_buddies <-dequeue_task_fair +kworker/-59 3d..2 4us : account_entity_dequeue <-dequeue_task_fair +kworker/-59 3d..2 4us : update_min_vruntime <-dequeue_task_fair +kworker/-59 3d..2 4us : update_cfs_shares <-dequeue_task_fair +kworker/-59 3d..2 5us : hrtick_update <-dequeue_task_fair +kworker/-59 3d..2 5us : wq_worker_sleeping <-__schedule +kworker/-59 3d..2 5us : kthread_data <-wq_worker_sleeping +kworker/-59 3d..2 5us : put_prev_task_fair <-__schedule +kworker/-59 3d..2 6us : pick_next_task_fair <-pick_next_task +kworker/-59 3d..2 6us : clear_buddies <-pick_next_task_fair +kworker/-59 3d..2 6us : set_next_entity <-pick_next_task_fair +kworker/-59 3d..2 6us : update_stats_wait_end <-set_next_entity + ls-2269 3d..2 7us : finish_task_switch <-__schedule + ls-2269 3d..2 7us : _raw_spin_unlock_irq <-finish_task_switch + ls-2269 3d..2 8us : do_IRQ <-ret_from_intr + ls-2269 3d..2 8us : irq_enter <-do_IRQ + ls-2269 3d..2 8us : rcu_irq_enter <-irq_enter + ls-2269 3d..2 9us : add_preempt_count <-irq_enter + ls-2269 3d.h2 9us : exit_idle <-do_IRQ [...] - sshd-4261 0d.s3 43us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s4 44us : sub_preempt_count (local_bh_enable_ip) - sshd-4261 0d.s3 44us : smp_apic_timer_interrupt (apic_timer_interrupt) - sshd-4261 0d.s3 45us : irq_enter (smp_apic_timer_interrupt) - sshd-4261 0d.s3 45us : idle_cpu (irq_enter) - sshd-4261 0d.s3 46us : add_preempt_count (irq_enter) - sshd-4261 0d.H3 46us : idle_cpu (irq_enter) - sshd-4261 0d.H3 47us : hrtimer_interrupt (smp_apic_timer_interrupt) - sshd-4261 0d.H3 47us : ktime_get (hrtimer_interrupt) + ls-2269 3d.h3 20us : sub_preempt_count <-_raw_spin_unlock + ls-2269 3d.h2 20us : irq_exit <-do_IRQ + ls-2269 3d.h2 21us : sub_preempt_count <-irq_exit + ls-2269 3d..3 21us : do_softirq <-irq_exit + ls-2269 3d..3 21us : __do_softirq <-call_softirq + ls-2269 3d..3 21us+: __local_bh_disable <-__do_softirq + ls-2269 3d.s4 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 29us : sub_preempt_count <-_local_bh_enable_ip + ls-2269 3d.s5 31us : do_IRQ <-ret_from_intr + ls-2269 3d.s5 31us : irq_enter <-do_IRQ + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter [...] - sshd-4261 0d.H3 81us : tick_program_event (hrtimer_interrupt) - sshd-4261 0d.H3 82us : ktime_get (tick_program_event) - sshd-4261 0d.H3 82us : ktime_get_ts (ktime_get) - sshd-4261 0d.H3 83us : getnstimeofday (ktime_get_ts) - sshd-4261 0d.H3 83us : set_normalized_timespec (ktime_get_ts) - sshd-4261 0d.H3 84us : clockevents_program_event (tick_program_event) - sshd-4261 0d.H3 84us : lapic_next_event (clockevents_program_event) - sshd-4261 0d.H3 85us : irq_exit (smp_apic_timer_interrupt) - sshd-4261 0d.H3 85us : sub_preempt_count (irq_exit) - sshd-4261 0d.s4 86us : sub_preempt_count (irq_exit) - sshd-4261 0d.s3 86us : add_preempt_count (__local_bh_disable) + ls-2269 3d.s5 31us : rcu_irq_enter <-irq_enter + ls-2269 3d.s5 32us : add_preempt_count <-irq_enter + ls-2269 3d.H5 32us : exit_idle <-do_IRQ + ls-2269 3d.H5 32us : handle_irq <-do_IRQ + ls-2269 3d.H5 32us : irq_to_desc <-handle_irq + ls-2269 3d.H5 33us : handle_fasteoi_irq <-handle_irq [...] - sshd-4261 0d.s1 98us : sub_preempt_count (net_rx_action) - sshd-4261 0d.s. 99us : add_preempt_count (_spin_lock_irq) - sshd-4261 0d.s1 99us+: _spin_unlock_irq (run_timer_softirq) - sshd-4261 0d.s. 104us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s. 104us : sub_preempt_count (_local_bh_enable) - sshd-4261 0d.s. 105us : _local_bh_enable (__do_softirq) - sshd-4261 0d.s1 105us : trace_preempt_on (__do_softirq) - - -This is a very interesting trace. It started with the preemption -of the ls task. We see that the task had the "need_resched" bit -set via the 'N' in the trace. Interrupts were disabled before -the spin_lock at the beginning of the trace. We see that a -schedule took place to run sshd. When the interrupts were -enabled, we took an interrupt. On return from the interrupt -handler, the softirq ran. We took another interrupt while -running the softirq as we see from the capital 'H'. + ls-2269 3d.s5 158us : _raw_spin_unlock_irqrestore <-rtl8139_poll + ls-2269 3d.s3 158us : net_rps_action_and_irq_enable.isra.65 <-net_rx_action + ls-2269 3d.s3 159us : __local_bh_enable <-__do_softirq + ls-2269 3d.s3 159us : sub_preempt_count <-__local_bh_enable + ls-2269 3d..3 159us : idle_cpu <-irq_exit + ls-2269 3d..3 159us : rcu_irq_exit <-irq_exit + ls-2269 3d..3 160us : sub_preempt_count <-irq_exit + ls-2269 3d... 161us : __mutex_unlock_slowpath <-mutex_unlock + ls-2269 3d... 162us+: trace_hardirqs_on <-mutex_unlock + ls-2269 3d... 186us : <stack trace> + => __mutex_unlock_slowpath + => mutex_unlock + => process_output + => n_tty_write + => tty_write + => vfs_write + => sys_write + => system_call_fastpath + +This is an interesting trace. It started with kworker running and +scheduling out and ls taking over. But as soon as ls released the +rq lock and enabled interrupts (but not preemption) an interrupt +triggered. When the interrupt finished, it started running softirqs. +But while the softirq was running, another interrupt triggered. +When an interrupt is running inside a softirq, the annotation is 'H'. wakeup ------ +One common case that people are interested in tracing is the +time it takes for a task that is woken to actually wake up. +Now for non Real-Time tasks, this can be arbitrary. But tracing +it none the less can be interesting. + +Without function tracing: + + # echo 0 > options/function-trace + # echo wakeup > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup +# +# wakeup latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 15 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: kworker/3:1H-312 (uid:0 nice:-20 policy:0 rt_prio:0) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3dNs7 0us : 0:120:R + [003] 312:100:R kworker/3:1H + <idle>-0 3dNs7 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 15us : __schedule <-schedule + <idle>-0 3d..3 15us : 0:120:R ==> [003] 312:100:R kworker/3:1H + +The tracer only traces the highest priority task in the system +to avoid tracing the normal circumstances. Here we see that +the kworker with a nice priority of -20 (not very nice), took +just 15 microseconds from the time it woke up, to the time it +ran. + +Non Real-Time tasks are not that interesting. A more interesting +trace is to concentrate only on Real-Time tasks. + +wakeup_rt +--------- + In a Real-Time environment it is very important to know the wakeup time it takes for the highest priority task that is woken up to the time that it executes. This is also known as "schedule @@ -988,123 +1427,229 @@ Real-Time environments are interested in the worst case latency. That is the longest latency it takes for something to happen, and not the average. We can have a very fast scheduler that may only have a large latency once in a while, but that would not -work well with Real-Time tasks. The wakeup tracer was designed +work well with Real-Time tasks. The wakeup_rt tracer was designed to record the worst case wakeups of RT tasks. Non-RT tasks are not recorded because the tracer only records one worst case and tracing non-RT tasks that are unpredictable will overwrite the -worst case latency of RT tasks. +worst case latency of RT tasks (just run the normal wakeup +tracer for a while to see that effect). Since this tracer only deals with RT tasks, we will run this slightly differently than we did with the previous tracers. Instead of performing an 'ls', we will run 'sleep 1' under 'chrt' which changes the priority of the task. - # echo wakeup > /debug/tracing/current_tracer - # echo 0 > /debug/tracing/tracing_max_latency - # echo 1 > /debug/tracing/tracing_enabled + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer + # echo 1 > tracing_on + # echo 0 > tracing_max_latency # chrt -f 5 sleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/latency_trace + # echo 0 > tracing_on + # cat trace # tracer: wakeup # -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 4 us, #2/2, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4901 (uid:0 nice:0 policy:1 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / - <idle>-0 1d.h4 0us+: try_to_wake_up (wake_up_process) - <idle>-0 1d..4 4us : schedule (cpu_idle) - - -Running this on an idle system, we see that it only took 4 -microseconds to perform the task switch. Note, since the trace -marker in the schedule is before the actual "switch", we stop -the tracing when the recorded task is about to schedule in. This -may change if we add a new marker at the end of the scheduler. - -Notice that the recorded task is 'sleep' with the PID of 4901 +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 5 us, #4/4, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2389 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 0us : 0:120:R + [003] 2389: 94:R sleep + <idle>-0 3d.h4 1us+: ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d..3 5us : __schedule <-schedule + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + + +Running this on an idle system, we see that it only took 5 microseconds +to perform the task switch. Note, since the trace point in the schedule +is before the actual "switch", we stop the tracing when the recorded task +is about to schedule in. This may change if we add a new marker at the +end of the scheduler. + +Notice that the recorded task is 'sleep' with the PID of 2389 and it has an rt_prio of 5. This priority is user-space priority and not the internal kernel priority. The policy is 1 for SCHED_FIFO and 2 for SCHED_RR. -Doing the same with chrt -r 5 and ftrace_enabled set. +Note, that the trace data shows the internal priority (99 - rtprio). -# tracer: wakeup + <idle>-0 3d..3 5us : 0:120:R ==> [003] 2389: 94:R sleep + +The 0:120:R means idle was running with a nice priority of 0 (120 - 20) +and in the running state 'R'. The sleep task was scheduled in with +2389: 94:R. That is the priority is the kernel rtprio (99 - 5 = 94) +and it too is in the running state. + +Doing the same with chrt -r 5 and function-trace set. + + echo 1 > options/function-trace + +# tracer: wakeup_rt # -wakeup latency trace v1.1.5 on 2.6.26-rc8 --------------------------------------------------------------------- - latency: 50 us, #60/60, CPU#1 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:2) - ----------------- - | task: sleep-4068 (uid:0 nice:0 policy:2 rt_prio:5) - ----------------- - -# _------=> CPU# -# / _-----=> irqs-off -# | / _----=> need-resched -# || / _---=> hardirq/softirq -# ||| / _--=> preempt-depth -# |||| / -# ||||| delay -# cmd pid ||||| time | caller -# \ / ||||| \ | / -ksoftirq-7 1d.H3 0us : try_to_wake_up (wake_up_process) -ksoftirq-7 1d.H4 1us : sub_preempt_count (marker_probe_cb) -ksoftirq-7 1d.H3 2us : check_preempt_wakeup (try_to_wake_up) -ksoftirq-7 1d.H3 3us : update_curr (check_preempt_wakeup) -ksoftirq-7 1d.H3 4us : calc_delta_mine (update_curr) -ksoftirq-7 1d.H3 5us : __resched_task (check_preempt_wakeup) -ksoftirq-7 1d.H3 6us : task_wake_up_rt (try_to_wake_up) -ksoftirq-7 1d.H3 7us : _spin_unlock_irqrestore (try_to_wake_up) -[...] -ksoftirq-7 1d.H2 17us : irq_exit (smp_apic_timer_interrupt) -ksoftirq-7 1d.H2 18us : sub_preempt_count (irq_exit) -ksoftirq-7 1d.s3 19us : sub_preempt_count (irq_exit) -ksoftirq-7 1..s2 20us : rcu_process_callbacks (__do_softirq) -[...] -ksoftirq-7 1..s2 26us : __rcu_process_callbacks (rcu_process_callbacks) -ksoftirq-7 1d.s2 27us : _local_bh_enable (__do_softirq) -ksoftirq-7 1d.s2 28us : sub_preempt_count (_local_bh_enable) -ksoftirq-7 1.N.3 29us : sub_preempt_count (ksoftirqd) -ksoftirq-7 1.N.2 30us : _cond_resched (ksoftirqd) -ksoftirq-7 1.N.2 31us : __cond_resched (_cond_resched) -ksoftirq-7 1.N.2 32us : add_preempt_count (__cond_resched) -ksoftirq-7 1.N.2 33us : schedule (__cond_resched) -ksoftirq-7 1.N.2 33us : add_preempt_count (schedule) -ksoftirq-7 1.N.3 34us : hrtick_clear (schedule) -ksoftirq-7 1dN.3 35us : _spin_lock (schedule) -ksoftirq-7 1dN.3 36us : add_preempt_count (_spin_lock) -ksoftirq-7 1d..4 37us : put_prev_task_fair (schedule) -ksoftirq-7 1d..4 38us : update_curr (put_prev_task_fair) -[...] -ksoftirq-7 1d..5 47us : _spin_trylock (tracing_record_cmdline) -ksoftirq-7 1d..5 48us : add_preempt_count (_spin_trylock) -ksoftirq-7 1d..6 49us : _spin_unlock (tracing_record_cmdline) -ksoftirq-7 1d..6 49us : sub_preempt_count (_spin_unlock) -ksoftirq-7 1d..4 50us : schedule (__cond_resched) - -The interrupt went off while running ksoftirqd. This task runs -at SCHED_OTHER. Why did not we see the 'N' set early? This may -be a harmless bug with x86_32 and 4K stacks. On x86_32 with 4K -stacks configured, the interrupt and softirq run with their own -stack. Some information is held on the top of the task's stack -(need_resched and preempt_count are both stored there). The -setting of the NEED_RESCHED bit is done directly to the task's -stack, but the reading of the NEED_RESCHED is done by looking at -the current stack, which in this case is the stack for the hard -interrupt. This hides the fact that NEED_RESCHED has been set. -We do not see the 'N' until we switch back to the task's -assigned stack. +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 29 us, #85/85, CPU#3 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-2448 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 3d.h4 1us+: 0:120:R + [003] 2448: 94:R sleep + <idle>-0 3d.h4 2us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 3d.h3 3us : check_preempt_curr <-ttwu_do_wakeup + <idle>-0 3d.h3 3us : resched_task <-check_preempt_curr + <idle>-0 3dNh3 4us : task_woken_rt <-ttwu_do_wakeup + <idle>-0 3dNh3 4us : _raw_spin_unlock <-try_to_wake_up + <idle>-0 3dNh3 4us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh2 5us : ttwu_stat <-try_to_wake_up + <idle>-0 3dNh2 5us : _raw_spin_unlock_irqrestore <-try_to_wake_up + <idle>-0 3dNh2 6us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dNh1 6us : _raw_spin_lock <-__run_hrtimer + <idle>-0 3dNh1 6us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dNh2 7us : _raw_spin_unlock <-hrtimer_interrupt + <idle>-0 3dNh2 7us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dNh1 7us : tick_program_event <-hrtimer_interrupt + <idle>-0 3dNh1 7us : clockevents_program_event <-tick_program_event + <idle>-0 3dNh1 8us : ktime_get <-clockevents_program_event + <idle>-0 3dNh1 8us : lapic_next_event <-clockevents_program_event + <idle>-0 3dNh1 8us : irq_exit <-smp_apic_timer_interrupt + <idle>-0 3dNh1 9us : sub_preempt_count <-irq_exit + <idle>-0 3dN.2 9us : idle_cpu <-irq_exit + <idle>-0 3dN.2 9us : rcu_irq_exit <-irq_exit + <idle>-0 3dN.2 10us : rcu_eqs_enter_common.isra.45 <-rcu_irq_exit + <idle>-0 3dN.2 10us : sub_preempt_count <-irq_exit + <idle>-0 3.N.1 11us : rcu_idle_exit <-cpu_idle + <idle>-0 3dN.1 11us : rcu_eqs_exit_common.isra.43 <-rcu_idle_exit + <idle>-0 3.N.1 11us : tick_nohz_idle_exit <-cpu_idle + <idle>-0 3dN.1 12us : menu_hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : ktime_get <-tick_nohz_idle_exit + <idle>-0 3dN.1 12us : tick_do_update_jiffies64 <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : update_cpu_load_nohz <-tick_nohz_idle_exit + <idle>-0 3dN.1 13us : _raw_spin_lock <-update_cpu_load_nohz + <idle>-0 3dN.1 13us : add_preempt_count <-_raw_spin_lock + <idle>-0 3dN.2 13us : __update_cpu_load <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sched_avg_update <-__update_cpu_load + <idle>-0 3dN.2 14us : _raw_spin_unlock <-update_cpu_load_nohz + <idle>-0 3dN.2 14us : sub_preempt_count <-_raw_spin_unlock + <idle>-0 3dN.1 15us : calc_load_exit_idle <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : touch_softlockup_watchdog <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 3dN.1 15us : hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 3dN.1 16us : lock_hrtimer_base.isra.18 <-hrtimer_try_to_cancel + <idle>-0 3dN.1 16us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 16us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 17us : __remove_hrtimer <-remove_hrtimer.part.16 + <idle>-0 3dN.2 17us : hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 3dN.2 17us : tick_program_event <-hrtimer_force_reprogram + <idle>-0 3dN.2 18us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 18us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 18us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 19us : _raw_spin_unlock_irqrestore <-hrtimer_try_to_cancel + <idle>-0 3dN.2 19us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 19us : hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : ktime_add_safe <-hrtimer_forward + <idle>-0 3dN.1 20us : hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 3dN.1 20us : __hrtimer_start_range_ns <-hrtimer_start_range_ns + <idle>-0 3dN.1 21us : lock_hrtimer_base.isra.18 <-__hrtimer_start_range_ns + <idle>-0 3dN.1 21us : _raw_spin_lock_irqsave <-lock_hrtimer_base.isra.18 + <idle>-0 3dN.1 21us : add_preempt_count <-_raw_spin_lock_irqsave + <idle>-0 3dN.2 22us : ktime_add_safe <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : enqueue_hrtimer <-__hrtimer_start_range_ns + <idle>-0 3dN.2 22us : tick_program_event <-__hrtimer_start_range_ns + <idle>-0 3dN.2 23us : clockevents_program_event <-tick_program_event + <idle>-0 3dN.2 23us : ktime_get <-clockevents_program_event + <idle>-0 3dN.2 23us : lapic_next_event <-clockevents_program_event + <idle>-0 3dN.2 24us : _raw_spin_unlock_irqrestore <-__hrtimer_start_range_ns + <idle>-0 3dN.2 24us : sub_preempt_count <-_raw_spin_unlock_irqrestore + <idle>-0 3dN.1 24us : account_idle_ticks <-tick_nohz_idle_exit + <idle>-0 3dN.1 24us : account_idle_time <-account_idle_ticks + <idle>-0 3.N.1 25us : sub_preempt_count <-cpu_idle + <idle>-0 3.N.. 25us : schedule <-cpu_idle + <idle>-0 3.N.. 25us : __schedule <-preempt_schedule + <idle>-0 3.N.. 26us : add_preempt_count <-__schedule + <idle>-0 3.N.1 26us : rcu_note_context_switch <-__schedule + <idle>-0 3.N.1 26us : rcu_sched_qs <-rcu_note_context_switch + <idle>-0 3dN.1 27us : rcu_preempt_qs <-rcu_note_context_switch + <idle>-0 3.N.1 27us : _raw_spin_lock_irq <-__schedule + <idle>-0 3dN.1 27us : add_preempt_count <-_raw_spin_lock_irq + <idle>-0 3dN.2 28us : put_prev_task_idle <-__schedule + <idle>-0 3dN.2 28us : pick_next_task_stop <-pick_next_task + <idle>-0 3dN.2 28us : pick_next_task_rt <-pick_next_task + <idle>-0 3dN.2 29us : dequeue_pushable_task <-pick_next_task_rt + <idle>-0 3d..3 29us : __schedule <-preempt_schedule + <idle>-0 3d..3 30us : 0:120:R ==> [003] 2448: 94:R sleep + +This isn't that big of a trace, even with function tracing enabled, +so I included the entire trace. + +The interrupt went off while when the system was idle. Somewhere +before task_woken_rt() was called, the NEED_RESCHED flag was set, +this is indicated by the first occurrence of the 'N' flag. + +Latency tracing and events +-------------------------- +As function tracing can induce a much larger latency, but without +seeing what happens within the latency it is hard to know what +caused it. There is a middle ground, and that is with enabling +events. + + # echo 0 > options/function-trace + # echo wakeup_rt > current_tracer + # echo 1 > events/enable + # echo 1 > tracing_on + # echo 0 > tracing_max_latency + # chrt -f 5 sleep 1 + # echo 0 > tracing_on + # cat trace +# tracer: wakeup_rt +# +# wakeup_rt latency trace v1.1.5 on 3.8.0-test+ +# -------------------------------------------------------------------- +# latency: 6 us, #12/12, CPU#2 | (M:preempt VP:0, KP:0, SP:0 HP:0 #P:4) +# ----------------- +# | task: sleep-5882 (uid:0 nice:0 policy:1 rt_prio:5) +# ----------------- +# +# _------=> CPU# +# / _-----=> irqs-off +# | / _----=> need-resched +# || / _---=> hardirq/softirq +# ||| / _--=> preempt-depth +# |||| / delay +# cmd pid ||||| time | caller +# \ / ||||| \ | / + <idle>-0 2d.h4 0us : 0:120:R + [002] 5882: 94:R sleep + <idle>-0 2d.h4 0us : ttwu_do_activate.constprop.87 <-try_to_wake_up + <idle>-0 2d.h4 1us : sched_wakeup: comm=sleep pid=5882 prio=94 success=1 target_cpu=002 + <idle>-0 2dNh2 1us : hrtimer_expire_exit: hrtimer=ffff88007796feb8 + <idle>-0 2.N.2 2us : power_end: cpu_id=2 + <idle>-0 2.N.2 3us : cpu_idle: state=4294967295 cpu_id=2 + <idle>-0 2dN.3 4us : hrtimer_cancel: hrtimer=ffff88007d50d5e0 + <idle>-0 2dN.3 4us : hrtimer_start: hrtimer=ffff88007d50d5e0 function=tick_sched_timer expires=34311211000000 softexpires=34311211000000 + <idle>-0 2.N.2 5us : rcu_utilization: Start context switch + <idle>-0 2.N.2 5us : rcu_utilization: End context switch + <idle>-0 2d..3 6us : __schedule <-schedule + <idle>-0 2d..3 6us : 0:120:R ==> [002] 5882: 94:R sleep + function -------- @@ -1112,32 +1657,33 @@ function This tracer is the function tracer. Enabling the function tracer can be done from the debug file system. Make sure the ftrace_enabled is set; otherwise this tracer is a nop. +See the "ftrace_enabled" section below. # sysctl kernel.ftrace_enabled=1 - # echo function > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled + # echo function > current_tracer + # echo 1 > tracing_on # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_on + # cat trace # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 123.638713: finish_task_switch <-schedule - bash-4003 [00] 123.638714: _spin_unlock_irq <-finish_task_switch - bash-4003 [00] 123.638714: sub_preempt_count <-_spin_unlock_irq - bash-4003 [00] 123.638715: hrtick_set <-schedule - bash-4003 [00] 123.638715: _spin_lock_irqsave <-hrtick_set - bash-4003 [00] 123.638716: add_preempt_count <-_spin_lock_irqsave - bash-4003 [00] 123.638716: _spin_unlock_irqrestore <-hrtick_set - bash-4003 [00] 123.638717: sub_preempt_count <-_spin_unlock_irqrestore - bash-4003 [00] 123.638717: hrtick_clear <-hrtick_set - bash-4003 [00] 123.638718: sub_preempt_count <-schedule - bash-4003 [00] 123.638718: sub_preempt_count <-preempt_schedule - bash-4003 [00] 123.638719: wait_for_completion <-__stop_machine_run - bash-4003 [00] 123.638719: wait_for_common <-wait_for_completion - bash-4003 [00] 123.638720: _spin_lock_irq <-wait_for_common - bash-4003 [00] 123.638720: add_preempt_count <-_spin_lock_irq +# entries-in-buffer/entries-written: 24799/24799 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [002] .... 3082.063030: mutex_unlock <-rb_simple_write + bash-1994 [002] .... 3082.063031: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [002] .... 3082.063031: __fsnotify_parent <-fsnotify_modify + bash-1994 [002] .... 3082.063032: fsnotify <-fsnotify_modify + bash-1994 [002] .... 3082.063032: __srcu_read_lock <-fsnotify + bash-1994 [002] .... 3082.063032: add_preempt_count <-__srcu_read_lock + bash-1994 [002] ...1 3082.063032: sub_preempt_count <-__srcu_read_lock + bash-1994 [002] .... 3082.063033: __srcu_read_unlock <-fsnotify [...] @@ -1155,7 +1701,7 @@ int trace_fd; [...] int main(int argc, char *argv[]) { [...] - trace_fd = open("/debug/tracing/tracing_enabled", O_WRONLY); + trace_fd = open(tracing_file("tracing_on"), O_WRONLY); [...] if (condition_hit()) { write(trace_fd, "0", 1); @@ -1163,26 +1709,20 @@ int main(int argc, char *argv[]) { [...] } -Note: Here we hard coded the path name. The debugfs mount is not -guaranteed to be at /debug (and is more commonly at -/sys/kernel/debug). For simple one time traces, the above is -sufficent. For anything else, a search through /proc/mounts may -be needed to find where the debugfs file-system is mounted. - Single thread tracing --------------------- -By writing into /debug/tracing/set_ftrace_pid you can trace a +By writing into set_ftrace_pid you can trace a single thread. For example: -# cat /debug/tracing/set_ftrace_pid +# cat set_ftrace_pid no pid -# echo 3111 > /debug/tracing/set_ftrace_pid -# cat /debug/tracing/set_ftrace_pid +# echo 3111 > set_ftrace_pid +# cat set_ftrace_pid 3111 -# echo function > /debug/tracing/current_tracer -# cat /debug/tracing/trace | head +# echo function > current_tracer +# cat trace | head # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1193,8 +1733,8 @@ no pid yum-updatesd-3111 [003] 1637.254683: lock_hrtimer_base <-hrtimer_try_to_cancel yum-updatesd-3111 [003] 1637.254685: fget_light <-do_sys_poll yum-updatesd-3111 [003] 1637.254686: pipe_poll <-do_sys_poll -# echo -1 > /debug/tracing/set_ftrace_pid -# cat /debug/tracing/trace |head +# echo -1 > set_ftrace_pid +# cat trace |head # tracer: function # # TASK-PID CPU# TIMESTAMP FUNCTION @@ -1215,6 +1755,53 @@ something like this simple program: #include <sys/stat.h> #include <fcntl.h> #include <unistd.h> +#include <string.h> + +#define _STR(x) #x +#define STR(x) _STR(x) +#define MAX_PATH 256 + +const char *find_debugfs(void) +{ + static char debugfs[MAX_PATH+1]; + static int debugfs_found; + char type[100]; + FILE *fp; + + if (debugfs_found) + return debugfs; + + if ((fp = fopen("/proc/mounts","r")) == NULL) { + perror("/proc/mounts"); + return NULL; + } + + while (fscanf(fp, "%*s %" + STR(MAX_PATH) + "s %99s %*s %*d %*d\n", + debugfs, type) == 2) { + if (strcmp(type, "debugfs") == 0) + break; + } + fclose(fp); + + if (strcmp(type, "debugfs") != 0) { + fprintf(stderr, "debugfs not mounted"); + return NULL; + } + + strcat(debugfs, "/tracing/"); + debugfs_found = 1; + + return debugfs; +} + +const char *tracing_file(const char *file_name) +{ + static char trace_file[MAX_PATH+1]; + snprintf(trace_file, MAX_PATH, "%s/%s", find_debugfs(), file_name); + return trace_file; +} int main (int argc, char **argv) { @@ -1226,12 +1813,12 @@ int main (int argc, char **argv) char line[64]; int s; - ffd = open("/debug/tracing/current_tracer", O_WRONLY); + ffd = open(tracing_file("current_tracer"), O_WRONLY); if (ffd < 0) exit(-1); write(ffd, "nop", 3); - fd = open("/debug/tracing/set_ftrace_pid", O_WRONLY); + fd = open(tracing_file("set_ftrace_pid"), O_WRONLY); s = sprintf(line, "%d\n", getpid()); write(fd, line, s); @@ -1246,77 +1833,19 @@ int main (int argc, char **argv) return 0; } +Or this simple script! -hw-branch-tracer (x86 only) ---------------------------- - -This tracer uses the x86 last branch tracing hardware feature to -collect a branch trace on all cpus with relatively low overhead. - -The tracer uses a fixed-size circular buffer per cpu and only -traces ring 0 branches. The trace file dumps that buffer in the -following format: - -# tracer: hw-branch-tracer -# -# CPU# TO <- FROM - 0 scheduler_tick+0xb5/0x1bf <- task_tick_idle+0x5/0x6 - 2 run_posix_cpu_timers+0x2b/0x72a <- run_posix_cpu_timers+0x25/0x72a - 0 scheduler_tick+0x139/0x1bf <- scheduler_tick+0xed/0x1bf - 0 scheduler_tick+0x17c/0x1bf <- scheduler_tick+0x148/0x1bf - 2 run_posix_cpu_timers+0x9e/0x72a <- run_posix_cpu_timers+0x5e/0x72a - 0 scheduler_tick+0x1b6/0x1bf <- scheduler_tick+0x1aa/0x1bf - - -The tracer may be used to dump the trace for the oops'ing cpu on -a kernel oops into the system log. To enable this, -ftrace_dump_on_oops must be set. To set ftrace_dump_on_oops, one -can either use the sysctl function or set it via the proc system -interface. - - sysctl kernel.ftrace_dump_on_oops=1 - -or - - echo 1 > /proc/sys/kernel/ftrace_dump_on_oops - - -Here's an example of such a dump after a null pointer -dereference in a kernel module: - -[57848.105921] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000 -[57848.106019] IP: [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] PGD 2354e9067 PUD 2375e7067 PMD 0 -[57848.106019] Oops: 0002 [#1] SMP -[57848.106019] last sysfs file: /sys/devices/pci0000:00/0000:00:1e.0/0000:20:05.0/local_cpus -[57848.106019] Dumping ftrace buffer: -[57848.106019] --------------------------------- -[...] -[57848.106019] 0 chrdev_open+0xe6/0x165 <- cdev_put+0x23/0x24 -[57848.106019] 0 chrdev_open+0x117/0x165 <- chrdev_open+0xfa/0x165 -[57848.106019] 0 chrdev_open+0x120/0x165 <- chrdev_open+0x11c/0x165 -[57848.106019] 0 chrdev_open+0x134/0x165 <- chrdev_open+0x12b/0x165 -[57848.106019] 0 open+0x0/0x14 [oops] <- chrdev_open+0x144/0x165 -[57848.106019] 0 page_fault+0x0/0x30 <- open+0x6/0x14 [oops] -[57848.106019] 0 error_entry+0x0/0x5b <- page_fault+0x4/0x30 -[57848.106019] 0 error_kernelspace+0x0/0x31 <- error_entry+0x59/0x5b -[57848.106019] 0 error_sti+0x0/0x1 <- error_kernelspace+0x2d/0x31 -[57848.106019] 0 page_fault+0x9/0x30 <- error_sti+0x0/0x1 -[57848.106019] 0 do_page_fault+0x0/0x881 <- page_fault+0x1a/0x30 -[...] -[57848.106019] 0 do_page_fault+0x66b/0x881 <- is_prefetch+0x1ee/0x1f2 -[57848.106019] 0 do_page_fault+0x6e0/0x881 <- do_page_fault+0x67a/0x881 -[57848.106019] 0 oops_begin+0x0/0x96 <- do_page_fault+0x6e0/0x881 -[57848.106019] 0 trace_hw_branch_oops+0x0/0x2d <- oops_begin+0x9/0x96 -[...] -[57848.106019] 0 ds_suspend_bts+0x2a/0xe3 <- ds_suspend_bts+0x1a/0xe3 -[57848.106019] --------------------------------- -[57848.106019] CPU 0 -[57848.106019] Modules linked in: oops -[57848.106019] Pid: 5542, comm: cat Tainted: G W 2.6.28 #23 -[57848.106019] RIP: 0010:[<ffffffffa0000006>] [<ffffffffa0000006>] open+0x6/0x14 [oops] -[57848.106019] RSP: 0018:ffff880235457d48 EFLAGS: 00010246 -[...] +------ +#!/bin/bash + +debugfs=`sed -ne 's/^debugfs \(.*\) debugfs.*/\1/p' /proc/mounts` +echo nop > $debugfs/tracing/current_tracer +echo 0 > $debugfs/tracing/tracing_on +echo $$ > $debugfs/tracing/set_ftrace_pid +echo function > $debugfs/tracing/current_tracer +echo 1 > $debugfs/tracing/tracing_on +exec "$@" +------ function graph tracer @@ -1383,22 +1912,22 @@ want, depending on your needs. tracing_cpu_mask file) or you might sometimes see unordered function calls while cpu tracing switch. - hide: echo nofuncgraph-cpu > /debug/tracing/trace_options - show: echo funcgraph-cpu > /debug/tracing/trace_options + hide: echo nofuncgraph-cpu > trace_options + show: echo funcgraph-cpu > trace_options - The duration (function's time of execution) is displayed on the closing bracket line of a function or on the same line than the current function in case of a leaf one. It is default enabled. - hide: echo nofuncgraph-duration > /debug/tracing/trace_options - show: echo funcgraph-duration > /debug/tracing/trace_options + hide: echo nofuncgraph-duration > trace_options + show: echo funcgraph-duration > trace_options - The overhead field precedes the duration field in case of reached duration thresholds. - hide: echo nofuncgraph-overhead > /debug/tracing/trace_options - show: echo funcgraph-overhead > /debug/tracing/trace_options + hide: echo nofuncgraph-overhead > trace_options + show: echo funcgraph-overhead > trace_options depends on: funcgraph-duration ie: @@ -1427,8 +1956,8 @@ want, depending on your needs. - The task/pid field displays the thread cmdline and pid which executed the function. It is default disabled. - hide: echo nofuncgraph-proc > /debug/tracing/trace_options - show: echo funcgraph-proc > /debug/tracing/trace_options + hide: echo nofuncgraph-proc > trace_options + show: echo funcgraph-proc > trace_options ie: @@ -1451,8 +1980,8 @@ want, depending on your needs. system clock since it started. A snapshot of this time is given on each entry/exit of functions - hide: echo nofuncgraph-abstime > /debug/tracing/trace_options - show: echo funcgraph-abstime > /debug/tracing/trace_options + hide: echo nofuncgraph-abstime > trace_options + show: echo funcgraph-abstime > trace_options ie: @@ -1474,6 +2003,32 @@ want, depending on your needs. 360.774530 | 1) 0.594 us | __phys_addr(); +The function name is always displayed after the closing bracket +for a function if the start of that function is not in the +trace buffer. + +Display of the function name after the closing bracket may be +enabled for functions whose start is in the trace buffer, +allowing easier searching with grep for function durations. +It is default disabled. + + hide: echo nofuncgraph-tail > trace_options + show: echo funcgraph-tail > trace_options + + Example with nofuncgraph-tail (default): + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } + 0) 2.861 us | } + + Example with funcgraph-tail: + 0) | putname() { + 0) | kmem_cache_free() { + 0) 0.518 us | __phys_addr(); + 0) 1.757 us | } /* kmem_cache_free() */ + 0) 2.861 us | } /* putname() */ + You can put some comments on specific functions by using trace_printk() For example, if you want to put a comment inside the __might_sleep() function, you just have to include @@ -1503,16 +2058,18 @@ starts of pointing to a simple return. (Enabling FTRACE will include the -pg switch in the compiling of the kernel.) At compile time every C file object is run through the -recordmcount.pl script (located in the scripts directory). This -script will process the C object using objdump to find all the -locations in the .text section that call mcount. (Note, only the -.text section is processed, since processing other sections like -.init.text may cause races due to those sections being freed). +recordmcount program (located in the scripts directory). This +program will parse the ELF headers in the C object to find all +the locations in the .text section that call mcount. (Note, only +white listed .text sections are processed, since processing other +sections like .init.text may cause races due to those sections +being freed unexpectedly). A new section called "__mcount_loc" is created that holds references to all the mcount call sites in the .text section. -This section is compiled back into the original object. The -final linker will add all these references into a single table. +The recordmcount program re-links this section back into the +original object. The final linking stage of the kernel will add all these +references into a single table. On boot up, before SMP is initialized, the dynamic ftrace code scans this table and updates all the locations into nops. It @@ -1523,13 +2080,25 @@ unloaded, it also removes its functions from the ftrace function list. This is automatic in the module unload code, and the module author does not need to worry about it. -When tracing is enabled, kstop_machine is called to prevent -races with the CPUS executing code being modified (which can -cause the CPU to do undesireable things), and the nops are +When tracing is enabled, the process of modifying the function +tracepoints is dependent on architecture. The old method is to use +kstop_machine to prevent races with the CPUs executing code being +modified (which can cause the CPU to do undesirable things, especially +if the modified code crosses cache (or page) boundaries), and the nops are patched back to calls. But this time, they do not call mcount (which is just a function stub). They now call into the ftrace infrastructure. +The new method of modifying the function tracepoints is to place +a breakpoint at the location to be modified, sync all CPUs, modify +the rest of the instruction not covered by the breakpoint. Sync +all CPUs again, and then remove the breakpoint with the finished +version to the ftrace call site. + +Some archs do not even need to monkey around with the synchronization, +and can just slap the new code on top of the old without any +problems with other CPUs executing it at the same time. + One special side-effect to the recording of the functions being traced is that we can now selectively choose which functions we wish to trace and which ones we want the mcount calls to remain @@ -1549,7 +2118,7 @@ listed in: available_filter_functions - # cat /debug/tracing/available_filter_functions + # cat available_filter_functions put_prev_task_idle kmem_cache_create pick_next_task_rt @@ -1560,24 +2129,32 @@ mutex_lock If I am only interested in sys_nanosleep and hrtimer_interrupt: - # echo sys_nanosleep hrtimer_interrupt \ - > /debug/tracing/set_ftrace_filter - # echo ftrace > /debug/tracing/current_tracer - # echo 1 > /debug/tracing/tracing_enabled + # echo sys_nanosleep hrtimer_interrupt > set_ftrace_filter + # echo function > current_tracer + # echo 1 > tracing_on # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace -# tracer: ftrace + # echo 0 > tracing_on + # cat trace +# tracer: function +# +# entries-in-buffer/entries-written: 5/5 #P:4 # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - usleep-4134 [00] 1317.070017: hrtimer_interrupt <-smp_apic_timer_interrupt - usleep-4134 [00] 1317.070111: sys_nanosleep <-syscall_call - <idle>-0 [00] 1317.070115: hrtimer_interrupt <-smp_apic_timer_interrupt +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + usleep-2665 [001] .... 4186.475355: sys_nanosleep <-system_call_fastpath + <idle>-0 [001] d.h1 4186.475409: hrtimer_interrupt <-smp_apic_timer_interrupt + usleep-2665 [001] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [003] d.h1 4186.475426: hrtimer_interrupt <-smp_apic_timer_interrupt + <idle>-0 [002] d.h1 4186.475427: hrtimer_interrupt <-smp_apic_timer_interrupt To see which functions are being traced, you can cat the file: - # cat /debug/tracing/set_ftrace_filter + # cat set_ftrace_filter hrtimer_interrupt sys_nanosleep @@ -1597,28 +2174,33 @@ Note: It is better to use quotes to enclose the wild cards, otherwise the shell may expand the parameters into names of files in the local directory. - # echo 'hrtimer_*' > /debug/tracing/set_ftrace_filter + # echo 'hrtimer_*' > set_ftrace_filter Produces: -# tracer: ftrace +# tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4003 [00] 1480.611794: hrtimer_init <-copy_process - bash-4003 [00] 1480.611941: hrtimer_start <-hrtick_set - bash-4003 [00] 1480.611956: hrtimer_cancel <-hrtick_clear - bash-4003 [00] 1480.611956: hrtimer_try_to_cancel <-hrtimer_cancel - <idle>-0 [00] 1480.612019: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612025: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612032: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612037: hrtimer_get_next_event <-get_next_timer_interrupt - <idle>-0 [00] 1480.612382: hrtimer_get_next_event <-get_next_timer_interrupt - +# entries-in-buffer/entries-written: 897/897 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [003] dN.1 4228.547803: hrtimer_cancel <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547804: hrtimer_try_to_cancel <-hrtimer_cancel + <idle>-0 [003] dN.2 4228.547805: hrtimer_force_reprogram <-__remove_hrtimer + <idle>-0 [003] dN.1 4228.547805: hrtimer_forward <-tick_nohz_idle_exit + <idle>-0 [003] dN.1 4228.547805: hrtimer_start_range_ns <-hrtimer_start_expires.constprop.11 + <idle>-0 [003] d..1 4228.547858: hrtimer_get_next_event <-get_next_timer_interrupt + <idle>-0 [003] d..1 4228.547859: hrtimer_start <-__tick_nohz_idle_enter + <idle>-0 [003] d..2 4228.547860: hrtimer_force_reprogram <-__rem Notice that we lost the sys_nanosleep. - # cat /debug/tracing/set_ftrace_filter + # cat set_ftrace_filter hrtimer_run_queues hrtimer_run_pending hrtimer_init @@ -1644,17 +2226,17 @@ To append to the filters, use '>>' To clear out a filter so that all functions will be recorded again: - # echo > /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo > set_ftrace_filter + # cat set_ftrace_filter # Again, now we want to append. - # echo sys_nanosleep > /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo sys_nanosleep > set_ftrace_filter + # cat set_ftrace_filter sys_nanosleep - # echo 'hrtimer_*' >> /debug/tracing/set_ftrace_filter - # cat /debug/tracing/set_ftrace_filter + # echo 'hrtimer_*' >> set_ftrace_filter + # cat set_ftrace_filter hrtimer_run_queues hrtimer_run_pending hrtimer_init @@ -1677,23 +2259,33 @@ hrtimer_init_sleeper The set_ftrace_notrace prevents those functions from being traced. - # echo '*preempt*' '*lock*' > /debug/tracing/set_ftrace_notrace + # echo '*preempt*' '*lock*' > set_ftrace_notrace Produces: -# tracer: ftrace +# tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | - bash-4043 [01] 115.281644: finish_task_switch <-schedule - bash-4043 [01] 115.281645: hrtick_set <-schedule - bash-4043 [01] 115.281645: hrtick_clear <-hrtick_set - bash-4043 [01] 115.281646: wait_for_completion <-__stop_machine_run - bash-4043 [01] 115.281647: wait_for_common <-wait_for_completion - bash-4043 [01] 115.281647: kthread_stop <-stop_machine_run - bash-4043 [01] 115.281648: init_waitqueue_head <-kthread_stop - bash-4043 [01] 115.281648: wake_up_process <-kthread_stop - bash-4043 [01] 115.281649: try_to_wake_up <-wake_up_process +# entries-in-buffer/entries-written: 39608/39608 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1994 [000] .... 4342.324896: file_ra_state_init <-do_dentry_open + bash-1994 [000] .... 4342.324897: open_check_o_direct <-do_last + bash-1994 [000] .... 4342.324897: ima_file_check <-do_last + bash-1994 [000] .... 4342.324898: process_measurement <-ima_file_check + bash-1994 [000] .... 4342.324898: ima_get_action <-process_measurement + bash-1994 [000] .... 4342.324898: ima_match_policy <-ima_get_action + bash-1994 [000] .... 4342.324899: do_truncate <-do_last + bash-1994 [000] .... 4342.324899: should_remove_suid <-do_truncate + bash-1994 [000] .... 4342.324899: notify_change <-do_truncate + bash-1994 [000] .... 4342.324900: current_fs_time <-notify_change + bash-1994 [000] .... 4342.324900: current_kernel_time <-current_fs_time + bash-1994 [000] .... 4342.324900: timespec_trunc <-current_fs_time We can see that there's no more lock or preempt tracing. @@ -1759,6 +2351,128 @@ this special filter via: echo > set_graph_function +ftrace_enabled +-------------- + +Note, the proc sysctl ftrace_enable is a big on/off switch for the +function tracer. By default it is enabled (when function tracing is +enabled in the kernel). If it is disabled, all function tracing is +disabled. This includes not only the function tracers for ftrace, but +also for any other uses (perf, kprobes, stack tracing, profiling, etc). + +Please disable this with care. + +This can be disable (and enabled) with: + + sysctl kernel.ftrace_enabled=0 + sysctl kernel.ftrace_enabled=1 + + or + + echo 0 > /proc/sys/kernel/ftrace_enabled + echo 1 > /proc/sys/kernel/ftrace_enabled + + +Filter commands +--------------- + +A few commands are supported by the set_ftrace_filter interface. +Trace commands have the following format: + +<function>:<command>:<parameter> + +The following commands are supported: + +- mod + This command enables function filtering per module. The + parameter defines the module. For example, if only the write* + functions in the ext3 module are desired, run: + + echo 'write*:mod:ext3' > set_ftrace_filter + + This command interacts with the filter in the same way as + filtering based on function names. Thus, adding more functions + in a different module is accomplished by appending (>>) to the + filter file. Remove specific module functions by prepending + '!': + + echo '!writeback*:mod:ext3' >> set_ftrace_filter + +- traceon/traceoff + These commands turn tracing on and off when the specified + functions are hit. The parameter determines how many times the + tracing system is turned on and off. If unspecified, there is + no limit. For example, to disable tracing when a schedule bug + is hit the first 5 times, run: + + echo '__schedule_bug:traceoff:5' > set_ftrace_filter + + To always disable tracing when __schedule_bug is hit: + + echo '__schedule_bug:traceoff' > set_ftrace_filter + + These commands are cumulative whether or not they are appended + to set_ftrace_filter. To remove a command, prepend it by '!' + and drop the parameter: + + echo '!__schedule_bug:traceoff:0' > set_ftrace_filter + + The above removes the traceoff command for __schedule_bug + that have a counter. To remove commands without counters: + + echo '!__schedule_bug:traceoff' > set_ftrace_filter + +- snapshot + Will cause a snapshot to be triggered when the function is hit. + + echo 'native_flush_tlb_others:snapshot' > set_ftrace_filter + + To only snapshot once: + + echo 'native_flush_tlb_others:snapshot:1' > set_ftrace_filter + + To remove the above commands: + + echo '!native_flush_tlb_others:snapshot' > set_ftrace_filter + echo '!native_flush_tlb_others:snapshot:0' > set_ftrace_filter + +- enable_event/disable_event + These commands can enable or disable a trace event. Note, because + function tracing callbacks are very sensitive, when these commands + are registered, the trace point is activated, but disabled in + a "soft" mode. That is, the tracepoint will be called, but + just will not be traced. The event tracepoint stays in this mode + as long as there's a command that triggers it. + + echo 'try_to_wake_up:enable_event:sched:sched_switch:2' > \ + set_ftrace_filter + + The format is: + + <function>:enable_event:<system>:<event>[:count] + <function>:disable_event:<system>:<event>[:count] + + To remove the events commands: + + + echo '!try_to_wake_up:enable_event:sched:sched_switch:0' > \ + set_ftrace_filter + echo '!schedule:disable_event:sched:sched_switch' > \ + set_ftrace_filter + +- dump + When the function is hit, it will dump the contents of the ftrace + ring buffer to the console. This is useful if you need to debug + something, and want to dump the trace when a certain function + is hit. Perhaps its a function that is called before a tripple + fault happens and does not allow you to get a regular dump. + +- cpudump + When the function is hit, it will dump the contents of the ftrace + ring buffer for the current CPU to the console. Unlike the "dump" + command, it only prints out the contents of the ring buffer for the + CPU that executed the function that triggered the dump. + trace_pipe ---------- @@ -1767,37 +2481,40 @@ the effect on the tracing is different. Every read from trace_pipe is consumed. This means that subsequent reads will be different. The trace is live. - # echo function > /debug/tracing/current_tracer - # cat /debug/tracing/trace_pipe > /tmp/trace.out & + # echo function > current_tracer + # cat trace_pipe > /tmp/trace.out & [1] 4153 - # echo 1 > /debug/tracing/tracing_enabled + # echo 1 > tracing_on # usleep 1 - # echo 0 > /debug/tracing/tracing_enabled - # cat /debug/tracing/trace + # echo 0 > tracing_on + # cat trace # tracer: function # -# TASK-PID CPU# TIMESTAMP FUNCTION -# | | | | | +# entries-in-buffer/entries-written: 0/0 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | # # cat /tmp/trace.out - bash-4043 [00] 41.267106: finish_task_switch <-schedule - bash-4043 [00] 41.267106: hrtick_set <-schedule - bash-4043 [00] 41.267107: hrtick_clear <-hrtick_set - bash-4043 [00] 41.267108: wait_for_completion <-__stop_machine_run - bash-4043 [00] 41.267108: wait_for_common <-wait_for_completion - bash-4043 [00] 41.267109: kthread_stop <-stop_machine_run - bash-4043 [00] 41.267109: init_waitqueue_head <-kthread_stop - bash-4043 [00] 41.267110: wake_up_process <-kthread_stop - bash-4043 [00] 41.267110: try_to_wake_up <-wake_up_process - bash-4043 [00] 41.267111: select_task_rq_rt <-try_to_wake_up + bash-1994 [000] .... 5281.568961: mutex_unlock <-rb_simple_write + bash-1994 [000] .... 5281.568963: __mutex_unlock_slowpath <-mutex_unlock + bash-1994 [000] .... 5281.568963: __fsnotify_parent <-fsnotify_modify + bash-1994 [000] .... 5281.568964: fsnotify <-fsnotify_modify + bash-1994 [000] .... 5281.568964: __srcu_read_lock <-fsnotify + bash-1994 [000] .... 5281.568964: add_preempt_count <-__srcu_read_lock + bash-1994 [000] ...1 5281.568965: sub_preempt_count <-__srcu_read_lock + bash-1994 [000] .... 5281.568965: __srcu_read_unlock <-fsnotify + bash-1994 [000] .... 5281.568967: sys_dup2 <-system_call_fastpath Note, reading the trace_pipe file will block until more input is -added. By changing the tracer, trace_pipe will issue an EOF. We -needed to set the function tracer _before_ we "cat" the -trace_pipe file. - +added. trace entries ------------- @@ -1806,32 +2523,315 @@ Having too much or not enough data can be troublesome in diagnosing an issue in the kernel. The file buffer_size_kb is used to modify the size of the internal trace buffers. The number listed is the number of entries that can be recorded per -CPU. To know the full size, multiply the number of possible CPUS +CPU. To know the full size, multiply the number of possible CPUs with the number of entries. - # cat /debug/tracing/buffer_size_kb + # cat buffer_size_kb 1408 (units kilobytes) -Note, to modify this, you must have tracing completely disabled. -To do that, echo "nop" into the current_tracer. If the -current_tracer is not set to "nop", an EINVAL error will be -returned. +Or simply read buffer_total_size_kb - # echo nop > /debug/tracing/current_tracer - # echo 10000 > /debug/tracing/buffer_size_kb - # cat /debug/tracing/buffer_size_kb + # cat buffer_total_size_kb +5632 + +To modify the buffer, simple echo in a number (in 1024 byte segments). + + # echo 10000 > buffer_size_kb + # cat buffer_size_kb 10000 (units kilobytes) -The number of pages which will be allocated is limited to a -percentage of available memory. Allocating too much will produce -an error. +It will try to allocate as much as possible. If you allocate too +much, it can cause Out-Of-Memory to trigger. - # echo 1000000000000 > /debug/tracing/buffer_size_kb + # echo 1000000000000 > buffer_size_kb -bash: echo: write error: Cannot allocate memory - # cat /debug/tracing/buffer_size_kb + # cat buffer_size_kb 85 +The per_cpu buffers can be changed individually as well: + + # echo 10000 > per_cpu/cpu0/buffer_size_kb + # echo 100 > per_cpu/cpu1/buffer_size_kb + +When the per_cpu buffers are not the same, the buffer_size_kb +at the top level will just show an X + + # cat buffer_size_kb +X + +This is where the buffer_total_size_kb is useful: + + # cat buffer_total_size_kb +12916 + +Writing to the top level buffer_size_kb will reset all the buffers +to be the same again. + +Snapshot +-------- +CONFIG_TRACER_SNAPSHOT makes a generic snapshot feature +available to all non latency tracers. (Latency tracers which +record max latency, such as "irqsoff" or "wakeup", can't use +this feature, since those are already using the snapshot +mechanism internally.) + +Snapshot preserves a current trace buffer at a particular point +in time without stopping tracing. Ftrace swaps the current +buffer with a spare buffer, and tracing continues in the new +current (=previous spare) buffer. + +The following debugfs files in "tracing" are related to this +feature: + + snapshot: + + This is used to take a snapshot and to read the output + of the snapshot. Echo 1 into this file to allocate a + spare buffer and to take a snapshot (swap), then read + the snapshot from this file in the same format as + "trace" (described above in the section "The File + System"). Both reads snapshot and tracing are executable + in parallel. When the spare buffer is allocated, echoing + 0 frees it, and echoing else (positive) values clear the + snapshot contents. + More details are shown in the table below. + + status\input | 0 | 1 | else | + --------------+------------+------------+------------+ + not allocated |(do nothing)| alloc+swap |(do nothing)| + --------------+------------+------------+------------+ + allocated | free | swap | clear | + --------------+------------+------------+------------+ + +Here is an example of using the snapshot feature. + + # echo 1 > events/sched/enable + # echo 1 > snapshot + # cat snapshot +# tracer: nop +# +# entries-in-buffer/entries-written: 71/71 #P:8 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [005] d... 2440.603828: sched_switch: prev_comm=swapper/5 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2242 next_prio=120 + sleep-2242 [005] d... 2440.603846: sched_switch: prev_comm=snapshot-test-2 prev_pid=2242 prev_prio=120 prev_state=R ==> next_comm=kworker/5:1 next_pid=60 next_prio=120 +[...] + <idle>-0 [002] d... 2440.707230: sched_switch: prev_comm=swapper/2 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2229 next_prio=120 + + # cat trace +# tracer: nop +# +# entries-in-buffer/entries-written: 77/77 #P:8 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + <idle>-0 [007] d... 2440.707395: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=snapshot-test-2 next_pid=2243 next_prio=120 + snapshot-test-2-2229 [002] d... 2440.707438: sched_switch: prev_comm=snapshot-test-2 prev_pid=2229 prev_prio=120 prev_state=S ==> next_comm=swapper/2 next_pid=0 next_prio=120 +[...] + + +If you try to use this snapshot feature when current tracer is +one of the latency tracers, you will get the following results. + + # echo wakeup > current_tracer + # echo 1 > snapshot +bash: echo: write error: Device or resource busy + # cat snapshot +cat: snapshot: Device or resource busy + + +Instances +--------- +In the debugfs tracing directory is a directory called "instances". +This directory can have new directories created inside of it using +mkdir, and removing directories with rmdir. The directory created +with mkdir in this directory will already contain files and other +directories after it is created. + + # mkdir instances/foo + # ls instances/foo +buffer_size_kb buffer_total_size_kb events free_buffer per_cpu +set_event snapshot trace trace_clock trace_marker trace_options +trace_pipe tracing_on + +As you can see, the new directory looks similar to the tracing directory +itself. In fact, it is very similar, except that the buffer and +events are agnostic from the main director, or from any other +instances that are created. + +The files in the new directory work just like the files with the +same name in the tracing directory except the buffer that is used +is a separate and new buffer. The files affect that buffer but do not +affect the main buffer with the exception of trace_options. Currently, +the trace_options affect all instances and the top level buffer +the same, but this may change in future releases. That is, options +may become specific to the instance they reside in. + +Notice that none of the function tracer files are there, nor is +current_tracer and available_tracers. This is because the buffers +can currently only have events enabled for them. + + # mkdir instances/foo + # mkdir instances/bar + # mkdir instances/zoot + # echo 100000 > buffer_size_kb + # echo 1000 > instances/foo/buffer_size_kb + # echo 5000 > instances/bar/per_cpu/cpu1/buffer_size_kb + # echo function > current_trace + # echo 1 > instances/foo/events/sched/sched_wakeup/enable + # echo 1 > instances/foo/events/sched/sched_wakeup_new/enable + # echo 1 > instances/foo/events/sched/sched_switch/enable + # echo 1 > instances/bar/events/irq/enable + # echo 1 > instances/zoot/events/syscalls/enable + # cat trace_pipe +CPU:2 [LOST 11745 EVENTS] + bash-2044 [002] .... 10594.481032: _raw_spin_lock_irqsave <-get_page_from_freelist + bash-2044 [002] d... 10594.481032: add_preempt_count <-_raw_spin_lock_irqsave + bash-2044 [002] d..1 10594.481032: __rmqueue <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: _raw_spin_unlock <-get_page_from_freelist + bash-2044 [002] d..1 10594.481033: sub_preempt_count <-_raw_spin_unlock + bash-2044 [002] d... 10594.481033: get_pageblock_flags_group <-get_pageblock_migratetype + bash-2044 [002] d... 10594.481034: __mod_zone_page_state <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: zone_statistics <-get_page_from_freelist + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] d... 10594.481034: __inc_zone_state <-zone_statistics + bash-2044 [002] .... 10594.481035: arch_dup_task_struct <-copy_process +[...] + + # cat instances/foo/trace_pipe + bash-1998 [000] d..4 136.676759: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.676760: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + <idle>-0 [003] d.h3 136.676906: sched_wakeup: comm=rcu_preempt pid=9 prio=120 success=1 target_cpu=003 + <idle>-0 [003] d..3 136.676909: sched_switch: prev_comm=swapper/3 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=rcu_preempt next_pid=9 next_prio=120 + rcu_preempt-9 [003] d..3 136.676916: sched_switch: prev_comm=rcu_preempt prev_pid=9 prev_prio=120 prev_state=S ==> next_comm=swapper/3 next_pid=0 next_prio=120 + bash-1998 [000] d..4 136.677014: sched_wakeup: comm=kworker/0:1 pid=59 prio=120 success=1 target_cpu=000 + bash-1998 [000] dN.4 136.677016: sched_wakeup: comm=bash pid=1998 prio=120 success=1 target_cpu=000 + bash-1998 [000] d..3 136.677018: sched_switch: prev_comm=bash prev_pid=1998 prev_prio=120 prev_state=R+ ==> next_comm=kworker/0:1 next_pid=59 next_prio=120 + kworker/0:1-59 [000] d..4 136.677022: sched_wakeup: comm=sshd pid=1995 prio=120 success=1 target_cpu=001 + kworker/0:1-59 [000] d..3 136.677025: sched_switch: prev_comm=kworker/0:1 prev_pid=59 prev_prio=120 prev_state=S ==> next_comm=bash next_pid=1998 next_prio=120 +[...] + + # cat instances/bar/trace_pipe + migration/1-14 [001] d.h3 138.732674: softirq_raise: vec=3 [action=NET_RX] + <idle>-0 [001] dNh3 138.732725: softirq_raise: vec=3 [action=NET_RX] + bash-1998 [000] d.h1 138.733101: softirq_raise: vec=1 [action=TIMER] + bash-1998 [000] d.h1 138.733102: softirq_raise: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733105: softirq_entry: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_exit: vec=1 [action=TIMER] + bash-1998 [000] ..s2 138.733106: softirq_entry: vec=9 [action=RCU] + bash-1998 [000] ..s2 138.733109: softirq_exit: vec=9 [action=RCU] + sshd-1995 [001] d.h1 138.733278: irq_handler_entry: irq=21 name=uhci_hcd:usb4 + sshd-1995 [001] d.h1 138.733280: irq_handler_exit: irq=21 ret=unhandled + sshd-1995 [001] d.h1 138.733281: irq_handler_entry: irq=21 name=eth0 + sshd-1995 [001] d.h1 138.733283: irq_handler_exit: irq=21 ret=handled +[...] + + # cat instances/zoot/trace +# tracer: nop +# +# entries-in-buffer/entries-written: 18996/18996 #P:4 +# +# _-----=> irqs-off +# / _----=> need-resched +# | / _---=> hardirq/softirq +# || / _--=> preempt-depth +# ||| / delay +# TASK-PID CPU# |||| TIMESTAMP FUNCTION +# | | | |||| | | + bash-1998 [000] d... 140.733501: sys_write -> 0x2 + bash-1998 [000] d... 140.733504: sys_dup2(oldfd: a, newfd: 1) + bash-1998 [000] d... 140.733506: sys_dup2 -> 0x1 + bash-1998 [000] d... 140.733508: sys_fcntl(fd: a, cmd: 1, arg: 0) + bash-1998 [000] d... 140.733509: sys_fcntl -> 0x1 + bash-1998 [000] d... 140.733510: sys_close(fd: a) + bash-1998 [000] d... 140.733510: sys_close -> 0x0 + bash-1998 [000] d... 140.733514: sys_rt_sigprocmask(how: 0, nset: 0, oset: 6e2768, sigsetsize: 8) + bash-1998 [000] d... 140.733515: sys_rt_sigprocmask -> 0x0 + bash-1998 [000] d... 140.733516: sys_rt_sigaction(sig: 2, act: 7fff718846f0, oact: 7fff71884650, sigsetsize: 8) + bash-1998 [000] d... 140.733516: sys_rt_sigaction -> 0x0 + +You can see that the trace of the top most trace buffer shows only +the function tracing. The foo instance displays wakeups and task +switches. + +To remove the instances, simply delete their directories: + + # rmdir instances/foo + # rmdir instances/bar + # rmdir instances/zoot + +Note, if a process has a trace file open in one of the instance +directories, the rmdir will fail with EBUSY. + + +Stack trace ----------- +Since the kernel has a fixed sized stack, it is important not to +waste it in functions. A kernel developer must be conscience of +what they allocate on the stack. If they add too much, the system +can be in danger of a stack overflow, and corruption will occur, +usually leading to a system panic. + +There are some tools that check this, usually with interrupts +periodically checking usage. But if you can perform a check +at every function call that will become very useful. As ftrace provides +a function tracer, it makes it convenient to check the stack size +at every function call. This is enabled via the stack tracer. + +CONFIG_STACK_TRACER enables the ftrace stack tracing functionality. +To enable it, write a '1' into /proc/sys/kernel/stack_tracer_enabled. + + # echo 1 > /proc/sys/kernel/stack_tracer_enabled + +You can also enable it from the kernel command line to trace +the stack size of the kernel during boot up, by adding "stacktrace" +to the kernel command line parameter. + +After running it for a few minutes, the output looks like: + + # cat stack_max_size +2928 + + # cat stack_trace + Depth Size Location (18 entries) + ----- ---- -------- + 0) 2928 224 update_sd_lb_stats+0xbc/0x4ac + 1) 2704 160 find_busiest_group+0x31/0x1f1 + 2) 2544 256 load_balance+0xd9/0x662 + 3) 2288 80 idle_balance+0xbb/0x130 + 4) 2208 128 __schedule+0x26e/0x5b9 + 5) 2080 16 schedule+0x64/0x66 + 6) 2064 128 schedule_timeout+0x34/0xe0 + 7) 1936 112 wait_for_common+0x97/0xf1 + 8) 1824 16 wait_for_completion+0x1d/0x1f + 9) 1808 128 flush_work+0xfe/0x119 + 10) 1680 16 tty_flush_to_ldisc+0x1e/0x20 + 11) 1664 48 input_available_p+0x1d/0x5c + 12) 1616 48 n_tty_poll+0x6d/0x134 + 13) 1568 64 tty_poll+0x64/0x7f + 14) 1504 880 do_select+0x31e/0x511 + 15) 624 400 core_sys_select+0x177/0x216 + 16) 224 96 sys_select+0x91/0xb9 + 17) 128 128 system_call_fastpath+0x16/0x1b + +Note, if -mfentry is being used by gcc, functions get traced before +they set up the stack frame. This means that leaf level functions +are not tested by the stack tracer when -mfentry is used. + +Currently, -mfentry is used by gcc 4.6.0 and above on x86 only. + +--------- More details can be found in the source code, in the -kernel/tracing/*.c files. +kernel/trace/*.c files. diff --git a/Documentation/trace/function-graph-fold.vim b/Documentation/trace/function-graph-fold.vim new file mode 100644 index 00000000000..0544b504c8b --- /dev/null +++ b/Documentation/trace/function-graph-fold.vim @@ -0,0 +1,42 @@ +" Enable folding for ftrace function_graph traces. +" +" To use, :source this file while viewing a function_graph trace, or use vim's +" -S option to load from the command-line together with a trace. You can then +" use the usual vim fold commands, such as "za", to open and close nested +" functions. While closed, a fold will show the total time taken for a call, +" as would normally appear on the line with the closing brace. Folded +" functions will not include finish_task_switch(), so folding should remain +" relatively sane even through a context switch. +" +" Note that this will almost certainly only work well with a +" single-CPU trace (e.g. trace-cmd report --cpu 1). + +function! FunctionGraphFoldExpr(lnum) + let line = getline(a:lnum) + if line[-1:] == '{' + if line =~ 'finish_task_switch() {$' + return '>1' + endif + return 'a1' + elseif line[-1:] == '}' + return 's1' + else + return '=' + endif +endfunction + +function! FunctionGraphFoldText() + let s = split(getline(v:foldstart), '|', 1) + if getline(v:foldend+1) =~ 'finish_task_switch() {$' + let s[2] = ' task switch ' + else + let e = split(getline(v:foldend), '|', 1) + let s[2] = e[2] + endif + return join(s, '|') +endfunction + +setlocal foldexpr=FunctionGraphFoldExpr(v:lnum) +setlocal foldtext=FunctionGraphFoldText() +setlocal foldcolumn=12 +setlocal foldmethod=expr diff --git a/Documentation/trace/kmemtrace.txt b/Documentation/trace/kmemtrace.txt deleted file mode 100644 index a956d9b7f94..00000000000 --- a/Documentation/trace/kmemtrace.txt +++ /dev/null @@ -1,126 +0,0 @@ - kmemtrace - Kernel Memory Tracer - - by Eduard - Gabriel Munteanu - <eduard.munteanu@linux360.ro> - -I. Introduction -=============== - -kmemtrace helps kernel developers figure out two things: -1) how different allocators (SLAB, SLUB etc.) perform -2) how kernel code allocates memory and how much - -To do this, we trace every allocation and export information to the userspace -through the relay interface. We export things such as the number of requested -bytes, the number of bytes actually allocated (i.e. including internal -fragmentation), whether this is a slab allocation or a plain kmalloc() and so -on. - -The actual analysis is performed by a userspace tool (see section III for -details on where to get it from). It logs the data exported by the kernel, -processes it and (as of writing this) can provide the following information: -- the total amount of memory allocated and fragmentation per call-site -- the amount of memory allocated and fragmentation per allocation -- total memory allocated and fragmentation in the collected dataset -- number of cross-CPU allocation and frees (makes sense in NUMA environments) - -Moreover, it can potentially find inconsistent and erroneous behavior in -kernel code, such as using slab free functions on kmalloc'ed memory or -allocating less memory than requested (but not truly failed allocations). - -kmemtrace also makes provisions for tracing on some arch and analysing the -data on another. - -II. Design and goals -==================== - -kmemtrace was designed to handle rather large amounts of data. Thus, it uses -the relay interface to export whatever is logged to userspace, which then -stores it. Analysis and reporting is done asynchronously, that is, after the -data is collected and stored. By design, it allows one to log and analyse -on different machines and different arches. - -As of writing this, the ABI is not considered stable, though it might not -change much. However, no guarantees are made about compatibility yet. When -deemed stable, the ABI should still allow easy extension while maintaining -backward compatibility. This is described further in Documentation/ABI. - -Summary of design goals: - - allow logging and analysis to be done across different machines - - be fast and anticipate usage in high-load environments (*) - - be reasonably extensible - - make it possible for GNU/Linux distributions to have kmemtrace - included in their repositories - -(*) - one of the reasons Pekka Enberg's original userspace data analysis - tool's code was rewritten from Perl to C (although this is more than a - simple conversion) - - -III. Quick usage guide -====================== - -1) Get a kernel that supports kmemtrace and build it accordingly (i.e. enable -CONFIG_KMEMTRACE). - -2) Get the userspace tool and build it: -$ git-clone git://repo.or.cz/kmemtrace-user.git # current repository -$ cd kmemtrace-user/ -$ ./autogen.sh -$ ./configure -$ make - -3) Boot the kmemtrace-enabled kernel if you haven't, preferably in the -'single' runlevel (so that relay buffers don't fill up easily), and run -kmemtrace: -# '$' does not mean user, but root here. -$ mount -t debugfs none /sys/kernel/debug -$ mount -t proc none /proc -$ cd path/to/kmemtrace-user/ -$ ./kmemtraced -Wait a bit, then stop it with CTRL+C. -$ cat /sys/kernel/debug/kmemtrace/total_overruns # Check if we didn't - # overrun, should - # be zero. -$ (Optionally) [Run kmemtrace_check separately on each cpu[0-9]*.out file to - check its correctness] -$ ./kmemtrace-report - -Now you should have a nice and short summary of how the allocator performs. - -IV. FAQ and known issues -======================== - -Q: 'cat /sys/kernel/debug/kmemtrace/total_overruns' is non-zero, how do I fix -this? Should I worry? -A: If it's non-zero, this affects kmemtrace's accuracy, depending on how -large the number is. You can fix it by supplying a higher -'kmemtrace.subbufs=N' kernel parameter. ---- - -Q: kmemtrace_check reports errors, how do I fix this? Should I worry? -A: This is a bug and should be reported. It can occur for a variety of -reasons: - - possible bugs in relay code - - possible misuse of relay by kmemtrace - - timestamps being collected unorderly -Or you may fix it yourself and send us a patch. ---- - -Q: kmemtrace_report shows many errors, how do I fix this? Should I worry? -A: This is a known issue and I'm working on it. These might be true errors -in kernel code, which may have inconsistent behavior (e.g. allocating memory -with kmem_cache_alloc() and freeing it with kfree()). Pekka Enberg pointed -out this behavior may work with SLAB, but may fail with other allocators. - -It may also be due to lack of tracing in some unusual allocator functions. - -We don't want bug reports regarding this issue yet. ---- - -V. See also -=========== - -Documentation/kernel-parameters.txt -Documentation/ABI/testing/debugfs-kmemtrace - diff --git a/Documentation/trace/kprobetrace.txt b/Documentation/trace/kprobetrace.txt new file mode 100644 index 00000000000..d68ea5fc812 --- /dev/null +++ b/Documentation/trace/kprobetrace.txt @@ -0,0 +1,172 @@ + Kprobe-based Event Tracing + ========================== + + Documentation is written by Masami Hiramatsu + + +Overview +-------- +These events are similar to tracepoint based events. Instead of Tracepoint, +this is based on kprobes (kprobe and kretprobe). So it can probe wherever +kprobes can probe (this means, all functions body except for __kprobes +functions). Unlike the Tracepoint based event, this can be added and removed +dynamically, on the fly. + +To enable this feature, build your kernel with CONFIG_KPROBE_EVENT=y. + +Similar to the events tracer, this doesn't need to be activated via +current_tracer. Instead of that, add probe points via +/sys/kernel/debug/tracing/kprobe_events, and enable it via +/sys/kernel/debug/tracing/events/kprobes/<EVENT>/enabled. + + +Synopsis of kprobe_events +------------------------- + p[:[GRP/]EVENT] [MOD:]SYM[+offs]|MEMADDR [FETCHARGS] : Set a probe + r[:[GRP/]EVENT] [MOD:]SYM[+0] [FETCHARGS] : Set a return probe + -:[GRP/]EVENT : Clear a probe + + GRP : Group name. If omitted, use "kprobes" for it. + EVENT : Event name. If omitted, the event name is generated + based on SYM+offs or MEMADDR. + MOD : Module name which has given SYM. + SYM[+offs] : Symbol+offset where the probe is inserted. + MEMADDR : Address where the probe is inserted. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + %REG : Fetch register REG + @ADDR : Fetch memory at ADDR (ADDR should be in kernel) + @SYM[+|-offs] : Fetch memory at SYM +|- offs (SYM should be a data symbol) + $stackN : Fetch Nth entry of stack (N >= 0) + $stack : Fetch stack address. + $retval : Fetch return value.(*) + +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield + are supported. + + (*) only for return probe. + (**) this is useful for fetching a field of data structures. + +Types +----- +Several types are supported for fetch-args. Kprobe tracer will access memory +by given type. Prefix 's' and 'u' means those types are signed and unsigned +respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). +String type is a special type, which fetches a "null-terminated" string from +kernel space. This means it will fail and store NULL if the string container +has been paged out. +Bitfield is another special type, which takes 3 parameters, bit-width, bit- +offset, and container-size (usually 32). The syntax is; + + b<bit-width>@<bit-offset>/<container-size> + + +Per-Probe Event Filtering +------------------------- + Per-probe event filtering feature allows you to set different filter on each +probe and gives you what arguments will be shown in trace buffer. If an event +name is specified right after 'p:' or 'r:' in kprobe_events, it adds an event +under tracing/events/kprobes/<EVENT>, at the directory you can see 'id', +'enabled', 'format' and 'filter'. + +enabled: + You can enable/disable the probe by writing 1 or 0 on it. + +format: + This shows the format of this probe event. + +filter: + You can write filtering rules of this event. + +id: + This shows the id of this probe event. + + +Event Profiling +--------------- + You can check the total number of probe hits and probe miss-hits via +/sys/kernel/debug/tracing/kprobe_profile. + The first column is event name, the second is the number of probe hits, +the third is the number of probe miss-hits. + + +Usage examples +-------------- +To add a probe as a new event, write a new definition to kprobe_events +as below. + + echo 'p:myprobe do_sys_open dfd=%ax filename=%dx flags=%cx mode=+4($stack)' > /sys/kernel/debug/tracing/kprobe_events + + This sets a kprobe on the top of do_sys_open() function with recording +1st to 4th arguments as "myprobe" event. Note, which register/stack entry is +assigned to each function argument depends on arch-specific ABI. If you unsure +the ABI, please try to use probe subcommand of perf-tools (you can find it +under tools/perf/). +As this example shows, users can choose more familiar names for each arguments. + + echo 'r:myretprobe do_sys_open $retval' >> /sys/kernel/debug/tracing/kprobe_events + + This sets a kretprobe on the return point of do_sys_open() function with +recording return value as "myretprobe" event. + You can see the format of these events via +/sys/kernel/debug/tracing/events/kprobes/<EVENT>/format. + + cat /sys/kernel/debug/tracing/events/kprobes/myprobe/format +name: myprobe +ID: 780 +format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1;signed:0; + field:int common_pid; offset:4; size:4; signed:1; + + field:unsigned long __probe_ip; offset:12; size:4; signed:0; + field:int __probe_nargs; offset:16; size:4; signed:1; + field:unsigned long dfd; offset:20; size:4; signed:0; + field:unsigned long filename; offset:24; size:4; signed:0; + field:unsigned long flags; offset:28; size:4; signed:0; + field:unsigned long mode; offset:32; size:4; signed:0; + + +print fmt: "(%lx) dfd=%lx filename=%lx flags=%lx mode=%lx", REC->__probe_ip, +REC->dfd, REC->filename, REC->flags, REC->mode + + You can see that the event has 4 arguments as in the expressions you specified. + + echo > /sys/kernel/debug/tracing/kprobe_events + + This clears all probe points. + + Or, + + echo -:myprobe >> kprobe_events + + This clears probe points selectively. + + Right after definition, each event is disabled by default. For tracing these +events, you need to enable it. + + echo 1 > /sys/kernel/debug/tracing/events/kprobes/myprobe/enable + echo 1 > /sys/kernel/debug/tracing/events/kprobes/myretprobe/enable + + And you can see the traced information via /sys/kernel/debug/tracing/trace. + + cat /sys/kernel/debug/tracing/trace +# tracer: nop +# +# TASK-PID CPU# TIMESTAMP FUNCTION +# | | | | | + <...>-1447 [001] 1038282.286875: myprobe: (do_sys_open+0x0/0xd6) dfd=3 filename=7fffd1ec4440 flags=8000 mode=0 + <...>-1447 [001] 1038282.286878: myretprobe: (sys_openat+0xc/0xe <- do_sys_open) $retval=fffffffffffffffe + <...>-1447 [001] 1038282.286885: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=40413c flags=8000 mode=1b6 + <...>-1447 [001] 1038282.286915: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 + <...>-1447 [001] 1038282.286969: myprobe: (do_sys_open+0x0/0xd6) dfd=ffffff9c filename=4041c6 flags=98800 mode=10 + <...>-1447 [001] 1038282.286976: myretprobe: (sys_open+0x1b/0x1d <- do_sys_open) $retval=3 + + + Each line shows when the kernel hits an event, and <- SYMBOL means kernel +returns from SYMBOL(e.g. "sys_open+0x1b/0x1d <- do_sys_open" means kernel +returns from do_sys_open to sys_open+0x1b). + diff --git a/Documentation/trace/mmiotrace.txt b/Documentation/trace/mmiotrace.txt index 5731c67abc5..664e7386d89 100644 --- a/Documentation/trace/mmiotrace.txt +++ b/Documentation/trace/mmiotrace.txt @@ -32,41 +32,42 @@ is no way to automatically detect if you are losing events due to CPUs racing. Usage Quick Reference --------------------- -$ mount -t debugfs debugfs /debug -$ echo mmiotrace > /debug/tracing/current_tracer -$ cat /debug/tracing/trace_pipe > mydump.txt & +$ mount -t debugfs debugfs /sys/kernel/debug +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & Start X or whatever. -$ echo "X is up" > /debug/tracing/trace_marker -$ echo nop > /debug/tracing/current_tracer +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker +$ echo nop > /sys/kernel/debug/tracing/current_tracer Check for lost events. Usage ----- -Make sure debugfs is mounted to /debug. If not, (requires root privileges) -$ mount -t debugfs debugfs /debug +Make sure debugfs is mounted to /sys/kernel/debug. +If not (requires root privileges): +$ mount -t debugfs debugfs /sys/kernel/debug Check that the driver you are about to trace is not loaded. Activate mmiotrace (requires root privileges): -$ echo mmiotrace > /debug/tracing/current_tracer +$ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer Start storing the trace: -$ cat /debug/tracing/trace_pipe > mydump.txt & +$ cat /sys/kernel/debug/tracing/trace_pipe > mydump.txt & The 'cat' process should stay running (sleeping) in the background. Load the driver you want to trace and use it. Mmiotrace will only catch MMIO accesses to areas that are ioremapped while mmiotrace is active. During tracing you can place comments (markers) into the trace by -$ echo "X is up" > /debug/tracing/trace_marker +$ echo "X is up" > /sys/kernel/debug/tracing/trace_marker This makes it easier to see which part of the (huge) trace corresponds to which action. It is recommended to place descriptive markers about what you do. Shut down mmiotrace (requires root privileges): -$ echo nop > /debug/tracing/current_tracer +$ echo nop > /sys/kernel/debug/tracing/current_tracer The 'cat' process exits. If it does not, kill it by issuing 'fg' command and pressing ctrl+c. @@ -78,10 +79,10 @@ to view your kernel log and look for "mmiotrace has lost events" warning. If events were lost, the trace is incomplete. You should enlarge the buffers and try again. Buffers are enlarged by first seeing how large the current buffers are: -$ cat /debug/tracing/buffer_size_kb +$ cat /sys/kernel/debug/tracing/buffer_size_kb gives you a number. Approximately double this number and write it back, for instance: -$ echo 128000 > /debug/tracing/buffer_size_kb +$ echo 128000 > /sys/kernel/debug/tracing/buffer_size_kb Then start again from the top. If you are doing a trace for a driver project, e.g. Nouveau, you should also @@ -91,7 +92,7 @@ $ dmesg > dmesg.txt $ tar zcf pciid-nick-mmiotrace.tar.gz mydump.txt lspci.txt dmesg.txt and then send the .tar.gz file. The trace compresses considerably. Replace "pciid" and "nick" with the PCI ID or model name of your piece of hardware -under investigation and your nick name. +under investigation and your nickname. How Mmiotrace Works @@ -100,7 +101,7 @@ How Mmiotrace Works Access to hardware IO-memory is gained by mapping addresses from PCI bus by calling one of the ioremap_*() functions. Mmiotrace is hooked into the __ioremap() function and gets called whenever a mapping is created. Mapping is -an event that is recorded into the trace log. Note, that ISA range mappings +an event that is recorded into the trace log. Note that ISA range mappings are not caught, since the mapping always exists and is returned directly. MMIO accesses are recorded via page faults. Just before __ioremap() returns, @@ -122,11 +123,11 @@ Trace Log Format ---------------- The raw log is text and easily filtered with e.g. grep and awk. One record is -one line in the log. A record starts with a keyword, followed by keyword -dependant arguments. Arguments are separated by a space, or continue until the +one line in the log. A record starts with a keyword, followed by keyword- +dependent arguments. Arguments are separated by a space, or continue until the end of line. The format for version 20070824 is as follows: -Explanation Keyword Space separated arguments +Explanation Keyword Space-separated arguments --------------------------------------------------------------------------- read event R width, timestamp, map id, physical, value, PC, PID @@ -136,7 +137,7 @@ iounmap event UNMAP timestamp, map id, PC, PID marker MARK timestamp, text version VERSION the string "20070824" info for reader LSPCI one line from lspci -v -PCI address map PCIDEV space separated /proc/bus/pci/devices data +PCI address map PCIDEV space-separated /proc/bus/pci/devices data unk. opcode UNKNOWN timestamp, map id, physical, data, PC, PID Timestamp is in seconds with decimals. Physical is a PCI bus address, virtual diff --git a/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl new file mode 100644 index 00000000000..0a120aae33c --- /dev/null +++ b/Documentation/trace/postprocess/trace-pagealloc-postprocess.pl @@ -0,0 +1,418 @@ +#!/usr/bin/perl +# This is a POC (proof of concept or piece of crap, take your pick) for reading the +# text representation of trace output related to page allocation. It makes an attempt +# to extract some high-level information on what is going on. The accuracy of the parser +# may vary considerably +# +# Example usage: trace-pagealloc-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --prepend-parent Report on the parent proc and PID +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_PAGE_ALLOC => 1; +use constant MM_PAGE_FREE => 2; +use constant MM_PAGE_FREE_BATCHED => 3; +use constant MM_PAGE_PCPU_DRAIN => 4; +use constant MM_PAGE_ALLOC_ZONE_LOCKED => 5; +use constant MM_PAGE_ALLOC_EXTFRAG => 6; +use constant EVENT_UNKNOWN => 7; + +# Constants used to track state +use constant STATE_PCPU_PAGES_DRAINED => 8; +use constant STATE_PCPU_PAGES_REFILLED => 9; + +# High-level events extrapolated from tracepoints +use constant HIGH_PCPU_DRAINS => 10; +use constant HIGH_PCPU_REFILLS => 11; +use constant HIGH_EXT_FRAGMENT => 12; +use constant HIGH_EXT_FRAGMENT_SEVERE => 13; +use constant HIGH_EXT_FRAGMENT_MODERATE => 14; +use constant HIGH_EXT_FRAGMENT_CHANGED => 15; + +my %perprocesspid; +my %perprocess; +my $opt_ignorepid; +my $opt_read_procstat; +my $opt_prepend_parent; + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, + 'prepend-parent' => \$opt_prepend_parent, +); + +# Defaults for dynamically discovered regex's +my $regex_fragdetails_default = 'page=([0-9a-f]*) pfn=([0-9]*) alloc_order=([-0-9]*) fallback_order=([-0-9]*) pageblock_order=([-0-9]*) alloc_migratetype=([-0-9]*) fallback_migratetype=([-0-9]*) fragmenting=([-0-9]) change_ownership=([-0-9])'; + +# Dyanically discovered regex +my $regex_fragdetails; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + $regex = $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + if ($line =~ /^print fmt:\s"(.*)",.*/) { + $regex = $1; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected '$key' != '$expected'"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} +$regex_fragdetails = generate_traceevent_regex("kmem/mm_page_alloc_extfrag", + $regex_fragdetails_default, + "page", "pfn", + "alloc_order", "fallback_order", "pageblock_order", + "alloc_migratetype", "fallback_migratetype", + "fragmenting", "change_ownership"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +sub parent_info($$) { + my $pid = $_[0]; + my $statline = $_[1]; + my $ppid; + + if ($pid == 0) { + return "NOPARENT-0"; + } + + if ($statline !~ /$regex_statppid/o) { + die("Failed to match stat line process ppid:: $statline"); + } + + # Read the ppid stat line + $ppid = $1; + return guess_process_pid($ppid, read_statline($ppid)); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $tracepoint = $4; + + if ($opt_read_procstat || $opt_prepend_parent) { + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + $statline = read_statline($pid); + + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + + if ($opt_prepend_parent) { + $process_pid = parent_info($pid, $statline) . " :: $process_pid"; + } + } + + # Unnecessary in this script. Uncomment if required + # $cpus = $2; + # $timestamp = $3; + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_page_alloc") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}++; + } elsif ($tracepoint eq "mm_page_free") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE}++ + } elsif ($tracepoint eq "mm_page_free_batched") { + $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}++; + } elsif ($tracepoint eq "mm_page_pcpu_drain") { + $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED}++; + } elsif ($tracepoint eq "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED}++; + } elsif ($tracepoint eq "mm_page_alloc_extfrag") { + + # Extract the details of the event now + $details = $5; + + my ($page, $pfn); + my ($alloc_order, $fallback_order, $pageblock_order); + my ($alloc_migratetype, $fallback_migratetype); + my ($fragmenting, $change_ownership); + + if ($details !~ /$regex_fragdetails/o) { + print "WARNING: Failed to parse mm_page_alloc_extfrag as expected\n"; + next; + } + + $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}++; + $page = $1; + $pfn = $2; + $alloc_order = $3; + $fallback_order = $4; + $pageblock_order = $5; + $alloc_migratetype = $6; + $fallback_migratetype = $7; + $fragmenting = $8; + $change_ownership = $9; + + if ($fragmenting) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}++; + if ($fallback_order <= 3) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}++; + } else { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}++; + } + } + if ($change_ownership) { + $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}++; + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + # Catch a full pcpu drain event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} && + $tracepoint ne "mm_page_pcpu_drain") { + + $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + + # Catch a full pcpu refill event + if ($perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} && + $tracepoint ne "mm_page_alloc_zone_locked") { + $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}++; + $perprocesspid{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "Process", "Pages", "Pages", "Pages", "Pages", "PCPU", "PCPU", "PCPU", "Fragment", "Fragment", "MigType", "Fragment", "Fragment", "Unknown"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "details", "allocd", "allocd", "freed", "freed", "pages", "drains", "refills", "Fallback", "Causing", "Changed", "Severe", "Moderate", ""); + + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s %8s\n", + "", "", "under lock", "direct", "pagevec", "drain", "", "", "", "", "", "", "", ""); + + foreach $process_pid (keys %stats) { + # Dump final aggregates + if ($stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED}) { + $stats{$process_pid}->{HIGH_PCPU_DRAINS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_DRAINED} = 0; + } + if ($stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED}) { + $stats{$process_pid}->{HIGH_PCPU_REFILLS}++; + $stats{$process_pid}->{STATE_PCPU_PAGES_REFILLED} = 0; + } + + printf("%-" . $max_strlen . "s %8d %10d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d %8d\n", + $process_pid, + $stats{$process_pid}->{MM_PAGE_ALLOC}, + $stats{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}, + $stats{$process_pid}->{MM_PAGE_FREE}, + $stats{$process_pid}->{MM_PAGE_FREE_BATCHED}, + $stats{$process_pid}->{MM_PAGE_PCPU_DRAIN}, + $stats{$process_pid}->{HIGH_PCPU_DRAINS}, + $stats{$process_pid}->{HIGH_PCPU_REFILLS}, + $stats{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAG}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}, + $stats{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}, + $stats{$process_pid}->{EVENT_UNKNOWN}); + } +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_PAGE_ALLOC} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC}; + $perprocess{$process}->{MM_PAGE_ALLOC_ZONE_LOCKED} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_ZONE_LOCKED}; + $perprocess{$process}->{MM_PAGE_FREE} += $perprocesspid{$process_pid}->{MM_PAGE_FREE}; + $perprocess{$process}->{MM_PAGE_FREE_BATCHED} += $perprocesspid{$process_pid}->{MM_PAGE_FREE_BATCHED}; + $perprocess{$process}->{MM_PAGE_PCPU_DRAIN} += $perprocesspid{$process_pid}->{MM_PAGE_PCPU_DRAIN}; + $perprocess{$process}->{HIGH_PCPU_DRAINS} += $perprocesspid{$process_pid}->{HIGH_PCPU_DRAINS}; + $perprocess{$process}->{HIGH_PCPU_REFILLS} += $perprocesspid{$process_pid}->{HIGH_PCPU_REFILLS}; + $perprocess{$process}->{MM_PAGE_ALLOC_EXTFRAG} += $perprocesspid{$process_pid}->{MM_PAGE_ALLOC_EXTFRAG}; + $perprocess{$process}->{HIGH_EXT_FRAG} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAG}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_CHANGED} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_CHANGED}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_SEVERE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_SEVERE}; + $perprocess{$process}->{HIGH_EXT_FRAGMENT_MODERATE} += $perprocesspid{$process_pid}->{HIGH_EXT_FRAGMENT_MODERATE}; + $perprocess{$process}->{EVENT_UNKNOWN} += $perprocesspid{$process_pid}->{EVENT_UNKNOWN}; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/Documentation/trace/postprocess/trace-vmscan-postprocess.pl b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl new file mode 100644 index 00000000000..78c9a7b2b58 --- /dev/null +++ b/Documentation/trace/postprocess/trace-vmscan-postprocess.pl @@ -0,0 +1,704 @@ +#!/usr/bin/perl +# This is a POC for reading the text representation of trace output related to +# page reclaim. It makes an attempt to extract some high-level information on +# what is going on. The accuracy of the parser may vary +# +# Example usage: trace-vmscan-postprocess.pl < /sys/kernel/debug/tracing/trace_pipe +# other options +# --read-procstat If the trace lacks process info, get it from /proc +# --ignore-pid Aggregate processes of the same name together +# +# Copyright (c) IBM Corporation 2009 +# Author: Mel Gorman <mel@csn.ul.ie> +use strict; +use Getopt::Long; + +# Tracepoint events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN => 1; +use constant MM_VMSCAN_DIRECT_RECLAIM_END => 2; +use constant MM_VMSCAN_KSWAPD_WAKE => 3; +use constant MM_VMSCAN_KSWAPD_SLEEP => 4; +use constant MM_VMSCAN_LRU_SHRINK_ACTIVE => 5; +use constant MM_VMSCAN_LRU_SHRINK_INACTIVE => 6; +use constant MM_VMSCAN_LRU_ISOLATE => 7; +use constant MM_VMSCAN_WRITEPAGE_FILE_SYNC => 8; +use constant MM_VMSCAN_WRITEPAGE_ANON_SYNC => 9; +use constant MM_VMSCAN_WRITEPAGE_FILE_ASYNC => 10; +use constant MM_VMSCAN_WRITEPAGE_ANON_ASYNC => 11; +use constant MM_VMSCAN_WRITEPAGE_ASYNC => 12; +use constant EVENT_UNKNOWN => 13; + +# Per-order events +use constant MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER => 11; +use constant MM_VMSCAN_WAKEUP_KSWAPD_PERORDER => 12; +use constant MM_VMSCAN_KSWAPD_WAKE_PERORDER => 13; +use constant HIGH_KSWAPD_REWAKEUP_PERORDER => 14; + +# Constants used to track state +use constant STATE_DIRECT_BEGIN => 15; +use constant STATE_DIRECT_ORDER => 16; +use constant STATE_KSWAPD_BEGIN => 17; +use constant STATE_KSWAPD_ORDER => 18; + +# High-level events extrapolated from tracepoints +use constant HIGH_DIRECT_RECLAIM_LATENCY => 19; +use constant HIGH_KSWAPD_LATENCY => 20; +use constant HIGH_KSWAPD_REWAKEUP => 21; +use constant HIGH_NR_SCANNED => 22; +use constant HIGH_NR_TAKEN => 23; +use constant HIGH_NR_RECLAIMED => 24; + +my %perprocesspid; +my %perprocess; +my %last_procmap; +my $opt_ignorepid; +my $opt_read_procstat; + +my $total_wakeup_kswapd; +my ($total_direct_reclaim, $total_direct_nr_scanned); +my ($total_direct_latency, $total_kswapd_latency); +my ($total_direct_nr_reclaimed); +my ($total_direct_writepage_file_sync, $total_direct_writepage_file_async); +my ($total_direct_writepage_anon_sync, $total_direct_writepage_anon_async); +my ($total_kswapd_nr_scanned, $total_kswapd_wake); +my ($total_kswapd_writepage_file_sync, $total_kswapd_writepage_file_async); +my ($total_kswapd_writepage_anon_sync, $total_kswapd_writepage_anon_async); +my ($total_kswapd_nr_reclaimed); + +# Catch sigint and exit on request +my $sigint_report = 0; +my $sigint_exit = 0; +my $sigint_pending = 0; +my $sigint_received = 0; +sub sigint_handler { + my $current_time = time; + if ($current_time - 2 > $sigint_received) { + print "SIGINT received, report pending. Hit ctrl-c again to exit\n"; + $sigint_report = 1; + } else { + if (!$sigint_exit) { + print "Second SIGINT received quickly, exiting\n"; + } + $sigint_exit++; + } + + if ($sigint_exit > 3) { + print "Many SIGINTs received, exiting now without report\n"; + exit; + } + + $sigint_received = $current_time; + $sigint_pending = 1; +} +$SIG{INT} = "sigint_handler"; + +# Parse command line options +GetOptions( + 'ignore-pid' => \$opt_ignorepid, + 'read-procstat' => \$opt_read_procstat, +); + +# Defaults for dynamically discovered regex's +my $regex_direct_begin_default = 'order=([0-9]*) may_writepage=([0-9]*) gfp_flags=([A-Z_|]*)'; +my $regex_direct_end_default = 'nr_reclaimed=([0-9]*)'; +my $regex_kswapd_wake_default = 'nid=([0-9]*) order=([0-9]*)'; +my $regex_kswapd_sleep_default = 'nid=([0-9]*)'; +my $regex_wakeup_kswapd_default = 'nid=([0-9]*) zid=([0-9]*) order=([0-9]*)'; +my $regex_lru_isolate_default = 'isolate_mode=([0-9]*) order=([0-9]*) nr_requested=([0-9]*) nr_scanned=([0-9]*) nr_taken=([0-9]*) file=([0-9]*)'; +my $regex_lru_shrink_inactive_default = 'nid=([0-9]*) zid=([0-9]*) nr_scanned=([0-9]*) nr_reclaimed=([0-9]*) priority=([0-9]*) flags=([A-Z_|]*)'; +my $regex_lru_shrink_active_default = 'lru=([A-Z_]*) nr_scanned=([0-9]*) nr_rotated=([0-9]*) priority=([0-9]*)'; +my $regex_writepage_default = 'page=([0-9a-f]*) pfn=([0-9]*) flags=([A-Z_|]*)'; + +# Dyanically discovered regex +my $regex_direct_begin; +my $regex_direct_end; +my $regex_kswapd_wake; +my $regex_kswapd_sleep; +my $regex_wakeup_kswapd; +my $regex_lru_isolate; +my $regex_lru_shrink_inactive; +my $regex_lru_shrink_active; +my $regex_writepage; + +# Static regex used. Specified like this for readability and for use with /o +# (process_pid) (cpus ) ( time ) (tpoint ) (details) +my $regex_traceevent = '\s*([a-zA-Z0-9-]*)\s*(\[[0-9]*\])(\s*[dX.][Nnp.][Hhs.][0-9a-fA-F.]*|)\s*([0-9.]*):\s*([a-zA-Z_]*):\s*(.*)'; +my $regex_statname = '[-0-9]*\s\((.*)\).*'; +my $regex_statppid = '[-0-9]*\s\(.*\)\s[A-Za-z]\s([0-9]*).*'; + +sub generate_traceevent_regex { + my $event = shift; + my $default = shift; + my $regex; + + # Read the event format or use the default + if (!open (FORMAT, "/sys/kernel/debug/tracing/events/$event/format")) { + print("WARNING: Event $event format string not found\n"); + return $default; + } else { + my $line; + while (!eof(FORMAT)) { + $line = <FORMAT>; + $line =~ s/, REC->.*//; + if ($line =~ /^print fmt:\s"(.*)".*/) { + $regex = $1; + $regex =~ s/%s/\([0-9a-zA-Z|_]*\)/g; + $regex =~ s/%p/\([0-9a-f]*\)/g; + $regex =~ s/%d/\([-0-9]*\)/g; + $regex =~ s/%ld/\([-0-9]*\)/g; + $regex =~ s/%lu/\([0-9]*\)/g; + } + } + } + + # Can't handle the print_flags stuff but in the context of this + # script, it really doesn't matter + $regex =~ s/\(REC.*\) \? __print_flags.*//; + + # Verify fields are in the right order + my $tuple; + foreach $tuple (split /\s/, $regex) { + my ($key, $value) = split(/=/, $tuple); + my $expected = shift; + if ($key ne $expected) { + print("WARNING: Format not as expected for event $event '$key' != '$expected'\n"); + $regex =~ s/$key=\((.*)\)/$key=$1/; + } + } + + if (defined shift) { + die("Fewer fields than expected in format"); + } + + return $regex; +} + +$regex_direct_begin = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_begin", + $regex_direct_begin_default, + "order", "may_writepage", + "gfp_flags"); +$regex_direct_end = generate_traceevent_regex( + "vmscan/mm_vmscan_direct_reclaim_end", + $regex_direct_end_default, + "nr_reclaimed"); +$regex_kswapd_wake = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_wake", + $regex_kswapd_wake_default, + "nid", "order"); +$regex_kswapd_sleep = generate_traceevent_regex( + "vmscan/mm_vmscan_kswapd_sleep", + $regex_kswapd_sleep_default, + "nid"); +$regex_wakeup_kswapd = generate_traceevent_regex( + "vmscan/mm_vmscan_wakeup_kswapd", + $regex_wakeup_kswapd_default, + "nid", "zid", "order"); +$regex_lru_isolate = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_isolate", + $regex_lru_isolate_default, + "isolate_mode", "order", + "nr_requested", "nr_scanned", "nr_taken", + "file"); +$regex_lru_shrink_inactive = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_inactive", + $regex_lru_shrink_inactive_default, + "nid", "zid", + "nr_scanned", "nr_reclaimed", "priority", + "flags"); +$regex_lru_shrink_active = generate_traceevent_regex( + "vmscan/mm_vmscan_lru_shrink_active", + $regex_lru_shrink_active_default, + "nid", "zid", + "lru", + "nr_scanned", "nr_rotated", "priority"); +$regex_writepage = generate_traceevent_regex( + "vmscan/mm_vmscan_writepage", + $regex_writepage_default, + "page", "pfn", "flags"); + +sub read_statline($) { + my $pid = $_[0]; + my $statline; + + if (open(STAT, "/proc/$pid/stat")) { + $statline = <STAT>; + close(STAT); + } + + if ($statline eq '') { + $statline = "-1 (UNKNOWN_PROCESS_NAME) R 0"; + } + + return $statline; +} + +sub guess_process_pid($$) { + my $pid = $_[0]; + my $statline = $_[1]; + + if ($pid == 0) { + return "swapper-0"; + } + + if ($statline !~ /$regex_statname/o) { + die("Failed to math stat line for process name :: $statline"); + } + return "$1-$pid"; +} + +# Convert sec.usec timestamp format +sub timestamp_to_ms($) { + my $timestamp = $_[0]; + + my ($sec, $usec) = split (/\./, $timestamp); + return ($sec * 1000) + ($usec / 1000); +} + +sub process_events { + my $traceevent; + my $process_pid; + my $cpus; + my $timestamp; + my $tracepoint; + my $details; + my $statline; + + # Read each line of the event log +EVENT_PROCESS: + while ($traceevent = <STDIN>) { + if ($traceevent =~ /$regex_traceevent/o) { + $process_pid = $1; + $timestamp = $4; + $tracepoint = $5; + + $process_pid =~ /(.*)-([0-9]*)$/; + my $process = $1; + my $pid = $2; + + if ($process eq "") { + $process = $last_procmap{$pid}; + $process_pid = "$process-$pid"; + } + $last_procmap{$pid} = $process; + + if ($opt_read_procstat) { + $statline = read_statline($pid); + if ($opt_read_procstat && $process eq '') { + $process_pid = guess_process_pid($pid, $statline); + } + } + } else { + next; + } + + # Perl Switch() sucks majorly + if ($tracepoint eq "mm_vmscan_direct_reclaim_begin") { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}++; + $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN} = $timestamp; + + $details = $6; + if ($details !~ /$regex_direct_begin/o) { + print "WARNING: Failed to parse mm_vmscan_direct_reclaim_begin as expected\n"; + print " $details\n"; + print " $regex_direct_begin\n"; + next; + } + my $order = $1; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]++; + $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER} = $order; + } elsif ($tracepoint eq "mm_vmscan_direct_reclaim_end") { + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_END}++; + + # Record how long direct reclaim took this time + if (defined $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_DIRECT_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_DIRECT_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] = "$order-$latency"; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_wake") { + $details = $6; + if ($details !~ /$regex_kswapd_wake/o) { + print "WARNING: Failed to parse mm_vmscan_kswapd_wake as expected\n"; + print " $details\n"; + print " $regex_kswapd_wake\n"; + next; + } + + my $order = $2; + $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER} = $order; + if (!$perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}) { + $timestamp = timestamp_to_ms($timestamp); + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}++; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = $timestamp; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]++; + } else { + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}++; + $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]++; + } + } elsif ($tracepoint eq "mm_vmscan_kswapd_sleep") { + + # Count the event itself + my $index = $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}; + $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_SLEEP}++; + + # Record how long kswapd was awake + $timestamp = timestamp_to_ms($timestamp); + my $order = $perprocesspid{$process_pid}->{STATE_KSWAPD_ORDER}; + my $latency = ($timestamp - $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN}); + $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index] = "$order-$latency"; + $perprocesspid{$process_pid}->{STATE_KSWAPD_BEGIN} = 0; + } elsif ($tracepoint eq "mm_vmscan_wakeup_kswapd") { + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}++; + + $details = $6; + if ($details !~ /$regex_wakeup_kswapd/o) { + print "WARNING: Failed to parse mm_vmscan_wakeup_kswapd as expected\n"; + print " $details\n"; + print " $regex_wakeup_kswapd\n"; + next; + } + my $order = $3; + $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]++; + } elsif ($tracepoint eq "mm_vmscan_lru_isolate") { + $details = $6; + if ($details !~ /$regex_lru_isolate/o) { + print "WARNING: Failed to parse mm_vmscan_lru_isolate as expected\n"; + print " $details\n"; + print " $regex_lru_isolate/o\n"; + next; + } + my $isolate_mode = $1; + my $nr_scanned = $4; + + # To closer match vmstat scanning statistics, only count isolate_both + # and isolate_inactive as scanning. isolate_active is rotation + # isolate_inactive == 1 + # isolate_active == 2 + # isolate_both == 3 + if ($isolate_mode != 2) { + $perprocesspid{$process_pid}->{HIGH_NR_SCANNED} += $nr_scanned; + } + } elsif ($tracepoint eq "mm_vmscan_lru_shrink_inactive") { + $details = $6; + if ($details !~ /$regex_lru_shrink_inactive/o) { + print "WARNING: Failed to parse mm_vmscan_lru_shrink_inactive as expected\n"; + print " $details\n"; + print " $regex_lru_shrink_inactive/o\n"; + next; + } + my $nr_reclaimed = $4; + $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED} += $nr_reclaimed; + } elsif ($tracepoint eq "mm_vmscan_writepage") { + $details = $6; + if ($details !~ /$regex_writepage/o) { + print "WARNING: Failed to parse mm_vmscan_writepage as expected\n"; + print " $details\n"; + print " $regex_writepage\n"; + next; + } + + my $flags = $3; + my $file = 0; + my $sync_io = 0; + if ($flags =~ /RECLAIM_WB_FILE/) { + $file = 1; + } + if ($flags =~ /RECLAIM_WB_SYNC/) { + $sync_io = 1; + } + if ($sync_io) { + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}++; + } + } else { + if ($file) { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}++; + } else { + $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}++; + } + } + } else { + $perprocesspid{$process_pid}->{EVENT_UNKNOWN}++; + } + + if ($sigint_pending) { + last EVENT_PROCESS; + } + } +} + +sub dump_stats { + my $hashref = shift; + my %stats = %$hashref; + + # Dump per-process stats + my $process_pid; + my $max_strlen = 0; + + # Get the maximum process name + foreach $process_pid (keys %perprocesspid) { + my $len = length($process_pid); + if ($len > $max_strlen) { + $max_strlen = $len; + } + } + $max_strlen += 2; + + # Work out latencies + printf("\n") if !$opt_ignorepid; + printf("Reclaim latencies expressed as order-latency_in_ms\n") if !$opt_ignorepid; + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[0] && + !$stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[0]) { + next; + } + + printf "%-" . $max_strlen . "s ", $process_pid if !$opt_ignorepid; + my $index = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index] || + defined $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) { + + if ($stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + printf("%s ", $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $total_direct_latency += $latency; + } else { + printf("%s ", $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]) if !$opt_ignorepid; + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_KSWAPD_LATENCY}[$index]); + $total_kswapd_latency += $latency; + } + $index++; + } + print "\n" if !$opt_ignorepid; + } + + # Print out process activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "Process", "Direct", "Wokeup", "Pages", "Pages", "Pages", "Pages", "Time"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s %8s %8s\n", "details", "Rclms", "Kswapd", "Scanned", "Rclmed", "Sync-IO", "ASync-IO", "Stalled"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + next; + } + + $total_direct_reclaim += $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $total_wakeup_kswapd += $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $total_direct_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_direct_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; + $total_direct_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_direct_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_direct_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + + $total_direct_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + my $index = 0; + my $this_reclaim_delay = 0; + while (defined $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]) { + my ($dummy, $latency) = split(/-/, $stats{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$index]); + $this_reclaim_delay += $latency; + $index++; + } + + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8u %8u %8.3f", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}, + $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}, + $this_reclaim_delay / 1000); + + if ($stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + if ($count != 0) { + print "direct-$order=$count "; + } + } + } + if ($stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + if ($count != 0) { + print "wakeup-$order=$count "; + } + } + } + + print "\n"; + } + + # Print out kswapd activity + printf("\n"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Kswapd", "Kswapd", "Order", "Pages", "Pages", "Pages", "Pages"); + printf("%-" . $max_strlen . "s %8s %10s %8s %8s %8s %8s\n", "Instance", "Wakeups", "Re-wakeup", "Scanned", "Rclmed", "Sync-IO", "ASync-IO"); + foreach $process_pid (keys %stats) { + + if (!$stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + next; + } + + $total_kswapd_wake += $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $total_kswapd_nr_scanned += $stats{$process_pid}->{HIGH_NR_SCANNED}; + $total_kswapd_nr_reclaimed += $stats{$process_pid}->{HIGH_NR_RECLAIMED}; + $total_kswapd_writepage_file_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $total_kswapd_writepage_anon_sync += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $total_kswapd_writepage_file_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $total_kswapd_writepage_anon_async += $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + printf("%-" . $max_strlen . "s %8d %10d %8u %8u %8i %8u", + $process_pid, + $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}, + $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}, + $stats{$process_pid}->{HIGH_NR_SCANNED}, + $stats{$process_pid}->{HIGH_NR_RECLAIMED}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}, + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} + $stats{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}); + + if ($stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + if ($count != 0) { + print "wake-$order=$count "; + } + } + } + if ($stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP}) { + print " "; + for (my $order = 0; $order < 20; $order++) { + my $count = $stats{$process_pid}->{HIGH_KSWAPD_REWAKEUP_PERORDER}[$order]; + if ($count != 0) { + print "rewake-$order=$count "; + } + } + } + printf("\n"); + } + + # Print out summaries + $total_direct_latency /= 1000; + $total_kswapd_latency /= 1000; + print "\nSummary\n"; + print "Direct reclaims: $total_direct_reclaim\n"; + print "Direct reclaim pages scanned: $total_direct_nr_scanned\n"; + print "Direct reclaim pages reclaimed: $total_direct_nr_reclaimed\n"; + print "Direct reclaim write file sync I/O: $total_direct_writepage_file_sync\n"; + print "Direct reclaim write anon sync I/O: $total_direct_writepage_anon_sync\n"; + print "Direct reclaim write file async I/O: $total_direct_writepage_file_async\n"; + print "Direct reclaim write anon async I/O: $total_direct_writepage_anon_async\n"; + print "Wake kswapd requests: $total_wakeup_kswapd\n"; + printf "Time stalled direct reclaim: %-1.2f seconds\n", $total_direct_latency; + print "\n"; + print "Kswapd wakeups: $total_kswapd_wake\n"; + print "Kswapd pages scanned: $total_kswapd_nr_scanned\n"; + print "Kswapd pages reclaimed: $total_kswapd_nr_reclaimed\n"; + print "Kswapd reclaim write file sync I/O: $total_kswapd_writepage_file_sync\n"; + print "Kswapd reclaim write anon sync I/O: $total_kswapd_writepage_anon_sync\n"; + print "Kswapd reclaim write file async I/O: $total_kswapd_writepage_file_async\n"; + print "Kswapd reclaim write anon async I/O: $total_kswapd_writepage_anon_async\n"; + printf "Time kswapd awake: %-1.2f seconds\n", $total_kswapd_latency; +} + +sub aggregate_perprocesspid() { + my $process_pid; + my $process; + undef %perprocess; + + foreach $process_pid (keys %perprocesspid) { + $process = $process_pid; + $process =~ s/-([0-9])*$//; + if ($process eq '') { + $process = "NO_PROCESS_NAME"; + } + + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN} += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN}; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE} += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE}; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD} += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD}; + $perprocess{$process}->{HIGH_KSWAPD_REWAKEUP} += $perprocesspid{$process_pid}->{HIGH_KSWAPD_REWAKEUP}; + $perprocess{$process}->{HIGH_NR_SCANNED} += $perprocesspid{$process_pid}->{HIGH_NR_SCANNED}; + $perprocess{$process}->{HIGH_NR_RECLAIMED} += $perprocesspid{$process_pid}->{HIGH_NR_RECLAIMED}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_SYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_FILE_ASYNC}; + $perprocess{$process}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC} += $perprocesspid{$process_pid}->{MM_VMSCAN_WRITEPAGE_ANON_ASYNC}; + + for (my $order = 0; $order < 20; $order++) { + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_DIRECT_RECLAIM_BEGIN_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_WAKEUP_KSWAPD_PERORDER}[$order]; + $perprocess{$process}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order] += $perprocesspid{$process_pid}->{MM_VMSCAN_KSWAPD_WAKE_PERORDER}[$order]; + + } + + # Aggregate direct reclaim latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_DIRECT_RECLAIM_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_DIRECT_RECLAIM_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + + # Aggregate kswapd latencies + my $wr_index = $perprocess{$process}->{MM_VMSCAN_KSWAPD_SLEEP}; + my $rd_index = 0; + while (defined $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]) { + $perprocess{$process}->{HIGH_KSWAPD_LATENCY}[$wr_index] = $perprocesspid{$process_pid}->{HIGH_KSWAPD_LATENCY}[$rd_index]; + $rd_index++; + $wr_index++; + } + $perprocess{$process}->{MM_VMSCAN_DIRECT_RECLAIM_END} = $wr_index; + } +} + +sub report() { + if (!$opt_ignorepid) { + dump_stats(\%perprocesspid); + } else { + aggregate_perprocesspid(); + dump_stats(\%perprocess); + } +} + +# Process events or signals until neither is available +sub signal_loop() { + my $sigint_processed; + do { + $sigint_processed = 0; + process_events(); + + # Handle pending signals if any + if ($sigint_pending) { + my $current_time = time; + + if ($sigint_exit) { + print "Received exit signal\n"; + $sigint_pending = 0; + } + if ($sigint_report) { + if ($current_time >= $sigint_received + 2) { + report(); + $sigint_report = 0; + $sigint_pending = 0; + $sigint_processed = 1; + } + } + } + } while ($sigint_pending || $sigint_processed); +} + +signal_loop(); +report(); diff --git a/Documentation/trace/power.txt b/Documentation/trace/power.txt deleted file mode 100644 index cd805e16dc2..00000000000 --- a/Documentation/trace/power.txt +++ /dev/null @@ -1,17 +0,0 @@ -The power tracer collects detailed information about C-state and P-state -transitions, instead of just looking at the high-level "average" -information. - -There is a helper script found in scrips/tracing/power.pl in the kernel -sources which can be used to parse this information and create a -Scalable Vector Graphics (SVG) picture from the trace data. - -To use this tracer: - - echo 0 > /sys/kernel/debug/tracing/tracing_enabled - echo power > /sys/kernel/debug/tracing/current_tracer - echo 1 > /sys/kernel/debug/tracing/tracing_enabled - sleep 1 - echo 0 > /sys/kernel/debug/tracing/tracing_enabled - cat /sys/kernel/debug/tracing/trace | \ - perl scripts/tracing/power.pl > out.sv diff --git a/Documentation/trace/ring-buffer-design.txt b/Documentation/trace/ring-buffer-design.txt new file mode 100644 index 00000000000..ff747b6fa39 --- /dev/null +++ b/Documentation/trace/ring-buffer-design.txt @@ -0,0 +1,955 @@ + Lockless Ring Buffer Design + =========================== + +Copyright 2009 Red Hat Inc. + Author: Steven Rostedt <srostedt@redhat.com> + License: The GNU Free Documentation License, Version 1.2 + (dual licensed under the GPL v2) +Reviewers: Mathieu Desnoyers, Huang Ying, Hidetoshi Seto, + and Frederic Weisbecker. + + +Written for: 2.6.31 + +Terminology used in this Document +--------------------------------- + +tail - where new writes happen in the ring buffer. + +head - where new reads happen in the ring buffer. + +producer - the task that writes into the ring buffer (same as writer) + +writer - same as producer + +consumer - the task that reads from the buffer (same as reader) + +reader - same as consumer. + +reader_page - A page outside the ring buffer used solely (for the most part) + by the reader. + +head_page - a pointer to the page that the reader will use next + +tail_page - a pointer to the page that will be written to next + +commit_page - a pointer to the page with the last finished non-nested write. + +cmpxchg - hardware-assisted atomic transaction that performs the following: + + A = B iff previous A == C + + R = cmpxchg(A, C, B) is saying that we replace A with B if and only if + current A is equal to C, and we put the old (current) A into R + + R gets the previous A regardless if A is updated with B or not. + + To see if the update was successful a compare of R == C may be used. + +The Generic Ring Buffer +----------------------- + +The ring buffer can be used in either an overwrite mode or in +producer/consumer mode. + +Producer/consumer mode is where if the producer were to fill up the +buffer before the consumer could free up anything, the producer +will stop writing to the buffer. This will lose most recent events. + +Overwrite mode is where if the producer were to fill up the buffer +before the consumer could free up anything, the producer will +overwrite the older data. This will lose the oldest events. + +No two writers can write at the same time (on the same per-cpu buffer), +but a writer may interrupt another writer, but it must finish writing +before the previous writer may continue. This is very important to the +algorithm. The writers act like a "stack". The way interrupts works +enforces this behavior. + + + writer1 start + <preempted> writer2 start + <preempted> writer3 start + writer3 finishes + writer2 finishes + writer1 finishes + +This is very much like a writer being preempted by an interrupt and +the interrupt doing a write as well. + +Readers can happen at any time. But no two readers may run at the +same time, nor can a reader preempt/interrupt another reader. A reader +cannot preempt/interrupt a writer, but it may read/consume from the +buffer at the same time as a writer is writing, but the reader must be +on another processor to do so. A reader may read on its own processor +and can be preempted by a writer. + +A writer can preempt a reader, but a reader cannot preempt a writer. +But a reader can read the buffer at the same time (on another processor) +as a writer. + +The ring buffer is made up of a list of pages held together by a linked list. + +At initialization a reader page is allocated for the reader that is not +part of the ring buffer. + +The head_page, tail_page and commit_page are all initialized to point +to the same page. + +The reader page is initialized to have its next pointer pointing to +the head page, and its previous pointer pointing to a page before +the head page. + +The reader has its own page to use. At start up time, this page is +allocated but is not attached to the list. When the reader wants +to read from the buffer, if its page is empty (like it is on start-up), +it will swap its page with the head_page. The old reader page will +become part of the ring buffer and the head_page will be removed. +The page after the inserted page (old reader_page) will become the +new head page. + +Once the new page is given to the reader, the reader could do what +it wants with it, as long as a writer has left that page. + +A sample of how the reader page is swapped: Note this does not +show the head page in the buffer, it is for demonstrating a swap +only. + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |-->| |-->| | + | |<--| |<--| | + +---+ +---+ +---+ + ^ | ^ | + | +-------------+ | + +-----------------+ + + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ v + | +---+ +---+ +---+ + | | |-->| |-->| | + | | |<--| |<--| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +-------------+ | | + | +-----------------+ | + +------------------------------------+ + + +------+ + |reader| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------------------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + + + +It is possible that the page swapped is the commit page and the tail page, +if what is in the ring buffer is less than what is held in a buffer page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +This case is still valid for this algorithm. +When the writer leaves the page, it simply goes into the ring buffer +since the reader page still points to the next location in the ring +buffer. + + +The main pointers: + + reader page - The page used solely by the reader and is not part + of the ring buffer (may be swapped in) + + head page - the next page in the ring buffer that will be swapped + with the reader page. + + tail page - the page where the next write will take place. + + commit page - the page that last finished a write. + +The commit page only is updated by the outermost writer in the +writer stack. A writer that preempts another writer will not move the +commit page. + +When data is written into the ring buffer, a position is reserved +in the ring buffer and passed back to the writer. When the writer +is finished writing data into that position, it commits the write. + +Another write (or a read) may take place at anytime during this +transaction. If another write happens it must finish before continuing +with the previous write. + + + Write reserve: + + Buffer page + +---------+ + |written | + +---------+ <--- given back to writer (current commit) + |reserved | + +---------+ <--- tail pointer + | empty | + +---------+ + + Write commit: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ <--- next position for write (current commit) + | empty | + +---------+ + + + If a write happens after the first reserve: + + Buffer page + +---------+ + |written | + +---------+ <-- current commit + |reserved | + +---------+ <--- given back to second writer + |reserved | + +---------+ <--- tail pointer + + After second writer commits: + + + Buffer page + +---------+ + |written | + +---------+ <--(last full commit) + |reserved | + +---------+ + |pending | + |commit | + +---------+ <--- tail pointer + + When the first writer commits: + + Buffer page + +---------+ + |written | + +---------+ + |written | + +---------+ + |written | + +---------+ <--(last full commit and tail pointer) + + +The commit pointer points to the last write location that was +committed without preempting another write. When a write that +preempted another write is committed, it only becomes a pending commit +and will not be a full commit until all writes have been committed. + +The commit page points to the page that has the last full commit. +The tail page points to the page with the last write (before +committing). + +The tail page is always equal to or after the commit page. It may +be several pages ahead. If the tail page catches up to the commit +page then no more writes may take place (regardless of the mode +of the ring buffer: overwrite and produce/consumer). + +The order of pages is: + + head page + commit page + tail page + +Possible scenario: + tail page + head page commit page | + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +There is a special case that the head page is after either the commit page +and possibly the tail page. That is when the commit (and tail) page has been +swapped with the reader page. This is because the head page is always +part of the ring buffer, but the reader page is not. Whenever there +has been less than a full page that has been committed inside the ring buffer, +and a reader swaps out a page, it will be swapping out the commit page. + + + reader page commit page tail page + | | | + v | | + +---+ | | + | |<----------+ | + | |<------------------------+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + +In this case, the head page will not move when the tail and commit +move back into the ring buffer. + +The reader cannot swap a page into the ring buffer if the commit page +is still on that page. If the read meets the last commit (real commit +not pending or reserved), then there is nothing more to read. +The buffer is considered empty until another full commit finishes. + +When the tail meets the head page, if the buffer is in overwrite mode, +the head page will be pushed ahead one. If the buffer is in producer/consumer +mode, the write will fail. + +Overwrite mode: + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + head page + +Note, the reader page will still point to the previous head page. +But when a swap takes place, it will use the most recent head page. + + +Making the Ring Buffer Lockless: +-------------------------------- + +The main idea behind the lockless algorithm is to combine the moving +of the head_page pointer with the swapping of pages with the reader. +State flags are placed inside the pointer to the page. To do this, +each page must be aligned in memory by 4 bytes. This will allow the 2 +least significant bits of the address to be used as flags, since +they will always be zero for the address. To get the address, +simply mask out the flags. + + MASK = ~3 + + address & MASK + +Two flags will be kept by these two bits: + + HEADER - the page being pointed to is a head page + + UPDATE - the page being pointed to is being updated by a writer + and was or is about to be a head page. + + + reader page + | + v + +---+ + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above pointer "-H->" would have the HEADER flag set. That is +the next page is the next page to be swapped out by the reader. +This pointer means the next page is the head page. + +When the tail page meets the head pointer, it will use cmpxchg to +change the pointer to the UPDATE state: + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +"-U->" represents a pointer in the UPDATE state. + +Any access to the reader will need to take some sort of lock to serialize +the readers. But the writers will never take a lock to write to the +ring buffer. This means we only need to worry about a single reader, +and writes only preempt in "stack" formation. + +When the reader tries to swap the page with the ring buffer, it +will also use cmpxchg. If the flag bit in the pointer to the +head page does not have the HEADER flag set, the compare will fail +and the reader will need to look for the new head page and try again. +Note, the flags UPDATE and HEADER are never set at the same time. + +The reader swaps the reader page as follows: + + +------+ + |reader| RING BUFFER + |page | + +------+ + +---+ +---+ +---+ + | |--->| |--->| | + | |<---| |<---| | + +---+ +---+ +---+ + ^ | ^ | + | +---------------+ | + +-----H-------------+ + +The reader sets the reader page next pointer as HEADER to the page after +the head page. + + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | +---+ +---+ +---+ + | | |--->| |--->| | + | | |<---| |<---| |<-+ + | +---+ +---+ +---+ | + | ^ | ^ | | + | | +---------------+ | | + | +-----H-------------+ | + +--------------------------------------+ + +It does a cmpxchg with the pointer to the previous head page to make it +point to the reader page. Note that the new pointer does not have the HEADER +flag set. This action atomically moves the head page forward. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | |<--| |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + +After the new head page is set, the previous pointer of the head page is +updated to the reader page. + + +------+ + |reader| RING BUFFER + |page |-------H-----------+ + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | |-->| |-->| | + | | | | | |<--| |<-+ + | | +---+ +---+ +---+ | + | | | ^ | | + | | +-------------+ | | + | +-----------------------------+ | + +------------------------------------+ + + +------+ + |buffer| RING BUFFER + |page |-------H-----------+ <--- New head page + +------+ <---------------+ v + | ^ +---+ +---+ +---+ + | | | | | |-->| | + | | New | | | |<--| |<-+ + | | Reader +---+ +---+ +---+ | + | | page ----^ | | + | | | | + | +-----------------------------+ | + +------------------------------------+ + +Another important point: The page that the reader page points back to +by its previous pointer (the one that now points to the new head page) +never points back to the reader page. That is because the reader page is +not part of the ring buffer. Traversing the ring buffer via the next pointers +will always stay in the ring buffer. Traversing the ring buffer via the +prev pointers may not. + +Note, the way to determine a reader page is simply by examining the previous +pointer of the page. If the next pointer of the previous page does not +point back to the original page, then the original page is a reader page: + + + +--------+ + | reader | next +----+ + | page |-------->| |<====== (buffer page) + +--------+ +----+ + | | ^ + | v | next + prev | +----+ + +------------->| | + +----+ + +The way the head page moves forward: + +When the tail page meets the head page and the buffer is in overwrite mode +and more writes take place, the head page must be moved forward before the +writer may move the tail page. The way this is done is that the writer +performs a cmpxchg to convert the pointer to the head page from the HEADER +flag to have the UPDATE flag set. Once this is done, the reader will +not be able to swap the head page from the buffer, nor will it be able to +move the head page, until the writer is finished with the move. + +This eliminates any races that the reader can have on the writer. The reader +must spin, and this is why the reader cannot preempt the writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The following page will be made into the new head page. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the new head page has been set, we can set the old head page +pointer back to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the head page has been moved, the tail page may now move forward. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The above are the trivial updates. Now for the more complex scenarios. + + +As stated before, if enough writes preempt the first write, the +tail page may make it all the way around the buffer and meet the commit +page. At this time, we must start dropping writes (usually with some kind +of warning to the user). But what happens if the commit was still on the +reader page? The commit page is not part of the ring buffer. The tail page +must account for this. + + + reader page commit page + | | + v | + +---+ | + | |<----------+ + | | + | |------+ + +---+ | + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + ^ + | + tail page + +If the tail page were to simply push the head page forward, the commit when +leaving the reader page would not be pointing to the correct page. + +The solution to this is to test if the commit page is on the reader page +before pushing the head page. If it is, then it can be assumed that the +tail page wrapped the buffer, and we must drop new writes. + +This is not a race condition, because the commit page can only be moved +by the outermost writer (the writer that was preempted). +This means that the commit will not move while a writer is moving the +tail page. The reader cannot swap the reader page if it is also being +used as the commit page. The reader can simply check that the commit +is off the reader page. Once the commit page leaves the reader page +it will never go back on it unless a reader does another swap with the +buffer page that is also the commit page. + + +Nested writes +------------- + +In the pushing forward of the tail page we must first push forward +the head page if the head page is the next page. If the head page +is not the next page, the tail page is simply updated with a cmpxchg. + +Only writers move the tail page. This must be done atomically to protect +against nested writers. + + temp_page = tail_page + next_page = temp_page->next + cmpxchg(tail_page, temp_page, next_page) + +The above will update the tail page if it is still pointing to the expected +page. If this fails, a nested write pushed it forward, the current write +does not need to push it. + + + temp page + | + v + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Nested write comes in and moves the tail page forward: + + tail page (moved by nested writer) + temp page | + | | + v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The above would fail the cmpxchg, but since the tail page has already +been moved forward, the writer will just try again to reserve storage +on the new tail page. + +But the moving of the head page is a bit more complex. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But if a nested writer preempts here, it will see that the next +page is a head page, but it is also nested. It will detect that +it is nested and will save that information. The detection is the +fact that it sees the UPDATE flag instead of a HEADER or NORMAL +pointer. + +The nested writer will set the new head page pointer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But it will not reset the update back to normal. Only the writer +that converted a pointer from HEAD to UPDATE will convert it back +to NORMAL. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +After the nested writer finishes, the outermost writer will convert +the UPDATE pointer to NORMAL. + + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +It can be even more complex if several nested writes came in and moved +the tail page ahead several pages: + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-H->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The write converts the head page pointer to UPDATE. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Next writer comes in, and sees the update and sets up the new +head page. + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The nested writer moves the tail page forward. But does not set the old +update page to NORMAL because it is not the outermost writer. + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Another writer preempts and sees the page after the tail page is a head page. +It changes it from HEAD to UPDATE. + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |---> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The writer will move the head page forward: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-U->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +But now that the third writer did change the HEAD flag to UPDATE it +will convert it to normal: + + +(third writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +Then it will move the tail page, and return back to the second writer. + + +(second writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + + +The second writer will fail to move the tail page because it was already +moved, so it will try again and add its data to the new tail page. +It will return to the first writer. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +The first writer cannot know atomically if the tail page moved +while it updates the HEAD page. It will then update the head page to +what it thinks is the new head page. + + +(first writer) + + tail page + | + v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Since the cmpxchg returns the old value of the pointer the first writer +will see it succeeded in updating the pointer from NORMAL to HEAD. +But as we can see, this is not good enough. It must also check to see +if the tail page is either where it use to be or on the next page: + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |-H->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +If tail page != A and tail page != B, then it must reset the pointer +back to NORMAL. The fact that it only needs to worry about nested +writers means that it only needs to check this after setting the HEAD page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |-U->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + +Now the writer can update the head page. This is also why the head page must +remain in UPDATE and only reset by the outermost writer. This prevents +the reader from seeing the incorrect head page. + + +(first writer) + + A B tail page + | | | + v v v + +---+ +---+ +---+ +---+ +<---| |--->| |--->| |--->| |-H-> +--->| |<---| |<---| |<---| |<--- + +---+ +---+ +---+ +---+ + diff --git a/Documentation/trace/tracepoint-analysis.txt b/Documentation/trace/tracepoint-analysis.txt new file mode 100644 index 00000000000..058cc6c9dc5 --- /dev/null +++ b/Documentation/trace/tracepoint-analysis.txt @@ -0,0 +1,327 @@ + Notes on Analysing Behaviour Using Events and Tracepoints + + Documentation written by Mel Gorman + PCL information heavily based on email from Ingo Molnar + +1. Introduction +=============== + +Tracepoints (see Documentation/trace/tracepoints.txt) can be used without +creating custom kernel modules to register probe functions using the event +tracing infrastructure. + +Simplistically, tracepoints represent important events that can be +taken in conjunction with other tracepoints to build a "Big Picture" of +what is going on within the system. There are a large number of methods for +gathering and interpreting these events. Lacking any current Best Practises, +this document describes some of the methods that can be used. + +This document assumes that debugfs is mounted on /sys/kernel/debug and that +the appropriate tracing options have been configured into the kernel. It is +assumed that the PCL tool tools/perf has been installed and is in your path. + +2. Listing Available Events +=========================== + +2.1 Standard Utilities +---------------------- + +All possible events are visible from /sys/kernel/debug/tracing/events. Simply +calling + + $ find /sys/kernel/debug/tracing/events -type d + +will give a fair indication of the number of events available. + +2.2 PCL (Performance Counters for Linux) +------- + +Discovery and enumeration of all counters and events, including tracepoints, +are available with the perf tool. Getting a list of available events is a +simple case of: + + $ perf list 2>&1 | grep Tracepoint + ext4:ext4_free_inode [Tracepoint event] + ext4:ext4_request_inode [Tracepoint event] + ext4:ext4_allocate_inode [Tracepoint event] + ext4:ext4_write_begin [Tracepoint event] + ext4:ext4_ordered_write_end [Tracepoint event] + [ .... remaining output snipped .... ] + + +3. Enabling Events +================== + +3.1 System-Wide Event Enabling +------------------------------ + +See Documentation/trace/events.txt for a proper description on how events +can be enabled system-wide. A short example of enabling all events related +to page allocation would look something like: + + $ for i in `find /sys/kernel/debug/tracing/events -name "enable" | grep mm_`; do echo 1 > $i; done + +3.2 System-Wide Event Enabling with SystemTap +--------------------------------------------- + +In SystemTap, tracepoints are accessible using the kernel.trace() function +call. The following is an example that reports every 5 seconds what processes +were allocating the pages. + + global page_allocs + + probe kernel.trace("mm_page_alloc") { + page_allocs[execname()]++ + } + + function print_count() { + printf ("%-25s %-s\n", "#Pages Allocated", "Process Name") + foreach (proc in page_allocs-) + printf("%-25d %s\n", page_allocs[proc], proc) + printf ("\n") + delete page_allocs + } + + probe timer.s(5) { + print_count() + } + +3.3 System-Wide Event Enabling with PCL +--------------------------------------- + +By specifying the -a switch and analysing sleep, the system-wide events +for a duration of time can be examined. + + $ perf stat -a \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + sleep 10 + Performance counter stats for 'sleep 10': + + 9630 kmem:mm_page_alloc + 2143 kmem:mm_page_free + 7424 kmem:mm_page_free_batched + + 10.002577764 seconds time elapsed + +Similarly, one could execute a shell and exit it as desired to get a report +at that point. + +3.4 Local Event Enabling +------------------------ + +Documentation/trace/ftrace.txt describes how to enable events on a per-thread +basis using set_ftrace_pid. + +3.5 Local Event Enablement with PCL +----------------------------------- + +Events can be activated and tracked for the duration of a process on a local +basis using PCL such as follows. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched ./hackbench 10 + Time: 0.909 + + Performance counter stats for './hackbench 10': + + 17803 kmem:mm_page_alloc + 12398 kmem:mm_page_free + 4827 kmem:mm_page_free_batched + + 0.973913387 seconds time elapsed + +4. Event Filtering +================== + +Documentation/trace/ftrace.txt covers in-depth how to filter events in +ftrace. Obviously using grep and awk of trace_pipe is an option as well +as any script reading trace_pipe. + +5. Analysing Event Variances with PCL +===================================== + +Any workload can exhibit variances between runs and it can be important +to know what the standard deviation is. By and large, this is left to the +performance analyst to do it by hand. In the event that the discrete event +occurrences are useful to the performance analyst, then perf can be used. + + $ perf stat --repeat 5 -e kmem:mm_page_alloc -e kmem:mm_page_free + -e kmem:mm_page_free_batched ./hackbench 10 + Time: 0.890 + Time: 0.895 + Time: 0.915 + Time: 1.001 + Time: 0.899 + + Performance counter stats for './hackbench 10' (5 runs): + + 16630 kmem:mm_page_alloc ( +- 3.542% ) + 11486 kmem:mm_page_free ( +- 4.771% ) + 4730 kmem:mm_page_free_batched ( +- 2.325% ) + + 0.982653002 seconds time elapsed ( +- 1.448% ) + +In the event that some higher-level event is required that depends on some +aggregation of discrete events, then a script would need to be developed. + +Using --repeat, it is also possible to view how events are fluctuating over +time on a system-wide basis using -a and sleep. + + $ perf stat -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + -a --repeat 10 \ + sleep 1 + Performance counter stats for 'sleep 1' (10 runs): + + 1066 kmem:mm_page_alloc ( +- 26.148% ) + 182 kmem:mm_page_free ( +- 5.464% ) + 890 kmem:mm_page_free_batched ( +- 30.079% ) + + 1.002251757 seconds time elapsed ( +- 0.005% ) + +6. Higher-Level Analysis with Helper Scripts +============================================ + +When events are enabled the events that are triggering can be read from +/sys/kernel/debug/tracing/trace_pipe in human-readable format although binary +options exist as well. By post-processing the output, further information can +be gathered on-line as appropriate. Examples of post-processing might include + + o Reading information from /proc for the PID that triggered the event + o Deriving a higher-level event from a series of lower-level events. + o Calculating latencies between two events + +Documentation/trace/postprocess/trace-pagealloc-postprocess.pl is an example +script that can read trace_pipe from STDIN or a copy of a trace. When used +on-line, it can be interrupted once to generate a report without exiting +and twice to exit. + +Simplistically, the script just reads STDIN and counts up events but it +also can do more such as + + o Derive high-level events from many low-level events. If a number of pages + are freed to the main allocator from the per-CPU lists, it recognises + that as one per-CPU drain even though there is no specific tracepoint + for that event + o It can aggregate based on PID or individual process number + o In the event memory is getting externally fragmented, it reports + on whether the fragmentation event was severe or moderate. + o When receiving an event about a PID, it can record who the parent was so + that if large numbers of events are coming from very short-lived + processes, the parent process responsible for creating all the helpers + can be identified + +7. Lower-Level Analysis with PCL +================================ + +There may also be a requirement to identify what functions within a program +were generating events within the kernel. To begin this sort of analysis, the +data must be recorded. At the time of writing, this required root: + + $ perf record -c 1 \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + ./hackbench 10 + Time: 0.894 + [ perf record: Captured and wrote 0.733 MB perf.data (~32010 samples) ] + +Note the use of '-c 1' to set the event period to sample. The default sample +period is quite high to minimise overhead but the information collected can be +very coarse as a result. + +This record outputted a file called perf.data which can be analysed using +perf report. + + $ perf report + # Samples: 30922 + # + # Overhead Command Shared Object + # ........ ......... ................................ + # + 87.27% hackbench [vdso] + 6.85% hackbench /lib/i686/cmov/libc-2.9.so + 2.62% hackbench /lib/ld-2.9.so + 1.52% perf [vdso] + 1.22% hackbench ./hackbench + 0.48% hackbench [kernel] + 0.02% perf /lib/i686/cmov/libc-2.9.so + 0.01% perf /usr/bin/perf + 0.01% perf /lib/ld-2.9.so + 0.00% hackbench /lib/i686/cmov/libpthread-2.9.so + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +According to this, the vast majority of events triggered on events +within the VDSO. With simple binaries, this will often be the case so let's +take a slightly different example. In the course of writing this, it was +noticed that X was generating an insane amount of page allocations so let's look +at it: + + $ perf record -c 1 -f \ + -e kmem:mm_page_alloc -e kmem:mm_page_free \ + -e kmem:mm_page_free_batched \ + -p `pidof X` + +This was interrupted after a few seconds and + + $ perf report + # Samples: 27666 + # + # Overhead Command Shared Object + # ........ ....... ....................................... + # + 51.95% Xorg [vdso] + 47.95% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so + 0.01% Xorg [kernel] + # + # (For more details, try: perf report --sort comm,dso,symbol) + # + +So, almost half of the events are occurring in a library. To get an idea which +symbol: + + $ perf report --sort comm,dso,symbol + # Samples: 27666 + # + # Overhead Command Shared Object Symbol + # ........ ....... ....................................... ...... + # + 51.95% Xorg [vdso] [.] 0x000000ffffe424 + 47.93% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixmanFillsse2 + 0.09% Xorg /lib/i686/cmov/libc-2.9.so [.] _int_malloc + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] pixman_region32_copy_f + 0.01% Xorg [kernel] [k] read_hpet + 0.01% Xorg /opt/gfx-test/lib/libpixman-1.so.0.13.1 [.] get_fast_path + 0.00% Xorg [kernel] [k] ftrace_trace_userstack + +To see where within the function pixmanFillsse2 things are going wrong: + + $ perf annotate pixmanFillsse2 + [ ... ] + 0.00 : 34eeb: 0f 18 08 prefetcht0 (%eax) + : } + : + : extern __inline void __attribute__((__gnu_inline__, __always_inline__, _ + : _mm_store_si128 (__m128i *__P, __m128i __B) : { + : *__P = __B; + 12.40 : 34eee: 66 0f 7f 80 40 ff ff movdqa %xmm0,-0xc0(%eax) + 0.00 : 34ef5: ff + 12.40 : 34ef6: 66 0f 7f 80 50 ff ff movdqa %xmm0,-0xb0(%eax) + 0.00 : 34efd: ff + 12.39 : 34efe: 66 0f 7f 80 60 ff ff movdqa %xmm0,-0xa0(%eax) + 0.00 : 34f05: ff + 12.67 : 34f06: 66 0f 7f 80 70 ff ff movdqa %xmm0,-0x90(%eax) + 0.00 : 34f0d: ff + 12.58 : 34f0e: 66 0f 7f 40 80 movdqa %xmm0,-0x80(%eax) + 12.31 : 34f13: 66 0f 7f 40 90 movdqa %xmm0,-0x70(%eax) + 12.40 : 34f18: 66 0f 7f 40 a0 movdqa %xmm0,-0x60(%eax) + 12.31 : 34f1d: 66 0f 7f 40 b0 movdqa %xmm0,-0x50(%eax) + +At a glance, it looks like the time is being spent copying pixmaps to +the card. Further investigation would be needed to determine why pixmaps +are being copied around so much but a starting point would be to take an +ancient build of libpixmap out of the library path where it was totally +forgotten about from months ago! diff --git a/Documentation/trace/tracepoints.txt b/Documentation/trace/tracepoints.txt index c0e1ceed75a..a3efac621c5 100644 --- a/Documentation/trace/tracepoints.txt +++ b/Documentation/trace/tracepoints.txt @@ -40,7 +40,13 @@ Two elements are required for tracepoints : In order to use tracepoints, you should include linux/tracepoint.h. -In include/trace/subsys.h : +In include/trace/events/subsys.h : + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM subsys + +#if !defined(_TRACE_SUBSYS_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_SUBSYS_H #include <linux/tracepoint.h> @@ -48,10 +54,16 @@ DECLARE_TRACE(subsys_eventname, TP_PROTO(int firstarg, struct task_struct *p), TP_ARGS(firstarg, p)); +#endif /* _TRACE_SUBSYS_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> + In subsys/file.c (where the tracing statement must be added) : -#include <trace/subsys.h> +#include <trace/events/subsys.h> +#define CREATE_TRACE_POINTS DEFINE_TRACE(subsys_eventname); void somefct(void) @@ -72,6 +84,9 @@ Where : - TP_ARGS(firstarg, p) are the parameters names, same as found in the prototype. +- if you use the header in multiple source files, #define CREATE_TRACE_POINTS + should appear only in one source file. + Connecting a function (probe) to a tracepoint is done by providing a probe (function to call) for the specific tracepoint through register_trace_subsys_eventname(). Removing a probe is done through @@ -81,7 +96,6 @@ tracepoint_synchronize_unregister() must be called before the end of the module exit function to make sure there is no caller left using the probe. This, and the fact that preemption is disabled around the probe call, make sure that probe removal and module unload are safe. -See the "Probe example" section below for a sample probe module. The tracepoint mechanism supports inserting multiple instances of the same tracepoint, but a single definition must be made of a given @@ -101,16 +115,31 @@ If the tracepoint has to be used in kernel modules, an EXPORT_TRACEPOINT_SYMBOL_GPL() or EXPORT_TRACEPOINT_SYMBOL() can be used to export the defined tracepoints. -* Probe / tracepoint example +If you need to do a bit of work for a tracepoint parameter, and +that work is only used for the tracepoint, that work can be encapsulated +within an if statement with the following: + + if (trace_foo_bar_enabled()) { + int i; + int tot = 0; + + for (i = 0; i < count; i++) + tot += calculate_nuggets(); + + trace_foo_bar(tot); + } -See the example provided in samples/tracepoints +All trace_<tracepoint>() calls have a matching trace_<tracepoint>_enabled() +function defined that returns true if the tracepoint is enabled and +false otherwise. The trace_<tracepoint>() should always be within the +block of the if (trace_<tracepoint>_enabled()) to prevent races between +the tracepoint being enabled and the check being seen. -Compile them with your kernel. They are built during 'make' (not -'make modules') when CONFIG_SAMPLE_TRACEPOINTS=m. +The advantage of using the trace_<tracepoint>_enabled() is that it uses +the static_key of the tracepoint to allow the if statement to be implemented +with jump labels and avoid conditional branches. -Run, as root : -modprobe tracepoint-sample (insmod order is not important) -modprobe tracepoint-probe-sample -cat /proc/tracepoint-sample (returns an expected error) -rmmod tracepoint-sample tracepoint-probe-sample -dmesg +Note: The convenience macro TRACE_EVENT provides an alternative way to + define tracepoints. Check http://lwn.net/Articles/379903, + http://lwn.net/Articles/381064 and http://lwn.net/Articles/383362 + for a series of articles with more details. diff --git a/Documentation/trace/uprobetracer.txt b/Documentation/trace/uprobetracer.txt new file mode 100644 index 00000000000..f1cf9a34ad9 --- /dev/null +++ b/Documentation/trace/uprobetracer.txt @@ -0,0 +1,159 @@ + Uprobe-tracer: Uprobe-based Event Tracing + ========================================= + + Documentation written by Srikar Dronamraju + + +Overview +-------- +Uprobe based trace events are similar to kprobe based trace events. +To enable this feature, build your kernel with CONFIG_UPROBE_EVENT=y. + +Similar to the kprobe-event tracer, this doesn't need to be activated via +current_tracer. Instead of that, add probe points via +/sys/kernel/debug/tracing/uprobe_events, and enable it via +/sys/kernel/debug/tracing/events/uprobes/<EVENT>/enabled. + +However unlike kprobe-event tracer, the uprobe event interface expects the +user to calculate the offset of the probepoint in the object. + +Synopsis of uprobe_tracer +------------------------- + p[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a uprobe + r[:[GRP/]EVENT] PATH:OFFSET [FETCHARGS] : Set a return uprobe (uretprobe) + -:[GRP/]EVENT : Clear uprobe or uretprobe event + + GRP : Group name. If omitted, "uprobes" is the default value. + EVENT : Event name. If omitted, the event name is generated based + on PATH+OFFSET. + PATH : Path to an executable or a library. + OFFSET : Offset where the probe is inserted. + + FETCHARGS : Arguments. Each probe can have up to 128 args. + %REG : Fetch register REG + @ADDR : Fetch memory at ADDR (ADDR should be in userspace) + @+OFFSET : Fetch memory at OFFSET (OFFSET from same file as PATH) + $stackN : Fetch Nth entry of stack (N >= 0) + $stack : Fetch stack address. + $retval : Fetch return value.(*) + +|-offs(FETCHARG) : Fetch memory at FETCHARG +|- offs address.(**) + NAME=FETCHARG : Set NAME as the argument name of FETCHARG. + FETCHARG:TYPE : Set TYPE as the type of FETCHARG. Currently, basic types + (u8/u16/u32/u64/s8/s16/s32/s64), "string" and bitfield + are supported. + + (*) only for return probe. + (**) this is useful for fetching a field of data structures. + +Types +----- +Several types are supported for fetch-args. Uprobe tracer will access memory +by given type. Prefix 's' and 'u' means those types are signed and unsigned +respectively. Traced arguments are shown in decimal (signed) or hex (unsigned). +String type is a special type, which fetches a "null-terminated" string from +user space. +Bitfield is another special type, which takes 3 parameters, bit-width, bit- +offset, and container-size (usually 32). The syntax is; + + b<bit-width>@<bit-offset>/<container-size> + + +Event Profiling +--------------- +You can check the total number of probe hits and probe miss-hits via +/sys/kernel/debug/tracing/uprobe_profile. +The first column is event name, the second is the number of probe hits, +the third is the number of probe miss-hits. + +Usage examples +-------------- + * Add a probe as a new uprobe event, write a new definition to uprobe_events +as below: (sets a uprobe at an offset of 0x4245c0 in the executable /bin/bash) + + echo 'p: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Add a probe as a new uretprobe event: + + echo 'r: /bin/bash:0x4245c0' > /sys/kernel/debug/tracing/uprobe_events + + * Unset registered event: + + echo '-:bash_0x4245c0' >> /sys/kernel/debug/tracing/uprobe_events + + * Print out the events that are registered: + + cat /sys/kernel/debug/tracing/uprobe_events + + * Clear all events: + + echo > /sys/kernel/debug/tracing/uprobe_events + +Following example shows how to dump the instruction pointer and %ax register +at the probed text address. Probe zfree function in /bin/zsh: + + # cd /sys/kernel/debug/tracing/ + # cat /proc/`pgrep zsh`/maps | grep /bin/zsh | grep r-xp + 00400000-0048a000 r-xp 00000000 08:03 130904 /bin/zsh + # objdump -T /bin/zsh | grep -w zfree + 0000000000446420 g DF .text 0000000000000012 Base zfree + + 0x46420 is the offset of zfree in object /bin/zsh that is loaded at + 0x00400000. Hence the command to uprobe would be: + + # echo 'p:zfree_entry /bin/zsh:0x46420 %ip %ax' > uprobe_events + + And the same for the uretprobe would be: + + # echo 'r:zfree_exit /bin/zsh:0x46420 %ip %ax' >> uprobe_events + +Please note: User has to explicitly calculate the offset of the probe-point +in the object. We can see the events that are registered by looking at the +uprobe_events file. + + # cat uprobe_events + p:uprobes/zfree_entry /bin/zsh:0x00046420 arg1=%ip arg2=%ax + r:uprobes/zfree_exit /bin/zsh:0x00046420 arg1=%ip arg2=%ax + +Format of events can be seen by viewing the file events/uprobes/zfree_entry/format + + # cat events/uprobes/zfree_entry/format + name: zfree_entry + ID: 922 + format: + field:unsigned short common_type; offset:0; size:2; signed:0; + field:unsigned char common_flags; offset:2; size:1; signed:0; + field:unsigned char common_preempt_count; offset:3; size:1; signed:0; + field:int common_pid; offset:4; size:4; signed:1; + field:int common_padding; offset:8; size:4; signed:1; + + field:unsigned long __probe_ip; offset:12; size:4; signed:0; + field:u32 arg1; offset:16; size:4; signed:0; + field:u32 arg2; offset:20; size:4; signed:0; + + print fmt: "(%lx) arg1=%lx arg2=%lx", REC->__probe_ip, REC->arg1, REC->arg2 + +Right after definition, each event is disabled by default. For tracing these +events, you need to enable it by: + + # echo 1 > events/uprobes/enable + +Lets disable the event after sleeping for some time. + + # sleep 20 + # echo 0 > events/uprobes/enable + +And you can see the traced information via /sys/kernel/debug/tracing/trace. + + # cat trace + # tracer: nop + # + # TASK-PID CPU# TIMESTAMP FUNCTION + # | | | | | + zsh-24842 [006] 258544.995456: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [007] 258545.000270: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + zsh-24842 [002] 258545.043929: zfree_entry: (0x446420) arg1=446420 arg2=79 + zsh-24842 [004] 258547.046129: zfree_exit: (0x446540 <- 0x446420) arg1=446540 arg2=0 + +Output shows us uprobe was triggered for a pid 24842 with ip being 0x446420 +and contents of ax register being 79. And uretprobe was triggered with ip at +0x446540 with counterpart function entry at 0x446420. |
