ftrace - Function Tracer
========================
Copyright 2008 Red Hat Inc.
Author: Steven Rostedt <srostedt@redhat.com>
License: The GNU Free Documentation License, Version 1.2
(dual licensed under the GPL v2)
Reviewers: Elias Oltmanns, Randy Dunlap, Andrew Morton,
John Kacur, and David Teigland.
Written for: 2.6.28-rc2
Introduction
------------
Ftrace is an internal tracer designed to help out developers and
designers of systems to find what is going on inside the kernel.
It can be used for debugging or analyzing latencies and performance
issues that take place outside of user-space.
Although ftrace is the function tracer, it also includes an
infrastructure that allows for other types of tracing. Some of the
tracers that are currently in ftrace include a tracer to trace
context switches, the time it takes for a high priority task to
run after it was woken up, the time interrupts are disabled, and
more (ftrace allows for tracer plugins, which means that the list of
tracers can always grow).
The File System
---------------
Ftrace uses the debugfs file system to hold the control files as well
as the files to display output.
To mount the debugfs system:
# mkdir /debug
# mount -t debugfs nodev /debug
(Note: it is more common to mount at /sys/kernel/debug, but for simplicity
this document will use /debug)
That's it! (assuming that you have ftrace configured into your kernel)
After mounting the debugfs, you can see a directory called
"tracing". This directory contains the control and output files
of ftrace. Here is a list of some of the key files:
Note: all time values are in microseconds.
current_tracer: This is used to set or display the current tracer
that is configured.
available_tracers: This holds the different types of tracers that
have been compiled into the kernel. The tracers
listed here can be configured by echoing their name
into current_tracer.
tracing_enabled: This sets or displays whether the current_tracer
is activated and tracing or not. Echo 0 into this
file to disable the tracer or 1 to enable it.
trace: This file holds the output of the trace in a human readable
format (described below).
latency_trace: This file shows the same trace but the information
is organized more to display possible latencies
in the system (described below).
trace_pipe: The output is the same as the "trace" file but this
file is meant to be streamed with live tracing.
Reads from this file will block until new data
is retrieved. Unlike the "trace" and "latency_trace"
files, this file is a consumer. This means reading
from this file causes sequential reads to display
more current data. Once data is read from this
file, it is consumed, and will not be read
again with a sequential read. The "trace" and
"latency_trace" files are static, and if the
tracer is not adding more data, they will display
the same information every time they are read.
trace_options: This file lets the user control the amount of data
that is displayed in one of the above output
files.
trace_max_latency: Some of the tracers record the max latency.
For example, the time interrupts are disabled.
This time is saved in this file. The max trace
will also be stored, and displayed by either
"trace" or "latency_trace". A new max trace will
only be recorded if the latency is greater than
the value in this file. (in microseconds)
buffer_size_kb: This sets or displays the number of kilobytes each CPU
buffer can hold. The tracer buffers are the same size
for each CPU. The displayed number is the size of the
CPU buffer and not total size of all buffers. The
trace buffers are allocated in pages (blocks of memory
that the kernel uses for allocation, usually 4 KB in size).
If the last page allocated has room for more bytes
than requested, the rest of the page will be used,
making the actual allocation bigger than requested.
(Note, the size may not be a multiple of the page size due
to buffer managment overhead.)
This can only be updated when the current_tracer
is set to "nop".
tracing_cpumask: This is a mask that lets the user only trace
on specified CPUS. The format is a hex string
representing the CPUS.
set_ftrace_filter: When dynamic ftrace is configured in (see the
section below "dynamic ftrace"), the code is dynamically
modified (code text rewrite) to disable calling of the
function profiler (mcount). This lets tracing be configured
in with practically no overhead in performance. This also
has a side effect of enabling or disabling specific functions
to be traced. Echoing names of functions into this file
will limit the trace to only those functions.
set_ftrace_notrace: This has an effect opposite to that of
set_ftrace_filter. Any function that is added here will not
be traced. If a function exists in both set_ftrace_filter
and set_ftrace_notrace, the function will _not_ be traced.
set_ftrace_pid: Have the function tracer only trace a single thread.
available_filter_functions: This lists the functions that ftrace
has processed and can trace. These are the function
names that you can pass to "set_ftrace_filter" or
"set_ftrace_notrace". (See the section "dynamic ftrace"
below for more details.)
The Tracers
-----------
Here is the list of current tracers that may be configured.
function - function tracer that uses mcount to trace all functions.
sched_switch - traces the context switches between tasks.
irqsoff - traces the areas that disable interrupts and saves
the trace with the longest max latency.
See tracing_max_latency. When a new max is recorded,
it replaces the old trace. It is best to view this
trace via the latency_trace file.
preemptoff - Similar to irqsoff but traces and records the amount of
time for which preemption is disabled.
preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
records the largest time for which irqs and/or preemption
is disabled.
wakeup - Traces and records the max latency that it takes for
the highest priority task to get scheduled after
it has been woken up.
nop - This is not a tracer. To remove all tracers from tracing
simply echo "nop" into current_tracer.
Examples of using the tracer
----------------------------
Here are typical examples of using the tracers when controlling them only
with the debugfs interface (without using any user-land utilities).
Output format:
--------------
Here is an example of the output format of the file "trace"
--------
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |
bash-4251 [01] 10152.583854: path_put <-path_walk
bash-4251 [01] 10152.583855: dput <-path_put
bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
--------
A header is printed with the tracer name that is represented by the trace.
In this case the tracer is "function". Then a header showing the format. Task
name "bash", the task PID "4251", the CPU that it was running on
"01", the timestamp in <secs>.<usecs> format, the function name that was
traced "path_put" and the parent function that called this function
"path_walk". The timestamp is the time at which the function wa