aboutsummaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/ABI/testing/debugfs-kmemtrace71
-rw-r--r--Documentation/filesystems/pohmelfs/design_notes.txt70
-rw-r--r--Documentation/filesystems/pohmelfs/info.txt86
-rw-r--r--Documentation/filesystems/pohmelfs/network_protocol.txt227
-rw-r--r--Documentation/ftrace.txt1134
-rw-r--r--Documentation/kernel-parameters.txt17
-rw-r--r--Documentation/sysrq.txt2
-rw-r--r--Documentation/tracepoints.txt21
-rw-r--r--Documentation/vm/kmemtrace.txt126
9 files changed, 1379 insertions, 375 deletions
diff --git a/Documentation/ABI/testing/debugfs-kmemtrace b/Documentation/ABI/testing/debugfs-kmemtrace
new file mode 100644
index 00000000000..5e6a92a02d8
--- /dev/null
+++ b/Documentation/ABI/testing/debugfs-kmemtrace
@@ -0,0 +1,71 @@
+What: /sys/kernel/debug/kmemtrace/
+Date: July 2008
+Contact: Eduard - Gabriel Munteanu <eduard.munteanu@linux360.ro>
+Description:
+
+In kmemtrace-enabled kernels, the following files are created:
+
+/sys/kernel/debug/kmemtrace/
+ cpu<n> (0400) Per-CPU tracing data, see below. (binary)
+ total_overruns (0400) Total number of bytes which were dropped from
+ cpu<n> files because of full buffer condition,
+ non-binary. (text)
+ abi_version (0400) Kernel's kmemtrace ABI version. (text)
+
+Each per-CPU file should be read according to the relay interface. That is,
+the reader should set affinity to that specific CPU and, as currently done by
+the userspace application (though there are other methods), use poll() with
+an infinite timeout before every read(). Otherwise, erroneous data may be
+read. The binary data has the following _core_ format:
+
+ Event ID (1 byte) Unsigned integer, one of:
+ 0 - represents an allocation (KMEMTRACE_EVENT_ALLOC)
+ 1 - represents a freeing of previously allocated memory
+ (KMEMTRACE_EVENT_FREE)
+ Type ID (1 byte) Unsigned integer, one of:
+ 0 - this is a kmalloc() / kfree()
+ 1 - this is a kmem_cache_alloc() / kmem_cache_free()
+ 2 - this is a __get_free_pages() et al.
+ Event size (2 bytes) Unsigned integer representing the
+ size of this event. Used to extend
+ kmemtrace. Discard the bytes you
+ don't know about.
+ Sequence number (4 bytes) Signed integer used to reorder data
+ logged on SMP machines. Wraparound
+ must be taken into account, although
+ it is unlikely.
+ Caller address (8 bytes) Return address to the caller.
+ Pointer to mem (8 bytes) Pointer to target memory area. Can be
+ NULL, but not all such calls might be
+ recorded.
+
+In case of KMEMTRACE_EVENT_ALLOC events, the next fields follow:
+
+ Requested bytes (8 bytes) Total number of requested bytes,
+ unsigned, must not be zero.
+ Allocated bytes (8 bytes) Total number of actually allocated
+ bytes, unsigned, must not be lower
+ than requested bytes.
+ Requested flags (4 bytes) GFP flags supplied by the caller.
+ Target CPU (4 bytes) Signed integer, valid for event id 1.
+ If equal to -1, target CPU is the same
+ as origin CPU, but the reverse might
+ not be true.
+
+The data is made available in the same endianness the machine has.
+
+Other event ids and type ids may be defined and added. Other fields may be
+added by increasing event size, but see below for details.
+Every modification to the ABI, including new id definitions, are followed
+by bumping the ABI version by one.
+
+Adding new data to the packet (features) is done at the end of the mandatory
+data:
+ Feature size (2 byte)
+ Feature ID (1 byte)
+ Feature data (Feature size - 3 bytes)
+
+
+Users:
+ kmemtrace-user - git://repo.or.cz/kmemtrace-user.git
+
diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt
new file mode 100644
index 00000000000..6d6db60d567
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/design_notes.txt
@@ -0,0 +1,70 @@
+POHMELFS: Parallel Optimized Host Message Exchange Layered File System.
+
+ Evgeniy Polyakov <zbr@ioremap.net>
+
+Homepage: http://www.ioremap.net/projects/pohmelfs
+
+POHMELFS first began as a network filesystem with coherent local data and
+metadata caches but is now evolving into a parallel distributed filesystem.
+
+Main features of this FS include:
+ * Locally coherent cache for data and metadata with (potentially) byte-range locks.
+ Since all Linux filesystems lock the whole inode during writing, algorithm
+ is very simple and does not use byte-ranges, although they are sent in
+ locking messages.
+ * Completely async processing of all events except creation of hard and symbolic
+ links, and rename events.
+ Object creation and data reading and writing are processed asynchronously.
+ * Flexible object architecture optimized for network processing.
+ Ability to create long paths to objects and remove arbitrarily huge
+ directories with a single network command.
+ (like removing the whole kernel tree via a single network command).
+ * Very high performance.
+ * Fast and scalable multithreaded userspace server. Being in userspace it works
+ with any underlying filesystem and still is much faster than async in-kernel NFS one.
+ * Client is able to switch between different servers (if one goes down, client
+ automatically reconnects to second and so on).
+ * Transactions support. Full failover for all operations.
+ Resending transactions to different servers on timeout or error.
+ * Read request (data read, directory listing, lookup requests) balancing between multiple servers.
+ * Write requests are replicated to multiple servers and completed only when all of them are acked.
+ * Ability to add and/or remove servers from the working set at run-time.
+ * Strong authentification and possible data encryption in network channel.
+ * Extended attributes support.
+
+POHMELFS is based on transactions, which are potentially long-standing objects that live
+in the client's memory. Each transaction contains all the information needed to process a given
+command (or set of commands, which is frequently used during data writing: single transactions
+can contain creation and data writing commands). Transactions are committed by all the servers
+to which they are sent and, in case of failures, are eventually resent or dropped with an error.
+For example, reading will return an error if no servers are available.
+
+POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is
+possible to detach replies from requests and, if the command requires data to be received, the
+caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different
+servers and async threads will pick up replies in parallel, find appropriate transactions in the
+system and put the data where it belongs (like the page or inode cache).
+
+The main feature of POHMELFS is writeback data and the metadata cache.
+Only a few non-performance critical operations use the write-through cache and
+are synchronous: hard and symbolic link creation, and object rename. Creation,
+removal of objects and data writing are asynchronous and are sent to
+the server during system writeback. Only one writer at a time is allowed for any
+given inode, which is guarded by an appropriate locking protocol.
+Because of this feature, POHMELFS is extremely fast at metadata intensive
+workloads and can fully utilize the bandwidth to the servers when doing bulk
+data transfers.
+
+POHMELFS clients operate with a working set of servers and are capable of balancing read-only
+operations (like lookups or directory listings) between them.
+Administrators can add or remove servers from the set at run-time via special commands (described
+in Documentation/pohmelfs/info.txt file). Writes are replicated to all servers.
+
+POHMELFS is capable of full data channel encryption and/or strong crypto hashing.
+One can select any kernel supported cipher, encryption mode, hash type and operation mode
+(hmac or digest). It is also possible to use both or neither (default). Crypto configuration
+is checked during mount time and, if the server does not support it, appropriate capabilities
+will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified).
+Crypto performance heavily depends on the number of crypto threads, which asynchronously perform
+crypto operations and send the resulting data to server or submit it up the stack. This number
+can be controlled via a mount option.
diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt
new file mode 100644
index 00000000000..4e3d5015708
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/info.txt
@@ -0,0 +1,86 @@
+POHMELFS usage information.
+
+Mount options:
+idx=%u
+ Each mountpoint is associated with a special index via this option.
+ Administrator can add or remove servers from the given index, so all mounts,
+ which were attached to it, are updated.
+ Default it is 0.
+
+trans_scan_timeout=%u
+ This timeout, expressed in milliseconds, specifies time to scan transaction
+ trees looking for stale requests, which have to be resent, or if number of
+ retries exceed specified limit, dropped with error.
+ Default is 5 seconds.
+
+drop_scan_timeout=%u
+ Internal timeout, expressed in milliseconds, which specifies how frequently
+ inodes marked to be dropped are freed. It also specifies how frequently
+ the system checks that servers have to be added or removed from current working set.
+ Default is 1 second.
+
+wait_on_page_timeout=%u
+ Number of milliseconds to wait for reply from remote server for data reading command.
+ If this timeout is exceeded, reading returns an error.
+ Default is 5 seconds.
+
+trans_retries=%u
+ This is the number of times that a transaction will be resent to a server that did
+ not answer for the last @trans_scan_timeout milliseconds.
+ When the number of resends exceeds this limit, the transaction is completed with error.
+ Default is 5 resends.
+
+crypto_thread_num=%u
+ Number of crypto processing threads. Threads are used both for RX and TX traffic.
+ Default is 2, or no threads if crypto operations are not supported.
+
+trans_max_pages=%u
+ Maximum number of pages in a single transaction. This parameter also controls
+ the number of pages, allocated for crypto processing (each crypto thread has
+ pool of pages, the number of which is equal to 'trans_max_pages'.
+ Default is 100 pages.
+
+crypto_fail_unsupported
+ If specified, mount will fail if the server does not support requested crypto operations.
+ By default mount will disable non-matching crypto operations.
+
+mcache_timeout=%u
+ Maximum number of milliseconds to wait for the mcache objects to be processed.
+ Mcache includes locks (given lock should be granted by server), attributes (they should be
+ fully received in the given timeframe).
+ Default is 5 seconds.
+
+Usage examples.
+
+Add (or remove if it already exists) server server1.net:1025 into the working set with index $idx
+with appropriate hash algorithm and key file and cipher algorithm, mode and key file:
+$cfg -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key
+
+Mount filesystem with given index $idx to /mnt mountpoint.
+Client will connect to all servers specified in the working set via previous command:
+mount -t pohmel -o idx=$idx q /mnt
+
+One can add or remove servers from working set after mounting too.
+
+
+Server installation.
+
+Creating a server, which listens at port 1025 and 0.0.0.0 address.
+Working root directory (note, that server chroots there, so you have to have appropriate permissions)
+is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there
+are appropriate key files.
+Number of working threads is set to 10.
+
+# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key
+
+ -A 6 - listen on ipv6 address. Default: Disabled.
+ -r root - path to root directory. Default: /tmp.
+ -a addr - listen address. Default: 0.0.0.0.
+ -p port - listen port. Default: 1025.
+ -w workers - number of workers per connected client. Default: 1.
+ -K file - hash key size. Default: none.
+ -k file - cipher key size. Default: none.
+ -h - this help.
+
+Number of worker threads specifies how many workers will be created for each client.
+Bulk single-client transafers usually are better handled with smaller number (like 1-3).
diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt
new file mode 100644
index 00000000000..40ea6c295af
--- /dev/null
+++ b/Documentation/filesystems/pohmelfs/network_protocol.txt
@@ -0,0 +1,227 @@
+POHMELFS network protocol.
+
+Basic structure used in network communication is following command:
+
+struct netfs_cmd
+{
+ __u16 cmd; /* Command number */
+ __u16 csize; /* Attached crypto information size */
+ __u16 cpad; /* Attached padding size */
+ __u16 ext; /* External flags */
+ __u32 size; /* Size of the attached data */
+ __u32 trans; /* Transaction id */
+ __u64 id; /* Object ID to operate on. Used for feedback.*/
+ __u64 start; /* Start of the object. */
+ __u64 iv; /* IV sequence */
+ __u8 data[0];
+};
+
+Commands can be embedded into transaction command (which in turn has own command),
+so one can extend protocol as needed without breaking backward compatibility as long
+as old commands are supported. All string lengths include tail 0 byte.
+
+All commans are transfered over the network in big-endian. CPU endianess is used at the end peers.
+
+@cmd - command number, which specifies command to be processed. Following
+ commands are used currently:
+
+ NETFS_READDIR = 1, /* Read directory for given inode number */
+ NETFS_READ_PAGE, /* Read data page from the server */
+ NETFS_WRITE_PAGE, /* Write data page to the server */
+ NETFS_CREATE, /* Create directory entry */
+ NETFS_REMOVE, /* Remove directory entry */
+ NETFS_LOOKUP, /* Lookup single object */
+ NETFS_LINK, /* Create a link */
+ NETFS_TRANS, /* Transaction */
+ NETFS_OPEN, /* Open intent */
+ NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */
+ NETFS_PAGE_CACHE, /* Page cache invalidation message */
+ NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */
+ NETFS_RENAME, /* Rename object */
+ NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */
+ NETFS_LOCK, /* Distributed lock message */
+ NETFS_XATTR_SET, /* Set extended attribute */
+ NETFS_XATTR_GET, /* Get extended attribute */
+
+@ext - external flags. Used by different commands to specify some extra arguments
+ like partial size of the embedded objects or creation flags.
+
+@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached,
+ but size of the requested data is incorporated here. It does not include size of the command
+ header (struct netfs_cmd) itself.
+
+@id - id of the object this command operates on. Each command can use it for own purpose.
+
+@start - start of the object this command operates on. Each command can use it for own purpose.
+
+@csize, @cpad - size and padding size of the (attached if needed) crypto information.
+
+Command specifications.
+
+@NETFS_READDIR
+This command is used to sync content of the remote dir to the client.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to read.
+@start - zero.
+
+
+@NETFS_READ_PAGE
+This command is used to read data from remote server.
+Data size does not exceed local page cache size.
+
+@id - inode number.
+@start - first byte offset.
+@size - number of bytes to read plus length of the path to object.
+@ext - object path length.
+
+
+@NETFS_CREATE
+Used to create object.
+It does not require that all directories on top of the object were
+already created, it will create them automatically. Each object has
+associated @netfs_path_entry data structure, which contains creation
+mode (permissions and type) and length of the name as long as name itself.
+
+@start - 0
+@size - size of the all data structures needed to create a path
+@id - local inode number
+@ext - 0
+
+
+@NETFS_REMOVE
+Used to remove object.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number.
+@start - zero.
+
+
+@NETFS_LOOKUP
+Lookup information about object on server.
+
+@ext - length of the path to object.
+@size - the same.
+@id - local inode number of the directory to look object in.
+@start - local inode number of the object to look at.
+
+
+@NETFS_LINK
+Create hard of symlink.
+Command is sent as "object_path|target_path".
+
+@size - size of the above string.
+@id - parent local inode number.
+@start - 1 for symlink, 0 for hardlink.
+@ext - size of the "object_path" above.
+
+
+@NETFS_TRANS
+Transaction header.
+
+@size - incorporates all embedded command sizes including theirs header sizes.
+@start - transaction generation number - unique id used to find transaction.
+@ext - transaction flags. Unused at the moment.
+@id - 0.
+
+
+@NETFS_OPEN
+Open intent for given transaction.
+
+@id - local inode number.
+@start - 0.
+@size - path length to the object.
+@ext - open flags (O_RDWR and so on).
+
+
+@NETFS_INODE_INFO
+Metadata update command.
+It is sent to servers when attributes of the object are changed and received
+when data or metadata were updated. It operates with the following structure:
+
+struct netfs_inode_info
+{
+ unsigned int mode;
+ unsigned int nlink;
+ unsigned int uid;
+ unsigned int gid;
+ unsigned int blocksize;
+ unsigned int padding;
+ __u64 ino;
+ __u64 blocks;
+ __u64 rdev;
+ __u64 size;
+ __u64 version;
+};
+
+It effectively mirrors stat(2) returned data.
+
+
+@ext - path length to the object.
+@size - the same plus size of the netfs_inode_info structure.
+@id - local inode number.
+@start - 0.
+
+
+@NETFS_PAGE_CACHE
+Command is only received by clients. It contains information about
+page to be marked as not up-to-date.
+
+@id - client's inode number.
+@start - last byte of the page to be invalidated. If it is not equal to
+ current inode size, it will be vmtruncated().
+@size - 0
+@ext - 0
+
+
+@NETFS_READ_PAGES
+Used to read multiple contiguous pages in one go.
+
+@start - first byte of the contiguous region to read.
+@size - contains of two fields: lower 8 bits are used to represent page cache shift
+ used by client, another 3 bytes are used to get number of pages.
+@id - local inode number.
+@ext - path length to the object.
+
+
+@NETFS_RENAME
+Used to rename object.
+Attached data is formed into following string: "old_path|new_path".
+
+@id - local inode number.
+@start - parent inode number.
+@size - length of the above string.
+@ext - length of the old path part.
+
+
+@NETFS_CAPABILITIES
+Used to exchange crypto capabilities with server.
+If crypto capabilities are not supported by server, then client will disable it
+or fail (if 'crypto_fail_unsupported' mount options was specified).
+
+@id - superblock index. Used to specify crypto information for group of servers.
+@size - size of the attached capabilities structure.
+@start - 0.
+@size - 0.
+@scsize - 0.
+
+@NETFS_LOCK
+Used to send lock request/release messages. Although it sends byte range request
+and is capable of flushing pages based on that, it is not used, since all Linux
+filesystems lock the whole inode.
+
+@id - lock generation number.
+@start - start of the locked range.
+@size - size of the locked range.
+@ext - lock type: read/write. Not used actually. 15'th bit is used to determine,
+ if it is lock request (1) or release (0).
+
+@NETFS_XATTR_SET
+@NETFS_XATTR_GET
+Used to set/get extended attributes for given inode.
+@id - attribute generation number or xattr setting type
+@start - size of the attribute (request or attached)
+@size - name length, path len and data size for given attribute
+@ext - path length for given object
diff --git a/Documentation/ftrace.txt b/Documentation/ftrace.txt
index 803b1318b13..fd9a3e69381 100644
--- a/Documentation/ftrace.txt
+++ b/Documentation/ftrace.txt
@@ -15,31 +15,31 @@ Introduction
Ftrace is an internal tracer designed to help out developers and
designers of systems to find what is going on inside the kernel.
-It can be used for debugging or analyzing latencies and performance
-issues that take place outside of user-space.
+It can be used for debugging or analyzing latencies and
+performance issues that take place outside of user-space.
Although ftrace is the function tracer, it also includes an
-infrastructure that allows for other types of tracing. Some of the
-tracers that are currently in ftrace include a tracer to trace
-context switches, the time it takes for a high priority task to
-run after it was woken up, the time interrupts are disabled, and
-more (ftrace allows for tracer plugins, which means that the list of
-tracers can always grow).
+infrastructure that allows for other types of tracing. Some of
+the tracers that are currently in ftrace include a tracer to
+trace context switches, the time it takes for a high priority
+task to run after it was woken up, the time interrupts are
+disabled, and more (ftrace allows for tracer plugins, which
+means that the list of tracers can always grow).
The File System
---------------
-Ftrace uses the debugfs file system to hold the control files as well
-as the files to display output.
+Ftrace uses the debugfs file system to hold the control files as
+well as the files to display output.
To mount the debugfs system:
# mkdir /debug
# mount -t debugfs nodev /debug
-(Note: it is more common to mount at /sys/kernel/debug, but for simplicity
- this document will use /debug)
+( Note: it is more common to mount at /sys/kernel/debug, but for
+ simplicity this document will use /debug)
That's it! (assuming that you have ftrace configured into your kernel)
@@ -50,90 +50,124 @@ of ftrace. Here is a list of some of the key files:
Note: all time values are in microseconds.
- current_tracer: This is used to set or display the current tracer
- that is configured.
-
- available_tracers: This holds the different types of tracers that
- have been compiled into the kernel. The tracers
- listed here can be configured by echoing their name
- into current_tracer.
-
- tracing_enabled: This sets or displays whether the current_tracer
- is activated and tracing or not. Echo 0 into this
- file to disable the tracer or 1 to enable it.
-
- trace: This file holds the output of the trace in a human readable
- format (described below).
-
- latency_trace: This file shows the same trace but the information
- is organized more to display possible latencies
- in the system (described below).
-
- trace_pipe: The output is the same as the "trace" file but this
- file is meant to be streamed with live tracing.
- Reads from this file will block until new data
- is retrieved. Unlike the "trace" and "latency_trace"
- files, this file is a consumer. This means reading
- from this file causes sequential reads to display
- more current data. Once data is read from this
- file, it is consumed, and will not be read
- again with a sequential read. The "trace" and
- "latency_trace" files are static, and if the
- tracer is not adding more data, they will display
- the same information every time they are read.
-
- trace_options: This file lets the user control the amount of data
- that is displayed in one of the above output
- files.
-
- trace_max_latency: Some of the tracers record the max latency.
- For example, the time interrupts are disabled.
- This time is saved in this file. The max trace
- will also be stored, and displayed by either
- "trace" or "latency_trace". A new max trace will
- only be recorded if the latency is greater than
- the value in this file. (in microseconds)
-
- buffer_size_kb: This sets or displays the number of kilobytes each CPU
- buffer can hold. The tracer buffers are the same size
- for each CPU. The displayed number is the size of the
- CPU buffer and not total size of all buffers. The
- trace buffers are allocated in pages (blocks of memory
- that the kernel uses for allocation, usually 4 KB in size).
- If the last page allocated has room for more bytes
- than requested, the rest of the page will be used,
- making the actual allocation bigger than requested.
- (Note, the size may not be a multiple of the page size due
- to buffer managment overhead.)
-
- This can only be updated when the current_tracer
- is set to "nop".
-
- tracing_cpumask: This is a mask that lets the user only trace
- on specified CPUS. The format is a hex string
- representing the CPUS.
-
- set_ftrace_filter: When dynamic ftrace is configured in (see the
- section below "dynamic ftrace"), the code is dynamically
- modified (code text rewrite) to disable calling of the
- function profiler (mcount). This lets tracing be configured
- in with practically no overhead in performance. This also
- has a side effect of enabling or disabling specific functions
- to be traced. Echoing names of functions into this file
- will limit the trace to only those functions.
-
- set_ftrace_notrace: This has an effect opposite to that of
- set_ftrace_filter. Any function that is added here will not
- be traced. If a function exists in both set_ftrace_filter
- and set_ftrace_notrace, the function will _not_ be traced.
-
- set_ftrace_pid: Have the function tracer only trace a single thread.
-
- available_filter_functions: This lists the functions that ftrace
- has processed and can trace. These are the function
- names that you can pass to "set_ftrace_filter" or
- "set_ftrace_notrace". (See the section "dynamic ftrace"
- below for more details.)
+ current_tracer:
+
+ This is used to set or display the current tracer
+ that is configured.
+
+ available_tracers:
+
+ This holds the different types of tracers that
+ have been compiled into the kernel. The
+ tracers listed here can be configured by
+ echoing their name into current_tracer.
+
+ tracing_enabled:
+
+ This sets or displays whether the current_tracer
+ is activated and tracing or not. Echo 0 into this
+ file to disable the tracer or 1 to enable it.
+
+ trace:
+
+ This file holds the output of the trace in a human
+ readable format (described below).
+
+ latency_trace:
+
+ This file shows the same trace but the information
+ is organized more to display possible latencies
+ in the system (described below).
+
+ trace_pipe:
+
+ The output is the same as the "trace" file but this
+ file is meant to be streamed with live tracing.
+ Reads from this file will block until new data
+ is retrieved. Unlike the "trace" and "latency_trace"
+ files, this file is a consumer. This means reading
+ from this file causes sequential reads to display
+ more current data. Once data is read from this
+ file, it is consumed, and will not be read
+ again with a sequential read. The "trace" and
+ "latency_trace" files are static, and if the
+ tracer is not adding more data, they will display
+ the same information every time they are read.
+
+ trace_options:
+
+ This file lets the user control the amount of data
+ that is displayed in one of the above output
+ files.
+
+ tracing_max_latency:
+
+ Some of the tracers record the max latency.
+ For example, the time interrupts are disabled.
+ This time is saved in this file. The max trace
+ will also be stored, and displayed by either
+ "trace" or "latency_trace". A new max trace will
+ only be recorded if the latency is greater than
+ the value in this file. (in microseconds)
+
+ buffer_size_kb:
+
+ This sets or displays the number of kilobytes each CPU
+ buffer can hold. The tracer buffers are the same size
+ for each CPU. The displayed number is the size of the
+ CPU buffer and not total size of all buffers. The
+ trace buffers are allocated in pages (blocks of memory
+ that the kernel uses for allocation, usually 4 KB in size).
+ If the last page allocated has room for more bytes
+ than requested, the rest of the page will be used,
+ making the actual allocation bigger than requested.
+ ( Note, the size may not be a multiple of the page size
+ due to buffer managment overhead. )
+
+ This can only be updated when the current_tracer
+ is set to "nop".
+
+ tracing_cpumask:
+
+ This is a mask that lets the user only trace
+ on specified CPUS. The format is a hex string
+ representing the CPUS.
+
+ set_ftrace_filter:
+
+ When dynamic ftrace is configured in (see the
+ section below "dynamic ftrace"), the code is dynamically
+ modified (code text rewrite) to disable calling of the
+ function profiler (mcount). This lets tracing be configured
+ in with practically no overhead in performance. This also
+ has a side effect of enabling or disabling specific functions
+ to be traced. Echoing names of functions into this file
+ will limit the trace to only those functions.
+
+ set_ftrace_notrace:
+
+ This has an effect opposite to that of
+ set_ftrace_filter. Any function that is added here will not
+ be traced. If a function exists in both set_ftrace_filter
+ and set_ftrace_notrace, the function will _not_ be traced.
+
+ set_ftrace_pid:
+
+ Have the function tracer only trace a single thread.
+
+ set_graph_function:
+
+ Set a "trigger" function where tracing should start
+ with the function graph tracer (See the section
+ "dynamic ftrace" for more details).
+
+ available_filter_functions:
+
+ This lists the functions that ftrace
+ has processed and can trace. These are the function
+ names that you can pass to "set_ftrace_filter" or
+ "set_ftrace_notrace". (See the section "dynamic ftrace"
+ below for more details.)
The Tracers
@@ -141,36 +175,66 @@ The Tracers
Here is the list of current tracers that may be configured.
- function - function tracer that uses mcount to trace all functions.
+ "function"
+
+ Function call tracer to trace all kernel functions.
+
+ "function_graph_tracer"
+
+ Similar to the function tracer except that the
+ function tracer probes the functions on their entry
+ whereas the function graph tracer traces on both entry
+ and exit of the functions. It then provides the ability
+ to draw a graph of function calls similar to C code
+ source.
- sched_switch - traces the context switches between tasks.
+ "sched_switch"
- irqsoff - traces the areas that disable interrupts and saves
- the trace with the longest max latency.
- See tracing_max_latency. When a new max is recorded,
- it replaces the old trace. It is best to view this
- trace via the latency_trace file.
+ Traces the context switches and wakeups between tasks.
- preemptoff - Similar to irqsoff but traces and records the amount of
- time for which preemption is disabled.
+ "irqsoff"
- preemptirqsoff - Similar to irqsoff and preemptoff, but traces and
- records the largest time for which irqs and/or preemption
- is disabled.
+ Traces the areas that disable interrupts and saves
+ the trace with the longest max latency.
+ See tracing_max_latency. When a new max is recorded,
+ it replaces the old trace. It is best to view this
+ trace via the latency_trace file.
- wakeup - Traces and records the max latency that it takes for
- the highest priority task to get scheduled after
- it has been woken up.
+ "preemptoff"
- nop - This is not a tracer. To remove all tracers from tracing
- simply echo "nop" into current_tracer.
+ Similar to irqsoff but traces and records the amount of
+ time for which preemption is disabled.
+
+ "preemptirqsoff"
+
+ Similar to irqsoff and preemptoff, but traces and
+ records the largest time for which irqs and/or preemption
+ is disabled.
+
+ "wakeup"
+
+ Traces and records the max latency that it takes for
+ the highest priority task to get scheduled after
+ it has been woken up.
+
+ "hw-branch-tracer"
+
+ Uses the BTS CPU feature on x86 CPUs to traces all
+ branches executed.
+
+ "nop"
+
+ This is the "trace nothing" tracer. To remove all
+ tracers from tracing simply echo "nop" into
+ current_tracer.
Examples of using the tracer
----------------------------
-Here are typical examples of using the tracers when controlling them only
-with the debugfs interface (without using any user-land utilities).
+Here are typical examples of using the tracers when controlling
+them only with the debugfs interface (without using any
+user-land utilities).
Output format:
--------------
@@ -187,16 +251,16 @@ Here is an example of the output format of the file "trace"
bash-4251 [01] 10152.583855: _atomic_dec_and_lock <-dput
--------
-A header is printed with the tracer name that is represented by the trace.
-In this case the tracer is "function". Then a header showing the format. Task
-name "bash", the task PID "4251", the CPU that it was running on
-"01", the timestamp in <secs>.<usecs> format, the function name that was
-traced "path_put" and the parent function that called this function
-"path_walk". The timestamp is the time at which the function was
-entered.
+A header is printed with the tracer name that is represented by
+the trace. In this case the tracer is "function". Then a header
+showing the format. Task name "bash", the task PID "4251", the
+CPU that it was running on "01", the timestamp in <secs>.<usecs>
+format, the function name that was traced "path_put" and the
+parent function that called this function "path_walk". The
+timestamp is the time at which the function was entered.
-The sched_switch tracer also includes tracing of task wakeups and
-context switches.
+The sched_switch tracer also includes tracing of task wakeups
+and context switches.
ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 2916:115:S
ksoftirqd/1-7 [01] 1453.070013: 7:115:R + 10:115:S
@@ -205,8 +269,8 @@ context switches.
kondemand/1-2916 [01] 1453.070013: 2916:115:S ==> 7:115:R
ksoftirqd/1-7 [01] 1453.070013: 7:115:S ==> 0:140:R
-Wake ups are represented by a "+" and the context switches are shown as
-"==>". The format is:
+Wake ups are represented by a "+" and the context switches are
+shown as "==>". The format is:
Context switches:
@@ -220,19 +284,20 @@ Wake ups are represented by a "+" and the context switches are shown as
<pid>:<prio>:<state> + <pid>:<prio>:<state>
-The prio is the internal kernel priority, which is the inverse of the
-priority that is usually displayed by user-space tools. Zero represents
-the highest priority (99). Prio 100 starts the "nice" priorities with
-100 being equal to nice -20 and 139 being nice 19. The prio "140" is
-reserved for the idle task which is the lowest priority thread (pid 0).
+The prio is the internal kernel priority, which is the inverse
+of the priority that is usually displayed by user-space tools.
+Zero represents the highest priority (99). Prio 100 starts the
+"nice" priorities with 100 being equal to nice -20 and 139 being
+nice 19. The prio "140" is reserved for the idle task which is
+the lowest priority thread (pid 0).
Latency trace format
--------------------
-For traces that display latency times, the latency_trace file gives
-somewhat more information to see why a latency happened. Here is a typical
-trace.
+For traces that display latency times, the latency_trace file
+gives somewhat more information to see why a latency happened.
+Here is a typical trace.
# tracer: irqsoff
#
@@ -259,20 +324,20 @@ irqsoff latency trace v1.1.5 on 2.6.26-rc8
<idle>-0 0d.s1 98us : trace_hardirqs_on (do_softirq)
+This shows that the current tracer is "irqsoff" tracing the time
+for which interrupts were disabled. It gives the trace version
+and the version of the kernel upon which this was executed on
+(2.6.26-rc8). Then it displays the max latency in microsecs (97
+us). The number of trace entries displayed and the total number
+recorded (both are three: #3/3). The type of preemption that was
+used (PREEMPT). VP, KP, SP, and HP are always zero and are
+reserved for later use. #P is the number of online CPUS (#P:2).
-This shows that the current tracer is "irqsoff" tracing the time fo