aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems/ext4.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems/ext4.txt')
-rw-r--r--Documentation/filesystems/ext4.txt321
1 files changed, 266 insertions, 55 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt
index 6ab9442d7ee..919a3293aaa 100644
--- a/Documentation/filesystems/ext4.txt
+++ b/Documentation/filesystems/ext4.txt
@@ -2,7 +2,7 @@
Ext4 Filesystem
===============
-Ext4 is an an advanced level of the ext3 filesystem which incorporates
+Ext4 is an advanced level of the ext3 filesystem which incorporates
scalability and reliability enhancements for supporting large filesystems
(64 bit) in keeping with increasing disk capacities and state-of-the-art
feature requirements.
@@ -68,12 +68,12 @@ Note: More extensive information for getting started with ext4 can be
'-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems
for a fair comparison. When tuning ext3 for best benchmark numbers,
it is often worthwhile to try changing the data journaling mode; '-o
- data=writeback,nobh' can be faster for some workloads. (Note
- however that running mounted with data=writeback can potentially
- leave stale data exposed in recently written files in case of an
- unclean shutdown, which could be a security exposure in some
- situations.) Configuring the filesystem with a large journal can
- also be helpful for metadata-intensive workloads.
+ data=writeback' can be faster for some workloads. (Note however that
+ running mounted with data=writeback can potentially leave stale data
+ exposed in recently written files in case of an unclean shutdown,
+ which could be a security exposure in some situations.) Configuring
+ the filesystem with a large journal can also be helpful for
+ metadata-intensive workloads.
2. Features
===========
@@ -97,7 +97,7 @@ Note: More extensive information for getting started with ext4 can be
* Inode allocation using large virtual block groups via flex_bg
* delayed allocation
* large block (up to pagesize) support
-* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force
+* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force
the ordering)
[1] Filesystems with a block size of 1k may see a limit imposed by the
@@ -106,7 +106,7 @@ directory hash tree having a maximum depth of two.
2.2 Candidate features for future inclusion
* Online defrag (patches available but not well tested)
-* reduced mke2fs time via lazy itable initialization in conjuction with
+* reduced mke2fs time via lazy itable initialization in conjunction with
the uninit_bg feature (capability to do this is available in e2fsprogs
but a kernel thread to do lazy zeroing of unused inode table blocks
after filesystem is first mounted is required for safety)
@@ -144,14 +144,12 @@ journal_async_commit Commit block can be written to disk without waiting
mount the device. This will enable 'journal_checksum'
internally.
-journal=update Update the ext4 file system's journal to the current
- format.
-
+journal_path=path
journal_dev=devnum When the external journal device's major/minor numbers
- have changed, this option allows the user to specify
+ have changed, these options allow the user to specify
the new journal location. The journal device is
- identified through its new major/minor numbers encoded
- in devnum.
+ identified through either its new major/minor numbers
+ encoded in devnum, or via a path to the device.
norecovery Don't load the journal on mounting. Note that
noload if the filesystem was not unmounted cleanly,
@@ -160,7 +158,9 @@ noload if the filesystem was not unmounted cleanly,
lead to any number of problems.
data=journal All data are committed into the journal prior to being
- written into the main file system.
+ written into the main file system. Enabling
+ this mode will disable delayed allocation and
+ O_DIRECT support.
data=ordered (*) All data are forced directly out to the main file
system prior to its metadata being committed to the
@@ -201,34 +201,16 @@ inode_readahead_blks=n This tuning parameter controls the maximum
table readahead algorithm will pre-read into
the buffer cache. The default value is 32 blocks.
-orlov (*) This enables the new Orlov block allocator. It is
- enabled by default.
-
-oldalloc This disables the Orlov block allocator and enables
- the old block allocator. Orlov should have better
- performance - we'd like to get some feedback if it's
- the contrary for you.
-
-user_xattr Enables Extended User Attributes. Additionally, you
- need to have extended attribute support enabled in the
- kernel configuration (CONFIG_EXT4_FS_XATTR). See the
- attr(5) manual page and http://acl.bestbits.at/ to
- learn more about extended attributes.
-
-nouser_xattr Disables Extended User Attributes.
-
-acl Enables POSIX Access Control Lists support.
- Additionally, you need to have ACL support enabled in
- the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL).
- See the acl(5) manual page and http://acl.bestbits.at/
- for more information.
+nouser_xattr Disables Extended User Attributes. See the
+ attr(5) manual page and http://acl.bestbits.at/
+ for more information about extended attributes.
noacl This option disables POSIX Access Control List
- support.
-
-reservation
-
-noreservation
+ support. If ACL support is enabled in the kernel
+ configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is
+ enabled by default on mount. See the acl(5) manual
+ page and http://acl.bestbits.at/ for more information
+ about acl.
bsddf (*) Make 'df' act like BSD.
minixdf Make 'df' act like Minix.
@@ -276,14 +258,6 @@ grpjquota=<file> during journal replay. They replace the above
package for more details
(http://sourceforge.net/projects/linuxquota).
-bh (*) ext4 associates buffer heads to data pages to
-nobh (a) cache disk block mapping information
- (b) link pages into transaction to provide
- ordering guarantees.
- "bh" option forces use of buffer heads.
- "nobh" option tries to avoid associating buffer
- heads (supported only for "writeback" mode).
-
stripe=n Number of filesystem blocks that mballoc will try
to use for allocation size and alignment. For RAID5/6
systems this should be the number of data
@@ -329,7 +303,7 @@ min_batch_time=usec This parameter sets the commit time (as
fast disks, at the cost of increasing latency.
journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the
- highest priorty) which should be used for I/O
+ highest priority) which should be used for I/O
operations submitted by kjournald2 during a
commit operation. This defaults to 3, which is
a slightly higher priority than the default I/O
@@ -364,15 +338,54 @@ noinit_itable Do not initialize any uninitialized inode table
init_itable=n The lazy itable init code will wait n times the
number of milliseconds it took to zero out the
previous block group's inode table. This
- minimizes the impact on the systme performance
+ minimizes the impact on the system performance
while file system's inode table is being initialized.
-discard Controls whether ext4 should issue discard/TRIM
+discard Controls whether ext4 should issue discard/TRIM
nodiscard(*) commands to the underlying block device when
blocks are freed. This is useful for SSD devices
and sparse/thinly-provisioned LUNs, but it is off
by default until sufficient testing has been done.
+nouid32 Disables 32-bit UIDs and GIDs. This is for
+ interoperability with older kernels which only
+ store and expect 16-bit values.
+
+block_validity This options allows to enables/disables the in-kernel
+noblock_validity facility for tracking filesystem metadata blocks
+ within internal data structures. This allows multi-
+ block allocator and other routines to quickly locate
+ extents which might overlap with filesystem metadata
+ blocks. This option is intended for debugging
+ purposes and since it negatively affects the
+ performance, it is off by default.
+
+dioread_lock Controls whether or not ext4 should use the DIO read
+dioread_nolock locking. If the dioread_nolock option is specified
+ ext4 will allocate uninitialized extent before buffer
+ write and convert the extent to initialized after IO
+ completes. This approach allows ext4 code to avoid
+ using inode mutex, which improves scalability on high
+ speed storages. However this does not work with
+ data journaling and dioread_nolock option will be
+ ignored with kernel warning. Note that dioread_nolock
+ code path is only used for extent-based files.
+ Because of the restrictions this options comprises
+ it is off by default (e.g. dioread_lock).
+
+max_dir_size_kb=n This limits the size of directories so that any
+ attempt to expand them beyond the specified
+ limit in kilobytes will cause an ENOSPC error.
+ This is useful in memory constrained
+ environments, where a very large directory can
+ cause severe performance problems or even
+ provoke the Out Of Memory killer. (For example,
+ if there is only 512mb memory available, a 176mb
+ directory may seriously cramp the system's style.)
+
+i_version Enable 64-bit inode version support. This option is
+ off by default.
+
Data Mode
=========
There are 3 different data modes:
@@ -397,8 +410,206 @@ written to the journal first, and then to its final location.
In the event of a crash, the journal can be replayed, bringing both data and
metadata into a consistent state. This mode is the slowest except when data
needs to be read from and written to disk at the same time where it
-outperforms all others modes. Currently ext4 does not have delayed
-allocation support if this data journalling mode is selected.
+outperforms all others modes. Enabling this mode will disable delayed
+allocation and O_DIRECT support.
+
+/proc entries
+=============
+
+Information about mounted ext4 file systems can be found in
+/proc/fs/ext4. Each mounted filesystem will have a directory in
+/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or
+/proc/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /proc/fs/ext4/<devname>
+..............................................................................
+ File Content
+ mb_groups details of multiblock allocator buddy cache of free blocks
+..............................................................................
+
+/sys entries
+============
+
+Information about mounted ext4 file systems can be found in
+/sys/fs/ext4. Each mounted filesystem will have a directory in
+/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or
+/sys/fs/ext4/dm-0). The files in each per-device directory are shown
+in table below.
+
+Files in /sys/fs/ext4/<devname>
+(see also Documentation/ABI/testing/sysfs-fs-ext4)
+..............................................................................
+ File Content
+
+ delayed_allocation_blocks This file is read-only and shows the number of
+ blocks that are dirty in the page cache, but
+ which do not have their location in the
+ filesystem allocated yet.
+
+ inode_goal Tuning parameter which (if non-zero) controls
+ the goal inode used by the inode allocator in
+ preference to all other allocation heuristics.
+ This is intended for debugging use only, and
+ should be 0 on production systems.
+
+ inode_readahead_blks Tuning parameter which controls the maximum
+ number of inode table blocks that ext4's inode
+ table readahead algorithm will pre-read into
+ the buffer cache
+
+ lifetime_write_kbytes This file is read-only and shows the number of
+ kilobytes of data that have been written to this
+ filesystem since it was created.
+
+ max_writeback_mb_bump The maximum number of megabytes the writeback
+ code will try to write out before move on to
+ another inode.
+
+ mb_group_prealloc The multiblock allocator will round up allocation
+ requests to a multiple of this tuning parameter if
+ the stripe size is not set in the ext4 superblock
+
+ mb_max_to_scan The maximum number of extents the multiblock
+ allocator will search to find the best extent
+
+ mb_min_to_scan The minimum number of extents the multiblock
+ allocator will search to find the best extent
+
+ mb_order2_req Tuning parameter which controls the minimum size
+ for requests (as a power of 2) where the buddy
+ cache is used
+
+ mb_stats Controls whether the multiblock allocator should
+ collect statistics, which are shown during the
+ unmount. 1 means to collect statistics, 0 means
+ not to collect statistics
+
+ mb_stream_req Files which have fewer blocks than this tunable
+ parameter will have their blocks allocated out
+ of a block group specific preallocation pool, so
+ that small files are packed closely together.
+ Each large file will have its blocks allocated
+ out of its own unique preallocation pool.
+
+ session_write_kbytes This file is read-only and shows the number of
+ kilobytes of data that have been written to this
+ filesystem since it was mounted.
+
+ reserved_clusters This is RW file and contains number of reserved
+ clusters in the file system which will be used
+ in the specific situations to avoid costly
+ zeroout, unexpected ENOSPC, or possible data
+ loss. The default is 2% or 4096 clusters,
+ whichever is smaller and this can be changed
+ however it can never exceed number of clusters
+ in the file system. If there is not enough space
+ for the reserved space when mounting the file
+ mount will _not_ fail.
+..............................................................................
+
+Ioctls
+======
+
+There is some Ext4 specific functionality which can be accessed by applications
+through the system call interfaces. The list of all Ext4 specific ioctls are
+shown in the table below.
+
+Table of Ext4 specific ioctls
+..............................................................................
+ Ioctl Description
+ EXT4_IOC_GETFLAGS Get additional attributes associated with inode.
+ The ioctl argument is an integer bitfield, with
+ bit values described in ext4.h. This ioctl is an
+ alias for FS_IOC_GETFLAGS.
+
+ EXT4_IOC_SETFLAGS Set additional attributes associated with inode.
+ The ioctl argument is an integer bitfield, with
+ bit values described in ext4.h. This ioctl is an
+ alias for FS_IOC_SETFLAGS.
+
+ EXT4_IOC_GETVERSION
+ EXT4_IOC_GETVERSION_OLD
+ Get the inode i_generation number stored for
+ each inode. The i_generation number is normally
+ changed only when new inode is created and it is
+ particularly useful for network filesystems. The
+ '_OLD' version of this ioctl is an alias for
+ FS_IOC_GETVERSION.
+
+ EXT4_IOC_SETVERSION
+ EXT4_IOC_SETVERSION_OLD
+ Set the inode i_generation number stored for
+ each inode. The '_OLD' version of this ioctl
+ is an alias for FS_IOC_SETVERSION.
+
+ EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize
+ mount option. It allows to resize filesystem
+ to the end of the last existing block group,
+ further resize has to be done with resize2fs,
+ either online, or offline. The argument points
+ to the unsigned logn number representing the
+ filesystem new block count.
+
+ EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one
+ this ioctl is pointing to) to the donor_fd (the
+ one specified in move_extent structure passed
+ as an argument to this ioctl). Then, exchange
+ inode metadata between orig_fd and donor_fd.
+ This is especially useful for online
+ defragmentation, because the allocator has the
+ opportunity to allocate moved blocks better,
+ ideally into one contiguous extent.
+
+ EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or
+ new group descriptor block. The new group
+ descriptor is described by ext4_new_group_input
+ structure, which is passed as an argument to
+ this ioctl. This is especially useful in
+ conjunction with EXT4_IOC_GROUP_EXTEND,
+ which allows online resize of the filesystem
+ to the end of the last existing block group.
+ Those two ioctls combined is used in userspace
+ online resize tool (e.g. resize2fs).
+
+ EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself.
+ It converts (migrates) ext3 indirect block mapped
+ inode to ext4 extent mapped inode by walking
+ through indirect block mapping of the original
+ inode and converting contiguous block ranges
+ into ext4 extents of the temporary inode. Then,
+ inodes are swapped. This ioctl might help, when
+ migrating from ext3 to ext4 filesystem, however
+ suggestion is to create fresh ext4 filesystem
+ and copy data from the backup. Note, that
+ filesystem has to support extents for this ioctl
+ to work.
+
+ EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be
+ allocated to preserve application-expected ext3
+ behaviour. Note that this will also start
+ triggering a write of the data blocks, but this
+ behaviour may change in the future as it is
+ not necessary and has been done this way only
+ for sake of simplicity.
+
+ EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number
+ of blocks of resized filesystem is passed in via
+ 64 bit integer argument. The kernel allocates
+ bitmaps and inode table, the userspace tool thus
+ just passes the new number of blocks.
+
+EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes
+ (like i_blocks, i_size, i_flags, ...) from
+ the specified inode with inode
+ EXT4_BOOT_LOADER_INO (#5). This is typically
+ used to store a boot loader in a secure part of
+ the filesystem, where it can't be changed by a
+ normal user by accident.
+ The data blocks of the previous boot loader
+ will be associated with the given inode.
+
+..............................................................................
References
==========