diff options
Diffstat (limited to 'Documentation/filesystems/ext4.txt')
| -rw-r--r-- | Documentation/filesystems/ext4.txt | 321 | 
1 files changed, 266 insertions, 55 deletions
diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 6ab9442d7ee..919a3293aaa 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -2,7 +2,7 @@  Ext4 Filesystem  =============== -Ext4 is an an advanced level of the ext3 filesystem which incorporates +Ext4 is an advanced level of the ext3 filesystem which incorporates  scalability and reliability enhancements for supporting large filesystems  (64 bit) in keeping with increasing disk capacities and state-of-the-art  feature requirements. @@ -68,12 +68,12 @@ Note: More extensive information for getting started with ext4 can be      '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems      for a fair comparison.  When tuning ext3 for best benchmark numbers,      it is often worthwhile to try changing the data journaling mode; '-o -    data=writeback,nobh' can be faster for some workloads.  (Note -    however that running mounted with data=writeback can potentially -    leave stale data exposed in recently written files in case of an -    unclean shutdown, which could be a security exposure in some -    situations.)  Configuring the filesystem with a large journal can -    also be helpful for metadata-intensive workloads. +    data=writeback' can be faster for some workloads.  (Note however that +    running mounted with data=writeback can potentially leave stale data +    exposed in recently written files in case of an unclean shutdown, +    which could be a security exposure in some situations.)  Configuring +    the filesystem with a large journal can also be helpful for +    metadata-intensive workloads.  2. Features  =========== @@ -97,7 +97,7 @@ Note: More extensive information for getting started with ext4 can be  * Inode allocation using large virtual block groups via flex_bg  * delayed allocation  * large block (up to pagesize) support -* efficent new ordered mode in JBD2 and ext4(avoid using buffer head to force +* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force    the ordering)  [1] Filesystems with a block size of 1k may see a limit imposed by the @@ -106,7 +106,7 @@ directory hash tree having a maximum depth of two.  2.2 Candidate features for future inclusion  * Online defrag (patches available but not well tested) -* reduced mke2fs time via lazy itable initialization in conjuction with +* reduced mke2fs time via lazy itable initialization in conjunction with    the uninit_bg feature (capability to do this is available in e2fsprogs    but a kernel thread to do lazy zeroing of unused inode table blocks    after filesystem is first mounted is required for safety) @@ -144,14 +144,12 @@ journal_async_commit	Commit block can be written to disk without waiting  			mount the device. This will enable 'journal_checksum'  			internally. -journal=update		Update the ext4 file system's journal to the current -			format. - +journal_path=path  journal_dev=devnum	When the external journal device's major/minor numbers -			have changed, this option allows the user to specify +			have changed, these options allow the user to specify  			the new journal location.  The journal device is -			identified through its new major/minor numbers encoded -			in devnum. +			identified through either its new major/minor numbers +			encoded in devnum, or via a path to the device.  norecovery		Don't load the journal on mounting.  Note that  noload			if the filesystem was not unmounted cleanly, @@ -160,7 +158,9 @@ noload			if the filesystem was not unmounted cleanly,                       	lead to any number of problems.  data=journal		All data are committed into the journal prior to being -			written into the main file system. +			written into the main file system.  Enabling +			this mode will disable delayed allocation and +			O_DIRECT support.  data=ordered	(*)	All data are forced directly out to the main file  			system prior to its metadata being committed to the @@ -201,34 +201,16 @@ inode_readahead_blks=n	This tuning parameter controls the maximum  			table readahead algorithm will pre-read into  			the buffer cache.  The default value is 32 blocks. -orlov		(*)	This enables the new Orlov block allocator. It is -			enabled by default. - -oldalloc		This disables the Orlov block allocator and enables -			the old block allocator.  Orlov should have better -			performance - we'd like to get some feedback if it's -			the contrary for you. - -user_xattr		Enables Extended User Attributes.  Additionally, you -			need to have extended attribute support enabled in the -			kernel configuration (CONFIG_EXT4_FS_XATTR).  See the -			attr(5) manual page and http://acl.bestbits.at/ to -			learn more about extended attributes. - -nouser_xattr		Disables Extended User Attributes. - -acl			Enables POSIX Access Control Lists support. -			Additionally, you need to have ACL support enabled in -			the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL). -			See the acl(5) manual page and http://acl.bestbits.at/ -			for more information. +nouser_xattr		Disables Extended User Attributes.  See the +			attr(5) manual page and http://acl.bestbits.at/ +			for more information about extended attributes.  noacl			This option disables POSIX Access Control List -			support. - -reservation - -noreservation +			support. If ACL support is enabled in the kernel +			configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is +			enabled by default on mount. See the acl(5) manual +			page and http://acl.bestbits.at/ for more information +			about acl.  bsddf		(*)	Make 'df' act like BSD.  minixdf			Make 'df' act like Minix. @@ -276,14 +258,6 @@ grpjquota=<file>	during journal replay. They replace the above  			package for more details  			(http://sourceforge.net/projects/linuxquota). -bh		(*)	ext4 associates buffer heads to data pages to -nobh			(a) cache disk block mapping information -			(b) link pages into transaction to provide -			    ordering guarantees. -			"bh" option forces use of buffer heads. -			"nobh" option tries to avoid associating buffer -			heads (supported only for "writeback" mode). -  stripe=n		Number of filesystem blocks that mballoc will try  			to use for allocation size and alignment. For RAID5/6  			systems this should be the number of data @@ -329,7 +303,7 @@ min_batch_time=usec	This parameter sets the commit time (as  			fast disks, at the cost of increasing latency.  journal_ioprio=prio	The I/O priority (from 0 to 7, where 0 is the -			highest priorty) which should be used for I/O +			highest priority) which should be used for I/O  			operations submitted by kjournald2 during a  			commit operation.  This defaults to 3, which is  			a slightly higher priority than the default I/O @@ -364,15 +338,54 @@ noinit_itable		Do not initialize any uninitialized inode table  init_itable=n		The lazy itable init code will wait n times the  			number of milliseconds it took to zero out the  			previous block group's inode table.  This -			minimizes the impact on the systme performance +			minimizes the impact on the system performance  			while file system's inode table is being initialized. -discard		Controls whether ext4 should issue discard/TRIM +discard			Controls whether ext4 should issue discard/TRIM  nodiscard(*)		commands to the underlying block device when  			blocks are freed.  This is useful for SSD devices  			and sparse/thinly-provisioned LUNs, but it is off  			by default until sufficient testing has been done. +nouid32			Disables 32-bit UIDs and GIDs.  This is for +			interoperability  with  older kernels which only +			store and expect 16-bit values. + +block_validity		This options allows to enables/disables the in-kernel +noblock_validity	facility for tracking filesystem metadata blocks +			within internal data structures. This allows multi- +			block allocator and other routines to quickly locate +			extents which might overlap with filesystem metadata +			blocks. This option is intended for debugging +			purposes and since it negatively affects the +			performance, it is off by default. + +dioread_lock		Controls whether or not ext4 should use the DIO read +dioread_nolock		locking. If the dioread_nolock option is specified +			ext4 will allocate uninitialized extent before buffer +			write and convert the extent to initialized after IO +			completes. This approach allows ext4 code to avoid +			using inode mutex, which improves scalability on high +			speed storages. However this does not work with +			data journaling and dioread_nolock option will be +			ignored with kernel warning. Note that dioread_nolock +			code path is only used for extent-based files. +			Because of the restrictions this options comprises +			it is off by default (e.g. dioread_lock). + +max_dir_size_kb=n	This limits the size of directories so that any +			attempt to expand them beyond the specified +			limit in kilobytes will cause an ENOSPC error. +			This is useful in memory constrained +			environments, where a very large directory can +			cause severe performance problems or even +			provoke the Out Of Memory killer.  (For example, +			if there is only 512mb memory available, a 176mb +			directory may seriously cramp the system's style.) + +i_version		Enable 64-bit inode version support. This option is +			off by default. +  Data Mode  =========  There are 3 different data modes: @@ -397,8 +410,206 @@ written to the journal first, and then to its final location.  In the event of a crash, the journal can be replayed, bringing both data and  metadata into a consistent state.  This mode is the slowest except when data  needs to be read from and written to disk at the same time where it -outperforms all others modes.  Currently ext4 does not have delayed -allocation support if this data journalling mode is selected. +outperforms all others modes.  Enabling this mode will disable delayed +allocation and O_DIRECT support. + +/proc entries +============= + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4.  Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0).   The files in each per-device directory are shown +in table below. + +Files in /proc/fs/ext4/<devname> +.............................................................................. + File            Content + mb_groups       details of multiblock allocator buddy cache of free blocks +.............................................................................. + +/sys entries +============ + +Information about mounted ext4 file systems can be found in +/sys/fs/ext4.  Each mounted filesystem will have a directory in +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or +/sys/fs/ext4/dm-0).   The files in each per-device directory are shown +in table below. + +Files in /sys/fs/ext4/<devname> +(see also Documentation/ABI/testing/sysfs-fs-ext4) +.............................................................................. + File                         Content + + delayed_allocation_blocks    This file is read-only and shows the number of +                              blocks that are dirty in the page cache, but +                              which do not have their location in the +                              filesystem allocated yet. + + inode_goal                   Tuning parameter which (if non-zero) controls +                              the goal inode used by the inode allocator in +                              preference to all other allocation heuristics. +                              This is intended for debugging use only, and +                              should be 0 on production systems. + + inode_readahead_blks         Tuning parameter which controls the maximum +                              number of inode table blocks that ext4's inode +                              table readahead algorithm will pre-read into +                              the buffer cache + + lifetime_write_kbytes        This file is read-only and shows the number of +                              kilobytes of data that have been written to this +                              filesystem since it was created. + + max_writeback_mb_bump        The maximum number of megabytes the writeback +                              code will try to write out before move on to +                              another inode. + + mb_group_prealloc            The multiblock allocator will round up allocation +                              requests to a multiple of this tuning parameter if +                              the stripe size is not set in the ext4 superblock + + mb_max_to_scan               The maximum number of extents the multiblock +                              allocator will search to find the best extent + + mb_min_to_scan               The minimum number of extents the multiblock +                              allocator will search to find the best extent + + mb_order2_req                Tuning parameter which controls the minimum size +                              for requests (as a power of 2) where the buddy +                              cache is used + + mb_stats                     Controls whether the multiblock allocator should +                              collect statistics, which are shown during the +                              unmount. 1 means to collect statistics, 0 means +                              not to collect statistics + + mb_stream_req                Files which have fewer blocks than this tunable +                              parameter will have their blocks allocated out +                              of a block group specific preallocation pool, so +                              that small files are packed closely together. +                              Each large file will have its blocks allocated +                              out of its own unique preallocation pool. + + session_write_kbytes         This file is read-only and shows the number of +                              kilobytes of data that have been written to this +                              filesystem since it was mounted. + + reserved_clusters            This is RW file and contains number of reserved +                              clusters in the file system which will be used +                              in the specific situations to avoid costly +                              zeroout, unexpected ENOSPC, or possible data +                              loss. The default is 2% or 4096 clusters, +                              whichever is smaller and this can be changed +                              however it can never exceed number of clusters +                              in the file system. If there is not enough space +                              for the reserved space when mounting the file +                              mount will _not_ fail. +.............................................................................. + +Ioctls +====== + +There is some Ext4 specific functionality which can be accessed by applications +through the system call interfaces. The list of all Ext4 specific ioctls are +shown in the table below. + +Table of Ext4 specific ioctls +.............................................................................. + Ioctl			      Description + EXT4_IOC_GETFLAGS	      Get additional attributes associated with inode. +			      The ioctl argument is an integer bitfield, with +			      bit values described in ext4.h. This ioctl is an +			      alias for FS_IOC_GETFLAGS. + + EXT4_IOC_SETFLAGS	      Set additional attributes associated with inode. +			      The ioctl argument is an integer bitfield, with +			      bit values described in ext4.h. This ioctl is an +			      alias for FS_IOC_SETFLAGS. + + EXT4_IOC_GETVERSION + EXT4_IOC_GETVERSION_OLD +			      Get the inode i_generation number stored for +			      each inode. The i_generation number is normally +			      changed only when new inode is created and it is +			      particularly useful for network filesystems. The +			      '_OLD' version of this ioctl is an alias for +			      FS_IOC_GETVERSION. + + EXT4_IOC_SETVERSION + EXT4_IOC_SETVERSION_OLD +			      Set the inode i_generation number stored for +			      each inode. The '_OLD' version of this ioctl +			      is an alias for FS_IOC_SETVERSION. + + EXT4_IOC_GROUP_EXTEND	      This ioctl has the same purpose as the resize +			      mount option. It allows to resize filesystem +			      to the end of the last existing block group, +			      further resize has to be done with resize2fs, +			      either online, or offline. The argument points +			      to the unsigned logn number representing the +			      filesystem new block count. + + EXT4_IOC_MOVE_EXT	      Move the block extents from orig_fd (the one +			      this ioctl is pointing to) to the donor_fd (the +			      one specified in move_extent structure passed +			      as an argument to this ioctl). Then, exchange +			      inode metadata between orig_fd and donor_fd. +			      This is especially useful for online +			      defragmentation, because the allocator has the +			      opportunity to allocate moved blocks better, +			      ideally into one contiguous extent. + + EXT4_IOC_GROUP_ADD	      Add a new group descriptor to an existing or +			      new group descriptor block. The new group +			      descriptor is described by ext4_new_group_input +			      structure, which is passed as an argument to +			      this ioctl. This is especially useful in +			      conjunction with EXT4_IOC_GROUP_EXTEND, +			      which allows online resize of the filesystem +			      to the end of the last existing block group. +			      Those two ioctls combined is used in userspace +			      online resize tool (e.g. resize2fs). + + EXT4_IOC_MIGRATE	      This ioctl operates on the filesystem itself. +			      It converts (migrates) ext3 indirect block mapped +			      inode to ext4 extent mapped inode by walking +			      through indirect block mapping of the original +			      inode and converting contiguous block ranges +			      into ext4 extents of the temporary inode. Then, +			      inodes are swapped. This ioctl might help, when +			      migrating from ext3 to ext4 filesystem, however +			      suggestion is to create fresh ext4 filesystem +			      and copy data from the backup. Note, that +			      filesystem has to support extents for this ioctl +			      to work. + + EXT4_IOC_ALLOC_DA_BLKS	      Force all of the delay allocated blocks to be +			      allocated to preserve application-expected ext3 +			      behaviour. Note that this will also start +			      triggering a write of the data blocks, but this +			      behaviour may change in the future as it is +			      not necessary and has been done this way only +			      for sake of simplicity. + + EXT4_IOC_RESIZE_FS	      Resize the filesystem to a new size.  The number +			      of blocks of resized filesystem is passed in via +			      64 bit integer argument.  The kernel allocates +			      bitmaps and inode table, the userspace tool thus +			      just passes the new number of blocks. + +EXT4_IOC_SWAP_BOOT	      Swap i_blocks and associated attributes +			      (like i_blocks, i_size, i_flags, ...) from +			      the specified inode with inode +			      EXT4_BOOT_LOADER_INO (#5). This is typically +			      used to store a boot loader in a secure part of +			      the filesystem, where it can't be changed by a +			      normal user by accident. +			      The data blocks of the previous boot loader +			      will be associated with the given inode. + +..............................................................................  References  ==========  | 
