diff options
Diffstat (limited to 'Documentation/filesystems')
101 files changed, 16247 insertions, 2769 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index e68021c08fb..ac28149aede 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -1,9 +1,9 @@ 00-INDEX - this file (info on some of the filesystems supported by linux). -Exporting - - explanation of how to make filesystems exportable. Locking - info on locking rules as they pertain to Linux VFS. +Makefile + - Makefile for building the filsystems-part of DocBook. 9p.txt - 9p (v9fs) is an implementation of the Plan 9 remote fs protocol. adfs.txt @@ -12,40 +12,64 @@ afs.txt - info and examples for the distributed AFS (Andrew File System) fs. affs.txt - info and mount options for the Amiga Fast File System. +autofs4-mount-control.txt + - info on device control operations for autofs4 module. automount-support.txt - information about filesystem automount support. befs.txt - information about the BeOS filesystem for Linux. bfs.txt - info for the SCO UnixWare Boot Filesystem (BFS). -cifs.txt - - description of the CIFS filesystem. +btrfs.txt + - info for the BTRFS filesystem. +caching/ + - directory containing filesystem cache documentation. +ceph.txt + - info for the Ceph Distributed File System. +cifs/ + - directory containing CIFS filesystem documentation and example code. coda.txt - description of the CODA filesystem. configfs/ - directory containing configfs documentation and example code. cramfs.txt - info on the cram filesystem for small storage (ROMs etc). -dentry-locking.txt - - info on the RCU-based dcache locking model. +debugfs.txt + - info on the debugfs filesystem. +devpts.txt + - info on the devpts filesystem. directory-locking - info about the locking scheme used for directory operations. dlmfs.txt - info on the userspace interface to the OCFS2 DLM. dnotify.txt - info about directory notification in Linux. +dnotify_test.c + - example program for dnotify. ecryptfs.txt - docs on eCryptfs: stacked cryptographic filesystem for Linux. +efivarfs.txt + - info for the efivarfs filesystem. +exofs.txt + - info, usage, mount options, design about EXOFS. ext2.txt - info, mount options and specifications for the Ext2 filesystem. ext3.txt - info, mount options and specifications for the Ext3 filesystem. ext4.txt - info, mount options and specifications for the Ext4 filesystem. +f2fs.txt + - info and mount options for the F2FS filesystem. +fiemap.txt + - info on fiemap ioctl. files.txt - info on file management in the Linux kernel. fuse.txt - info on the Filesystem in User SpacE including mount options. +gfs2-glocks.txt + - info on the Global File System 2 - Glock internal locking rules. +gfs2-uevents.txt + - info on the Global File System 2 - uevents. gfs2.txt - info on the Global File System 2. hfs.txt @@ -62,48 +86,72 @@ jfs.txt - info and mount options for the JFS filesystem. locks.txt - info on file locking implementations, flock() vs. fcntl(), etc. +logfs.txt + - info on the LogFS flash filesystem. mandatory-locking.txt - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. +nfs/ + - nfs-related documentation. +nilfs2.txt + - info and mount options for the NILFS2 filesystem. ntfs.txt - info and mount options for the NTFS filesystem (Windows NT). ocfs2.txt - info and mount options for the OCFS2 clustered filesystem. +omfs.txt + - info on the Optimized MPEG FileSystem. +path-lookup.txt + - info on path walking and name lookup locking. +pohmelfs/ + - directory containing pohmelfs filesystem documentation. porting - various information on filesystem porting. proc.txt - info on Linux's /proc filesystem. +qnx6.txt + - info on the QNX6 filesystem. +quota.txt + - info on Quota subsystem. ramfs-rootfs-initramfs.txt - info on the 'in memory' filesystems ramfs, rootfs and initramfs. -reiser4.txt - - info on the Reiser4 filesystem based on dancing tree algorithms. relay.txt - info on relay, for efficient streaming from kernel to user space. romfs.txt - description of the ROMFS filesystem. +seq_file.txt + - how to use the seq_file API. sharedsubtree.txt - a description of shared subtrees for namespaces. -smbfs.txt - - info on using filesystems with the SMB protocol (Win 3.11 and NT). spufs.txt - info and mount options for the SPU filesystem used on Cell. +squashfs.txt + - info on the squashfs filesystem. sysfs-pci.txt - info on accessing PCI device resources through sysfs. +sysfs-tagging.txt + - info on sysfs tagging to avoid duplicates. sysfs.txt - info on sysfs, a ram-based filesystem for exporting kernel objects. sysv-fs.txt - info on the SystemV/V7/Xenix/Coherent filesystem. tmpfs.txt - info on tmpfs, a filesystem that holds all files in virtual memory. +ubifs.txt + - info on the Unsorted Block Images FileSystem. udf.txt - info and mount options for the UDF filesystem. ufs.txt - info on the ufs filesystem. vfat.txt - - info on using the VFAT filesystem used in Windows NT and Windows 95 + - info on using the VFAT filesystem used in Windows NT and Windows 95. vfs.txt - - overview of the Virtual File System + - overview of the Virtual File System. +xfs-delayed-logging-design.txt + - info on the XFS Delayed Logging Design. +xfs-self-describing-metadata.txt + - info on XFS Self Describing Metadata. xfs.txt - info and mount options for the XFS filesystem. xip.txt diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt index bf8080640eb..fec7144e817 100644 --- a/Documentation/filesystems/9p.txt +++ b/Documentation/filesystems/9p.txt @@ -18,13 +18,15 @@ the 9p client is available in the form of a USENIX paper: Other applications are described in the following papers: * XCPU & Clustering - http://www.xcpu.org/xcpu-talk.pdf + http://xcpu.org/papers/xcpu-talk.pdf * KVMFS: control file system for KVM - http://www.xcpu.org/kvmfs.pdf - * CellFS: A New ProgrammingModel for the Cell BE - http://www.xcpu.org/cellfs-talk.pdf + http://xcpu.org/papers/kvmfs.pdf + * CellFS: A New Programming Model for the Cell BE + http://xcpu.org/papers/cellfs-talk.pdf * PROSE I/O: Using 9p to enable Application Partitions http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + * VirtFS: A Virtualization Aware File System pass-through + http://goo.gl/3WPDg USAGE ===== @@ -37,6 +39,15 @@ For Plan 9 From User Space applications (http://swtch.com/plan9) mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER +For server running on QEMU host with virtio transport: + + mount -t 9p -o trans=virtio <mount_tag> /mnt/9 + +where mount_tag is the tag associated by the server to each of the exported +mount points. Each 9P export is seen by the client as a virtio device with an +associated "mount_tag" property. Available mount tags can be +seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. + OPTIONS ======= @@ -47,7 +58,8 @@ OPTIONS fd - used passed file descriptors for connection (see rfdno and wfdno) virtio - connect to the next virtio channel available - (from lguest or KVM with trans_virtio module) + (from QEMU with trans_virtio module) + rdma - connect to a specified RDMA channel uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user @@ -57,28 +69,43 @@ OPTIONS offering several exported file systems. cache=mode specifies a caching policy. By default, no caches are used. + none = default no cache policy, metadata and data + alike are synchronous. loose = no attempts are made at consistency, intended for exclusive, read-only mounts + fscache = use FS-Cache for a persistent, read-only + cache backend. + mmap = minimal cache that is only used for read-write + mmap. Northing else is cached, like cache=none debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9p trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug + 0x01 = display verbose error messages + 0x02 = developer debug (DEBUG_CURRENT) + 0x04 = display 9p trace + 0x08 = display VFS trace + 0x10 = display Marshalling debug + 0x20 = display RPC debug + 0x40 = display transport debug + 0x80 = display allocation debug + 0x100 = display protocol message debug + 0x200 = display Fid debug + 0x400 = display packet debug + 0x800 = display fscache tracing debug rfdno=n the file descriptor for reading with trans=fd wfdno=n the file descriptor for writing with trans=fd - maxdata=n the number of bytes to use for 9p packet payload (msize) + msize=n the number of bytes to use for 9p packet payload port=n port to connect to on the remote server - noextend force legacy mode (no 9p2000.u semantics) + noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) + + version=name Select 9P protocol version. Valid options are: + 9p2000 - Legacy mode (same as noextend) + 9p2000.u - Use 9P2000.u protocol + 9p2000.L - Use 9P2000.L protocol dfltuid attempt to mount as a particular uid @@ -90,7 +117,7 @@ OPTIONS This can be used to share devices/named pipes/sockets between hosts. This functionality will be expanded in later versions. - access there are three access modes. + access there are four access modes. user = if a user tries to access a file on v9fs filesystem for the first time, v9fs sends an attach command (Tattach) for that user. @@ -99,34 +126,32 @@ OPTIONS the files on the mounted filesystem any = v9fs does single attach and performs all operations as one user + client = ACL based access check on the 9p client + side for access validation + + cachetag cache tag to use the specified persistent cache. + cache tags for existing cache sessions can be listed at + /sys/fs/9p/caches. (applies only to cache=fscache) RESOURCES ========= -Our current recommendation is to use Inferno (http://www.vitanuova.com/inferno) -as the 9p server. You can start a 9p server under Inferno by issuing the -following command: - ; styxlisten -A tcp!*!564 export '#U*' +Protocol specifications are maintained on github: +http://ericvh.github.com/9p-rfc/ -The -A specifies an unauthenticated export. The 564 is the port # (you may -have to choose a higher port number if running as a normal user). The '#U*' -specifies exporting the root of the Linux name space. You may specify a -subset of the namespace by extending the path: '#U*'/tmp would just export -/tmp. For more information, see the Inferno manual pages covering styxlisten -and export. +9p client and server implementations are listed on +http://9p.cat-v.org/implementations -A Linux version of the 9p server is now maintained under the npfs project -on sourceforge (http://sourceforge.net/projects/npfs). The currently -maintained version is the single-threaded version of the server (named spfs) -available from the same CVS repository. +A 9p2000.L server is being developed by LLNL and can be found +at http://code.google.com/p/diod/ There are user and developer mailing lists available through the v9fs project on sourceforge (http://sourceforge.net/projects/v9fs). -News and other information is maintained on SWiK (http://swik.net/v9fs). +News and other information is maintained on a Wiki. +(http://sf.net/apps/mediawiki/v9fs/index.php). -Bug reports may be issued through the kernel.org bugzilla -(http://bugzilla.kernel.org) +Bug reports are best issued via the mailing list. For more information on the Plan 9 Operating System check out http://plan9.bell-labs.com/plan9 @@ -134,11 +159,3 @@ http://plan9.bell-labs.com/plan9 For information on Plan 9 from User Space (Plan 9 applications and libraries ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 - -STATUS -====== - -The 2.6 kernel support is working on PPC and x86. - -PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org) - diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 42d4b30b104..b18dd177902 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -9,51 +9,67 @@ be able to use diff(1). --------------------------- dentry_operations -------------------------- prototypes: - int (*d_revalidate)(struct dentry *, int); - int (*d_hash) (struct dentry *, struct qstr *); - int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + int (*d_revalidate)(struct dentry *, unsigned int); + int (*d_weak_revalidate)(struct dentry *, unsigned int); + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, + unsigned int, const char *, const struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); + struct vfsmount *(*d_automount)(struct path *path); + int (*d_manage)(struct dentry *, bool); locking rules: - none have BKL - dcache_lock rename_lock ->d_lock may block -d_revalidate: no no no yes -d_hash no no no yes -d_compare: no yes no no -d_delete: yes no yes no -d_release: no no no yes -d_iput: no no no yes + rename_lock ->d_lock may block rcu-walk +d_revalidate: no no yes (ref-walk) maybe +d_weak_revalidate:no no yes no +d_hash no no no maybe +d_compare: yes no no maybe +d_delete: no yes no no +d_release: no no yes no +d_prune: no yes no no +d_iput: no no yes no d_dname: no no no no +d_automount: no no yes no +d_manage: no no yes (ref-walk) maybe --------------------------- inode_operations --------------------------- prototypes: - int (*create) (struct inode *,struct dentry *,int, struct nameidata *); - struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid -ata *); + int (*create) (struct inode *,struct dentry *,umode_t, bool); + struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); - int (*mkdir) (struct inode *,struct dentry *,int); + int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); - int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); + int (*rename2) (struct inode *, struct dentry *, + struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); - int (*follow_link) (struct dentry *, struct nameidata *); + void * (*follow_link) (struct dentry *, struct nameidata *); + void (*put_link) (struct dentry *, struct nameidata *, void *); void (*truncate) (struct inode *); - int (*permission) (struct inode *, int, struct nameidata *); + int (*permission) (struct inode *, int, unsigned int); + int (*get_acl)(struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + void (*update_time)(struct inode *, struct timespec *, int); + int (*atomic_open)(struct inode *, struct dentry *, + struct file *, unsigned open_flag, + umode_t create_mode, int *opened); + int (*tmpfile) (struct inode *, struct dentry *, umode_t); locking rules: - all may block, none have BKL + all may block i_mutex(inode) lookup: yes create: yes @@ -64,24 +80,27 @@ mkdir: yes unlink: yes (both) rmdir: yes (both) (see below) rename: yes (all) (see below) +rename2: yes (all) (see below) readlink: no follow_link: no -truncate: yes (see below) +put_link: no setattr: yes -permission: no +permission: no (may not block if called in rcu-walk mode) +get_acl: no getattr: no setxattr: yes getxattr: no listxattr: no removexattr: yes +fiemap: no +update_time: no +atomic_open: yes +tmpfile: no + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. - cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. - ->truncate() is never called directly - it's a callback, not a -method. It's called by vmtruncate() - library function normally used by -->setattr(). Locking information above applies to that call (i.e. is -inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been -passed). + cross-directory ->rename() and rename2() has (per-superblock) +->s_vfs_rename_sem. See Documentation/filesystems/directory-locking for more detailed discussion of the locking scheme for directory operations. @@ -90,67 +109,71 @@ of the locking scheme for directory operations. prototypes: struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*dirty_inode) (struct inode *); - int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); + void (*dirty_inode) (struct inode *, int flags); + int (*write_inode) (struct inode *, struct writeback_control *wbc); + int (*drop_inode) (struct inode *); + void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); - void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); - void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); - int (*show_options)(struct seq_file *, struct vfsmount *); + int (*show_options)(struct seq_file *, struct dentry *); ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); locking rules: - All may block. - BKL s_lock s_umount -alloc_inode: no no no -destroy_inode: no -dirty_inode: no (must not sleep) -write_inode: no -put_inode: no -drop_inode: no !!!inode_lock!!! -delete_inode: no -put_super: yes yes no -write_super: no yes read -sync_fs: no no read -write_super_lockfs: ? -unlockfs: ? -statfs: no no no -remount_fs: yes yes maybe (see below) -clear_inode: no -umount_begin: yes no no -show_options: no (vfsmount->sem) -quota_read: no no no (see below) -quota_write: no no no (see below) - -->remount_fs() will have the s_umount lock if it's already mounted. -When called from get_sb_single, it does NOT have the s_umount lock. + All may block [not true, see below] + s_umount +alloc_inode: +destroy_inode: +dirty_inode: +write_inode: +drop_inode: !!!inode->i_lock!!! +evict_inode: +put_super: write +sync_fs: read +freeze_fs: write +unfreeze_fs: write +statfs: maybe(read) (see below) +remount_fs: write +umount_begin: no +show_options: no (namespace_sem) +quota_read: no (see below) +quota_write: no (see below) +bdev_try_to_free_page: no (see below) + +->statfs() has s_umount (shared) when called by ustat(2) (native or +compat), but that's an accident of bad API; s_umount is used to pin +the superblock down when we only have dev_t given us by userland to +identify the superblock. Everything else (statfs(), fstatfs(), etc.) +doesn't hold it when calling ->statfs() - superblock is pinned down +by resolving the pathname passed to syscall. ->quota_read() and ->quota_write() functions are both guaranteed to be the only ones operating on the quota file by the quota code (via dqio_sem) (unless an admin really wants to screw up something and writes to quota files with quotas on). For other details about locking see also dquot_operations section. +->bdev_try_to_free_page is called from the ->releasepage handler of +the block device inode. See there for more details. --------------------------- file_system_type --------------------------- prototypes: int (*get_sb) (struct file_system_type *, int, const char *, void *, struct vfsmount *); + struct dentry *(*mount) (struct file_system_type *, int, + const char *, void *); void (*kill_sb) (struct super_block *); locking rules: - may block BKL -get_sb yes yes -kill_sb yes yes + may block +mount yes +kill_sb yes -->get_sb() returns error or 0 with locked superblock attached to the vfsmount -(exclusive on ->s_umount). +->mount() returns ERR_PTR or the root dentry; its superblock should be locked +on return. ->kill_sb() takes a write-locked superblock, does all shutdown work on it, unlocks and drops the reference. @@ -163,37 +186,52 @@ prototypes: int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); - int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); - int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); - int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); - int (*launder_page) (struct page *); + void (*freepage)(struct page *); + int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); + int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, + unsigned long *); + int (*migratepage)(struct address_space *, struct page *, struct page *); + int (*launder_page)(struct page *); + int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long); + int (*error_remove_page)(struct address_space *, struct page *); + int (*swap_activate)(struct file *); + int (*swap_deactivate)(struct file *); locking rules: - All except set_page_dirty may block - - BKL PageLocked(page) i_sem -writepage: no yes, unlocks (see below) -readpage: no yes, unlocks -sync_page: no maybe -writepages: no -set_page_dirty no no -readpages: no -prepare_write: no yes yes -commit_write: no yes yes -write_begin: no locks the page yes -write_end: no yes, unlocks yes -perform_write: no n/a yes -bmap: yes -invalidatepage: no yes -releasepage: no yes -direct_IO: no -launder_page: no yes - - ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage() + All except set_page_dirty and freepage may block + + PageLocked(page) i_mutex +writepage: yes, unlocks (see below) +readpage: yes, unlocks +sync_page: maybe +writepages: +set_page_dirty no +readpages: +write_begin: locks the page yes +write_end: yes, unlocks yes +bmap: +invalidatepage: yes +releasepage: yes +freepage: yes +direct_IO: +get_xip_mem: maybe +migratepage: yes (both) +launder_page: yes +is_partially_uptodate: yes +error_remove_page: yes +swap_activate: no +swap_deactivate: no + + ->write_begin(), ->write_end(), ->sync_page() and ->readpage() may be called from the request handler (/dev/loop). ->readpage() unlocks the page, either synchronously or via I/O @@ -271,13 +309,12 @@ under spinlock (it cannot block) and is sometimes called with the page not locked. ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some -filesystems and by the swapper. The latter will eventually go away. All -instances do not actually need the BKL. Please, keep it that way and don't -breed new callers. +filesystems and by the swapper. The latter will eventually go away. Please, +keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop -some or all of the buffers from the page when it is being truncated. It -returns zero on success. If ->invalidatepage is zero, the kernel uses +some or all of the buffers from the page when it is being truncated. It +returns zero on success. If ->invalidatepage is zero, the kernel uses block_invalidatepage() instead. ->releasepage() is called when the kernel is about to try to drop the @@ -285,55 +322,64 @@ buffers from the page in preparation for freeing it. It returns zero to indicate that the buffers are (or may be) freeable. If ->releasepage is zero, the kernel assumes that the fs has no private interest in the buffers. + ->freepage() is called when the kernel is done dropping the page +from the page cache. + ->launder_page() may be called prior to releasing a page if it is still found to be dirty. It returns zero if the page was successfully cleaned, or an error value if not. Note that in order to prevent the page getting mapped back in and redirtied, it needs to be kept locked across the entire operation. - Note: currently almost all instances of address_space methods are -using BKL for internal serialization and that's one of the worst sources -of contention. Normally they are calling library functions (in fs/buffer.c) -and pass foo_get_block() as a callback (on local block-based filesystems, -indeed). BKL is not needed for library stuff and is usually taken by -foo_get_block(). It's an overkill, since block bitmaps can be protected by -internal fs locking and real critical areas are much smaller than the areas -filesystems protect now. + ->swap_activate will be called with a non-zero argument on +files backing (non block device backed) swapfiles. A return value +of zero indicates success, in which case this file can be used for +backing swapspace. The swapspace operations will be proxied to the +address space operations. + + ->swap_deactivate() will be called in the sys_swapoff() +path after ->swap_activate() returned success. ----------------------- file_lock_operations ------------------------------ prototypes: - void (*fl_insert)(struct file_lock *); /* lock insertion callback */ - void (*fl_remove)(struct file_lock *); /* lock removal callback */ void (*fl_copy_lock)(struct file_lock *, struct file_lock *); void (*fl_release_private)(struct file_lock *); locking rules: - BKL may block -fl_insert: yes no -fl_remove: yes no -fl_copy_lock: yes no -fl_release_private: yes yes + inode->i_lock may block +fl_copy_lock: yes no +fl_release_private: maybe no ----------------------- lock_manager_operations --------------------------- prototypes: - int (*fl_compare_owner)(struct file_lock *, struct file_lock *); - void (*fl_notify)(struct file_lock *); /* unblock callback */ - void (*fl_copy_lock)(struct file_lock *, struct file_lock *); - void (*fl_release_private)(struct file_lock *); - void (*fl_break)(struct file_lock *); /* break_lease callback */ + int (*lm_compare_owner)(struct file_lock *, struct file_lock *); + unsigned long (*lm_owner_key)(struct file_lock *); + void (*lm_notify)(struct file_lock *); /* unblock callback */ + int (*lm_grant)(struct file_lock *, struct file_lock *, int); + void (*lm_break)(struct file_lock *); /* break_lease callback */ + int (*lm_change)(struct file_lock **, int); locking rules: - BKL may block -fl_compare_owner: yes no -fl_notify: yes no -fl_copy_lock: yes no -fl_release_private: yes yes -fl_break: yes no - - Currently only NFSD and NLM provide instances of this class. None of the -them block. If you have out-of-tree instances - please, show up. Locking -in that area will change. + + inode->i_lock blocked_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no + +[1]: ->lm_compare_owner and ->lm_owner_key are generally called with +*an* inode->i_lock held. It may not be the i_lock of the inode +associated with either file_lock argument! This is the case with deadlock +detection, since the code has to chase down the owners of locks that may +be entirely unrelated to the one on which the lock is being acquired. +For deadlock detection however, the blocked_lock_lock is also held. The +fact that these locks are held ensures that the file_locks do not +disappear out from under you while doing the comparison or generating an +owner key. + --------------------------- buffer_head ----------------------------------- prototypes: void (*b_end_io)(struct buffer_head *bh, int uptodate); @@ -346,21 +392,36 @@ call this method upon the IO completion. --------------------------- block_device_operations ----------------------- prototypes: - int (*open) (struct inode *, struct file *); - int (*release) (struct inode *, struct file *); - int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); + int (*open) (struct block_device *, fmode_t); + int (*release) (struct gendisk *, fmode_t); + int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); + int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); + int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *); int (*media_changed) (struct gendisk *); + void (*unlock_native_capacity) (struct gendisk *); int (*revalidate_disk) (struct gendisk *); + int (*getgeo)(struct block_device *, struct hd_geometry *); + void (*swap_slot_free_notify) (struct block_device *, unsigned long); locking rules: - BKL bd_sem -open: yes yes -release: yes yes -ioctl: yes no -media_changed: no no -revalidate_disk: no no + bd_mutex +open: yes +release: yes +ioctl: no +compat_ioctl: no +direct_access: no +media_changed: no +unlock_native_capacity: no +revalidate_disk: no +getgeo: no +swap_slot_free_notify: no (see below) + +media_changed, unlock_native_capacity and revalidate_disk are called only from +check_disk_change(). + +swap_slot_free_notify is called with swap_lock and sometimes the page lock +held. -The last two are called only from check_disk_change(). --------------------------- file_operations ------------------------------- prototypes: @@ -369,17 +430,17 @@ prototypes: ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); - int (*readdir) (struct file *, void *, filldir_t); + ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); + ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); - int (*ioctl) (struct inode *, struct file *, unsigned int, - unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); - int (*fsync) (struct file *, struct dentry *, int datasync); + int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); @@ -394,60 +455,33 @@ prototypes: unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); - int (*dir_notify)(struct file *, unsigned long); + int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, + size_t, unsigned int); + ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, + size_t, unsigned int); + int (*setlease)(struct file *, long, struct file_lock **); + long (*fallocate)(struct file *, int, loff_t, loff_t); }; locking rules: - All except ->poll() may block. - BKL -llseek: no (see below) -read: no -aio_read: no -write: no -aio_write: no -readdir: no -poll: no -ioctl: yes (see below) -unlocked_ioctl: no (see below) -compat_ioctl: no -mmap: no -open: maybe (see below) -flush: no -release: no -fsync: no (see below) -aio_fsync: no -fasync: yes (see below) -lock: yes -readv: no -writev: no -sendfile: no -sendpage: no -get_unmapped_area: no -check_flags: no -dir_notify: no + All may block except for ->setlease. + No VFS locks held on entry except for ->setlease. + +->setlease has the file_list_lock held and must not sleep. ->llseek() locking has moved from llseek to the individual llseek implementations. If your fs is not using generic_file_llseek, you need to acquire and release the appropriate locks in your ->llseek(). For many filesystems, it is probably safe to acquire the inode -semaphore. Note some filesystems (i.e. remote ones) provide no -protection for i_size so you will need to use the BKL. - -->open() locking is in-transit: big lock partially moved into the methods. -The only exception is ->open() in the instances of file_operations that never -end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices -(chrdev_open() takes lock before replacing ->f_op and calling the secondary -method. As soon as we fix the handling of module reference counters all -instances of ->open() will be called without the BKL. +mutex or just to use i_size_read() instead. +Note: this does not protect the file->f_pos against concurrent modifications +since this is something the userspace has to take care about. -Note: ext2_release() was *the* source of contention on fs-intensive -loads and dropping BKL on ->release() helps to get rid of that (we still -grab BKL for cases when we close a file that had been opened r/w, but that -can and should be done using the internal locking with smaller critical areas). -Current worst offender is ext2_get_block()... - -->fasync() is a mess. This area needs a big cleanup and that will probably -affect locking. +->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags. +Most instances call fasync_helper(), which does that maintenance, so it's +not normally something one needs to worry about. Return values > 0 will be +mapped to zero in the VFS layer. ->readdir() and ->ioctl() on directories must be changed. Ideally we would move ->readdir() to inode_operations and use a separate method for directory @@ -455,23 +489,11 @@ move ->readdir() to inode_operations and use a separate method for directory anything that resembles union-mount we won't have a struct file for all components. And there are other reasons why the current interface is a mess... -->ioctl() on regular files is superceded by the ->unlocked_ioctl() that -doesn't take the BKL. - ->read on directories probably must go away - we should just enforce -EISDIR in sys_read() and friends. -->fsync() has i_mutex on inode. - --------------------------- dquot_operations ------------------------------- prototypes: - int (*initialize) (struct inode *, int); - int (*drop) (struct inode *); - int (*alloc_space) (struct inode *, qsize_t, int); - int (*alloc_inode) (const struct inode *, unsigned long); - int (*free_space) (struct inode *, qsize_t); - int (*free_inode) (const struct inode *, unsigned long); - int (*transfer) (struct inode *, struct iattr *); int (*write_dquot) (struct dquot *); int (*acquire_dquot) (struct dquot *); int (*release_dquot) (struct dquot *); @@ -484,13 +506,6 @@ a proper locking wrt the filesystem and call the generic quota operations. What filesystem should expect from the generic quota functions: FS recursion Held locks when called -initialize: yes maybe dqonoff_sem -drop: yes - -alloc_space: ->mark_dirty() - -alloc_inode: ->mark_dirty() - -free_space: ->mark_dirty() - -free_inode: ->mark_dirty() - -transfer: yes - write_dquot: yes dqonoff_sem or dqptr_sem acquire_dquot: yes dqonoff_sem or dqptr_sem release_dquot: yes dqonoff_sem or dqptr_sem @@ -500,10 +515,6 @@ write_info: yes dqonoff_sem FS recursion means calling ->quota_read() and ->quota_write() from superblock operations. -->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called -only directly by the filesystem and do not call any fs functions only -the ->mark_dirty() operation. - More details about quota locking can be found in fs/dquot.c. --------------------------- vm_operations_struct ----------------------------- @@ -511,30 +522,49 @@ prototypes: void (*open)(struct vm_area_struct*); void (*close)(struct vm_area_struct*); int (*fault)(struct vm_area_struct*, struct vm_fault *); - struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); - int (*page_mkwrite)(struct vm_area_struct *, struct page *); + int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: - BKL mmap_sem PageLocked(page) -open: no yes -close: no yes -fault: no yes -nopage: no yes -page_mkwrite: no yes no - - ->page_mkwrite() is called when a previously read-only page is -about to become writeable. The file system is responsible for -protecting against truncate races. Once appropriate action has been -taking to lock out truncate, the page range should be verified to be -within i_size. The page mapping should also be checked that it is not -NULL. + mmap_sem PageLocked(page) +open: yes +close: yes +fault: yes can return with page locked +map_pages: yes +page_mkwrite: yes can return with page locked +access: yes + + ->fault() is called when a previously not present pte is about +to be faulted in. The filesystem must find and return the page associated +with the passed in "pgoff" in the vm_fault structure. If it is possible that +the page may be truncated and/or invalidated, then the filesystem must lock +the page, then ensure it is not already truncated (the page lock will block +subsequent truncate), and then return with VM_FAULT_LOCKED, and the page +locked. The VM will unlock the page. + + ->map_pages() is called when VM asks to map easy accessible pages. +Filesystem should find and map pages associated with offsets from "pgoff" +till "max_pgoff". ->map_pages() is called with page table locked and must +not block. If it's not possible to reach a page without blocking, +filesystem should skip it. Filesystem should use do_set_pte() to setup +page table entry. Pointer to entry associated with offset "pgoff" is +passed in "pte" field in vm_fault structure. Pointers to entries for other +offsets should be calculated relative to "pte". + + ->page_mkwrite() is called when a previously read-only pte is +about to become writeable. The filesystem again must ensure that there are +no truncate/invalidate races, and then return with the page locked. If +the page has been truncated, the filesystem should not look up a new page +like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which +will cause the VM to retry the fault. + + ->access() is called when get_user_pages() fails in +access_process_vm(), typically used to debug a process through +/proc/pid/mem or ptrace. This function is needed only for +VM_IO | VM_PFNMAP VMAs. ================================================================================ Dubious stuff (if you break something or notice that it is broken and do not fix it yourself - at least put it here) - -ipc/shm.c::shm_delete() - may need BKL. -->read() and ->write() in many drivers are (probably) missing BKL. -drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL. diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile new file mode 100644 index 00000000000..a5dd114da14 --- /dev/null +++ b/Documentation/filesystems/Makefile @@ -0,0 +1,8 @@ +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o + +# List of programs to build +hostprogs-y := dnotify_test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt index 9e8811f92b8..5949766353f 100644 --- a/Documentation/filesystems/adfs.txt +++ b/Documentation/filesystems/adfs.txt @@ -9,6 +9,9 @@ Mount options for ADFS will be nnn. Default 0700. othmask=nnn The permission mask for ADFS 'other' permissions will be nnn. Default 0077. + ftsuffix=n When ftsuffix=0, no file type suffix will be applied. + When ftsuffix=1, a hexadecimal suffix corresponding to + the RISC OS file type will be added. Default 0. Mapping of ADFS permissions to Linux permissions ------------------------------------------------ @@ -55,3 +58,18 @@ Mapping of ADFS permissions to Linux permissions You can therefore tailor the permission translation to whatever you desire the permissions should be under Linux. + +RISC OS file type suffix +------------------------ + + RISC OS file types are stored in bits 19..8 of the file load address. + + To enable non-RISC OS systems to be used to store files without losing + file type information, a file naming convention was devised (initially + for use with NFS) such that a hexadecimal suffix of the form ,xyz + denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This + naming convention is now also used by RISC OS emulators such as RPCEmu. + + Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file + type suffixes to be appended to file names read from a directory. If the + ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt index 2d1524469c2..71b63c2b984 100644 --- a/Documentation/filesystems/affs.txt +++ b/Documentation/filesystems/affs.txt @@ -49,6 +49,10 @@ mode=mode Sets the mode flags to the given (octal) value, regardless This is useful since most of the plain AmigaOS files will map to 600. +nofilenametruncate + The file system will return an error when filename exceeds + standard maximum filename length (30 characters). + reserved=num Sets the number of reserved blocks at the start of the partition to num. You should never need this option. Default is 2. @@ -181,9 +185,8 @@ tested, though several hundred MB have been read and written using this fs. For a most up-to-date list of bugs please consult fs/affs/Changes. -Filenames are truncated to 30 characters without warning (this -can be changed by setting the compile-time option AFFS_NO_TRUNCATE -in include/linux/amigaffs.h). +By default, filenames are truncated to 30 characters without warning. +'nofilenametruncate' mount option can change that behavior. Case is ignored by the affs in filename matching, but Linux shells do care about the case. Example (with /wb being an affs mounted fs): @@ -216,4 +219,4 @@ due to an incompatibility with the Amiga floppy controller. If you are interested in an Amiga Emulator for Linux, look at -http://www.freiburg.linux.de/~uae/ +http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt index 12ad6c7f4e5..ffef91c4e0d 100644 --- a/Documentation/filesystems/afs.txt +++ b/Documentation/filesystems/afs.txt @@ -23,15 +23,13 @@ it does support include: (*) Security (currently only AFS kaserver and KerberosIV tickets). - (*) File reading. + (*) File reading and writing. (*) Automounting. -It does not yet support the following AFS features: - - (*) Write support. + (*) Local caching (via fscache). - (*) Local caching. +It does not yet support the following AFS features: (*) pioctl() system call. @@ -56,7 +54,7 @@ They permit the debugging messages to be turned on dynamically by manipulating the masks in the following files: /sys/module/af_rxrpc/parameters/debug - /sys/module/afs/parameters/debug + /sys/module/kafs/parameters/debug ===== @@ -66,9 +64,9 @@ USAGE When inserting the driver modules the root cell must be specified along with a list of volume location server IP addresses: - insmod af_rxrpc.o - insmod rxkad.o - insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + modprobe af_rxrpc + modprobe rxkad + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 The first module is the AF_RXRPC network protocol driver. This provides the RxRPC remote operation protocol and may also be accessed from userspace. See: @@ -81,7 +79,7 @@ is the actual filesystem driver for the AFS filesystem. Once the module has been loaded, more modules can be added by the following procedure: - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells Where the parameters to the "add" command are the name of a cell and a list of volume location servers within that cell, with the latter separated by colons. @@ -101,7 +99,7 @@ The name of the volume can be suffixes with ".backup" or ".readonly" to specify connection to only volumes of those types. The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during insmod. +named volume will be looked up in the cell specified during modprobe. Additional cells can be added through /proc (see later section). @@ -163,14 +161,14 @@ THE CELL DATABASE The filesystem maintains an internal database of all the cells it knows and the IP addresses of the volume location servers for those cells. The cell to which -the system belongs is added to the database when insmod is performed by the +the system belongs is added to the database when modprobe is performed by the "rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on the kernel command line. Further cells can be added by commands similar to the following: echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells No other cell database operations are available at this time. @@ -233,7 +231,7 @@ insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91 mount -t afs \%root.afs. /afs mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ -echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt new file mode 100644 index 00000000000..aff22113a98 --- /dev/null +++ b/Documentation/filesystems/autofs4-mount-control.txt @@ -0,0 +1,393 @@ + +Miscellaneous Device control operations for the autofs4 kernel module +==================================================================== + +The problem +=========== + +There is a problem with active restarts in autofs (that is to say +restarting autofs when there are busy mounts). + +During normal operation autofs uses a file descriptor opened on the +directory that is being managed in order to be able to issue control +operations. Using a file descriptor gives ioctl operations access to +autofs specific information stored in the super block. The operations +are things such as setting an autofs mount catatonic, setting the +expire timeout and requesting expire checks. As is explained below, +certain types of autofs triggered mounts can end up covering an autofs +mount itself which prevents us being able to use open(2) to obtain a +file descriptor for these operations if we don't already have one open. + +Currently autofs uses "umount -l" (lazy umount) to clear active mounts +at restart. While using lazy umount works for most cases, anything that +needs to walk back up the mount tree to construct a path, such as +getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works +because the point from which the path is constructed has been detached +from the mount tree. + +The actual problem with autofs is that it can't reconnect to existing +mounts. Immediately one thinks of just adding the ability to remount +autofs file systems would solve it, but alas, that can't work. This is +because autofs direct mounts and the implementation of "on demand mount +and expire" of nested mount trees have the file system mounted directly +on top of the mount trigger directory dentry. + +For example, there are two types of automount maps, direct (in the kernel +module source you will see a third type called an offset, which is just +a direct mount in disguise) and indirect. + +Here is a master map with direct and indirect map entries: + +/- /etc/auto.direct +/test /etc/auto.indirect + +and the corresponding map files: + +/etc/auto.direct: + +/automount/dparse/g6 budgie:/autofs/export1 +/automount/dparse/g1 shark:/autofs/export1 +and so on. + +/etc/auto.indirect: + +g1 shark:/autofs/export1 +g6 budgie:/autofs/export1 +and so on. + +For the above indirect map an autofs file system is mounted on /test and +mounts are triggered for each sub-directory key by the inode lookup +operation. So we see a mount of shark:/autofs/export1 on /test/g1, for +example. + +The way that direct mounts are handled is by making an autofs mount on +each full path, such as /automount/dparse/g1, and using it as a mount +trigger. So when we walk on the path we mount shark:/autofs/export1 "on +top of this mount point". Since these are always directories we can +use the follow_link inode operation to trigger the mount. + +But, each entry in direct and indirect maps can have offsets (making +them multi-mount map entries). + +For example, an indirect mount map entry could also be: + +g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 + +and a similarly a direct mount map entry could also be: + +/automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 + +One of the issues with version 4 of autofs was that, when mounting an +entry with a large number of offsets, possibly with nesting, we needed +to mount and umount all of the offsets as a single unit. Not really a +problem, except for people with a large number of offsets in map entries. +This mechanism is used for the well known "hosts" map and we have seen +cases (in 2.4) where the available number of mounts are exhausted or +where the number of privileged ports available is exhausted. + +In version 5 we mount only as we go down the tree of offsets and +similarly for expiring them which resolves the above problem. There is +somewhat more detail to the implementation but it isn't needed for the +sake of the problem explanation. The one important detail is that these +offsets are implemented using the same mechanism as the direct mounts +above and so the mount points can be covered by a mount. + +The current autofs implementation uses an ioctl file descriptor opened +on the mount point for control operations. The references held by the +descriptor are accounted for in checks made to determine if a mount is +in use and is also used to access autofs file system information held +in the mount super block. So the use of a file handle needs to be +retained. + + +The Solution +============ + +To be able to restart autofs leaving existing direct, indirect and +offset mounts in place we need to be able to obtain a file handle +for these potentially covered autofs mount points. Rather than just +implement an isolated operation it was decided to re-implement the +existing ioctl interface and add new operations to provide this +functionality. + +In addition, to be able to reconstruct a mount tree that has busy mounts, +the uid and gid of the last user that triggered the mount needs to be +available because these can be used as macro substitution variables in +autofs maps. They are recorded at mount request time and an operation +has been added to retrieve them. + +Since we're re-implementing the control interface, a couple of other +problems with the existing interface have been addressed. First, when +a mount or expire operation completes a status is returned to the +kernel by either a "send ready" or a "send fail" operation. The +"send fail" operation of the ioctl interface could only ever send +ENOENT so the re-implementation allows user space to send an actual +status. Another expensive operation in user space, for those using +very large maps, is discovering if a mount is present. Usually this +involves scanning /proc/mounts and since it needs to be done quite +often it can introduce significant overhead when there are many entries +in the mount table. An operation to lookup the mount status of a mount +point dentry (covered or not) has also been added. + +Current kernel development policy recommends avoiding the use of the +ioctl mechanism in favor of systems such as Netlink. An implementation +using this system was attempted to evaluate its suitability and it was +found to be inadequate, in this case. The Generic Netlink system was +used for this as raw Netlink would lead to a significant increase in +complexity. There's no question that the Generic Netlink system is an +elegant solution for common case ioctl functions but it's not a complete +replacement probably because its primary purpose in life is to be a +message bus implementation rather than specifically an ioctl replacement. +While it would be possible to work around this there is one concern +that lead to the decision to not use it. This is that the autofs +expire in the daemon has become far to complex because umount +candidates are enumerated, almost for no other reason than to "count" +the number of times to call the expire ioctl. This involves scanning +the mount table which has proved to be a big overhead for users with +large maps. The best way to improve this is try and get back to the +way the expire was done long ago. That is, when an expire request is +issued for a mount (file handle) we should continually call back to +the daemon until we can't umount any more mounts, then return the +appropriate status to the daemon. At the moment we just expire one +mount at a time. A Generic Netlink implementation would exclude this +possibility for future development due to the requirements of the +message bus architecture. + + +autofs4 Miscellaneous Device mount control interface +==================================================== + +The control interface is opening a device node, typically /dev/autofs. + +All the ioctls use a common structure to pass the needed parameter +information and return operation results: + +struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + __u32 arg1; /* Command parameters */ + __u32 arg2; + + char path[0]; +}; + +The ioctlfd field is a mount point file descriptor of an autofs mount +point. It is returned by the open call and is used by all calls except +the check for whether a given path is a mount point, where it may +optionally be used to check a specific mount corresponding to a given +mount point file descriptor, and when requesting the uid and gid of the +last successful mount on a directory within the autofs file system. + +The fields arg1 and arg2 are used to communicate parameters and results of +calls made as described below. + +The path field is used to pass a path where it is needed and the size field +is used account for the increased structure length when translating the +structure sent from user space. + +This structure can be initialized before setting specific fields by using +the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). + +All of the ioctls perform a copy of this structure from user space to +kernel space and return -EINVAL if the size parameter is smaller than +the structure size itself, -ENOMEM if the kernel memory allocation fails +or -EFAULT if the copy itself fails. Other checks include a version check +of the compiled in user space version against the module version and a +mismatch results in a -EINVAL return. If the size field is greater than +the structure size then a path is assumed to be present and is checked to +ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is +returned. Following these checks, for all ioctl commands except +AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and +AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is +not a valid descriptor or doesn't correspond to an autofs mount point +an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is +returned. + + +The ioctls +========== + +An example of an implementation which uses this interface can be seen +in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the +distribution tar available for download from kernel.org in directory +/pub/linux/daemons/autofs/v5. + +The device node ioctl operations implemented by this interface are: + + +AUTOFS_DEV_IOCTL_VERSION +------------------------ + +Get the major and minor version of the autofs4 device ioctl kernel module +implementation. It requires an initialized struct autofs_dev_ioctl as an +input parameter and sets the version information in the passed in structure. +It returns 0 on success or the error -EINVAL if a version mismatch is +detected. + + +AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD +------------------------------------------------------------------ + +Get the major and minor version of the autofs4 protocol version understood +by loaded module. This call requires an initialized struct autofs_dev_ioctl +with the ioctlfd field set to a valid autofs mount point descriptor +and sets the requested version number in structure field arg1. These +commands return 0 on success or one of the negative error codes if +validation fails. + + +AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT +---------------------------------------------------------- + +Obtain and release a file descriptor for an autofs managed mount point +path. The open call requires an initialized struct autofs_dev_ioctl with +the path field set and the size field adjusted appropriately as well +as the arg1 field set to the device number of the autofs mount. The +device number can be obtained from the mount options shown in +/proc/mounts. The close call requires an initialized struct +autofs_dev_ioct with the ioctlfd field set to the descriptor obtained +from the open call. The release of the file descriptor can also be done +with close(2) so any open descriptors will also be closed at process exit. +The close call is included in the implemented operations largely for +completeness and to provide for a consistent user space implementation. + + +AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD +-------------------------------------------------------- + +Return mount and expire result status from user space to the kernel. +Both of these calls require an initialized struct autofs_dev_ioctl +with the ioctlfd field set to the descriptor obtained from the open +call and the arg1 field set to the wait queue token number, received +by user space in the foregoing mount or expire request. The arg2 field +is set to the status to be returned. For the ready call this is always +0 and for the fail call it is set to the errno of the operation. + + +AUTOFS_DEV_IOCTL_SETPIPEFD_CMD +------------------------------ + +Set the pipe file descriptor used for kernel communication to the daemon. +Normally this is set at mount time using an option but when reconnecting +to a existing mount we need to use this to tell the autofs mount about +the new kernel pipe descriptor. In order to protect mounts against +incorrectly setting the pipe descriptor we also require that the autofs +mount be catatonic (see next call). + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +the arg1 field set to descriptor of the pipe. On success the call +also sets the process group id used to identify the controlling process +(eg. the owning automount(8) daemon) to the process group of the caller. + + +AUTOFS_DEV_IOCTL_CATATONIC_CMD +------------------------------ + +Make the autofs mount point catatonic. The autofs mount will no longer +issue mount requests, the kernel communication pipe descriptor is released +and any remaining waits in the queue released. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_TIMEOUT_CMD +---------------------------- + +Set the expire timeout for mounts within an autofs mount point. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_REQUESTER_CMD +------------------------------ + +Return the uid and gid of the last process to successfully trigger a the +mount on the given path dentry. + +The call requires an initialized struct autofs_dev_ioctl with the path +field set to the mount point in question and the size field adjusted +appropriately as well as the arg1 field set to the device number of the +containing autofs mount. Upon return the struct field arg1 contains the +uid and arg2 the gid. + +When reconstructing an autofs mount tree with active mounts we need to +re-connect to mounts that may have used the original process uid and +gid (or string variations of them) for mount lookups within the map entry. +This call provides the ability to obtain this uid and gid so they may be +used by user space for the mount map lookups. + + +AUTOFS_DEV_IOCTL_EXPIRE_CMD +--------------------------- + +Issue an expire request to the kernel for an autofs mount. Typically +this ioctl is called until no further expire candidates are found. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. In +addition an immediate expire, independent of the mount timeout, can be +requested by setting the arg1 field to 1. If no expire candidates can +be found the ioctl returns -1 with errno set to EAGAIN. + +This call causes the kernel module to check the mount corresponding +to the given ioctlfd for mounts that can be expired, issues an expire +request back to the daemon and waits for completion. + +AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD +------------------------------ + +Checks if an autofs mount point is in use. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +it returns the result in the arg1 field, 1 for busy and 0 otherwise. + + +AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD +--------------------------------- + +Check if the given path is a mountpoint. + +The call requires an initialized struct autofs_dev_ioctl. There are two +possible variations. Both use the path field set to the path of the mount +point to check and the size field adjusted appropriately. One uses the +ioctlfd field to identify a specific mount point to check while the other +variation uses the path and optionally arg1 set to an autofs mount type. +The call returns 1 if this is a mount point and sets arg1 to the device +number of the mount and field arg2 to the relevant super block magic +number (described below) or 0 if it isn't a mountpoint. In both cases +the the device number (as returned by new_encode_dev()) is returned +in field arg1. + +If supplied with a file descriptor we're looking for a specific mount, +not necessarily at the top of the mounted stack. In this case the path +the descriptor corresponds to is considered a mountpoint if it is itself +a mountpoint or contains a mount, such as a multi-mount without a root +mount. In this case we return 1 if the descriptor corresponds to a mount +point and and also returns the super magic of the covering mount if there +is one or 0 if it isn't a mountpoint. + +If a path is supplied (and the ioctlfd field is set to -1) then the path +is looked up and is checked to see if it is the root of a mount. If a +type is also given we are looking for a particular autofs mount and if +a match isn't found a fail is returned. If the the located path is the +root of a mount 1 is returned along with the super magic of the mount +or 0 otherwise. + diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt index 67391a15949..da45e6c842b 100644 --- a/Documentation/filesystems/befs.txt +++ b/Documentation/filesystems/befs.txt @@ -27,11 +27,11 @@ His original code can still be found at: Does anyone know of a more current email address for Makoto? He doesn't respond to the address given above... -Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru> +This filesystem doesn't have a maintainer. WHAT IS THIS DRIVER? ================== -This module implements the native filesystem of BeOS <http://www.be.com/> +This module implements the native filesystem of BeOS http://www.beincorporated.com/ for the linux 2.4.1 and later kernels. Currently it is a read-only implementation. @@ -61,7 +61,7 @@ step 2. Configuration & make kernel The linux kernel has many compile-time options. Most of them are beyond the scope of this document. I suggest the Kernel-HOWTO document as a good general -reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html> +reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html However, to use the BeFS module, you must enable it at configure time. diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt index ea825e178e7..78043d5a8fc 100644 --- a/Documentation/filesystems/bfs.txt +++ b/Documentation/filesystems/bfs.txt @@ -26,11 +26,11 @@ You can simplify mounting by just typing: this will allocate the first available loopback device (and load loop.o kernel module if necessary) automatically. If the loopback driver is not -loaded automatically, make sure that your kernel is compiled with kmod -support (CONFIG_KMOD) enabled. Beware that umount will not -deallocate /dev/loopN device if /etc/mtab file on your system is a -symbolic link to /proc/mounts. You will need to do it manually using -"-d" switch of losetup(8). Read losetup(8) manpage for more info. +loaded automatically, make sure that you have compiled the module and +that modprobe is functioning. Beware that umount will not deallocate +/dev/loopN device if /etc/mtab file on your system is a symbolic link to +/proc/mounts. You will need to do it manually using "-d" switch of +losetup(8). Read losetup(8) manpage for more info. To create the BFS image under UnixWare you need to find out first which slice contains it. The command prtvtoc(1M) is your friend: diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt new file mode 100644 index 00000000000..d11cc2f8077 --- /dev/null +++ b/Documentation/filesystems/btrfs.txt @@ -0,0 +1,270 @@ + +BTRFS +===== + +Btrfs is a copy on write filesystem for Linux aimed at +implementing advanced features while focusing on fault tolerance, +repair and easy administration. Initially developed by Oracle, Btrfs +is licensed under the GPL and open for contribution from anyone. + +Linux has a wealth of filesystems to choose from, but we are facing a +number of challenges with scaling to the large storage subsystems that +are becoming common in today's data centers. Filesystems need to scale +in their ability to address and manage large storage, and also in +their ability to detect, repair and tolerate errors in the data stored +on disk. Btrfs is under heavy development, and is not suitable for +any uses other than benchmarking and review. The Btrfs disk format is +not yet finalized. + +The main Btrfs features include: + + * Extent based file storage (2^64 max file size) + * Space efficient packing of small files + * Space efficient indexed directories + * Dynamic inode allocation + * Writable snapshots + * Subvolumes (separate internal filesystem roots) + * Object level mirroring and striping + * Checksums on data and metadata (multiple algorithms available) + * Compression + * Integrated multiple device support, with several raid algorithms + * Online filesystem check (not yet implemented) + * Very fast offline filesystem check + * Efficient incremental backup and FS mirroring (not yet implemented) + * Online filesystem defragmentation + + +Mount Options +============= + +When mounting a btrfs filesystem, the following option are accepted. +Options with (*) are default options and will not show in the mount options. + + alloc_start=<bytes> + Debugging option to force all block allocations above a certain + byte threshold on each block device. The value is specified in + bytes, optionally with a K, M, or G suffix, case insensitive. + Default is 1MB. + + noautodefrag(*) + autodefrag + Disable/enable auto defragmentation. + Auto defragmentation detects small random writes into files and queue + them up for the defrag process. Works best for small files; + Not well suited for large database workloads. + + check_int + check_int_data + check_int_print_mask=<value> + These debugging options control the behavior of the integrity checking + module (the BTRFS_FS_CHECK_INTEGRITY config option required). + + check_int enables the integrity checker module, which examines all + block write requests to ensure on-disk consistency, at a large + memory and CPU cost. + + check_int_data includes extent data in the integrity checks, and + implies the check_int option. + + check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values + as defined in fs/btrfs/check-integrity.c, to control the integrity + checker module behavior. + + See comments at the top of fs/btrfs/check-integrity.c for more info. + + commit=<seconds> + Set the interval of periodic commit, 30 seconds by default. Higher + values defer data being synced to permanent storage with obvious + consequences when the system crashes. The upper bound is not forced, + but a warning is printed if it's more than 300 seconds (5 minutes). + + compress + compress=<type> + compress-force + compress-force=<type> + Control BTRFS file data compression. Type may be specified as "zlib" + "lzo" or "no" (for no compression, used for remounting). If no type + is specified, zlib is used. If compress-force is specified, + all files will be compressed, whether or not they compress well. + If compression is enabled, nodatacow and nodatasum are disabled. + + degraded + Allow mounts to continue with missing devices. A read-write mount may + fail with too many devices missing, for example if a stripe member + is completely missing. + + device=<devicepath> + Specify a device during mount so that ioctls on the control device + can be avoided. Especially useful when trying to mount a multi-device + setup as root. May be specified multiple times for multiple devices. + + nodiscard(*) + discard + Disable/enable discard mount option. + Discard issues frequent commands to let the block device reclaim space + freed by the filesystem. + This is useful for SSD devices, thinly provisioned + LUNs and virtual machine images, but may have a significant + performance impact. (The fstrim command is also available to + initiate batch trims from userspace). + + noenospc_debug(*) + enospc_debug + Disable/enable debugging option to be more verbose in some ENOSPC conditions. + + fatal_errors=<action> + Action to take when encountering a fatal error: + "bug" - BUG() on a fatal error. This is the default. + "panic" - panic() on a fatal error. + + noflushoncommit(*) + flushoncommit + The 'flushoncommit' mount option forces any data dirtied by a write in a + prior transaction to commit as part of the current commit. This makes + the committed state a fully consistent view of the file system from the + application's perspective (i.e., it includes all completed file system + operations). This was previously the behavior only when a snapshot is + created. + + inode_cache + Enable free inode number caching. Defaults to off due to an overflow + problem when the free space crcs don't fit inside a single page. + + max_inline=<bytes> + Specify the maximum amount of space, in bytes, that can be inlined in + a metadata B-tree leaf. The value is specified in bytes, optionally + with a K, M, or G suffix, case insensitive. In practice, this value + is limited by the root sector size, with some space unavailable due + to leaf headers. For a 4k sectorsize, max inline data is ~3900 bytes. + + metadata_ratio=<value> + Specify that 1 metadata chunk should be allocated after every <value> + data chunks. Off by default. + + acl(*) + noacl + Enable/disable support for Posix Access Control Lists (ACLs). See the + acl(5) manual page for more information about ACLs. + + barrier(*) + nobarrier + Enable/disable the use of block layer write barriers. Write barriers + ensure that certain IOs make it through the device cache and are on + persistent storage. If disabled on a device with a volatile + (non-battery-backed) write-back cache, nobarrier option will lead to + filesystem corruption on a system crash or power loss. + + datacow(*) + nodatacow + Enable/disable data copy-on-write for newly created files. + Nodatacow implies nodatasum, and disables all compression. + + datasum(*) + nodatasum + Enable/disable data checksumming for newly created files. + Datasum implies datacow. + + treelog(*) + notreelog + Enable/disable the tree logging used for fsync and O_SYNC writes. + + recovery + Enable autorecovery attempts if a bad tree root is found at mount time. + Currently this scans a list of several previous tree roots and tries to + use the first readable. + + rescan_uuid_tree + Force check and rebuild procedure of the UUID tree. This should not + normally be needed. + + skip_balance + Skip automatic resume of interrupted balance operation after mount. + May be resumed with "btrfs balance resume." + + space_cache (*) + Enable the on-disk freespace cache. + nospace_cache + Disable freespace cache loading without clearing the cache. + clear_cache + Force clearing and rebuilding of the disk space cache if something + has gone wrong. + + ssd + nossd + ssd_spread + Options to control ssd allocation schemes. By default, BTRFS will + enable or disable ssd allocation heuristics depending on whether a + rotational or nonrotational disk is in use. The ssd and nossd options + can override this autodetection. + + The ssd_spread mount option attempts to allocate into big chunks + of unused space, and may perform better on low-end ssds. ssd_spread + implies ssd, enabling all other ssd heuristics as well. + + subvol=<path> + Mount subvolume at <path> rather than the root subvolume. <path> is + relative to the top level subvolume. + + subvolid=<ID> + Mount subvolume specified by an ID number rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume list" to see subvolume ID numbers. + + subvolrootid=<objectid> (deprecated) + Mount subvolume specified by <objectid> rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume show " to see the object ID for a subvolume. + + thread_pool=<number> + The number of worker threads to allocate. The default number is equal + to the number of CPUs + 2, or 8, whichever is smaller. + + user_subvol_rm_allowed + Allow subvolumes to be deleted by a non-root user. Use with caution. + +MAILING LIST +============ + +There is a Btrfs mailing list hosted on vger.kernel.org. You can +find details on how to subscribe here: + +http://vger.kernel.org/vger-lists.html#linux-btrfs + +Mailing list archives are available from gmane: + +http://dir.gmane.org/gmane.comp.file-systems.btrfs + + + +IRC +=== + +Discussion of Btrfs also occurs on the #btrfs channel of the Freenode +IRC network. + + + + UTILITIES + ========= + +Userspace tools for creating and manipulating Btrfs file systems are +available from the git repository at the following location: + + http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git + git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git + +These include the following tools: + +* mkfs.btrfs: create a filesystem + +* btrfs: a single tool to manage the filesystems, refer to the manpage for more details + +* 'btrfsck' or 'btrfs check': do a consistency check of the filesystem + +Other tools for specific tasks: + +* btrfs-convert: in-place conversion from ext2/3/4 filesystems + +* btrfs-image: dump filesystem metadata for debugging diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt new file mode 100644 index 00000000000..277d1e81067 --- /dev/null +++ b/Documentation/filesystems/caching/backend-api.txt @@ -0,0 +1,703 @@ + ========================== + FS-CACHE CACHE BACKEND API + ========================== + +The FS-Cache system provides an API by which actual caches can be supplied to +FS-Cache for it to then serve out to network filesystems and other interested +parties. + +This API is declared in <linux/fscache-cache.h>. + + +==================================== +INITIALISING AND REGISTERING A CACHE +==================================== + +To start off, a cache definition must be initialised and registered for each +cache the backend wants to make available. For instance, CacheFS does this in +the fill_super() operation on mounting. + +The cache definition (struct fscache_cache) should be initialised by calling: + + void fscache_init_cache(struct fscache_cache *cache, + struct fscache_cache_ops *ops, + const char *idfmt, + ...); + +Where: + + (*) "cache" is a pointer to the cache definition; + + (*) "ops" is a pointer to the table of operations that the backend supports on + this cache; and + + (*) "idfmt" is a format and printf-style arguments for constructing a label + for the cache. + + +The cache should then be registered with FS-Cache by passing a pointer to the +previously initialised cache definition to: + + int fscache_add_cache(struct fscache_cache *cache, + struct fscache_object *fsdef, + const char *tagname); + +Two extra arguments should also be supplied: + + (*) "fsdef" which should point to the object representation for the FS-Cache + master index in this cache. Netfs primary index entries will be created + here. FS-Cache keeps the caller's reference to the index object if + successful and will release it upon withdrawal of the cache. + + (*) "tagname" which, if given, should be a text string naming this cache. If + this is NULL, the identifier will be used instead. For CacheFS, the + identifier is set to name the underlying block device and the tag can be + supplied by mount. + +This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag +is already in use. 0 will be returned on success. + + +===================== +UNREGISTERING A CACHE +===================== + +A cache can be withdrawn from the system by calling this function with a +pointer to the cache definition: + + void fscache_withdraw_cache(struct fscache_cache *cache); + +In CacheFS's case, this is called by put_super(). + + +======== +SECURITY +======== + +The cache methods are executed one of two contexts: + + (1) that of the userspace process that issued the netfs operation that caused + the cache method to be invoked, or + + (2) that of one of the processes in the FS-Cache thread pool. + +In either case, this may not be an appropriate context in which to access the +cache. + +The calling process's fsuid, fsgid and SELinux security identities may need to +be masqueraded for the duration of the cache driver's access to the cache. +This is left to the cache to handle; FS-Cache makes no effort in this regard. + + +=================================== +CONTROL AND STATISTICS PRESENTATION +=================================== + +The cache may present data to the outside world through FS-Cache's interfaces +in sysfs and procfs - the former for control and the latter for statistics. + +A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS +is enabled. This is accessible through the kobject struct fscache_cache::kobj +and is for use by the cache as it sees fit. + + +======================== +RELEVANT DATA STRUCTURES +======================== + + (*) Index/Data file FS-Cache representation cookie: + + struct fscache_cookie { + struct fscache_object_def *def; + struct fscache_netfs *netfs; + void *netfs_data; + ... + }; + + The fields that might be of use to the backend describe the object + definition, the netfs definition and the netfs's data for this cookie. + The object definition contain functions supplied by the netfs for loading + and matching index entries; these are required to provide some of the + cache operations. + + + (*) In-cache object representation: + + struct fscache_object { + int debug_id; + enum { + FSCACHE_OBJECT_RECYCLING, + ... + } state; + spinlock_t lock + struct fscache_cache *cache; + struct fscache_cookie *cookie; + ... + }; + + Structures of this type should be allocated by the cache backend and + passed to FS-Cache when requested by the appropriate cache operation. In + the case of CacheFS, they're embedded in CacheFS's internal object + structures. + + The debug_id is a simple integer that can be used in debugging messages + that refer to a particular object. In such a case it should be printed + using "OBJ%x" to be consistent with FS-Cache. + + Each object contains a pointer to the cookie that represents the object it + is backing. An object should retired when put_object() is called if it is + in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be + initialised by calling fscache_object_init(object). + + + (*) FS-Cache operation record: + + struct fscache_operation { + atomic_t usage; + struct fscache_object *object; + unsigned long flags; + #define FSCACHE_OP_EXCLUSIVE + void (*processor)(struct fscache_operation *op); + void (*release)(struct fscache_operation *op); + ... + }; + + FS-Cache has a pool of threads that it uses to give CPU time to the + various asynchronous operations that need to be done as part of driving + the cache. These are represented by the above structure. The processor + method is called to give the op CPU time, and the release method to get + rid of it when its usage count reaches 0. + + An operation can be made exclusive upon an object by setting the + appropriate flag before enqueuing it with fscache_enqueue_operation(). If + an operation needs more processing time, it should be enqueued again. + + + (*) FS-Cache retrieval operation record: + + struct fscache_retrieval { + struct fscache_operation op; + struct address_space *mapping; + struct list_head *to_do; + ... + }; + + A structure of this type is allocated by FS-Cache to record retrieval and + allocation requests made by the netfs. This struct is then passed to the + backend to do the operation. The backend may get extra refs to it by + calling fscache_get_retrieval() and refs may be discarded by calling + fscache_put_retrieval(). + + A retrieval operation can be used by the backend to do retrieval work. To + do this, the retrieval->op.processor method pointer should be set + appropriately by the backend and fscache_enqueue_retrieval() called to + submit it to the thread pool. CacheFiles, for example, uses this to queue + page examination when it detects PG_lock being cleared. + + The to_do field is an empty list available for the cache backend to use as + it sees fit. + + + (*) FS-Cache storage operation record: + + struct fscache_storage { + struct fscache_operation op; + pgoff_t store_limit; + ... + }; + + A structure of this type is allocated by FS-Cache to record outstanding + writes to be made. FS-Cache itself enqueues this operation and invokes + the write_page() method on the object at appropriate times to effect + storage. + + +================ +CACHE OPERATIONS +================ + +The cache backend provides FS-Cache with a table of operations that can be +performed on the denizens of the cache. These are held in a structure of type: + + struct fscache_cache_ops + + (*) Name of cache provider [mandatory]: + + const char *name + + This isn't strictly an operation, but should be pointed at a string naming + the backend. + + + (*) Allocate a new object [mandatory]: + + struct fscache_object *(*alloc_object)(struct fscache_cache *cache, + struct fscache_cookie *cookie) + + This method is used to allocate a cache object representation to back a + cookie in a particular cache. fscache_object_init() should be called on + the object to initialise it prior to returning. + + This function may also be used to parse the index key to be used for + multiple lookup calls to turn it into a more convenient form. FS-Cache + will call the lookup_complete() method to allow the cache to release the + form once lookup is complete or aborted. + + + (*) Look up and create object [mandatory]: + + void (*lookup_object)(struct fscache_object *object) + + This method is used to look up an object, given that the object is already + allocated and attached to the cookie. This should instantiate that object + in the cache if it can. + + The method should call fscache_object_lookup_negative() as soon as + possible if it determines the object doesn't exist in the cache. If the + object is found to exist and the netfs indicates that it is valid then + fscache_obtained_object() should be called once the object is in a + position to have data stored in it. Similarly, fscache_obtained_object() + should also be called once a non-present object has been created. + + If a lookup error occurs, fscache_object_lookup_error() should be called + to abort the lookup of that object. + + + (*) Release lookup data [mandatory]: + + void (*lookup_complete)(struct fscache_object *object) + + This method is called to ask the cache to release any resources it was + using to perform a lookup. + + + (*) Increment object refcount [mandatory]: + + struct fscache_object *(*grab_object)(struct fscache_object *object) + + This method is called to increment the reference count on an object. It + may fail (for instance if the cache is being withdrawn) by returning NULL. + It should return the object pointer if successful. + + + (*) Lock/Unlock object [mandatory]: + + void (*lock_object)(struct fscache_object *object) + void (*unlock_object)(struct fscache_object *object) + + These methods are used to exclusively lock an object. It must be possible + to schedule with the lock held, so a spinlock isn't sufficient. + + + (*) Pin/Unpin object [optional]: + + int (*pin_object)(struct fscache_object *object) + void (*unpin_object)(struct fscache_object *object) + + These methods are used to pin an object into the cache. Once pinned an + object cannot be reclaimed to make space. Return -ENOSPC if there's not + enough space in the cache to permit this. + + + (*) Check coherency state of an object [mandatory]: + + int (*check_consistency)(struct fscache_object *object) + + This method is called to have the cache check the saved auxiliary data of + the object against the netfs's idea of the state. 0 should be returned + if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS + may also be returned. + + (*) Update object [mandatory]: + + int (*update_object)(struct fscache_object *object) + + This is called to update the index entry for the specified object. The + new information should be in object->cookie->netfs_data. This can be + obtained by calling object->cookie->def->get_aux()/get_attr(). + + + (*) Invalidate data object [mandatory]: + + int (*invalidate_object)(struct fscache_operation *op) + + This is called to invalidate a data object (as pointed to by op->object). + All the data stored for this object should be discarded and an + attr_changed operation should be performed. The caller will follow up + with an object update operation. + + fscache_op_complete() must be called on op before returning. + + + (*) Discard object [mandatory]: + + void (*drop_object)(struct fscache_object *object) + + This method is called to indicate that an object has been unbound from its + cookie, and that the cache should release the object's resources and + retire it if it's in state FSCACHE_OBJECT_RECYCLING. + + This method should not attempt to release any references held by the + caller. The caller will invoke the put_object() method as appropriate. + + + (*) Release object reference [mandatory]: + + void (*put_object)(struct fscache_object *object) + + This method is used to discard a reference to an object. The object may + be freed when all the references to it are released. + + + (*) Synchronise a cache [mandatory]: + + void (*sync)(struct fscache_cache *cache) + + This is called to ask the backend to synchronise a cache with its backing + device. + + + (*) Dissociate a cache [mandatory]: + + void (*dissociate_pages)(struct fscache_cache *cache) + + This is called to ask a cache to perform any page dissociations as part of + cache withdrawal. + + + (*) Notification that the attributes on a netfs file changed [mandatory]: + + int (*attr_changed)(struct fscache_object *object); + + This is called to indicate to the cache that certain attributes on a netfs + file have changed (for example the maximum size a file may reach). The + cache can read these from the netfs by calling the cookie's get_attr() + method. + + The cache may use the file size information to reserve space on the cache. + It should also call fscache_set_store_limit() to indicate to FS-Cache the + highest byte it's willing to store for an object. + + This method may return -ve if an error occurred or the cache object cannot + be expanded. In such a case, the object will be withdrawn from service. + + This operation is run asynchronously from FS-Cache's thread pool, and + storage and retrieval operations from the netfs are excluded during the + execution of this operation. + + + (*) Reserve cache space for an object's data [optional]: + + int (*reserve_space)(struct fscache_object *object, loff_t size); + + This is called to request that cache space be reserved to hold the data + for an object and the metadata used to track it. Zero size should be + taken as request to cancel a reservation. + + This should return 0 if successful, -ENOSPC if there isn't enough space + available, or -ENOMEM or -EIO on other errors. + + The reservation may exceed the current size of the object, thus permitting + future expansion. If the amount of space consumed by an object would + exceed the reservation, it's permitted to refuse requests to allocate + pages, but not required. An object may be pruned down to its reservation + size if larger than that already. + + + (*) Request page be read from cache [mandatory]: + + int (*read_or_alloc_page)(struct fscache_retrieval *op, + struct page *page, + gfp_t gfp) + + This is called to attempt to read a netfs page from the cache, or to + reserve a backing block if not. FS-Cache will have done as much checking + as it can before calling, but most of the work belongs to the backend. + + If there's no page in the cache, then -ENODATA should be returned if the + backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it + didn't. + + If there is suitable data in the cache, then a read operation should be + queued and 0 returned. When the read finishes, fscache_end_io() should be + called. + + The fscache_mark_pages_cached() should be called for the page if any cache + metadata is retained. This will indicate to the netfs that the page needs + explicit uncaching. This operation takes a pagevec, thus allowing several + pages to be marked at once. + + The retrieval record pointed to by op should be retained for each page + queued and released when I/O on the page has been formally ended. + fscache_get/put_retrieval() are available for this purpose. + + The retrieval record may be used to get CPU time via the FS-Cache thread + pool. If this is desired, the op->op.processor should be set to point to + the appropriate processing routine, and fscache_enqueue_retrieval() should + be called at an appropriate point to request CPU time. For instance, the + retrieval routine could be enqueued upon the completion of a disk read. + The to_do field in the retrieval record is provided to aid in this. + + If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS + returned if possible or fscache_end_io() called with a suitable error + code. + + fscache_put_retrieval() should be called after a page or pages are dealt + with. This will complete the operation when all pages are dealt with. + + + (*) Request pages be read from cache [mandatory]: + + int (*read_or_alloc_pages)(struct fscache_retrieval *op, + struct list_head *pages, + unsigned *nr_pages, + gfp_t gfp) + + This is like the read_or_alloc_page() method, except it is handed a list + of pages instead of one page. Any pages on which a read operation is + started must be added to the page cache for the specified mapping and also + to the LRU. Such pages must also be removed from the pages list and + *nr_pages decremented per page. + + If there was an error such as -ENOMEM, then that should be returned; else + if one or more pages couldn't be read or allocated, then -ENOBUFS should + be returned; else if one or more pages couldn't be read, then -ENODATA + should be returned. If all the pages are dispatched then 0 should be + returned. + + + (*) Request page be allocated in the cache [mandatory]: + + int (*allocate_page)(struct fscache_retrieval *op, + struct page *page, + gfp_t gfp) + + This is like the read_or_alloc_page() method, except that it shouldn't + read from the cache, even if there's data there that could be retrieved. + It should, however, set up any internal metadata required such that + the write_page() method can write to the cache. + + If there's no backing block available, then -ENOBUFS should be returned + (or -ENOMEM if there were other problems). If a block is successfully + allocated, then the netfs page should be marked and 0 returned. + + + (*) Request pages be allocated in the cache [mandatory]: + + int (*allocate_pages)(struct fscache_retrieval *op, + struct list_head *pages, + unsigned *nr_pages, + gfp_t gfp) + + This is an multiple page version of the allocate_page() method. pages and + nr_pages should be treated as for the read_or_alloc_pages() method. + + + (*) Request page be written to cache [mandatory]: + + int (*write_page)(struct fscache_storage *op, + struct page *page); + + This is called to write from a page on which there was a previously + successful read_or_alloc_page() call or similar. FS-Cache filters out + pages that don't have mappings. + + This method is called asynchronously from the FS-Cache thread pool. It is + not required to actually store anything, provided -ENODATA is then + returned to the next read of this page. + + If an error occurred, then a negative error code should be returned, + otherwise zero should be returned. FS-Cache will take appropriate action + in response to an error, such as withdrawing this object. + + If this method returns success then FS-Cache will inform the netfs + appropriately. + + + (*) Discard retained per-page metadata [mandatory]: + + void (*uncache_page)(struct fscache_object *object, struct page *page) + + This is called when a netfs page is being evicted from the pagecache. The + cache backend should tear down any internal representation or tracking it + maintains for this page. + + +================== +FS-CACHE UTILITIES +================== + +FS-Cache provides some utilities that a cache backend may make use of: + + (*) Note occurrence of an I/O error in a cache: + + void fscache_io_error(struct fscache_cache *cache) + + This tells FS-Cache that an I/O error occurred in the cache. After this + has been called, only resource dissociation operations (object and page + release) will be passed from the netfs to the cache backend for the + specified cache. + + This does not actually withdraw the cache. That must be done separately. + + + (*) Invoke the retrieval I/O completion function: + + void fscache_end_io(struct fscache_retrieval *op, struct page *page, + int error); + + This is called to note the end of an attempt to retrieve a page. The + error value should be 0 if successful and an error otherwise. + + + (*) Record that one or more pages being retrieved or allocated have been dealt + with: + + void fscache_retrieval_complete(struct fscache_retrieval *op, + int n_pages); + + This is called to record the fact that one or more pages have been dealt + with and are no longer the concern of this operation. When the number of + pages remaining in the operation reaches 0, the operation will be + completed. + + + (*) Record operation completion: + + void fscache_op_complete(struct fscache_operation *op); + + This is called to record the completion of an operation. This deducts + this operation from the parent object's run state, potentially permitting + one or more pending operations to start running. + + + (*) Set highest store limit: + + void fscache_set_store_limit(struct fscache_object *object, + loff_t i_size); + + This sets the limit FS-Cache imposes on the highest byte it's willing to + try and store for a netfs. Any page over this limit is automatically + rejected by fscache_read_alloc_page() and co with -ENOBUFS. + + + (*) Mark pages as being cached: + + void fscache_mark_pages_cached(struct fscache_retrieval *op, + struct pagevec *pagevec); + + This marks a set of pages as being cached. After this has been called, + the netfs must call fscache_uncache_page() to unmark the pages. + + + (*) Perform coherency check on an object: + + enum fscache_checkaux fscache_check_aux(struct fscache_object *object, + const void *data, + uint16_t datalen); + + This asks the netfs to perform a coherency check on an object that has + just been looked up. The cookie attached to the object will determine the + netfs to use. data and datalen should specify where the auxiliary data + retrieved from the cache can be found. + + One of three values will be returned: + + (*) FSCACHE_CHECKAUX_OKAY + + The coherency data indicates the object is valid as is. + + (*) FSCACHE_CHECKAUX_NEEDS_UPDATE + + The coherency data needs updating, but otherwise the object is + valid. + + (*) FSCACHE_CHECKAUX_OBSOLETE + + The coherency data indicates that the object is obsolete and should + be discarded. + + + (*) Initialise a freshly allocated object: + + void fscache_object_init(struct fscache_object *object); + + This initialises all the fields in an object representation. + + + (*) Indicate the destruction of an object: + + void fscache_object_destroyed(struct fscache_cache *cache); + + This must be called to inform FS-Cache that an object that belonged to a + cache has been destroyed and deallocated. This will allow continuation + of the cache withdrawal process when it is stopped pending destruction of + all the objects. + + + (*) Indicate negative lookup on an object: + + void fscache_object_lookup_negative(struct fscache_object *object); + + This is called to indicate to FS-Cache that a lookup process for an object + found a negative result. + + This changes the state of an object to permit reads pending on lookup + completion to go off and start fetching data from the netfs server as it's + known at this point that there can't be any data in the cache. + + This may be called multiple times on an object. Only the first call is + significant - all subsequent calls are ignored. + + + (*) Indicate an object has been obtained: + + void fscache_obtained_object(struct fscache_object *object); + + This is called to indicate to FS-Cache that a lookup process for an object + produced a positive result, or that an object was created. This should + only be called once for any particular object. + + This changes the state of an object to indicate: + + (1) if no call to fscache_object_lookup_negative() has been made on + this object, that there may be data available, and that reads can + now go and look for it; and + + (2) that writes may now proceed against this object. + + + (*) Indicate that object lookup failed: + + void fscache_object_lookup_error(struct fscache_object *object); + + This marks an object as having encountered a fatal error (usually EIO) + and causes it to move into a state whereby it will be withdrawn as soon + as possible. + + + (*) Get and release references on a retrieval record: + + void fscache_get_retrieval(struct fscache_retrieval *op); + void fscache_put_retrieval(struct fscache_retrieval *op); + + These two functions are used to retain a retrieval record whilst doing + asynchronous data retrieval and block allocation. + + + (*) Enqueue a retrieval record for processing. + + void fscache_enqueue_retrieval(struct fscache_retrieval *op); + + This enqueues a retrieval record for processing by the FS-Cache thread + pool. One of the threads in the pool will invoke the retrieval record's + op->op.processor callback function. This function may be called from + within the callback function. + + + (*) List of object state names: + + const char *fscache_object_states[]; + + For debugging purposes, this may be used to turn the state that an object + is in into a text string for display purposes. diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt new file mode 100644 index 00000000000..748a1ae49e1 --- /dev/null +++ b/Documentation/filesystems/caching/cachefiles.txt @@ -0,0 +1,501 @@ + =============================================== + CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM + =============================================== + +Contents: + + (*) Overview. + + (*) Requirements. + + (*) Configuration. + + (*) Starting the cache. + + (*) Things to avoid. + + (*) Cache culling. + + (*) Cache structure. + + (*) Security model and SELinux. + + (*) A note on security. + + (*) Statistical information. + + (*) Debugging. + + +======== +OVERVIEW +======== + +CacheFiles is a caching backend that's meant to use as a cache a directory on +an already mounted filesystem of a local type (such as Ext3). + +CacheFiles uses a userspace daemon to do some of the cache management - such as +reaping stale nodes and culling. This is called cachefilesd and lives in +/sbin. + +The filesystem and data integrity of the cache are only as good as those of the +filesystem providing the backing services. Note that CacheFiles does not +attempt to journal anything since the journalling interfaces of the various +filesystems are very specific in nature. + +CacheFiles creates a misc character device - "/dev/cachefiles" - that is used +to communication with the daemon. Only one thing may have this open at once, +and whilst it is open, a cache is at least partially in existence. The daemon +opens this and sends commands down it to control the cache. + +CacheFiles is currently limited to a single cache. + +CacheFiles attempts to maintain at least a certain percentage of free space on +the filesystem, shrinking the cache by culling the objects it contains to make +space if necessary - see the "Cache Culling" section. This means it can be +placed on the same medium as a live set of data, and will expand to make use of +spare space and automatically contract when the set of data requires more +space. + + +============ +REQUIREMENTS +============ + +The use of CacheFiles and its daemon requires the following features to be +available in the system and in the cache filesystem: + + - dnotify. + + - extended attributes (xattrs). + + - openat() and friends. + + - bmap() support on files in the filesystem (FIBMAP ioctl). + + - The use of bmap() to detect a partial page at the end of the file. + +It is strongly recommended that the "dir_index" option is enabled on Ext3 +filesystems being used as a cache. + + +============= +CONFIGURATION +============= + +The cache is configured by a script in /etc/cachefilesd.conf. These commands +set up cache ready for use. The following script commands are available: + + (*) brun <N>% + (*) bcull <N>% + (*) bstop <N>% + (*) frun <N>% + (*) fcull <N>% + (*) fstop <N>% + + Configure the culling limits. Optional. See the section on culling + The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. + + The commands beginning with a 'b' are file space (block) limits, those + beginning with an 'f' are file count limits. + + (*) dir <path> + + Specify the directory containing the root of the cache. Mandatory. + + (*) tag <name> + + Specify a tag to FS-Cache to use in distinguishing multiple caches. + Optional. The default is "CacheFiles". + + (*) debug <mask> + + Specify a numeric bitmask to control debugging in the kernel module. + Optional. The default is zero (all off). The following values can be + OR'd into the mask to collect various information: + + 1 Turn on trace of function entry (_enter() macros) + 2 Turn on trace of function exit (_leave() macros) + 4 Turn on trace of internal debug points (_debug()) + + This mask can also be set through sysfs, eg: + + echo 5 >/sys/modules/cachefiles/parameters/debug + + +================== +STARTING THE CACHE +================== + +The cache is started by running the daemon. The daemon opens the cache device, +configures the cache and tells it to begin caching. At that point the cache +binds to fscache and the cache becomes live. + +The daemon is run as follows: + + /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] + +The flags are: + + (*) -d + + Increase the debugging level. This can be specified multiple times and + is cumulative with itself. + + (*) -s + + Send messages to stderr instead of syslog. + + (*) -n + + Don't daemonise and go into background. + + (*) -f <configfile> + + Use an alternative configuration file rather than the default one. + + +=============== +THINGS TO AVOID +=============== + +Do not mount other things within the cache as this will cause problems. The +kernel module contains its own very cut-down path walking facility that ignores +mountpoints, but the daemon can't avoid them. + +Do not create, rename or unlink files and directories in the cache whilst the +cache is active, as this may cause the state to become uncertain. + +Renaming files in the cache might make objects appear to be other objects (the +filename is part of the lookup key). + +Do not change or remove the extended attributes attached to cache files by the +cache as this will cause the cache state management to get confused. + +Do not create files or directories in the cache, lest the cache get confused or +serve incorrect data. + +Do not chmod files in the cache. The module creates things with minimal +permissions to prevent random users being able to access them directly. + + +============= +CACHE CULLING +============= + +The cache may need culling occasionally to make space. This involves +discarding objects from the cache that have been used less recently than +anything else. Culling is based on the access time of data objects. Empty +directories are culled if not in use. + +Cache culling is done on the basis of the percentage of blocks and the +percentage of files available in the underlying filesystem. There are six +"limits": + + (*) brun + (*) frun + + If the amount of free space and the number of available files in the cache + rises above both these limits, then culling is turned off. + + (*) bcull + (*) fcull + + If the amount of available space or the number of available files in the + cache falls below either of these limits, then culling is started. + + (*) bstop + (*) fstop + + If the amount of available space or the number of available files in the + cache falls below either of these limits, then no further allocation of + disk space or files is permitted until culling has raised things above + these limits again. + +These must be configured thusly: + + 0 <= bstop < bcull < brun < 100 + 0 <= fstop < fcull < frun < 100 + +Note that these are percentages of available space and available files, and do +_not_ appear as 100 minus the percentage displayed by the "df" program. + +The userspace daemon scans the cache to build up a table of cullable objects. +These are then culled in least recently used order. A new scan of the cache is +started as soon as space is made in the table. Objects will be skipped if +their atimes have changed or if the kernel module says it is still using them. + + +=============== +CACHE STRUCTURE +=============== + +The CacheFiles module will create two directories in the directory it was +given: + + (*) cache/ + + (*) graveyard/ + +The active cache objects all reside in the first directory. The CacheFiles +kernel module moves any retired or culled objects that it can't simply unlink +to the graveyard from which the daemon will actually delete them. + +The daemon uses dnotify to monitor the graveyard directory, and will delete +anything that appears therein. + + +The module represents index objects as directories with the filename "I..." or +"J...". Note that the "cache/" directory is itself a special index. + +Data objects are represented as files if they have no children, or directories +if they do. Their filenames all begin "D..." or "E...". If represented as a +directory, data objects will have a file in the directory called "data" that +actually holds the data. + +Special objects are similar to data objects, except their filenames begin +"S..." or "T...". + + +If an object has children, then it will be represented as a directory. +Immediately in the representative directory are a collection of directories +named for hash values of the child object keys with an '@' prepended. Into +this directory, if possible, will be placed the representations of the child +objects: + + INDEX INDEX INDEX DATA FILES + ========= ========== ================================= ================ + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry + + +If the key is so long that it exceeds NAME_MAX with the decorations added on to +it, then it will be cut into pieces, the first few of which will be used to +make a nest of directories, and the last one of which will be the objects +inside the last directory. The names of the intermediate directories will have +'+' prepended: + + J1223/@23/+xy...z/+kl...m/Epqr + + +Note that keys are raw data, and not only may they exceed NAME_MAX in size, +they may also contain things like '/' and NUL characters, and so they may not +be suitable for turning directly into a filename. + +To handle this, CacheFiles will use a suitably printable filename directly and +"base-64" encode ones that aren't directly suitable. The two versions of +object filenames indicate the encoding: + + OBJECT TYPE PRINTABLE ENCODED + =============== =============== =============== + Index "I..." "J..." + Data "D..." "E..." + Special "S..." "T..." + +Intermediate directories are always "@" or "+" as appropriate. + + +Each object in the cache has an extended attribute label that holds the object +type ID (required to distinguish special objects) and the auxiliary data from +the netfs. The latter is used to detect stale objects in the cache and update +or retire them. + + +Note that CacheFiles will erase from the cache any file it doesn't recognise or +any file of an incorrect type (such as a FIFO file or a device file). + + +========================== +SECURITY MODEL AND SELINUX +========================== + +CacheFiles is implemented to deal properly with the LSM security features of +the Linux kernel and the SELinux facility. + +One of the problems that CacheFiles faces is that it is generally acting on +behalf of a process, and running in that process's context, and that includes a +security context that is not appropriate for accessing the cache - either +because the files in the cache are inaccessible to that process, or because if +the process creates a file in the cache, that file may be inaccessible to other +processes. + +The way CacheFiles works is to temporarily change the security context (fsuid, +fsgid and actor security label) that the process acts as - without changing the +security context of the process when it the target of an operation performed by +some other process (so signalling and suchlike still work correctly). + + +When the CacheFiles module is asked to bind to its cache, it: + + (1) Finds the security label attached to the root cache directory and uses + that as the security label with which it will create files. By default, + this is: + + cachefiles_var_t + + (2) Finds the security label of the process which issued the bind request + (presumed to be the cachefilesd daemon), which by default will be: + + cachefilesd_t + + and asks LSM to supply a security ID as which it should act given the + daemon's label. By default, this will be: + + cachefiles_kernel_t + + SELinux transitions the daemon's security ID to the module's security ID + based on a rule of this form in the policy. + + type_transition <daemon's-ID> kernel_t : process <module's-ID>; + + For instance: + + type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; + + +The module's security ID gives it permission to create, move and remove files +and directories in the cache, to find and access directories and files in the +cache, to set and access extended attributes on cache objects, and to read and +write files in the cache. + +The daemon's security ID gives it only a very restricted set of permissions: it +may scan directories, stat files and erase files and directories. It may +not read or write files in the cache, and so it is precluded from accessing the +data cached therein; nor is it permitted to create new files in the cache. + + +There are policy source files available in: + + http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 + +and later versions. In that tarball, see the files: + + cachefilesd.te + cachefilesd.fc + cachefilesd.if + +They are built and installed directly by the RPM. + +If a non-RPM based system is being used, then copy the above files to their own +directory and run: + + make -f /usr/share/selinux/devel/Makefile + semodule -i cachefilesd.pp + +You will need checkpolicy and selinux-policy-devel installed prior to the +build. + + +By default, the cache is located in /var/fscache, but if it is desirable that +it should be elsewhere, than either the above policy files must be altered, or +an auxiliary policy must be installed to label the alternate location of the +cache. + +For instructions on how to add an auxiliary policy to enable the cache to be +located elsewhere when SELinux is in enforcing mode, please see: + + /usr/share/doc/cachefilesd-*/move-cache.txt + +When the cachefilesd rpm is installed; alternatively, the document can be found +in the sources. + + +================== +A NOTE ON SECURITY +================== + +CacheFiles makes use of the split security in the task_struct. It allocates +its own task_security structure, and redirects current->cred to point to it +when it acts on behalf of another process, in that process's context. + +The reason it does this is that it calls vfs_mkdir() and suchlike rather than +bypassing security and calling inode ops directly. Therefore the VFS and LSM +may deny the CacheFiles access to the cache data because under some +circumstances the caching code is running in the security context of whatever +process issued the original syscall on the netfs. + +Furthermore, should CacheFiles create a file or directory, the security +parameters with that object is created (UID, GID, security label) would be +derived from that process that issued the system call, thus potentially +preventing other processes from accessing the cache - including CacheFiles's +cache management daemon (cachefilesd). + +What is required is to temporarily override the security of the process that +issued the system call. We can't, however, just do an in-place change of the +security data as that affects the process as an object, not just as a subject. +This means it may lose signals or ptrace events for example, and affects what +the process looks like in /proc. + +So CacheFiles makes use of a logical split in the security between the +objective security (task->real_cred) and the subjective security (task->cred). +The objective security holds the intrinsic security properties of a process and +is never overridden. This is what appears in /proc, and is what is used when a +process is the target of an operation by some other process (SIGKILL for +example). + +The subjective security holds the active security properties of a process, and +may be overridden. This is not seen externally, and is used whan a process +acts upon another object, for example SIGKILLing another process or opening a +file. + +LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request +for CacheFiles to run in a context of a specific security label, or to create +files and directories with another security label. + + +======================= +STATISTICAL INFORMATION +======================= + +If FS-Cache is compiled with the following option enabled: + + CONFIG_CACHEFILES_HISTOGRAM=y + +then it will gather certain statistics and display them through a proc file. + + (*) /proc/fs/cachefiles/histogram + + cat /proc/fs/cachefiles/histogram + JIFS SECS LOOKUPS MKDIRS CREATES + ===== ===== ========= ========= ========= + + This shows the breakdown of the number of times each amount of time + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The + columns are as follows: + + COLUMN TIME MEASUREMENT + ======= ======================================================= + LOOKUPS Length of time to perform a lookup on the backing fs + MKDIRS Length of time to perform a mkdir on the backing fs + CREATES Length of time to perform a create on the backing fs + + Each row shows the number of events that took a particular range of times. + Each step is 1 jiffy in size. The JIFS column indicates the particular + jiffy range covered, and the SECS field the equivalent number of seconds. + + +========= +DEBUGGING +========= + +If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime +debugging enabled by adjusting the value in: + + /sys/module/cachefiles/parameters/debug + +This is a bitmask of debugging streams to enable: + + BIT VALUE STREAM POINT + ======= ======= =============================== ======================= + 0 1 General Function entry trace + 1 2 Function exit trace + 2 4 General + +The appropriate set of values should be OR'd together and the result written to +the control file. For example: + + echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug + +will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt new file mode 100644 index 00000000000..770267af5b3 --- /dev/null +++ b/Documentation/filesystems/caching/fscache.txt @@ -0,0 +1,443 @@ + ========================== + General Filesystem Caching + ========================== + +======== +OVERVIEW +======== + +This facility is a general purpose cache for network filesystems, though it +could be used for caching other things such as ISO9660 filesystems too. + +FS-Cache mediates between cache backends (such as CacheFS) and network +filesystems: + + +---------+ + | | +--------------+ + | NFS |--+ | | + | | | +-->| CacheFS | + +---------+ | +----------+ | | /dev/hda5 | + | | | | +--------------+ + +---------+ +-->| | | + | | | |--+ + | AFS |----->| FS-Cache | + | | | |--+ + +---------+ +-->| | | + | | | | +--------------+ + +---------+ | +----------+ | | | + | | | +-->| CacheFiles | + | ISOFS |--+ | /var/cache | + | | +--------------+ + +---------+ + +Or to look at it another way, FS-Cache is a module that provides a caching +facility to a network filesystem such that the cache is transparent to the +user: + + +---------+ + | | + | Server | + | | + +---------+ + | NETWORK + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | + | +----------+ + V | | + +---------+ | | + | | | | + | NFS |----->| FS-Cache | + | | | |--+ + +---------+ | | | +--------------+ +--------------+ + | | | | | | | | + V +----------+ +-->| CacheFiles |-->| Ext3 | + +---------+ | /var/cache | | /dev/sda6 | + | | +--------------+ +--------------+ + | VFS | ^ ^ + | | | | + +---------+ +--------------+ | + | KERNEL SPACE | | + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ + | USER SPACE | | + V | | + +---------+ +--------------+ + | | | | + | Process | | cachefilesd | + | | | | + +---------+ +--------------+ + + +FS-Cache does not follow the idea of completely loading every netfs file +opened in its entirety into a cache before permitting it to be accessed and +then serving the pages out of that cache rather than the netfs inode because: + + (1) It must be practical to operate without a cache. + + (2) The size of any accessible file must not be limited to the size of the + cache. + + (3) The combined size of all opened files (this includes mapped libraries) + must not be limited to the size of the cache. + + (4) The user should not be forced to download an entire file just to do a + one-off access of a small portion of it (such as might be done with the + "file" program). + +It instead serves the cache out in PAGE_SIZE chunks as and when requested by +the netfs('s) using it. + + +FS-Cache provides the following facilities: + + (1) More than one cache can be used at once. Caches can be selected + explicitly by use of tags. + + (2) Caches can be added / removed at any time. + + (3) The netfs is provided with an interface that allows either party to + withdraw caching facilities from a file (required for (2)). + + (4) The interface to the netfs returns as few errors as possible, preferring + rather to let the netfs remain oblivious. + + (5) Cookies are used to represent indices, files and other objects to the + netfs. The simplest cookie is just a NULL pointer - indicating nothing + cached there. + + (6) The netfs is allowed to propose - dynamically - any index hierarchy it + desires, though it must be aware that the index search function is + recursive, stack space is limited, and indices can only be children of + indices. + + (7) Data I/O is done direct to and from the netfs's pages. The netfs + indicates that page A is at index B of the data-file represented by cookie + C, and that it should be read or written. The cache backend may or may + not start I/O on that page, but if it does, a netfs callback will be + invoked to indicate completion. The I/O may be either synchronous or + asynchronous. + + (8) Cookies can be "retired" upon release. At this point FS-Cache will mark + them as obsolete and the index hierarchy rooted at that point will get + recycled. + + (9) The netfs provides a "match" function for index searches. In addition to + saying whether a match was made or not, this can also specify that an + entry should be updated or deleted. + +(10) As much as possible is done asynchronously. + + +FS-Cache maintains a virtual indexing tree in which all indices, files, objects +and pages are kept. Bits of this tree may actually reside in one or more +caches. + + FSDEF + | + +------------------------------------+ + | | + NFS AFS + | | + +--------------------------+ +-----------+ + | | | | + homedir mirror afs.org redhat.com + | | | + +------------+ +---------------+ +----------+ + | | | | | | + 00001 00002 00007 00125 vol00001 vol00002 + | | | | | + +---+---+ +-----+ +---+ +------+------+ +-----+----+ + | | | | | | | | | | | | | +PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak + | | + PG0 +-------+ + | | + 00001 00003 + | + +---+---+ + | | | + PG0 PG1 PG2 + +In the example above, you can see two netfs's being backed: NFS and AFS. These +have different index hierarchies: + + (*) The NFS primary index contains per-server indices. Each server index is + indexed by NFS file handles to get data file objects. Each data file + objects can have an array of pages, but may also have further child + objects, such as extended attributes and directory entries. Extended + attribute objects themselves have page-array contents. + + (*) The AFS primary index contains per-cell indices. Each cell index contains + per-logical-volume indices. Each of volume index contains up to three + indices for the read-write, read-only and backup mirrors of those volumes. + Each of these contains vnode data file objects, each of which contains an + array of pages. + +The very top index is the FS-Cache master index in which individual netfs's +have entries. + +Any index object may reside in more than one cache, provided it only has index +children. Any index with non-index object children will be assumed to only +reside in one cache. + + +The netfs API to FS-Cache can be found in: + + Documentation/filesystems/caching/netfs-api.txt + +The cache backend API to FS-Cache can be found in: + + Documentation/filesystems/caching/backend-api.txt + +A description of the internal representations and object state machine can be +found in: + + Documentation/filesystems/caching/object.txt + + +======================= +STATISTICAL INFORMATION +======================= + +If FS-Cache is compiled with the following options enabled: + + CONFIG_FSCACHE_STATS=y + CONFIG_FSCACHE_HISTOGRAM=y + +then it will gather certain statistics and display them through a number of +proc files. + + (*) /proc/fs/fscache/stats + + This shows counts of a number of events that can happen in FS-Cache: + + CLASS EVENT MEANING + ======= ======= ======================================================= + Cookies idx=N Number of index cookies allocated + dat=N Number of data storage cookies allocated + spc=N Number of special cookies allocated + Objects alc=N Number of objects allocated + nal=N Number of object allocation failures + avl=N Number of objects that reached the available state + ded=N Number of objects that reached the dead state + ChkAux non=N Number of objects that didn't have a coherency check + ok=N Number of objects that passed a coherency check + upd=N Number of objects that needed a coherency data update + obs=N Number of objects that were declared obsolete + Pages mrk=N Number of pages marked as being cached + unc=N Number of uncache page requests seen + Acquire n=N Number of acquire cookie requests seen + nul=N Number of acq reqs given a NULL parent + noc=N Number of acq reqs rejected due to no cache available + ok=N Number of acq reqs succeeded + nbf=N Number of acq reqs rejected due to error + oom=N Number of acq reqs failed on ENOMEM + Lookups n=N Number of lookup calls made on cache backends + neg=N Number of negative lookups made + pos=N Number of positive lookups made + crt=N Number of objects created by lookup + tmo=N Number of lookups timed out and requeued + Updates n=N Number of update cookie requests seen + nul=N Number of upd reqs given a NULL parent + run=N Number of upd reqs granted CPU time + Relinqs n=N Number of relinquish cookie requests seen + nul=N Number of rlq reqs given a NULL parent + wcr=N Number of rlq reqs waited on completion of creation + AttrChg n=N Number of attribute changed requests seen + ok=N Number of attr changed requests queued + nbf=N Number of attr changed rejected -ENOBUFS + oom=N Number of attr changed failed -ENOMEM + run=N Number of attr changed ops given CPU time + Allocs n=N Number of allocation requests seen + ok=N Number of successful alloc reqs + wt=N Number of alloc reqs that waited on lookup completion + nbf=N Number of alloc reqs rejected -ENOBUFS + int=N Number of alloc reqs aborted -ERESTARTSYS + ops=N Number of alloc reqs submitted + owt=N Number of alloc reqs waited for CPU time + abt=N Number of alloc reqs aborted due to object death + Retrvls n=N Number of retrieval (read) requests seen + ok=N Number of successful retr reqs + wt=N Number of retr reqs that waited on lookup completion + nod=N Number of retr reqs returned -ENODATA + nbf=N Number of retr reqs rejected -ENOBUFS + int=N Number of retr reqs aborted -ERESTARTSYS + oom=N Number of retr reqs failed -ENOMEM + ops=N Number of retr reqs submitted + owt=N Number of retr reqs waited for CPU time + abt=N Number of retr reqs aborted due to object death + Stores n=N Number of storage (write) requests seen + ok=N Number of successful store reqs + agn=N Number of store reqs on a page already pending storage + nbf=N Number of store reqs rejected -ENOBUFS + oom=N Number of store reqs failed -ENOMEM + ops=N Number of store reqs submitted + run=N Number of store reqs granted CPU time + pgs=N Number of pages given store req processing time + rxd=N Number of store reqs deleted from tracking tree + olm=N Number of store reqs over store limit + VmScan nos=N Number of release reqs against pages with no pending store + gon=N Number of release reqs against pages stored by time lock granted + bsy=N Number of release reqs ignored due to in-progress store + can=N Number of page stores cancelled due to release req + Ops pend=N Number of times async ops added to pending queues + run=N Number of times async ops given CPU time + enq=N Number of times async ops queued for processing + can=N Number of async ops cancelled + rej=N Number of async ops rejected due to object lookup/create failure + dfr=N Number of async ops queued for deferred release + rel=N Number of async ops released + gc=N Number of deferred-release async ops garbage collected + CacheOp alo=N Number of in-progress alloc_object() cache ops + luo=N Number of in-progress lookup_object() cache ops + luc=N Number of in-progress lookup_complete() cache ops + gro=N Number of in-progress grab_object() cache ops + upo=N Number of in-progress update_object() cache ops + dro=N Number of in-progress drop_object() cache ops + pto=N Number of in-progress put_object() cache ops + syn=N Number of in-progress sync_cache() cache ops + atc=N Number of in-progress attr_changed() cache ops + rap=N Number of in-progress read_or_alloc_page() cache ops + ras=N Number of in-progress read_or_alloc_pages() cache ops + alp=N Number of in-progress allocate_page() cache ops + als=N Number of in-progress allocate_pages() cache ops + wrp=N Number of in-progress write_page() cache ops + ucp=N Number of in-progress uncache_page() cache ops + dsp=N Number of in-progress dissociate_pages() cache ops + + + (*) /proc/fs/fscache/histogram + + cat /proc/fs/fscache/histogram + JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS + ===== ===== ========= ========= ========= ========= ========= + + This shows the breakdown of the number of times each amount of time + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The + columns are as follows: + + COLUMN TIME MEASUREMENT + ======= ======================================================= + OBJ INST Length of time to instantiate an object + OP RUNS Length of time a call to process an operation took + OBJ RUNS Length of time a call to process an object event took + RETRV DLY Time between an requesting a read and lookup completing + RETRIEVLS Time between beginning and end of a retrieval + + Each row shows the number of events that took a particular range of times. + Each step is 1 jiffy in size. The JIFS column indicates the particular + jiffy range covered, and the SECS field the equivalent number of seconds. + + +=========== +OBJECT LIST +=========== + +If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a +list of all the objects currently allocated and allow them to be viewed +through: + + /proc/fs/fscache/objects + +This will look something like: + + [root@andromeda ~]# head /proc/fs/fscache/objects + OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA + ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ + 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a + 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a + +where the first set of columns before the '|' describe the object: + + COLUMN DESCRIPTION + ======= =============================================================== + OBJECT Object debugging ID (appears as OBJ%x in some debug messages) + PARENT Debugging ID of parent object + STAT Object state + CHLDN Number of child objects of this object + OPS Number of outstanding operations on this object + OOP Number of outstanding child object management operations + IPR + EX Number of outstanding exclusive operations + READS Number of outstanding read operations + EM Object's event mask + EV Events raised on this object + F Object flags + S Object work item busy state mask (1:pending 2:running) + +and the second set of columns describe the object's cookie, if present: + + COLUMN DESCRIPTION + =============== ======================================================= + NETFS_COOKIE_DEF Name of netfs cookie definition + TY Cookie type (IX - index, DT - data, hex - special) + FL Cookie flags + NETFS_DATA Netfs private data stored in the cookie + OBJECT_KEY Object key } 1 column, with separating comma + AUX_DATA Object aux data } presence may be configured + +The data shown may be filtered by attaching the a key to an appropriate keyring +before viewing the file. Something like: + + keyctl add user fscache:objlist <restrictions> @s + +where <restrictions> are a selection of the following letters: + + K Show hexdump of object key (don't show if not given) + A Show hexdump of object aux data (don't show if not given) + +and the following paired letters: + + C Show objects that have a cookie + c Show objects that don't have a cookie + B Show objects that are busy + b Show objects that aren't busy + W Show objects that have pending writes + w Show objects that don't have pending writes + R Show objects that have outstanding reads + r Show objects that don't have outstanding reads + S Show objects that have work queued + s Show objects that don't have work queued + +If neither side of a letter pair is given, then both are implied. For example: + + keyctl add user fscache:objlist KB @s + +shows objects that are busy, and lists their object keys, but does not dump +their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is +not implied. + +By default all objects and all fields will be shown. + + +========= +DEBUGGING +========= + +If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime +debugging enabled by adjusting the value in: + + /sys/module/fscache/parameters/debug + +This is a bitmask of debugging streams to enable: + + BIT VALUE STREAM POINT + ======= ======= =============================== ======================= + 0 1 Cache management Function entry trace + 1 2 Function exit trace + 2 4 General + 3 8 Cookie management Function entry trace + 4 16 Function exit trace + 5 32 General + 6 64 Page handling Function entry trace + 7 128 Function exit trace + 8 256 General + 9 512 Operation management Function entry trace + 10 1024 Function exit trace + 11 2048 General + +The appropriate set of values should be OR'd together and the result written to +the control file. For example: + + echo $((1|8|64)) >/sys/module/fscache/parameters/debug + +will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt new file mode 100644 index 00000000000..aed6b94160b --- /dev/null +++ b/Documentation/filesystems/caching/netfs-api.txt @@ -0,0 +1,915 @@ + =============================== + FS-CACHE NETWORK FILESYSTEM API + =============================== + +There's an API by which a network filesystem can make use of the FS-Cache +facilities. This is based around a number of principles: + + (1) Caches can store a number of different object types. There are two main + object types: indices and files. The first is a special type used by + FS-Cache to make finding objects faster and to make retiring of groups of + objects easier. + + (2) Every index, file or other object is represented by a cookie. This cookie + may or may not have anything associated with it, but the netfs doesn't + need to care. + + (3) Barring the top-level index (one entry per cached netfs), the index + hierarchy for each netfs is structured according the whim of the netfs. + +This API is declared in <linux/fscache.h>. + +This document contains the following sections: + + (1) Network filesystem definition + (2) Index definition + (3) Object definition + (4) Network filesystem (un)registration + (5) Cache tag lookup + (6) Index registration + (7) Data file registration + (8) Miscellaneous object registration + (9) Setting the data file size + (10) Page alloc/read/write + (11) Page uncaching + (12) Index and data file consistency + (13) Cookie enablement + (14) Miscellaneous cookie operations + (15) Cookie unregistration + (16) Index invalidation + (17) Data file invalidation + (18) FS-Cache specific page flags. + + +============================= +NETWORK FILESYSTEM DEFINITION +============================= + +FS-Cache needs a description of the network filesystem. This is specified +using a record of the following structure: + + struct fscache_netfs { + uint32_t version; + const char *name; + struct fscache_cookie *primary_index; + ... + }; + +This first two fields should be filled in before registration, and the third +will be filled in by the registration function; any other fields should just be +ignored and are for internal use only. + +The fields are: + + (1) The name of the netfs (used as the key in the toplevel index). + + (2) The version of the netfs (if the name matches but the version doesn't, the + entire in-cache hierarchy for this netfs will be scrapped and begun + afresh). + + (3) The cookie representing the primary index will be allocated according to + another parameter passed into the registration function. + +For example, kAFS (linux/fs/afs/) uses the following definitions to describe +itself: + + struct fscache_netfs afs_cache_netfs = { + .version = 0, + .name = "afs", + }; + + +================ +INDEX DEFINITION +================ + +Indices are used for two purposes: + + (1) To aid the finding of a file based on a series of keys (such as AFS's + "cell", "volume ID", "vnode ID"). + + (2) To make it easier to discard a subset of all the files cached based around + a particular key - for instance to mirror the removal of an AFS volume. + +However, since it's unlikely that any two netfs's are going to want to define +their index hierarchies in quite the same way, FS-Cache tries to impose as few +restraints as possible on how an index is structured and where it is placed in +the tree. The netfs can even mix indices and data files at the same level, but +it's not recommended. + +Each index entry consists of a key of indeterminate length plus some auxiliary +data, also of indeterminate length. + +There are some limits on indices: + + (1) Any index containing non-index objects should be restricted to a single + cache. Any such objects created within an index will be created in the + first cache only. The cache in which an index is created can be + controlled by cache tags (see below). + + (2) The entry data must be atomically journallable, so it is limited to about + 400 bytes at present. At least 400 bytes will be available. + + (3) The depth of the index tree should be judged with care as the search + function is recursive. Too many layers will run the kernel out of stack. + + +================= +OBJECT DEFINITION +================= + +To define an object, a structure of the following type should be filled out: + + struct fscache_cookie_def + { + uint8_t name[16]; + uint8_t type; + + struct fscache_cache_tag *(*select_cache)( + const void *parent_netfs_data, + const void *cookie_netfs_data); + + uint16_t (*get_key)(const void *cookie_netfs_data, + void *buffer, + uint16_t bufmax); + + void (*get_attr)(const void *cookie_netfs_data, + uint64_t *size); + + uint16_t (*get_aux)(const void *cookie_netfs_data, + void *buffer, + uint16_t bufmax); + + enum fscache_checkaux (*check_aux)(void *cookie_netfs_data, + const void *data, + uint16_t datalen); + + void (*get_context)(void *cookie_netfs_data, void *context); + + void (*put_context)(void *cookie_netfs_data, void *context); + + void (*mark_pages_cached)(void *cookie_netfs_data, + struct address_space *mapping, + struct pagevec *cached_pvec); + + void (*now_uncached)(void *cookie_netfs_data); + }; + +This has the following fields: + + (1) The type of the object [mandatory]. + + This is one of the following values: + + (*) FSCACHE_COOKIE_TYPE_INDEX + + This defines an index, which is a special FS-Cache type. + + (*) FSCACHE_COOKIE_TYPE_DATAFILE + + This defines an ordinary data file. + + (*) Any other value between 2 and 255 + + This defines an extraordinary object such as an XATTR. + + (2) The name of the object type (NUL terminated unless all 16 chars are used) + [optional]. + + (3) A function to select the cache in which to store an index [optional]. + + This function is invoked when an index needs to be instantiated in a cache + during the instantiation of a non-index object. Only the immediate index + parent for the non-index object will be queried. Any indices above that + in the hierarchy may be stored in multiple caches. This function does not + need to be supplied for any non-index object or any index that will only + have index children. + + If this function is not supplied or if it returns NULL then the first + cache in the parent's list will be chosen, or failing that, the first + cache in the master list. + + (4) A function to retrieve an object's key from the netfs [mandatory]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function and the maximum length of key data that it may + provide. It should write the required key data into the given buffer and + return the quantity it wrote. + + (5) A function to retrieve attribute data from the netfs [optional]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function. It should return the size of the file if + this is a data file. The size may be used to govern how much cache must + be reserved for this file in the cache. + + If the function is absent, a file size of 0 is assumed. + + (6) A function to retrieve auxiliary data from the netfs [optional]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function and the maximum length of auxiliary data that + it may provide. It should write the auxiliary data into the given buffer + and return the quantity it wrote. + + If this function is absent, the auxiliary data length will be set to 0. + + The length of the auxiliary data buffer may be dependent on the key + length. A netfs mustn't rely on being able to provide more than 400 bytes + for both. + + (7) A function to check the auxiliary data [optional]. + + This function will be called to check that a match found in the cache for + this object is valid. For instance with AFS it could check the auxiliary + data against the data version number returned by the server to determine + whether the index entry in a cache is still valid. + + If this function is absent, it will be assumed that matching objects in a + cache are always valid. + + If present, the function should return one of the following values: + + (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is + (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update + (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted + + This function can also be used to extract data from the auxiliary data in + the cache and copy it into the netfs's structures. + + (8) A pair of functions to manage contexts for the completion callback + [optional]. + + The cache read/write functions are passed a context which is then passed + to the I/O completion callback function. To ensure this context remains + valid until after the I/O completion is called, two functions may be + provided: one to get an extra reference on the context, and one to drop a + reference to it. + + If the context is not used or is a type of object that won't go out of + scope, then these functions are not required. These functions are not + required for indices as indices may not contain data. These functions may + be called in interrupt context and so may not sleep. + + (9) A function to mark a page as retaining cache metadata [optional]. + + This is called by the cache to indicate that it is retaining in-memory + information for this page and that the netfs should uncache the page when + it has finished. This does not indicate whether there's data on the disk + or not. Note that several pages at once may be presented for marking. + + The PG_fscache bit is set on the pages before this function would be + called, so the function need not be provided if this is sufficient. + + This function is not required for indices as they're not permitted data. + +(10) A function to unmark all the pages retaining cache metadata [mandatory]. + + This is called by FS-Cache to indicate that a backing store is being + unbound from a cookie and that all the marks on the pages should be + cleared to prevent confusion. Note that the cache will have torn down all + its tracking information so that the pages don't need to be explicitly + uncached. + + This function is not required for indices as they're not permitted data. + + +=================================== +NETWORK FILESYSTEM (UN)REGISTRATION +=================================== + +The first step is to declare the network filesystem to the cache. This also +involves specifying the layout of the primary index (for AFS, this would be the +"cell" level). + +The registration function is: + + int fscache_register_netfs(struct fscache_netfs *netfs); + +It just takes a pointer to the netfs definition. It returns 0 or an error as +appropriate. + +For kAFS, registration is done as follows: + + ret = fscache_register_netfs(&afs_cache_netfs); + +The last step is, of course, unregistration: + + void fscache_unregister_netfs(struct fscache_netfs *netfs); + + +================ +CACHE TAG LOOKUP +================ + +FS-Cache permits the use of more than one cache. To permit particular index +subtrees to be bound to particular caches, the second step is to look up cache +representation tags. This step is optional; it can be left entirely up to +FS-Cache as to which cache should be used. The problem with doing that is that +FS-Cache will always pick the first cache that was registered. + +To get the representation for a named tag: + + struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); + +This takes a text string as the name and returns a representation of a tag. It +will never return an error. It may return a dummy tag, however, if it runs out +of memory; this will inhibit caching with this tag. + +Any representation so obtained must be released by passing it to this function: + + void fscache_release_cache_tag(struct fscache_cache_tag *tag); + +The tag will be retrieved by FS-Cache when it calls the object definition +operation select_cache(). + + +================== +INDEX REGISTRATION +================== + +The third step is to inform FS-Cache about part of an index hierarchy that can +be used to locate files. This is done by requesting a cookie for each index in +the path to the file: + + struct fscache_cookie * + fscache_acquire_cookie(struct fscache_cookie *parent, + const struct fscache_object_def *def, + void *netfs_data, + bool enable); + +This function creates an index entry in the index represented by parent, +filling in the index entry by calling the operations pointed to by def. + +Note that this function never returns an error - all errors are handled +internally. It may, however, return NULL to indicate no cookie. It is quite +acceptable to pass this token back to this function as the parent to another +acquisition (or even to the relinquish cookie, read page and write page +functions - see below). + +Note also that no indices are actually created in a cache until a non-index +object needs to be created somewhere down the hierarchy. Furthermore, an index +may be created in several different caches independently at different times. +This is all handled transparently, and the netfs doesn't see any of it. + +A cookie will be created in the disabled state if enabled is false. A cookie +must be enabled to do anything with it. A disabled cookie can be enabled by +calling fscache_enable_cookie() (see below). + +For example, with AFS, a cell would be added to the primary index. This index +entry would have a dependent inode containing a volume location index for the +volume mappings within this cell: + + cell->cache = + fscache_acquire_cookie(afs_cache_netfs.primary_index, + &afs_cell_cache_index_def, + cell, true); + +Then when a volume location was accessed, it would be entered into the cell's +index and an inode would be allocated that acts as a volume type and hash chain +combination: + + vlocation->cache = + fscache_acquire_cookie(cell->cache, + &afs_vlocation_cache_index_def, + vlocation, true); + +And then a particular flavour of volume (R/O for example) could be added to +that index, creating another index for vnodes (AFS inode equivalents): + + volume->cache = + fscache_acquire_cookie(vlocation->cache, + &afs_volume_cache_index_def, + volume, true); + + +====================== +DATA FILE REGISTRATION +====================== + +The fourth step is to request a data file be created in the cache. This is +identical to index cookie acquisition. The only difference is that the type in +the object definition should be something other than index type. + + vnode->cache = + fscache_acquire_cookie(volume->cache, + &afs_vnode_cache_object_def, + vnode, true); + + +================================= +MISCELLANEOUS OBJECT REGISTRATION +================================= + +An optional step is to request an object of miscellaneous type be created in +the cache. This is almost identical to index cookie acquisition. The only +difference is that the type in the object definition should be something other +than index type. Whilst the parent object could be an index, it's more likely +it would be some other type of object such as a data file. + + xattr->cache = + fscache_acquire_cookie(vnode->cache, + &afs_xattr_cache_object_def, + xattr, true); + +Miscellaneous objects might be used to store extended attributes or directory +entries for example. + + +========================== +SETTING THE DATA FILE SIZE +========================== + +The fifth step is to set the physical attributes of the file, such as its size. +This doesn't automatically reserve any space in the cache, but permits the +cache to adjust its metadata for data tracking appropriately: + + int fscache_attr_changed(struct fscache_cookie *cookie); + +The cache will return -ENOBUFS if there is no backing cache or if there is no +space to allocate any extra metadata required in the cache. The attributes +will be accessed with the get_attr() cookie definition operation. + +Note that attempts to read or write data pages in the cache over this size may +be rebuffed with -ENOBUFS. + +This operation schedules an attribute adjustment to happen asynchronously at +some point in the future, and as such, it may happen after the function returns +to the caller. The attribute adjustment excludes read and write operations. + + +===================== +PAGE ALLOC/READ/WRITE +===================== + +And the sixth step is to store and retrieve pages in the cache. There are +three functions that are used to do this. + +Note: + + (1) A page should not be re-read or re-allocated without uncaching it first. + + (2) A read or allocated page must be uncached when the netfs page is released + from the pagecache. + + (3) A page should only be written to the cache if previous read or allocated. + +This permits the cache to maintain its page tracking in proper order. + + +PAGE READ +--------- + +Firstly, the netfs should ask FS-Cache to examine the caches and read the +contents cached for a particular page of a particular file if present, or else +allocate space to store the contents if not: + + typedef + void (*fscache_rw_complete_t)(struct page *page, + void *context, + int error); + + int fscache_read_or_alloc_page(struct fscache_cookie *cookie, + struct page *page, + fscache_rw_complete_t end_io_func, + void *context, + gfp_t gfp); + +The cookie argument must specify a cookie for an object that isn't an index, +the page specified will have the data loaded into it (and is also used to +specify the page number), and the gfp argument is used to control how any +memory allocations made are satisfied. + +If the cookie indicates the inode is not cached: + + (1) The function will return -ENOBUFS. + +Else if there's a copy of the page resident in the cache: + + (1) The mark_pages_cached() cookie operation will be called on that page. + + (2) The function will submit a request to read the data from the cache's + backing device directly into the page specified. + + (3) The function will return 0. + + (4) When the read is complete, end_io_func() will be invoked with: + + (*) The netfs data supplied when the cookie was created. + + (*) The page descriptor. + + (*) The context argument passed to the above function. This will be + maintained with the get_context/put_context functions mentioned above. + + (*) An argument that's 0 on success or negative for an error code. + + If an error occurs, it should be assumed that the page contains no usable + data. fscache_readpages_cancel() may need to be called. + + end_io_func() will be called in process context if the read is results in + an error, but it might be called in interrupt context if the read is + successful. + +Otherwise, if there's not a copy available in cache, but the cache may be able +to store the page: + + (1) The mark_pages_cached() cookie operation will be called on that page. + + (2) A block may be reserved in the cache and attached to the object at the + appropriate place. + + (3) The function will return -ENODATA. + +This function may also return -ENOMEM or -EINTR, in which case it won't have +read any data from the cache. + + +PAGE ALLOCATE +------------- + +Alternatively, if there's not expected to be any data in the cache for a page +because the file has been extended, a block can simply be allocated instead: + + int fscache_alloc_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +This is similar to the fscache_read_or_alloc_page() function, except that it +never reads from the cache. It will return 0 if a block has been allocated, +rather than -ENODATA as the other would. One or the other must be performed +before writing to the cache. + +The mark_pages_cached() cookie operation will be called on the page if +successful. + + +PAGE WRITE +---------- + +Secondly, if the netfs changes the contents of the page (either due to an +initial download or if a user performs a write), then the page should be +written back to the cache: + + int fscache_write_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +The cookie argument must specify a data file cookie, the page specified should +contain the data to be written (and is also used to specify the page number), +and the gfp argument is used to control how any memory allocations made are +satisfied. + +The page must have first been read or allocated successfully and must not have +been uncached before writing is performed. + +If the cookie indicates the inode is not cached then: + + (1) The function will return -ENOBUFS. + +Else if space can be allocated in the cache to hold this page: + + (1) PG_fscache_write will be set on the page. + + (2) The function will submit a request to write the data to cache's backing + device directly from the page specified. + + (3) The function will return 0. + + (4) When the write is complete PG_fscache_write is cleared on the page and + anyone waiting for that bit will be woken up. + +Else if there's no space available in the cache, -ENOBUFS will be returned. It +is also possible for the PG_fscache_write bit to be cleared when no write took +place if unforeseen circumstances arose (such as a disk error). + +Writing takes place asynchronously. + + +MULTIPLE PAGE READ +------------------ + +A facility is provided to read several pages at once, as requested by the +readpages() address space operation: + + int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, + struct address_space *mapping, + struct list_head *pages, + int *nr_pages, + fscache_rw_complete_t end_io_func, + void *context, + gfp_t gfp); + +This works in a similar way to fscache_read_or_alloc_page(), except: + + (1) Any page it can retrieve data for is removed from pages and nr_pages and + dispatched for reading to the disk. Reads of adjacent pages on disk may + be merged for greater efficiency. + + (2) The mark_pages_cached() cookie operation will be called on several pages + at once if they're being read or allocated. + + (3) If there was an general error, then that error will be returned. + + Else if some pages couldn't be allocated or read, then -ENOBUFS will be + returned. + + Else if some pages couldn't be read but were allocated, then -ENODATA will + be returned. + + Otherwise, if all pages had reads dispatched, then 0 will be returned, the + list will be empty and *nr_pages will be 0. + + (4) end_io_func will be called once for each page being read as the reads + complete. It will be called in process context if error != 0, but it may + be called in interrupt context if there is no error. + +Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude +some of the pages being read and some being allocated. Those pages will have +been marked appropriately and will need uncaching. + + +CANCELLATION OF UNREAD PAGES +---------------------------- + +If one or more pages are passed to fscache_read_or_alloc_pages() but not then +read from the cache and also not read from the underlying filesystem then +those pages will need to have any marks and reservations removed. This can be +done by calling: + + void fscache_readpages_cancel(struct fscache_cookie *cookie, + struct list_head *pages); + +prior to returning to the caller. The cookie argument should be as passed to +fscache_read_or_alloc_pages(). Every page in the pages list will be examined +and any that have PG_fscache set will be uncached. + + +============== +PAGE UNCACHING +============== + +To uncache a page, this function should be called: + + void fscache_uncache_page(struct fscache_cookie *cookie, + struct page *page); + +This function permits the cache to release any in-memory representation it +might be holding for this netfs page. This function must be called once for +each page on which the read or write page functions above have been called to +make sure the cache's in-memory tracking information gets torn down. + +Note that pages can't be explicitly deleted from the a data file. The whole +data file must be retired (see the relinquish cookie function below). + +Furthermore, note that this does not cancel the asynchronous read or write +operation started by the read/alloc and write functions, so the page +invalidation functions must use: + + bool fscache_check_page_write(struct fscache_cookie *cookie, + struct page *page); + +to see if a page is being written to the cache, and: + + void fscache_wait_on_page_write(struct fscache_cookie *cookie, + struct page *page); + +to wait for it to finish if it is. + + +When releasepage() is being implemented, a special FS-Cache function exists to +manage the heuristics of coping with vmscan trying to eject pages, which may +conflict with the cache trying to write pages to the cache (which may itself +need to allocate memory): + + bool fscache_maybe_release_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +This takes the netfs cookie, and the page and gfp arguments as supplied to +releasepage(). It will return false if the page cannot be released yet for +some reason and if it returns true, the page has been uncached and can now be +released. + +To make a page available for release, this function may wait for an outstanding +storage request to complete, or it may attempt to cancel the storage request - +in which case the page will not be stored in the cache this time. + + +BULK INODE PAGE UNCACHE +----------------------- + +A convenience routine is provided to perform an uncache on all the pages +attached to an inode. This assumes that the pages on the inode correspond on a +1:1 basis with the pages in the cache. + + void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, + struct inode *inode); + +This takes the netfs cookie that the pages were cached with and the inode that +the pages are attached to. This function will wait for pages to finish being +written to the cache and for the cache to finish with the page generally. No +error is returned. + + +=============================== +INDEX AND DATA FILE CONSISTENCY +=============================== + +To find out whether auxiliary data for an object is up to data within the +cache, the following function can be called: + + int fscache_check_consistency(struct fscache_cookie *cookie) + +This will call back to the netfs to check whether the auxiliary data associated +with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it +may also return -ENOMEM and -ERESTARTSYS. + +To request an update of the index data for an index or other object, the +following function should be called: + + void fscache_update_cookie(struct fscache_cookie *cookie); + +This function will refer back to the netfs_data pointer stored in the cookie by +the acquisition function to obtain the data to write into each revised index +entry. The update method in the parent index definition will be called to +transfer the data. + +Note that partial updates may happen automatically at other times, such as when +data blocks are added to a data file object. + + +================= +COOKIE ENABLEMENT +================= + +Cookies exist in one of two states: enabled and disabled. If a cookie is +disabled, it ignores all attempts to acquire child cookies; check, update or +invalidate its state; allocate, read or write backing pages - though it is +still possible to uncache pages and relinquish the cookie. + +The initial enablement state is set by fscache_acquire_cookie(), but the cookie +can be enabled or disabled later. To disable a cookie, call: + + void fscache_disable_cookie(struct fscache_cookie *cookie, + bool invalidate); + +If the cookie is not already disabled, this locks the cookie against other +enable and disable ops, marks the cookie as being disabled, discards or +invalidates any backing objects and waits for cessation of activity on any +associated object before unlocking the cookie. + +All possible failures are handled internally. The caller should consider +calling fscache_uncache_all_inode_pages() afterwards to make sure all page +markings are cleared up. + +Cookies can be enabled or reenabled with: + + void fscache_enable_cookie(struct fscache_cookie *cookie, + bool (*can_enable)(void *data), + void *data) + +If the cookie is not already enabled, this locks the cookie against other +enable and disable ops, invokes can_enable() and, if the cookie is not an index +cookie, will begin the procedure of acquiring backing objects. + +The optional can_enable() function is passed the data argument and returns a +ruling as to whether or not enablement should actually be permitted to begin. + +All possible failures are handled internally. The cookie will only be marked +as enabled if provisional backing objects are allocated. + + +=============================== +MISCELLANEOUS COOKIE OPERATIONS +=============================== + +There are a number of operations that can be used to control cookies: + + (*) Cookie pinning: + + int fscache_pin_cookie(struct fscache_cookie *cookie); + void fscache_unpin_cookie(struct fscache_cookie *cookie); + + These operations permit data cookies to be pinned into the cache and to + have the pinning removed. They are not permitted on index cookies. + + The pinning function will return 0 if successful, -ENOBUFS in the cookie + isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning, + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or + -EIO if there's any other problem. + + (*) Data space reservation: + + int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); + + This permits a netfs to request cache space be reserved to store up to the + given amount of a file. It is permitted to ask for more than the current + size of the file to allow for future file expansion. + + If size is given as zero then the reservation will be cancelled. + + The function will return 0 if successful, -ENOBUFS in the cookie isn't + backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations, + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or + -EIO if there's any other problem. + + Note that this doesn't pin an object in a cache; it can still be culled to + make space if it's not in use. + + +===================== +COOKIE UNREGISTRATION +===================== + +To get rid of a cookie, this function should be called. + + void fscache_relinquish_cookie(struct fscache_cookie *cookie, + bool retire); + +If retire is non-zero, then the object will be marked for recycling, and all +copies of it will be removed from all active caches in which it is present. +Not only that but all child objects will also be retired. + +If retire is zero, then the object may be available again when next the +acquisition function is called. Retirement here will overrule the pinning on a +cookie. + +One very important note - relinquish must NOT be called for a cookie unless all +the cookies for "child" indices, objects and pages have been relinquished +first. + + +================== +INDEX INVALIDATION +================== + +There is no direct way to invalidate an index subtree. To do this, the caller +should relinquish and retire the cookie they have, and then acquire a new one. + + +====================== +DATA FILE INVALIDATION +====================== + +Sometimes it will be necessary to invalidate an object that contains data. +Typically this will be necessary when the server tells the netfs of a foreign +change - at which point the netfs has to throw away all the state it had for an +inode and reload from the server. + +To indicate that a cache object should be invalidated, the following function +can be called: + + void fscache_invalidate(struct fscache_cookie *cookie); + +This can be called with spinlocks held as it defers the work to a thread pool. +All extant storage, retrieval and attribute change ops at this point are +cancelled and discarded. Some future operations will be rejected until the +cache has had a chance to insert a barrier in the operations queue. After +that, operations will be queued again behind the invalidation operation. + +The invalidation operation will perform an attribute change operation and an +auxiliary data update operation as it is very likely these will have changed. + +Using the following function, the netfs can wait for the invalidation operation +to have reached a point at which it can start submitting ordinary operations +once again: + + void fscache_wait_on_invalidate(struct fscache_cookie *cookie); + + +=========================== +FS-CACHE SPECIFIC PAGE FLAG +=========================== + +FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is +given the alternative name PG_fscache. + +PG_fscache is used to indicate that the page is known by the cache, and that +the cache must be informed if the page is going to go away. It's an indication +to the netfs that the cache has an interest in this page, where an interest may +be a pointer to it, resources allocated or reserved for it, or I/O in progress +upon it. + +The netfs can use this information in methods such as releasepage() to +determine whether it needs to uncache a page or update it. + +Furthermore, if this bit is set, releasepage() and invalidatepage() operations +will be called on a page to get rid of it, even if PG_private is not set. This +allows caching to attempted on a page before read_cache_pages() to be called +after fscache_read_or_alloc_pages() as the former will try and release pages it +was given under certain circumstances. + +This bit does not overlap with such as PG_private. This means that FS-Cache +can be used with a filesystem that uses the block buffering code. + +There are a number of operations defined on this flag: + + int PageFsCache(struct page *page); + void SetPageFsCache(struct page *page) + void ClearPageFsCache(struct page *page) + int TestSetPageFsCache(struct page *page) + int TestClearPageFsCache(struct page *page) + +These functions are bit test, bit set, bit clear, bit test and set and bit +test and clear operations on PG_fscache. diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt new file mode 100644 index 00000000000..100ff41127e --- /dev/null +++ b/Documentation/filesystems/caching/object.txt @@ -0,0 +1,320 @@ + ==================================================== + IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT + ==================================================== + +By: David Howells <dhowells@redhat.com> + +Contents: + + (*) Representation + + (*) Object management state machine. + + - Provision of cpu time. + - Locking simplification. + + (*) The set of states. + + (*) The set of events. + + +============== +REPRESENTATION +============== + +FS-Cache maintains an in-kernel representation of each object that a netfs is +currently interested in. Such objects are represented by the fscache_cookie +struct and are referred to as cookies. + +FS-Cache also maintains a separate in-kernel representation of the objects that +a cache backend is currently actively caching. Such objects are represented by +the fscache_object struct. The cache backends allocate these upon request, and +are expected to embed them in their own representations. These are referred to +as objects. + +There is a 1:N relationship between cookies and objects. A cookie may be +represented by multiple objects - an index may exist in more than one cache - +or even by no objects (it may not be cached). + +Furthermore, both cookies and objects are hierarchical. The two hierarchies +correspond, but the cookies tree is a superset of the union of the object trees +of multiple caches: + + NETFS INDEX TREE : CACHE 1 : CACHE 2 + : : + : +-----------+ : + +----------->| IObject | : + +-----------+ | : +-----------+ : + | ICookie |-------+ : | : + +-----------+ | : | : +-----------+ + | +------------------------------>| IObject | + | : | : +-----------+ + | : V : | + | : +-----------+ : | + V +----------->| IObject | : | + +-----------+ | : +-----------+ : | + | ICookie |-------+ : | : V + +-----------+ | : | : +-----------+ + | +------------------------------>| IObject | + +-----+-----+ : | : +-----------+ + | | : | : | + V | : V : | + +-----------+ | : +-----------+ : | + | ICookie |------------------------->| IObject | : | + +-----------+ | : +-----------+ : | + | V : | : V + | +-----------+ : | : +-----------+ + | | ICookie |-------------------------------->| IObject | + | +-----------+ : | : +-----------+ + V | : V : | + +-----------+ | : +-----------+ : | + | DCookie |------------------------->| DObject | : | + +-----------+ | : +-----------+ : | + | : : | + +-------+-------+ : : | + | | : : | + V V : : V + +-----------+ +-----------+ : : +-----------+ + | DCookie | | DCookie |------------------------>| DObject | + +-----------+ +-----------+ : : +-----------+ + : : + +In the above illustration, ICookie and IObject represent indices and DCookie +and DObject represent data storage objects. Indices may have representation in +multiple caches, but currently, non-index objects may not. Objects of any type +may also be entirely unrepresented. + +As far as the netfs API goes, the netfs is only actually permitted to see +pointers to the cookies. The cookies themselves and any objects attached to +those cookies are hidden from it. + + +=============================== +OBJECT MANAGEMENT STATE MACHINE +=============================== + +Within FS-Cache, each active object is managed by its own individual state +machine. The state for an object is kept in the fscache_object struct, in +object->state. A cookie may point to a set of objects that are in different +states. + +Each state has an action associated with it that is invoked when the machine +wakes up in that state. There are four logical sets of states: + + (1) Preparation: states that wait for the parent objects to become ready. The + representations are hierarchical, and it is expected that an object must + be created or accessed with respect to its parent object. + + (2) Initialisation: states that perform lookups in the cache and validate + what's found and that create on disk any missing metadata. + + (3) Normal running: states that allow netfs operations on objects to proceed + and that update the state of objects. + + (4) Termination: states that detach objects from their netfs cookies, that + delete objects from disk, that handle disk and system errors and that free + up in-memory resources. + + +In most cases, transitioning between states is in response to signalled events. +When a state has finished processing, it will usually set the mask of events in +which it is interested (object->event_mask) and relinquish the worker thread. +Then when an event is raised (by calling fscache_raise_event()), if the event +is not masked, the object will be queued for processing (by calling +fscache_enqueue_object()). + + +PROVISION OF CPU TIME +--------------------- + +The work to be done by the various states was given CPU time by the threads of +the slow work facility. This was used in preference to the workqueue facility +because: + + (1) Threads may be completely occupied for very long periods of time by a + particular work item. These state actions may be doing sequences of + synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, + getxattr, truncate, unlink, rmdir, rename). + + (2) Threads may do little actual work, but may rather spend a lot of time + sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded + workqueues don't necessarily have the right numbers of threads. + + +LOCKING SIMPLIFICATION +---------------------- + +Because only one worker thread may be operating on any particular object's +state machine at once, this simplifies the locking, particularly with respect +to disconnecting the netfs's representation of a cache object (fscache_cookie) +from the cache backend's representation (fscache_object) - which may be +requested from either end. + + +================= +THE SET OF STATES +================= + +The object state machine has a set of states that it can be in. There are +preparation states in which the object sets itself up and waits for its parent +object to transit to a state that allows access to its children: + + (1) State FSCACHE_OBJECT_INIT. + + Initialise the object and wait for the parent object to become active. In + the cache, it is expected that it will not be possible to look an object + up from the parent object, until that parent object itself has been looked + up. + +There are initialisation states in which the object sets itself up and accesses +disk for the object metadata: + + (2) State FSCACHE_OBJECT_LOOKING_UP. + + Look up the object on disk, using the parent as a starting point. + FS-Cache expects the cache backend to probe the cache to see whether this + object is represented there, and if it is, to see if it's valid (coherency + management). + + The cache should call fscache_object_lookup_negative() to indicate lookup + failure for whatever reason, and should call fscache_obtained_object() to + indicate success. + + At the completion of lookup, FS-Cache will let the netfs go ahead with + read operations, no matter whether the file is yet cached. If not yet + cached, read operations will be immediately rejected with ENODATA until + the first known page is uncached - as to that point there can be no data + to be read out of the cache for that file that isn't currently also held + in the pagecache. + + (3) State FSCACHE_OBJECT_CREATING. + + Create an object on disk, using the parent as a starting point. This + happens if the lookup failed to find the object, or if the object's + coherency data indicated what's on disk is out of date. In this state, + FS-Cache expects the cache to create + + The cache should call fscache_obtained_object() if creation completes + successfully, fscache_object_lookup_negative() otherwise. + + At the completion of creation, FS-Cache will start processing write + operations the netfs has queued for an object. If creation failed, the + write ops will be transparently discarded, and nothing recorded in the + cache. + +There are some normal running states in which the object spends its time +servicing netfs requests: + + (4) State FSCACHE_OBJECT_AVAILABLE. + + A transient state in which pending operations are started, child objects + are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary + lookup data is freed. + + (5) State FSCACHE_OBJECT_ACTIVE. + + The normal running state. In this state, requests the netfs makes will be + passed on to the cache. + + (6) State FSCACHE_OBJECT_INVALIDATING. + + The object is undergoing invalidation. When the state comes here, it + discards all pending read, write and attribute change operations as it is + going to clear out the cache entirely and reinitialise it. It will then + continue to the FSCACHE_OBJECT_UPDATING state. + + (7) State FSCACHE_OBJECT_UPDATING. + + The state machine comes here to update the object in the cache from the + netfs's records. This involves updating the auxiliary data that is used + to maintain coherency. + +And there are terminal states in which an object cleans itself up, deallocates +memory and potentially deletes stuff from disk: + + (8) State FSCACHE_OBJECT_LC_DYING. + + The object comes here if it is dying because of a lookup or creation + error. This would be due to a disk error or system error of some sort. + Temporary data is cleaned up, and the parent is released. + + (9) State FSCACHE_OBJECT_DYING. + + The object comes here if it is dying due to an error, because its parent + cookie has been relinquished by the netfs or because the cache is being + withdrawn. + + Any child objects waiting on this one are given CPU time so that they too + can destroy themselves. This object waits for all its children to go away + before advancing to the next state. + +(10) State FSCACHE_OBJECT_ABORT_INIT. + + The object comes to this state if it was waiting on its parent in + FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself + so that the parent may proceed from the FSCACHE_OBJECT_DYING state. + +(11) State FSCACHE_OBJECT_RELEASING. +(12) State FSCACHE_OBJECT_RECYCLING. + + The object comes to one of these two states when dying once it is rid of + all its children, if it is dying because the netfs relinquished its + cookie. In the first state, the cached data is expected to persist, and + in the second it will be deleted. + +(13) State FSCACHE_OBJECT_WITHDRAWING. + + The object transits to this state if the cache decides it wants to + withdraw the object from service, perhaps to make space, but also due to + error or just because the whole cache is being withdrawn. + +(14) State FSCACHE_OBJECT_DEAD. + + The object transits to this state when the in-memory object record is + ready to be deleted. The object processor shouldn't ever see an object in + this state. + + +THE SET OF EVENTS +----------------- + +There are a number of events that can be raised to an object state machine: + + (*) FSCACHE_OBJECT_EV_UPDATE + + The netfs requested that an object be updated. The state machine will ask + the cache backend to update the object, and the cache backend will ask the + netfs for details of the change through its cookie definition ops. + + (*) FSCACHE_OBJECT_EV_CLEARED + + This is signalled in two circumstances: + + (a) when an object's last child object is dropped and + + (b) when the last operation outstanding on an object is completed. + + This is used to proceed from the dying state. + + (*) FSCACHE_OBJECT_EV_ERROR + + This is signalled when an I/O error occurs during the processing of some + object. + + (*) FSCACHE_OBJECT_EV_RELEASE + (*) FSCACHE_OBJECT_EV_RETIRE + + These are signalled when the netfs relinquishes a cookie it was using. + The event selected depends on whether the netfs asks for the backing + object to be retired (deleted) or retained. + + (*) FSCACHE_OBJECT_EV_WITHDRAW + + This is signalled when the cache backend wants to withdraw an object. + This means that the object will have to be detached from the netfs's + cookie. + +Because the withdrawing releasing/retiring events are all handled by the object +state machine, it doesn't matter if there's a collision with both ends trying +to sever the connection at the same time. The state machine can just pick +which one it wants to honour, and that effects the other. diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt new file mode 100644 index 00000000000..bee2a5f93d6 --- /dev/null +++ b/Documentation/filesystems/caching/operations.txt @@ -0,0 +1,213 @@ + ================================ + ASYNCHRONOUS OPERATIONS HANDLING + ================================ + +By: David Howells <dhowells@redhat.com> + +Contents: + + (*) Overview. + + (*) Operation record initialisation. + + (*) Parameters. + + (*) Procedure. + + (*) Asynchronous callback. + + +======== +OVERVIEW +======== + +FS-Cache has an asynchronous operations handling facility that it uses for its +data storage and retrieval routines. Its operations are represented by +fscache_operation structs, though these are usually embedded into some other +structure. + +This facility is available to and expected to be be used by the cache backends, +and FS-Cache will create operations and pass them off to the appropriate cache +backend for completion. + +To make use of this facility, <linux/fscache-cache.h> should be #included. + + +=============================== +OPERATION RECORD INITIALISATION +=============================== + +An operation is recorded in an fscache_operation struct: + + struct fscache_operation { + union { + struct work_struct fast_work; + struct slow_work slow_work; + }; + unsigned long flags; + fscache_operation_processor_t processor; + ... + }; + +Someone wanting to issue an operation should allocate something with this +struct embedded in it. They should initialise it by calling: + + void fscache_operation_init(struct fscache_operation *op, + fscache_operation_release_t release); + +with the operation to be initialised and the release function to use. + +The op->flags parameter should be set to indicate the CPU time provision and +the exclusivity (see the Parameters section). + +The op->fast_work, op->slow_work and op->processor flags should be set as +appropriate for the CPU time provision (see the Parameters section). + +FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the +operation and waited for afterwards. + + +========== +PARAMETERS +========== + +There are a number of parameters that can be set in the operation record's flag +parameter. There are three options for the provision of CPU time in these +operations: + + (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread + may decide it wants to handle an operation itself without deferring it to + another thread. + + This is, for example, used in read operations for calling readpages() on + the backing filesystem in CacheFiles. Although readpages() does an + asynchronous data fetch, the determination of whether pages exist is done + synchronously - and the netfs does not proceed until this has been + determined. + + If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags + before submitting the operation, and the operating thread must wait for it + to be cleared before proceeding: + + wait_on_bit(&op->flags, FSCACHE_OP_WAITING, + fscache_wait_bit, TASK_UNINTERRUPTIBLE); + + + (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it + will be given to keventd to process. Such an operation is not permitted + to sleep on I/O. + + This is, for example, used by CacheFiles to copy data from a backing fs + page to a netfs page after the backing fs has read the page in. + + If this option is used, op->fast_work and op->processor must be + initialised before submitting the operation: + + INIT_WORK(&op->fast_work, do_some_work); + + + (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it + will be given to the slow work facility to process. Such an operation is + permitted to sleep on I/O. + + This is, for example, used by FS-Cache to handle background writes of + pages that have just been fetched from a remote server. + + If this option is used, op->slow_work and op->processor must be + initialised before submitting the operation: + + fscache_operation_init_slow(op, processor) + + +Furthermore, operations may be one of two types: + + (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in + conjunction with any other operation on the object being operated upon. + + An example of this is the attribute change operation, in which the file + being written to may need truncation. + + (2) Shareable. Operations of this type may be running simultaneously. It's + up to the operation implementation to prevent interference between other + operations running at the same time. + + +========= +PROCEDURE +========= + +Operations are used through the following procedure: + + (1) The submitting thread must allocate the operation and initialise it + itself. Normally this would be part of a more specific structure with the + generic op embedded within. + + (2) The submitting thread must then submit the operation for processing using + one of the following two functions: + + int fscache_submit_op(struct fscache_object *object, + struct fscache_operation *op); + + int fscache_submit_exclusive_op(struct fscache_object *object, + struct fscache_operation *op); + + The first function should be used to submit non-exclusive ops and the + second to submit exclusive ones. The caller must still set the + FSCACHE_OP_EXCLUSIVE flag. + + If successful, both functions will assign the operation to the specified + object and return 0. -ENOBUFS will be returned if the object specified is + permanently unavailable. + + The operation manager will defer operations on an object that is still + undergoing lookup or creation. The operation will also be deferred if an + operation of conflicting exclusivity is in progress on the object. + + If the operation is asynchronous, the manager will retain a reference to + it, so the caller should put their reference to it by passing it to: + + void fscache_put_operation(struct fscache_operation *op); + + (3) If the submitting thread wants to do the work itself, and has marked the + operation with FSCACHE_OP_MYTHREAD, then it should monitor + FSCACHE_OP_WAITING as described above and check the state of the object if + necessary (the object might have died whilst the thread was waiting). + + When it has finished doing its processing, it should call + fscache_op_complete() and fscache_put_operation() on it. + + (4) The operation holds an effective lock upon the object, preventing other + exclusive ops conflicting until it is released. The operation can be + enqueued for further immediate asynchronous processing by adjusting the + CPU time provisioning option if necessary, eg: + + op->flags &= ~FSCACHE_OP_TYPE; + op->flags |= ~FSCACHE_OP_FAST; + + and calling: + + void fscache_enqueue_operation(struct fscache_operation *op) + + This can be used to allow other things to have use of the worker thread + pools. + + +===================== +ASYNCHRONOUS CALLBACK +===================== + +When used in asynchronous mode, the worker thread pool will invoke the +processor method with a pointer to the operation. This should then get at the +container struct by using container_of(): + + static void fscache_write_op(struct fscache_operation *_op) + { + struct fscache_storage *op = + container_of(_op, struct fscache_storage, op); + ... + } + +The caller holds a reference on the operation, and will invoke +fscache_put_operation() when the processor function returns. The processor +function is at liberty to call fscache_enqueue_operation() or to take extra +references. diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 00000000000..d6030aa3337 --- /dev/null +++ b/Documentation/filesystems/ceph.txt @@ -0,0 +1,148 @@ +Ceph Distributed File System +============================ + +Ceph is a distributed network file system designed to provide good +performance, reliability, and scalability. + +Basic features include: + + * POSIX semantics + * Seamless scaling from 1 to many thousands of nodes + * High availability and reliability. No single point of failure. + * N-way replication of data across storage nodes + * Fast recovery from node failures + * Automatic rebalancing of data on node addition/removal + * Easy deployment: most FS components are userspace daemons + +Also, + * Flexible snapshots (on any directory) + * Recursive accounting (nested files, directories, bytes) + +In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely +on symmetric access by all clients to shared block devices, Ceph +separates data and metadata management into independent server +clusters, similar to Lustre. Unlike Lustre, however, metadata and +storage nodes run entirely as user space daemons. Storage nodes +utilize btrfs to store data objects, leveraging its advanced features +(checksumming, metadata replication, etc.). File data is striped +across storage nodes in large chunks to distribute workload and +facilitate high throughputs. When storage nodes fail, data is +re-replicated in a distributed fashion by the storage nodes themselves +(with some minimal coordination from a cluster monitor), making the +system extremely efficient and scalable. + +Metadata servers effectively form a large, consistent, distributed +in-memory cache above the file namespace that is extremely scalable, +dynamically redistributes metadata in response to workload changes, +and can tolerate arbitrary (well, non-Byzantine) node failures. The +metadata server takes a somewhat unconventional approach to metadata +storage to significantly improve performance for common workloads. In +particular, inodes with only a single link are embedded in +directories, allowing entire directories of dentries and inodes to be +loaded into its cache with a single I/O operation. The contents of +extremely large directories can be fragmented and managed by +independent metadata servers, allowing scalable concurrent access. + +The system offers automatic data rebalancing/migration when scaling +from a small cluster of just a few nodes to many hundreds, without +requiring an administrator carve the data set into static volumes or +go through the tedious process of migrating data between servers. +When the file system approaches full, new nodes can be easily added +and things will "just work." + +Ceph includes flexible snapshot mechanism that allows a user to create +a snapshot on any subdirectory (and its nested contents) in the +system. Snapshot creation and deletion are as simple as 'mkdir +.snap/foo' and 'rmdir .snap/foo'. + +Ceph also provides some recursive accounting on directories for nested +files and bytes. That is, a 'getfattr -d foo' on any directory in the +system will reveal the total number of nested regular files and +subdirectories, and a summation of all nested file sizes. This makes +the identification of large disk space consumers relatively quick, as +no 'du' or similar recursive scan of the file system is required. + + +Mount Syntax +============ + +The basic mount syntax is: + + # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt + +You only need to specify a single monitor, as the client will get the +full list when it connects. (However, if the monitor you specify +happens to be down, the mount won't succeed.) The port can be left +off if the monitor is using the default. So if the monitor is at +1.2.3.4, + + # mount -t ceph 1.2.3.4:/ /mnt/ceph + +is sufficient. If /sbin/mount.ceph is installed, a hostname can be +used instead of an IP address. + + + +Mount Options +============= + + ip=A.B.C.D[:N] + Specify the IP and/or port the client should bind to locally. + There is normally not much reason to do this. If the IP is not + specified, the client's IP address is determined by looking at the + address its connection to the monitor originates from. + + wsize=X + Specify the maximum write size in bytes. By default there is no + maximum. Ceph will normally size writes based on the file stripe + size. + + rsize=X + Specify the maximum readahead. + + mount_timeout=X + Specify the timeout value for mount (in seconds), in the case + of a non-responsive Ceph file system. The default is 30 + seconds. + + rbytes + When stat() is called on a directory, set st_size to 'rbytes', + the summation of file sizes over all files nested beneath that + directory. This is the default. + + norbytes + When stat() is called on a directory, set st_size to the + number of entries in that directory. + + nocrc + Disable CRC32C calculation for data writes. If set, the storage node + must rely on TCP's error correction to detect data corruption + in the data payload. + + dcache + Use the dcache contents to perform negative lookups and + readdir when the client has the entire directory contents in + its cache. (This does not change correctness; the client uses + cached metadata only when a lease or capability ensures it is + valid.) + + nodcache + Do not use the dcache as above. This avoids a significant amount of + complex code, sacrificing performance without affecting correctness, + and is useful for tracking down bugs. + + noasyncreaddir + Do not use the dcache as above for readdir. + +More Information +================ + +For more information on Ceph, see the home page at + http://ceph.newdream.net/ + +The Linux kernel client source tree is available at + git://ceph.newdream.net/git/ceph-client.git + git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git + +and the source for the full system is at + git://ceph.newdream.net/git/ceph.git diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt deleted file mode 100644 index 49cc923a93e..00000000000 --- a/Documentation/filesystems/cifs.txt +++ /dev/null @@ -1,51 +0,0 @@ - This is the client VFS module for the Common Internet File System - (CIFS) protocol which is the successor to the Server Message Block - (SMB) protocol, the native file sharing mechanism for most early - PC operating systems. CIFS is fully supported by current network - file servers such as Windows 2000, Windows 2003 (including - Windows XP) as well by Samba (which provides excellent CIFS - server support for Linux and many other operating systems), so - this network filesystem client can mount to a wide variety of - servers. The smbfs module should be used instead of this cifs module - for mounting to older SMB servers such as OS/2. The smbfs and cifs - modules can coexist and do not conflict. The CIFS VFS filesystem - module is designed to work well with servers that implement the - newer versions (dialects) of the SMB/CIFS protocol such as Samba, - the program written by Andrew Tridgell that turns any Unix host - into a SMB/CIFS file server. - - The intent of this module is to provide the most advanced network - file system function for CIFS compliant servers, including better - POSIX compliance, secure per-user session establishment, high - performance safe distributed caching (oplock), optional packet - signing, large files, Unicode support and other internationalization - improvements. Since both Samba server and this filesystem client support - the CIFS Unix extensions, the combination can provide a reasonable - alternative to NFSv4 for fileserving in some Linux to Linux environments, - not just in Linux to Windows environments. - - This filesystem has an optional mount utility (mount.cifs) that can - be obtained from the project page and installed in the path in the same - directory with the other mount helpers (such as mount.smbfs). - Mounting using the cifs filesystem without installing the mount helper - requires specifying the server's ip address. - - For Linux 2.4: - mount //anything/here /mnt_target -o - user=username,pass=password,unc=//ip_address_of_server/sharename - - For Linux 2.5: - mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password - - - For more information on the module see the project page at - - http://us1.samba.org/samba/Linux_CIFS_client.html - - For more information on CIFS see: - - http://www.snia.org/tech_activities/CIFS - - or the Samba site: - - http://www.samba.org diff --git a/Documentation/filesystems/cifs/AUTHORS b/Documentation/filesystems/cifs/AUTHORS new file mode 100644 index 00000000000..ca4a67a0bb1 --- /dev/null +++ b/Documentation/filesystems/cifs/AUTHORS @@ -0,0 +1,56 @@ +Original Author +=============== +Steve French (sfrench@samba.org) + +The author wishes to express his appreciation and thanks to: +Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS +improvements. Thanks to IBM for allowing me time and test resources to pursue +this project, to Jim McDonough from IBM (and the Samba Team) for his help, to +the IBM Linux JFS team for explaining many esoteric Linux filesystem features. +Jeremy Allison of the Samba team has done invaluable work in adding the server +side of the original CIFS Unix extensions and reviewing and implementing +portions of the newer CIFS POSIX extensions into the Samba 3 file server. Thank +Dave Boutcher of IBM Rochester (author of the OS/400 smb/cifs filesystem client) +for proving years ago that very good smb/cifs clients could be done on Unix-like +operating systems. Volker Lendecke, Andrew Tridgell, Urban Widmark, John +Newbigin and others for their work on the Linux smbfs module. Thanks to +the other members of the Storage Network Industry Association CIFS Technical +Workgroup for their work specifying this highly complex protocol and finally +thanks to the Samba team for their technical advice and encouragement. + +Patch Contributors +------------------ +Zwane Mwaikambo +Andi Kleen +Amrut Joshi +Shobhit Dayal +Sergey Vlasov +Richard Hughes +Yury Umanets +Mark Hamzy (for some of the early cifs IPv6 work) +Domen Puncer +Jesper Juhl (in particular for lots of whitespace/formatting cleanup) +Vince Negri and Dave Stahl (for finding an important caching bug) +Adrian Bunk (kcalloc cleanups) +Miklos Szeredi +Kazeon team for various fixes especially for 2.4 version. +Asser Ferno (Change Notify support) +Shaggy (Dave Kleikamp) for innumerable small fs suggestions and some good cleanup +Gunter Kukkukk (testing and suggestions for support of old servers) +Igor Mammedov (DFS support) +Jeff Layton (many, many fixes, as well as great work on the cifs Kerberos code) +Scott Lovenberg + +Test case and Bug Report contributors +------------------------------------- +Thanks to those in the community who have submitted detailed bug reports +and debug of problems they have found: Jochen Dolze, David Blaine, +Rene Scharfe, Martin Josefsson, Alexander Wild, Anthony Liguori, +Lars Muller, Urban Widmark, Massimiliano Ferrero, Howard Owen, +Olaf Kirch, Kieron Briggs, Nick Millington and others. Also special +mention to the Stanford Checker (SWAT) which pointed out many minor +bugs in error paths. Valuable suggestions also have come from Al Viro +and Dave Miller. + +And thanks to the IBM LTC and Power test teams and SuSE testers for +finding multiple bugs during excellent stress test runs. diff --git a/Documentation/filesystems/cifs/CHANGES b/Documentation/filesystems/cifs/CHANGES new file mode 100644 index 00000000000..bc0025cdd1c --- /dev/null +++ b/Documentation/filesystems/cifs/CHANGES @@ -0,0 +1,1065 @@ +Version 1.62 +------------ +Add sockopt=TCP_NODELAY mount option. EA (xattr) routines hardened +to more strictly handle corrupt frames. + +Version 1.61 +------------ +Fix append problem to Samba servers (files opened with O_APPEND could +have duplicated data). Fix oops in cifs_lookup. Workaround problem +mounting to OS/400 Netserve. Fix oops in cifs_get_tcp_session. +Disable use of server inode numbers when server only +partially supports them (e.g. for one server querying inode numbers on +FindFirst fails but QPathInfo queries works). Fix oops with dfs in +cifs_put_smb_ses. Fix mmap to work on directio mounts (needed +for OpenOffice when on forcedirectio mount e.g.) + +Version 1.60 +------------- +Fix memory leak in reconnect. Fix oops in DFS mount error path. +Set s_maxbytes to smaller (the max that vfs can handle) so that +sendfile will now work over cifs mounts again. Add noforcegid +and noforceuid mount parameters. Fix small mem leak when using +ntlmv2. Fix 2nd mount to same server but with different port to +be allowed (rather than reusing the 1st port) - only when the +user explicitly overrides the port on the 2nd mount. + +Version 1.59 +------------ +Client uses server inode numbers (which are persistent) rather than +client generated ones by default (mount option "serverino" turned +on by default if server supports it). Add forceuid and forcegid +mount options (so that when negotiating unix extensions specifying +which uid mounted does not immediately force the server's reported +uids to be overridden). Add support for scope mount parm. Improve +hard link detection to use same inode for both. Do not set +read-only dos attribute on directories (for chmod) since Windows +explorer special cases this attribute bit for directories for +a different purpose. + +Version 1.58 +------------ +Guard against buffer overruns in various UCS-2 to UTF-8 string conversions +when the UTF-8 string is composed of unusually long (more than 4 byte) converted +characters. Add support for mounting root of a share which redirects immediately +to DFS target. Convert string conversion functions from Unicode to more +accurately mark string length before allocating memory (which may help the +rare cases where a UTF-8 string is much larger than the UCS2 string that +we converted from). Fix endianness of the vcnum field used during +session setup to distinguish multiple mounts to same server from different +userids. Raw NTLMSSP fixed (it requires /proc/fs/cifs/experimental +flag to be set to 2, and mount must enable krb5 to turn on extended security). +Performance of file create to Samba improved (posix create on lookup +removes 1 of 2 network requests sent on file create) + +Version 1.57 +------------ +Improve support for multiple security contexts to the same server. We +used to use the same "vcnumber" for all connections which could cause +the server to treat subsequent connections, especially those that +are authenticated as guest, as reconnections, invalidating the earlier +user's smb session. This fix allows cifs to mount multiple times to the +same server with different userids without risking invalidating earlier +established security contexts. fsync now sends SMB Flush operation +to better ensure that we wait for server to write all of the data to +server disk (not just write it over the network). Add new mount +parameter to allow user to disable sending the (slow) SMB flush on +fsync if desired (fsync still flushes all cached write data to the server). +Posix file open support added (turned off after one attempt if server +fails to support it properly, as with Samba server versions prior to 3.3.2) +Fix "redzone overwritten" bug in cifs_put_tcon (CIFSTcon may allocate too +little memory for the "nativeFileSystem" field returned by the server +during mount). Endian convert inode numbers if necessary (makes it easier +to compare inode numbers on network files from big endian systems). + +Version 1.56 +------------ +Add "forcemandatorylock" mount option to allow user to use mandatory +rather than posix (advisory) byte range locks, even though server would +support posix byte range locks. Fix query of root inode when prefixpath +specified and user does not have access to query information about the +top of the share. Fix problem in 2.6.28 resolving DFS paths to +Samba servers (worked to Windows). Fix rmdir so that pending search +(readdir) requests do not get invalid results which include the now +removed directory. Fix oops in cifs_dfs_ref.c when prefixpath is not reachable +when using DFS. Add better file create support to servers which support +the CIFS POSIX protocol extensions (this adds support for new flags +on create, and improves semantics for write of locked ranges). + +Version 1.55 +------------ +Various fixes to make delete of open files behavior more predictable +(when delete of an open file fails we mark the file as "delete-on-close" +in a way that more servers accept, but only if we can first rename the +file to a temporary name). Add experimental support for more safely +handling fcntl(F_SETLEASE). Convert cifs to using blocking tcp +sends, and also let tcp autotune the socket send and receive buffers. +This reduces the number of EAGAIN errors returned by TCP/IP in +high stress workloads (and the number of retries on socket writes +when sending large SMBWriteX requests). Fix case in which a portion of +data can in some cases not get written to the file on the server before the +file is closed. Fix DFS parsing to properly handle path consumed field, +and to handle certain codepage conversions better. Fix mount and +umount race that can cause oops in mount or umount or reconnect. + +Version 1.54 +------------ +Fix premature write failure on congested networks (we would give up +on EAGAIN from the socket too quickly on large writes). +Cifs_mkdir and cifs_create now respect the setgid bit on parent dir. +Fix endian problems in acl (mode from/to cifs acl) on bigendian +architectures. Fix problems with preserving timestamps on copying open +files (e.g. "cp -a") to Windows servers. For mkdir and create honor setgid bit +on parent directory when server supports Unix Extensions but not POSIX +create. Update cifs.upcall version to handle new Kerberos sec flags +(this requires update of cifs.upcall program from Samba). Fix memory leak +on dns_upcall (resolving DFS referralls). Fix plain text password +authentication (requires setting SecurityFlags to 0x30030 to enable +lanman and plain text though). Fix writes to be at correct offset when +file is open with O_APPEND and file is on a directio (forcediretio) mount. +Fix bug in rewinding readdir directory searches. Add nodfs mount option. + +Version 1.53 +------------ +DFS support added (Microsoft Distributed File System client support needed +for referrals which enable a hierarchical name space among servers). +Disable temporary caching of mode bits to servers which do not support +storing of mode (e.g. Windows servers, when client mounts without cifsacl +mount option) and add new "dynperm" mount option to enable temporary caching +of mode (enable old behavior). Fix hang on mount caused when server crashes +tcp session during negotiate protocol. + +Version 1.52 +------------ +Fix oops on second mount to server when null auth is used. +Enable experimental Kerberos support. Return writebehind errors on flush +and sync so that events like out of disk space get reported properly on +cached files. Fix setxattr failure to certain Samba versions. Fix mount +of second share to disconnected server session (autoreconnect on this). +Add ability to modify cifs acls for handling chmod (when mounted with +cifsacl flag). Fix prefixpath path separator so we can handle mounts +with prefixpaths longer than one directory (one path component) when +mounted to Windows servers. Fix slow file open when cifsacl +enabled. Fix memory leak in FindNext when the SMB call returns -EBADF. + + +Version 1.51 +------------ +Fix memory leak in statfs when mounted to very old servers (e.g. +Windows 9x). Add new feature "POSIX open" which allows servers +which support the current POSIX Extensions to provide better semantics +(e.g. delete for open files opened with posix open). Take into +account umask on posix mkdir not just older style mkdir. Add +ability to mount to IPC$ share (which allows CIFS named pipes to be +opened, read and written as if they were files). When 1st tree +connect fails (e.g. due to signing negotiation failure) fix +leak that causes cifsd not to stop and rmmod to fail to cleanup +cifs_request_buffers pool. Fix problem with POSIX Open/Mkdir on +bigendian architectures. Fix possible memory corruption when +EAGAIN returned on kern_recvmsg. Return better error if server +requires packet signing but client has disabled it. When mounted +with cifsacl mount option - mode bits are approximated based +on the contents of the ACL of the file or directory. When cifs +mount helper is missing convert make sure that UNC name +has backslash (not forward slash) between ip address of server +and the share name. + +Version 1.50 +------------ +Fix NTLMv2 signing. NFS server mounted over cifs works (if cifs mount is +done with "serverino" mount option). Add support for POSIX Unlink +(helps with certain sharing violation cases when server such as +Samba supports newer POSIX CIFS Protocol Extensions). Add "nounix" +mount option to allow disabling the CIFS Unix Extensions for just +that mount. Fix hang on spinlock in find_writable_file (race when +reopening file after session crash). Byte range unlock request to +windows server could unlock more bytes (on server copy of file) +than intended if start of unlock request is well before start of +a previous byte range lock that we issued. + +Version 1.49 +------------ +IPv6 support. Enable ipv6 addresses to be passed on mount (put the ipv6 +address after the "ip=" mount option, at least until mount.cifs is fixed to +handle DNS host to ipv6 name translation). Accept override of uid or gid +on mount even when Unix Extensions are negotiated (it used to be ignored +when Unix Extensions were ignored). This allows users to override the +default uid and gid for files when they are certain that the uids or +gids on the server do not match those of the client. Make "sec=none" +mount override username (so that null user connection is attempted) +to match what documentation said. Support for very large reads, over 127K, +available to some newer servers (such as Samba 3.0.26 and later but +note that it also requires setting CIFSMaxBufSize at module install +time to a larger value which may hurt performance in some cases). +Make sign option force signing (or fail if server does not support it). + +Version 1.48 +------------ +Fix mtime bouncing around from local idea of last write times to remote time. +Fix hang (in i_size_read) when simultaneous size update of same remote file +on smp system corrupts sequence number. Do not reread unnecessarily partial page +(which we are about to overwrite anyway) when writing out file opened rw. +When DOS attribute of file on non-Unix server's file changes on the server side +from read-only back to read-write, reflect this change in default file mode +(we had been leaving a file's mode read-only until the inode were reloaded). +Allow setting of attribute back to ATTR_NORMAL (removing readonly dos attribute +when archive dos attribute not set and we are changing mode back to writeable +on server which does not support the Unix Extensions). Remove read only dos +attribute on chmod when adding any write permission (ie on any of +user/group/other (not all of user/group/other ie 0222) when +mounted to windows. Add support for POSIX MkDir (slight performance +enhancement and eliminates the network race between the mkdir and set +path info of the mode). + + +Version 1.47 +------------ +Fix oops in list_del during mount caused by unaligned string. +Fix file corruption which could occur on some large file +copies caused by writepages page i/o completion bug. +Seek to SEEK_END forces check for update of file size for non-cached +files. Allow file size to be updated on remote extend of locally open, +non-cached file. Fix reconnect to newer Samba servers (or other servers +which support the CIFS Unix/POSIX extensions) so that we again tell the +server the Unix/POSIX cifs capabilities which we support (SetFSInfo). +Add experimental support for new POSIX Open/Mkdir (which returns +stat information on the open, and allows setting the mode). + +Version 1.46 +------------ +Support deep tree mounts. Better support OS/2, Win9x (DOS) time stamps. +Allow null user to be specified on mount ("username="). Do not return +EINVAL on readdir when filldir fails due to overwritten blocksize +(fixes FC problem). Return error in rename 2nd attempt retry (ie report +if rename by handle also fails, after rename by path fails, we were +not reporting whether the retry worked or not). Fix NTLMv2 to +work to Windows servers (mount with option "sec=ntlmv2"). + +Version 1.45 +------------ +Do not time out lockw calls when using posix extensions. Do not +time out requests if server still responding reasonably fast +on requests on other threads. Improve POSIX locking emulation, +(lock cancel now works, and unlock of merged range works even +to Windows servers now). Fix oops on mount to lanman servers +(win9x, os/2 etc.) when null password. Do not send listxattr +(SMB to query all EAs) if nouser_xattr specified. Fix SE Linux +problem (instantiate inodes/dentries in right order for readdir). + +Version 1.44 +------------ +Rewritten sessionsetup support, including support for legacy SMB +session setup needed for OS/2 and older servers such as Windows 95 and 98. +Fix oops on ls to OS/2 servers. Add support for level 1 FindFirst +so we can do search (ls etc.) to OS/2. Do not send NTCreateX +or recent levels of FindFirst unless server says it supports NT SMBs +(instead use legacy equivalents from LANMAN dialect). Fix to allow +NTLMv2 authentication support (now can use stronger password hashing +on mount if corresponding /proc/fs/cifs/SecurityFlags is set (0x4004). +Allow override of global cifs security flags on mount via "sec=" option(s). + +Version 1.43 +------------ +POSIX locking to servers which support CIFS POSIX Extensions +(disabled by default controlled by proc/fs/cifs/Experimental). +Handle conversion of long share names (especially Asian languages) +to Unicode during mount. Fix memory leak in sess struct on reconnect. +Fix rare oops after acpi suspend. Fix O_TRUNC opens to overwrite on +cifs open which helps rare case when setpathinfo fails or server does +not support it. + +Version 1.42 +------------ +Fix slow oplock break when mounted to different servers at the same time and +the tids match and we try to find matching fid on wrong server. Fix read +looping when signing required by server (2.6.16 kernel only). Fix readdir +vs. rename race which could cause each to hang. Return . and .. even +if server does not. Allow searches to skip first three entries and +begin at any location. Fix oops in find_writeable_file. + +Version 1.41 +------------ +Fix NTLMv2 security (can be enabled in /proc/fs/cifs) so customers can +configure stronger authentication. Fix sfu symlinks so they can +be followed (not just recognized). Fix wraparound of bcc on +read responses when buffer size over 64K and also fix wrap of +max smb buffer size when CIFSMaxBufSize over 64K. Fix oops in +cifs_user_read and cifs_readpages (when EAGAIN on send of smb +on socket is returned over and over). Add POSIX (advisory) byte range +locking support (requires server with newest CIFS UNIX Extensions +to the protocol implemented). Slow down negprot slightly in port 139 +RFC1001 case to give session_init time on buggy servers. + +Version 1.40 +------------ +Use fsuid (fsgid) more consistently instead of uid (gid). Improve performance +of readpages by eliminating one extra memcpy. Allow update of file size +from remote server even if file is open for write as long as mount is +directio. Recognize share mode security and send NTLM encrypted password +on tree connect if share mode negotiated. + +Version 1.39 +------------ +Defer close of a file handle slightly if pending writes depend on that handle +(this reduces the EBADF bad file handle errors that can be logged under heavy +stress on writes). Modify cifs Kconfig options to expose CONFIG_CIFS_STATS2 +Fix SFU style symlinks and mknod needed for servers which do not support the +CIFS Unix Extensions. Fix setfacl/getfacl on bigendian. Timeout negative +dentries so files that the client sees as deleted but that later get created +on the server will be recognized. Add client side permission check on setattr. +Timeout stuck requests better (where server has never responded or sent corrupt +responses) + +Version 1.38 +------------ +Fix tcp socket retransmission timeouts (e.g. on ENOSPACE from the socket) +to be smaller at first (but increasing) so large write performance performance +over GigE is better. Do not hang thread on illegal byte range lock response +from Windows (Windows can send an RFC1001 size which does not match smb size) by +allowing an SMBs TCP length to be up to a few bytes longer than it should be. +wsize and rsize can now be larger than negotiated buffer size if server +supports large readx/writex, even when directio mount flag not specified. +Write size will in many cases now be 16K instead of 4K which greatly helps +file copy performance on lightly loaded networks. Fix oops in dnotify +when experimental config flag enabled. Make cifsFYI more granular. + +Version 1.37 +------------ +Fix readdir caching when unlink removes file in current search buffer, +and this is followed by a rewind search to just before the deleted entry. +Do not attempt to set ctime unless atime and/or mtime change requested +(most servers throw it away anyway). Fix length check of received smbs +to be more accurate. Fix big endian problem with mapchars mount option, +and with a field returned by statfs. + +Version 1.36 +------------ +Add support for mounting to older pre-CIFS servers such as Windows9x and ME. +For these older servers, add option for passing netbios name of server in +on mount (servernetbiosname). Add suspend support for power management, to +avoid cifsd thread preventing software suspend from working. +Add mount option for disabling the default behavior of sending byte range lock +requests to the server (necessary for certain applications which break with +mandatory lock behavior such as Evolution), and also mount option for +requesting case insensitive matching for path based requests (requesting +case sensitive is the default). + +Version 1.35 +------------ +Add writepage performance improvements. Fix path name conversions +for long filenames on mounts which were done with "mapchars" mount option +specified. Ensure multiplex ids do not collide. Fix case in which +rmmod can oops if done soon after last unmount. Fix truncated +search (readdir) output when resume filename was a long filename. +Fix filename conversion when mapchars mount option was specified and +filename was a long filename. + +Version 1.34 +------------ +Fix error mapping of the TOO_MANY_LINKS (hardlinks) case. +Do not oops if root user kills cifs oplock kernel thread or +kills the cifsd thread (NB: killing the cifs kernel threads is not +recommended, unmount and rmmod cifs will kill them when they are +no longer needed). Fix readdir to ASCII servers (ie older servers +which do not support Unicode) and also require asterisk. +Fix out of memory case in which data could be written one page +off in the page cache. + +Version 1.33 +------------ +Fix caching problem, in which readdir of directory containing a file +which was cached could cause the file's time stamp to be updated +without invalidating the readahead data (so we could get stale +file data on the client for that file even as the server copy changed). +Cleanup response processing so cifsd can not loop when abnormally +terminated. + + +Version 1.32 +------------ +Fix oops in ls when Transact2 FindFirst (or FindNext) returns more than one +transact response for an SMB request and search entry split across two frames. +Add support for lsattr (getting ext2/ext3/reiserfs attr flags from the server) +as new protocol extensions. Do not send Get/Set calls for POSIX ACLs +unless server explicitly claims to support them in CIFS Unix extensions +POSIX ACL capability bit. Fix packet signing when multiuser mounting with +different users from the same client to the same server. Fix oops in +cifs_close. Add mount option for remapping reserved characters in +filenames (also allow recognizing files with created by SFU which have any +of these seven reserved characters, except backslash, to be recognized). +Fix invalid transact2 message (we were sometimes trying to interpret +oplock breaks as SMB responses). Add ioctl for checking that the +current uid matches the uid of the mounter (needed by umount.cifs). +Reduce the number of large buffer allocations in cifs response processing +(significantly reduces memory pressure under heavy stress with multiple +processes accessing the same server at the same time). + +Version 1.31 +------------ +Fix updates of DOS attributes and time fields so that files on NT4 servers +do not get marked delete on close. Display sizes of cifs buffer pools in +cifs stats. Fix oops in unmount when cifsd thread being killed by +shutdown. Add generic readv/writev and aio support. Report inode numbers +consistently in readdir and lookup (when serverino mount option is +specified use the inode number that the server reports - for both lookup +and readdir, otherwise by default the locally generated inode number is used +for inodes created in either path since servers are not always able to +provide unique inode numbers when exporting multiple volumes from under one +sharename). + +Version 1.30 +------------ +Allow new nouser_xattr mount parm to disable xattr support for user namespace. +Do not flag user_xattr mount parm in dmesg. Retry failures setting file time +(mostly affects NT4 servers) by retry with handle based network operation. +Add new POSIX Query FS Info for returning statfs info more accurately. +Handle passwords with multiple commas in them. + +Version 1.29 +------------ +Fix default mode in sysfs of cifs module parms. Remove old readdir routine. +Fix capabilities flags for large readx so as to allow reads larger than 64K. + +Version 1.28 +------------ +Add module init parm for large SMB buffer size (to allow it to be changed +from its default of 16K) which is especially useful for large file copy +when mounting with the directio mount option. Fix oops after +returning from mount when experimental ExtendedSecurity enabled and +SpnegoNegotiated returning invalid error. Fix case to retry better when +peek returns from 1 to 3 bytes on socket which should have more data. +Fixed path based calls (such as cifs lookup) to handle path names +longer than 530 (now can handle PATH_MAX). Fix pass through authentication +from Samba server to DC (Samba required dummy LM password). + +Version 1.27 +------------ +Turn off DNOTIFY (directory change notification support) by default +(unless built with the experimental flag) to fix hang with KDE +file browser. Fix DNOTIFY flag mappings. Fix hang (in wait_event +waiting on an SMB response) in SendReceive when session dies but +reconnects quickly from another task. Add module init parms for +minimum number of large and small network buffers in the buffer pools, +and for the maximum number of simultaneous requests. + +Version 1.26 +------------ +Add setfacl support to allow setting of ACLs remotely to Samba 3.10 and later +and other POSIX CIFS compliant servers. Fix error mapping for getfacl +to EOPNOTSUPP when server does not support posix acls on the wire. Fix +improperly zeroed buffer in CIFS Unix extensions set times call. + +Version 1.25 +------------ +Fix internationalization problem in cifs readdir with filenames that map to +longer UTF-8 strings than the string on the wire was in Unicode. Add workaround +for readdir to netapp servers. Fix search rewind (seek into readdir to return +non-consecutive entries). Do not do readdir when server negotiates +buffer size to small to fit filename. Add support for reading POSIX ACLs from +the server (add also acl and noacl mount options). + +Version 1.24 +------------ +Optionally allow using server side inode numbers, rather than client generated +ones by specifying mount option "serverino" - this is required for some apps +to work which double check hardlinked files and have persistent inode numbers. + +Version 1.23 +------------ +Multiple bigendian fixes. On little endian systems (for reconnect after +network failure) fix tcp session reconnect code so we do not try first +to reconnect on reverse of port 445. Treat reparse points (NTFS junctions) +as directories rather than symlinks because we can do follow link on them. + +Version 1.22 +------------ +Add config option to enable XATTR (extended attribute) support, mapping +xattr names in the "user." namespace space to SMB/CIFS EAs. Lots of +minor fixes pointed out by the Stanford SWAT checker (mostly missing +or out of order NULL pointer checks in little used error paths). + +Version 1.21 +------------ +Add new mount parm to control whether mode check (generic_permission) is done +on the client. If Unix extensions are enabled and the uids on the client +and server do not match, client permission checks are meaningless on +server uids that do not exist on the client (this does not affect the +normal ACL check which occurs on the server). Fix default uid +on mknod to match create and mkdir. Add optional mount parm to allow +override of the default uid behavior (in which the server sets the uid +and gid of newly created files). Normally for network filesystem mounts +user want the server to set the uid/gid on newly created files (rather than +using uid of the client processes you would in a local filesystem). + +Version 1.20 +------------ +Make transaction counts more consistent. Merge /proc/fs/cifs/SimultaneousOps +info into /proc/fs/cifs/DebugData. Fix oops in rare oops in readdir +(in build_wildcard_path_from_dentry). Fix mknod to pass type field +(block/char/fifo) properly. Remove spurious mount warning log entry when +credentials passed as mount argument. Set major/minor device number in +inode for block and char devices when unix extensions enabled. + +Version 1.19 +------------ +Fix /proc/fs/cifs/Stats and DebugData display to handle larger +amounts of return data. Properly limit requests to MAX_REQ (50 +is the usual maximum active multiplex SMB/CIFS requests per server). +Do not kill cifsd (and thus hurt the other SMB session) when more than one +session to the same server (but with different userids) exists and one +of the two user's smb sessions is being removed while leaving the other. +Do not loop reconnecting in cifsd demultiplex thread when admin +kills the thread without going through unmount. + +Version 1.18 +------------ +Do not rename hardlinked files (since that should be a noop). Flush +cached write behind data when reopening a file after session abend, +except when already in write. Grab per socket sem during reconnect +to avoid oops in sendmsg if overlapping with reconnect. Do not +reset cached inode file size on readdir for files open for write on +client. + + +Version 1.17 +------------ +Update number of blocks in file so du command is happier (in Linux a fake +blocksize of 512 is required for calculating number of blocks in inode). +Fix prepare write of partial pages to read in data from server if possible. +Fix race on tcpStatus field between unmount and reconnection code, causing +cifsd process sometimes to hang around forever. Improve out of memory +checks in cifs_filldir + +Version 1.16 +------------ +Fix incorrect file size in file handle based setattr on big endian hardware. +Fix oops in build_path_from_dentry when out of memory. Add checks for invalid +and closing file structs in writepage/partialpagewrite. Add statistics +for each mounted share (new menuconfig option). Fix endianness problem in +volume information displayed in /proc/fs/cifs/DebugData (only affects +affects big endian architectures). Prevent renames while constructing +path names for open, mkdir and rmdir. + +Version 1.15 +------------ +Change to mempools for alloc smb request buffers and multiplex structs +to better handle low memory problems (and potential deadlocks). + +Version 1.14 +------------ +Fix incomplete listings of large directories on Samba servers when Unix +extensions enabled. Fix oops when smb_buffer can not be allocated. Fix +rename deadlock when writing out dirty pages at same time. + +Version 1.13 +------------ +Fix open of files in which O_CREATE can cause the mode to change in +some cases. Fix case in which retry of write overlaps file close. +Fix PPC64 build error. Reduce excessive stack usage in smb password +hashing. Fix overwrite of Linux user's view of file mode to Windows servers. + +Version 1.12 +------------ +Fixes for large file copy, signal handling, socket retry, buffer +allocation and low memory situations. + +Version 1.11 +------------ +Better port 139 support to Windows servers (RFC1001/RFC1002 Session_Initialize) +also now allowing support for specifying client netbiosname. NT4 support added. + +Version 1.10 +------------ +Fix reconnection (and certain failed mounts) to properly wake up the +blocked users thread so it does not seem hung (in some cases was blocked +until the cifs receive timeout expired). Fix spurious error logging +to kernel log when application with open network files killed. + +Version 1.09 +------------ +Fix /proc/fs module unload warning message (that could be logged +to the kernel log). Fix intermittent failure in connectathon +test7 (hardlink count not immediately refreshed in case in which +inode metadata can be incorrectly kept cached when time near zero) + +Version 1.08 +------------ +Allow file_mode and dir_mode (specified at mount time) to be enforced +locally (the server already enforced its own ACLs too) for servers +that do not report the correct mode (do not support the +CIFS Unix Extensions). + +Version 1.07 +------------ +Fix some small memory leaks in some unmount error paths. Fix major leak +of cache pages in readpages causing multiple read oriented stress +testcases (including fsx, and even large file copy) to fail over time. + +Version 1.06 +------------ +Send NTCreateX with ATTR_POSIX if Linux/Unix extensions negotiated with server. +This allows files that differ only in case and improves performance of file +creation and file open to such servers. Fix semaphore conflict which causes +slow delete of open file to Samba (which unfortunately can cause an oplock +break to self while vfs_unlink held i_sem) which can hang for 20 seconds. + +Version 1.05 +------------ +fixes to cifs_readpages for fsx test case + +Version 1.04 +------------ +Fix caching data integrity bug when extending file size especially when no +oplock on file. Fix spurious logging of valid already parsed mount options +that are parsed outside of the cifs vfs such as nosuid. + + +Version 1.03 +------------ +Connect to server when port number override not specified, and tcp port +unitialized. Reset search to restart at correct file when kernel routine +filldir returns error during large directory searches (readdir). + +Version 1.02 +------------ +Fix caching problem when files opened by multiple clients in which +page cache could contain stale data, and write through did +not occur often enough while file was still open when read ahead +(read oplock) not allowed. Treat "sep=" when first mount option +as an override of comma as the default separator between mount +options. + +Version 1.01 +------------ +Allow passwords longer than 16 bytes. Allow null password string. + +Version 1.00 +------------ +Gracefully clean up failed mounts when attempting to mount to servers such as +Windows 98 that terminate tcp sessions during protocol negotiation. Handle +embedded commas in mount parsing of passwords. + +Version 0.99 +------------ +Invalidate local inode cached pages on oplock break and when last file +instance is closed so that the client does not continue using stale local +copy rather than later modified server copy of file. Do not reconnect +when server drops the tcp session prematurely before negotiate +protocol response. Fix oops in reopen_file when dentry freed. Allow +the support for CIFS Unix Extensions to be disabled via proc interface. + +Version 0.98 +------------ +Fix hang in commit_write during reconnection of open files under heavy load. +Fix unload_nls oops in a mount failure path. Serialize writes to same socket +which also fixes any possible races when cifs signatures are enabled in SMBs +being sent out of signature sequence number order. + +Version 0.97 +------------ +Fix byte range locking bug (endian problem) causing bad offset and +length. + +Version 0.96 +------------ +Fix oops (in send_sig) caused by CIFS unmount code trying to +wake up the demultiplex thread after it had exited. Do not log +error on harmless oplock release of closed handle. + +Version 0.95 +------------ +Fix unsafe global variable usage and password hash failure on gcc 3.3.1 +Fix problem reconnecting secondary mounts to same server after session +failure. Fix invalid dentry - race in mkdir when directory gets created +by another client between the lookup and mkdir. + +Version 0.94 +------------ +Fix to list processing in reopen_files. Fix reconnection when server hung +but tcpip session still alive. Set proper timeout on socket read. + +Version 0.93 +------------ +Add missing mount options including iocharset. SMP fixes in write and open. +Fix errors in reconnecting after TCP session failure. Fix module unloading +of default nls codepage + +Version 0.92 +------------ +Active smb transactions should never go negative (fix double FreeXid). Fix +list processing in file routines. Check return code on kmalloc in open. +Fix spinlock usage for SMP. + +Version 0.91 +------------ +Fix oops in reopen_files when invalid dentry. drop dentry on server rename +and on revalidate errors. Fix cases where pid is now tgid. Fix return code +on create hard link when server does not support them. + +Version 0.90 +------------ +Fix scheduling while atomic error in getting inode info on newly created file. +Fix truncate of existing files opened with O_CREAT but not O_TRUNC set. + +Version 0.89 +------------ +Fix oops on write to dead tcp session. Remove error log write for case when file open +O_CREAT but not O_EXCL + +Version 0.88 +------------ +Fix non-POSIX behavior on rename of open file and delete of open file by taking +advantage of trans2 SetFileInfo rename facility if available on target server. +Retry on ENOSPC and EAGAIN socket errors. + +Version 0.87 +------------ +Fix oops on big endian readdir. Set blksize to be even power of two (2**blkbits) to fix +allocation size miscalculation. After oplock token lost do not read through +cache. + +Version 0.86 +------------ +Fix oops on empty file readahead. Fix for file size handling for locally cached files. + +Version 0.85 +------------ +Fix oops in mkdir when server fails to return inode info. Fix oops in reopen_files +during auto reconnection to server after server recovered from failure. + +Version 0.84 +------------ +Finish support for Linux 2.5 open/create changes, which removes the +redundant NTCreate/QPathInfo/close that was sent during file create. +Enable oplock by default. Enable packet signing by default (needed to +access many recent Windows servers) + +Version 0.83 +------------ +Fix oops when mounting to long server names caused by inverted parms to kmalloc. +Fix MultiuserMount (/proc/fs/cifs configuration setting) so that when enabled +we will choose a cifs user session (smb uid) that better matches the local +uid if a) the mount uid does not match the current uid and b) we have another +session to the same server (ip address) for a different mount which +matches the current local uid. + +Version 0.82 +------------ +Add support for mknod of block or character devices. Fix oplock +code (distributed caching) to properly send response to oplock +break from server. + +Version 0.81 +------------ +Finish up CIFS packet digital signing for the default +NTLM security case. This should help Windows 2003 +network interoperability since it is common for +packet signing to be required now. Fix statfs (stat -f) +which recently started returning errors due to +invalid value (-1 instead of 0) being set in the +struct kstatfs f_ffiles field. + +Version 0.80 +----------- +Fix oops on stopping oplock thread when removing cifs when +built as module. + +Version 0.79 +------------ +Fix mount options for ro (readonly), uid, gid and file and directory mode. + +Version 0.78 +------------ +Fix errors displayed on failed mounts to be more understandable. +Fixed various incorrect or misleading smb to posix error code mappings. + +Version 0.77 +------------ +Fix display of NTFS DFS junctions to display as symlinks. +They are the network equivalent. Fix oops in +cifs_partialpagewrite caused by missing spinlock protection +of openfile linked list. Allow writebehind caching errors to +be returned to the application at file close. + +Version 0.76 +------------ +Clean up options displayed in /proc/mounts by show_options to +be more consistent with other filesystems. + +Version 0.75 +------------ +Fix delete of readonly file to Windows servers. Reflect +presence or absence of read only dos attribute in mode +bits for servers that do not support CIFS Unix extensions. +Fix shortened results on readdir of large directories to +servers supporting CIFS Unix extensions (caused by +incorrect resume key). + +Version 0.74 +------------ +Fix truncate bug (set file size) that could cause hangs e.g. running fsx + +Version 0.73 +------------ +unload nls if mount fails. + +Version 0.72 +------------ +Add resume key support to search (readdir) code to workaround +Windows bug. Add /proc/fs/cifs/LookupCacheEnable which +allows disabling caching of attribute information for +lookups. + +Version 0.71 +------------ +Add more oplock handling (distributed caching code). Remove +dead code. Remove excessive stack space utilization from +symlink routines. + +Version 0.70 +------------ +Fix oops in get dfs referral (triggered when null path sent in to +mount). Add support for overriding rsize at mount time. + +Version 0.69 +------------ +Fix buffer overrun in readdir which caused intermittent kernel oopses. +Fix writepage code to release kmap on write data. Allow "-ip=" new +mount option to be passed in on parameter distinct from the first part +(server name portion of) the UNC name. Allow override of the +tcp port of the target server via new mount option "-port=" + +Version 0.68 +------------ +Fix search handle leak on rewind. Fix setuid and gid so that they are +reflected in the local inode immediately. Cleanup of whitespace +to make 2.4 and 2.5 versions more consistent. + + +Version 0.67 +------------ +Fix signal sending so that captive thread (cifsd) exits on umount +(which was causing the warning in kmem_cache_free of the request buffers +at rmmod time). This had broken as a sideeffect of the recent global +kernel change to daemonize. Fix memory leak in readdir code which +showed up in "ls -R" (and applications that did search rewinding). + +Version 0.66 +------------ +Reconnect tids and fids after session reconnection (still do not +reconnect byte range locks though). Fix problem caching +lookup information for directory inodes, improving performance, +especially in deep directory trees. Fix various build warnings. + +Version 0.65 +------------ +Finish fixes to commit write for caching/readahead consistency. fsx +now works to Samba servers. Fix oops caused when readahead +was interrupted by a signal. + +Version 0.64 +------------ +Fix data corruption (in partial page after truncate) that caused fsx to +fail to Windows servers. Cleaned up some extraneous error logging in +common error paths. Add generic sendfile support. + +Version 0.63 +------------ +Fix memory leak in AllocMidQEntry. +Finish reconnection logic, so connection with server can be dropped +(or server rebooted) and the cifs client will reconnect. + +Version 0.62 +------------ +Fix temporary socket leak when bad userid or password specified +(or other SMBSessSetup failure). Increase maximum buffer size to slightly +over 16K to allow negotiation of up to Samba and Windows server default read +sizes. Add support for readpages + +Version 0.61 +------------ +Fix oops when username not passed in on mount. Extensive fixes and improvements +to error logging (strip redundant newlines, change debug macros to ensure newline +passed in and to be more consistent). Fix writepage wrong file handle problem, +a readonly file handle could be incorrectly used to attempt to write out +file updates through the page cache to multiply open files. This could cause +the iozone benchmark to fail on the fwrite test. Fix bug mounting two different +shares to the same Windows server when using different usernames +(doing this to Samba servers worked but Windows was rejecting it) - now it is +possible to use different userids when connecting to the same server from a +Linux client. Fix oops when treeDisconnect called during unmount on +previously freed socket. + +Version 0.60 +------------ +Fix oops in readpages caused by not setting address space operations in inode in +rare code path. + +Version 0.59 +------------ +Includes support for deleting of open files and renaming over existing files (per POSIX +requirement). Add readlink support for Windows junction points (directory symlinks). + +Version 0.58 +------------ +Changed read and write to go through pagecache. Added additional address space operations. +Memory mapped operations now working. + +Version 0.57 +------------ +Added writepage code for additional memory mapping support. Fixed leak in xids causing +the simultaneous operations counter (/proc/fs/cifs/SimultaneousOps) to increase on +every stat call. Additional formatting cleanup. + +Version 0.56 +------------ +Fix bigendian bug in order of time conversion. Merge 2.5 to 2.4 version. Formatting cleanup. + +Version 0.55 +------------ +Fixes from Zwane Mwaikambo for adding missing return code checking in a few places. +Also included a modified version of his fix to protect global list manipulation of +the smb session and tree connection and mid related global variables. + +Version 0.54 +------------ +Fix problem with captive thread hanging around at unmount time. Adjust to 2.5.42-pre +changes to superblock layout. Remove wasteful allocation of smb buffers (now the send +buffer is reused for responses). Add more oplock handling. Additional minor cleanup. + +Version 0.53 +------------ +More stylistic updates to better match kernel style. Add additional statistics +for filesystem which can be viewed via /proc/fs/cifs. Add more pieces of NTLMv2 +and CIFS Packet Signing enablement. + +Version 0.52 +------------ +Replace call to sleep_on with safer wait_on_event. +Make stylistic changes to better match kernel style recommendations. +Remove most typedef usage (except for the PDUs themselves). + +Version 0.51 +------------ +Update mount so the -unc mount option is no longer required (the ip address can be specified +in a UNC style device name. Implementation of readpage/writepage started. + +Version 0.50 +------------ +Fix intermittent problem with incorrect smb header checking on badly +fragmented tcp responses + +Version 0.49 +------------ +Fixes to setting of allocation size and file size. + +Version 0.48 +------------ +Various 2.5.38 fixes. Now works on 2.5.38 + +Version 0.47 +------------ +Prepare for 2.5 kernel merge. Remove ifdefs. + +Version 0.46 +------------ +Socket buffer management fixes. Fix dual free. + +Version 0.45 +------------ +Various big endian fixes for hardlinks and symlinks and also for dfs. + +Version 0.44 +------------ +Various big endian fixes for servers with Unix extensions such as Samba + +Version 0.43 +------------ +Various FindNext fixes for incorrect filenames on large directory searches on big endian +clients. basic posix file i/o tests now work on big endian machines, not just le + +Version 0.42 +------------ +SessionSetup and NegotiateProtocol now work from Big Endian machines. +Various Big Endian fixes found during testing on the Linux on 390. Various fixes for compatibility with older +versions of 2.4 kernel (now builds and works again on kernels at least as early as 2.4.7). + +Version 0.41 +------------ +Various minor fixes for Connectathon Posix "basic" file i/o test suite. Directory caching fixed so hardlinked +files now return the correct number of links on fstat as they are repeatedly linked and unlinked. + +Version 0.40 +------------ +Implemented "Raw" (i.e. not encapsulated in SPNEGO) NTLMSSP (i.e. the Security Provider Interface used to negotiate +session advanced session authentication). Raw NTLMSSP is preferred by Windows 2000 Professional and Windows XP. +Began implementing support for SPNEGO encapsulation of NTLMSSP based session authentication blobs +(which is the mechanism preferred by Windows 2000 server in the absence of Kerberos). + +Version 0.38 +------------ +Introduced optional mount helper utility mount.cifs and made coreq changes to cifs vfs to enable +it. Fixed a few bugs in the DFS code (e.g. bcc two bytes too short and incorrect uid in PDU). + +Version 0.37 +------------ +Rewrote much of connection and mount/unmount logic to handle bugs with +multiple uses to same share, multiple users to same server etc. + +Version 0.36 +------------ +Fixed major problem with dentry corruption (missing call to dput) + +Version 0.35 +------------ +Rewrite of readdir code to fix bug. Various fixes for bigendian machines. +Begin adding oplock support. Multiusermount and oplockEnabled flags added to /proc/fs/cifs +although corresponding function not fully implemented in the vfs yet + +Version 0.34 +------------ +Fixed dentry caching bug, misc. cleanup + +Version 0.33 +------------ +Fixed 2.5 support to handle build and configure changes as well as misc. 2.5 changes. Now can build +on current 2.5 beta version (2.5.24) of the Linux kernel as well as on 2.4 Linux kernels. +Support for STATUS codes (newer 32 bit NT error codes) added. DFS support begun to be added. + +Version 0.32 +------------ +Unix extensions (symlink, readlink, hardlink, chmod and some chgrp and chown) implemented +and tested against Samba 2.2.5 + + +Version 0.31 +------------ +1) Fixed lockrange to be correct (it was one byte too short) + +2) Fixed GETLK (i.e. the fcntl call to test a range of bytes in a file to see if locked) to correctly +show range as locked when there is a conflict with an existing lock. + +3) default file perms are now 2767 (indicating support for mandatory locks) instead of 777 for directories +in most cases. Eventually will offer optional ability to query server for the correct perms. + +3) Fixed eventual trap when mounting twice to different shares on the same server when the first succeeded +but the second one was invalid and failed (the second one was incorrectly disconnecting the tcp and smb +session) + +4) Fixed error logging of valid mount options + +5) Removed logging of password field. + +6) Moved negotiate, treeDisconnect and uloggoffX (only tConx and SessSetup remain in connect.c) to cifssmb.c +and cleaned them up and made them more consistent with other cifs functions. + +7) Server support for Unix extensions is now fully detected and FindFirst is implemented both ways +(with or without Unix extensions) but FindNext and QueryPathInfo with the Unix extensions are not completed, +nor is the symlink support using the Unix extensions + +8) Started adding the readlink and follow_link code + +Version 0.3 +----------- +Initial drop + diff --git a/Documentation/filesystems/cifs/README b/Documentation/filesystems/cifs/README new file mode 100644 index 00000000000..2d5622f60e1 --- /dev/null +++ b/Documentation/filesystems/cifs/README @@ -0,0 +1,753 @@ +The CIFS VFS support for Linux supports many advanced network filesystem +features such as hierarchical dfs like namespace, hardlinks, locking and more. +It was designed to comply with the SNIA CIFS Technical Reference (which +supersedes the 1992 X/Open SMB Standard) as well as to perform best practice +practical interoperability with Windows 2000, Windows XP, Samba and equivalent +servers. This code was developed in participation with the Protocol Freedom +Information Foundation. + +Please see + http://protocolfreedom.org/ and + http://samba.org/samba/PFIF/ +for more details. + + +For questions or bug reports please contact: + sfrench@samba.org (sfrench@us.ibm.com) + +Build instructions: +================== +For Linux 2.4: +1) Get the kernel source (e.g.from http://www.kernel.org) +and download the cifs vfs source (see the project page +at http://us1.samba.org/samba/Linux_CIFS_client.html) +and change directory into the top of the kernel directory +then patch the kernel (e.g. "patch -p1 < cifs_24.patch") +to add the cifs vfs to your kernel configure options if +it has not already been added (e.g. current SuSE and UL +users do not need to apply the cifs_24.patch since the cifs vfs is +already in the kernel configure menu) and then +mkdir linux/fs/cifs and then copy the current cifs vfs files from +the cifs download to your kernel build directory e.g. + + cp <cifs_download_dir>/fs/cifs/* to <kernel_download_dir>/fs/cifs + +2) make menuconfig (or make xconfig) +3) select cifs from within the network filesystem choices +4) save and exit +5) make dep +6) make modules (or "make" if CIFS VFS not to be built as a module) + +For Linux 2.6: +1) Download the kernel (e.g. from http://www.kernel.org) +and change directory into the top of the kernel directory tree +(e.g. /usr/src/linux-2.5.73) +2) make menuconfig (or make xconfig) +3) select cifs from within the network filesystem choices +4) save and exit +5) make + + +Installation instructions: +========================= +If you have built the CIFS vfs as module (successfully) simply +type "make modules_install" (or if you prefer, manually copy the file to +the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.o). + +If you have built the CIFS vfs into the kernel itself, follow the instructions +for your distribution on how to install a new kernel (usually you +would simply type "make install"). + +If you do not have the utility mount.cifs (in the Samba 3.0 source tree and on +the CIFS VFS web site) copy it to the same directory in which mount.smbfs and +similar files reside (usually /sbin). Although the helper software is not +required, mount.cifs is recommended. Eventually the Samba 3.0 utility program +"net" may also be helpful since it may someday provide easier mount syntax for +users who are used to Windows e.g. + net use <mount point> <UNC name or cifs URL> +Note that running the Winbind pam/nss module (logon service) on all of your +Linux clients is useful in mapping Uids and Gids consistently across the +domain to the proper network user. The mount.cifs mount helper can be +trivially built from Samba 3.0 or later source e.g. by executing: + + gcc samba/source/client/mount.cifs.c -o mount.cifs + +If cifs is built as a module, then the size and number of network buffers +and maximum number of simultaneous requests to one server can be configured. +Changing these from their defaults is not recommended. By executing modinfo + modinfo kernel/fs/cifs/cifs.ko +on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made +at module initialization time (by running insmod cifs.ko) can be seen. + +Allowing User Mounts +==================== +To permit users to mount and unmount over directories they own is possible +with the cifs vfs. A way to enable such mounting is to mark the mount.cifs +utility as suid (e.g. "chmod +s /sbin/mount.cifs). To enable users to +umount shares they mount requires +1) mount.cifs version 1.4 or later +2) an entry for the share in /etc/fstab indicating that a user may +unmount it e.g. +//server/usersharename /mnt/username cifs user 0 0 + +Note that when the mount.cifs utility is run suid (allowing user mounts), +in order to reduce risks, the "nosuid" mount flag is passed in on mount to +disallow execution of an suid program mounted on the remote target. +When mount is executed as root, nosuid is not passed in by default, +and execution of suid programs on the remote target would be enabled +by default. This can be changed, as with nfs and other filesystems, +by simply specifying "nosuid" among the mount options. For user mounts +though to be able to pass the suid flag to mount requires rebuilding +mount.cifs with the following flag: + + gcc samba/source/client/mount.cifs.c -DCIFS_ALLOW_USR_SUID -o mount.cifs + +There is a corresponding manual page for cifs mounting in the Samba 3.0 and +later source tree in docs/manpages/mount.cifs.8 + +Allowing User Unmounts +====================== +To permit users to ummount directories that they have user mounted (see above), +the utility umount.cifs may be used. It may be invoked directly, or if +umount.cifs is placed in /sbin, umount can invoke the cifs umount helper +(at least for most versions of the umount utility) for umount of cifs +mounts, unless umount is invoked with -i (which will avoid invoking a umount +helper). As with mount.cifs, to enable user unmounts umount.cifs must be marked +as suid (e.g. "chmod +s /sbin/umount.cifs") or equivalent (some distributions +allow adding entries to a file to the /etc/permissions file to achieve the +equivalent suid effect). For this utility to succeed the target path +must be a cifs mount, and the uid of the current user must match the uid +of the user who mounted the resource. + +Also note that the customary way of allowing user mounts and unmounts is +(instead of using mount.cifs and unmount.cifs as suid) to add a line +to the file /etc/fstab for each //server/share you wish to mount, but +this can become unwieldy when potential mount targets include many +or unpredictable UNC names. + +Samba Considerations +==================== +To get the maximum benefit from the CIFS VFS, we recommend using a server that +supports the SNIA CIFS Unix Extensions standard (e.g. Samba 2.2.5 or later or +Samba 3.0) but the CIFS vfs works fine with a wide variety of CIFS servers. +Note that uid, gid and file permissions will display default values if you do +not have a server that supports the Unix extensions for CIFS (such as Samba +2.2.5 or later). To enable the Unix CIFS Extensions in the Samba server, add +the line: + + unix extensions = yes + +to your smb.conf file on the server. Note that the following smb.conf settings +are also useful (on the Samba server) when the majority of clients are Unix or +Linux: + + case sensitive = yes + delete readonly = yes + ea support = yes + +Note that server ea support is required for supporting xattrs from the Linux +cifs client, and that EA support is present in later versions of Samba (e.g. +3.0.6 and later (also EA support works in all versions of Windows, at least to +shares on NTFS filesystems). Extended Attribute (xattr) support is an optional +feature of most Linux filesystems which may require enabling via +make menuconfig. Client support for extended attributes (user xattr) can be +disabled on a per-mount basis by specifying "nouser_xattr" on mount. + +The CIFS client can get and set POSIX ACLs (getfacl, setfacl) to Samba servers +version 3.10 and later. Setting POSIX ACLs requires enabling both XATTR and +then POSIX support in the CIFS configuration options when building the cifs +module. POSIX ACL support can be disabled on a per mount basic by specifying +"noacl" on mount. + +Some administrators may want to change Samba's smb.conf "map archive" and +"create mask" parameters from the default. Unless the create mask is changed +newly created files can end up with an unnecessarily restrictive default mode, +which may not be what you want, although if the CIFS Unix extensions are +enabled on the server and client, subsequent setattr calls (e.g. chmod) can +fix the mode. Note that creating special devices (mknod) remotely +may require specifying a mkdev function to Samba if you are not using +Samba 3.0.6 or later. For more information on these see the manual pages +("man smb.conf") on the Samba server system. Note that the cifs vfs, +unlike the smbfs vfs, does not read the smb.conf on the client system +(the few optional settings are passed in on mount via -o parameters instead). +Note that Samba 2.2.7 or later includes a fix that allows the CIFS VFS to delete +open files (required for strict POSIX compliance). Windows Servers already +supported this feature. Samba server does not allow symlinks that refer to files +outside of the share, so in Samba versions prior to 3.0.6, most symlinks to +files with absolute paths (ie beginning with slash) such as: + ln -s /mnt/foo bar +would be forbidden. Samba 3.0.6 server or later includes the ability to create +such symlinks safely by converting unsafe symlinks (ie symlinks to server +files that are outside of the share) to a samba specific format on the server +that is ignored by local server applications and non-cifs clients and that will +not be traversed by the Samba server). This is opaque to the Linux client +application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or +later, but only for remote clients using the CIFS Unix extensions, and will +be invisbile to Windows clients and typically will not affect local +applications running on the same server as Samba. + +Use instructions: +================ +Once the CIFS VFS support is built into the kernel or installed as a module +(cifs.o), you can use mount syntax like the following to access Samba or Windows +servers: + + mount -t cifs //9.53.216.11/e$ /mnt -o user=myname,pass=mypassword + +Before -o the option -v may be specified to make the mount.cifs +mount helper display the mount steps more verbosely. +After -o the following commonly used cifs vfs specific options +are supported: + + user=<username> + pass=<password> + domain=<domain name> + +Other cifs mount options are described below. Use of TCP names (in addition to +ip addresses) is available if the mount helper (mount.cifs) is installed. If +you do not trust the server to which are mounted, or if you do not have +cifs signing enabled (and the physical network is insecure), consider use +of the standard mount options "noexec" and "nosuid" to reduce the risk of +running an altered binary on your local system (downloaded from a hostile server +or altered by a hostile router). + +Although mounting using format corresponding to the CIFS URL specification is +not possible in mount.cifs yet, it is possible to use an alternate format +for the server and sharename (which is somewhat similar to NFS style mount +syntax) instead of the more widely used UNC format (i.e. \\server\share): + mount -t cifs tcp_name_of_server:share_name /mnt -o user=myname,pass=mypasswd + +When using the mount helper mount.cifs, passwords may be specified via alternate +mechanisms, instead of specifying it after -o using the normal "pass=" syntax +on the command line: +1) By including it in a credential file. Specify credentials=filename as one +of the mount options. Credential files contain two lines + username=someuser + password=your_password +2) By specifying the password in the PASSWD environment variable (similarly +the user name can be taken from the USER environment variable). +3) By specifying the password in a file by name via PASSWD_FILE +4) By specifying the password in a file by file descriptor via PASSWD_FD + +If no password is provided, mount.cifs will prompt for password entry + +Restrictions +============ +Servers must support either "pure-TCP" (port 445 TCP/IP CIFS connections) or RFC +1001/1002 support for "Netbios-Over-TCP/IP." This is not likely to be a +problem as most servers support this. + +Valid filenames differ between Windows and Linux. Windows typically restricts +filenames which contain certain reserved characters (e.g.the character : +which is used to delimit the beginning of a stream name by Windows), while +Linux allows a slightly wider set of valid characters in filenames. Windows +servers can remap such characters when an explicit mapping is specified in +the Server's registry. Samba starting with version 3.10 will allow such +filenames (ie those which contain valid Linux characters, which normally +would be forbidden for Windows/CIFS semantics) as long as the server is +configured for Unix Extensions (and the client has not disabled +/proc/fs/cifs/LinuxExtensionsEnabled). + + +CIFS VFS Mount Options +====================== +A partial list of the supported mount options follows: + user The user name to use when trying to establish + the CIFS session. + password The user password. If the mount helper is + installed, the user will be prompted for password + if not supplied. + ip The ip address of the target server + unc The target server Universal Network Name (export) to + mount. + domain Set the SMB/CIFS workgroup name prepended to the + username during CIFS session establishment + forceuid Set the default uid for inodes to the uid + passed in on mount. For mounts to servers + which do support the CIFS Unix extensions, such as a + properly configured Samba server, the server provides + the uid, gid and mode so this parameter should not be + specified unless the server and clients uid and gid + numbering differ. If the server and client are in the + same domain (e.g. running winbind or nss_ldap) and + the server supports the Unix Extensions then the uid + and gid can be retrieved from the server (and uid + and gid would not have to be specifed on the mount. + For servers which do not support the CIFS Unix + extensions, the default uid (and gid) returned on lookup + of existing files will be the uid (gid) of the person + who executed the mount (root, except when mount.cifs + is configured setuid for user mounts) unless the "uid=" + (gid) mount option is specified. Also note that permission + checks (authorization checks) on accesses to a file occur + at the server, but there are cases in which an administrator + may want to restrict at the client as well. For those + servers which do not report a uid/gid owner + (such as Windows), permissions can also be checked at the + client, and a crude form of client side permission checking + can be enabled by specifying file_mode and dir_mode on + the client. (default) + forcegid (similar to above but for the groupid instead of uid) (default) + noforceuid Fill in file owner information (uid) by requesting it from + the server if possible. With this option, the value given in + the uid= option (on mount) will only be used if the server + can not support returning uids on inodes. + noforcegid (similar to above but for the group owner, gid, instead of uid) + uid Set the default uid for inodes, and indicate to the + cifs kernel driver which local user mounted. If the server + supports the unix extensions the default uid is + not used to fill in the owner fields of inodes (files) + unless the "forceuid" parameter is specified. + gid Set the default gid for inodes (similar to above). + file_mode If CIFS Unix extensions are not supported by the server + this overrides the default mode for file inodes. + fsc Enable local disk caching using FS-Cache (off by default). This + option could be useful to improve performance on a slow link, + heavily loaded server and/or network where reading from the + disk is faster than reading from the server (over the network). + This could also impact scalability positively as the + number of calls to the server are reduced. However, local + caching is not suitable for all workloads for e.g. read-once + type workloads. So, you need to consider carefully your + workload/scenario before using this option. Currently, local + disk caching is functional for CIFS files opened as read-only. + dir_mode If CIFS Unix extensions are not supported by the server + this overrides the default mode for directory inodes. + port attempt to contact the server on this tcp port, before + trying the usual ports (port 445, then 139). + iocharset Codepage used to convert local path names to and from + Unicode. Unicode is used by default for network path + names if the server supports it. If iocharset is + not specified then the nls_default specified + during the local client kernel build will be used. + If server does not support Unicode, this parameter is + unused. + rsize default read size (usually 16K). The client currently + can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize + defaults to 16K and may be changed (from 8K to the maximum + kmalloc size allowed by your kernel) at module install time + for cifs.ko. Setting CIFSMaxBufSize to a very large value + will cause cifs to use more memory and may reduce performance + in some cases. To use rsize greater than 127K (the original + cifs protocol maximum) also requires that the server support + a new Unix Capability flag (for very large read) which some + newer servers (e.g. Samba 3.0.26 or later) do. rsize can be + set from a minimum of 2048 to a maximum of 130048 (127K or + CIFSMaxBufSize, whichever is smaller) + wsize default write size (default 57344) + maximum wsize currently allowed by CIFS is 57344 (fourteen + 4096 byte pages) + actimeo=n attribute cache timeout in seconds (default 1 second). + After this timeout, the cifs client requests fresh attribute + information from the server. This option allows to tune the + attribute cache timeout to suit the workload needs. Shorter + timeouts mean better the cache coherency, but increased number + of calls to the server. Longer timeouts mean reduced number + of calls to the server at the expense of less stricter cache + coherency checks (i.e. incorrect attribute cache for a short + period of time). + rw mount the network share read-write (note that the + server may still consider the share read-only) + ro mount network share read-only + version used to distinguish different versions of the + mount helper utility (not typically needed) + sep if first mount option (after the -o), overrides + the comma as the separator between the mount + parms. e.g. + -o user=myname,password=mypassword,domain=mydom + could be passed instead with period as the separator by + -o sep=.user=myname.password=mypassword.domain=mydom + this might be useful when comma is contained within username + or password or domain. This option is less important + when the cifs mount helper cifs.mount (version 1.1 or later) + is used. + nosuid Do not allow remote executables with the suid bit + program to be executed. This is only meaningful for mounts + to servers such as Samba which support the CIFS Unix Extensions. + If you do not trust the servers in your network (your mount + targets) it is recommended that you specify this option for + greater security. + exec Permit execution of binaries on the mount. + noexec Do not permit execution of binaries on the mount. + dev Recognize block devices on the remote mount. + nodev Do not recognize devices on the remote mount. + suid Allow remote files on this mountpoint with suid enabled to + be executed (default for mounts when executed as root, + nosuid is default for user mounts). + credentials Although ignored by the cifs kernel component, it is used by + the mount helper, mount.cifs. When mount.cifs is installed it + opens and reads the credential file specified in order + to obtain the userid and password arguments which are passed to + the cifs vfs. + guest Although ignored by the kernel component, the mount.cifs + mount helper will not prompt the user for a password + if guest is specified on the mount options. If no + password is specified a null password will be used. + perm Client does permission checks (vfs_permission check of uid + and gid of the file against the mode and desired operation), + Note that this is in addition to the normal ACL check on the + target machine done by the server software. + Client permission checking is enabled by default. + noperm Client does not do permission checks. This can expose + files on this mount to access by other users on the local + client system. It is typically only needed when the server + supports the CIFS Unix Extensions but the UIDs/GIDs on the + client and server system do not match closely enough to allow + access by the user doing the mount, but it may be useful with + non CIFS Unix Extension mounts for cases in which the default + mode is specified on the mount but is not to be enforced on the + client (e.g. perhaps when MultiUserMount is enabled) + Note that this does not affect the normal ACL check on the + target machine done by the server software (of the server + ACL against the user name provided at mount time). + serverino Use server's inode numbers instead of generating automatically + incrementing inode numbers on the client. Although this will + make it easier to spot hardlinked files (as they will have + the same inode numbers) and inode numbers may be persistent, + note that the server does not guarantee that the inode numbers + are unique if multiple server side mounts are exported under a + single share (since inode numbers on the servers might not + be unique if multiple filesystems are mounted under the same + shared higher level directory). Note that some older + (e.g. pre-Windows 2000) do not support returning UniqueIDs + or the CIFS Unix Extensions equivalent and for those + this mount option will have no effect. Exporting cifs mounts + under nfsd requires this mount option on the cifs mount. + This is now the default if server supports the + required network operation. + noserverino Client generates inode numbers (rather than using the actual one + from the server). These inode numbers will vary after + unmount or reboot which can confuse some applications, + but not all server filesystems support unique inode + numbers. + setuids If the CIFS Unix extensions are negotiated with the server + the client will attempt to set the effective uid and gid of + the local process on newly created files, directories, and + devices (create, mkdir, mknod). If the CIFS Unix Extensions + are not negotiated, for newly created files and directories + instead of using the default uid and gid specified on + the mount, cache the new file's uid and gid locally which means + that the uid for the file can change when the inode is + reloaded (or the user remounts the share). + nosetuids The client will not attempt to set the uid and gid on + on newly created files, directories, and devices (create, + mkdir, mknod) which will result in the server setting the + uid and gid to the default (usually the server uid of the + user who mounted the share). Letting the server (rather than + the client) set the uid and gid is the default. If the CIFS + Unix Extensions are not negotiated then the uid and gid for + new files will appear to be the uid (gid) of the mounter or the + uid (gid) parameter specified on the mount. + netbiosname When mounting to servers via port 139, specifies the RFC1001 + source name to use to represent the client netbios machine + name when doing the RFC1001 netbios session initialize. + direct Do not do inode data caching on files opened on this mount. + This precludes mmapping files on this mount. In some cases + with fast networks and little or no caching benefits on the + client (e.g. when the application is doing large sequential + reads bigger than page size without rereading the same data) + this can provide better performance than the default + behavior which caches reads (readahead) and writes + (writebehind) through the local Linux client pagecache + if oplock (caching token) is granted and held. Note that + direct allows write operations larger than page size + to be sent to the server. + strictcache Use for switching on strict cache mode. In this mode the + client read from the cache all the time it has Oplock Level II, + otherwise - read from the server. All written data are stored + in the cache, but if the client doesn't have Exclusive Oplock, + it writes the data to the server. + rwpidforward Forward pid of a process who opened a file to any read or write + operation on that file. This prevent applications like WINE + from failing on read and write if we use mandatory brlock style. + acl Allow setfacl and getfacl to manage posix ACLs if server + supports them. (default) + noacl Do not allow setfacl and getfacl calls on this mount + user_xattr Allow getting and setting user xattrs (those attributes whose + name begins with "user." or "os2.") as OS/2 EAs (extended + attributes) to the server. This allows support of the + setfattr and getfattr utilities. (default) + nouser_xattr Do not allow getfattr/setfattr to get/set/list xattrs + mapchars Translate six of the seven reserved characters (not backslash) + *?<>|: + to the remap range (above 0xF000), which also + allows the CIFS client to recognize files created with + such characters by Windows's POSIX emulation. This can + also be useful when mounting to most versions of Samba + (which also forbids creating and opening files + whose names contain any of these seven characters). + This has no effect if the server does not support + Unicode on the wire. + nomapchars Do not translate any of these seven characters (default). + nocase Request case insensitive path name matching (case + sensitive is the default if the server supports it). + (mount option "ignorecase" is identical to "nocase") + posixpaths If CIFS Unix extensions are supported, attempt to + negotiate posix path name support which allows certain + characters forbidden in typical CIFS filenames, without + requiring remapping. (default) + noposixpaths If CIFS Unix extensions are supported, do not request + posix path name support (this may cause servers to + reject creatingfile with certain reserved characters). + nounix Disable the CIFS Unix Extensions for this mount (tree + connection). This is rarely needed, but it may be useful + in order to turn off multiple settings all at once (ie + posix acls, posix locks, posix paths, symlink support + and retrieving uids/gids/mode from the server) or to + work around a bug in server which implement the Unix + Extensions. + nobrl Do not send byte range lock requests to the server. + This is necessary for certain applications that break + with cifs style mandatory byte range locks (and most + cifs servers do not yet support requesting advisory + byte range locks). + forcemandatorylock Even if the server supports posix (advisory) byte range + locking, send only mandatory lock requests. For some + (presumably rare) applications, originally coded for + DOS/Windows, which require Windows style mandatory byte range + locking, they may be able to take advantage of this option, + forcing the cifs client to only send mandatory locks + even if the cifs server would support posix advisory locks. + "forcemand" is accepted as a shorter form of this mount + option. + nostrictsync If this mount option is set, when an application does an + fsync call then the cifs client does not send an SMB Flush + to the server (to force the server to write all dirty data + for this file immediately to disk), although cifs still sends + all dirty (cached) file data to the server and waits for the + server to respond to the write. Since SMB Flush can be + very slow, and some servers may be reliable enough (to risk + delaying slightly flushing the data to disk on the server), + turning on this option may be useful to improve performance for + applications that fsync too much, at a small risk of server + crash. If this mount option is not set, by default cifs will + send an SMB flush request (and wait for a response) on every + fsync call. + nodfs Disable DFS (global name space support) even if the + server claims to support it. This can help work around + a problem with parsing of DFS paths with Samba server + versions 3.0.24 and 3.0.25. + remount remount the share (often used to change from ro to rw mounts + or vice versa) + cifsacl Report mode bits (e.g. on stat) based on the Windows ACL for + the file. (EXPERIMENTAL) + servern Specify the server 's netbios name (RFC1001 name) to use + when attempting to setup a session to the server. + This is needed for mounting to some older servers (such + as OS/2 or Windows 98 and Windows ME) since they do not + support a default server name. A server name can be up + to 15 characters long and is usually uppercased. + sfu When the CIFS Unix Extensions are not negotiated, attempt to + create device files and fifos in a format compatible with + Services for Unix (SFU). In addition retrieve bits 10-12 + of the mode via the SETFILEBITS extended attribute (as + SFU does). In the future the bottom 9 bits of the + mode also will be emulated using queries of the security + descriptor (ACL). + mfsymlinks Enable support for Minshall+French symlinks + (see http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks) + This option is ignored when specified together with the + 'sfu' option. Minshall+French symlinks are used even if + the server supports the CIFS Unix Extensions. + sign Must use packet signing (helps avoid unwanted data modification + by intermediate systems in the route). Note that signing + does not work with lanman or plaintext authentication. + seal Must seal (encrypt) all data on this mounted share before + sending on the network. Requires support for Unix Extensions. + Note that this differs from the sign mount option in that it + causes encryption of data sent over this mounted share but other + shares mounted to the same server are unaffected. + locallease This option is rarely needed. Fcntl F_SETLEASE is + used by some applications such as Samba and NFSv4 server to + check to see whether a file is cacheable. CIFS has no way + to explicitly request a lease, but can check whether a file + is cacheable (oplocked). Unfortunately, even if a file + is not oplocked, it could still be cacheable (ie cifs client + could grant fcntl leases if no other local processes are using + the file) for cases for example such as when the server does not + support oplocks and the user is sure that the only updates to + the file will be from this client. Specifying this mount option + will allow the cifs client to check for leases (only) locally + for files which are not oplocked instead of denying leases + in that case. (EXPERIMENTAL) + sec Security mode. Allowed values are: + none attempt to connection as a null user (no name) + krb5 Use Kerberos version 5 authentication + krb5i Use Kerberos authentication and packet signing + ntlm Use NTLM password hashing (default) + ntlmi Use NTLM password hashing with signing (if + /proc/fs/cifs/PacketSigningEnabled on or if + server requires signing also can be the default) + ntlmv2 Use NTLMv2 password hashing + ntlmv2i Use NTLMv2 password hashing with packet signing + lanman (if configured in kernel config) use older + lanman hash +hard Retry file operations if server is not responding +soft Limit retries to unresponsive servers (usually only + one retry) before returning an error. (default) + +The mount.cifs mount helper also accepts a few mount options before -o +including: + + -S take password from stdin (equivalent to setting the environment + variable "PASSWD_FD=0" + -V print mount.cifs version + -? display simple usage information + +With most 2.6 kernel versions of modutils, the version of the cifs kernel +module can be displayed via modinfo. + +Misc /proc/fs/cifs Flags and Debug Info +======================================= +Informational pseudo-files: +DebugData Displays information about active CIFS sessions and + shares, features enabled as well as the cifs.ko + version. +Stats Lists summary resource usage information as well as per + share statistics, if CONFIG_CIFS_STATS in enabled + in the kernel configuration. + +Configuration pseudo-files: +PacketSigningEnabled If set to one, cifs packet signing is enabled + and will be used if the server requires + it. If set to two, cifs packet signing is + required even if the server considers packet + signing optional. (default 1) +SecurityFlags Flags which control security negotiation and + also packet signing. Authentication (may/must) + flags (e.g. for NTLM and/or NTLMv2) may be combined with + the signing flags. Specifying two different password + hashing mechanisms (as "must use") on the other hand + does not make much sense. Default flags are + 0x07007 + (NTLM, NTLMv2 and packet signing allowed). The maximum + allowable flags if you want to allow mounts to servers + using weaker password hashes is 0x37037 (lanman, + plaintext, ntlm, ntlmv2, signing allowed). Some + SecurityFlags require the corresponding menuconfig + options to be enabled (lanman and plaintext require + CONFIG_CIFS_WEAK_PW_HASH for example). Enabling + plaintext authentication currently requires also + enabling lanman authentication in the security flags + because the cifs module only supports sending + laintext passwords using the older lanman dialect + form of the session setup SMB. (e.g. for authentication + using plain text passwords, set the SecurityFlags + to 0x30030): + + may use packet signing 0x00001 + must use packet signing 0x01001 + may use NTLM (most common password hash) 0x00002 + must use NTLM 0x02002 + may use NTLMv2 0x00004 + must use NTLMv2 0x04004 + may use Kerberos security 0x00008 + must use Kerberos 0x08008 + may use lanman (weak) password hash 0x00010 + must use lanman password hash 0x10010 + may use plaintext passwords 0x00020 + must use plaintext passwords 0x20020 + (reserved for future packet encryption) 0x00040 + +cifsFYI If set to non-zero value, additional debug information + will be logged to the system error log. This field + contains three flags controlling different classes of + debugging entries. The maximum value it can be set + to is 7 which enables all debugging points (default 0). + Some debugging statements are not compiled into the + cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the + kernel configuration. cifsFYI may be set to one or + nore of the following flags (7 sets them all): + + log cifs informational messages 0x01 + log return codes from cifs entry points 0x02 + log slow responses (ie which take longer than 1 second) + CONFIG_CIFS_STATS2 must be enabled in .config 0x04 + + +traceSMB If set to one, debug information is logged to the + system error log with the start of smb requests + and responses (default 0) +LookupCacheEnable If set to one, inode information is kept cached + for one second improving performance of lookups + (default 1) +OplockEnabled If set to one, safe distributed caching enabled. + (default 1) +LinuxExtensionsEnabled If set to one then the client will attempt to + use the CIFS "UNIX" extensions which are optional + protocol enhancements that allow CIFS servers + to return accurate UID/GID information as well + as support symbolic links. If you use servers + such as Samba that support the CIFS Unix + extensions but do not want to use symbolic link + support and want to map the uid and gid fields + to values supplied at mount (rather than the + actual values, then set this to zero. (default 1) + +These experimental features and tracing can be enabled by changing flags in +/proc/fs/cifs (after the cifs module has been installed or built into the +kernel, e.g. insmod cifs). To enable a feature set it to 1 e.g. to enable +tracing to the kernel message log type: + + echo 7 > /proc/fs/cifs/cifsFYI + +cifsFYI functions as a bit mask. Setting it to 1 enables additional kernel +logging of various informational messages. 2 enables logging of non-zero +SMB return codes while 4 enables logging of requests that take longer +than one second to complete (except for byte range lock requests). +Setting it to 4 requires defining CONFIG_CIFS_STATS2 manually in the +source code (typically by setting it in the beginning of cifsglob.h), +and setting it to seven enables all three. Finally, tracing +the start of smb requests and responses can be enabled via: + + echo 1 > /proc/fs/cifs/traceSMB + +Per share (per client mount) statistics are available in /proc/fs/cifs/Stats +if the kernel was configured with cifs statistics enabled. The statistics +represent the number of successful (ie non-zero return code from the server) +SMB responses to some of the more common commands (open, delete, mkdir etc.). +Also recorded is the total bytes read and bytes written to the server for +that share. Note that due to client caching effects this can be less than the +number of bytes read and written by the application running on the client. +The statistics for the number of total SMBs and oplock breaks are different in +that they represent all for that share, not just those for which the server +returned success. + +Also note that "cat /proc/fs/cifs/DebugData" will display information about +the active sessions and the shares that are mounted. + +Enabling Kerberos (extended security) works but requires version 1.2 or later +of the helper program cifs.upcall to be present and to be configured in the +/etc/request-key.conf file. The cifs.upcall helper program is from the Samba +project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not +require this helper. Note that NTLMv2 security (which does not require the +cifs.upcall helper program), instead of using Kerberos, is sufficient for +some use cases. + +DFS support allows transparent redirection to shares in an MS-DFS name space. +In addition, DFS support for target shares which are specified as UNC +names which begin with host names (rather than IP addresses) requires +a user space helper (such as cifs.upcall) to be present in order to +translate host names to ip address, and the user space helper must also +be configured in the file /etc/request-key.conf. Samba, Windows servers and +many NAS appliances support DFS as a way of constructing a global name +space to ease network configuration and improve reliability. + +To use cifs Kerberos and DFS support, the Linux keyutils package should be +installed and something like the following lines should be added to the +/etc/request-key.conf file: + +create cifs.spnego * * /usr/local/sbin/cifs.upcall %k +create dns_resolver * * /usr/local/sbin/cifs.upcall %k + +CIFS kernel module parameters +============================= +These module parameters can be specified or modified either during the time of +module loading or during the runtime by using the interface + /proc/module/cifs/parameters/<param> + +i.e. echo "value" > /sys/module/cifs/parameters/<param> + +1. enable_oplocks - Enable or disable oplocks. Oplocks are enabled by default. + [Y/y/1]. To disable use any of [N/n/0]. + diff --git a/Documentation/filesystems/cifs/TODO b/Documentation/filesystems/cifs/TODO new file mode 100644 index 00000000000..355abcdcda9 --- /dev/null +++ b/Documentation/filesystems/cifs/TODO @@ -0,0 +1,129 @@ +Version 1.53 May 20, 2008 + +A Partial List of Missing Features +================================== + +Contributions are welcome. There are plenty of opportunities +for visible, important contributions to this module. Here +is a partial list of the known problems and missing features: + +a) Support for SecurityDescriptors(Windows/CIFS ACLs) for chmod/chgrp/chown +so that these operations can be supported to Windows servers + +b) Mapping POSIX ACLs (and eventually NFSv4 ACLs) to CIFS +SecurityDescriptors + +c) Better pam/winbind integration (e.g. to handle uid mapping +better) + +d) Cleanup now unneeded SessSetup code in +fs/cifs/connect.c and add back in NTLMSSP code if any servers +need it + +e) fix NTLMv2 signing when two mounts with different users to same +server. + +f) Directory entry caching relies on a 1 second timer, rather than +using FindNotify or equivalent. - (started) + +g) quota support (needs minor kernel change since quota calls +to make it to network filesystems or deviceless filesystems) + +h) investigate sync behavior (including syncpage) and check +for proper behavior of intr/nointr + +i) improve support for very old servers (OS/2 and Win9x for example) +Including support for changing the time remotely (utimes command). + +j) hook lower into the sockets api (as NFS/SunRPC does) to avoid the +extra copy in/out of the socket buffers in some cases. + +k) Better optimize open (and pathbased setfilesize) to reduce the +oplock breaks coming from windows srv. Piggyback identical file +opens on top of each other by incrementing reference count rather +than resending (helps reduce server resource utilization and avoid +spurious oplock breaks). + +l) Improve performance of readpages by sending more than one read +at a time when 8 pages or more are requested. In conjuntion +add support for async_cifs_readpages. + +m) Add support for storing symlink info to Windows servers +in the Extended Attribute format their SFU clients would recognize. + +n) Finish fcntl D_NOTIFY support so kde and gnome file list windows +will autorefresh (partially complete by Asser). Needs minor kernel +vfs change to support removing D_NOTIFY on a file. + +o) Add GUI tool to configure /proc/fs/cifs settings and for display of +the CIFS statistics (started) + +p) implement support for security and trusted categories of xattrs +(requires minor protocol extension) to enable better support for SELINUX + +q) Implement O_DIRECT flag on open (already supported on mount) + +r) Create UID mapping facility so server UIDs can be mapped on a per +mount or a per server basis to client UIDs or nobody if no mapping +exists. This is helpful when Unix extensions are negotiated to +allow better permission checking when UIDs differ on the server +and client. Add new protocol request to the CIFS protocol +standard for asking the server for the corresponding name of a +particular uid. + +s) Add support for CIFS Unix and also the newer POSIX extensions to the +server side for Samba 4. + +t) In support for OS/2 (LANMAN 1.2 and LANMAN2.1 based SMB servers) +need to add ability to set time to server (utimes command) + +u) DOS attrs - returned as pseudo-xattr in Samba format (check VFAT and NTFS for this too) + +v) mount check for unmatched uids + +w) Add support for new vfs entry point for fallocate + +x) Fix Samba 3 server to handle Linux kernel aio so dbench with lots of +processes can proceed better in parallel (on the server) + +y) Fix Samba 3 to handle reads/writes over 127K (and remove the cifs mount +restriction of wsize max being 127K) + +KNOWN BUGS (updated April 24, 2007) +==================================== +See http://bugzilla.samba.org - search on product "CifsVFS" for +current bug list. + +1) existing symbolic links (Windows reparse points) are recognized but +can not be created remotely. They are implemented for Samba and those that +support the CIFS Unix extensions, although earlier versions of Samba +overly restrict the pathnames. +2) follow_link and readdir code does not follow dfs junctions +but recognizes them +3) create of new files to FAT partitions on Windows servers can +succeed but still return access denied (appears to be Windows +server not cifs client problem) and has not been reproduced recently. +NTFS partitions do not have this problem. +4) Unix/POSIX capabilities are reset after reconnection, and affect +a few fields in the tree connection but we do do not know which +superblocks to apply these changes to. We should probably walk +the list of superblocks to set these. Also need to check the +flags on the second mount to the same share, and see if we +can do the same trick that NFS does to remount duplicate shares. + +Misc testing to do +================== +1) check out max path names and max path name components against various server +types. Try nested symlinks (8 deep). Return max path name in stat -f information + +2) Modify file portion of ltp so it can run against a mounted network +share and run it against cifs vfs in automated fashion. + +3) Additional performance testing and optimization using iozone and similar - +there are some easy changes that can be done to parallelize sequential writes, +and when signing is disabled to request larger read sizes (larger than +negotiated size) and send larger write sizes to modern servers. + +4) More exhaustively test against less common servers. More testing +against Windows 9x, Windows ME servers. + diff --git a/Documentation/filesystems/cifs/cifs.txt b/Documentation/filesystems/cifs/cifs.txt new file mode 100644 index 00000000000..2fac91ac96c --- /dev/null +++ b/Documentation/filesystems/cifs/cifs.txt @@ -0,0 +1,31 @@ + This is the client VFS module for the Common Internet File System + (CIFS) protocol which is the successor to the Server Message Block + (SMB) protocol, the native file sharing mechanism for most early + PC operating systems. New and improved versions of CIFS are now + called SMB2 and SMB3. These dialects are also supported by the + CIFS VFS module. CIFS is fully supported by network + file servers such as Windows 2000, 2003, 2008 and 2012 + as well by Samba (which provides excellent CIFS + server support for Linux and many other operating systems), so + this network filesystem client can mount to a wide variety of + servers. + + The intent of this module is to provide the most advanced network + file system function for CIFS compliant servers, including better + POSIX compliance, secure per-user session establishment, high + performance safe distributed caching (oplock), optional packet + signing, large files, Unicode support and other internationalization + improvements. Since both Samba server and this filesystem client support + the CIFS Unix extensions, the combination can provide a reasonable + alternative to NFSv4 for fileserving in some Linux to Linux environments, + not just in Linux to Windows environments. + + This filesystem has an mount utility (mount.cifs) that can be obtained from + + https://ftp.samba.org/pub/linux-cifs/cifs-utils/ + + It must be installed in the directory with the other mount helpers. + + For more information on the module see the project wiki page at + + https://wiki.samba.org/index.php/LinuxCIFS_utils diff --git a/Documentation/filesystems/cifs/winucase_convert.pl b/Documentation/filesystems/cifs/winucase_convert.pl new file mode 100755 index 00000000000..322a9c833f2 --- /dev/null +++ b/Documentation/filesystems/cifs/winucase_convert.pl @@ -0,0 +1,62 @@ +#!/usr/bin/perl -w +# +# winucase_convert.pl -- convert "Windows 8 Upper Case Mapping Table.txt" to +# a two-level set of C arrays. +# +# Copyright 2013: Jeff Layton <jlayton@redhat.com> +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +while(<>) { + next if (!/^0x(..)(..)\t0x(....)\t/); + $firstchar = hex($1); + $secondchar = hex($2); + $uppercase = hex($3); + + $top[$firstchar][$secondchar] = $uppercase; +} + +for ($i = 0; $i < 256; $i++) { + next if (!$top[$i]); + + printf("static const wchar_t t2_%2.2x[256] = {", $i); + for ($j = 0; $j < 256; $j++) { + if (($j % 8) == 0) { + print "\n\t"; + } else { + print " "; + } + printf("0x%4.4x,", $top[$i][$j] ? $top[$i][$j] : 0); + } + print "\n};\n\n"; +} + +printf("static const wchar_t *const toplevel[256] = {", $i); +for ($i = 0; $i < 256; $i++) { + if (($i % 8) == 0) { + print "\n\t"; + } elsif ($top[$i]) { + print " "; + } else { + print " "; + } + + if ($top[$i]) { + printf("t2_%2.2x,", $i); + } else { + print "NULL,"; + } +} +print "\n};\n\n"; diff --git a/Documentation/filesystems/configfs/Makefile b/Documentation/filesystems/configfs/Makefile new file mode 100644 index 00000000000..be7ec5e67db --- /dev/null +++ b/Documentation/filesystems/configfs/Makefile @@ -0,0 +1,3 @@ +ifneq ($(CONFIG_CONFIGFS_FS),) +obj-m += configfs_example_explicit.o configfs_example_macros.o +endif diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt index 44c97e6accb..b40fec9d3f5 100644 --- a/Documentation/filesystems/configfs/configfs.txt +++ b/Documentation/filesystems/configfs/configfs.txt @@ -192,7 +192,7 @@ attribute value uses the store_attribute() method. struct configfs_attribute { char *ca_name; struct module *ca_owner; - mode_t ca_mode; + umode_t ca_mode; }; When a config_item wants an attribute to appear as a file in the item's @@ -311,9 +311,20 @@ the subsystem must be ready for it. [An Example] The best example of these basic concepts is the simple_children -subsystem/group and the simple_child item in configfs_example.c It -shows a trivial object displaying and storing an attribute, and a simple -group creating and destroying these children. +subsystem/group and the simple_child item in configfs_example_explicit.c +and configfs_example_macros.c. It shows a trivial object displaying and +storing an attribute, and a simple group creating and destroying these +children. + +The only difference between configfs_example_explicit.c and +configfs_example_macros.c is how the attributes of the childless item +are defined. The childless item has extended attributes, each with +their own show()/store() operation. This follows a convention commonly +used in sysfs. configfs_example_explicit.c creates these attributes +by explicitly defining the structures involved. Conversely +configfs_example_macros.c uses some convenience macros from configfs.h +to define the attributes. These macros are similar to their sysfs +counterparts. [Hierarchy Navigation and the Subsystem Mutex] @@ -398,7 +409,7 @@ As a consequence of this, default_groups cannot be removed directly via rmdir(2). They also are not considered when rmdir(2) on the parent group is checking for children. -[Dependant Subsystems] +[Dependent Subsystems] Sometimes other drivers depend on particular configfs items. For example, ocfs2 mounts depend on a heartbeat region item. If that diff --git a/Documentation/filesystems/configfs/configfs_example.c b/Documentation/filesystems/configfs/configfs_example_explicit.c index 25151fd5c2c..1420233dfa5 100644 --- a/Documentation/filesystems/configfs/configfs_example.c +++ b/Documentation/filesystems/configfs/configfs_example_explicit.c @@ -1,8 +1,10 @@ /* * vim: noexpandtab ts=8 sts=0 sw=8: * - * configfs_example.c - This file is a demonstration module containing - * a number of configfs subsystems. + * configfs_example_explicit.c - This file is a demonstration module + * containing a number of configfs subsystems. It explicitly defines + * each structure without using the helper macros defined in + * configfs.h. * * This program is free software; you can redistribute it and/or * modify it under the terms of the GNU General Public @@ -87,7 +89,7 @@ static ssize_t childless_storeme_write(struct childless *childless, char *p = (char *) page; tmp = simple_strtoul(p, &p, 10); - if (!p || (*p && (*p != '\n'))) + if ((*p != '\0') && (*p != '\n')) return -EINVAL; if (tmp > INT_MAX) @@ -279,8 +281,7 @@ static struct config_item *simple_children_make_item(struct config_group *group, simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); if (!simple_child) - return NULL; - + return ERR_PTR(-ENOMEM); config_item_init_type_name(&simple_child->item, name, &simple_child_type); @@ -302,8 +303,8 @@ static struct configfs_attribute *simple_children_attrs[] = { }; static ssize_t simple_children_attr_show(struct config_item *item, - struct configfs_attribute *attr, - char *page) + struct configfs_attribute *attr, + char *page) { return sprintf(page, "[02-simple-children]\n" @@ -318,7 +319,7 @@ static void simple_children_release(struct config_item *item) } static struct configfs_item_operations simple_children_item_ops = { - .release = simple_children_release, + .release = simple_children_release, .show_attribute = simple_children_attr_show, }; @@ -366,8 +367,7 @@ static struct config_group *group_children_make_group(struct config_group *group simple_children = kzalloc(sizeof(struct simple_children), GFP_KERNEL); if (!simple_children) - return NULL; - + return ERR_PTR(-ENOMEM); config_group_init_type_name(&simple_children->group, name, &simple_children_type); @@ -387,8 +387,8 @@ static struct configfs_attribute *group_children_attrs[] = { }; static ssize_t group_children_attr_show(struct config_item *item, - struct configfs_attribute *attr, - char *page) + struct configfs_attribute *attr, + char *page) { return sprintf(page, "[03-group-children]\n" @@ -464,9 +464,8 @@ static int __init configfs_example_init(void) return 0; out_unregister: - for (; i >= 0; i--) { + for (i--; i >= 0; i--) configfs_unregister_subsystem(example_subsys[i]); - } return ret; } @@ -475,9 +474,8 @@ static void __exit configfs_example_exit(void) { int i; - for (i = 0; example_subsys[i]; i++) { + for (i = 0; example_subsys[i]; i++) configfs_unregister_subsystem(example_subsys[i]); - } } module_init(configfs_example_init); diff --git a/Documentation/filesystems/configfs/configfs_example_macros.c b/Documentation/filesystems/configfs/configfs_example_macros.c new file mode 100644 index 00000000000..327dfbc640a --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example_macros.c @@ -0,0 +1,446 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example_macros.c - This file is a demonstration module + * containing a number of configfs subsystems. It uses the helper + * macros defined by configfs.h + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <linux/configfs.h> + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +CONFIGFS_ATTR_STRUCT(childless); +#define CHILDLESS_ATTR(_name, _mode, _show, _store) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR(_name, _mode, _show, _store) +#define CHILDLESS_ATTR_RO(_name, _show) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR_RO(_name, _show); + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +CHILDLESS_ATTR_RO(showme, childless_showme_read); +CHILDLESS_ATTR(storeme, S_IRUGO | S_IWUSR, childless_storeme_read, + childless_storeme_write); +CHILDLESS_ATTR_RO(description, childless_description_read); + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +CONFIGFS_ATTR_OPS(childless); +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return ERR_PTR(-ENOMEM); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + +static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kzalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return ERR_PTR(-ENOMEM); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + mutex_init(&subsys->su_mutex); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (i--; i >= 0; i--) + configfs_unregister_subsystem(example_subsys[i]); + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) + configfs_unregister_subsystem(example_subsys[i]); +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt new file mode 100644 index 00000000000..3a863f69272 --- /dev/null +++ b/Documentation/filesystems/debugfs.txt @@ -0,0 +1,191 @@ +Copyright 2009 Jonathan Corbet <corbet@lwn.net> + +Debugfs exists as a simple way for kernel developers to make information +available to user space. Unlike /proc, which is only meant for information +about a process, or sysfs, which has strict one-value-per-file rules, +debugfs has no rules at all. Developers can put any information they want +there. The debugfs filesystem is also intended to not serve as a stable +ABI to user space; in theory, there are no stability constraints placed on +files exported there. The real world is not always so simple, though [1]; +even debugfs interfaces are best designed with the idea that they will need +to be maintained forever. + +Debugfs is typically mounted with a command like: + + mount -t debugfs none /sys/kernel/debug + +(Or an equivalent /etc/fstab line). +The debugfs root directory is accessible only to the root user by +default. To change access to the tree the "uid", "gid" and "mode" mount +options can be used. + +Note that the debugfs API is exported GPL-only to modules. + +Code using debugfs should include <linux/debugfs.h>. Then, the first order +of business will be to create at least one directory to hold a set of +debugfs files: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +This call, if successful, will make a directory called name underneath the +indicated parent directory. If parent is NULL, the directory will be +created in the debugfs root. On success, the return value is a struct +dentry pointer which can be used to create files in the directory (and to +clean it up at the end). A NULL return value indicates that something went +wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the +kernel has been built without debugfs support and none of the functions +described below will work. + +The most general way to create a file within a debugfs directory is with: + + struct dentry *debugfs_create_file(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +Here, name is the name of the file to create, mode describes the access +permissions the file should have, parent indicates the directory which +should hold the file, data will be stored in the i_private field of the +resulting inode structure, and fops is a set of file operations which +implement the file's behavior. At a minimum, the read() and/or write() +operations should be provided; others can be included as needed. Again, +the return value will be a dentry pointer to the created file, NULL for +error, or ERR_PTR(-ENODEV) if debugfs support is missing. + +In a number of cases, the creation of a set of file operations is not +actually necessary; the debugfs code provides a number of helper functions +for simple situations. Files containing a single integer value can be +created with any of: + + struct dentry *debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + struct dentry *debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These files support both reading and writing the given value; if a specific +file should not be written to, simply set the mode bits accordingly. The +values in these files are in decimal; if hexadecimal is more appropriate, +the following functions can be used instead: + + struct dentry *debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + struct dentry *debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These functions are useful as long as the developer knows the size of the +value to be exported. Some types can have different widths on different +architectures, though, complicating the situation somewhat. There is a +function meant to help out in one special case: + + struct dentry *debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, + size_t *value); + +As might be expected, this function will create a debugfs file to represent +a variable of type size_t. + +Boolean values can be placed in debugfs with: + + struct dentry *debugfs_create_bool(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + +A read on the resulting file will yield either Y (for non-zero values) or +N, followed by a newline. If written to, it will accept either upper- or +lower-case values, or 1 or 0. Any other input will be silently ignored. + +Another option is exporting a block of arbitrary binary data, with +this structure and function: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +A read of this file will return the data pointed to by the +debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way +to return several lines of (static) formatted text output. This function +can be used to export binary information, but there does not appear to be +any code which does so in the mainline. Note that all files created with +debugfs_create_blob() are read-only. + +If you want to dump a block of registers (something that happens quite +often during development, even if little such code reaches mainline. +Debugfs offers two functions: one to make a registers-only file, and +another to insert a register block in the middle of another sequential +file. + + struct debugfs_reg32 { + char *name; + unsigned long offset; + }; + + struct debugfs_regset32 { + struct debugfs_reg32 *regs; + int nregs; + void __iomem *base; + }; + + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); + + int debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + int nregs, void __iomem *base, char *prefix); + +The "base" argument may be 0, but you may want to build the reg32 array +using __stringify, and a number of register names (macros) are actually +byte offsets over a base for the register block. + + +There are a couple of other directory-oriented helper functions: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +A call to debugfs_rename() will give a new name to an existing debugfs +file, possibly in a different directory. The new_name must not exist prior +to the call; the return value is old_dentry with updated information. +Symbolic links can be created with debugfs_create_symlink(). + +There is one important thing that all debugfs users must take into account: +there is no automatic cleanup of any directories created in debugfs. If a +module is unloaded without explicitly removing debugfs entries, the result +will be a lot of stale pointers and no end of highly antisocial behavior. +So all debugfs users - at least those which can be built as modules - must +be prepared to remove all files and directories they create there. A file +can be removed with: + + void debugfs_remove(struct dentry *dentry); + +The dentry value can be NULL, in which case nothing will be removed. + +Once upon a time, debugfs users were required to remember the dentry +pointer for every debugfs file they created so that all files could be +cleaned up. We live in more civilized times now, though, and debugfs users +can call: + + void debugfs_remove_recursive(struct dentry *dentry); + +If this function is passed a pointer for the dentry corresponding to the +top-level directory, the entire hierarchy below that directory will be +removed. + +Notes: + [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/dentry-locking.txt b/Documentation/filesystems/dentry-locking.txt deleted file mode 100644 index 4c0c575a401..00000000000 --- a/Documentation/filesystems/dentry-locking.txt +++ /dev/null @@ -1,173 +0,0 @@ -RCU-based dcache locking model -============================== - -On many workloads, the most common operation on dcache is to look up a -dentry, given a parent dentry and the name of the child. Typically, -for every open(), stat() etc., the dentry corresponding to the -pathname will be looked up by walking the tree starting with the first -component of the pathname and using that dentry along with the next -component to look up the next level and so on. Since it is a frequent -operation for workloads like multiuser environments and web servers, -it is important to optimize this path. - -Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus in -every component during path look-up. Since 2.5.10 onwards, fast-walk -algorithm changed this by holding the dcache_lock at the beginning and -walking as many cached path component dentries as possible. This -significantly decreases the number of acquisition of -dcache_lock. However it also increases the lock hold time -significantly and affects performance in large SMP machines. Since -2.5.62 kernel, dcache has been using a new locking model that uses RCU -to make dcache look-up lock-free. - -The current dcache locking model is not very different from the -existing dcache locking model. Prior to 2.5.62 kernel, dcache_lock -protected the hash chain, d_child, d_alias, d_lru lists as well as -d_inode and several other things like mount look-up. RCU-based changes -affect only the way the hash chain is protected. For everything else -the dcache_lock must be taken for both traversing as well as -updating. The hash chain updates too take the dcache_lock. The -significant change is the way d_lookup traverses the hash chain, it -doesn't acquire the dcache_lock for this and rely on RCU to ensure -that the dentry has not been *freed*. - - -Dcache locking details -====================== - -For many multi-user workloads, open() and stat() on files are very -frequently occurring operations. Both involve walking of path names to -find the dentry corresponding to the concerned file. In 2.4 kernel, -dcache_lock was held during look-up of each path component. Contention -and cache-line bouncing of this global lock caused significant -scalability problems. With the introduction of RCU in Linux kernel, -this was worked around by making the look-up of path components during -path walking lock-free. - - -Safe lock-free look-up of dcache hash table -=========================================== - -Dcache is a complex data structure with the hash table entries also -linked together in other lists. In 2.4 kernel, dcache_lock protected -all the lists. We applied RCU only on hash chain walking. The rest of -the lists are still protected by dcache_lock. Some of the important -changes are : - -1. The deletion from hash chain is done using hlist_del_rcu() macro - which doesn't initialize next pointer of the deleted dentry and - this allows us to walk safely lock-free while a deletion is - happening. - -2. Insertion of a dentry into the hash table is done using - hlist_add_head_rcu() which take care of ordering the writes - the - writes to the dentry must be visible before the dentry is - inserted. This works in conjunction with hlist_for_each_rcu() while - walking the hash chain. The only requirement is that all - initialization to the dentry must be done before - hlist_add_head_rcu() since we don't have dcache_lock protection - while traversing the hash chain. This isn't different from the - existing code. - -3. The dentry looked up without holding dcache_lock by cannot be - returned for walking if it is unhashed. It then may have a NULL - d_inode or other bogosity since RCU doesn't protect the other - fields in the dentry. We therefore use a flag DCACHE_UNHASHED to - indicate unhashed dentries and use this in conjunction with a - per-dentry lock (d_lock). Once looked up without the dcache_lock, - we acquire the per-dentry lock (d_lock) and check if the dentry is - unhashed. If so, the look-up is failed. If not, the reference count - of the dentry is increased and the dentry is returned. - -4. Once a dentry is looked up, it must be ensured during the path walk - for that component it doesn't go away. In pre-2.5.10 code, this was - done holding a reference to the dentry. dcache_rcu does the same. - In some sense, dcache_rcu path walking looks like the pre-2.5.10 - version. - -5. All dentry hash chain updates must take the dcache_lock as well as - the per-dentry lock in that order. dput() does this to ensure that - a dentry that has just been looked up in another CPU doesn't get - deleted before dget() can be done on it. - -6. There are several ways to do reference counting of RCU protected - objects. One such example is in ipv4 route cache where deferred - freeing (using call_rcu()) is done as soon as the reference count - goes to zero. This cannot be done in the case of dentries because - tearing down of dentries require blocking (dentry_iput()) which - isn't supported from RCU callbacks. Instead, tearing down of - dentries happen synchronously in dput(), but actual freeing happens - later when RCU grace period is over. This allows safe lock-free - walking of the hash chains, but a matched dentry may have been - partially torn down. The checking of DCACHE_UNHASHED flag with - d_lock held detects such dentries and prevents them from being - returned from look-up. - - -Maintaining POSIX rename semantics -================================== - -Since look-up of dentries is lock-free, it can race against a -concurrent rename operation. For example, during rename of file A to -B, look-up of either A or B must succeed. So, if look-up of B happens -after A has been removed from the hash chain but not added to the new -hash chain, it may fail. Also, a comparison while the name is being -written concurrently by a rename may result in false positive matches -violating rename semantics. Issues related to race with rename are -handled as described below : - -1. Look-up can be done in two ways - d_lookup() which is safe from - simultaneous renames and __d_lookup() which is not. If - __d_lookup() fails, it must be followed up by a d_lookup() to - correctly determine whether a dentry is in the hash table or - not. d_lookup() protects look-ups using a sequence lock - (rename_lock). - -2. The name associated with a dentry (d_name) may be changed if a - rename is allowed to happen simultaneously. To avoid memcmp() in - __d_lookup() go out of bounds due to a rename and false positive - comparison, the name comparison is done while holding the - per-dentry lock. This prevents concurrent renames during this - operation. - -3. Hash table walking during look-up may move to a different bucket as - the current dentry is moved to a different bucket due to rename. - But we use hlists in dcache hash table and they are - null-terminated. So, even if a dentry moves to a different bucket, - hash chain walk will terminate. [with a list_head list, it may not - since termination is when the list_head in the original bucket is - reached]. Since we redo the d_parent check and compare name while - holding d_lock, lock-free look-up will not race against d_move(). - -4. There can be a theoretical race when a dentry keeps coming back to - original bucket due to double moves. Due to this look-up may - consider that it has never moved and can end up in a infinite loop. - But this is not any worse that theoretical livelocks we already - have in the kernel. - - -Important guidelines for filesystem developers related to dcache_rcu -==================================================================== - -1. Existing dcache interfaces (pre-2.5.62) exported to filesystem - don't change. Only dcache internal implementation changes. However - filesystems *must not* delete from the dentry hash chains directly - using the list macros like allowed earlier. They must use dcache - APIs like d_drop() or __d_drop() depending on the situation. - -2. d_flags is now protected by a per-dentry lock (d_lock). All access - to d_flags must be protected by it. - -3. For a hashed dentry, checking of d_count needs to be protected by - d_lock. - - -Papers and other documentation on dcache locking -================================================ - -1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). - -2. http://lse.sourceforge.net/locking/dcache/dcache.html - - - diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt new file mode 100644 index 00000000000..68dffd87f9b --- /dev/null +++ b/Documentation/filesystems/devpts.txt @@ -0,0 +1,132 @@ + +To support containers, we now allow multiple instances of devpts filesystem, +such that indices of ptys allocated in one instance are independent of indices +allocated in other instances of devpts. + +To preserve backward compatibility, this support for multiple instances is +enabled only if: + + - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and + - '-o newinstance' mount option is specified while mounting devpts + +IOW, devpts now supports both single-instance and multi-instance semantics. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and +this referred to as the "legacy" mode. In this mode, the new mount options +(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message +on console. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the +'newinstance' option (as in current start-up scripts) the new mount binds +to the initial kernel mount of devpts. This mode is referred to as the +'single-instance' mode and the current, single-instance semantics are +preserved, i.e PTYs are common across the system. + +The only difference between this single-instance mode and the legacy mode +is the presence of new, '/dev/pts/ptmx' node with permissions 0000, which +can safely be ignored. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified, +the mount is considered to be in the multi-instance mode and a new instance +of the devpts fs is created. Any ptys created in this instance are independent +of ptys in other instances of devpts. Like in the single-instance mode, the +/dev/pts/ptmx node is present. To effectively use the multi-instance mode, +open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or +bind-mount. + +Eg: A container startup script could do the following: + + $ chmod 0666 /dev/pts/ptmx + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ ns_exec -cm /bin/bash + + # We are now in new container + + $ umount /dev/pts + $ mount -t devpts -o newinstance lxcpts /dev/pts + $ sshd -p 1234 + +where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs +/bin/bash in the child process. A pty created by the sshd is not visible in +the original mount of /dev/pts. + +User-space changes +------------------ + +In multi-instance mode (i.e '-o newinstance' mount option is specified at least +once), following user-space issues should be noted. + +1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored + and no change is needed to system-startup scripts. + +2. To effectively use multi-instance mode (i.e -o newinstance is specified) + administrators or startup scripts should "redirect" open of /dev/ptmx to + /dev/pts/ptmx using either a bind mount or symlink. + + $ mount -t devpts -o newinstance devpts /dev/pts + + followed by either + + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ chmod 666 /dev/pts/ptmx + or + $ mount -o bind /dev/pts/ptmx /dev/ptmx + +3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it + enables better error-reporting and treats both single-instance and + multi-instance mounts similarly. + + But this method requires that system-startup scripts set the mode of + /dev/pts/ptmx correctly (default mode is 0000). The scripts can set the + mode by, either + + - adding ptmxmode mount option to devpts entry in /etc/fstab, or + - using 'chmod 0666 /dev/pts/ptmx' + +4. If multi-instance mode mount is needed for containers, but the system + startup scripts have not yet been updated, container-startup scripts + should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single- + instance mounts. + + Or, in general, container-startup scripts should use: + + mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts + if [ ! -L /dev/ptmx ]; then + mount -o bind /dev/pts/ptmx /dev/ptmx + fi + + When all devpts mounts are multi-instance, /dev/ptmx can permanently be + a symlink to pts/ptmx and the bind mount can be ignored. + +5. A multi-instance mount that is not accompanied by the /dev/ptmx to + /dev/pts/ptmx redirection would result in an unusable/unreachable pty. + + mount -t devpts -o newinstance lxcpts /dev/pts + + immediately followed by: + + open("/dev/ptmx") + + would create a pty, say /dev/pts/7, in the initial kernel mount. + But /dev/pts/7 would be invisible in the new mount. + +6. The permissions for /dev/pts/ptmx node should be specified when mounting + /dev/pts, using the '-o ptmxmode=%o' mount option (default is 0000). + + mount -t devpts -o newinstance -o ptmxmode=0644 devpts /dev/pts + + The permissions can be later be changed as usual with 'chmod'. + + chmod 666 /dev/pts/ptmx + +7. A mount of devpts without the 'newinstance' option results in binding to + initial kernel mount. This behavior while preserving legacy semantics, + does not provide strict isolation in a container environment. i.e by + mounting devpts without the 'newinstance' option, a container could + get visibility into the 'host' or root container's devpts. + + To workaround this and have strict isolation, all mounts of devpts, + including the mount in the root container, should use the newinstance + option. diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking index ff7b611abf3..09bbf9a54f8 100644 --- a/Documentation/filesystems/directory-locking +++ b/Documentation/filesystems/directory-locking @@ -2,6 +2,10 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem (->s_vfs_rename_mutex). + When taking the i_mutex on multiple non-directory objects, we +always acquire the locks in order by increasing address. We'll call +that "inode pointer" order in the following. + For our purposes all operations fall in 5 classes: 1) read access. Locking rules: caller locks directory we are accessing. @@ -12,8 +16,9 @@ kinds of locks - per-inode (->i_mutex) and per-filesystem locks victim and calls the method. 4) rename() that is _not_ cross-directory. Locking rules: caller locks -the parent, finds source and target, if target already exists - locks it -and then calls the method. +the parent and finds source and target. If target already exists, lock +it. If source is a non-directory, lock it. If that means we need to +lock both, lock them in inode pointer order. 5) link creation. Locking rules: * lock parent @@ -30,7 +35,9 @@ rules: fail with -ENOTEMPTY * if new parent is equal to or is a descendent of source fail with -ELOOP - * if target exists - lock it. + * If target exists, lock it. If source is a non-directory, lock + it. In case that means we need to lock both source and target, + do so in inode pointer order. * call the method. @@ -56,9 +63,11 @@ objects - A < B iff A is an ancestor of B. renames will be blocked on filesystem lock and we don't start changing the order until we had acquired all locks). -(3) any operation holds at most one lock on non-directory object and - that lock is acquired after all other locks. (Proof: see descriptions - of operations). +(3) locks on non-directory objects are acquired only after locks on + directory objects, and are acquired in inode pointer order. + (Proof: all operations but renames take lock on at most one + non-directory object, except renames, which take locks on source and + target in inode pointer order in the case they are not directories.) Now consider the minimal deadlock. Each process is blocked on attempt to acquire some lock and already holds at least one lock. Let's @@ -66,9 +75,13 @@ consider the set of contended locks. First of all, filesystem lock is not contended, since any process blocked on it is not holding any locks. Thus all processes are blocked on ->i_mutex. - Non-directory objects are not contended due to (3). Thus link -creation can't be a part of deadlock - it can't be blocked on source -and it means that it doesn't hold any locks. + By (3), any process holding a non-directory lock can only be +waiting on another non-directory lock with a larger address. Therefore +the process holding the "largest" such lock can always make progress, and +non-directory objects are not included in the set of contended locks. + + Thus link creation can't be a part of deadlock - it can't be +blocked on source and it means that it doesn't hold any locks. Any contended object is either held by cross-directory rename or has a child that is also contended. Indeed, suppose that it is held by diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt index c50bbb2d52b..1b528b2ad80 100644 --- a/Documentation/filesystems/dlmfs.txt +++ b/Documentation/filesystems/dlmfs.txt @@ -47,7 +47,7 @@ You'll want to start heartbeating on a volume which all the nodes in your lockspace can access. The easiest way to do this is via ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires that an OCFS2 file system be in place so that it can automatically -find it's heartbeat area, though it will eventually support heartbeat +find its heartbeat area, though it will eventually support heartbeat against raw disks. Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt index 9f5d338ddbb..6baf88f4685 100644 --- a/Documentation/filesystems/dnotify.txt +++ b/Documentation/filesystems/dnotify.txt @@ -62,38 +62,9 @@ disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. Example ------- +See Documentation/filesystems/dnotify_test.c for an example. - #define _GNU_SOURCE /* needed to get the defines */ - #include <fcntl.h> /* in glibc 2.2 this has the needed - values defined */ - #include <signal.h> - #include <stdio.h> - #include <unistd.h> - - static volatile int event_fd; - - static void handler(int sig, siginfo_t *si, void *data) - { - event_fd = si->si_fd; - } - - int main(void) - { - struct sigaction act; - int fd; - - act.sa_sigaction = handler; - sigemptyset(&act.sa_mask); - act.sa_flags = SA_SIGINFO; - sigaction(SIGRTMIN + 1, &act, NULL); - - fd = open(".", O_RDONLY); - fcntl(fd, F_SETSIG, SIGRTMIN + 1); - fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); - /* we will now be notified if any of the files - in "." is modified or new files are created */ - while (1) { - pause(); - printf("Got event on fd=%d\n", event_fd); - } - } +NOTE +---- +Beginning with Linux 2.6.13, dnotify has been replaced by inotify. +See Documentation/filesystems/inotify.txt for more information on it. diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c new file mode 100644 index 00000000000..8b37b4a1e18 --- /dev/null +++ b/Documentation/filesystems/dnotify_test.c @@ -0,0 +1,34 @@ +#define _GNU_SOURCE /* needed to get the defines */ +#include <fcntl.h> /* in glibc 2.2 this has the needed + values defined */ +#include <signal.h> +#include <stdio.h> +#include <unistd.h> + +static volatile int event_fd; + +static void handler(int sig, siginfo_t *si, void *data) +{ + event_fd = si->si_fd; +} + +int main(void) +{ + struct sigaction act; + int fd; + + act.sa_sigaction = handler; + sigemptyset(&act.sa_mask); + act.sa_flags = SA_SIGINFO; + sigaction(SIGRTMIN + 1, &act, NULL); + + fd = open(".", O_RDONLY); + fcntl(fd, F_SETSIG, SIGRTMIN + 1); + fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); + /* we will now be notified if any of the files + in "." is modified or new files are created */ + while (1) { + pause(); + printf("Got event on fd=%d\n", event_fd); + } +} diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt new file mode 100644 index 00000000000..c477af086e6 --- /dev/null +++ b/Documentation/filesystems/efivarfs.txt @@ -0,0 +1,16 @@ + +efivarfs - a (U)EFI variable filesystem + +The efivarfs filesystem was created to address the shortcomings of +using entries in sysfs to maintain EFI variables. The old sysfs EFI +variables code only supported variables of up to 1024 bytes. This +limitation existed in version 0.99 of the EFI specification, but was +removed before any full releases. Since variables can now be larger +than a single page, sysfs isn't the best interface for this. + +Variables can be created, deleted and modified with the efivarfs +filesystem. + +efivarfs is typically mounted like this, + + mount -t efivarfs none /sys/firmware/efi/efivars diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt new file mode 100644 index 00000000000..23583a13697 --- /dev/null +++ b/Documentation/filesystems/exofs.txt @@ -0,0 +1,185 @@ +=============================================================================== +WHAT IS EXOFS? +=============================================================================== + +exofs is a file system that uses an OSD and exports the API of a normal Linux +file system. Users access exofs like any other local file system, and exofs +will in turn issue commands to the local OSD initiator. + +OSD is a new T10 command set that views storage devices not as a large/flat +array of sectors but as a container of objects, each having a length, quota, +time attributes and more. Each object is addressed by a 64bit ID, and is +contained in a 64bit ID partition. Each object has associated attributes +attached to it, which are integral part of the object and provide metadata about +the object. The standard defines some common obligatory attributes, but user +attributes can be added as needed. + +=============================================================================== +ENVIRONMENT +=============================================================================== + +To use this file system, you need to have an object store to run it on. You +may download a target from: +http://open-osd.org + +See Documentation/scsi/osd.txt for how to setup a working osd environment. + +=============================================================================== +USAGE +=============================================================================== + +1. Download and compile exofs and open-osd initiator: + You need an external Kernel source tree or kernel headers from your + distribution. (anything based on 2.6.26 or later). + + a. download open-osd including exofs source using: + [parent-directory]$ git clone git://git.open-osd.org/open-osd.git + + b. Build the library module like this: + [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd + + This will build both the open-osd initiator as well as the exofs kernel + module. Use whatever parameters you compiled your Kernel with and + $(KER_DIR) above pointing to the Kernel you compile against. See the file + open-osd/top-level-Makefile for an example. + +2. Get the OSD initiator and target set up properly, and login to the target. + See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd + for example script that does all these steps. + +3. Insmod the exofs.ko module: + [exofs]$ insmod exofs.ko + +4. Make sure the directory where you want to mount exists. If not, create it. + (For example, mkdir /mnt/exofs) + +5. At first run you will need to invoke the mkfs.exofs application + + As an example, this will create the file system on: + /dev/osd0 partition ID 65536 + + mkfs.exofs --pid=65536 --format /dev/osd0 + + The --format is optional. If not specified, no OSD_FORMAT will be + performed and a clean file system will be created in the specified pid, + in the available space of the target. (Use --format=size_in_meg to limit + the total LUN space available) + + If pid already exists, it will be deleted and a new one will be created in + its place. Be careful. + + An exofs lives inside a single OSD partition. You can create multiple exofs + filesystems on the same device using multiple pids. + + (run mkfs.exofs without any parameters for usage help message) + +6. Mount the file system. + + For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs: + + mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/ + +7. For reference (See do-exofs example script): + do-exofs start - an example of how to perform the above steps. + do-exofs stop - an example of how to unmount the file system. + do-exofs format - an example of how to format and mkfs a new exofs. + +8. Extra compilation flags (uncomment in fs/exofs/Kbuild): + CONFIG_EXOFS_DEBUG - for debug messages and extra checks. + +=============================================================================== +exofs mount options +=============================================================================== +Similar to any mount command: + mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory + +Where: + -t exofs: specifies the exofs file system + + /dev/osdX: X is a decimal number. /dev/osdX was created after a successful + login into an OSD target. + + mount_exofs_directory: The directory to mount the file system on + + exofs specific options: Options are separated by commas (,) + pid=<integer> - The partition number to mount/create as + container of the filesystem. + This option is mandatory. integer can be + Hex by pre-pending an 0x to the number. + osdname=<id> - Mount by a device's osdname. + osdname is usually a 36 character uuid of the + form "d2683732-c906-4ee1-9dbd-c10c27bb40df". + It is one of the device's uuid specified in the + mkfs.exofs format command. + If this option is specified then the /dev/osdX + above can be empty and is ignored. + to=<integer> - Timeout in ticks for a single command. + default is (60 * HZ) [for debugging only] + +=============================================================================== +DESIGN +=============================================================================== + +* The file system control block (AKA on-disk superblock) resides in an object + with a special ID (defined in common.h). + Information included in the file system control block is used to fill the + in-memory superblock structure at mount time. This object is created before + the file system is used by mkexofs.c. It contains information such as: + - The file system's magic number + - The next inode number to be allocated + +* Each file resides in its own object and contains the data (and it will be + possible to extend the file over multiple objects, though this has not been + implemented yet). + +* A directory is treated as a file, and essentially contains a list of <file + name, inode #> pairs for files that are found in that directory. The object + IDs correspond to the files' inode numbers and will be allocated according to + a bitmap (stored in a separate object). Now they are allocated using a + counter. + +* Each file's control block (AKA on-disk inode) is stored in its object's + attributes. This applies to both regular files and other types (directories, + device files, symlinks, etc.). + +* Credentials are generated per object (inode and superblock) when they are + created in memory (read from disk or created). The credential works for all + operations and is used as long as the object remains in memory. + +* Async OSD operations are used whenever possible, but the target may execute + them out of order. The operations that concern us are create, delete, + readpage, writepage, update_inode, and truncate. The following pairs of + operations should execute in the order written, and we need to prevent them + from executing in reverse order: + - The following are handled with the OBJ_CREATED and OBJ_2BCREATED + flags. OBJ_CREATED is set when we know the object exists on the OSD - + in create's callback function, and when we successfully do a + read_inode. + OBJ_2BCREATED is set in the beginning of the create function, so we + know that we should wait. + - create/delete: delete should wait until the object is created + on the OSD. + - create/readpage: readpage should be able to return a page + full of zeroes in this case. If there was a write already + en-route (i.e. create, writepage, readpage) then the page + would be locked, and so it would really be the same as + create/writepage. + - create/writepage: if writepage is called for a sync write, it + should wait until the object is created on the OSD. + Otherwise, it should just return. + - create/truncate: truncate should wait until the object is + created on the OSD. + - create/update_inode: update_inode should wait until the + object is created on the OSD. + - Handled by VFS locks: + - readpage/delete: shouldn't happen because of page lock. + - writepage/delete: shouldn't happen because of page lock. + - readpage/writepage: shouldn't happen because of page lock. + +=============================================================================== +LICENSE/COPYRIGHT +=============================================================================== +The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel +version 2.6.10). All files include the original copyrights, and the license +is GPL version 2 (only version 2, as is true for the Linux kernel). The +Linux kernel can be downloaded from www.kernel.org. diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 4333e836c49..67639f905f1 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -322,7 +322,7 @@ an upper limit on the block size imposed by the page size of the kernel, so 8kB blocks are only allowed on Alpha systems (and other architectures which support larger pages). -There is an upper limit of 32768 subdirectories in a single directory. +There is an upper limit of 32000 subdirectories in a single directory. There is a "soft" upper limit of about 10-15k files in a single directory with the current linear linked-list directory implementation. This limit @@ -373,10 +373,11 @@ Filesystem Resizing http://ext2resize.sourceforge.net/ Compression (*) http://e2compr.sourceforge.net/ Implementations for: -Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm -Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 +Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs +Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ -RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ +OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ (*) no longer actively developed/supported (as of Apr 2001) +(+) no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index b45f3c1b8b4..7ed0d17d672 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -14,6 +14,11 @@ Options When mounting an ext3 filesystem, the following option are accepted: (*) == default +ro Mount filesystem read only. Note that ext3 will replay + the journal (and thus write to the partition) even when + mounted "read only". Mount options "ro,noload" can be + used to prevent writes to the filesystem. + journal=update Update the ext3 file system's journal to the current format. @@ -21,13 +26,16 @@ journal=inum When a journal already exists, this option is ignored. Otherwise, it specifies the number of the inode which will represent the ext3 file system's journal file. +journal_path=path journal_dev=devnum When the external journal device's major/minor numbers - have changed, this option allows the user to specify + have changed, these options allow the user to specify the new journal location. The journal device is - identified through its new major/minor numbers encoded - in devnum. + identified through either its new major/minor numbers + encoded in devnum, or via a path to the device. -noload Don't load the journal on mounting. +norecovery Don't load the journal on mounting. Note that this forces +noload mount of inconsistent filesystem, which can lead to + various problems. data=journal All data are committed into the journal prior to being written into the main file system. @@ -52,16 +60,19 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata Setting it to very large values will improve performance. -barrier=1 This enables/disables barriers. barrier=0 disables - it, barrier=1 enables it. - -orlov (*) This enables the new Orlov block allocator. It is - enabled by default. - -oldalloc This disables the Orlov block allocator and enables - the old block allocator. Orlov should have better - performance - we'd like to get some feedback if it's - the contrary for you. +barrier=<0|1(*)> This enables/disables the use of write barriers in +barrier (*) the jbd code. barrier=0 disables, barrier=1 enables. +nobarrier This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. + The mount options "barrier" and "nobarrier" can + also be used to enable or disable barriers, for + consistency with other ext3 mount options. user_xattr Enables Extended User Attributes. Additionally, you need to have extended attribute support enabled in the @@ -92,9 +103,17 @@ nocheck debug Extra debugging information is sent to syslog. -errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=remount-ro Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. + (These mount options override the errors behavior + specified in the superblock, which can be + configured using tune2fs.) + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. grpid Give objects the same group ID as their creator. bsdgroups @@ -108,19 +127,18 @@ resuid=n The user ID which may use the reserved blocks. sb=n Use alternate superblock at this location. -quota -noquota -grpquota -usrquota - -bh (*) ext3 associates buffer heads to data pages to -nobh (a) cache disk block mapping information - (b) link pages into transaction to provide - ordering guarantees. - "bh" option forces use of buffer heads. - "nobh" option tries to avoid associating buffer - heads (supported only for "writeback" mode). +quota These options are ignored by the filesystem. They +noquota are used only by quota tools to recognize volumes +grpquota where quota should be turned on. See documentation +usrquota in the quota-tools package for more details + (http://sourceforge.net/projects/linuxquota). +jqfmt=<quota type> These options tell filesystem details about quota +usrjquota=<file> so that quota information can be properly updated +grpjquota=<file> during journal replay. They replace the above + quota options. See documentation in the quota-tools + package for more details + (http://sourceforge.net/projects/linuxquota). Specification ============= @@ -193,6 +211,5 @@ kernel source: <file:fs/ext3/> programs: http://e2fsprogs.sourceforge.net/ http://ext2resize.sourceforge.net -useful links: http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html - http://www-106.ibm.com/developerworks/linux/library/l-fs7/ - http://www-106.ibm.com/developerworks/linux/library/l-fs8/ +useful links: http://www.ibm.com/developerworks/library/l-fs7/index.html + http://www.ibm.com/developerworks/library/l-fs8/index.html diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 560f88dc709..919a3293aaa 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -2,83 +2,125 @@ Ext4 Filesystem =============== -This is a development version of the ext4 filesystem, an advanced level -of the ext3 filesystem which incorporates scalability and reliability -enhancements for supporting large filesystems (64 bit) in keeping with -increasing disk capacities and state-of-the-art feature requirements. +Ext4 is an advanced level of the ext3 filesystem which incorporates +scalability and reliability enhancements for supporting large filesystems +(64 bit) in keeping with increasing disk capacities and state-of-the-art +feature requirements. -Mailing list: linux-ext4@vger.kernel.org +Mailing list: linux-ext4@vger.kernel.org +Web site: http://ext4.wiki.kernel.org 1. Quick usage instructions: =========================== - - Grab updated e2fsprogs from - ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs-interim/ - This is a patchset on top of e2fsprogs-1.39, which can be found at +Note: More extensive information for getting started with ext4 can be + found at the ext4 wiki site at the URL: + http://ext4.wiki.kernel.org/index.php/Ext4_Howto + + - Compile and install the latest version of e2fsprogs (as of this + writing version 1.41.3) from: + + http://sourceforge.net/project/showfiles.php?group_id=2406 + + or + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ - - It's still mke2fs -j /dev/hda1 + or grab the latest git repository from: + + git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git + + - Note that it is highly important to install the mke2fs.conf file + that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If + you have edited the /etc/mke2fs.conf file installed on your system, + you will need to merge your changes with the version from e2fsprogs + 1.41.x. + + - Create a new filesystem using the ext4 filesystem type: + + # mke2fs -t ext4 /dev/hda1 + + Or to configure an existing ext3 filesystem to support extents: - - mount /dev/hda1 /wherever -t ext4dev + # tune2fs -O extents /dev/hda1 - - To enable extents, + If the filesystem was created with 128 byte inodes, it can be + converted to use 256 byte for greater efficiency via: - mount /dev/hda1 /wherever -t ext4dev -o extents + # tune2fs -I 256 /dev/hda1 - - The filesystem is compatible with the ext3 driver until you add a file - which has extents (ie: `mount -o extents', then create a file). + (Note: we currently do not have tools to convert an ext4 + filesystem back to ext3; so please do not do try this on production + filesystems.) - NOTE: The "extents" mount flag is temporary. It will soon go away and - extents will be enabled by the "-o extents" flag to mke2fs or tune2fs + - Mounting: - - When comparing performance with other filesystems, remember that - ext3/4 by default offers higher data integrity guarantees than most. So - when comparing with a metadata-only journalling filesystem, use `mount -o - data=writeback'. And you might as well use `mount -o nobh' too along - with it. Making the journal larger than the mke2fs default often helps - performance with metadata-intensive workloads. + # mount -t ext4 /dev/hda1 /wherever + + - When comparing performance with other filesystems, it's always + important to try multiple workloads; very often a subtle change in a + workload parameter can completely change the ranking of which + filesystems do well compared to others. When comparing versus ext3, + note that ext4 enables write barriers by default, while ext3 does + not enable write barriers by default. So it is useful to use + explicitly specify whether barriers are enabled or not when via the + '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems + for a fair comparison. When tuning ext3 for best benchmark numbers, + it is often worthwhile to try changing the data journaling mode; '-o + data=writeback' can be faster for some workloads. (Note however that + running mounted with data=writeback can potentially leave stale data + exposed in recently written files in case of an unclean shutdown, + which could be a security exposure in some situations.) Configuring + the filesystem with a large journal can also be helpful for + metadata-intensive workloads. 2. Features =========== 2.1 Currently available -* ability to use filesystems > 16TB +* ability to use filesystems > 16TB (e2fsprogs support not available yet) * extent format reduces metadata overhead (RAM, IO for access, transactions) * extent format more robust in face of on-disk corruption due to magics, -* internal redunancy in tree - -2.1 Previously available, soon to be enabled by default by "mkefs.ext4": - -* dir_index and resize inode will be on by default -* large inodes will be used by default for fast EAs, nsec timestamps, etc +* internal redundancy in tree +* improved file allocation (multi-block alloc) +* lift 32000 subdirectory limit imposed by i_links_count[1] +* nsec timestamps for mtime, atime, ctime, create time +* inode version field on disk (NFSv4, Lustre) +* reduced e2fsck time via uninit_bg feature +* journal checksumming for robustness, performance +* persistent file preallocation (e.g for streaming media, databases) +* ability to pack bitmaps and inode tables into larger virtual groups via the + flex_bg feature +* large file support +* Inode allocation using large virtual block groups via flex_bg +* delayed allocation +* large block (up to pagesize) support +* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force + the ordering) + +[1] Filesystems with a block size of 1k may see a limit imposed by the +directory hash tree having a maximum depth of two. 2.2 Candidate features for future inclusion -There are several under discussion, whether they all make it in is -partly a function of how much time everyone has to work on them: +* Online defrag (patches available but not well tested) +* reduced mke2fs time via lazy itable initialization in conjunction with + the uninit_bg feature (capability to do this is available in e2fsprogs + but a kernel thread to do lazy zeroing of unused inode table blocks + after filesystem is first mounted is required for safety) -* improved file allocation (multi-block alloc, delayed alloc; basically done) -* fix 32000 subdirectory limit (patch exists, needs some e2fsck work) -* nsec timestamps for mtime, atime, ctime, create time (patch exists, - needs some e2fsck work) -* inode version field on disk (NFSv4, Lustre; prototype exists) -* reduced mke2fs/e2fsck time via uninitialized groups (prototype exists) -* journal checksumming for robustness, performance (prototype exists) -* persistent file preallocation (e.g for streaming media, databases) +There are several others under discussion, whether they all make it in is +partly a function of how much time everyone has to work on them. Features like +metadata checksumming have been discussed and planned for a bit but no patches +exist yet so I'm not sure they're in the near-term roadmap. -Features like metadata checksumming have been discussed and planned for -a bit but no patches exist yet so I'm not sure they're in the near-term -roadmap. +The big performance win will come with mballoc, delalloc and flex_bg +grouping of bitmaps and inode tables. Some test results available here: -The big performance win will come with mballoc and delalloc. CFS has -been using mballoc for a few years already with Lustre, and IBM + Bull -did a lot of benchmarking on it. The reason it isn't in the first set of -patches is partly a manageability issue, and partly because it doesn't -directly affect the on-disk format (outside of much better allocation) -so it isn't critical to get into the first round of changes. I believe -Alex is working on a new set of patches right now. + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html 3. Options ========== @@ -86,10 +128,11 @@ Alex is working on a new set of patches right now. When mounting an ext4 filesystem, the following option are accepted: (*) == default -extents (*) ext4 will use extents to address file data. The - file system will no longer be mountable by ext3. - -noextents ext4 will not use extents for newly created files +ro Mount filesystem read only. Note that ext4 will + replay the journal (and thus write to the + partition) even when mounted "read only". The + mount options "ro,noload" can be used to prevent + writes to the filesystem. journal_checksum Enable checksumming of the journal transactions. This will allow the recovery code in e2fsck and the @@ -101,23 +144,23 @@ journal_async_commit Commit block can be written to disk without waiting mount the device. This will enable 'journal_checksum' internally. -journal=update Update the ext4 file system's journal to the current - format. - -journal=inum When a journal already exists, this option is ignored. - Otherwise, it specifies the number of the inode which - will represent the ext4 file system's journal file. - +journal_path=path journal_dev=devnum When the external journal device's major/minor numbers - have changed, this option allows the user to specify + have changed, these options allow the user to specify the new journal location. The journal device is - identified through its new major/minor numbers encoded - in devnum. + identified through either its new major/minor numbers + encoded in devnum, or via a path to the device. -noload Don't load the journal on mounting. +norecovery Don't load the journal on mounting. Note that +noload if the filesystem was not unmounted cleanly, + skipping the journal replay will lead to the + filesystem containing inconsistencies that can + lead to any number of problems. data=journal All data are committed into the journal prior to being - written into the main file system. + written into the main file system. Enabling + this mode will disable delayed allocation and + O_DIRECT support. data=ordered (*) All data are forced directly out to the main file system prior to its metadata being committed to the @@ -139,49 +182,56 @@ commit=nrsec (*) Ext4 can be told to sync all its data and metadata Setting it to very large values will improve performance. -barrier=1 This enables/disables barriers. barrier=0 disables - it, barrier=1 enables it. - -orlov (*) This enables the new Orlov block allocator. It is - enabled by default. - -oldalloc This disables the Orlov block allocator and enables - the old block allocator. Orlov should have better - performance - we'd like to get some feedback if it's - the contrary for you. - -user_xattr Enables Extended User Attributes. Additionally, you - need to have extended attribute support enabled in the - kernel configuration (CONFIG_EXT4_FS_XATTR). See the - attr(5) manual page and http://acl.bestbits.at/ to - learn more about extended attributes. - -nouser_xattr Disables Extended User Attributes. - -acl Enables POSIX Access Control Lists support. - Additionally, you need to have ACL support enabled in - the kernel configuration (CONFIG_EXT4_FS_POSIX_ACL). - See the acl(5) manual page and http://acl.bestbits.at/ - for more information. +barrier=<0|1(*)> This enables/disables the use of write barriers in +barrier(*) the jbd code. barrier=0 disables, barrier=1 enables. +nobarrier This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. + The mount options "barrier" and "nobarrier" can + also be used to enable or disable barriers, for + consistency with other ext4 mount options. + +inode_readahead_blks=n This tuning parameter controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache. The default value is 32 blocks. + +nouser_xattr Disables Extended User Attributes. See the + attr(5) manual page and http://acl.bestbits.at/ + for more information about extended attributes. noacl This option disables POSIX Access Control List - support. - -reservation - -noreservation + support. If ACL support is enabled in the kernel + configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is + enabled by default on mount. See the acl(5) manual + page and http://acl.bestbits.at/ for more information + about acl. bsddf (*) Make 'df' act like BSD. minixdf Make 'df' act like Minix. -check=none Don't do extra checking of bitmaps on mount. -nocheck - debug Extra debugging information is sent to syslog. -errors=remount-ro(*) Remount the filesystem read-only on an error. +abort Simulate the effects of calling ext4_abort() for + debugging purposes. This is normally used while + remounting a filesystem which is already mounted. + +errors=remount-ro Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. + (These mount options override the errors behavior + specified in the superblock, which can be configured + using tune2fs) + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. grpid Give objects the same group ID as their creator. bsdgroups @@ -195,28 +245,149 @@ resuid=n The user ID which may use the reserved blocks. sb=n Use alternate superblock at this location. -quota -noquota -grpquota -usrquota - -bh (*) ext4 associates buffer heads to data pages to -nobh (a) cache disk block mapping information - (b) link pages into transaction to provide - ordering guarantees. - "bh" option forces use of buffer heads. - "nobh" option tries to avoid associating buffer - heads (supported only for "writeback" mode). - -mballoc (*) Use the multiple block allocator for block allocation -nomballoc disabled multiple block allocator for block allocation. +quota These options are ignored by the filesystem. They +noquota are used only by quota tools to recognize volumes +grpquota where quota should be turned on. See documentation +usrquota in the quota-tools package for more details + (http://sourceforge.net/projects/linuxquota). + +jqfmt=<quota type> These options tell filesystem details about quota +usrjquota=<file> so that quota information can be properly updated +grpjquota=<file> during journal replay. They replace the above + quota options. See documentation in the quota-tools + package for more details + (http://sourceforge.net/projects/linuxquota). + stripe=n Number of filesystem blocks that mballoc will try to use for allocation size and alignment. For RAID5/6 systems this should be the number of data disks * RAID chunk size in file system blocks. +delalloc (*) Defer block allocation until just before ext4 + writes out the block(s) in question. This + allows ext4 to better allocation decisions + more efficiently. +nodelalloc Disable delayed allocation. Blocks are allocated + when the data is copied from userspace to the + page cache, either via the write(2) system call + or when an mmap'ed page which was previously + unallocated is written for the first time. + +max_batch_time=usec Maximum amount of time ext4 should wait for + additional filesystem operations to be batch + together with a synchronous write operation. + Since a synchronous write operation is going to + force a commit and then a wait for the I/O + complete, it doesn't cost much, and can be a + huge throughput win, we wait for a small amount + of time to see if any other transactions can + piggyback on the synchronous write. The + algorithm used is designed to automatically tune + for the speed of the disk, by measuring the + amount of time (on average) that it takes to + finish committing a transaction. Call this time + the "commit time". If the time that the + transaction has been running is less than the + commit time, ext4 will try sleeping for the + commit time to see if other operations will join + the transaction. The commit time is capped by + the max_batch_time, which defaults to 15000us + (15ms). This optimization can be turned off + entirely by setting max_batch_time to 0. + +min_batch_time=usec This parameter sets the commit time (as + described above) to be at least min_batch_time. + It defaults to zero microseconds. Increasing + this parameter may improve the throughput of + multi-threaded, synchronous workloads on very + fast disks, at the cost of increasing latency. + +journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the + highest priority) which should be used for I/O + operations submitted by kjournald2 during a + commit operation. This defaults to 3, which is + a slightly higher priority than the default I/O + priority. + +auto_da_alloc(*) Many broken applications don't use fsync() when +noauto_da_alloc replacing existing files via patterns such as + fd = open("foo.new")/write(fd,..)/close(fd)/ + rename("foo.new", "foo"), or worse yet, + fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). + If auto_da_alloc is enabled, ext4 will detect + the replace-via-rename and replace-via-truncate + patterns and force that any delayed allocation + blocks are allocated such that at the next + journal commit, in the default data=ordered + mode, the data blocks of the new file are forced + to disk before the rename() operation is + committed. This provides roughly the same level + of guarantees as ext3, and avoids the + "zero-length" problem that can happen when a + system crashes before the delayed allocation + blocks are forced to disk. + +noinit_itable Do not initialize any uninitialized inode table + blocks in the background. This feature may be + used by installation CD's so that the install + process can complete as quickly as possible; the + inode table initialization process would then be + deferred until the next time the file system + is unmounted. + +init_itable=n The lazy itable init code will wait n times the + number of milliseconds it took to zero out the + previous block group's inode table. This + minimizes the impact on the system performance + while file system's inode table is being initialized. + +discard Controls whether ext4 should issue discard/TRIM +nodiscard(*) commands to the underlying block device when + blocks are freed. This is useful for SSD devices + and sparse/thinly-provisioned LUNs, but it is off + by default until sufficient testing has been done. + +nouid32 Disables 32-bit UIDs and GIDs. This is for + interoperability with older kernels which only + store and expect 16-bit values. + +block_validity This options allows to enables/disables the in-kernel +noblock_validity facility for tracking filesystem metadata blocks + within internal data structures. This allows multi- + block allocator and other routines to quickly locate + extents which might overlap with filesystem metadata + blocks. This option is intended for debugging + purposes and since it negatively affects the + performance, it is off by default. + +dioread_lock Controls whether or not ext4 should use the DIO read +dioread_nolock locking. If the dioread_nolock option is specified + ext4 will allocate uninitialized extent before buffer + write and convert the extent to initialized after IO + completes. This approach allows ext4 code to avoid + using inode mutex, which improves scalability on high + speed storages. However this does not work with + data journaling and dioread_nolock option will be + ignored with kernel warning. Note that dioread_nolock + code path is only used for extent-based files. + Because of the restrictions this options comprises + it is off by default (e.g. dioread_lock). + +max_dir_size_kb=n This limits the size of directories so that any + attempt to expand them beyond the specified + limit in kilobytes will cause an ENOSPC error. + This is useful in memory constrained + environments, where a very large directory can + cause severe performance problems or even + provoke the Out Of Memory killer. (For example, + if there is only 512mb memory available, a 176mb + directory may seriously cramp the system's style.) + +i_version Enable 64-bit inode version support. This option is + off by default. + Data Mode ---------- +========= There are 3 different data modes: * writeback mode @@ -228,10 +399,10 @@ typically provide the best ext4 performance. * ordered mode In data=ordered mode, ext4 only officially journals metadata, but it logically -groups metadata and data blocks into a single unit called a transaction. When -it's time to write the new metadata out to disk, the associated data blocks -are written first. In general, this mode performs slightly slower than -writeback but significantly faster than journal mode. +groups metadata information related to data changes with the data blocks into a +single unit called a transaction. When it's time to write the new metadata +out to disk, the associated data blocks are written first. In general, +this mode performs slightly slower than writeback but significantly faster than journal mode. * journal mode data=journal mode provides full data and metadata journaling. All new data is @@ -239,7 +410,206 @@ written to the journal first, and then to its final location. In the event of a crash, the journal can be replayed, bringing both data and metadata into a consistent state. This mode is the slowest except when data needs to be read from and written to disk at the same time where it -outperforms all others modes. +outperforms all others modes. Enabling this mode will disable delayed +allocation and O_DIRECT support. + +/proc entries +============= + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /proc/fs/ext4/<devname> +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks +.............................................................................. + +/sys entries +============ + +Information about mounted ext4 file systems can be found in +/sys/fs/ext4. Each mounted filesystem will have a directory in +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or +/sys/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /sys/fs/ext4/<devname> +(see also Documentation/ABI/testing/sysfs-fs-ext4) +.............................................................................. + File Content + + delayed_allocation_blocks This file is read-only and shows the number of + blocks that are dirty in the page cache, but + which do not have their location in the + filesystem allocated yet. + + inode_goal Tuning parameter which (if non-zero) controls + the goal inode used by the inode allocator in + preference to all other allocation heuristics. + This is intended for debugging use only, and + should be 0 on production systems. + + inode_readahead_blks Tuning parameter which controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache + + lifetime_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was created. + + max_writeback_mb_bump The maximum number of megabytes the writeback + code will try to write out before move on to + another inode. + + mb_group_prealloc The multiblock allocator will round up allocation + requests to a multiple of this tuning parameter if + the stripe size is not set in the ext4 superblock + + mb_max_to_scan The maximum number of extents the multiblock + allocator will search to find the best extent + + mb_min_to_scan The minimum number of extents the multiblock + allocator will search to find the best extent + + mb_order2_req Tuning parameter which controls the minimum size + for requests (as a power of 2) where the buddy + cache is used + + mb_stats Controls whether the multiblock allocator should + collect statistics, which are shown during the + unmount. 1 means to collect statistics, 0 means + not to collect statistics + + mb_stream_req Files which have fewer blocks than this tunable + parameter will have their blocks allocated out + of a block group specific preallocation pool, so + that small files are packed closely together. + Each large file will have its blocks allocated + out of its own unique preallocation pool. + + session_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was mounted. + + reserved_clusters This is RW file and contains number of reserved + clusters in the file system which will be used + in the specific situations to avoid costly + zeroout, unexpected ENOSPC, or possible data + loss. The default is 2% or 4096 clusters, + whichever is smaller and this can be changed + however it can never exceed number of clusters + in the file system. If there is not enough space + for the reserved space when mounting the file + mount will _not_ fail. +.............................................................................. + +Ioctls +====== + +There is some Ext4 specific functionality which can be accessed by applications +through the system call interfaces. The list of all Ext4 specific ioctls are +shown in the table below. + +Table of Ext4 specific ioctls +.............................................................................. + Ioctl Description + EXT4_IOC_GETFLAGS Get additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_GETFLAGS. + + EXT4_IOC_SETFLAGS Set additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_SETFLAGS. + + EXT4_IOC_GETVERSION + EXT4_IOC_GETVERSION_OLD + Get the inode i_generation number stored for + each inode. The i_generation number is normally + changed only when new inode is created and it is + particularly useful for network filesystems. The + '_OLD' version of this ioctl is an alias for + FS_IOC_GETVERSION. + + EXT4_IOC_SETVERSION + EXT4_IOC_SETVERSION_OLD + Set the inode i_generation number stored for + each inode. The '_OLD' version of this ioctl + is an alias for FS_IOC_SETVERSION. + + EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize + mount option. It allows to resize filesystem + to the end of the last existing block group, + further resize has to be done with resize2fs, + either online, or offline. The argument points + to the unsigned logn number representing the + filesystem new block count. + + EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one + this ioctl is pointing to) to the donor_fd (the + one specified in move_extent structure passed + as an argument to this ioctl). Then, exchange + inode metadata between orig_fd and donor_fd. + This is especially useful for online + defragmentation, because the allocator has the + opportunity to allocate moved blocks better, + ideally into one contiguous extent. + + EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or + new group descriptor block. The new group + descriptor is described by ext4_new_group_input + structure, which is passed as an argument to + this ioctl. This is especially useful in + conjunction with EXT4_IOC_GROUP_EXTEND, + which allows online resize of the filesystem + to the end of the last existing block group. + Those two ioctls combined is used in userspace + online resize tool (e.g. resize2fs). + + EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself. + It converts (migrates) ext3 indirect block mapped + inode to ext4 extent mapped inode by walking + through indirect block mapping of the original + inode and converting contiguous block ranges + into ext4 extents of the temporary inode. Then, + inodes are swapped. This ioctl might help, when + migrating from ext3 to ext4 filesystem, however + suggestion is to create fresh ext4 filesystem + and copy data from the backup. Note, that + filesystem has to support extents for this ioctl + to work. + + EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be + allocated to preserve application-expected ext3 + behaviour. Note that this will also start + triggering a write of the data blocks, but this + behaviour may change in the future as it is + not necessary and has been done this way only + for sake of simplicity. + + EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number + of blocks of resized filesystem is passed in via + 64 bit integer argument. The kernel allocates + bitmaps and inode table, the userspace tool thus + just passes the new number of blocks. + +EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes + (like i_blocks, i_size, i_flags, ...) from + the specified inode with inode + EXT4_BOOT_LOADER_INO (#5). This is typically + used to store a boot loader in a secure part of + the filesystem, where it can't be changed by a + normal user by accident. + The data blocks of the previous boot loader + will be associated with the given inode. + +.............................................................................. References ========== @@ -248,7 +618,8 @@ kernel source: <file:fs/ext4/> <file:fs/jbd2/> programs: http://e2fsprogs.sourceforge.net/ - http://ext2resize.sourceforge.net useful links: http://fedoraproject.org/wiki/ext3-devel http://www.bullopensource.org/ext4/ + http://ext4.wiki.kernel.org/index.php/Main_Page + http://fedoraproject.org/wiki/Features/Ext4 diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt new file mode 100644 index 00000000000..51afba17bba --- /dev/null +++ b/Documentation/filesystems/f2fs.txt @@ -0,0 +1,545 @@ +================================================================================ +WHAT IS Flash-Friendly File System (F2FS)? +================================================================================ + +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have +been equipped on a variety systems ranging from mobile to server systems. Since +they are known to have different characteristics from the conventional rotating +disks, a file system, an upper layer to the storage device, should adapt to the +changes from the sketch in the design level. + +F2FS is a file system exploiting NAND flash memory-based storage devices, which +is based on Log-structured File System (LFS). The design has been focused on +addressing the fundamental issues in LFS, which are snowball effect of wandering +tree and high cleaning overhead. + +Since a NAND flash memory-based storage device shows different characteristic +according to its internal geometry or flash memory management scheme, namely FTL, +F2FS and its tools support various parameters not only for configuring on-disk +layout, but also for selecting allocation and cleaning algorithms. + +The following git tree provides the file system formatting tool (mkfs.f2fs), +a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). +>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git + +For reporting bugs and sending patches, please use the following mailing list: +>> linux-f2fs-devel@lists.sourceforge.net + +================================================================================ +BACKGROUND AND DESIGN ISSUES +================================================================================ + +Log-structured File System (LFS) +-------------------------------- +"A log-structured file system writes all modifications to disk sequentially in +a log-like structure, thereby speeding up both file writing and crash recovery. +The log is the only structure on disk; it contains indexing information so that +files can be read back from the log efficiently. In order to maintain large free +areas on disk for fast writing, we divide the log into segments and use a +segment cleaner to compress the live information from heavily fragmented +segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and +implementation of a log-structured file system", ACM Trans. Computer Systems +10, 1, 26–52. + +Wandering Tree Problem +---------------------- +In LFS, when a file data is updated and written to the end of log, its direct +pointer block is updated due to the changed location. Then the indirect pointer +block is also updated due to the direct pointer block update. In this manner, +the upper index structures such as inode, inode map, and checkpoint block are +also updated recursively. This problem is called as wandering tree problem [1], +and in order to enhance the performance, it should eliminate or relax the update +propagation as much as possible. + +[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ + +Cleaning Overhead +----------------- +Since LFS is based on out-of-place writes, it produces so many obsolete blocks +scattered across the whole storage. In order to serve new empty log space, it +needs to reclaim these obsolete blocks seamlessly to users. This job is called +as a cleaning process. + +The process consists of three operations as follows. +1. A victim segment is selected through referencing segment usage table. +2. It loads parent index structures of all the data in the victim identified by + segment summary blocks. +3. It checks the cross-reference between the data and its parent index structure. +4. It moves valid data selectively. + +This cleaning job may cause unexpected long delays, so the most important goal +is to hide the latencies to users. And also definitely, it should reduce the +amount of valid data to be moved, and move them quickly as well. + +================================================================================ +KEY FEATURES +================================================================================ + +Flash Awareness +--------------- +- Enlarge the random write area for better performance, but provide the high + spatial locality +- Align FS data structures to the operational units in FTL as best efforts + +Wandering Tree Problem +---------------------- +- Use a term, “node”, that represents inodes as well as various pointer blocks +- Introduce Node Address Table (NAT) containing the locations of all the “node” + blocks; this will cut off the update propagation. + +Cleaning Overhead +----------------- +- Support a background cleaning process +- Support greedy and cost-benefit algorithms for victim selection policies +- Support multi-head logs for static/dynamic hot and cold data separation +- Introduce adaptive logging for efficient block allocation + +================================================================================ +MOUNT OPTIONS +================================================================================ + +background_gc=%s Turn on/off cleaning operations, namely garbage + collection, triggered in background when I/O subsystem is + idle. If background_gc=on, it will turn on the garbage + collection and if background_gc=off, garbage collection + will be truned off. + Default value for this option is on. So garbage + collection is on by default. +disable_roll_forward Disable the roll-forward recovery routine +discard Issue discard/TRIM commands when a segment is cleaned. +no_heap Disable heap-style segment allocation which finds free + segments for data from the beginning of main area, while + for node from the end of main area. +nouser_xattr Disable Extended User Attributes. Note: xattr is enabled + by default if CONFIG_F2FS_FS_XATTR is selected. +noacl Disable POSIX Access Control List. Note: acl is enabled + by default if CONFIG_F2FS_FS_POSIX_ACL is selected. +active_logs=%u Support configuring the number of active logs. In the + current design, f2fs supports only 2, 4, and 6 logs. + Default number is 6. +disable_ext_identify Disable the extension list configured by mkfs, so f2fs + does not aware of cold files such as media files. +inline_xattr Enable the inline xattrs feature. +inline_data Enable the inline data feature: New created small(<~3.4k) + files can be written into inode block. +flush_merge Merge concurrent cache_flush commands as much as possible + to eliminate redundant command issues. If the underlying + device handles the cache_flush command relatively slowly, + recommend to enable this option. + +================================================================================ +DEBUGFS ENTRIES +================================================================================ + +/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as +f2fs. Each file shows the whole f2fs information. + +/sys/kernel/debug/f2fs/status includes: + - major file system information managed by f2fs currently + - average SIT information about whole segments + - current memory footprint consumed by f2fs. + +================================================================================ +SYSFS ENTRIES +================================================================================ + +Information about mounted f2f2 file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/<devname> +(see also Documentation/ABI/testing/sysfs-fs-f2fs) +.............................................................................. + File Content + + gc_max_sleep_time This tuning parameter controls the maximum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_min_sleep_time This tuning parameter controls the minimum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_no_gc_sleep_time This tuning parameter controls the default sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_idle This parameter controls the selection of victim + policy for garbage collection. Setting gc_idle = 0 + (default) will disable this option. Setting + gc_idle = 1 will select the Cost Benefit approach + & setting gc_idle = 2 will select the greedy aproach. + + reclaim_segments This parameter controls the number of prefree + segments to be reclaimed. If the number of prefree + segments is larger than the number of segments + in the proportion to the percentage over total + volume size, f2fs tries to conduct checkpoint to + reclaim the prefree segments to free segments. + By default, 5% over total # of segments. + + max_small_discards This parameter controls the number of discard + commands that consist small blocks less than 2MB. + The candidates to be discarded are cached until + checkpoint is triggered, and issued during the + checkpoint. By default, it is disabled with 0. + + ipu_policy This parameter controls the policy of in-place + updates in f2fs. There are five policies: + 0: F2FS_IPU_FORCE, 1: F2FS_IPU_SSR, + 2: F2FS_IPU_UTIL, 3: F2FS_IPU_SSR_UTIL, + 4: F2FS_IPU_DISABLE. + + min_ipu_util This parameter controls the threshold to trigger + in-place-updates. The number indicates percentage + of the filesystem utilization, and used by + F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies. + + max_victim_search This parameter controls the number of trials to + find a victim segment when conducting SSR and + cleaning operations. The default value is 4096 + which covers 8GB block address range. + + dir_level This parameter controls the directory level to + support large directory. If a directory has a + number of files, it can reduce the file lookup + latency by increasing this dir_level value. + Otherwise, it needs to decrease this value to + reduce the space overhead. The default value is 0. + + ram_thresh This parameter controls the memory footprint used + by free nids and cached nat entries. By default, + 10 is set, which indicates 10 MB / 1 GB RAM. + +================================================================================ +USAGE +================================================================================ + +1. Download userland tools and compile them. + +2. Skip, if f2fs was compiled statically inside kernel. + Otherwise, insert the f2fs.ko module. + # insmod f2fs.ko + +3. Create a directory trying to mount + # mkdir /mnt/f2fs + +4. Format the block device, and then mount as f2fs + # mkfs.f2fs -l label /dev/block_device + # mount -t f2fs /dev/block_device /mnt/f2fs + +mkfs.f2fs +--------- +The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, +which builds a basic on-disk layout. + +The options consist of: +-l [label] : Give a volume label, up to 512 unicode name. +-a [0 or 1] : Split start location of each area for heap-based allocation. + 1 is set by default, which performs this. +-o [int] : Set overprovision ratio in percent over volume size. + 5 is set by default. +-s [int] : Set the number of segments per section. + 1 is set by default. +-z [int] : Set the number of sections per zone. + 1 is set by default. +-e [str] : Set basic extension list. e.g. "mp3,gif,mov" +-t [0 or 1] : Disable discard command or not. + 1 is set by default, which conducts discard. + +fsck.f2fs +--------- +The fsck.f2fs is a tool to check the consistency of an f2fs-formatted +partition, which examines whether the filesystem metadata and user-made data +are cross-referenced correctly or not. +Note that, initial version of the tool does not fix any inconsistency. + +The options consist of: + -d debug level [default:0] + +dump.f2fs +--------- +The dump.f2fs shows the information of specific inode and dumps SSA and SIT to +file. Each file is dump_ssa and dump_sit. + +The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. +It shows on-disk inode information reconized by a given inode number, and is +able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and +./dump_sit respectively. + +The options consist of: + -d debug level [default:0] + -i inode no (hex) + -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] + -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] + +Examples: +# dump.f2fs -i [ino] /dev/sdx +# dump.f2fs -s 0~-1 /dev/sdx (SIT dump) +# dump.f2fs -a 0~-1 /dev/sdx (SSA dump) + +================================================================================ +DESIGN +================================================================================ + +On-disk Layout +-------------- + +F2FS divides the whole volume into a number of segments, each of which is fixed +to 2MB in size. A section is composed of consecutive segments, and a zone +consists of a set of sections. By default, section and zone sizes are set to one +segment size identically, but users can easily modify the sizes by mkfs. + +F2FS splits the entire volume into six areas, and all the areas except superblock +consists of multiple segments as described below. + + align with the zone size <-| + |-> align with the segment size + _________________________________________________________________________ + | | | Segment | Node | Segment | | + | Superblock | Checkpoint | Info. | Address | Summary | Main | + | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | + |____________|_____2______|______N______|______N______|______N_____|__N___| + . . + . . + . . + ._________________________________________. + |_Segment_|_..._|_Segment_|_..._|_Segment_| + . . + ._________._________ + |_section_|__...__|_ + . . + .________. + |__zone__| + +- Superblock (SB) + : It is located at the beginning of the partition, and there exist two copies + to avoid file system crash. It contains basic partition information and some + default parameters of f2fs. + +- Checkpoint (CP) + : It contains file system information, bitmaps for valid NAT/SIT sets, orphan + inode lists, and summary entries of current active segments. + +- Segment Information Table (SIT) + : It contains segment information such as valid block count and bitmap for the + validity of all the blocks. + +- Node Address Table (NAT) + : It is composed of a block address table for all the node blocks stored in + Main area. + +- Segment Summary Area (SSA) + : It contains summary entries which contains the owner information of all the + data and node blocks stored in Main area. + +- Main Area + : It contains file and directory data including their indices. + +In order to avoid misalignment between file system and flash-based storage, F2FS +aligns the start block address of CP with the segment size. Also, it aligns the +start block address of Main area with the zone size by reserving some segments +in SSA area. + +Reference the following survey for additional technical details. +https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey + +File System Metadata Structure +------------------------------ + +F2FS adopts the checkpointing scheme to maintain file system consistency. At +mount time, F2FS first tries to find the last valid checkpoint data by scanning +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. +One of them always indicates the last valid data, which is called as shadow copy +mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. + +For file system consistency, each CP points to which NAT and SIT copies are +valid, as shown as below. + + +--------+----------+---------+ + | CP | SIT | NAT | + +--------+----------+---------+ + . . . . + . . . . + . . . . + +-------+-------+--------+--------+--------+--------+ + | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | + +-------+-------+--------+--------+--------+--------+ + | ^ ^ + | | | + `----------------------------------------' + +Index Structure +--------------- + +The key data structure to manage the data locations is a "node". Similar to +traditional file structures, F2FS has three types of node: inode, direct node, +indirect node. F2FS assigns 4KB to an inode block which contains 923 data block +indices, two direct node pointers, two indirect node pointers, and one double +indirect node pointer as described below. One direct node block contains 1018 +data blocks, and one indirect node block contains also 1018 node blocks. Thus, +one inode block (i.e., a file) covers: + + 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. + + Inode block (4KB) + |- data (923) + |- direct node (2) + | `- data (1018) + |- indirect node (2) + | `- direct node (1018) + | `- data (1018) + `- double indirect node (1) + `- indirect node (1018) + `- direct node (1018) + `- data (1018) + +Note that, all the node blocks are mapped by NAT which means the location of +each node is translated by the NAT table. In the consideration of the wandering +tree problem, F2FS is able to cut off the propagation of node updates caused by +leaf data writes. + +Directory Structure +------------------- + +A directory entry occupies 11 bytes, which consists of the following attributes. + +- hash hash value of the file name +- ino inode number +- len the length of file name +- type file type such as directory, symlink, etc + +A dentry block consists of 214 dentry slots and file names. Therein a bitmap is +used to represent whether each dentry is valid or not. A dentry block occupies +4KB with the following composition. + + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + + dentries(11 * 214 bytes) + file name (8 * 214 bytes) + + [Bucket] + +--------------------------------+ + |dentry block 1 | dentry block 2 | + +--------------------------------+ + . . + . . + . [Dentry Block Structure: 4KB] . + +--------+----------+----------+------------+ + | bitmap | reserved | dentries | file names | + +--------+----------+----------+------------+ + [Dentry Block: 4KB] . . + . . + . . + +------+------+-----+------+ + | hash | ino | len | type | + +------+------+-----+------+ + [Dentry Structure: 11 bytes] + +F2FS implements multi-level hash tables for directory structure. Each level has +a hash table with dedicated number of hash buckets as shown below. Note that +"A(2B)" means a bucket includes 2 data blocks. + +---------------------- +A : bucket +B : block +N : MAX_DIR_HASH_DEPTH +---------------------- + +level #0 | A(2B) + | +level #1 | A(2B) - A(2B) + | +level #2 | A(2B) - A(2B) - A(2B) - A(2B) + . | . . . . +level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) + . | . . . . +level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) + +The number of blocks and buckets are determined by, + + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, + # of blocks in level #n = | + `- 4, Otherwise + + ,- 2^(n + dir_level), + | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, + # of buckets in level #n = | + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), + Otherwise + +When F2FS finds a file name in a directory, at first a hash value of the file +name is calculated. Then, F2FS scans the hash table in level #0 to find the +dentry consisting of the file name and its inode number. If not found, F2FS +scans the next hash table in level #1. In this way, F2FS scans hash tables in +each levels incrementally from 1 to N. In each levels F2FS needs to scan only +one bucket determined by the following equation, which shows O(log(# of files)) +complexity. + + bucket number to scan in level #n = (hash value) % (# of buckets in level #n) + +In the case of file creation, F2FS finds empty consecutive slots that cover the +file name. F2FS searches the empty slots in the hash tables of whole levels from +1 to N in the same way as the lookup operation. + +The following figure shows an example of two cases holding children. + --------------> Dir <-------------- + | | + child child + + child - child [hole] - child + + child - child - child [hole] - [hole] - child + + Case 1: Case 2: + Number of children = 6, Number of children = 3, + File size = 7 File size = 7 + +Default Block Allocation +------------------------ + +At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node +and Hot/Warm/Cold data. + +- Hot node contains direct node blocks of directories. +- Warm node contains direct node blocks except hot node blocks. +- Cold node contains indirect node blocks +- Hot data contains dentry blocks +- Warm data contains data blocks except hot and cold data blocks +- Cold data contains multimedia data or migrated data blocks + +LFS has two schemes for free space management: threaded log and copy-and-compac- +tion. The copy-and-compaction scheme which is known as cleaning, is well-suited +for devices showing very good sequential write performance, since free segments +are served all the time for writing new data. However, it suffers from cleaning +overhead under high utilization. Contrarily, the threaded log scheme suffers +from random writes, but no cleaning process is needed. F2FS adopts a hybrid +scheme where the copy-and-compaction scheme is adopted by default, but the +policy is dynamically changed to the threaded log scheme according to the file +system status. + +In order to align F2FS with underlying flash-based storage, F2FS allocates a +segment in a unit of section. F2FS expects that the section size would be the +same as the unit size of garbage collection in FTL. Furthermore, with respect +to the mapping granularity in FTL, F2FS allocates each section of the active +logs from different zones as much as possible, since FTL can write the data in +the active logs into one allocation unit according to its mapping granularity. + +Cleaning process +---------------- + +F2FS does cleaning both on demand and in the background. On-demand cleaning is +triggered when there are not enough free segments to serve VFS calls. Background +cleaner is operated by a kernel thread, and triggers the cleaning job when the +system is idle. + +F2FS supports two victim selection policies: greedy and cost-benefit algorithms. +In the greedy algorithm, F2FS selects a victim segment having the smallest number +of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment +according to the segment age and the number of valid blocks in order to address +log block thrashing problem in the greedy algorithm. F2FS adopts the greedy +algorithm for on-demand cleaner, while background cleaner adopts cost-benefit +algorithm. + +In order to identify whether the data in the victim segment are valid or not, +F2FS manages a bitmap. Each bit represents the validity of a block, and the +bitmap is composed of a bit stream covering whole blocks in main area. diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 00000000000..1b805a0efbb --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,228 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to its operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +The following flags can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of its data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or its logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, its +physical location has not been allocated yet. + +* FIEMAP_EXTENT_ENCODED +This extent does not consist of plain filesystem blocks but is +encoded (e.g. encrypted or compressed). Reading the data in this +extent via I/O to the block device will have undefined results. + +Note that it is *always* undefined to try to update the data +in-place by writing to the indicated location without the +assistance of the filesystem, or to access the data using the +information returned by the FIEMAP interface while the filesystem +is mounted. In other words, user applications may only read the +extent data via I/O to the block device while the filesystem is +unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is +clear; user applications must not try reading or writing to the +filesystem via the block device under any other circumstances. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_ENCODED +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but its data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining its set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as received from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt index bb0142f6108..46dfc6b038c 100644 --- a/Documentation/filesystems/files.txt +++ b/Documentation/filesystems/files.txt @@ -76,13 +76,13 @@ the fdtable structure - 5. Handling of the file structures is special. Since the look-up of the fd (fget()/fget_light()) are lock-free, it is possible that look-up may race with the last put() operation on the - file structure. This is avoided using atomic_inc_not_zero() + file structure. This is avoided using atomic_long_inc_not_zero() on ->f_count : rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (atomic_inc_not_zero(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) *fput_needed = 1; else /* Didn't get the reference, someone's freed */ @@ -92,7 +92,7 @@ the fdtable structure - .... return file; - atomic_inc_not_zero() detects if refcounts is already zero or + atomic_long_inc_not_zero() detects if refcounts is already zero or goes to zero during increment. If it does, we fail fget()/fget_light(). @@ -113,8 +113,8 @@ the fdtable structure - if (fd >= 0) { /* locate_fd() may have expanded fdtable, load the ptr */ fdt = files_fdtable(files); - FD_SET(fd, fdt->open_fds); - FD_CLR(fd, fdt->close_on_exec); + __set_open_fd(fd, fdt); + __clear_close_on_exec(fd, fdt); spin_unlock(&files->file_lock); ..... diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 397a41adb4c..13af4a49e7d 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt @@ -91,7 +91,7 @@ Mount options 'default_permissions' By default FUSE doesn't check file access permissions, the - filesystem is free to implement it's access policy or leave it to + filesystem is free to implement its access policy or leave it to the underlying file access mechanism (e.g. in case of network filesystems). This option enables permission checking, restricting access based on file mode. It is usually useful together with the @@ -171,7 +171,7 @@ or may honor them by sending a reply to the _original_ request, with the error set to EINTR. It is also possible that there's a race between processing the -original request and it's INTERRUPT request. There are two possibilities: +original request and its INTERRUPT request. There are two possibilities: 1) The INTERRUPT request is processed before the original request is processed diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 00000000000..fcc79957be6 --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -0,0 +1,231 @@ + Glock internal locking rules + ------------------------------ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_spin) which protects the internal state such + as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other + threads from making calls to the DLM, etc. at the same time. If a + thread takes this lock, it must then call run_queue (usually via the + workqueue) when it releases it in order to ensure any pending tasks + are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued, except for those marked LM_FLAG_PRIORITY which are +used only during recovery, and even then only for journal locks. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +Glock mode | DLM lock mode +------------------------------ + UN | IV/NL Unlocked (no DLM lock associated with glock) or NL + SH | PR (Protected read) + DF | CW (Concurrent write) + EX | EX (Exclusive) + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata +-------------------------------------------------------------------------- + UN | No | No | No | No + SH | Yes | Yes | No | No + DF | No | Yes | No | No + EX | Yes | Yes | Yes | Yes + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +Field | Purpose +---------------------------------------------------------------------------- +go_xmote_th | Called before remote state change (e.g. to sync dirty data) +go_xmote_bh | Called after remote state change (e.g. to refill cache) +go_inval | Called if remote state change requires invalidating the cache +go_demote_ok | Returns boolean value of whether its ok to demote a glock + | (e.g. checks timeout, and that there is no cached data) +go_lock | Called for the first local holder of a lock +go_unlock | Called on the final local unlock of a lock +go_dump | Called to print content of object for debugfs file, or on + | error to dump glock to the log. +go_type | The type of the glock, LM_TYPE_..... +go_callback | Called if the DLM sends a callback to drop this lock +go_flags | GLOF_ASPACE is set, if the glock has an address space + | associated with it + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmaped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +There is a plan to try and remove the go_lock and go_unlock callbacks +if possible, in order to try and speed up the fast path though the locking. +Also, eventually we hope to make the glock "EX" mode locally shared +such that any local locking will be done with the i_mutex as required +rather than via the glock. + +Locking rules for glock operations: + +Operation | GLF_LOCK bit lock held | gl_spin spinlock held +----------------------------------------------------------------- +go_xmote_th | Yes | No +go_xmote_bh | Yes | No +go_inval | Yes | No +go_demote_ok | Sometimes | Yes +go_lock | Yes | No +go_unlock | Yes | No +go_dump | Sometimes | Yes +go_callback | Sometimes (N/A) | Yes + +N.B. Operations must not drop either the bit lock or the spinlock +if its held on entry. go_dump and do_demote_ok must never block. +Note that go_dump will only be called if the glock's state +indicates that it is caching uptodate data. + +Glock locking order within GFS2: + + 1. i_mutex (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) + (Parents before children, inodes at "same level" with same parent in + lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. Page lock (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. +In general we prefer to lock local locks prior to cluster locks. + + Glock Statistics + ------------------ + +The stats are divided into two sets: those relating to the +super block and those relating to an individual glock. The +super block stats are done on a per cpu basis in order to +try and reduce the overhead of gathering them. They are also +further divided by glock type. All timings are in nanoseconds. + +In the case of both the super block and glock statistics, +the same information is gathered in each case. The super +block timing statistics are used to provide default values for +the glock timing statistics, so that newly created glocks +should have, as far as possible, a sensible starting point. +The per-glock counters are initialised to zero when the +glock is created. The per-glock statistics are lost when +the glock is ejected from memory. + +The statistics are divided into three pairs of mean and +variance, plus two counters. The mean/variance pairs are +smoothed exponential estimates and the algorithm used is +one which will be very familiar to those used to calculation +of round trip times in network code. See "TCP/IP Illustrated, +Volume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", +p. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. +Unlike the TCP/IP Illustrated case, the mean and variance are +not scaled, but are in units of integer nanoseconds. + +The three pairs of mean/variance measure the following +things: + + 1. DLM lock time (non-blocking requests) + 2. DLM lock time (blocking requests) + 3. Inter-request time (again to the DLM) + +A non-blocking request is one which will complete right +away, whatever the state of the DLM lock in question. That +currently means any requests when (a) the current state of +the lock is exclusive, i.e. a lock demotion (b) the requested +state is either null or unlocked (again, a demotion) or (c) the +"try lock" flag is set. A blocking request covers all the other +lock requests. + +There are two counters. The first is there primarily to show +how many lock requests have been made, and thus how much data +has gone into the mean/variance calculations. The other counter +is counting queuing of holders at the top layer of the glock +code. Hopefully that number will be a lot larger than the number +of dlm lock requests issued. + +So why gather these statistics? There are several reasons +we'd like to get a better idea of these timings: + +1. To be able to better set the glock "min hold time" +2. To spot performance issues more easily +3. To improve the algorithm for selecting resource groups for +allocation (to base it on lock wait time, rather than blindly +using a "try lock") + +Due to the smoothing action of the updates, a step change in +some input quantity being sampled will only fully be taken +into account after 8 samples (or 4 for the variance) and this +needs to be carefully considered when interpreting the +results. + +Knowing both the time it takes a lock request to complete and +the average time between lock requests for a glock means we +can compute the total percentage of the time for which the +node is able to use a glock vs. time that the rest of the +cluster has its share. That will be very useful when setting +the lock min hold time. + +Great care has been taken to ensure that we +measure exactly the quantities that we want, as accurately +as possible. There are always inaccuracies in any +measuring system, but I hope this is as accurate as we +can reasonably make it. + +Per sb stats can be found here: +/sys/kernel/debug/gfs2/<fsname>/sbstats +Per glock stats can be found here: +/sys/kernel/debug/gfs2/<fsname>/glstats + +Assuming that debugfs is mounted on /sys/kernel/debug and also +that <fsname> is replaced with the name of the gfs2 filesystem +in question. + +The abbreviations used in the output as are follows: + +srtt - Smoothed round trip time for non-blocking dlm requests +srttvar - Variance estimate for srtt +srttb - Smoothed round trip time for (potentially) blocking dlm requests +srttvarb - Variance estimate for srttb +sirt - Smoothed inter-request time (for dlm requests) +sirtvar - Variance estimate for sirt +dlm - Number of dlm requests made (dcnt in glstats file) +queue - Number of glock requests queued (qcnt in glstats file) + +The sbstats file contains a set of these stats for each glock type (so 8 lines +for each type) and for each cpu (one column per cpu). The glstats file contains +a set of these stats for each glock in a similar format to the glocks file, but +using the format mean/variance for each of the timing stats. + +The gfs2_glock_lock_time tracepoint prints out the current values of the stats +for the glock in question, along with some addition information on each dlm +reply that is received: + +status - The status of the dlm request +flags - The dlm request flags +tdiff - The time taken by this specific request +(remaining fields as per above list) + + diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt new file mode 100644 index 00000000000..19a19ebebc3 --- /dev/null +++ b/Documentation/filesystems/gfs2-uevents.txt @@ -0,0 +1,100 @@ + uevents and GFS2 + ================== + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +----------------------- + +1. ADD + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceded by at least an ADD uevent for the same filesystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +---------------------------------------------------------------------- + +1. LOCKTABLE= + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt index 593004b6bba..cc4f2306609 100644 --- a/Documentation/filesystems/gfs2.txt +++ b/Documentation/filesystems/gfs2.txt @@ -1,7 +1,7 @@ Global File System ------------------ -http://sources.redhat.com/cluster/ +https://fedorahosted.org/cluster/wiki/HomePage GFS is a cluster file system. It allows a cluster of computers to simultaneously use a block device that is shared between them (with FC, @@ -11,18 +11,15 @@ their I/O so file system consistency is maintained. One of the nifty features of GFS is perfect consistency -- changes made to the file system on one machine show up immediately on all other machines in the cluster. -GFS uses interchangable inter-node locking mechanisms. Different lock -modules can plug into GFS and each file system selects the appropriate -lock module at mount time. Lock modules include: +GFS uses interchangeable inter-node locking mechanisms, the currently +supported mechanisms are: lock_nolock -- allows gfs to be used as a local file system lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking The dlm is found at linux/fs/dlm/ -In addition to interfacing with an external locking manager, a gfs lock -module is responsible for interacting with external cluster management -systems. Lock_dlm depends on user space cluster management systems found +Lock_dlm depends on user space cluster management systems found at the URL above. To use gfs as a local file system, no external clustering systems are @@ -31,13 +28,18 @@ needed, simply: $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device $ mount -t gfs2 /dev/block_device /dir -GFS2 is not on-disk compatible with previous versions of GFS. +If you are using Fedora, you need to install the gfs2-utils package +and, for lock_dlm, you will also need to install the cman package +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. The following man pages can be found at the URL above: - gfs2_fsck to repair a filesystem - gfs2_grow to expand a filesystem online - gfs2_jadd to add journals to a filesystem online - gfs2_tool to manipulate, examine and tune a filesystem - gfs2_quota to examine and change quota values in a filesystem - mount.gfs2 to help mount(8) mount a filesystem - mkfs.gfs2 to make a filesystem + fsck.gfs2 to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + tunegfs2 to manipulate, examine and tune a filesystem + gfs2_convert to convert a gfs filesystem to gfs2 in-place + mkfs.gfs2 to make a filesystem diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt index bd0fa770403..d096df6db07 100644 --- a/Documentation/filesystems/hfs.txt +++ b/Documentation/filesystems/hfs.txt @@ -1,3 +1,4 @@ +Note: This filesystem doesn't have a maintainer. Macintosh HFS Filesystem for Linux ================================== @@ -76,8 +77,6 @@ hformat that can be used to create HFS filesystem. See Credits ======= -The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU) -and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis -Technologies. -Roman rewrote large parts of the code and brought in btree routines derived -from Brad Boyer's hfsplus driver (also maintained by Roman now). +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). +Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought +in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt index af1628a1061..59f7569fc9e 100644 --- a/Documentation/filesystems/hfsplus.txt +++ b/Documentation/filesystems/hfsplus.txt @@ -56,4 +56,4 @@ References kernel source: <file:fs/hfsplus> -Apple Technote 1150 http://developer.apple.com/technotes/tn/tn1150.html +Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt index fa45c3baed9..74630bd504f 100644 --- a/Documentation/filesystems/hpfs.txt +++ b/Documentation/filesystems/hpfs.txt @@ -103,7 +103,7 @@ to analyze or change OS2SYS.INI. Codepages HPFS can contain several uppercasing tables for several codepages and each -file has a pointer to codepage it's name is in. However OS/2 was created in +file has a pointer to codepage its name is in. However OS/2 was created in America where people don't care much about codepages and so multiple codepages support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. Once I booted English OS/2 working in cp 850 and I created a file on my 852 diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt index 59a919f1614..cfd02712b83 100644 --- a/Documentation/filesystems/inotify.txt +++ b/Documentation/filesystems/inotify.txt @@ -194,7 +194,8 @@ associated with the inotify_handle, and on which events are queued. Each watch is associated with an inotify_watch structure. Watches are chained off of each associated inotify_handle and each associated inode. -See fs/inotify.c and fs/inotify_user.c for the locking and lifetime rules. +See fs/notify/inotify/inotify_fsnotify.c and fs/notify/inotify/inotify_user.c +for the locking and lifetime rules. (vi) Rationale diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt index 6973b980ca2..ba0a93384de 100644 --- a/Documentation/filesystems/isofs.txt +++ b/Documentation/filesystems/isofs.txt @@ -23,8 +23,13 @@ Mount options unique to the isofs filesystem. map=off Do not map non-Rock Ridge filenames to lower case map=normal Map non-Rock Ridge filenames to lower case map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx - dmode=xxx Sets the permissions on directories to xxx + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. nojoliet Ignore Joliet extensions if they are present. norock Ignore Rock Ridge extensions if they are present. hide Completely strip hidden files from the file system. @@ -36,7 +41,7 @@ Mount options unique to the isofs filesystem. sbsector=xxx Session begins from sector xxx Recommended documents about ISO 9660 standard are located at: -http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm +http://www.y-adagio.com/ ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically identical with ISO 9660.", so it is a valid and gratis substitute of the diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt index 26ebde77e82..41fd757997b 100644 --- a/Documentation/filesystems/jfs.txt +++ b/Documentation/filesystems/jfs.txt @@ -3,6 +3,7 @@ IBM's Journaled File System (JFS) for Linux JFS Homepage: http://jfs.sourceforge.net/ The following mount options are supported: +(*) == default iocharset=name Character set to use for converting from Unicode to ASCII. The default is to do no conversion. Use @@ -21,12 +22,12 @@ nointegrity Do not write to the journal. The primary use of this option from backup media. The integrity of the volume is not guaranteed if the system abnormally abends. -integrity Default. Commit metadata changes to the journal. Use this - option to remount a volume where the nointegrity option was +integrity(*) Commit metadata changes to the journal. Use this option to + remount a volume where the nointegrity option was previously specified in order to restore normal behavior. errors=continue Keep going on a filesystem error. -errors=remount-ro Default. Remount the filesystem read-only on an error. +errors=remount-ro(*) Remount the filesystem read-only on an error. errors=panic Panic and halt the machine if an error occurs. uid=value Override on-disk uid with specified value @@ -35,7 +36,17 @@ umask=value Override on-disk umask with specified octal value. For directories, the execute bit will be set if the corresponding read bit is set. -Please send bugs, comments, cards and letters to shaggy@linux.vnet.ibm.com. +discard=minlen This enables/disables the use of discard/TRIM commands. +discard The discard/TRIM commands are sent to the underlying +nodiscard(*) block device when blocks are freed. This is useful for SSD + devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl + command is also available together with the nodiscard option. + The value of minlen specifies the minimum blockcount, when + a TRIM command to the block device is considered useful. + When no value is given to the discard option, it defaults to + 64 blocks, which means 256KiB in JFS. + The minlen value of discard overrides the minlen value given + on an FITRIM ioctl(). The JFS mailing list can be subscribed to by using the link labeled "Mail list Subscribe" at our web page http://jfs.sourceforge.net/ diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.txt index fab857accbd..2cf81082581 100644 --- a/Documentation/filesystems/locks.txt +++ b/Documentation/filesystems/locks.txt @@ -53,11 +53,12 @@ fcntl(), with all the problems that implies. 1.3 Mandatory Locking As A Mount Option --------------------------------------- -Mandatory locking, as described in 'Documentation/filesystems/mandatory.txt' -was prior to this release a general configuration option that was valid for -all mounted filesystems. This had a number of inherent dangers, not the -least of which was the ability to freeze an NFS server by asking it to read -a file for which a mandatory lock existed. +Mandatory locking, as described in +'Documentation/filesystems/mandatory-locking.txt' was prior to this release a +general configuration option that was valid for all mounted filesystems. This +had a number of inherent dangers, not the least of which was the ability to +freeze an NFS server by asking it to read a file for which a mandatory lock +existed. From this release of the kernel, mandatory locking can be turned on and off on a per-filesystem basis, using the mount options 'mand' and 'nomand'. diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt new file mode 100644 index 00000000000..bca42c22a14 --- /dev/null +++ b/Documentation/filesystems/logfs.txt @@ -0,0 +1,241 @@ + +The LogFS Flash Filesystem +========================== + +Specification +============= + +Superblocks +----------- + +Two superblocks exist at the beginning and end of the filesystem. +Each superblock is 256 Bytes large, with another 3840 Bytes reserved +for future purposes, making a total of 4096 Bytes. + +Superblock locations may differ for MTD and block devices. On MTD the +first non-bad block contains a superblock in the first 4096 Bytes and +the last non-bad block contains a superblock in the last 4096 Bytes. +On block devices, the first 4096 Bytes of the device contain the first +superblock and the last aligned 4096 Byte-block contains the second +superblock. + +For the most part, the superblocks can be considered read-only. They +are written only to correct errors detected within the superblocks, +move the journal and change the filesystem parameters through tunefs. +As a result, the superblock does not contain any fields that require +constant updates, like the amount of free space, etc. + +Segments +-------- + +The space in the device is split up into equal-sized segments. +Segments are the primary write unit of LogFS. Within each segments, +writes happen from front (low addresses) to back (high addresses. If +only a partial segment has been written, the segment number, the +current position within and optionally a write buffer are stored in +the journal. + +Segments are erased as a whole. Therefore Garbage Collection may be +required to completely free a segment before doing so. + +Journal +-------- + +The journal contains all global information about the filesystem that +is subject to frequent change. At mount time, it has to be scanned +for the most recent commit entry, which contains a list of pointers to +all currently valid entries. + +Object Store +------------ + +All space except for the superblocks and journal is part of the object +store. Each segment contains a segment header and a number of +objects, each consisting of the object header and the payload. +Objects are either inodes, directory entries (dentries), file data +blocks or indirect blocks. + +Levels +------ + +Garbage collection (GC) may fail if all data is written +indiscriminately. One requirement of GC is that data is separated +roughly according to the distance between the tree root and the data. +Effectively that means all file data is on level 0, indirect blocks +are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, +respectively. Inode file data is on level 6 for the inodes and 7-11 +for indirect blocks. + +Each segment contains objects of a single level only. As a result, +each level requires its own separate segment to be open for writing. + +Inode File +---------- + +All inodes are stored in a special file, the inode file. Single +exception is the inode file's inode (master inode) which for obvious +reasons is stored in the journal instead. Instead of data blocks, the +leaf nodes of the inode files are inodes. + +Aliases +------- + +Writes in LogFS are done by means of a wandering tree. A naïve +implementation would require that for each write or a block, all +parent blocks are written as well, since the block pointers have +changed. Such an implementation would not be very efficient. + +In LogFS, the block pointer changes are cached in the journal by means +of alias entries. Each alias consists of its logical address - inode +number, block index, level and child number (index into block) - and +the changed data. Any 8-byte word can be changes in this manner. + +Currently aliases are used for block pointers, file size, file used +bytes and the height of an inodes indirect tree. + +Segment Aliases +--------------- + +Related to regular aliases, these are used to handle bad blocks. +Initially, bad blocks are handled by moving the affected segment +content to a spare segment and noting this move in the journal with a +segment alias, a simple (to, from) tupel. GC will later empty this +segment and the alias can be removed again. This is used on MTD only. + +Vim +--- + +By cleverly predicting the life time of data, it is possible to +separate long-living data from short-living data and thereby reduce +the GC overhead later. Each type of distinc life expectency (vim) can +have a separate segment open for writing. Each (level, vim) tupel can +be open just once. If an open segment with unknown vim is encountered +at mount time, it is closed and ignored henceforth. + +Indirect Tree +------------- + +Inodes in LogFS are similar to FFS-style filesystems with direct and +indirect block pointers. One difference is that LogFS uses a single +indirect pointer that can be either a 1x, 2x, etc. indirect pointer. +A height field in the inode defines the height of the indirect tree +and thereby the indirection of the pointer. + +Another difference is the addressing of indirect blocks. In LogFS, +the first 16 pointers in the first indirect block are left empty, +corresponding to the 16 direct pointers in the inode. In ext2 (maybe +others as well) the first pointer in the first indirect block +corresponds to logical block 12, skipping the 12 direct pointers. +So where ext2 is using arithmetic to better utilize space, LogFS keeps +arithmetic simple and uses compression to save space. + +Compression +----------- + +Both file data and metadata can be compressed. Compression for file +data can be enabled with chattr +c and disabled with chattr -c. Doing +so has no effect on existing data, but new data will be stored +accordingly. New inodes will inherit the compression flag of the +parent directory. + +Metadata is always compressed. However, the space accounting ignores +this and charges for the uncompressed size. Failing to do so could +result in GC failures when, after moving some data, indirect blocks +compress worse than previously. Even on a 100% full medium, GC may +not consume any extra space, so the compression gains are lost space +to the user. + +However, they are not lost space to the filesystem internals. By +cheating the user for those bytes, the filesystem gained some slack +space and GC will run less often and faster. + +Garbage Collection and Wear Leveling +------------------------------------ + +Garbage collection is invoked whenever the number of free segments +falls below a threshold. The best (known) candidate is picked based +on the least amount of valid data contained in the segment. All +remaining valid data is copied elsewhere, thereby invalidating it. + +The GC code also checks for aliases and writes then back if their +number gets too large. + +Wear leveling is done by occasionally picking a suboptimal segment for +garbage collection. If a stale segments erase count is significantly +lower than the active segments' erase counts, it will be picked. Wear +leveling is rate limited, so it will never monopolize the device for +more than one segment worth at a time. + +Values for "occasionally", "significantly lower" are compile time +constants. + +Hashed directories +------------------ + +To satisfy efficient lookup(), directory entries are hashed and +located based on the hash. In order to both support large directories +and not be overly inefficient for small directories, several hash +tables of increasing size are used. For each table, the hash value +modulo the table size gives the table index. + +Tables sizes are chosen to limit the number of indirect blocks with a +fully populated table to 0, 1, 2 or 3 respectively. So the first +table contains 16 entries, the second 512-16, etc. + +The last table is special in several ways. First its size depends on +the effective 32bit limit on telldir/seekdir cookies. Since logfs +uses the upper half of the address space for indirect blocks, the size +is limited to 2^31. Secondly the table contains hash buckets with 16 +entries each. + +Using single-entry buckets would result in birthday "attacks". At +just 2^16 used entries, hash collisions would be likely (P >= 0.5). +My math skills are insufficient to do the combinatorics for the 17x +collisions necessary to overflow a bucket, but testing showed that in +10,000 runs the lowest directory fill before a bucket overflow was +188,057,130 entries with an average of 315,149,915 entries. So for +directory sizes of up to a million, bucket overflows should be +virtually impossible under normal circumstances. + +With carefully chosen filenames, it is obviously possible to cause an +overflow with just 21 entries (4 higher tables + 16 entries + 1). So +there may be a security concern if a malicious user has write access +to a directory. + +Open For Discussion +=================== + +Device Address Space +-------------------- + +A device address space is used for caching. Both block devices and +MTD provide functions to either read a single page or write a segment. +Partial segments may be written for data integrity, but where possible +complete segments are written for performance on simple block device +flash media. + +Meta Inodes +----------- + +Inodes are stored in the inode file, which is just a regular file for +most purposes. At umount time, however, the inode file needs to +remain open until all dirty inodes are written. So +generic_shutdown_super() may not close this inode, but shouldn't +complain about remaining inodes due to the inode file either. Same +goes for mapping inode of the device address space. + +Currently logfs uses a hack that essentially copies part of fs/inode.c +code over. A general solution would be preferred. + +Indirect block mapping +---------------------- + +With compression, the block device (or mapping inode) cannot be used +to cache indirect blocks. Some other place is required. Currently +logfs uses the top half of each inode's address space. The low 8TB +(on 32bit) are filled with file data, the high 8TB are used for +indirect blocks. + +One problem is that 16TB files created on 64bit systems actually have +data in the top 8TB. But files >16TB would cause problems anyway, so +only the limit has changed. diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt index f12c30c93f2..5af164f4b37 100644 --- a/Documentation/filesystems/ncpfs.txt +++ b/Documentation/filesystems/ncpfs.txt @@ -7,6 +7,6 @@ ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors will have it as well. Related products are linware and mars_nwe, which will give Linux partial -NetWare server functionality. Linware's home site is -klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on -ftp.gwdg.de/pub/linux/misc/ncpfs. +NetWare server functionality. + +mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX new file mode 100644 index 00000000000..53f3b596ac0 --- /dev/null +++ b/Documentation/filesystems/nfs/00-INDEX @@ -0,0 +1,26 @@ +00-INDEX + - this file (nfs-related documentation). +Exporting + - explanation of how to make filesystems exportable. +fault_injection.txt + - information for using fault injection on the server +knfsd-stats.txt + - statistics which the NFS server makes available to user space. +nfs.txt + - nfs client, and DNS resolution for fs_locations. +nfs41-server.txt + - info on the Linux server implementation of NFSv4 minor version 1. +nfs-rdma.txt + - how to install and setup the Linux NFS/RDMA client and server software +nfsd-admin-interfaces.txt + - Administrative interfaces for nfsd. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. +pnfs.txt + - short explanation of some of the internals of the pnfs client code +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. +idmapper.txt + - information for configuring request-keys to be used by idmapper +rpc-server-gss.txt + - Information on GSS authentication support in the NFS Server diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/nfs/Exporting index 87019d2b598..e543b1a619c 100644 --- a/Documentation/filesystems/Exporting +++ b/Documentation/filesystems/nfs/Exporting @@ -92,7 +92,14 @@ For a filesystem to be exportable it must: 1/ provide the filehandle fragment routines described below. 2/ make sure that d_splice_alias is used rather than d_add when ->lookup finds an inode for a given parent and name. - Typically the ->lookup routine will end with a: + + If inode is NULL, d_splice_alias(inode, dentry) is equivalent to + + d_add(dentry, inode), NULL + + Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) + + Typically the ->lookup routine will simply end with a: return d_splice_alias(inode, dentry); } diff --git a/Documentation/filesystems/nfs/fault_injection.txt b/Documentation/filesystems/nfs/fault_injection.txt new file mode 100644 index 00000000000..426d166089a --- /dev/null +++ b/Documentation/filesystems/nfs/fault_injection.txt @@ -0,0 +1,69 @@ + +Fault Injection +=============== +Fault injection is a method for forcing errors that may not normally occur, or +may be difficult to reproduce. Forcing these errors in a controlled environment +can help the developer find and fix bugs before their code is shipped in a +production system. Injecting an error on the Linux NFS server will allow us to +observe how the client reacts and if it manages to recover its state correctly. + +NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this +feature. + + +Using Fault Injection +===================== +On the client, mount the fault injection server through NFS v4.0+ and do some +work over NFS (open files, take locks, ...). + +On the server, mount the debugfs filesystem to <debug_dir> and ls +<debug_dir>/nfsd. This will show a list of files that will be used for +injecting faults on the NFS server. As root, write a number n to the file +corresponding to the action you want the server to take. The server will then +process the first n items it finds. So if you want to forget 5 locks, echo '5' +to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget +all corresponding items. A log message will be created containing the number +of items forgotten (check dmesg). + +Go back to work on the client and check if the client recovered from the error +correctly. + + +Available Faults +================ +forget_clients: + The NFS server keeps a list of clients that have placed a mount call. If + this list is cleared, the server will have no knowledge of who the client + is, forcing the client to reauthenticate with the server. + +forget_openowners: + The NFS server keeps a list of what files are currently opened and who + they were opened by. Clearing this list will force the client to reopen + its files. + +forget_locks: + The NFS server keeps a list of what files are currently locked in the VFS. + Clearing this list will force the client to reclaim its locks (files are + unlocked through the VFS as they are cleared from this list). + +forget_delegations: + A delegation is used to assure the client that a file, or part of a file, + has not changed since the delegation was awarded. Clearing this list will + force the client to reaquire its delegation before accessing the file + again. + +recall_delegations: + Delegations can be recalled by the server when another client attempts to + access a file. This test will notify the client that its delegation has + been revoked, forcing the client to reaquire the delegation before using + the file again. + + +tools/nfs/inject_faults.sh script +================================= +This script has been created to ease the fault injection process. This script +will detect the mounted debugfs directory and write to the files located there +based on the arguments passed by the user. For example, running +`inject_faults.sh forget_locks 1` as root will instruct the server to forget +one lock. Running `inject_faults forget_locks` will instruct the server to +forgetall locks. diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt new file mode 100644 index 00000000000..fe03d10bb79 --- /dev/null +++ b/Documentation/filesystems/nfs/idmapper.txt @@ -0,0 +1,75 @@ + +========= +ID Mapper +========= +Id mapper is used by NFS to translate user and group ids into names, and to +translate user and group names into ids. Part of this translation involves +performing an upcall to userspace to request the information. There are two +ways NFS could obtain this information: placing a call to /sbin/request-key +or by placing a call to the rpc.idmap daemon. + +NFS will attempt to call /sbin/request-key first. If this succeeds, the +result will be cached using the generic request-key cache. This call should +only fail if /etc/request-key.conf is not configured for the id_resolver key +type, see the "Configuring" section below if you wish to use the request-key +method. + +If the call to /sbin/request-key fails (if /etc/request-key.conf is not +configured with the id_resolver key type), then the idmapper will ask the +legacy rpc.idmap daemon for the id mapping. This result will be stored +in a custom NFS idmap cache. + + +=========== +Configuring +=========== +The file /etc/request-key.conf will need to be modified so /sbin/request-key can +direct the upcall. The following line should be added: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap. +The last parameter, 600, defines how many seconds into the future the key will +expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout +is not specified, nfs.idmap will default to 600 seconds. + +id mapper uses for key descriptions: + uid: Find the UID for the given user + gid: Find the GID for the given group + user: Find the user name for the given UID + group: Find the group name for the given GID + +You can handle any of these individually, rather than using the generic upcall +program. If you would like to use your own program for a uid lookup then you +would edit your request-key.conf so it look similar to this: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver uid:* * /some/other/program %k %d 600 +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +Notice that the new line was added above the line for the generic program. +request-key will find the first matching line and corresponding program. In +this case, /some/other/program will handle all uid lookups and +/usr/sbin/nfs.idmap will handle gid, user, and group lookups. + +See <file:Documentation/security/keys-request-key.txt> for more information +about the request-key function. + + +========= +nfs.idmap +========= +nfs.idmap is designed to be called by request-key, and should not be run "by +hand". This program takes two arguments, a serialized key and a key +description. The serialized key is first converted into a key_serial_t, and +then passed as an argument to keyctl_instantiate (both are part of keyutils.h). + +The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap +determines the correct function to call by looking at the first part of the +description string. For example, a uid lookup description will appear as +"uid:user@domain". + +nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise. diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt new file mode 100644 index 00000000000..64ced5149d3 --- /dev/null +++ b/Documentation/filesystems/nfs/knfsd-stats.txt @@ -0,0 +1,159 @@ + +Kernel NFS Server Statistics +============================ + +This document describes the format and semantics of the statistics +which the kernel NFS server makes available to userspace. These +statistics are available in several text form pseudo files, each of +which is described separately below. + +In most cases you don't need to know these formats, as the nfsstat(8) +program from the nfs-utils distribution provides a helpful command-line +interface for extracting and printing them. + +All the files described here are formatted as a sequence of text lines, +separated by newline '\n' characters. Lines beginning with a hash +'#' character are comments intended for humans and should be ignored +by parsing routines. All other lines contain a sequence of fields +separated by whitespace. + +/proc/fs/nfsd/pool_stats +------------------------ + +This file is available in kernels from 2.6.30 onwards, if the +/proc/fs/nfsd filesystem is mounted (it almost always should be). + +The first line is a comment which describes the fields present in +all the other lines. The other lines present the following data as +a sequence of unsigned decimal numeric fields. One line is shown +for each NFS thread pool. + +All counters are 64 bits wide and wrap naturally. There is no way +to zero these counters, instead applications should do their own +rate conversion. + +pool + The id number of the NFS thread pool to which this line applies. + This number does not change. + + Thread pool ids are a contiguous set of small integers starting + at zero. The maximum value depends on the thread pool mode, but + currently cannot be larger than the number of CPUs in the system. + Note that in the default case there will be a single thread pool + which contains all the nfsd threads and all the CPUs in the system, + and thus this file will have a single line with a pool id of "0". + +packets-arrived + Counts how many NFS packets have arrived. More precisely, this + is the number of times that the network stack has notified the + sunrpc server layer that new data may be available on a transport + (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). + + Depending on the NFS workload patterns and various network stack + effects (such as Large Receive Offload) which can combine packets + on the wire, this may be either more or less than the number + of NFS calls received (which statistic is available elsewhere). + However this is a more accurate and less workload-dependent measure + of how much CPU load is being placed on the sunrpc server layer + due to NFS network traffic. + +sockets-enqueued + Counts how many times an NFS transport is enqueued to wait for + an nfsd thread to service it, i.e. no nfsd thread was considered + available. + + The circumstance this statistic tracks indicates that there was NFS + network-facing work to be done but it couldn't be done immediately, + thus introducing a small delay in servicing NFS calls. The ideal + rate of change for this counter is zero; significantly non-zero + values may indicate a performance limitation. + + This can happen either because there are too few nfsd threads in the + thread pool for the NFS workload (the workload is thread-limited), + or because the NFS workload needs more CPU time than is available in + the thread pool (the workload is CPU-limited). In the former case, + configuring more nfsd threads will probably improve the performance + of the NFS workload. In the latter case, the sunrpc server layer is + already choosing not to wake idle nfsd threads because there are too + many nfsd threads which want to run but cannot, so configuring more + nfsd threads will make no difference whatsoever. The overloads-avoided + statistic (see below) can be used to distinguish these cases. + +threads-woken + Counts how many times an idle nfsd thread is woken to try to + receive some data from an NFS transport. + + This statistic tracks the circumstance where incoming + network-facing NFS work is being handled quickly, which is a good + thing. The ideal rate of change for this counter will be close + to but less than the rate of change of the packets-arrived counter. + +overloads-avoided + Counts how many times the sunrpc server layer chose not to wake an + nfsd thread, despite the presence of idle nfsd threads, because + too many nfsd threads had been recently woken but could not get + enough CPU time to actually run. + + This statistic counts a circumstance where the sunrpc layer + heuristically avoids overloading the CPU scheduler with too many + runnable nfsd threads. The ideal rate of change for this counter + is zero. Significant non-zero values indicate that the workload + is CPU limited. Usually this is associated with heavy CPU usage + on all the CPUs in the nfsd thread pool. + + If a sustained large overloads-avoided rate is detected on a pool, + the top(1) utility should be used to check for the following + pattern of CPU usage on all the CPUs associated with the given + nfsd thread pool. + + - %us ~= 0 (as you're *NOT* running applications on your NFS server) + + - %wa ~= 0 + + - %id ~= 0 + + - %sy + %hi + %si ~= 100 + + If this pattern is seen, configuring more nfsd threads will *not* + improve the performance of the workload. If this patten is not + seen, then something more subtle is wrong. + +threads-timedout + Counts how many times an nfsd thread triggered an idle timeout, + i.e. was not woken to handle any incoming network packets for + some time. + + This statistic counts a circumstance where there are more nfsd + threads configured than can be used by the NFS workload. This is + a clue that the number of nfsd threads can be reduced without + affecting performance. Unfortunately, it's only a clue and not + a strong indication, for a couple of reasons: + + - Currently the rate at which the counter is incremented is quite + slow; the idle timeout is 60 minutes. Unless the NFS workload + remains constant for hours at a time, this counter is unlikely + to be providing information that is still useful. + + - It is usually a wise policy to provide some slack, + i.e. configure a few more nfsds than are currently needed, + to allow for future spikes in load. + + +Note that incoming packets on NFS transports will be dealt with in +one of three ways. An nfsd thread can be woken (threads-woken counts +this case), or the transport can be enqueued for later attention +(sockets-enqueued counts this case), or the packet can be temporarily +deferred because the transport is currently being used by an nfsd +thread. This last case is not very interesting and is not explicitly +counted, but can be inferred from the other counters thus: + +packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + + +More +---- +Descriptions of the other statistics file should go here. + + +Greg Banks <gnb@sgi.com> +26 Mar 2009 diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt new file mode 100644 index 00000000000..e386f7e4bce --- /dev/null +++ b/Documentation/filesystems/nfs/nfs-rdma.txt @@ -0,0 +1,271 @@ +################################################################################ +# # +# NFS/RDMA README # +# # +################################################################################ + + Author: NetApp and Open Grid Computing + Date: May 29, 2008 + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + + This document describes how to install and setup the Linux NFS/RDMA client + and server software. + + The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server + was first included in the following release, Linux 2.6.25. + + In our testing, we have obtained excellent performance results (full 10Gbit + wire bandwidth at minimal client CPU) under many workloads. The code passes + the full Connectathon test suite and operates over both Infiniband and iWARP + RDMA adapters. + +Getting Help +~~~~~~~~~~~~ + + If you get stuck, you can ask questions on the + + nfs-rdma-devel@lists.sourceforge.net + + mailing list. + +Installation +~~~~~~~~~~~~ + + These instructions are a step by step guide to building a machine for + use with NFS/RDMA. + + - Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards, the + Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + + - Install a Linux distribution and tools + + The first kernel release to contain both the NFS/RDMA client and server was + Linux 2.6.25 Therefore, a distribution compatible with this and subsequent + Linux kernel release should be installed. + + The procedures described in this document have been tested with + distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + + - Install nfs-utils-1.1.2 or greater on the client + + An NFS/RDMA mount point can be obtained by using the mount.nfs command in + nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils + version with support for NFS/RDMA mounts, but for various reasons we + recommend using nfs-utils-1.1.2 or greater). To see which version of + mount.nfs you are using, type: + + $ /sbin/mount.nfs -V + + If the version is less than 1.1.2 or the command does not exist, + you should install the latest version of nfs-utils. + + Download the latest package from: + + http://www.kernel.org/pub/linux/utils/nfs + + Uncompress the package and follow the installation instructions. + + If you will not need the idmapper and gssd executables (you do not need + these to create an NFS/RDMA enabled mount command), the installation + process can be simplified by disabling these features when running + configure: + + $ ./configure --disable-gss --disable-nfsv4 + + To build nfs-utils you will need the tcp_wrappers package installed. For + more information on this see the package's README and INSTALL files. + + After building the nfs-utils package, there will be a mount.nfs binary in + the utils/mount directory. This binary can be used to initiate NFS v2, v3, + or v4 mounts. To initiate a v4 mount, the binary must be called + mount.nfs4. The standard technique is to create a symlink called + mount.nfs4 to mount.nfs. + + This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + + $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + + In this location, mount.nfs will be invoked automatically for NFS mounts + by the system mount command. + + NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed + on the NFS client machine. You do not need this specific version of + nfs-utils on the server. Furthermore, only the mount.nfs command from + nfs-utils-1.1.2 is needed on the client. + + - Install a Linux kernel with NFS/RDMA + + The NFS/RDMA client and server are both included in the mainline Linux + kernel version 2.6.25 and later. This and other versions of the 2.6 Linux + kernel can be found at: + + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + + Download the sources and place them in an appropriate location. + + - Configure the RDMA stack + + Make sure your kernel configuration has RDMA support enabled. Under + Device Drivers -> InfiniBand support, update the kernel configuration + to enable InfiniBand support [NOTE: the option name is misleading. Enabling + InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + + Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or + iWARP adapter support (amso, cxgb3, etc.). + + If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + + - Configure the NFS client and server + + Your kernel configuration must also have NFS file system support and/or + NFS server support enabled. These and other NFS related configuration + options can be found under File Systems -> Network File Systems. + + - Build, install, reboot + + The NFS/RDMA code will be enabled automatically if NFS and RDMA + are turned on. The NFS/RDMA client and server are configured via the hidden + SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The + value of SUNRPC_XPRT_RDMA will be: + + - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client + and server will not be built + - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, + in this case the NFS/RDMA client and server will be built as modules + - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client + and server will be built into the kernel + + Therefore, if you have followed the steps above and turned no NFS and RDMA, + the NFS/RDMA client and server will be built. + + Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + + Before configuring the NFS/RDMA software, it is a good idea to test + your new kernel to ensure that the kernel is working correctly. + In particular, it is a good idea to verify that the RDMA stack + is functioning as expected and standard NFS over TCP/IP and/or UDP/IP + is working properly. + + - Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + host1$ ifconfig ib0 a.b.c.x + host2$ ifconfig ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + + - Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + + We recommend that you use two machines, one to act as the client and + one to act as the server. + + One time configuration: + + - On the server system, configure the /etc/exports file and + start the NFS/RDMA server. + + Exports entries with the following formats have been tested: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the cleint's iWARP address(es) for an RNIC. + + NOTE: The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + + Each time a machine boots: + + - Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ifconfig ib0 a.b.c.d + + NOTE: use unique addresses for the client and server + + - Start the NFS server + + If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA transport module: + + $ modprobe svcrdma + + Regardless of how the server was built (module or built-in), start the + server: + + $ /etc/init.d/nfs start + + or + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + $ echo rdma 20049 > /proc/fs/nfsd/portlist + + - On the client system + + If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA client module: + + $ modprobe xprtrdma.ko + + Regardless of how the client was built (module or built-in), use this + command to mount the NFS/RDMA server: + + $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt new file mode 100644 index 00000000000..f2571c8bef7 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs.txt @@ -0,0 +1,136 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +special features of the NFS client that can be configured by system +administrators. + + +The nfs4_unique_id parameter +============================ + +NFSv4 requires clients to identify themselves to servers with a unique +string. File open and lock state shared between one client and one server +is associated with this identity. To support robust NFSv4 state recovery +and transparent state migration, this identity string must not change +across client reboots. + +Without any other intervention, the Linux client uses a string that contains +the local system's node name. System administrators, however, often do not +take care to ensure that node names are fully qualified and do not change +over the lifetime of a client system. Node names can have other +administrative requirements that require particular behavior that does not +work well as part of an nfs_client_id4 string. + +The nfs.nfs4_unique_id boot parameter specifies a unique string that can be +used instead of a system's node name when an NFS client identifies itself to +a server. Thus, if the system's node name is not unique, or it changes, its +nfs.nfs4_unique_id stays the same, preventing collision with other clients +or loss of state during NFS reboot recovery or transparent state migration. + +The nfs.nfs4_unique_id string is typically a UUID, though it can contain +anything that is believed to be unique across all NFS clients. An +nfs4_unique_id string should be chosen when a client system is installed, +just as a system's root file system gets a fresh UUID in its label at +install time. + +The string should remain fixed for the lifetime of the client. It can be +changed safely if care is taken that the client shuts down cleanly and all +outstanding NFSv4 state has expired, to prevent loss of NFSv4 state. + +This string can be stored in an NFS client's grub.conf, or it can be provided +via a net boot facility such as PXE. It may also be specified as an nfs.ko +module parameter. Specifying a uniquifier string is not support for NFS +clients running in containers. + + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See + http://tools.ietf.org/html/rfc3530#section-6 +and + http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + "<ip address> <hostname> <ttl>\n" + + Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + <hostname> is identical to the second argument of the helper + script, and <ttl> is the 'time to live' of this cache entry (in + units of seconds). + + Note: If <ip address> is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ + echo "Usage: $0 cache_name entry_name" + exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt new file mode 100644 index 00000000000..c49cd7e796e --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.txt @@ -0,0 +1,180 @@ +NFSv4.1 Server Implementation + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is enabled by default. +It can be disabled at run time by writing the string "-4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. You can use rpc.nfsd +for this; see rpc.nfsd(8). + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on RFC 5661. + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +Other NFSv4.1 features, Parallel NFS operations in particular, +are still under development out of tree. +See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design +for more information. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + pNFS Parallel NFS + FDELG File Delegations + DDELG Directory Delegations + +The following abbreviations indicate the linux server implementation status. + I Implemented NFSv4.1 operations. + NS Not Supported. + NS* unimplemented optional feature. + P pNFS features implemented out of tree. + PNS pNFS features that are not supported yet (out of tree). + +Operations + + +----------------------+------------+--------------+----------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +----------------------+------------+--------------+----------------+ + | ACCESS | REQ | | Section 18.1 | +I | BACKCHANNEL_CTL | REQ | | Section 18.33 | +I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | + | CLOSE | REQ | | Section 18.2 | + | COMMIT | REQ | | Section 18.3 | + | CREATE | REQ | | Section 18.4 | +I | CREATE_SESSION | REQ | | Section 18.36 | +NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | + | DELEGRETURN | OPT | FDELG, | Section 18.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +I | DESTROY_CLIENTID | REQ | | Section 18.50 | +I | DESTROY_SESSION | REQ | | Section 18.37 | +I | EXCHANGE_ID | REQ | | Section 18.35 | +I | FREE_STATEID | REQ | | Section 18.38 | + | GETATTR | REQ | | Section 18.7 | +P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | +P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | + | GETFH | REQ | | Section 18.8 | +NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | +P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | +P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | +P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | + | LINK | OPT | | Section 18.9 | + | LOCK | REQ | | Section 18.10 | + | LOCKT | REQ | | Section 18.11 | + | LOCKU | REQ | | Section 18.12 | + | LOOKUP | REQ | | Section 18.13 | + | LOOKUPP | REQ | | Section 18.14 | + | NVERIFY | REQ | | Section 18.15 | + | OPEN | REQ | | Section 18.16 | +NS*| OPENATTR | OPT | | Section 18.17 | + | OPEN_CONFIRM | MNI | | N/A | + | OPEN_DOWNGRADE | REQ | | Section 18.18 | + | PUTFH | REQ | | Section 18.19 | + | PUTPUBFH | REQ | | Section 18.20 | + | PUTROOTFH | REQ | | Section 18.21 | + | READ | REQ | | Section 18.22 | + | READDIR | REQ | | Section 18.23 | + | READLINK | OPT | | Section 18.24 | + | RECLAIM_COMPLETE | REQ | | Section 18.51 | + | RELEASE_LOCKOWNER | MNI | | N/A | + | REMOVE | REQ | | Section 18.25 | + | RENAME | REQ | | Section 18.26 | + | RENEW | MNI | | N/A | + | RESTOREFH | REQ | | Section 18.27 | + | SAVEFH | REQ | | Section 18.28 | + | SECINFO | REQ | | Section 18.29 | +I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | + | | | layout (REQ) | Section 13.12 | +I | SEQUENCE | REQ | | Section 18.46 | + | SETATTR | REQ | | Section 18.30 | + | SETCLIENTID | MNI | | N/A | + | SETCLIENTID_CONFIRM | MNI | | N/A | +NS | SET_SSV | REQ | | Section 18.47 | +I | TEST_STATEID | REQ | | Section 18.48 | + | VERIFY | REQ | | Section 18.31 | +NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | + | WRITE | REQ | | Section 18.32 | + +Callback Operations + + +-------------------------+-----------+-------------+---------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +-------------------------+-----------+-------------+---------------+ + | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | +P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | +NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | +P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | +NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | +NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | + | CB_RECALL | OPT | FDELG, | Section 20.2 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS | CB_RECALL_SLOT | REQ | | Section 20.8 | +NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | + | | | (REQ) | | +I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | + | | | DDELG, pNFS | | + | | | (REQ) | | + +-------------------------+-----------+-------------+---------------+ + +Implementation notes: + +SSV: +* The spec claims this is mandatory, but we don't actually know of any + implementations, so we're ignoring it for now. The server returns + NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. + +GSS on the backchannel: +* Again, theoretically required but not widely implemented (in + particular, the current Linux client doesn't request it). We return + NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. + +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: +* implementation ids are ignored + +CREATE_SESSION: +* backchannel attributes are ignored + +SEQUENCE: +* no support for dynamic slot table renegotiation (optional) + +Nonstandard compound limitations: +* No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. + +See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt new file mode 100644 index 00000000000..56a96fb08a7 --- /dev/null +++ b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt @@ -0,0 +1,41 @@ +Administrative interfaces for nfsd +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Note that normally these interfaces are used only by the utilities in +nfs-utils. + +nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem, +which is normally mounted at /proc/fs/nfsd/. + +The server is always started by the first write of a nonzero value to +nfsd/threads. + +Before doing that, NFSD can be told which sockets to listen on by +writing to nfsd/portlist; that write may be: + + - an ascii-encoded file descriptor, which should refer to a + bound (and listening, for tcp) socket, or + - "transportname port", where transportname is currently either + "udp", "tcp", or "rdma". + +If nfsd is started without doing any of these, then it will create one +udp and one tcp listener at port 2049 (see nfsd_init_socks). + +On startup, nfsd and lockd grace periods start. + +nfsd is shut down by a write of 0 to nfsd/threads. All locks and state +are thrown away at that point. + +Between startup and shutdown, the number of threads may be adjusted up +or down by additional writes to nfsd/threads or by writes to +nfsd/pool_threads. + +For more detail about files under nfsd/ and what they control, see +fs/nfsd/nfsctl.c; most of them have detailed comments. + +Implementation notes +^^^^^^^^^^^^^^^^^^^^ + +Note that the rpc server requires the caller to serialize addition and +removal of listening sockets, and startup and shutdown of the server. +For nfsd this is done using nfsd_mutex. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt new file mode 100644 index 00000000000..2d66ed68812 --- /dev/null +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -0,0 +1,302 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> +Updated 2006 by Horms <horms@verge.net.au> + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities + ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line + ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] + + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. + + <server-ip> Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + <root-dir> Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + <nfs-options> Standard NFS options. All options are separated by commas. + The following defaults are used: + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: + <dns0-ip>:<dns1-ip> + + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The <autoconf> parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + <client-ip> IP address of the client. + + Default: Determined using autoconfiguration. + + <server-ip> IP address of the NFS server. If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + <gw-ip> IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + <netmask> Netmask for local network interface. If unspecified + the netmask is derived from the client IP address assuming + classful addressing. + + Default: Determined using autoconfiguration. + + <hostname> Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user provided hostname will + be carried in the DHCP request to hopefully update DNS record. + + Default: Client IP address is used in ASCII notation. + + <device> Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + <autoconf> Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + Default: any + + <dns0-ip> IP address of first nameserver. + Value gets exported by /proc/net/pnp which is often linked + on embedded systems by /etc/resolv.conf. + + <dns1-ip> IP address of secound nameserver. + Same as above. + + +nfsrootdebug + + This parameter enables debugging messages to appear in the kernel + log at boot time so that administrators can verify that the correct + NFS mount options, server address, and root path are passed to the + NFS client. + + +rdinit=<executable file> + + To specify which file contains the program that starts system + initialization, administrators can use this command line parameter. + The default value of this parameter is "/init". If the specified + file exists and the kernel can execute it, root filesystem related + kernel command line parameters, including `nfsroot=', are ignored. + + A description of the process of mounting the root file system can be + found in: + + Documentation/early-userspace/README + + + + +3.) Boot Loader + ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1) Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use zimage + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch/<ARCH>/boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel <kernel> <parameters> + +3.4) Using loadlin + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel <relative-path-below /tftpboot>". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits + ------- + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann <gero@gkminix.han.de>. + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt new file mode 100644 index 00000000000..adc81a35fe2 --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs.txt @@ -0,0 +1,109 @@ +Reference counting in pnfs: +========================== + +The are several inter-related caches. We have layouts which can +reference multiple devices, each of which can reference multiple data servers. +Each data server can be referenced by multiple devices. Each device +can be referenced by multiple layouts. To keep all of this straight, +we need to reference count. + + +struct pnfs_layout_hdr +---------------------- +The on-the-wire command LAYOUTGET corresponds to struct +pnfs_layout_segment, usually referred to by the variable name lseg. +Each nfs_inode may hold a pointer to a cache of these layout +segments in nfsi->layout, of type struct pnfs_layout_hdr. + +We reference the header for the inode pointing to it, across each +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, +LAYOUTCOMMIT), and for each lseg held within. + +Each header is also (when non-empty) put on a list associated with +struct nfs_client (cl_layouts). Being put on this list does not bump +the reference count, as the layout is kept around by the lseg that +keeps it in the list. + +deviceid_cache +-------------- +lsegs reference device ids, which are resolved per nfs_client and +layout driver type. The device ids are held in a RCU cache (struct +nfs4_deviceid_cache). The cache itself is referenced across each +mount. The entries (struct nfs4_deviceid) themselves are held across +the lifetime of each lseg referencing them. + +RCU is used because the deviceid is basically a write once, read many +data structure. The hlist size of 32 buckets needs better +justification, but seems reasonable given that we can have multiple +deviceid's per filesystem, and multiple filesystems per nfs_client. + +The hash code is copied from the nfsd code base. A discussion of +hashing and variations of this algorithm can be found at: +http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 + +data server cache +----------------- +file driver devices refer to data servers, which are kept in a module +level cache. Its reference is held over the lifetime of the deviceid +pointing to it. + +lseg +---- +lseg maintains an extra reference corresponding to the NFS_LSEG_VALID +bit which holds it in the pnfs_layout_hdr's list. When the final lseg +is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED +bit is set, preventing any new lsegs from being added. + +layout drivers +-------------- + +PNFS utilizes what is called layout drivers. The STD defines 3 basic +layout types: "files" "objects" and "blocks". For each of these types +there is a layout-driver with a common function-vectors table which +are called by the nfs-client pnfs-core to implement the different layout +types. + +Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c +Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory +Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory + +objects-layout setup +-------------------- + +As part of the full STD implementation the objlayoutdriver.ko needs, at times, +to automatically login to yet undiscovered iscsi/osd devices. For this the +driver makes up-calles to a user-mode script called *osd_login* + +The path_name of the script to use is by default: + /sbin/osd_login. +This name can be overridden by the Kernel module parameter: + objlayoutdriver.osd_login_prog + +If Kernel does not find the osd_login_prog path it will zero it out +and will not attempt farther logins. An admin can then write new value +to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it. + +The /sbin/osd_login is part of the nfs-utils package, and should usually +be installed on distributions that support this Kernel version. + +The API to the login script is as follows: + Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID> + Options: + -u target uri e.g. iscsi://<ip>:<port> + (allways exists) + (More protocols can be defined in the future. + The client does not interpret this string it is + passed unchanged as received from the Server) + -o osdname of the requested target OSD + (Might be empty) + (A string which denotes the OSD name, there is a + limit of 64 chars on this string) + -s systemid of the requested target OSD + (Might be empty) + (This string, if not empty is always an hex + representation of the 20 bytes osd_system_id) + +blocks-layout setup +------------------- + +TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt new file mode 100644 index 00000000000..ebcaaee2161 --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-cache.txt @@ -0,0 +1,202 @@ + This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store. This is in the form of a + structure definition that must contain a + struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + The operations requires are: + struct cache_head *alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + void cache_put(struct kref *) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + int match(struct cache_head *orig, struct cache_head *new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + void init(struct cache_head *orig, struct cache_head *new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + void update(struct cache_head *orig, struct cache_head *new) + Set the 'content' fileds in 'new' from 'orig'. + int cache_show(struct seq_file *m, struct cache_detail *cd, + struct cache_head *h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + int cache_request(struct cache_detail *cd, struct cache_head *h, + char **bpp, int *blen) + Format a request to be send to user-space for an item + to be instantiated. *bpp is a buffer of size *blen. + bpp should be moved forward over the encoded message, + and *blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + int cache_parse(struct cache_detail *cd, char *buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup, and update the item + with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + - a key + - an expiry time + - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail. This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use its own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +2/ otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.txt new file mode 100644 index 00000000000..716f4be8e8b --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-server-gss.txt @@ -0,0 +1,91 @@ + +rpcsec_gss support for kernel RPC servers +========================================= + +This document gives references to the standards and protocols used to +implement RPCGSS authentication in kernel RPC servers such as the NFS +server and the NFS client's NFSv4.0 callback server. (But note that +NFSv4.1 and higher don't require the client to act as a server for the +purposes of authentication.) + +RPCGSS is specified in a few IETF documents: + - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt +and there is a 3rd version being proposed: + - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt + (At draft n. 02 at the time of writing) + +Background +---------- + +The RPCGSS Authentication method describes a way to perform GSSAPI +Authentication for NFS. Although GSSAPI is itself completely mechanism +agnostic, in many cases only the KRB5 mechanism is supported by NFS +implementations. + +The Linux kernel, at the moment, supports only the KRB5 mechanism, and +depends on GSSAPI extensions that are KRB5 specific. + +GSSAPI is a complex library, and implementing it completely in kernel is +unwarranted. However GSSAPI operations are fundementally separable in 2 +parts: +- initial context establishment +- integrity/privacy protection (signing and encrypting of individual + packets) + +The former is more complex and policy-independent, but less +performance-sensitive. The latter is simpler and needs to be very fast. + +Therefore, we perform per-packet integrity and privacy protection in the +kernel, but leave the initial context establishment to userspace. We +need upcalls to request userspace to perform context establishment. + +NFS Server Legacy Upcall Mechanism +---------------------------------- + +The classic upcall mechanism uses a custom text based upcall mechanism +to talk to a custom daemon called rpc.svcgssd that is provide by the +nfs-utils package. + +This upcall mechanism has 2 limitations: + +A) It can handle tokens that are no bigger than 2KiB + +In some Kerberos deployment GSSAPI tokens can be quite big, up and +beyond 64KiB in size due to various authorization extensions attacked to +the Kerberos tickets, that needs to be sent through the GSS layer in +order to perform context establishment. + +B) It does not properly handle creds where the user is member of more +than a few housand groups (the current hard limit in the kernel is 65K +groups) due to limitation on the size of the buffer that can be send +back to the kernel (4KiB). + +NFS Server New RPC Upcall Mechanism +----------------------------------- + +The newer upcall mechanism uses RPC over a unix socket to a daemon +called gss-proxy, implemented by a userspace program called Gssproxy. + +The gss_proxy RPC protocol is currently documented here: + + https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation + +This upcall mechanism uses the kernel rpc client and connects to the gssproxy +userspace program over a regular unix socket. The gssproxy protocol does not +suffer from the size limitations of the legacy protocol. + +Negotiating Upcall Mechanisms +----------------------------- + +To provide backward compatibility, the kernel defaults to using the +legacy mechanism. To switch to the new mechanism, gss-proxy must bind +to /var/run/gssproxy.sock and then write "1" to +/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both +steps. + +Once the upcall mechanism is chosen, it cannot be changed. To prevent +locking into the legacy mechanisms, the above steps must be performed +before starting nfsd. Whoever starts nfsd can guarantee this by reading +from /proc/net/rpc/use-gss-proxy and checking that it contains a +"1"--the read will block until gss-proxy has done its write to the file. diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt new file mode 100644 index 00000000000..41c3d332acc --- /dev/null +++ b/Documentation/filesystems/nilfs2.txt @@ -0,0 +1,270 @@ +NILFS2 +------ + +NILFS2 is a log-structured file system (LFS) supporting continuous +snapshotting. In addition to versioning capability of the entire file +system, users can even restore files mistakenly overwritten or +destroyed just a few seconds ago. Since NILFS2 can keep consistency +like conventional LFS, it achieves quick recovery after system +crashes. + +NILFS2 creates a number of checkpoints every few seconds or per +synchronous write basis (unless there is no change). Users can select +significant versions among continuously created checkpoints, and can +change them into snapshots which will be preserved until they are +changed back to checkpoints. + +There is no limit on the number of snapshots until the volume gets +full. Each snapshot is mountable as a read-only file system +concurrently with its writable mount, and this feature is convenient +for online backup. + +The userland tools are included in nilfs-utils package, which is +available from the following download page. At least "mkfs.nilfs2", +"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called +cleaner or garbage collector) are required. Details on the tools are +described in the man pages included in the package. + +Project web page: http://nilfs.sourceforge.net/ +Download page: http://nilfs.sourceforge.net/en/download.html +List info: http://vger.kernel.org/vger-lists.html#linux-nilfs + +Caveats +======= + +Features which NILFS2 does not support yet: + + - atime + - extended attributes + - POSIX ACLs + - quotas + - fsck + - defragmentation + +Mount options +============= + +NILFS2 supports the following mount options: +(*) == default + +barrier(*) This enables/disables the use of write barriers. This +nobarrier requires an IO stack which can support barriers, and + if nilfs gets an error on a barrier write, it will + disable again with a warning. +errors=continue Keep going on a filesystem error. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +cp=n Specify the checkpoint-number of the snapshot to be + mounted. Checkpoints and snapshots are listed by lscp + user command. Only the checkpoints marked as snapshot + are mountable with this option. Snapshot is read-only, + so a read-only mount option must be specified together. +order=relaxed(*) Apply relaxed order semantics that allows modified data + blocks to be written to disk without making a + checkpoint if no metadata update is going. This mode + is equivalent to the ordered data mode of the ext3 + filesystem except for the updates on data blocks still + conserve atomicity. This will improve synchronous + write performance for overwriting. +order=strict Apply strict in-order semantics that preserves sequence + of all file operations including overwriting of data + blocks. That means, it is guaranteed that no + overtaking of events occurs in the recovered file + system after a crash. +norecovery Disable recovery of the filesystem on mount. + This disables every write access on the device for + read-only mounts or snapshots. This option will fail + for r/w mounts on an unclean volume. +discard This enables/disables the use of discard/TRIM commands. +nodiscard(*) The discard/TRIM commands are sent to the underlying + block device when blocks are freed. This is useful + for SSD devices and sparse/thinly-provisioned LUNs. + +Ioctls +====== + +There is some NILFS2 specific functionality which can be accessed by applications +through the system call interfaces. The list of all NILFS2 specific ioctls are +shown in the table below. + +Table of NILFS2 specific ioctls +.............................................................................. + Ioctl Description + NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between + checkpoint and snapshot state. This ioctl is + used in chcp and mkcp utilities. + + NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. + This ioctl is used in rmcp utility. + + NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This + ioctl is used in lscp utility and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is + used by lscp, rmcp utilities and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_SUINFO Return segment usage info about requested + segments. This ioctl is used in lssu, + nilfs_resize utilities and by nilfs_cleanerd + daemon. + + NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested + segments. This ioctl is used by + nilfs_cleanerd daemon to skip unnecessary + cleaning operation of segments and reduce + performance penalty or wear of flash device + due to redundant move of in-use blocks. + + NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl + is used in lssu, nilfs_resize utilities and + by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. + This ioctl is used by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk + block numbers. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the + environment of requested parameters from + userspace. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in + mkcp utility. + + NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used + by nilfs_resize utility. + + NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and + upper limit of segments in bytes. This ioctl + is used by nilfs_resize utility. + +NILFS2 usage +============ + +To use nilfs2 as a local file system, simply: + + # mkfs -t nilfs2 /dev/block_device + # mount -t nilfs2 /dev/block_device /dir + +This will also invoke the cleaner through the mount helper program +(mount.nilfs2). + +Checkpoints and snapshots are managed by the following commands. +Their manpages are included in the nilfs-utils package above. + + lscp list checkpoints or snapshots. + mkcp make a checkpoint or a snapshot. + chcp change an existing checkpoint to a snapshot or vice versa. + rmcp invalidate specified checkpoint(s). + +To mount a snapshot, + + # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir + +where <cno> is the checkpoint number of the snapshot. + +To unmount the NILFS2 mount point or snapshot, simply: + + # umount /dir + +Then, the cleaner daemon is automatically shut down by the umount +helper program (umount.nilfs2). + +Disk format +=========== + +A nilfs2 volume is equally divided into a number of segments except +for the super block (SB) and segment #0. A segment is the container +of logs. Each log is composed of summary information blocks, payload +blocks, and an optional super root block (SR): + + ______________________________________________________ + | |SB| | Segment | Segment | Segment | ... | Segment | | + |_|__|_|____0____|____1____|____2____|_____|____N____|_| + 0 +1K +4K +8M +16M +24M +(8MB x N) + . . (Typical offsets for 4KB-block) + . . + .______________________. + | log | log |... | log | + |__1__|__2__|____|__m__| + . . + . . + . . + .______________________________. + | Summary | Payload blocks |SR| + |_blocks__|_________________|__| + +The payload blocks are organized per file, and each file consists of +data blocks and B-tree node blocks: + + |<--- File-A --->|<--- File-B --->| + _______________________________________________________________ + | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... + _|_____________|_______________|_____________|_______________|_ + + +Since only the modified blocks are written in the log, it may have +files without data blocks or B-tree node blocks. + +The organization of the blocks is recorded in the summary information +blocks, which contains a header structure (nilfs_segment_summary), per +file structures (nilfs_finfo), and per block structures (nilfs_binfo): + + _________________________________________________________________________ + | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... + |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ + + +The logs include regular files, directory files, symbolic link files +and several meta data files. The mata data files are the files used +to maintain file system meta data. The current version of NILFS2 uses +the following meta data files: + + 1) Inode file (ifile) -- Stores on-disk inodes + 2) Checkpoint file (cpfile) -- Stores checkpoints + 3) Segment usage file (sufile) -- Stores allocation state of segments + 4) Data address translation file -- Maps virtual block numbers to usual + (DAT) block numbers. This file serves to + make on-disk blocks relocatable. + +The following figure shows a typical organization of the logs: + + _________________________________________________________________________ + | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| + |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| + + +To stride over segment boundaries, this sequence of files may be split +into multiple logs. The sequence of logs that should be treated as +logically one log, is delimited with flags marked in the segment +summary. The recovery code of nilfs2 looks this boundary information +to ensure atomicity of updates. + +The super root block is inserted for every checkpoints. It includes +three special inodes, inodes for the DAT, cpfile, and sufile. Inodes +of regular files, directories, symlinks and other special files, are +included in the ifile. The inode of ifile itself is included in the +corresponding checkpoint entry in the cpfile. Thus, the hierarchy +among NILFS2 files can be depicted as follows: + + Super block (SB) + | + v + Super root block (the latest cno=xx) + |-- DAT + |-- sufile + `-- cpfile + |-- ifile (cno=c1) + |-- ifile (cno=c2) ---- file (ino=i1) + : : |-- file (ino=i2) + `-- ifile (cno=xx) |-- file (ino=i3) + : : + `-- file (ino=yy) + ( regular file, directory, or symlink ) + +For detail on the format of each file, please see include/linux/nilfs2_fs.h. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index e79ee2db183..61947facfc0 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -40,7 +40,7 @@ Web site ======== There is plenty of additional information on the linux-ntfs web site -at http://linux-ntfs.sourceforge.net/ +at http://www.linux-ntfs.org/ The web site has a lot of additional information, such as a comprehensive FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS @@ -272,7 +272,7 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = For Win2k and later dynamic disks, you can for example use the ldminfo utility which is part of the Linux LDM tools (the latest version at the time of writing is linux-ldm-0.0.8.tar.bz2). You can download it from: - http://linux-ntfs.sourceforge.net/downloads.html + http://www.linux-ntfs.org/ Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You will find the precompiled (i386) ldminfo utility there. NOTE: You will not be @@ -350,7 +350,7 @@ Note the "Should sync?" parameter "nosync" means that the two mirrors are already in sync which will be the case on a clean shutdown of Windows. If the mirrors are not clean, you can specify the "sync" option instead of "nosync" and the Device-Mapper driver will then copy the entirety of the "Source Device" -to the "Target Device" or if you specified multipled target devices to all of +to the "Target Device" or if you specified multiple target devices to all of them. Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), @@ -455,8 +455,11 @@ not have this problem with odd numbers of sectors. ChangeLog ========= -Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. - +2.1.30: + - Fix writev() (it kept writing the first segment over and over again + instead of moving onto subsequent segments). + - Fix crash in ntfs_mft_record_alloc() when mapping the new extent mft + record failed. 2.1.29: - Fix a deadlock when mounting read-write. 2.1.28: diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt index c318a8bbb1e..7618a287aa4 100644 --- a/Documentation/filesystems/ocfs2.txt +++ b/Documentation/filesystems/ocfs2.txt @@ -20,21 +20,18 @@ Lots of code taken from ext3 and other projects. Authors in alphabetical order: Joel Becker <joel.becker@oracle.com> Zach Brown <zach.brown@oracle.com> -Mark Fasheh <mark.fasheh@oracle.com> +Mark Fasheh <mfasheh@suse.com> Kurt Hackel <kurt.hackel@oracle.com> +Tao Ma <tao.ma@oracle.com> Sunil Mushran <sunil.mushran@oracle.com> Manish Singh <manish.singh@oracle.com> +Tiger Yang <tiger.yang@oracle.com> Caveats ======= Features which OCFS2 does not support yet: - - extended attributes - - quotas - - cluster aware flock - - cluster aware lockf - Directory change notification (F_NOTIFY) - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) - - POSIX ACLs Mount options ============= @@ -49,9 +46,15 @@ errors=panic Panic and halt the machine if an error occurs. intr (*) Allow signals to interrupt cluster operations. nointr Do not allow signals to interrupt cluster operations. +noatime Do not update access time. +relatime(*) Update atime if the previous atime is older than + mtime or ctime +strictatime Always update atime, but the minimum update interval + is specified by atime_quantum. atime_quantum=60(*) OCFS2 will not update atime unless this number of seconds has passed since the last update. - Set to zero to always update atime. + Set to zero to always update atime. This option need + work with strictatime. data=ordered (*) All data are forced directly out to the main file system prior to its metadata being committed to the journal. @@ -74,5 +77,26 @@ commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata performance. localalloc=8(*) Allows custom localalloc size in MB. If the value is too large, the fs will silently revert it to the default. - Localalloc is not enabled for local mounts. localflocks This disables cluster aware flock. +inode64 Indicates that Ocfs2 is allowed to create inodes at + any location in the filesystem, including those which + will result in inode numbers occupying more than 32 + bits of significance. +user_xattr (*) Enables Extended User Attributes. +nouser_xattr Disables Extended User Attributes. +acl Enables POSIX Access Control Lists support. +noacl (*) Disables POSIX Access Control Lists support. +resv_level=2 (*) Set how aggressive allocation reservations will be. + Valid values are between 0 (reservations off) to 8 + (maximum space for reservations). +dir_resv_level= (*) By default, directory reservations will scale with file + reservations - users should rarely need to change this + value. If allocation reservations are turned off, this + option will have no effect. +coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode + lock will be taken to force other nodes drop cache, + therefore full cluster coherency is guaranteed even + for O_DIRECT writes. +coherency=buffered Allow concurrent O_DIRECT writes without EX lock among + nodes, which gains high performance at risk of getting + stale data on other nodes. diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt new file mode 100644 index 00000000000..1d0d41ff5c6 --- /dev/null +++ b/Documentation/filesystems/omfs.txt @@ -0,0 +1,106 @@ +Optimized MPEG Filesystem (OMFS) + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + uid=n - make all files owned by specified user + gid=n - make all files owned by specified group + umask=xxx - set permission umask to xxx + fmask=xxx - set umask to xxx for files + dmask=xxx - set umask to xxx for directories + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information: + +struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; +}; + +Files and directories are both represented by omfs_inode: + +struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ +}; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START: + +struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ +}; + +struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ +}; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt new file mode 100644 index 00000000000..3571667c710 --- /dev/null +++ b/Documentation/filesystems/path-lookup.txt @@ -0,0 +1,382 @@ +Path walking and name lookup locking +==================================== + +Path resolution is the finding a dentry corresponding to a path name string, by +performing a path walk. Typically, for every open(), stat() etc., the path name +will be resolved. Paths are resolved by walking the namespace tree, starting +with the first component of the pathname (eg. root or cwd) with a known dentry, +then finding the child of that dentry, which is named the next component in the +path string. Then repeating the lookup from the child dentry and finding its +child with the next element, and so on. + +Since it is a frequent operation for workloads like multiuser environments and +web servers, it is important to optimize this code. + +Path walking synchronisation history: +Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and +thus in every component during path look-up. Since 2.5.10 onwards, fast-walk +algorithm changed this by holding the dcache_lock at the beginning and walking +as many cached path component dentries as possible. This significantly +decreases the number of acquisition of dcache_lock. However it also increases +the lock hold time significantly and affects performance in large SMP machines. +Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to +make dcache look-up lock-free. + +All the above algorithms required taking a lock and reference count on the +dentry that was looked up, so that may be used as the basis for walking the +next path element. This is inefficient and unscalable. It is inefficient +because of the locks and atomic operations required for every dentry element +slows things down. It is not scalable because many parallel applications that +are path-walk intensive tend to do path lookups starting from a common dentry +(usually, the root "/" or current working directory). So contention on these +common path elements causes lock and cacheline queueing. + +Since 2.6.38, RCU is used to make a significant part of the entire path walk +(including dcache look-up) completely "store-free" (so, no locks, atomics, or +even stores into cachelines of common dentries). This is known as "rcu-walk" +path walking. + +Path walking overview +===================== + +A name string specifies a start (root directory, cwd, fd-relative) and a +sequence of elements (directory entry names), which together refer to a path in +the namespace. A path is represented as a (dentry, vfsmount) tuple. The name +elements are sub-strings, separated by '/'. + +Name lookups will want to find a particular path that a name string refers to +(usually the final element, or parent of final element). This is done by taking +the path given by the name's starting point (which we know in advance -- eg. +current->fs->cwd or current->fs->root) as the first parent of the lookup. Then +iteratively for each subsequent name element, look up the child of the current +parent with the given name and if it is not the desired entry, make it the +parent for the next lookup. + +A parent, of course, must be a directory, and we must have appropriate +permissions on the parent inode to be able to walk into it. + +Turning the child into a parent for the next lookup requires more checks and +procedures. Symlinks essentially substitute the symlink name for the target +name in the name string, and require some recursive path walking. Mount points +must be followed into (thus changing the vfsmount that subsequent path elements +refer to), switching from the mount point path to the root of the particular +mounted vfsmount. These behaviours are variously modified depending on the +exact path walking flags. + +Path walking then must, broadly, do several particular things: +- find the start point of the walk; +- perform permissions and validity checks on inodes; +- perform dcache hash name lookups on (parent, name element) tuples; +- traverse mount points; +- traverse symlinks; +- lookup and create missing parts of the path on demand. + +Safe store-free look-up of dcache hash table +============================================ + +Dcache name lookup +------------------ +In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple +and use that to select a bucket in the dcache-hash table. The list of entries +in that bucket is then walked, and we do a full comparison of each entry +against our (parent, name) tuple. + +The hash lists are RCU protected, so list walking is not serialised with +concurrent updates (insertion, deletion from the hash). This is a standard RCU +list application with the exception of renames, which will be covered below. + +Parent and name members of a dentry, as well as its membership in the dcache +hash, and its inode are protected by the per-dentry d_lock spinlock. A +reference is taken on the dentry (while the fields are verified under d_lock), +and this stabilises its d_inode pointer and actual inode. This gives a stable +point to perform the next step of our path walk against. + +These members are also protected by d_seq seqlock, although this offers +read-only protection and no durability of results, so care must be taken when +using d_seq for synchronisation (see seqcount based lookups, below). + +Renames +------- +Back to the rename case. In usual RCU protected lists, the only operations that +will happen to an object is insertion, and then eventually removal from the +list. The object will not be reused until an RCU grace period is complete. +This ensures the RCU list traversal primitives can run over the object without +problems (see RCU documentation for how this works). + +However when a dentry is renamed, its hash value can change, requiring it to be +moved to a new hash list. Allocating and inserting a new alias would be +expensive and also problematic for directory dentries. Latency would be far to +high to wait for a grace period after removing the dentry and before inserting +it in the new hash bucket. So what is done is to insert the dentry into the +new list immediately. + +However, when the dentry's list pointers are updated to point to objects in the +new list before waiting for a grace period, this can result in a concurrent RCU +lookup of the old list veering off into the new (incorrect) list and missing +the remaining dentries on the list. + +There is no fundamental problem with walking down the wrong list, because the +dentry comparisons will never match. However it is fatal to miss a matching +dentry. So a seqlock is used to detect when a rename has occurred, and so the +lookup can be retried. + + 1 2 3 + +---+ +---+ +---+ +hlist-->| N-+->| N-+->| N-+-> +head <--+-P |<-+-P |<-+-P | + +---+ +---+ +---+ + +Rename of dentry 2 may require it deleted from the above list, and inserted +into a new list. Deleting 2 gives the following list. + + 1 3 + +---+ +---+ (don't worry, the longer pointers do not +hlist-->| N-+-------->| N-+-> impose a measurable performance overhead +head <--+-P |<--------+-P | on modern CPUs) + +---+ +---+ + ^ 2 ^ + | +---+ | + | | N-+----+ + +----+-P | + +---+ + +This is a standard RCU-list deletion, which leaves the deleted object's +pointers intact, so a concurrent list walker that is currently looking at +object 2 will correctly continue to object 3 when it is time to traverse the +next object. + +However, when inserting object 2 onto a new list, we end up with this: + + 1 3 + +---+ +---+ +hlist-->| N-+-------->| N-+-> +head <--+-P |<--------+-P | + +---+ +---+ + 2 + +---+ + | N-+----> + <----+-P | + +---+ + +Because we didn't wait for a grace period, there may be a concurrent lookup +still at 2. Now when it follows 2's 'next' pointer, it will walk off into +another list without ever having checked object 3. + +A related, but distinctly different, issue is that of rename atomicity versus +lookup operations. If a file is renamed from 'A' to 'B', a lookup must only +find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup +of 'B' must succeed (note the reverse is not true). + +Between deleting the dentry from the old hash list, and inserting it on the new +hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same +rename seqlock is also used to cover this race in much the same way, by +retrying a negative lookup result if a rename was in progress. + +Seqcount based lookups +---------------------- +In refcount based dcache lookups, d_lock is used to serialise access to +the dentry, stabilising it while comparing its name and parent and then +taking a reference count (the reference count then gives a stable place to +start the next part of the path walk from). + +As explained above, we would like to do path walking without taking locks or +reference counts on intermediate dentries along the path. To do this, a per +dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry +looks like (its name, parent, and inode). That snapshot is then used to start +the next part of the path walk. When loading the coherent snapshot under d_seq, +care must be taken to load the members up-front, and use those pointers rather +than reloading from the dentry later on (otherwise we'd have interesting things +like d_inode going NULL underneath us, if the name was unlinked). + +Also important is to avoid performing any destructive operations (pretty much: +no non-atomic stores to shared data), and to recheck the seqcount when we are +"done" with the operation. Retry or abort if the seqcount does not match. +Avoiding destructive or changing operations means we can easily unwind from +failure. + +What this means is that a caller, provided they are holding RCU lock to +protect the dentry object from disappearing, can perform a seqcount based +lookup which does not increment the refcount on the dentry or write to +it in any way. This returned dentry can be used for subsequent operations, +provided that d_seq is rechecked after that operation is complete. + +Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be +queried for permissions. + +With this two parts of the puzzle, we can do path lookups without taking +locks or refcounts on dentry elements. + +RCU-walk path walking design +============================ + +Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk +is the traditional[*] way of performing dcache lookups using d_lock to +serialise concurrent modifications to the dentry and take a reference count on +it. ref-walk is simple and obvious, and may sleep, take locks, etc while path +walking is operating on each dentry. rcu-walk uses seqcount based dentry +lookups, and can perform lookup of intermediate elements without any stores to +shared data in the dentry or inode. rcu-walk can not be applied to all cases, +eg. if the filesystem must sleep or perform non trivial operations, rcu-walk +must be switched to ref-walk mode. + +[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full + path walk. + +Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining +path string, rcu-walk uses a d_seq protected snapshot. When looking up a +child of this parent snapshot, we open d_seq critical section on the child +before closing d_seq critical section on the parent. This gives an interlocking +ladder of snapshots to walk down. + + + proc 101 + /----------------\ + / comm: "vi" \ + / fs.root: dentry0 \ + \ fs.cwd: dentry2 / + \ / + \----------------/ + +So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will +start from current->fs->root, which is a pinned dentry. Alternatively, +"./test.c" would start from cwd; both names refer to the same path in +the context of proc101. + + dentry 0 + +---------------------+ rcu-walk begins here, we note d_seq, check the + | name: "/" | inode's permission, and then look up the next + | inode: 10 | path element which is "home"... + | children:"home", ...| + +---------------------+ + | + dentry 1 V + +---------------------+ ... which brings us here. We find dentry1 via + | name: "home" | hash lookup, then note d_seq and compare name + | inode: 678 | string and parent pointer. When we have a match, + | children:"npiggin" | we now recheck the d_seq of dentry0. Then we + +---------------------+ check inode and look up the next element. + | + dentry2 V + +---------------------+ Note: if dentry0 is now modified, lookup is + | name: "npiggin" | not necessarily invalid, so we need only keep a + | inode: 543 | parent for d_seq verification, and grandparents + | children:"a.c", ... | can be forgotten. + +---------------------+ + | + dentry3 V + +---------------------+ At this point we have our destination dentry. + | name: "a.c" | We now take its d_lock, verify d_seq of this + | inode: 14221 | dentry. If that checks out, we can increment + | children:NULL | its refcount because we're holding d_lock. + +---------------------+ + +Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock, +re-checking its d_seq, and then incrementing its refcount is called +"dropping rcu" or dropping from rcu-walk into ref-walk mode. + +It is, in some sense, a bit of a house of cards. If the seqcount check of the +parent snapshot fails, the house comes down, because we had closed the d_seq +section on the grandparent, so we have nothing left to stand on. In that case, +the path walk must be fully restarted (which we do in ref-walk mode, to avoid +live locks). It is costly to have a full restart, but fortunately they are +quite rare. + +When we reach a point where sleeping is required, or a filesystem callout +requires ref-walk, then instead of restarting the walk, we attempt to drop rcu +at the last known good dentry we have. Avoiding a full restart in ref-walk in +these cases is fundamental for performance and scalability because blocking +operations such as creates and unlinks are not uncommon. + +The detailed design for rcu-walk is like this: +* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. +* Take the RCU lock for the entire path walk, starting with the acquiring + of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are + not required for dentry persistence. +* synchronize_rcu is called when unregistering a filesystem, so we can + access d_ops and i_ops during rcu-walk. +* Similarly take the vfsmount lock for the entire path walk. So now mnt + refcounts are not required for persistence. Also we are free to perform mount + lookups, and to assume dentry mount points and mount roots are stable up and + down the path. +* Have a per-dentry seqlock to protect the dentry name, parent, and inode, + so we can load this tuple atomically, and also check whether any of its + members have changed. +* Dentry lookups (based on parent, candidate string tuple) recheck the parent + sequence after the child is found in case anything changed in the parent + during the path walk. +* inode is also RCU protected so we can load d_inode and use the inode for + limited things. +* i_mode, i_uid, i_gid can be tested for exec permissions during path walk. +* i_op can be loaded. +* When the destination dentry is reached, drop rcu there (ie. take d_lock, + verify d_seq, increment refcount). +* If seqlock verification fails anywhere along the path, do a full restart + of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of + a better errno) to signal an rcu-walk failure. + +The cases where rcu-walk cannot continue are: +* NULL dentry (ie. any uncached path element) +* Following links + +It may be possible eventually to make following links rcu-walk aware. + +Uncached path elements will always require dropping to ref-walk mode, at the +very least because i_mutex needs to be grabbed, and objects allocated. + +Final note: +"store-free" path walking is not strictly store free. We take vfsmount lock +and refcounts (both of which can be made per-cpu), and we also store to the +stack (which is essentially CPU-local), and we also have to take locks and +refcount on final dentry. + +The point is that shared data, where practically possible, is not locked +or stored into. The result is massive improvements in performance and +scalability of path resolution. + + +Interesting statistics +====================== + +The following table gives rcu lookup statistics for a few simple workloads +(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to +drop rcu that fail due to d_seq failure and requiring the entire path lookup +again. Other cases are successful rcu-drops that are required before the final +element, nodentry for missing dentry, revalidate for filesystem revalidate +routine requiring rcu drop, permission for permission check requiring drop, +and link for symlink traversal requiring drop. + + rcu-lookups restart nodentry link revalidate permission +bootup 47121 0 4624 1010 10283 7852 +dbench 25386793 0 6778659(26.7%) 55 549 1156 +kbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590 +git diff 39605 0 28 2 0 106 +vfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651 + +What this shows is that failed rcu-walk lookups, ie. ones that are restarted +entirely with ref-walk, are quite rare. Even the "vfstest" case which +specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise +such races is not showing a huge amount of restarts. + +Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where +the reference count needs to be taken for some reason. This is either because +we have reached the target of the path walk, or because we have encountered a +condition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk +only when we have reached the target dentry, so the other statistics show where +this does not happen. + +Note that a graceful drop from rcu-walk mode due to something such as the +dentry not existing (which can be common) is not necessarily a failure of +rcu-walk scheme, because some elements of the path may have been walked in +rcu-walk mode. The further we get from common path elements (such as cwd or +root), the less contended the dentry is likely to be. The closer we are to +common path elements, the more likely they will exist in dentry cache. + + +Papers and other documentation on dcache locking +================================================ + +1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). + +2. http://lse.sourceforge.net/locking/dcache/dcache.html + + diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt new file mode 100644 index 00000000000..8aef9133570 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/design_notes.txt @@ -0,0 +1,72 @@ +POHMELFS: Parallel Optimized Host Message Exchange Layered File System. + + Evgeniy Polyakov <zbr@ioremap.net> + +Homepage: http://www.ioremap.net/projects/pohmelfs + +POHMELFS first began as a network filesystem with coherent local data and +metadata caches but is now evolving into a parallel distributed filesystem. + +Main features of this FS include: + * Locally coherent cache for data and metadata with (potentially) byte-range locks. + Since all Linux filesystems lock the whole inode during writing, algorithm + is very simple and does not use byte-ranges, although they are sent in + locking messages. + * Completely async processing of all events except creation of hard and symbolic + links, and rename events. + Object creation and data reading and writing are processed asynchronously. + * Flexible object architecture optimized for network processing. + Ability to create long paths to objects and remove arbitrarily huge + directories with a single network command. + (like removing the whole kernel tree via a single network command). + * Very high performance. + * Fast and scalable multithreaded userspace server. Being in userspace it works + with any underlying filesystem and still is much faster than async in-kernel NFS one. + * Client is able to switch between different servers (if one goes down, client + automatically reconnects to second and so on). + * Transactions support. Full failover for all operations. + Resending transactions to different servers on timeout or error. + * Read request (data read, directory listing, lookup requests) balancing between multiple servers. + * Write requests are replicated to multiple servers and completed only when all of them are acked. + * Ability to add and/or remove servers from the working set at run-time. + * Strong authentification and possible data encryption in network channel. + * Extended attributes support. + +POHMELFS is based on transactions, which are potentially long-standing objects that live +in the client's memory. Each transaction contains all the information needed to process a given +command (or set of commands, which is frequently used during data writing: single transactions +can contain creation and data writing commands). Transactions are committed by all the servers +to which they are sent and, in case of failures, are eventually resent or dropped with an error. +For example, reading will return an error if no servers are available. + +POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is +possible to detach replies from requests and, if the command requires data to be received, the +caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different +servers and async threads will pick up replies in parallel, find appropriate transactions in the +system and put the data where it belongs (like the page or inode cache). + +The main feature of POHMELFS is writeback data and the metadata cache. +Only a few non-performance critical operations use the write-through cache and +are synchronous: hard and symbolic link creation, and object rename. Creation, +removal of objects and data writing are asynchronous and are sent to +the server during system writeback. Only one writer at a time is allowed for any +given inode, which is guarded by an appropriate locking protocol. +Because of this feature, POHMELFS is extremely fast at metadata intensive +workloads and can fully utilize the bandwidth to the servers when doing bulk +data transfers. + +POHMELFS clients operate with a working set of servers and are capable of balancing read-only +operations (like lookups or directory listings) between them according to IO priorities. +Administrators can add or remove servers from the set at run-time via special commands (described +in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which +are connected with write permission turned on. IO priority and permissions can be changed in +run-time. + +POHMELFS is capable of full data channel encryption and/or strong crypto hashing. +One can select any kernel supported cipher, encryption mode, hash type and operation mode +(hmac or digest). It is also possible to use both or neither (default). Crypto configuration +is checked during mount time and, if the server does not support it, appropriate capabilities +will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). +Crypto performance heavily depends on the number of crypto threads, which asynchronously perform +crypto operations and send the resulting data to server or submit it up the stack. This number +can be controlled via a mount option. diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt new file mode 100644 index 00000000000..db2e4139362 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/info.txt @@ -0,0 +1,99 @@ +POHMELFS usage information. + +Mount options. +All but index, number of crypto threads and maximum IO size can changed via remount. + +idx=%u + Each mountpoint is associated with a special index via this option. + Administrator can add or remove servers from the given index, so all mounts, + which were attached to it, are updated. + Default it is 0. + +trans_scan_timeout=%u + This timeout, expressed in milliseconds, specifies time to scan transaction + trees looking for stale requests, which have to be resent, or if number of + retries exceed specified limit, dropped with error. + Default is 5 seconds. + +drop_scan_timeout=%u + Internal timeout, expressed in milliseconds, which specifies how frequently + inodes marked to be dropped are freed. It also specifies how frequently + the system checks that servers have to be added or removed from current working set. + Default is 1 second. + +wait_on_page_timeout=%u + Number of milliseconds to wait for reply from remote server for data reading command. + If this timeout is exceeded, reading returns an error. + Default is 5 seconds. + +trans_retries=%u + This is the number of times that a transaction will be resent to a server that did + not answer for the last @trans_scan_timeout milliseconds. + When the number of resends exceeds this limit, the transaction is completed with error. + Default is 5 resends. + +crypto_thread_num=%u + Number of crypto processing threads. Threads are used both for RX and TX traffic. + Default is 2, or no threads if crypto operations are not supported. + +trans_max_pages=%u + Maximum number of pages in a single transaction. This parameter also controls + the number of pages, allocated for crypto processing (each crypto thread has + pool of pages, the number of which is equal to 'trans_max_pages'. + Default is 100 pages. + +crypto_fail_unsupported + If specified, mount will fail if the server does not support requested crypto operations. + By default mount will disable non-matching crypto operations. + +mcache_timeout=%u + Maximum number of milliseconds to wait for the mcache objects to be processed. + Mcache includes locks (given lock should be granted by server), attributes (they should be + fully received in the given timeframe). + Default is 5 seconds. + +Usage examples. + +Add server server1.net:1025 into the working set with index $idx +with appropriate hash algorithm and key file and cipher algorithm, mode and key file: +$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key + +Mount filesystem with given index $idx to /mnt mountpoint. +Client will connect to all servers specified in the working set via previous command: +mount -t pohmel -o idx=$idx q /mnt + +Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw): +$cfg A modify -a server1.net -p 1025 -i $idx -I 1 + +Change IO priority to 123 (node with the highest priority gets read requests). +$cfg A modify -a server1.net -p 1025 -i $idx -P 123 + +One can check currect status of all connections in the mountstats file: +# cat /proc/$PID/mountstats +... +device none mounted on /mnt with fstype pohmel +idx addr(:port) socket_type protocol active priority permissions +0 server1.net:1026 1 6 1 250 1 +0 server2.net:1025 1 6 1 123 3 + +Server installation. + +Creating a server, which listens at port 1025 and 0.0.0.0 address. +Working root directory (note, that server chroots there, so you have to have appropriate permissions) +is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there +are appropriate key files. +Number of working threads is set to 10. + +# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key + + -A 6 - listen on ipv6 address. Default: Disabled. + -r root - path to root directory. Default: /tmp. + -a addr - listen address. Default: 0.0.0.0. + -p port - listen port. Default: 1025. + -w workers - number of workers per connected client. Default: 1. + -K file - hash key size. Default: none. + -k file - cipher key size. Default: none. + -h - this help. + +Number of worker threads specifies how many workers will be created for each client. +Bulk single-client transafers usually are better handled with smaller number (like 1-3). diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt new file mode 100644 index 00000000000..c680b4b5353 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/network_protocol.txt @@ -0,0 +1,227 @@ +POHMELFS network protocol. + +Basic structure used in network communication is following command: + +struct netfs_cmd +{ + __u16 cmd; /* Command number */ + __u16 csize; /* Attached crypto information size */ + __u16 cpad; /* Attached padding size */ + __u16 ext; /* External flags */ + __u32 size; /* Size of the attached data */ + __u32 trans; /* Transaction id */ + __u64 id; /* Object ID to operate on. Used for feedback.*/ + __u64 start; /* Start of the object. */ + __u64 iv; /* IV sequence */ + __u8 data[0]; +}; + +Commands can be embedded into transaction command (which in turn has own command), +so one can extend protocol as needed without breaking backward compatibility as long +as old commands are supported. All string lengths include tail 0 byte. + +All commands are transferred over the network in big-endian. CPU endianness is used at the end peers. + +@cmd - command number, which specifies command to be processed. Following + commands are used currently: + + NETFS_READDIR = 1, /* Read directory for given inode number */ + NETFS_READ_PAGE, /* Read data page from the server */ + NETFS_WRITE_PAGE, /* Write data page to the server */ + NETFS_CREATE, /* Create directory entry */ + NETFS_REMOVE, /* Remove directory entry */ + NETFS_LOOKUP, /* Lookup single object */ + NETFS_LINK, /* Create a link */ + NETFS_TRANS, /* Transaction */ + NETFS_OPEN, /* Open intent */ + NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */ + NETFS_PAGE_CACHE, /* Page cache invalidation message */ + NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */ + NETFS_RENAME, /* Rename object */ + NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */ + NETFS_LOCK, /* Distributed lock message */ + NETFS_XATTR_SET, /* Set extended attribute */ + NETFS_XATTR_GET, /* Get extended attribute */ + +@ext - external flags. Used by different commands to specify some extra arguments + like partial size of the embedded objects or creation flags. + +@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached, + but size of the requested data is incorporated here. It does not include size of the command + header (struct netfs_cmd) itself. + +@id - id of the object this command operates on. Each command can use it for own purpose. + +@start - start of the object this command operates on. Each command can use it for own purpose. + +@csize, @cpad - size and padding size of the (attached if needed) crypto information. + +Command specifications. + +@NETFS_READDIR +This command is used to sync content of the remote dir to the client. + +@ext - length of the path to object. +@size - the same. +@id - local inode number of the directory to read. +@start - zero. + + +@NETFS_READ_PAGE +This command is used to read data from remote server. +Data size does not exceed local page cache size. + +@id - inode number. +@start - first byte offset. +@size - number of bytes to read plus length of the path to object. +@ext - object path length. + + +@NETFS_CREATE +Used to create object. +It does not require that all directories on top of the object were +already created, it will create them automatically. Each object has +associated @netfs_path_entry data structure, which contains creation +mode (permissions and type) and length of the name as long as name itself. + +@start - 0 +@size - size of the all data structures needed to create a path +@id - local inode number +@ext - 0 + + +@NETFS_REMOVE +Used to remove object. + +@ext - length of the path to object. +@size - the same. +@id - local inode number. +@start - zero. + + +@NETFS_LOOKUP +Lookup information about object on server. + +@ext - length of the path to object. +@size - the same. +@id - local inode number of the directory to look object in. +@start - local inode number of the object to look at. + + +@NETFS_LINK +Create hard of symlink. +Command is sent as "object_path|target_path". + +@size - size of the above string. +@id - parent local inode number. +@start - 1 for symlink, 0 for hardlink. +@ext - size of the "object_path" above. + + +@NETFS_TRANS +Transaction header. + +@size - incorporates all embedded command sizes including theirs header sizes. +@start - transaction generation number - unique id used to find transaction. +@ext - transaction flags. Unused at the moment. +@id - 0. + + +@NETFS_OPEN +Open intent for given transaction. + +@id - local inode number. +@start - 0. +@size - path length to the object. +@ext - open flags (O_RDWR and so on). + + +@NETFS_INODE_INFO +Metadata update command. +It is sent to servers when attributes of the object are changed and received +when data or metadata were updated. It operates with the following structure: + +struct netfs_inode_info +{ + unsigned int mode; + unsigned int nlink; + unsigned int uid; + unsigned int gid; + unsigned int blocksize; + unsigned int padding; + __u64 ino; + __u64 blocks; + __u64 rdev; + __u64 size; + __u64 version; +}; + +It effectively mirrors stat(2) returned data. + + +@ext - path length to the object. +@size - the same plus size of the netfs_inode_info structure. +@id - local inode number. +@start - 0. + + +@NETFS_PAGE_CACHE +Command is only received by clients. It contains information about +page to be marked as not up-to-date. + +@id - client's inode number. +@start - last byte of the page to be invalidated. If it is not equal to + current inode size, it will be vmtruncated(). +@size - 0 +@ext - 0 + + +@NETFS_READ_PAGES +Used to read multiple contiguous pages in one go. + +@start - first byte of the contiguous region to read. +@size - contains of two fields: lower 8 bits are used to represent page cache shift + used by client, another 3 bytes are used to get number of pages. +@id - local inode number. +@ext - path length to the object. + + +@NETFS_RENAME +Used to rename object. +Attached data is formed into following string: "old_path|new_path". + +@id - local inode number. +@start - parent inode number. +@size - length of the above string. +@ext - length of the old path part. + + +@NETFS_CAPABILITIES +Used to exchange crypto capabilities with server. +If crypto capabilities are not supported by server, then client will disable it +or fail (if 'crypto_fail_unsupported' mount options was specified). + +@id - superblock index. Used to specify crypto information for group of servers. +@size - size of the attached capabilities structure. +@start - 0. +@size - 0. +@scsize - 0. + +@NETFS_LOCK +Used to send lock request/release messages. Although it sends byte range request +and is capable of flushing pages based on that, it is not used, since all Linux +filesystems lock the whole inode. + +@id - lock generation number. +@start - start of the locked range. +@size - size of the locked range. +@ext - lock type: read/write. Not used actually. 15'th bit is used to determine, + if it is lock request (1) or release (0). + +@NETFS_XATTR_SET +@NETFS_XATTR_GET +Used to set/get extended attributes for given inode. +@id - attribute generation number or xattr setting type +@start - size of the attribute (request or attached) +@size - name length, path len and data size for given attribute +@ext - path length for given object diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 92b888d540a..0f3a1390bf0 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -94,9 +94,8 @@ protected. --- [mandatory] -BKL is also moved from around sb operations. ->write_super() Is now called -without BKL held. BKL should have been shifted into individual fs sb_op -functions. If you don't need it, remove it. +BKL is also moved from around sb operations. BKL should have been shifted into +individual fs sb_op functions. If you don't need it, remove it. --- [informational] @@ -140,7 +139,7 @@ Callers of notify_change() need ->i_mutex now. New super_block field "struct export_operations *s_export_op" for explicit support for exporting, e.g. via NFS. The structure is fully documented at its declaration in include/linux/fs.h, and in -Documentation/filesystems/Exporting. +Documentation/filesystems/nfs/Exporting. Briefly it allows for the definition of decode_fh and encode_fh operations to encode and decode filehandles, and allows the filesystem to use @@ -216,7 +215,6 @@ had ->revalidate()) add calls in ->follow_link()/->readlink(). ->d_parent changes are not protected by BKL anymore. Read access is safe if at least one of the following is true: * filesystem has no cross-directory rename() - * dcache_lock is held * we know that parent had been locked (e.g. we are looking at ->d_parent of ->lookup() argument). * we are called from ->rename(). @@ -273,3 +271,195 @@ it's safe to remove it. If you don't need it, remove it. deliberate; as soon as struct block_device * is propagated in a reasonable way by that code fixing will become trivial; until then nothing can be done. + +[mandatory] + + block truncatation on error exit from ->write_begin, and ->direct_IO +moved from generic methods (block_write_begin, cont_write_begin, +nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at +ext2_write_failed and callers for an example. + +[mandatory] + + ->truncate is gone. The whole truncate sequence needs to be +implemented in ->setattr, which is now mandatory for filesystems +implementing on-disk size changes. Start with a copy of the old inode_setattr +and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to +be in order of zeroing blocks using block_truncate_page or similar helpers, +size update and on finally on-disk truncation which should not fail. +inode_change_ok now includes the size checks for ATTR_SIZE and must be called +in the beginning of ->setattr unconditionally. + +[mandatory] + + ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should +be used instead. It gets called whenever the inode is evicted, whether it has +remaining links or not. Caller does *not* evict the pagecache or inode-associated +metadata buffers; the method has to use truncate_inode_pages_final() to get rid +of those. Caller makes sure async writeback cannot be running for the inode while +(or after) ->evict_inode() is called. + + ->drop_inode() returns int now; it's called on final iput() with +inode->i_lock held and it returns true if filesystems wants the inode to be +dropped. As before, generic_drop_inode() is still the default and it's been +updated appropriately. generic_delete_inode() is also alive and it consists +simply of return 1. Note that all actual eviction work is done by caller after +->drop_inode() returns. + + As before, clear_inode() must be called exactly once on each call of +->evict_inode() (as it used to be for each call of ->delete_inode()). Unlike +before, if you are using inode-associated metadata buffers (i.e. +mark_buffer_dirty_inode()), it's your responsibility to call +invalidate_inode_buffers() before clear_inode(). + + NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out +if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() +may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly +free the on-disk inode, you may end up doing that while ->write_inode() is writing +to it. + +--- +[mandatory] + + .d_delete() now only advises the dcache as to whether or not to cache +unreferenced dentries, and is now only called when the dentry refcount goes to +0. Even on 0 refcount transition, it must be able to tolerate being called 0, +1, or more times (eg. constant, idempotent). + +--- +[mandatory] + + .d_compare() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + + .d_hash() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c +for details of what locks to replace dcache_lock with in order to protect +particular things. Most of the time, a filesystem only needs ->d_lock, which +protects *all* the dcache state of a given dentry. + +-- +[mandatory] + + Filesystems must RCU-free their inodes, if they can have been accessed +via rcu-walk path walk (basically, if the file can have had a path name in the +vfs namespace). + + Even though i_dentry and i_rcu share storage in a union, we will +initialize the former in inode_init_always(), so just leave it alone in +the callback. It used to be necessary to clean it there, but not anymore +(starting at 3.2). + +-- +[recommended] + vfs now tries to do path walking in "rcu-walk mode", which avoids +atomic operations and scalability hazards on dentries and inodes (see +Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes +(above) are examples of the changes required to support this. For more complex +filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so +no changes are required to the filesystem. However, this is costly and loses +the benefits of rcu-walk mode. We will begin to add filesystem callbacks that +are rcu-walk aware, shown below. Filesystems should take advantage of this +where possible. + +-- +[mandatory] + d_revalidate is a callback that is made on every path element (if +the filesystem provides it), which requires dropping out of rcu-walk mode. This +may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be +returned if the filesystem cannot handle rcu-walk. See +Documentation/filesystems/vfs.txt for more details. + + permission and check_acl are inode permission checks that are called +on many or all directory inodes on the way down a path walk (to check for +exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU). +See Documentation/filesystems/vfs.txt for more details. + +-- +[mandatory] + In ->fallocate() you must check the mode option passed in. If your +filesystem does not support hole punching (deallocating space in the middle of a +file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode. +Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set, +so the i_size should not change when hole punching, even when puching the end of +a file off. + +-- +[mandatory] + ->get_sb() is gone. Switch to use of ->mount(). Typically it's just +a matter of switching from calling get_sb_... to mount_... and changing the +function type. If you were doing it manually, just switch from setting ->mnt_root +to some pointer to returning that pointer. On errors return ERR_PTR(...). + +-- +[mandatory] + ->permission() and generic_permission()have lost flags +argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask. + generic_permission() has also lost the check_acl argument; ACL checking +has been taken to VFS and filesystems need to provide a non-NULL ->i_op->get_acl +to read an ACL from disk. + +-- +[mandatory] + If you implement your own ->llseek() you must handle SEEK_HOLE and +SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to +support it in some way. The generic handler assumes that the entire file is +data and there is a virtual hole at the end of the file. So if the provided +offset is less than i_size and SEEK_DATA is specified, return the same offset. +If the above is true for the offset and you are given SEEK_HOLE, return the end +of the file. If the offset is i_size or greater return -ENXIO in either case. + +[mandatory] + If you have your own ->fsync() you must make sure to call +filemap_write_and_wait_range() so that all dirty pages are synced out properly. +You must also keep in mind that ->fsync() is not called with i_mutex held +anymore, so if you require i_mutex locking you must make sure to take it and +release it yourself. + +-- +[mandatory] + d_alloc_root() is gone, along with a lot of bugs caused by code +misusing it. Replacement: d_make_root(inode). The difference is, +d_make_root() drops the reference to inode if dentry allocation fails. + +-- +[mandatory] + The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and +->lookup() do *not* take struct nameidata anymore; just the flags. +-- +[mandatory] + ->create() doesn't take struct nameidata *; unlike the previous +two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that +local filesystems can ignore tha argument - they are guaranteed that the +object doesn't exist. It's remote/distributed ones that might care... +-- +[mandatory] + FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate() +in your dentry operations instead. +-- +[mandatory] + vfs_readdir() is gone; switch to iterate_dir() instead +-- +[mandatory] + ->readdir() is gone now; switch to ->iterate() +[mandatory] + vfs_follow_link has been removed. Filesystems must use nd_set_link + from ->follow_link for normal symlinks, or nd_jump_link for magic + /proc/<pid> style links. +-- +[mandatory] + iget5_locked()/ilookup5()/ilookup5_nowait() test() callback used to be + called with both ->i_lock and inode_hash_lock held; the former is *not* + taken anymore, so verify that your callbacks do not rely on it (none + of the in-tree instances did). inode_hash_lock is still held, + of course, so they are still serialized wrt removal from inode hash, + as well as wrt set() callback of iget5_locked(). diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index 518ebe609e2..ddc531a74d0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -5,10 +5,12 @@ Bodo Bauer <bb@ricochet.net> 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 +move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 ------------------------------------------------------------------------------ Version 1.3 Kernel version 2.2.12 Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------ +fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 Table of Contents ----------------- @@ -26,23 +28,23 @@ Table of Contents 1.6 Parallel port info in /proc/parport 1.7 TTY info in /proc/tty 1.8 Miscellaneous kernel statistics in /proc/stat + 1.9 Ext4 file system parameters 2 Modifying System Parameters - 2.1 /proc/sys/fs - File system data - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats - 2.3 /proc/sys/kernel - general kernel parameters - 2.4 /proc/sys/vm - The virtual memory subsystem - 2.5 /proc/sys/dev - Device specific parameters - 2.6 /proc/sys/sunrpc - Remote procedure calls - 2.7 /proc/sys/net - Networking stuff - 2.8 /proc/sys/net/ipv4 - IPV4 settings - 2.9 Appletalk - 2.10 IPX - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem - 2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score - 2.13 /proc/<pid>/oom_score - Display current oom-killer score - 2.14 /proc/<pid>/io - Display the IO accounting fields - 2.15 /proc/<pid>/coredump_filter - Core dump filtering settings + + 3 Per-Process Parameters + 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer + score + 3.2 /proc/<pid>/oom_score - Display current oom-killer score + 3.3 /proc/<pid>/io - Display the IO accounting fields + 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings + 3.5 /proc/<pid>/mountinfo - Information about mounts + 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm + 3.7 /proc/<pid>/task/<tid>/children - Information about task children + 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file + + 4 Configuring procfs + 4.1 Mount options ------------------------------------------------------------------------------ Preface @@ -76,9 +78,9 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this document. The latest version of this document is available online at -http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version. +http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html -If the above direction does not works for you, ypu could try the kernel +If the above direction does not works for you, you could try the kernel mailing list at linux-kernel@vger.kernel.org and/or try to reach me at comandante@zaralinux.com. @@ -121,7 +123,7 @@ The link self points to the process reading the file system. Each process subdirectory has the entries listed in Table 1-1. -Table 1-1: Process specific entries in /proc +Table 1-1: Process specific entries in /proc .............................................................................. File Content clear_refs Clears page referenced bits shown in smaps output @@ -138,46 +140,115 @@ Table 1-1: Process specific entries in /proc statm Process memory status information status Process status in human readable form wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan - smaps Extension based on maps, the rss size for each mapped file + pagemap Page table + stack Report full stack trace, enable via CONFIG_STACKTRACE + smaps a extension based on maps, showing the memory consumption of + each mapping and flags associated with it .............................................................................. For example, to get the status information of a process, all you have to do is read the file /proc/PID/status: - >cat /proc/self/status - Name: cat - State: R (running) - Pid: 5452 - PPid: 743 + >cat /proc/self/status + Name: cat + State: R (running) + Tgid: 5452 + Pid: 5452 + PPid: 743 TracerPid: 0 (2.4) - Uid: 501 501 501 501 - Gid: 100 100 100 100 - Groups: 100 14 16 - VmSize: 1112 kB - VmLck: 0 kB - VmRSS: 348 kB - VmData: 24 kB - VmStk: 12 kB - VmExe: 8 kB - VmLib: 1044 kB - SigPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000000000 - SigCgt: 0000000000000000 - CapInh: 00000000fffffeff - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - + Uid: 501 501 501 501 + Gid: 100 100 100 100 + FDSize: 256 + Groups: 100 14 16 + VmPeak: 5004 kB + VmSize: 5004 kB + VmLck: 0 kB + VmHWM: 476 kB + VmRSS: 476 kB + VmData: 156 kB + VmStk: 88 kB + VmExe: 68 kB + VmLib: 1412 kB + VmPTE: 20 kb + VmSwap: 0 kB + Threads: 1 + SigQ: 0/28578 + SigPnd: 0000000000000000 + ShdPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: ffffffffffffffff + Seccomp: 0 + voluntary_ctxt_switches: 0 + nonvoluntary_ctxt_switches: 1 This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its -information. The statm file contains more detailed information about the -process memory usage. Its seven fields are explained in Table 1-2. The stat -file contains details information about the process itself. Its fields are -explained in Table 1-3. +information. But you get a more detailed view of the process by reading the +file /proc/PID/status. It fields are described in table 1-2. + +The statm file contains more detailed information about the process +memory usage. Its seven fields are explained in Table 1-3. The stat file +contains details information about the process itself. Its fields are +explained in Table 1-4. +(for SMP CONFIG users) +For making accounting scalable, RSS related information are handled in +asynchronous manner and the vaule may not be very precise. To see a precise +snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. +It's slow but very precise. -Table 1-2: Contents of the statm files (as of 2.6.8-rc3) +Table 1-2: Contents of the status files (as of 2.6.30-rc7) +.............................................................................. + Field Content + Name filename of the executable + State state (R is running, S is sleeping, D is sleeping + in an uninterruptible wait, Z is zombie, + T is traced or stopped) + Tgid thread group ID + Pid process id + PPid process id of the parent process + TracerPid PID of process tracing this process (0 if not) + Uid Real, effective, saved set, and file system UIDs + Gid Real, effective, saved set, and file system GIDs + FDSize number of file descriptor slots currently allocated + Groups supplementary group list + VmPeak peak virtual memory size + VmSize total program size + VmLck locked memory size + VmHWM peak resident set size ("high water mark") + VmRSS size of memory portions + VmData size of data, stack, and text segments + VmStk size of data, stack, and text segments + VmExe size of text segment + VmLib size of shared library code + VmPTE size of page table entries + VmSwap size of swap usage (the number of referred swapents) + Threads number of threads + SigQ number of signals queued/max. number for queue + SigPnd bitmap of pending signals for the thread + ShdPnd bitmap of shared pending signals for the process + SigBlk bitmap of blocked signals + SigIgn bitmap of ignored signals + SigCgt bitmap of caught signals + CapInh bitmap of inheritable capabilities + CapPrm bitmap of permitted capabilities + CapEff bitmap of effective capabilities + CapBnd bitmap of capabilities bounding set + Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) + Cpus_allowed mask of CPUs on which this process may run + Cpus_allowed_list Same as previous, but in "list format" + Mems_allowed mask of memory nodes allowed to this process + Mems_allowed_list Same as previous, but in "list format" + voluntary_ctxt_switches number of voluntary context switches + nonvoluntary_ctxt_switches number of non voluntary context switches +.............................................................................. + +Table 1-3: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. Field Content size total program size (pages) (same as VmSize in status) @@ -192,7 +263,7 @@ Table 1-2: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. -Table 1-3: Contents of the stat files (as of 2.6.22-rc3) +Table 1-4: Contents of the stat files (as of 2.6.30-rc7) .............................................................................. Field Content pid process id @@ -223,13 +294,13 @@ Table 1-3: Contents of the stat files (as of 2.6.22-rc3) rsslim current limit in bytes on the rss start_code address above which program text can run end_code address below which program text can run - start_stack address of the start of the stack + start_stack address of the start of the main process stack esp current value of ESP eip current value of EIP - pending bitmap of pending signals (obsolete) - blocked bitmap of blocked signals (obsolete) - sigign bitmap of ignored signals (obsolete) - sigcatch bitmap of catched signals (obsolete) + pending bitmap of pending signals + blocked bitmap of blocked signals + sigign bitmap of ignored signals + sigcatch bitmap of caught signals wchan address where process went to sleep 0 (place holder) 0 (place holder) @@ -238,19 +309,201 @@ Table 1-3: Contents of the stat files (as of 2.6.22-rc3) rt_priority realtime priority policy scheduling policy (man sched_setscheduler) blkio_ticks time spent waiting for block IO + gtime guest time of the task in jiffies + cgtime guest time of the task children in jiffies + start_data address above which program data+bss is placed + end_data address below which program data+bss is placed + start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid system call .............................................................................. +The /proc/PID/maps file containing the currently mapped memory regions and +their access permissions. + +The format is: + +address perms offset dev inode pathname + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +where "address" is the address space in the process that it occupies, "perms" +is a set of permissions: + + r = read + w = write + x = execute + s = shared + p = private (copy on write) + +"offset" is the offset into the mapping, "dev" is the device (major:minor), and +"inode" is the inode on that device. 0 indicates that no inode is associated +with the memory region, as the case would be with BSS (uninitialized data). +The "pathname" shows the name associated file for this mapping. If the mapping +is not associated with a file: + + [heap] = the heap of the program + [stack] = the stack of the main process + [stack:1001] = the stack of the thread with tid 1001 + [vdso] = the "virtual dynamic shared object", + the kernel system call handler + + or if empty, the mapping is anonymous. + +The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint +of the individual tasks of a process. In this file you will see a mapping marked +as [stack] if that task sees it as a stack. This is a key difference from the +content of /proc/PID/maps, where you will see all mappings that are being used +as stack by all of those tasks. Hence, for the example above, the task-level +map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +The /proc/PID/smaps is an extension based on maps, showing the memory +consumption for each of the process's mappings. For each of mappings there +is a series of lines such as the following: + +08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash +Size: 1084 kB +Rss: 892 kB +Pss: 374 kB +Shared_Clean: 892 kB +Shared_Dirty: 0 kB +Private_Clean: 0 kB +Private_Dirty: 0 kB +Referenced: 892 kB +Anonymous: 0 kB +Swap: 0 kB +KernelPageSize: 4 kB +MMUPageSize: 4 kB +Locked: 374 kB +VmFlags: rd ex mr mw me de + +the first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. The remaining lines show the size of the mapping +(size), the amount of the mapping that is currently resident in RAM (RSS), the +process' proportional share of this mapping (PSS), the number of clean and +dirty private pages in the mapping. Note that even a page which is part of a +MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used +by only one process, is accounted as private and not as shared. "Referenced" +indicates the amount of memory currently marked as referenced or accessed. +"Anonymous" shows the amount of memory that does not belong to any file. Even +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE +and a page is modified, the file page is replaced by a private anonymous copy. +"Swap" shows how much would-be-anonymous memory is also used, but out on +swap. + +"VmFlags" field deserves a separate description. This member represents the kernel +flags associated with the particular virtual memory area in two letter encoded +manner. The codes are the following: + rd - readable + wr - writeable + ex - executable + sh - shared + mr - may read + mw - may write + me - may execute + ms - may share + gd - stack segment growns down + pf - pure PFN range + dw - disabled write to the mapped file + lo - pages are locked in memory + io - memory mapped I/O area + sr - sequential read advise provided + rr - random read advise provided + dc - do not copy area on fork + de - do not expand area on remapping + ac - area is accountable + nr - swap space is not reserved for the area + ht - area uses huge tlb pages + nl - non-linear mapping + ar - architecture specific flag + dd - do not include area into core dump + sd - soft-dirty flag + mm - mixed map area + hg - huge page advise flag + nh - no-huge page advise flag + mg - mergable advise flag + +Note that there is no guarantee that every flag and associated mnemonic will +be present in all further kernel releases. Things get changed, the flags may +be vanished or the reverse -- new added. + +This file is only present if the CONFIG_MMU kernel configuration option is +enabled. + +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). +To clear the bits for all the pages associated with the process + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process + > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + +Any other value written to /proc/PID/clear_refs will have no effect. + +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags +using /proc/kpageflags and number of times a page is mapped using +/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. 1.2 Kernel data --------------- Similar to the process entries, the kernel data files give information about the running kernel. The files used to obtain this information are contained in -/proc and are listed in Table 1-4. Not all of these will be present in your +/proc and are listed in Table 1-5. Not all of these will be present in your system. It depends on the kernel configuration and the loaded modules, which files are there, and which are missing. -Table 1-4: Kernel info in /proc +Table 1-5: Kernel info in /proc .............................................................................. File Content apm Advanced power management info @@ -281,20 +534,23 @@ Table 1-4: Kernel info in /proc modules List of loaded modules mounts Mounted filesystems net Networking info (see text) + pagetypeinfo Additional page allocator information (see text) (2.5) partitions Table of partitions known to the system pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, decoupled by lspci (2.4) rtc Real time clock scsi SCSI info (see text) slabinfo Slab pool info + softirqs softirq usage stat Overall statistics swaps Swap space utilization sys See chapter 2 sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) tty Info of tty drivers - uptime System uptime + uptime Wall clock since boot, combined idle time of all cpus version Kernel version video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas .............................................................................. You can, for example, check which interrupts are currently in use and what @@ -369,9 +625,9 @@ just those considered 'most important'. The new vectors are: RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are sent from one CPU to another per the needs of the OS. Typically, their statistics are used by kernel developers and interested users to - determine the occurance of interrupt of the given type. + determine the occurrence of interrupts of the given type. -The above IRQ vectors are displayed only when relevent. For example, +The above IRQ vectors are displayed only when relevant. For example, the threshold vector does not exist on x86_64 platforms. Others are suppressed when the system is a uniprocessor. As of this writing, only i386 and x86_64 platforms support the new IRQ vector displays. @@ -379,33 +635,51 @@ i386 and x86_64 platforms support the new IRQ vector displays. Of some interest is the introduction of the /proc/irq directory to 2.4. It could be used to set IRQ to CPU affinity, this means that you can "hook" an IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. For example > ls /proc/irq/ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask - 1 11 13 15 17 19 3 5 7 9 + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity > ls /proc/irq/0/ smp_affinity -The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ -is the same by default: +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing: + + > echo 1 > /proc/irq/10/smp_affinity - > cat /proc/irq/0/smp_affinity +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and fourth CPU can handle the IRQ. + +The contents of each smp_affinity file is the same by default: + + > cat /proc/irq/0/smp_affinity ffffffff -It's a bitmask, in which you can specify which CPUs can handle the IRQ, you can -set it by doing: +There is an alternate interface, smp_affinity_list which allows specifying +a cpu range instead of a bitmask: + + > cat /proc/irq/0/smp_affinity_list + 1024-1031 + +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. - > echo 1 > /proc/irq/prof_cpu_mask +The node file on an SMP system shows the node to which the device using the IRQ +reports itself as being attached. This hardware locality information does not +include information about any possible driver locality preference. -This means that only the first CPU will handle the IRQ, but you can also echo 5 -which means that only the first and fourth CPU can handle the IRQ. +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus if there are only 32 of them). The way IRQs are routed is handled by the IO-APIC, and it's Round Robin between all the CPUs which are allowed to handle it. As usual the kernel has more info than you and does a better job than you, so the defaults are the -best choice for almost everyone. +best choice for almost everyone. [Note this applies only to those IO-APIC's +that support "Round Robin" interrupt distribution.] There are three more important subdirectories in /proc: net, scsi, and sys. The general rule is that the contents, or even the existence of these @@ -426,7 +700,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ... Node 0, zone Normal 1 0 0 1 101 8 ... Node 0, zone HighMem 2 0 0 1 1 0 ... -Memory fragmentation is a problem under some workloads, and buddyinfo is a +External fragmentation is a problem under some workloads, and buddyinfo is a useful tool for helping diagnose these problems. Buddyinfo will give you a clue as to how big an area you can safely allocate, or why a previous allocation failed. @@ -436,6 +710,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE available in ZONE_NORMAL, etc... +More information relevant to external fragmentation can be found in +pagetypeinfo. + +> cat /proc/pagetypeinfo +Page block order: 9 +Pages per block: 512 + +Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 +Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 +Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 +Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 +Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 +Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 +Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 +Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 +Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 +Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 +Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + +Number of blocks type Unmovable Reclaimable Movable Reserve Isolate +Node 0, zone DMA 2 0 5 1 0 +Node 0, zone DMA32 41 6 967 2 0 + +Fragmentation avoidance in the kernel works by grouping pages of different +migrate types into the same contiguous regions of memory called page blocks. +A page block is typically the size of the default hugepage size e.g. 2MB on +X86-64. By keeping pages grouped based on their ability to move, the kernel +can reclaim pages within a page block to satisfy a high-order allocation. + +The pagetypinfo begins with information on the size of a page block. It +then gives the same type of information as buddyinfo except broken down +by migrate-type and finishes with details on how many page blocks of each +type exist. + +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm +from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can +make an estimate of the likely number of huge pages that can be allocated +at a given point in time. All the "Movable" blocks should be allocatable +unless memory has been mlock()'d. Some of the Reclaimable blocks should +also be allocatable although a lot of filesystem metadata may have to be +reclaimed to achieve this. + .............................................................................. meminfo: @@ -446,9 +762,12 @@ varies by architecture and compile options. The following is from a > cat /proc/meminfo +The "Locked" indicates whether the mapping is locked in memory or not. + MemTotal: 16344972 kB MemFree: 13634064 kB +MemAvailable: 14836172 kB Buffers: 3656 kB Cached: 1195708 kB SwapCached: 0 kB @@ -462,18 +781,33 @@ SwapTotal: 0 kB SwapFree: 0 kB Dirty: 968 kB Writeback: 0 kB +AnonPages: 861800 kB Mapped: 280372 kB -Slab: 684068 kB +Slab: 284364 kB +SReclaimable: 159856 kB +SUnreclaim: 124508 kB +PageTables: 24448 kB +NFS_Unstable: 0 kB +Bounce: 0 kB +WritebackTmp: 0 kB CommitLimit: 7669796 kB Committed_AS: 100056 kB -PageTables: 24448 kB VmallocTotal: 112216 kB VmallocUsed: 428 kB VmallocChunk: 111088 kB +AnonHugePages: 49152 kB MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) MemFree: The sum of LowFree+HighFree +MemAvailable: An estimate of how much memory is available for starting new + applications, without swapping. Calculated from MemFree, + SReclaimable, the size of the file LRU lists, and the low + watermarks in each zone. + The estimate takes into account that the system needs some + page cache to function well, and that not all reclaimable + slab will be reclaimable, due to items being in use. The + impact of those factors will vary from system to system. Buffers: Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so) Cached: in-memory cache for files read from the disk (the @@ -502,15 +836,26 @@ VmallocChunk: 111088 kB on the disk Dirty: Memory which is waiting to get written back to the disk Writeback: Memory which is actively being written back to the disk + AnonPages: Non-file backed pages mapped into userspace page tables +AnonHugePages: Non-file backed huge pages mapped into userspace page tables Mapped: files which have been mmaped, such as libraries Slab: in-kernel data structures cache +SReclaimable: Part of Slab, that might be reclaimed, such as caches + SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure + PageTables: amount of memory dedicated to the lowest level of page + tables. +NFS_Unstable: NFS pages sent to the server, but not yet committed to stable + storage + Bounce: Memory used for block device "bounce buffers" +WritebackTmp: Memory used by FUSE for temporary writeback buffers CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). The CommitLimit is calculated with the following formula: - CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] For example, on a system with 1G of physical RAM and 7G of swap with a `vm.overcommit_ratio` of 30 it would yield a CommitLimit of 7.3G. @@ -520,21 +865,80 @@ Committed_AS: The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet. A process which malloc()'s 1G - of memory, but only touches 300M of it will only show up - as using 300M of memory even if it has the address space - allocated for the entire 1G. This 1G is memory which has - been "committed" to by the VM and can be used at any time - by the allocating application. With strict overcommit - enabled on the system (mode 2 in 'vm.overcommit_memory'), - allocations which would exceed the CommitLimit (detailed - above) will not be permitted. This is useful if one needs - to guarantee that processes will not fail due to lack of - memory once that memory has been successfully allocated. - PageTables: amount of memory dedicated to the lowest level of page - tables. + of memory, but only touches 300M of it will show up as + using 1G. This 1G is memory which has been "committed" to + by the VM and can be used at any time by the allocating + application. With strict overcommit enabled on the system + (mode 2 in 'vm.overcommit_memory'),allocations which would + exceed the CommitLimit (detailed above) will not be permitted. + This is useful if one needs to guarantee that processes will + not fail due to lack of memory once that memory has been + successfully allocated. VmallocTotal: total size of vmalloc memory area VmallocUsed: amount of vmalloc area which is used -VmallocChunk: largest contigious block of vmalloc area which is free +VmallocChunk: largest contiguous block of vmalloc area which is free + +.............................................................................. + +vmallocinfo: + +Provides information about vmalloced/vmaped areas. One line per area, +containing the virtual address range of the area, size in bytes, +caller information of the creator, and optional information depending +on the kind of area : + + pages=nr number of pages + phys=addr if a physical address was specified + ioremap I/O mapping (ioremap() and friends) + vmalloc vmalloc() area + vmap vmap()ed pages + user VM_USERMAP area + vpages buffer for pages pointers was vmalloced (huge area) + N<node>=nr (Only on NUMA kernels) + Number of pages allocated on memory node <node> + +> cat /proc/vmallocinfo +0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 +0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 +0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap +0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap +0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 +0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 +0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 +0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 +0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 +0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 +0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 +0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + +.............................................................................. + +softirqs: + +Provides counts of softirq handlers serviced since boot time, for each cpu. + +> cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 1.3 IDE devices in /proc/ide @@ -554,10 +958,10 @@ IDE devices: More detailed information can be found in the controller specific subdirectories. These are named ide0, ide1 and so on. Each of these -directories contains the files shown in table 1-5. +directories contains the files shown in table 1-6. -Table 1-5: IDE controller info in /proc/ide/ide? +Table 1-6: IDE controller info in /proc/ide/ide? .............................................................................. File Content channel IDE channel (0 or 1) @@ -567,11 +971,11 @@ Table 1-5: IDE controller info in /proc/ide/ide? .............................................................................. Each device connected to a controller has a separate subdirectory in the -controllers directory. The files listed in table 1-6 are contained in these +controllers directory. The files listed in table 1-7 are contained in these directories. -Table 1-6: IDE device information +Table 1-7: IDE device information .............................................................................. File Content cache The cache @@ -613,12 +1017,12 @@ the drive parameters: 1.4 Networking info in /proc/net -------------------------------- -The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the +The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the additional values you get for IP version 6 if you configure the kernel to -support this. Table 1-7 lists the files and their meaning. +support this. Table 1-9 lists the files and their meaning. -Table 1-6: IPv6 info in /proc/net +Table 1-8: IPv6 info in /proc/net .............................................................................. File Content udp6 UDP sockets (IPv6) @@ -633,7 +1037,7 @@ Table 1-6: IPv6 info in /proc/net .............................................................................. -Table 1-7: Network info in /proc/net +Table 1-9: Network info in /proc/net .............................................................................. File Content arp Kernel ARP table @@ -654,7 +1058,6 @@ Table 1-7: Network info in /proc/net snmp SNMP data sockstat Socket statistics tcp TCP sockets - tr_rif Token ring RIF routing table udp UDP sockets unix UNIX domain sockets wireless Wireless interface data (Wavelan etc) @@ -681,7 +1084,7 @@ your system and how much traffic was routed over those devices: ...] 1375103 17405 0 0 0 0 0 0 ...] 1703981 5535 0 0 0 3 0 0 -In addition, each Channel Bond interface has it's own directory. For +In addition, each Channel Bond interface has its own directory. For example, the bond0 device will have a directory called /proc/net/bond0/. It will contain information that is specific to that bond, such as the current slaves of the bond, the link status of the slaves, and how @@ -757,10 +1160,10 @@ The directory /proc/parport contains information about the parallel ports of your system. It has one subdirectory for each port, named after the port number (0,1,2,...). -These directories contain the four files shown in Table 1-8. +These directories contain the four files shown in Table 1-10. -Table 1-8: Files in /proc/parport +Table 1-10: Files in /proc/parport .............................................................................. File Content autoprobe Any IEEE-1284 device ID information that has been acquired. @@ -778,10 +1181,10 @@ Table 1-8: Files in /proc/parport Information about the available and actually used tty's can be found in the directory /proc/tty.You'll find entries for drivers and line disciplines in -this directory, as shown in Table 1-9. +this directory, as shown in Table 1-11. -Table 1-9: Files in /proc/tty +Table 1-11: Files in /proc/tty .............................................................................. File Content drivers list of drivers and their usage @@ -814,15 +1217,16 @@ Various pieces of information about kernel activity are available in the since the system first booted. For a quick look, simply cat the file: > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 0 - cpu0 1132 34 1441 11311718 3675 127 438 0 - cpu1 1123 0 849 11313845 2614 0 18 0 + cpu 2255 34 2290 22625563 6290 127 456 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] ctxt 1990473 btime 1062191376 processes 2915 procs_running 1 procs_blocked 0 + softirq 183433 0 21755 12 39 1137 231 21459 2263 The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing @@ -837,11 +1241,14 @@ second). The meanings of the columns are as follows, from left to right: - irq: servicing interrupts - softirq: servicing softirqs - steal: involuntary wait +- guest: running a normal guest +- guest_nice: running a niced guest The "intr" line gives counts of interrupts serviced since boot time, for each of the possible system interrupts. The first column is the total of all -interrupts serviced; each subsequent column is the total for that particular -interrupt. +interrupts serviced including unnumbered architecture specific interrupts; +each subsequent column is the total for that particular numbered interrupt. +Unnumbered interrupts are not shown, only summed into the total. The "ctxt" line gives the total number of context switches across all CPUs. @@ -852,51 +1259,57 @@ The "processes" line gives the number of processes and threads created, which includes (but is not limited to) those created by calls to the fork() and clone() system calls. -The "procs_running" line gives the number of processes currently running on -CPUs. +The "procs_running" line gives the total number of threads that are +running or ready to run (i.e., the total number of runnable threads). The "procs_blocked" line gives the number of processes currently blocked, waiting for I/O to complete. +The "softirq" line gives counts of softirqs serviced since boot time, for each +of the possible system softirqs. The first column is the total of all +softirqs serviced; each subsequent column is the total for that particular +softirq. + + 1.9 Ext4 file system parameters ------------------------------ -Ext4 file system have one directory per partition under /proc/fs/ext4/ -# ls /proc/fs/ext4/hdc/ -group_prealloc max_to_scan mb_groups mb_history min_to_scan order2_req -stats stream_req -mb_groups: -This file gives the details of mutiblock allocator buddy cache of free blocks +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in Table 1-12, below. -mb_history: -Multiblock allocation history. - -stats: -This file indicate whether the multiblock allocator should start collecting -statistics. The statistics are shown during unmount +Table 1-12: Files in /proc/fs/ext4/<devname> +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks +.............................................................................. -group_prealloc: -The multiblock allocator normalize the block allocation request to -group_prealloc filesystem blocks if we don't have strip value set. -The stripe value can be specified at mount time or during mke2fs. +2.0 /proc/consoles +------------------ +Shows registered system console lines. -max_to_scan: -How long multiblock allocator can look for a best extent (in found extents) +To see which character device lines are currently used for the system console +/dev/console, you may simply look into the file /proc/consoles: -min_to_scan: -How long multiblock allocator must look for a best extent + > cat /proc/consoles + tty0 -WU (ECp) 4:7 + ttyS0 -W- (Ep) 4:64 -order2_req: -Multiblock allocator use 2^N search using buddies only for requests greater -than or equal to order2_req. The request size is specfied in file system -blocks. A value of 2 indicate only if the requests are greater than or equal -to 4 blocks. +The columns are: -stream_req: -Files smaller than stream_req are served by the stream allocator, whose -purpose is to pack requests as close each to other as possible to -produce smooth I/O traffic. Avalue of 16 indicate that file smaller than 16 -filesystem block size will use group based preallocation. + device name of the device + operations R = can do read operations + W = can do write operations + U = can do unblank + flags E = it is enabled + C = it is preferred console + B = it is primary boot console + p = it is used for printk buffer + b = it is not a TTY but a Braille device + a = it is safe to use when cpu is offline + major:minor major and minor number of the device separated by a colon ------------------------------------------------------------------------------ Summary @@ -945,1259 +1358,8 @@ review the kernel documentation in the directory /usr/src/linux/Documentation. This chapter is heavily based on the documentation included in the pre 2.2 kernels, and became part of it in version 2.2.1 of the Linux kernel. -2.1 /proc/sys/fs - File system data ------------------------------------ - -This subdirectory contains specific file system, file handle, inode, dentry -and quota information. - -Currently, these files are in /proc/sys/fs: - -dentry-state ------------- - -Status of the directory cache. Since directory entries are dynamically -allocated and deallocated, this file indicates the current status. It holds -six values, in which the last two are not used and are always zero. The others -are listed in table 2-1. - - -Table 2-1: Status files of the directory cache -.............................................................................. - File Content - nr_dentry Almost always zero - nr_unused Number of unused cache entries - age_limit - in seconds after the entry may be reclaimed, when memory is short - want_pages internally -.............................................................................. - -dquot-nr and dquot-max ----------------------- - -The file dquot-max shows the maximum number of cached disk quota entries. - -The file dquot-nr shows the number of allocated disk quota entries and the -number of free disk quota entries. - -If the number of available cached disk quotas is very low and you have a large -number of simultaneous system users, you might want to raise the limit. - -file-nr and file-max --------------------- - -The kernel allocates file handles dynamically, but doesn't free them again at -this time. - -The value in file-max denotes the maximum number of file handles that the -Linux kernel will allocate. When you get a lot of error messages about running -out of file handles, you might want to raise this limit. The default value is -10% of RAM in kilobytes. To change it, just write the new number into the -file: - - # cat /proc/sys/fs/file-max - 4096 - # echo 8192 > /proc/sys/fs/file-max - # cat /proc/sys/fs/file-max - 8192 - - -This method of revision is useful for all customizable parameters of the -kernel - simply echo the new value to the corresponding file. - -Historically, the three values in file-nr denoted the number of allocated file -handles, the number of allocated but unused file handles, and the maximum -number of file handles. Linux 2.6 always reports 0 as the number of free file -handles -- this is not an error, it just means that the number of allocated -file handles exactly matches the number of used file handles. - -Attempts to allocate more file descriptors than file-max are reported with -printk, look for "VFS: file-max limit <number> reached". - -inode-state and inode-nr ------------------------- - -The file inode-nr contains the first two items from inode-state, so we'll skip -to that file... - -inode-state contains two actual numbers and five dummy values. The numbers -are nr_inodes and nr_free_inodes (in order of appearance). - -nr_inodes -~~~~~~~~~ - -Denotes the number of inodes the system has allocated. This number will -grow and shrink dynamically. - -nr_open -------- - -Denotes the maximum number of file-handles a process can -allocate. Default value is 1024*1024 (1048576) which should be -enough for most machines. Actual limit depends on RLIMIT_NOFILE -resource limit. - -nr_free_inodes --------------- - -Represents the number of free inodes. Ie. The number of inuse inodes is -(nr_inodes - nr_free_inodes). - -aio-nr and aio-max-nr ---------------------- - -aio-nr is the running total of the number of events specified on the -io_setup system call for all currently active aio contexts. If aio-nr -reaches aio-max-nr then io_setup will fail with EAGAIN. Note that -raising aio-max-nr does not result in the pre-allocation or re-sizing -of any kernel data structures. - -2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats ------------------------------------------------------------ - -Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This -handles the kernel support for miscellaneous binary formats. - -Binfmt_misc provides the ability to register additional binary formats to the -Kernel without compiling an additional module/kernel. Therefore, binfmt_misc -needs to know magic numbers at the beginning or the filename extension of the -binary. - -It works by maintaining a linked list of structs that contain a description of -a binary format, including a magic with size (or the filename extension), -offset and mask, and the interpreter name. On request it invokes the given -interpreter with the original program as argument, as binfmt_java and -binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default -binary-formats, you have to register an additional binary-format. - -There are two general files in binfmt_misc and one file per registered format. -The two general files are register and status. - -Registering a new binary format -------------------------------- - -To register a new binary format you have to issue the command - - echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register - - - -with appropriate name (the name for the /proc-dir entry), offset (defaults to -0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and -last but not least, the interpreter that is to be invoked (for example and -testing /bin/echo). Type can be M for usual magic matching or E for filename -extension matching (give extension in place of magic). - -Check or reset the status of the binary format handler ------------------------------------------------------- - -If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the -current status (enabled/disabled) of binfmt_misc. Change the status by echoing -0 (disables) or 1 (enables) or -1 (caution: this clears all previously -registered binary formats) to status. For example echo 0 > status to disable -binfmt_misc (temporarily). - -Status of a single handler --------------------------- - -Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files -perform the same function as status, but their scope is limited to the actual -binary format. By cating this file, you also receive all related information -about the interpreter/magic of the binfmt. - -Example usage of binfmt_misc (emulate binfmt_java) --------------------------------------------------- - - cd /proc/sys/fs/binfmt_misc - echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register - echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register - echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register - echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register - - -These four lines add support for Java executables and Java applets (like -binfmt_java, additionally recognizing the .html extension with no need to put -<!--applet> to every applet file). You have to install the JDK and the -shell-script /usr/local/java/bin/javawrapper too. It works around the -brokenness of the Java filename handling. To add a Java binary, just create a -link to the class-file somewhere in the path. - -2.3 /proc/sys/kernel - general kernel parameters ------------------------------------------------- - -This directory reflects general kernel behaviors. As I've said before, the -contents depend on your configuration. Here you'll find the most important -files, along with descriptions of what they mean and how to use them. - -acct ----- - -The file contains three values; highwater, lowwater, and frequency. - -It exists only when BSD-style process accounting is enabled. These values -control its behavior. If the free space on the file system where the log lives -goes below lowwater percentage, accounting suspends. If it goes above -highwater percentage, accounting resumes. Frequency determines how often you -check the amount of free space (value is in seconds). Default settings are: 4, -2, and 30. That is, suspend accounting if there is less than 2 percent free; -resume it if we have a value of 3 or more percent; consider information about -the amount of free space valid for 30 seconds - -ctrl-alt-del ------------- - -When the value in this file is 0, ctrl-alt-del is trapped and sent to the init -program to handle a graceful restart. However, when the value is greater that -zero, Linux's reaction to this key combination will be an immediate reboot, -without syncing its dirty buffers. - -[NOTE] - When a program (like dosemu) has the keyboard in raw mode, the - ctrl-alt-del is intercepted by the program before it ever reaches the - kernel tty layer, and it is up to the program to decide what to do with - it. - -domainname and hostname ------------------------ - -These files can be controlled to set the NIS domainname and hostname of your -box. For the classic darkstar.frop.org a simple: - - # echo "darkstar" > /proc/sys/kernel/hostname - # echo "frop.org" > /proc/sys/kernel/domainname - - -would suffice to set your hostname and NIS domainname. - -osrelease, ostype and version ------------------------------ - -The names make it pretty obvious what these fields contain: - - > cat /proc/sys/kernel/osrelease - 2.2.12 - - > cat /proc/sys/kernel/ostype - Linux - - > cat /proc/sys/kernel/version - #4 Fri Oct 1 12:41:14 PDT 1999 - - -The files osrelease and ostype should be clear enough. Version needs a little -more clarification. The #4 means that this is the 4th kernel built from this -source base and the date after it indicates the time the kernel was built. The -only way to tune these values is to rebuild the kernel. - -panic ------ - -The value in this file represents the number of seconds the kernel waits -before rebooting on a panic. When you use the software watchdog, the -recommended setting is 60. If set to 0, the auto reboot after a kernel panic -is disabled, which is the default setting. - -printk ------- - -The four values in printk denote -* console_loglevel, -* default_message_loglevel, -* minimum_console_loglevel and -* default_console_loglevel -respectively. - -These values influence printk() behavior when printing or logging error -messages, which come from inside the kernel. See syslog(2) for more -information on the different log levels. - -console_loglevel ----------------- - -Messages with a higher priority than this will be printed to the console. - -default_message_level ---------------------- - -Messages without an explicit priority will be printed with this priority. - -minimum_console_loglevel ------------------------- - -Minimum (highest) value to which the console_loglevel can be set. - -default_console_loglevel ------------------------- - -Default value for console_loglevel. - -sg-big-buff ------------ - -This file shows the size of the generic SCSI (sg) buffer. At this point, you -can't tune it yet, but you can change it at compile time by editing -include/scsi/sg.h and changing the value of SG_BIG_BUFF. - -If you use a scanner with SANE (Scanner Access Now Easy) you might want to set -this to a higher value. Refer to the SANE documentation on this issue. - -modprobe --------- - -The location where the modprobe binary is located. The kernel uses this -program to load modules on demand. - -unknown_nmi_panic ------------------ - -The value in this file affects behavior of handling NMI. When the value is -non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel -debugging information is displayed on console. - -NMI switch that most IA32 servers have fires unknown NMI up, for example. -If a system hangs up, try pressing the NMI switch. - -nmi_watchdog ------------- - -Enables/Disables the NMI watchdog on x86 systems. When the value is non-zero -the NMI watchdog is enabled and will continuously test all online cpus to -determine whether or not they are still functioning properly. - -Because the NMI watchdog shares registers with oprofile, by disabling the NMI -watchdog, oprofile may have more registers to utilize. - -maps_protect ------------- - -Enables/Disables the protection of the per-process proc entries "maps" and -"smaps". When enabled, the contents of these files are visible only to -readers that are allowed to ptrace() the given process. - - -2.4 /proc/sys/vm - The virtual memory subsystem ------------------------------------------------ - -The files in this directory can be used to tune the operation of the virtual -memory (VM) subsystem of the Linux kernel. - -vfs_cache_pressure ------------------- - -Controls the tendency of the kernel to reclaim the memory which is used for -caching of directory and inode objects. - -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. - -dirty_background_ratio ----------------------- - -Contains, as a percentage of total system memory, the number of pages at which -the pdflush background writeback daemon will start writing out dirty data. - -dirty_ratio ------------------ - -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. - -dirty_writeback_centisecs -------------------------- - -The pdflush writeback daemons will periodically wake up and write `old' data -out to disk. This tunable expresses the interval between those wakeups, in -100'ths of a second. - -Setting this to zero disables periodic writeback altogether. - -dirty_expire_centisecs ----------------------- - -This tunable is used to define when dirty data is old enough to be eligible -for writeout by the pdflush daemons. It is expressed in 100'ths of a second. -Data which has been dirty in-memory for longer than this interval will be -written out next time a pdflush daemon wakes up. - -highmem_is_dirtyable --------------------- - -Only present if CONFIG_HIGHMEM is set. - -This defaults to 0 (false), meaning that the ratios set above are calculated -as a percentage of lowmem only. This protects against excessive scanning -in page reclaim, swapping and general VM distress. - -Setting this to 1 can be useful on 32 bit machines where you want to make -random changes within an MMAPed file that is larger than your available -lowmem without causing large quantities of random IO. Is is safe if the -behavior of all programs running on the machine is known and memory will -not be otherwise stressed. - -legacy_va_layout ----------------- - -If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel -will use the legacy (2.4) layout for all processes. - -lowmem_reserve_ratio ---------------------- - -For some specialised workloads on highmem machines it is dangerous for -the kernel to allow process memory to be allocated from the "lowmem" -zone. This is because that memory could then be pinned via the mlock() -system call, or by unavailability of swapspace. - -And on large highmem machines this lack of reclaimable lowmem memory -can be fatal. - -So the Linux page allocator has a mechanism which prevents allocations -which _could_ use highmem from using too much lowmem. This means that -a certain amount of lowmem is defended from the possibility of being -captured into pinned user memory. - -(The same argument applies to the old 16 megabyte ISA DMA region. This -mechanism will also defend that region from allocations which could use -highmem or lowmem). - -The `lowmem_reserve_ratio' tunable determines how aggressive the kernel is -in defending these lower zones. - -If you have a machine which uses highmem or ISA DMA and your -applications are using mlock(), or if you are running with no swap then -you probably should change the lowmem_reserve_ratio setting. - -The lowmem_reserve_ratio is an array. You can see them by reading this file. -- -% cat /proc/sys/vm/lowmem_reserve_ratio -256 256 32 -- -Note: # of this elements is one fewer than number of zones. Because the highest - zone's value is not necessary for following calculation. - -But, these values are not used directly. The kernel calculates # of protection -pages for each zones from them. These are shown as array of protection pages -in /proc/zoneinfo like followings. (This is an example of x86-64 box). -Each zone has an array of protection pages like this. - -- -Node 0, zone DMA - pages free 1355 - min 3 - low 3 - high 4 - : - : - numa_other 0 - protection: (0, 2004, 2004, 2004) - ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - pagesets - cpu: 0 pcp: 0 - : -- -These protections are added to score to judge whether this zone should be used -for page allocation or should be reclaimed. - -In this example, if normal pages (index=2) are required to this DMA zone and -pages_high is used for watermark, the kernel judges this zone should not be -used because pages_free(1355) is smaller than watermark + protection[2] -(4 + 2004 = 2008). If this protection value is 0, this zone would be used for -normal page requirement. If requirement is DMA zone(index=0), protection[0] -(=0) is used. - -zone[i]'s protection[j] is calculated by following exprssion. - -(i < j): - zone[i]->protection[j] - = (total sums of present_pages from zone[i+1] to zone[j] on the node) - / lowmem_reserve_ratio[i]; -(i = j): - (should not be protected. = 0; -(i > j): - (not necessary, but looks 0) - -The default values of lowmem_reserve_ratio[i] are - 256 (if zone[i] means DMA or DMA32 zone) - 32 (others). -As above expression, they are reciprocal number of ratio. -256 means 1/256. # of protection pages becomes about "0.39%" of total present -pages of higher zones on the node. - -If you would like to protect more pages, smaller values are effective. -The minimum value is 1 (1/1 -> 100%). - -page-cluster ------------- - -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. - -It is a logarithmic value - setting it to zero means "1 page", setting -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. - -The default value is three (eight pages at a time). There may be some -small benefits in tuning this to a different value if your workload is -swap-intensive. - -overcommit_memory ------------------ - -Controls overcommit of system memory, possibly allowing processes -to allocate (but not use) more memory than is actually available. - - -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slightly more memory in this mode. This is the - default. - -1 - Always overcommit. Appropriate for some scientific - applications. - -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap plus a - configurable percentage (default is 50) of physical RAM. - Depending on the percentage you use, in most situations - this means a process will not be killed while attempting - to use already-allocated memory but will receive errors - on memory allocation as appropriate. - -overcommit_ratio ----------------- - -Percentage of physical memory size to include in overcommit calculations -(see above.) - -Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) - - swapspace = total size of all swap areas - physmem = size of physical memory in system - -nr_hugepages and hugetlb_shm_group ----------------------------------- - -nr_hugepages configures number of hugetlb page reserved for the system. - -hugetlb_shm_group contains group id that is allowed to create SysV shared -memory segment using hugetlb page. - -hugepages_treat_as_movable --------------------------- - -This parameter is only useful when kernelcore= is specified at boot time to -create ZONE_MOVABLE for pages that may be reclaimed or migrated. Huge pages -are not movable so are not normally allocated from ZONE_MOVABLE. A non-zero -value written to hugepages_treat_as_movable allows huge pages to be allocated -from ZONE_MOVABLE. - -Once enabled, the ZONE_MOVABLE is treated as an area of memory the huge -pages pool can easily grow or shrink within. Assuming that applications are -not running that mlock() a lot of memory, it is likely the huge pages pool -can grow to the size of ZONE_MOVABLE by repeatedly entering the desired value -into nr_hugepages and triggering page reclaim. - -laptop_mode ------------ - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/laptops/laptop-mode.txt. - -block_dump ----------- - -block_dump enables block I/O debugging when set to a nonzero value. More -information on block I/O debugging is in Documentation/laptops/laptop-mode.txt. - -swap_token_timeout ------------------- - -This file contains valid hold time of swap out protection token. The Linux -VM has token based thrashing control mechanism and uses the token to prevent -unnecessary page faults in thrashing situation. The unit of the value is -second. The value would be useful to tune thrashing behavior. - -drop_caches ------------ - -Writing to this will cause the kernel to drop clean caches, dentries and -inodes from memory, causing that memory to become free. - -To free pagecache: - echo 1 > /proc/sys/vm/drop_caches -To free dentries and inodes: - echo 2 > /proc/sys/vm/drop_caches -To free pagecache, dentries and inodes: - echo 3 > /proc/sys/vm/drop_caches - -As this is a non-destructive operation and dirty objects are not freeable, the -user should run `sync' first. - - -2.5 /proc/sys/dev - Device specific parameters ----------------------------------------------- - -Currently there is only support for CDROM drives, and for those, there is only -one read-only file containing information about the CD-ROM drives attached to -the system: - - >cat /proc/sys/dev/cdrom/info - CD-ROM information, Id: cdrom.c 2.55 1999/04/25 - - drive name: sr0 hdb - drive speed: 32 40 - drive # of slots: 1 0 - Can close tray: 1 1 - Can open tray: 1 1 - Can lock tray: 1 1 - Can change speed: 1 1 - Can select disk: 0 1 - Can read multisession: 1 1 - Can read MCN: 1 1 - Reports media changed: 1 1 - Can play audio: 1 1 - - -You see two drives, sr0 and hdb, along with a list of their features. - -2.6 /proc/sys/sunrpc - Remote procedure calls ---------------------------------------------- - -This directory contains four files, which enable or disable debugging for the -RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can -be set to one to turn debugging on. (The default value is 0 for each) - -2.7 /proc/sys/net - Networking stuff ------------------------------------- - -The interface to the networking parts of the kernel is located in -/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only -some of them, depending on your kernel's configuration. - - -Table 2-3: Subdirectories in /proc/sys/net -.............................................................................. - Directory Content Directory Content - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer - ipv4 IP version 4 x25 X.25 protocol - ipx IPX token-ring IBM token ring - bridge Bridging decnet DEC net - ipv6 IP version 6 -.............................................................................. - -We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are -only minor players in the Linux world, we'll skip them in this chapter. You'll -find some short info on Appletalk and IPX further on in this chapter. Review -the online documentation and the kernel source to get a detailed view of the -parameters for those protocols. In this section we'll discuss the -subdirectories printed in bold letters in the table above. As default values -are suitable for most needs, there is no need to change these values. - -/proc/sys/net/core - Network core options ------------------------------------------ - -rmem_default ------------- - -The default setting of the socket receive buffer in bytes. - -rmem_max --------- - -The maximum receive socket buffer size in bytes. - -wmem_default ------------- - -The default setting (in bytes) of the socket send buffer. - -wmem_max --------- - -The maximum send socket buffer size in bytes. - -message_burst and message_cost ------------------------------- - -These parameters are used to limit the warning messages written to the kernel -log from the networking code. They enforce a rate limit to make a -denial-of-service attack impossible. A higher message_cost factor, results in -fewer messages that will be written. Message_burst controls when messages will -be dropped. The default settings limit warning messages to one every five -seconds. - -warnings --------- - -This controls console messages from the networking stack that can occur because -of problems on the network like duplicate address or bad checksums. Normally, -this should be enabled, but if the problem persists the messages can be -disabled. - - -netdev_max_backlog ------------------- - -Maximum number of packets, queued on the INPUT side, when the interface -receives packets faster than kernel can process them. - -optmem_max ----------- - -Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence -of struct cmsghdr structures with appended data. - -/proc/sys/net/unix - Parameters for Unix domain sockets -------------------------------------------------------- - -There are only two files in this subdirectory. They control the delays for -deleting and destroying socket descriptors. - -2.8 /proc/sys/net/ipv4 - IPV4 settings --------------------------------------- - -IP version 4 is still the most used protocol in Unix networking. It will be -replaced by IP version 6 in the next couple of years, but for the moment it's -the de facto standard for the internet and is used in most networking -environments around the world. Because of the importance of this protocol, -we'll have a deeper look into the subtree controlling the behavior of the IPv4 -subsystem of the Linux kernel. - -Let's start with the entries in /proc/sys/net/ipv4. - -ICMP settings -------------- - -icmp_echo_ignore_all and icmp_echo_ignore_broadcasts ----------------------------------------------------- - -Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or -just those to broadcast and multicast addresses. - -Please note that if you accept ICMP echo requests with a broadcast/multi\-cast -destination address your network may be used as an exploder for denial of -service packet flooding attacks to other hosts. - -icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate ---------------------------------------------------------------------------------------- - -Sets limits for sending ICMP packets to specific targets. A value of zero -disables all limiting. Any positive value sets the maximum package rate in -hundredth of a second (on Intel systems). - -IP settings ------------ - -ip_autoconfig -------------- - -This file contains the number one if the host received its IP configuration by -RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. - -ip_default_ttl --------------- - -TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of -hops a packet may travel. - -ip_dynaddr ----------- - -Enable dynamic socket address rewriting on interface address change. This is -useful for dialup interface with changing IP addresses. - -ip_forward ----------- - -Enable or disable forwarding of IP packages between interfaces. Changing this -value resets all other parameters to their default values. They differ if the -kernel is configured as host or router. - -ip_local_port_range -------------------- - -Range of ports used by TCP and UDP to choose the local port. Contains two -numbers, the first number is the lowest port, the second number the highest -local port. Default is 1024-4999. Should be changed to 32768-61000 for -high-usage systems. - -ip_no_pmtu_disc ---------------- - -Global switch to turn path MTU discovery off. It can also be set on a per -socket basis by the applications or on a per route basis. - -ip_masq_debug -------------- - -Enable/disable debugging of IP masquerading. - -IP fragmentation settings -------------------------- - -ipfrag_high_trash and ipfrag_low_trash --------------------------------------- - -Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes -of memory is allocated for this purpose, the fragment handler will toss -packets until ipfrag_low_thresh is reached. - -ipfrag_time ------------ - -Time in seconds to keep an IP fragment in memory. - -TCP settings ------------- - -tcp_ecn -------- - -This file controls the use of the ECN bit in the IPv4 headers. This is a new -feature about Explicit Congestion Notification, but some routers and firewalls -block traffic that has this bit set, so it could be necessary to echo 0 to -/proc/sys/net/ipv4/tcp_ecn if you want to talk to these sites. For more info -you could read RFC2481. - -tcp_retrans_collapse --------------------- - -Bug-to-bug compatibility with some broken printers. On retransmit, try to send -larger packets to work around bugs in certain TCP stacks. Can be turned off by -setting it to zero. - -tcp_keepalive_probes --------------------- - -Number of keep alive probes TCP sends out, until it decides that the -connection is broken. - -tcp_keepalive_time ------------------- - -How often TCP sends out keep alive messages, when keep alive is enabled. The -default is 2 hours. - -tcp_syn_retries ---------------- - -Number of times initial SYNs for a TCP connection attempt will be -retransmitted. Should not be higher than 255. This is only the timeout for -outgoing connections, for incoming connections the number of retransmits is -defined by tcp_retries1. - -tcp_sack --------- - -Enable select acknowledgments after RFC2018. - -tcp_timestamps --------------- - -Enable timestamps as defined in RFC1323. - -tcp_stdurg ----------- - -Enable the strict RFC793 interpretation of the TCP urgent pointer field. The -default is to use the BSD compatible interpretation of the urgent pointer -pointing to the first byte after the urgent data. The RFC793 interpretation is -to have it point to the last byte of urgent data. Enabling this option may -lead to interoperability problems. Disabled by default. - -tcp_syncookies --------------- - -Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out -syncookies when the syn backlog queue of a socket overflows. This is to ward -off the common 'syn flood attack'. Disabled by default. - -Note that the concept of a socket backlog is abandoned. This means the peer -may not receive reliable error messages from an over loaded server with -syncookies enabled. - -tcp_window_scaling ------------------- - -Enable window scaling as defined in RFC1323. - -tcp_fin_timeout ---------------- - -The length of time in seconds it takes to receive a final FIN before the -socket is always closed. This is strictly a violation of the TCP -specification, but required to prevent denial-of-service attacks. - -tcp_max_ka_probes ------------------ - -Indicates how many keep alive probes are sent per slow timer run. Should not -be set too high to prevent bursts. - -tcp_max_syn_backlog -------------------- - -Length of the per socket backlog queue. Since Linux 2.2 the backlog specified -in listen(2) only specifies the length of the backlog queue of already -established sockets. When more connection requests arrive Linux starts to drop -packets. When syncookies are enabled the packets are still answered and the -maximum queue is effectively ignored. - -tcp_retries1 ------------- - -Defines how often an answer to a TCP connection request is retransmitted -before giving up. - -tcp_retries2 ------------- - -Defines how often a TCP packet is retransmitted before giving up. - -Interface specific settings ---------------------------- - -In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each -interface the system knows about and one directory calls all. Changes in the -all subdirectory affect all interfaces, whereas changes in the other -subdirectories affect only one interface. All directories have the same -entries: - -accept_redirects ----------------- - -This switch decides if the kernel accepts ICMP redirect messages or not. The -default is 'yes' if the kernel is configured for a regular host and 'no' for a -router configuration. - -accept_source_route -------------------- - -Should source routed packages be accepted or declined. The default is -dependent on the kernel configuration. It's 'yes' for routers and 'no' for -hosts. - -bootp_relay -~~~~~~~~~~~ - -Accept packets with source address 0.b.c.d with destinations not to this host -as local ones. It is supposed that a BOOTP relay daemon will catch and forward -such packets. - -The default is 0, since this feature is not implemented yet (kernel version -2.2.12). - -forwarding ----------- - -Enable or disable IP forwarding on this interface. - -log_martians ------------- - -Log packets with source addresses with no known route to kernel log. - -mc_forwarding -------------- - -Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a -multicast routing daemon is required. - -proxy_arp ---------- - -Does (1) or does not (0) perform proxy ARP. - -rp_filter ---------- - -Integer value determines if a source validation should be made. 1 means yes, 0 -means no. Disabled by default, but local/broadcast address spoofing is always -on. - -If you set this to 1 on a router that is the only connection for a network to -the net, it will prevent spoofing attacks against your internal networks -(external addresses can still be spoofed), without the need for additional -firewall rules. - -secure_redirects ----------------- - -Accept ICMP redirect messages only for gateways, listed in default gateway -list. Enabled by default. - -shared_media ------------- - -If it is not set the kernel does not assume that different subnets on this -device can communicate directly. Default setting is 'yes'. - -send_redirects --------------- - -Determines whether to send ICMP redirects to other hosts. - -Routing settings ----------------- - -The directory /proc/sys/net/ipv4/route contains several file to control -routing issues. - -error_burst and error_cost --------------------------- - -These parameters are used to limit how many ICMP destination unreachable to -send from the host in question. ICMP destination unreachable messages are -sent when we cannot reach the next hop while trying to transmit a packet. -It will also print some error messages to kernel logs if someone is ignoring -our ICMP redirects. The higher the error_cost factor is, the fewer -destination unreachable and error messages will be let through. Error_burst -controls when destination unreachable messages and error messages will be -dropped. The default settings limit warning messages to five every second. - -flush ------ - -Writing to this file results in a flush of the routing cache. - -gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh ---------------------------------------------------------------------- - -Values to control the frequency and behavior of the garbage collection -algorithm for the routing cache. gc_min_interval is deprecated and replaced -by gc_min_interval_ms. - - -max_size --------- - -Maximum size of the routing cache. Old entries will be purged once the cache -reached has this size. - -redirect_load, redirect_number ------------------------------- - -Factors which determine if more ICPM redirects should be sent to a specific -host. No redirects will be sent once the load limit or the maximum number of -redirects has been reached. - -redirect_silence ----------------- - -Timeout for redirects. After this period redirects will be sent again, even if -this has been stopped, because the load or number limit has been reached. - -Network Neighbor handling -------------------------- - -Settings about how to handle connections with direct neighbors (nodes attached -to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. - -As we saw it in the conf directory, there is a default subdirectory which -holds the default values, and one directory for each interface. The contents -of the directories are identical, with the single exception that the default -settings contain additional options to set garbage collection parameters. - -In the interface directories you'll find the following entries: - -base_reachable_time, base_reachable_time_ms -------------------------------------------- - -A base value used for computing the random reachable time value as specified -in RFC2461. - -Expression of base_reachable_time, which is deprecated, is in seconds. -Expression of base_reachable_time_ms is in milliseconds. - -retrans_time, retrans_time_ms ------------------------------ - -The time between retransmitted Neighbor Solicitation messages. -Used for address resolution and to determine if a neighbor is -unreachable. - -Expression of retrans_time, which is deprecated, is in 1/100 seconds (for -IPv4) or in jiffies (for IPv6). -Expression of retrans_time_ms is in milliseconds. - -unres_qlen ----------- - -Maximum queue length for a pending arp request - the number of packets which -are accepted from other layers while the ARP address is still resolved. - -anycast_delay -------------- - -Maximum for random delay of answers to neighbor solicitation messages in -jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support -yet). - -ucast_solicit -------------- - -Maximum number of retries for unicast solicitation. - -mcast_solicit -------------- - -Maximum number of retries for multicast solicitation. - -delay_first_probe_time ----------------------- - -Delay for the first time probe if the neighbor is reachable. (see -gc_stale_time) - -locktime --------- - -An ARP/neighbor entry is only replaced with a new one if the old is at least -locktime old. This prevents ARP cache thrashing. - -proxy_delay ------------ - -Maximum time (real time is random [0..proxytime]) before answering to an ARP -request for which we have an proxy ARP entry. In some cases, this is used to -prevent network flooding. - -proxy_qlen ----------- - -Maximum queue length of the delayed proxy arp timer. (see proxy_delay). - -app_solicit ----------- - -Determines the number of requests to send to the user level ARP daemon. Use 0 -to turn off. - -gc_stale_time -------------- - -Determines how often to check for stale ARP entries. After an ARP entry is -stale it will be resolved again (which is useful when an IP address migrates -to another machine). When ucast_solicit is greater than 0 it first tries to -send an ARP packet directly to the known host When that fails and -mcast_solicit is greater than 0, an ARP request is broadcasted. - -2.9 Appletalk -------------- - -The /proc/sys/net/appletalk directory holds the Appletalk configuration data -when Appletalk is loaded. The configurable parameters are: - -aarp-expiry-time ----------------- - -The amount of time we keep an ARP entry before expiring it. Used to age out -old hosts. - -aarp-resolve-time ------------------ - -The amount of time we will spend trying to resolve an Appletalk address. - -aarp-retransmit-limit ---------------------- - -The number of times we will retransmit a query before giving up. - -aarp-tick-time --------------- - -Controls the rate at which expires are checked. - -The directory /proc/net/appletalk holds the list of active Appletalk sockets -on a machine. - -The fields indicate the DDP type, the local address (in network:node format) -the remote address, the size of the transmit pending queue, the size of the -received queue (bytes waiting for applications to read) the state and the uid -owning the socket. - -/proc/net/atalk_iface lists all the interfaces configured for appletalk.It -shows the name of the interface, its Appletalk address, the network range on -that address (or network number for phase 1 networks), and the status of the -interface. - -/proc/net/atalk_route lists each known network route. It lists the target -(network) that the route leads to, the router (may be directly connected), the -route flags, and the device the route is using. - -2.10 IPX --------- - -The IPX protocol has no tunable values in proc/sys/net. - -The IPX protocol does, however, provide proc/net/ipx. This lists each IPX -socket giving the local and remote addresses in Novell format (that is -network:node:port). In accordance with the strange Novell tradition, -everything but the port is in hex. Not_Connected is displayed for sockets that -are not tied to a specific remote address. The Tx and Rx queue sizes indicate -the number of bytes pending for transmission and reception. The state -indicates the state the socket is in and the uid is the owning uid of the -socket. - -The /proc/net/ipx_interface file lists all IPX interfaces. For each interface -it gives the network number, the node number, and indicates if the network is -the primary network. It also indicates which device it is bound to (or -Internal for internal networks) and the Frame Type if appropriate. Linux -supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for -IPX. - -The /proc/net/ipx_route table holds a list of IPX routes. For each route it -gives the destination network, the router node (or Directly) and the network -address of the router (or Connected) for internal networks. - -2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem ----------------------------------------------------------- - -The "mqueue" filesystem provides the necessary kernel features to enable the -creation of a user space library that implements the POSIX message queues -API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System -Interfaces specification.) - -The "mqueue" filesystem contains values for determining/setting the amount of -resources used by the file system. - -/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the -maximum number of message queues allowed on the system. - -/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the -maximum number of messages in a queue value. In fact it is the limiting value -for another (user) limit which is set in mq_open invocation. This attribute of -a queue must be less or equal then msg_max. - -/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the -maximum message size value (it is every message queue's attribute set during -its creation). - -2.12 /proc/<pid>/oom_adj - Adjust the oom-killer score ------------------------------------------------------- - -This file can be used to adjust the score used to select which processes -should be killed in an out-of-memory situation. Giving it a high score will -increase the likelihood of this process being killed by the oom-killer. Valid -values are in the range -16 to +15, plus the special value -17, which disables -oom-killing altogether for this process. - -2.13 /proc/<pid>/oom_score - Display current oom-killer score -------------------------------------------------------------- - ------------------------------------------------------------------------------- -This file can be used to check the current score used by the oom-killer is for -any given <pid>. Use it together with /proc/<pid>/oom_adj to tune which -process should be killed in an out-of-memory situation. +Please see: Documentation/sysctl/ directory for descriptions of these +entries. ------------------------------------------------------------------------------ Summary @@ -2209,7 +1371,76 @@ command to write value into these files, thereby changing the default settings of the kernel. ------------------------------------------------------------------------------ -2.14 /proc/<pid>/io - Display the IO accounting fields +------------------------------------------------------------------------------ +CHAPTER 3: PER-PROCESS PARAMETERS +------------------------------------------------------------------------------ + +3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: the current memory +and swap usage is discounted by 3% for root processes. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc/<pid>/oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc/<pid>/oom_score_adj. + +The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower +requires CAP_SYS_RESOURCE. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with separate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + + +3.2 /proc/<pid>/oom_score - Display current oom-killer score +------------------------------------------------------------- + +This file can be used to check the current score used by the oom-killer is for +any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which +process should be killed in an out-of-memory situation. + + +3.3 /proc/<pid>/io - Display the IO accounting fields ------------------------------------------------------- This file contains IO statistics for each running process @@ -2295,7 +1526,7 @@ been accounted as having caused 1MB of write. In other words: The number of bytes which this process caused to not happen, by truncating pagecache. A task can cause "negative" IO too. If this task truncates some dirty pagecache, some IO which another task has been accounted -for (in it's write_bytes) will not be happening. We _could_ just subtract that +for (in its write_bytes) will not be happening. We _could_ just subtract that from the truncating task's write_bytes, but there is information loss in doing that. @@ -2311,7 +1542,7 @@ those 64-bit counters, process A could see an intermediate result. More information about this can be found within the taskstats documentation in Documentation/accounting. -2.15 /proc/<pid>/coredump_filter - Core dump filtering settings +3.4 /proc/<pid>/coredump_filter - Core dump filtering settings --------------------------------------------------------------- When a process is dumped, all anonymous memory is written to a core file as long as the size of the core file isn't limited. But sometimes we don't want @@ -2324,22 +1555,29 @@ will be dumped when the <pid> process is dumped. coredump_filter is a bitmask of memory types. If a bit of the bitmask is set, memory segments of the corresponding memory type are dumped, otherwise they are not dumped. -The following 4 memory types are supported: +The following 7 memory types are supported: - (bit 0) anonymous private memory - (bit 1) anonymous shared memory - (bit 2) file-backed private memory - (bit 3) file-backed shared memory + - (bit 4) ELF header pages in file-backed private memory areas (it is + effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory Note that MMIO pages such as frame buffer are never dumped and vDSO pages are always dumped regardless of the bitmask status. -Default value of coredump_filter is 0x3; this means all anonymous memory -segments are dumped. + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. + +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. If you don't want to dump all shared memory segments attached to pid 1234, -write 1 to the process's proc file. +write 0x21 to the process's proc file. - $ echo 0x1 > /proc/1234/coredump_filter + $ echo 0x21 > /proc/1234/coredump_filter When a new process is created, the process inherits the bitmask status from its parent. It is useful to set up coredump_filter before the program runs. @@ -2348,4 +1586,196 @@ For example: $ echo 0x7 > /proc/self/coredump_filter $ ./some_program +3.5 /proc/<pid>/mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form: + +36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue +(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + +(1) mount ID: unique identifier of the mount (may be reused after umount) +(2) parent ID: ID of parent (or of self for the top of the mount tree) +(3) major:minor: value of st_dev for files on filesystem +(4) root: root of the mount within the filesystem +(5) mount point: mount point relative to the process's root +(6) mount options: per mount options +(7) optional fields: zero or more fields of the form "tag[:value]" +(8) separator: marks the end of the optional fields +(9) filesystem type: name of filesystem of the form "type[.subtype]" +(10) mount source: filesystem specific information or "none" +(11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X (*) +unbindable mount is unbindable + +(*) X is the closest dominant peer group under the process's root. If +X is the immediate master of the mount, or if there's no dominant peer +group under the same root, then only the "master:X" field is present +and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt + + +3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm +-------------------------------------------------------- +These files provide a method to access a tasks comm value. It also allows for +a task to set its own or one of its thread siblings comm value. The comm value +is limited in size compared to the cmdline value, so writing anything longer +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated +comm value. + + +3.7 /proc/<pid>/task/<tid>/children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by <pid>/<tid> pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + +3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file +--------------------------------------------------------------- +This file provides information associated with an opened file. The regular +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' +represents the current offset of the opened file in decimal form [see lseek(2) +for details], 'flags' denotes the octal O_xxx mask the file has been +created with [see open(2) for details] and 'mnt_id' represents mount ID of +the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo +for details]. + +A typical output is + + pos: 0 + flags: 0100002 + mnt_id: 19 + +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags +pair provide additional information particular to the objects they represent. + + Eventfd files + ~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + eventfd-count: 5a + + where 'eventfd-count' is hex value of a counter. + + Signalfd files + ~~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + sigmask: 0000000000000200 + + where 'sigmask' is hex value of the signal mask associated + with a file. + + Epoll files + ~~~~~~~~~~~ + pos: 0 + flags: 02 + mnt_id: 9 + tfd: 5 events: 1d data: ffffffffffffffff + + where 'tfd' is a target file descriptor number in decimal form, + 'events' is events mask being watched and the 'data' is data + associated with a target [see epoll(7) for more details]. + + Fsnotify files + ~~~~~~~~~~~~~~ + For inotify files the format is the following + + pos: 0 + flags: 02000000 + inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d + + where 'wd' is a watch descriptor in decimal form, ie a target file + descriptor number, 'ino' and 'sdev' are inode and device where the + target file resides and the 'mask' is the mask of events, all in hex + form [see inotify(7) for more details]. + + If the kernel was built with exportfs support, the path to the target + file is encoded as a file handle. The file handle is provided by three + fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex + format. + + If the kernel is built without exportfs support the file handle won't be + printed out. + + If there is no inotify mark attached yet the 'inotify' line will be omitted. + + For fanotify files the format is + + pos: 0 + flags: 02 + mnt_id: 9 + fanotify flags:10 event-flags:0 + fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 + fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 + + where fanotify 'flags' and 'event-flags' are values used in fanotify_init + call, 'mnt_id' is the mount point identifier, 'mflags' is the value of + flags associated with mark which are tracked separately from events + mask. 'ino', 'sdev' are target inode and device, 'mask' is the events + mask and 'ignored_mask' is the mask of events which are to be ignored. + All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' + does provide information about flags and mask used in fanotify_mark + call [see fsnotify manpage for details]. + + While the first three lines are mandatory and always printed, the rest is + optional and may be omitted if no marks created yet. + + +------------------------------------------------------------------------------ +Configuring procfs ------------------------------------------------------------------------------ + +4.1 Mount options +--------------------- + +The following mount options are supported: + + hidepid= Set /proc/<pid>/ access mode. + gid= Set the group authorized to learn processes information. + +hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories +(default). + +hidepid=1 means users may not access any /proc/<pid>/ directories but their +own. Sensitive files like cmdline, sched*, status are now protected against +other users. This makes it impossible to learn whether any user runs +specific program (given the program doesn't reveal itself by its behaviour). +As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users, +poorly written programs passing sensitive information via program arguments are +now protected against local eavesdroppers. + +hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other +users. It doesn't mean that it hides a fact whether a process with a specific +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), +but it hides process' uid and gid, which may be learned by stat()'ing +/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering +information about running processes, whether some daemon runs with elevated +privileges, whether other user runs some sensitive program, whether other users +run any program at all, etc. + +gid= defines a group authorized to learn processes information otherwise +prohibited by hidepid=. If you use some daemon like identd which needs to learn +information about processes information, just add identd to this group. diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt new file mode 100644 index 00000000000..40867978913 --- /dev/null +++ b/Documentation/filesystems/qnx6.txt @@ -0,0 +1,174 @@ +The QNX6 Filesystem +=================== + +The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) +It got introduced in QNX 6.4.0 and is used default since 6.4.1. + +Option +====== + +mmi_fs Mount filesystem as used for example by Audi MMI 3G system + +Specification +============= + +qnx6fs shares many properties with traditional Unix filesystems. It has the +concepts of blocks, inodes and directories. +On QNX it is possible to create little endian and big endian qnx6 filesystems. +This feature makes it possible to create and use a different endianness fs +for the target (QNX is used on quite a range of embedded systems) plattform +running on a different endianness. +The Linux driver handles endianness transparently. (LE and BE) + +Blocks +------ + +The space in the device or file is split up into blocks. These are a fixed +size of 512, 1024, 2048 or 4096, which is decided when the filesystem is +created. +Blockpointers are 32bit, so the maximum space that can be addressed is +2^32 * 4096 bytes or 16TB + +The superblocks +--------------- + +The superblock contains all global information about the filesystem. +Each qnx6fs got two superblocks, each one having a 64bit serial number. +That serial number is used to identify the "active" superblock. +In write mode with reach new snapshot (after each synchronous write), the +serial of the new master superblock is increased (old superblock serial + 1) + +So basically the snapshot functionality is realized by an atomic final +update of the serial number. Before updating that serial, all modifications +are done by copying all modified blocks during that specific write request +(or period) and building up a new (stable) filesystem structure under the +inactive superblock. + +Each superblock holds a set of root inodes for the different filesystem +parts. (Inode, Bitmap and Longfilenames) +Each of these root nodes holds information like total size of the stored +data and the addressing levels in that specific tree. +If the level value is 0, up to 16 direct blocks can be addressed by each +node. +Level 1 adds an additional indirect addressing level where each indirect +addressing block holds up to blocksize / 4 bytes pointers to data blocks. +Level 2 adds an additional indirect addressing block level (so, already up +to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). + +Unused block pointers are always set to ~0 - regardless of root node, +indirect addressing blocks or inodes. +Data leaves are always on the lowest level. So no data is stored on upper +tree levels. + +The first Superblock is located at 0x2000. (0x2000 is the bootblock size) +The Audi MMI 3G first superblock directly starts at byte 0. +Second superblock position can either be calculated from the superblock +information (total number of filesystem blocks) or by taking the highest +device address, zeroing the last 3 bytes and then subtracting 0x1000 from +that address. + +0x1000 is the size reserved for each superblock - regardless of the +blocksize of the filesystem. + +Inodes +------ + +Each object in the filesystem is represented by an inode. (index node) +The inode structure contains pointers to the filesystem blocks which contain +the data held in the object and all of the metadata about an object except +its longname. (filenames longer than 27 characters) +The metadata about an object includes the permissions, owner, group, flags, +size, number of blocks used, access time, change time and modification time. + +Object mode field is POSIX format. (which makes things easier) + +There are also pointers to the first 16 blocks, if the object data can be +addressed with 16 direct blocks. +For more than 16 blocks an indirect addressing in form of another tree is +used. (scheme is the same as the one used for the superblock root nodes) + +The filesize is stored 64bit. Inode counting starts with 1. (whilst long +filename inodes start with 0) + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate each +name with an inode number. +'.' inode number points to the directory inode +'..' inode number points to the parent directory inode +Eeach filename record additionally got a filename length field. + +One special case are long filenames or subdirectory names. +These got set a filename length field of 0xff in the corresponding directory +record plus the longfile inode number also stored in that record. +With that longfilename inode number, the longfilename tree can be walked +starting with the superblock longfilename root node pointers. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They got a specific +bit in the inode mode field identifying them as symbolic link. +The directory entry file inode pointer points to the target file inode. + +Hard links got an inode, a directory entry, but a specific mode bit set, +no block pointers and the directory file record pointing to the target file +inode. + +Character and block special devices do not exist in QNX as those files +are handled by the QNX kernel/drivers and created in /dev independent of the +underlaying filesystem. + +Long filenames +-------------- + +Long filenames are stored in a separate addressing tree. The staring point +is the longfilename root node in the active superblock. +Each data block (tree leaves) holds one long filename. That filename is +limited to 510 bytes. The first two starting bytes are used as length field +for the actual filename. +If that structure shall fit for all allowed blocksizes, it is clear why there +is a limit of 510 bytes for the actual filename stored. + +Bitmap +------ + +The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap +root node in the superblock and each bit in the bitmap represents one +filesystem block. +The first block is block 0, which starts 0x1000 after superblock start. +So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical +address at which block 0 is located. + +Bits at the end of the last bitmap block are set to 1, if the device is +smaller than addressing space in the bitmap. + +Bitmap system area +------------------ + +The bitmap itself is divided into three parts. +First the system area, that is split into two halves. +Then userspace. + +The requirement for a static, fixed preallocated system area comes from how +qnx6fs deals with writes. +Each superblock got it's own half of the system area. So superblock #1 +always uses blocks from the lower half whilst superblock #2 just writes to +blocks represented by the upper half bitmap system area bits. + +Bitmap blocks, Inode blocks and indirect addressing blocks for those two +tree structures are treated as system blocks. + +The rational behind that is that a write request can work on a new snapshot +(system area of the inactive - resp. lower serial numbered superblock) while +at the same time there is still a complete stable filesystem structer in the +other half of the system area. + +When finished with writing (a sync write is completed, the maximum sync leap +time or a filesystem sync is requested), serial of the previously inactive +superblock atomically is increased and the fs switches over to that - then +stable declared - superblock. + +For all data outside the system area, blocks are just copied while writing. diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.txt index a590c4093ef..5e8de25bf0f 100644 --- a/Documentation/filesystems/quota.txt +++ b/Documentation/filesystems/quota.txt @@ -3,14 +3,14 @@ Quota subsystem =============== Quota subsystem allows system administrator to set limits on used space and -number of used inodes (inode is a filesystem structure which is associated -with each file or directory) for users and/or groups. For both used space and -number of used inodes there are actually two limits. The first one is called -softlimit and the second one hardlimit. An user can never exceed a hardlimit -for any resource. User is allowed to exceed softlimit but only for limited -period of time. This period is called "grace period" or "grace time". When -grace time is over, user is not able to allocate more space/inodes until he -frees enough of them to get below softlimit. +number of used inodes (inode is a filesystem structure which is associated with +each file or directory) for users and/or groups. For both used space and number +of used inodes there are actually two limits. The first one is called softlimit +and the second one hardlimit. An user can never exceed a hardlimit for any +resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed +softlimit but only for limited period of time. This period is called "grace +period" or "grace time". When grace time is over, user is not able to allocate +more space/inodes until he frees enough of them to get below softlimit. Quota limits (and amount of grace time) are set independently for each filesystem. @@ -53,6 +53,12 @@ in parentheses): QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded longer than given grace period. QUOTA_NL_BSOFTWARN - space (block) softlimit + - four warnings are also defined for the event when user stops + exceeding some limit: + QUOTA_NL_IHARDBELOW - inode hardlimit + QUOTA_NL_ISOFTBELOW - inode softlimit + QUOTA_NL_BHARDBELOW - space (block) hardlimit + QUOTA_NL_BSOFTBELOW - space (block) softlimit QUOTA_NL_A_DEV_MAJOR (u32) - major number of a device with the affected filesystem QUOTA_NL_A_DEV_MINOR (u32) diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt index 7be232b44ee..b176928e696 100644 --- a/Documentation/filesystems/ramfs-rootfs-initramfs.txt +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -79,6 +79,10 @@ to just make sure certain lists can't become empty. Most systems just mount another filesystem over rootfs and ignore it. The amount of space an empty instance of ramfs takes up is tiny. +If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by +default. To force ramfs, add "rootfstype=ramfs" to the kernel command +line. + What is initramfs? ------------------ @@ -130,12 +134,12 @@ The 2.6 kernel build process always creates a gzipped cpio format initramfs archive and links it into the resulting kernel binary. By default, this archive is empty (consuming 134 bytes on x86). -The config option CONFIG_INITRAMFS_SOURCE (for some reason buried under -devices->block devices in menuconfig, and living in usr/Kconfig) can be used -to specify a source for the initramfs archive, which will automatically be -incorporated into the resulting binary. This option can point to an existing -gzipped cpio archive, a directory containing files to be archived, or a text -file specification such as the following example: +The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, +and living in usr/Kconfig) can be used to specify a source for the +initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio +archive, a directory containing files to be archived, or a text file +specification such as the following example: dir /dev 755 0 0 nod /dev/console 644 0 0 c 5 1 @@ -263,7 +267,7 @@ User Mode Linux, like so: sleep(999999999); } EOF - gcc -static hello2.c -o init + gcc -static hello.c -o init echo init | cpio -o -H newc | gzip > test.cpio.gz # Testing external initramfs using the initrd loading mechanism. qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero @@ -297,7 +301,7 @@ the above threads) is: either way about the archive format, and there are alternative tools, such as: - http://freshmeat.net/projects/afio/ + http://freecode.com/projects/afio 2) The cpio archive format chosen by the kernel is simpler and cleaner (and thus easier to create and parse) than any of the (literally dozens of) diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt index 094f2d2f38b..33e2f369473 100644 --- a/Documentation/filesystems/relay.txt +++ b/Documentation/filesystems/relay.txt @@ -31,7 +31,7 @@ Semantics Each relay channel has one buffer per CPU, each buffer has one or more sub-buffers. Messages are written to the first sub-buffer until it is -too full to contain a new message, in which case it it is written to +too full to contain a new message, in which case it is written to the next (if available). Messages are never split across sub-buffers. At this point, userspace can be notified so it empties the first sub-buffer, while the kernel continues writing to the next. @@ -294,6 +294,16 @@ user-defined data with a channel, and is immediately available (including in create_buf_file()) via chan->private_data or buf->chan->private_data. +Buffer-only channels +-------------------- + +These channels have no files associated and can be created with +relay_open(NULL, NULL, ...). Such channels are useful in scenarios such +as when doing early tracing in the kernel, before the VFS is up. In these +cases, one may open a buffer-only channel and then call +relay_late_setup_files() when the kernel is ready to handle files, +to expose the buffered data to the userspace. + Channel 'modes' --------------- diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt index 2d2a7b2a16b..e2b07cc9120 100644 --- a/Documentation/filesystems/romfs.txt +++ b/Documentation/filesystems/romfs.txt @@ -17,8 +17,7 @@ comparison, an actual rescue disk used up 3202 blocks with ext2, while with romfs, it needed 3079 blocks. To create such a file system, you'll need a user program named -genromfs. It is available via anonymous ftp on sunsite.unc.edu and -its mirrors, in the /pub/Linux/system/recovery/ directory. +genromfs. It is available on http://romfs.sourceforge.net/ As the name suggests, romfs could be also used (space-efficiently) on various read-only media, like (E)EPROM disks if someone will have the diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt new file mode 100644 index 00000000000..1fe0ccb1af5 --- /dev/null +++ b/Documentation/filesystems/seq_file.txt @@ -0,0 +1,301 @@ +The seq_file interface + + Copyright 2003 Jonathan Corbet <corbet@lwn.net> + This file is originally from the LWN.net Driver Porting series at + http://lwn.net/Articles/driver-porting/ + + +There are numerous ways for a device driver (or other kernel component) to +provide information to the user or system administrator. One useful +technique is the creation of virtual files, in debugfs, /proc or elsewhere. +Virtual files can provide human-readable output that is easy to get at +without any special utility programs; they can also make life easier for +script writers. It is not surprising that the use of virtual files has +grown over the years. + +Creating those files correctly has always been a bit of a challenge, +however. It is not that hard to make a virtual file which returns a +string. But life gets trickier if the output is long - anything greater +than an application is likely to read in a single operation. Handling +multiple reads (and seeks) requires careful attention to the reader's +position within the virtual file - that position is, likely as not, in the +middle of a line of output. The kernel has traditionally had a number of +implementations that got this wrong. + +The 2.6 kernel contains a set of functions (implemented by Alexander Viro) +which are designed to make it easy for virtual file creators to get it +right. + +The seq_file interface is available via <linux/seq_file.h>. There are +three aspects to seq_file: + + * An iterator interface which lets a virtual file implementation + step through the objects it is presenting. + + * Some utility functions for formatting objects for output without + needing to worry about things like output buffers. + + * A set of canned file_operations which implement most operations on + the virtual file. + +We'll look at the seq_file interface via an extremely simple example: a +loadable module which creates a file called /proc/sequence. The file, when +read, simply produces a set of increasing integer values, one per line. The +sequence will continue until the user loses patience and finds something +better to do. The file is seekable, in that one can do something like the +following: + + dd if=/proc/sequence of=out1 count=1 + dd if=/proc/sequence skip=1 of=out2 count=1 + +Then concatenate the output files out1 and out2 and get the right +result. Yes, it is a thoroughly useless module, but the point is to show +how the mechanism works without getting lost in other details. (Those +wanting to see the full source for this module can find it at +http://lwn.net/Articles/22359/). + +Deprecated create_proc_entry + +Note that the above article uses create_proc_entry which was removed in +kernel 3.10. Current versions require the following update + +- entry = create_proc_entry("sequence", 0, NULL); +- if (entry) +- entry->proc_fops = &ct_file_ops; ++ entry = proc_create("sequence", 0, NULL, &ct_file_ops); + +The iterator interface + +Modules implementing a virtual file with seq_file must implement a simple +iterator object that allows stepping through the data of interest. +Iterators must be able to move to a specific position - like the file they +implement - but the interpretation of that position is up to the iterator +itself. A seq_file implementation that is formatting firewall rules, for +example, could interpret position N as the Nth rule in the chain. +Positioning can thus be done in whatever way makes the most sense for the +generator of the data, which need not be aware of how a position translates +to an offset in the virtual file. The one obvious exception is that a +position of zero should indicate the beginning of the file. + +The /proc/sequence iterator just uses the count of the next number it +will output as its position. + +Four functions must be implemented to make the iterator work. The first, +called start() takes a position as an argument and returns an iterator +which will start reading at that position. For our simple sequence example, +the start() function looks like: + + static void *ct_seq_start(struct seq_file *s, loff_t *pos) + { + loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); + if (! spos) + return NULL; + *spos = *pos; + return spos; + } + +The entire data structure for this iterator is a single loff_t value +holding the current position. There is no upper bound for the sequence +iterator, but that will not be the case for most other seq_file +implementations; in most cases the start() function should check for a +"past end of file" condition and return NULL if need be. + +For more complicated applications, the private field of the seq_file +structure can be used. There is also a special value which can be returned +by the start() function called SEQ_START_TOKEN; it can be used if you wish +to instruct your show() function (described below) to print a header at the +top of the output. SEQ_START_TOKEN should only be used if the offset is +zero, however. + +The next function to implement is called, amazingly, next(); its job is to +move the iterator forward to the next position in the sequence. The +example module can simply increment the position by one; more useful +modules will do what is needed to step through some data structure. The +next() function returns a new iterator, or NULL if the sequence is +complete. Here's the example version: + + static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + loff_t *spos = v; + *pos = ++*spos; + return spos; + } + +The stop() function is called when iteration is complete; its job, of +course, is to clean up. If dynamic memory is allocated for the iterator, +stop() is the place to free it. + + static void ct_seq_stop(struct seq_file *s, void *v) + { + kfree(v); + } + +Finally, the show() function should format the object currently pointed to +by the iterator for output. The example module's show() function is: + + static int ct_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = v; + seq_printf(s, "%lld\n", (long long)*spos); + return 0; + } + +If all is well, the show() function should return zero. A negative error +code in the usual manner indicates that something went wrong; it will be +passed back to user space. This function can also return SEQ_SKIP, which +causes the current item to be skipped; if the show() function has already +generated output before returning SEQ_SKIP, that output will be dropped. + +We will look at seq_printf() in a moment. But first, the definition of the +seq_file iterator is finished by creating a seq_operations structure with +the four functions we have just defined: + + static const struct seq_operations ct_seq_ops = { + .start = ct_seq_start, + .next = ct_seq_next, + .stop = ct_seq_stop, + .show = ct_seq_show + }; + +This structure will be needed to tie our iterator to the /proc file in +a little bit. + +It's worth noting that the iterator value returned by start() and +manipulated by the other functions is considered to be completely opaque by +the seq_file code. It can thus be anything that is useful in stepping +through the data to be output. Counters can be useful, but it could also be +a direct pointer into an array or linked list. Anything goes, as long as +the programmer is aware that things can happen between calls to the +iterator function. However, the seq_file code (by design) will not sleep +between the calls to start() and stop(), so holding a lock during that time +is a reasonable thing to do. The seq_file code will also avoid taking any +other locks while the iterator is active. + + +Formatted output + +The seq_file code manages positioning within the output created by the +iterator and getting it into the user's buffer. But, for that to work, that +output must be passed to the seq_file code. Some utility functions have +been defined which make this task easy. + +Most code will simply use seq_printf(), which works pretty much like +printk(), but which requires the seq_file pointer as an argument. It is +common to ignore the return value from seq_printf(), but a function +producing complicated output may want to check that value and quit if +something non-zero is returned; an error return means that the seq_file +buffer has been filled and further output will be discarded. + +For straight character output, the following functions may be used: + + int seq_putc(struct seq_file *m, char c); + int seq_puts(struct seq_file *m, const char *s); + int seq_escape(struct seq_file *m, const char *s, const char *esc); + +The first two output a single character and a string, just like one would +expect. seq_escape() is like seq_puts(), except that any character in s +which is in the string esc will be represented in octal form in the output. + +There is also a pair of functions for printing filenames: + + int seq_path(struct seq_file *m, struct path *path, char *esc); + int seq_path_root(struct seq_file *m, struct path *path, + struct path *root, char *esc) + +Here, path indicates the file of interest, and esc is a set of characters +which should be escaped in the output. A call to seq_path() will output +the path relative to the current process's filesystem root. If a different +root is desired, it can be used with seq_path_root(). Note that, if it +turns out that path cannot be reached from root, the value of root will be +changed in seq_file_root() to a root which *does* work. + + +Making it all work + +So far, we have a nice set of functions which can produce output within the +seq_file system, but we have not yet turned them into a file that a user +can see. Creating a file within the kernel requires, of course, the +creation of a set of file_operations which implement the operations on that +file. The seq_file interface provides a set of canned operations which do +most of the work. The virtual file author still must implement the open() +method, however, to hook everything up. The open function is often a single +line, as in the example module: + + static int ct_open(struct inode *inode, struct file *file) + { + return seq_open(file, &ct_seq_ops); + } + +Here, the call to seq_open() takes the seq_operations structure we created +before, and gets set up to iterate through the virtual file. + +On a successful open, seq_open() stores the struct seq_file pointer in +file->private_data. If you have an application where the same iterator can +be used for more than one file, you can store an arbitrary pointer in the +private field of the seq_file structure; that value can then be retrieved +by the iterator functions. + +The other operations of interest - read(), llseek(), and release() - are +all implemented by the seq_file code itself. So a virtual file's +file_operations structure will look like: + + static const struct file_operations ct_file_ops = { + .owner = THIS_MODULE, + .open = ct_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + +There is also a seq_release_private() which passes the contents of the +seq_file private field to kfree() before releasing the structure. + +The final step is the creation of the /proc file itself. In the example +code, that is done in the initialization code in the usual way: + + static int ct_init(void) + { + struct proc_dir_entry *entry; + + proc_create("sequence", 0, NULL, &ct_file_ops); + return 0; + } + + module_init(ct_init); + +And that is pretty much it. + + +seq_list + +If your file will be iterating through a linked list, you may find these +routines useful: + + struct list_head *seq_list_start(struct list_head *head, + loff_t pos); + struct list_head *seq_list_start_head(struct list_head *head, + loff_t pos); + struct list_head *seq_list_next(void *v, struct list_head *head, + loff_t *ppos); + +These helpers will interpret pos as a position within the list and iterate +accordingly. Your start() and next() functions need only invoke the +seq_list_* helpers with a pointer to the appropriate list_head structure. + + +The extra-simple version + +For extremely simple virtual files, there is an even easier interface. A +module can define only the show() function, which should create all the +output that the virtual file will contain. The file's open() method then +calls: + + int single_open(struct file *file, + int (*show)(struct seq_file *m, void *p), + void *data); + +When output time comes, the show() function will be called once. The data +value given to single_open() can be found in the private field of the +seq_file structure. When using single_open(), the programmer should use +single_release() instead of seq_release() in the file_operations structure +to avoid a memory leak. diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt index 736540045dc..32a173dd315 100644 --- a/Documentation/filesystems/sharedsubtree.txt +++ b/Documentation/filesystems/sharedsubtree.txt @@ -4,7 +4,7 @@ Shared Subtrees Contents: 1) Overview 2) Features - 3) smount command + 3) Setting mount states 4) Use-case 5) Detailed semantics 6) Quiz @@ -41,14 +41,14 @@ replicas continue to be exactly same. Here is an example: - Lets say /mnt has a mount that is shared. + Let's say /mnt has a mount that is shared. mount --make-shared /mnt - note: mount command does not yet support the --make-shared flag. - I have included a small C program which does the same by executing - 'smount /mnt shared' + Note: mount(8) command now supports the --make-shared flag, + so the sample 'smount' program is no longer needed and has been + removed. - #mount --bind /mnt /tmp + # mount --bind /mnt /tmp The above command replicates the mount at /mnt to the mountpoint /tmp and the contents of both the mounts remain identical. @@ -58,14 +58,14 @@ replicas continue to be exactly same. #ls /tmp a b c - Now lets say we mount a device at /tmp/a - #mount /dev/sd0 /tmp/a + Now let's say we mount a device at /tmp/a + # mount /dev/sd0 /tmp/a #ls /tmp/a - t1 t2 t2 + t1 t2 t3 #ls /mnt/a - t1 t2 t2 + t1 t2 t3 Note that the mount has propagated to the mount at /mnt as well. @@ -80,21 +80,20 @@ replicas continue to be exactly same. Here is an example: - Lets say /mnt has a mount which is shared. - #mount --make-shared /mnt + Let's say /mnt has a mount which is shared. + # mount --make-shared /mnt - Lets bind mount /mnt to /tmp - #mount --bind /mnt /tmp + Let's bind mount /mnt to /tmp + # mount --bind /mnt /tmp the new mount at /tmp becomes a shared mount and it is a replica of the mount at /mnt. - Now lets make the mount at /tmp; a slave of /mnt - #mount --make-slave /tmp - [or smount /tmp slave] + Now let's make the mount at /tmp; a slave of /mnt + # mount --make-slave /tmp - lets mount /dev/sd0 on /mnt/a - #mount /dev/sd0 /mnt/a + let's mount /dev/sd0 on /mnt/a + # mount /dev/sd0 /mnt/a #ls /mnt/a t1 t2 t3 @@ -104,9 +103,9 @@ replicas continue to be exactly same. Note the mount event has propagated to the mount at /tmp - However lets see what happens if we mount something on the mount at /tmp + However let's see what happens if we mount something on the mount at /tmp - #mount /dev/sd1 /tmp/b + # mount /dev/sd1 /tmp/b #ls /tmp/b s1 s2 s3 @@ -124,12 +123,11 @@ replicas continue to be exactly same. 2d) A unbindable mount is a unbindable private mount - lets say we have a mount at /mnt and we make is unbindable + let's say we have a mount at /mnt and we make is unbindable - #mount --make-unbindable /mnt - [ smount /mnt unbindable ] + # mount --make-unbindable /mnt - Lets try to bind mount this mount somewhere else. + Let's try to bind mount this mount somewhere else. # mount --bind /mnt /tmp mount: wrong fs type, bad option, bad superblock on /mnt, or too many mounted file systems @@ -137,149 +135,15 @@ replicas continue to be exactly same. Binding a unbindable mount is a invalid operation. -3) smount command - - Currently the mount command is not aware of shared subtree features. - Work is in progress to add the support in mount ( util-linux package ). - Till then use the following program. - - ------------------------------------------------------------------------ - // - //this code was developed my Miklos Szeredi <miklos@szeredi.hu> - //and modified by Ram Pai <linuxram@us.ibm.com> - // sample usage: - // smount /tmp shared - // - #include <stdio.h> - #include <stdlib.h> - #include <unistd.h> - #include <string.h> - #include <sys/mount.h> - #include <sys/fsuid.h> - - #ifndef MS_REC - #define MS_REC 0x4000 /* 16384: Recursive loopback */ - #endif - - #ifndef MS_SHARED - #define MS_SHARED 1<<20 /* Shared */ - #endif - - #ifndef MS_PRIVATE - #define MS_PRIVATE 1<<18 /* Private */ - #endif - - #ifndef MS_SLAVE - #define MS_SLAVE 1<<19 /* Slave */ - #endif - - #ifndef MS_UNBINDABLE - #define MS_UNBINDABLE 1<<17 /* Unbindable */ - #endif - - int main(int argc, char *argv[]) - { - int type; - if(argc != 3) { - fprintf(stderr, "usage: %s dir " - "<rshared|rslave|rprivate|runbindable|shared|slave" - "|private|unbindable>\n" , argv[0]); - return 1; - } - - fprintf(stdout, "%s %s %s\n", argv[0], argv[1], argv[2]); - - if (strcmp(argv[2],"rshared")==0) - type=(MS_SHARED|MS_REC); - else if (strcmp(argv[2],"rslave")==0) - type=(MS_SLAVE|MS_REC); - else if (strcmp(argv[2],"rprivate")==0) - type=(MS_PRIVATE|MS_REC); - else if (strcmp(argv[2],"runbindable")==0) - type=(MS_UNBINDABLE|MS_REC); - else if (strcmp(argv[2],"shared")==0) - type=MS_SHARED; - else if (strcmp(argv[2],"slave")==0) - type=MS_SLAVE; - else if (strcmp(argv[2],"private")==0) - type=MS_PRIVATE; - else if (strcmp(argv[2],"unbindable")==0) - type=MS_UNBINDABLE; - else { - fprintf(stderr, "invalid operation: %s\n", argv[2]); - return 1; - } - setfsuid(getuid()); - - if(mount("", argv[1], "dontcare", type, "") == -1) { - perror("mount"); - return 1; - } - return 0; - } - ----------------------------------------------------------------------- - - Copy the above code snippet into smount.c - gcc -o smount smount.c - - - (i) To mark all the mounts under /mnt as shared execute the following - command: - - smount /mnt rshared - the corresponding syntax planned for mount command is - mount --make-rshared /mnt - - just to mark a mount /mnt as shared, execute the following - command: - smount /mnt shared - the corresponding syntax planned for mount command is - mount --make-shared /mnt - - (ii) To mark all the shared mounts under /mnt as slave execute the - following - - command: - smount /mnt rslave - the corresponding syntax planned for mount command is - mount --make-rslave /mnt - - just to mark a mount /mnt as slave, execute the following - command: - smount /mnt slave - the corresponding syntax planned for mount command is - mount --make-slave /mnt - - (iii) To mark all the mounts under /mnt as private execute the - following command: - - smount /mnt rprivate - the corresponding syntax planned for mount command is - mount --make-rprivate /mnt - - just to mark a mount /mnt as private, execute the following - command: - smount /mnt private - the corresponding syntax planned for mount command is - mount --make-private /mnt - - NOTE: by default all the mounts are created as private. But if - you want to change some shared/slave/unbindable mount as - private at a later point in time, this command can help. +3) Setting mount states - (iv) To mark all the mounts under /mnt as unbindable execute the - following + The mount command (util-linux package) can be used to set mount + states: - command: - smount /mnt runbindable - the corresponding syntax planned for mount command is - mount --make-runbindable /mnt - - just to mark a mount /mnt as unbindable, execute the following - command: - smount /mnt unbindable - the corresponding syntax planned for mount command is - mount --make-unbindable /mnt + mount --make-shared mountpoint + mount --make-slave mountpoint + mount --make-private mountpoint + mount --make-unbindable mountpoint 4) Use cases @@ -350,7 +214,7 @@ replicas continue to be exactly same. mount --rbind / /view/v3 mount --rbind / /view/v4 - and if /usr has a versioning filesystem mounted, than that + and if /usr has a versioning filesystem mounted, then that mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and /view/v4/usr too @@ -390,7 +254,7 @@ replicas continue to be exactly same. For example: mount --make-shared /mnt - mount --bin /mnt /tmp + mount --bind /mnt /tmp The mount at /mnt and that at /tmp are both shared and belong to the same peer group. Anything mounted or unmounted under @@ -558,7 +422,7 @@ replicas continue to be exactly same. then the subtree under the unbindable mount is pruned in the new location. - eg: lets say we have the following mount tree. + eg: let's say we have the following mount tree. A / \ @@ -566,7 +430,7 @@ replicas continue to be exactly same. / \ / \ D E F G - Lets say all the mount except the mount C in the tree are + Let's say all the mount except the mount C in the tree are of a type other than unbindable. If this tree is rbound to say Z @@ -683,13 +547,13 @@ replicas continue to be exactly same. 'b' on mounts that receive propagation from mount 'B' and does not have sub-mounts within them are unmounted. - Example: Lets say 'B1', 'B2', 'B3' are shared mounts that propagate to + Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to each other. - lets say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount + let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount 'B1', 'B2' and 'B3' respectively. - lets say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on + let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on mount 'B1', 'B2' and 'B3' respectively. if 'C1' is unmounted, all the mounts that are most-recently-mounted on @@ -710,7 +574,7 @@ replicas continue to be exactly same. A cloned namespace contains all the mounts as that of the parent namespace. - Lets say 'A' and 'B' are the corresponding mounts in the parent and the + Let's say 'A' and 'B' are the corresponding mounts in the parent and the child namespace. If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to @@ -759,11 +623,11 @@ replicas continue to be exactly same. mount --make-slave /mnt At this point we have the first mount at /tmp and - its root dentry is 1. Lets call this mount 'A' + its root dentry is 1. Let's call this mount 'A' And then we have a second mount at /tmp1 with root - dentry 2. Lets call this mount 'B' + dentry 2. Let's call this mount 'B' Next we have a third mount at /mnt with root dentry - mnt. Lets call this mount 'C' + mnt. Let's call this mount 'C' 'B' is the slave of 'A' and 'C' is a slave of 'B' A -> B -> C @@ -794,7 +658,7 @@ replicas continue to be exactly same. Q3 Why is unbindable mount needed? - Lets say we want to replicate the mount tree at multiple + Let's say we want to replicate the mount tree at multiple locations within the same subtree. if one rbind mounts a tree within the same subtree 'n' times @@ -803,7 +667,7 @@ replicas continue to be exactly same. mounts. Here is a example. step 1: - lets say the root tree has just two directories with + let's say the root tree has just two directories with one vfsmount. root / \ @@ -863,7 +727,7 @@ replicas continue to be exactly same. mkdir -p /tmp/m3 mount --rbind /root /tmp/m3 - I wont' draw the tree..but it has 24 vfsmounts + I won't draw the tree..but it has 24 vfsmounts at step i the number of vfsmounts is V[i] = i*V[i-1]. @@ -875,7 +739,7 @@ replicas continue to be exactly same. Unclonable mounts come in handy here. step 1: - lets say the root tree has just two directories with + let's say the root tree has just two directories with one vfsmount. root / \ @@ -973,6 +837,9 @@ replicas continue to be exactly same. individual lists does not affect propagation or the way propagation tree is modified by operations. + All vfsmounts in a peer group have the same ->mnt_master. If it is + non-NULL, they form a contiguous (ordered) segment of slave list. + A example propagation tree looks as shown in the figure below. [ NOTE: Though it looks like a forest, if we consider all the shared mounts as a conceptual entity called 'pnode', it becomes a tree] @@ -1010,8 +877,19 @@ replicas continue to be exactly same. NOTE: The propagation tree is orthogonal to the mount tree. +8B Locking: + + ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected + by namespace_sem (exclusive for modifications, shared for reading). + + Normally we have ->mnt_flags modifications serialized by vfsmount_lock. + There are two exceptions: do_add_mount() and clone_mnt(). + The former modifies a vfsmount that has not been visible in any shared + data structures yet. + The latter holds namespace_sem and the only references to vfsmount + are in lists that can't be traversed without namespace_sem. -8B Algorithm: +8C Algorithm: The crux of the implementation resides in rbind/move operation. diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt deleted file mode 100644 index f673ef0de0f..00000000000 --- a/Documentation/filesystems/smbfs.txt +++ /dev/null @@ -1,8 +0,0 @@ -Smbfs is a filesystem that implements the SMB protocol, which is the -protocol used by Windows for Workgroups, Windows 95 and Windows NT. -Smbfs was inspired by Samba, the program written by Andrew Tridgell -that turns any Unix host into a file server for DOS or Windows clients. - -Smbfs is a SMB client, but uses parts of samba for it's operation. For -more info on samba, including documentation, please go to -http://www.samba.org/ and then on to your nearest mirror. diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt new file mode 100644 index 00000000000..403c090aca3 --- /dev/null +++ b/Documentation/filesystems/squashfs.txt @@ -0,0 +1,259 @@ +SQUASHFS 4.0 FILESYSTEM +======================= + +Squashfs is a compressed read-only filesystem for Linux. +It uses zlib/lzo/xz compression to compress files, inodes and directories. +Inodes in the system are very small and all blocks are packed to minimise +data overhead. Block sizes greater than 4K are supported up to a maximum +of 1Mbytes (default block size 128K). + +Squashfs is intended for general read-only filesystem use, for archival +use (i.e. in cases where a .tar.gz file may be used), and in constrained +block device/memory systems (e.g. embedded systems) where low overhead is +needed. + +Mailing list: squashfs-devel@lists.sourceforge.net +Web site: www.squashfs.org + +1. FILESYSTEM FEATURES +---------------------- + +Squashfs filesystem features versus Cramfs: + + Squashfs Cramfs + +Max filesystem size: 2^64 256 MiB +Max file size: ~ 2 TiB 16 MiB +Max files: unlimited unlimited +Max directories: unlimited unlimited +Max entries per directory: unlimited unlimited +Max block size: 1 MiB 4 KiB +Metadata compression: yes no +Directory indexes: yes no +Sparse file support: yes no +Tail-end packing (fragments): yes no +Exportable (NFS etc.): yes no +Hard link support: yes no +"." and ".." in readdir: yes no +Real inode numbers: yes no +32-bit uids/gids: yes no +File creation time: yes no +Xattr support: yes no +ACL support: no no + +Squashfs compresses data, inodes and directories. In addition, inode and +directory data are highly compacted, and packed on byte boundaries. Each +compressed inode is on average 8 bytes in length (the exact length varies on +file type, i.e. regular file, directory, symbolic link, and block/char device +inodes have different sizes). + +2. USING SQUASHFS +----------------- + +As squashfs is a read-only filesystem, the mksquashfs program must be used to +create populated squashfs filesystems. This and other squashfs utilities +can be obtained from http://www.squashfs.org. Usage instructions can be +obtained from this site also. + +The squashfs-tools development tree is now located on kernel.org + git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git + +3. SQUASHFS FILESYSTEM DESIGN +----------------------------- + +A squashfs filesystem consists of a maximum of nine parts, packed together on a +byte alignment: + + --------------- + | superblock | + |---------------| + | compression | + | options | + |---------------| + | datablocks | + | & fragments | + |---------------| + | inode table | + |---------------| + | directory | + | table | + |---------------| + | fragment | + | table | + |---------------| + | export | + | table | + |---------------| + | uid/gid | + | lookup table | + |---------------| + | xattr | + | table | + --------------- + +Compressed data blocks are written to the filesystem as files are read from +the source directory, and checked for duplicates. Once all file data has been +written the completed inode, directory, fragment, export, uid/gid lookup and +xattr tables are written. + +3.1 Compression options +----------------------- + +Compressors can optionally support compression specific options (e.g. +dictionary size). If non-default compression options have been used, then +these are stored here. + +3.2 Inodes +---------- + +Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each +compressed block is prefixed by a two byte length, the top bit is set if the +block is uncompressed. A block will be uncompressed if the -noI option is set, +or if the compressed block was larger than the uncompressed block. + +Inodes are packed into the metadata blocks, and are not aligned to block +boundaries, therefore inodes overlap compressed blocks. Inodes are identified +by a 48-bit number which encodes the location of the compressed metadata block +containing the inode, and the byte offset into that block where the inode is +placed (<block, offset>). + +To maximise compression there are different inodes for each file type +(regular file, directory, device, etc.), the inode contents and length +varying with the type. + +To further maximise compression, two types of regular file inode and +directory inode are defined: inodes optimised for frequently occurring +regular files and directories, and extended types where extra +information has to be stored. + +3.3 Directories +--------------- + +Like inodes, directories are packed into compressed metadata blocks, stored +in a directory table. Directories are accessed using the start address of +the metablock containing the directory and the offset into the +decompressed block (<block, offset>). + +Directories are organised in a slightly complex way, and are not simply +a list of file names. The organisation takes advantage of the +fact that (in most cases) the inodes of the files will be in the same +compressed metadata block, and therefore, can share the start block. +Directories are therefore organised in a two level list, a directory +header containing the shared start block value, and a sequence of directory +entries, each of which share the shared start block. A new directory header +is written once/if the inode start block changes. The directory +header/directory entry list is repeated as many times as necessary. + +Directories are sorted, and can contain a directory index to speed up +file lookup. Directory indexes store one entry per metablock, each entry +storing the index/filename mapping to the first directory header +in each metadata block. Directories are sorted in alphabetical order, +and at lookup the index is scanned linearly looking for the first filename +alphabetically larger than the filename being looked up. At this point the +location of the metadata block the filename is in has been found. +The general idea of the index is to ensure only one metadata block needs to be +decompressed to do a lookup irrespective of the length of the directory. +This scheme has the advantage that it doesn't require extra memory overhead +and doesn't require much extra storage on disk. + +3.4 File data +------------- + +Regular files consist of a sequence of contiguous compressed blocks, and/or a +compressed fragment block (tail-end packed block). The compressed size +of each datablock is stored in a block list contained within the +file inode. + +To speed up access to datablocks when reading 'large' files (256 Mbytes or +larger), the code implements an index cache that caches the mapping from +block index to datablock location on disk. + +The index cache allows Squashfs to handle large files (up to 1.75 TiB) while +retaining a simple and space-efficient block list on disk. The cache +is split into slots, caching up to eight 224 GiB files (128 KiB blocks). +Larger files use multiple slots, with 1.75 TiB files using all 8 slots. +The index cache is designed to be memory efficient, and by default uses +16 KiB. + +3.5 Fragment lookup table +------------------------- + +Regular files can contain a fragment index which is mapped to a fragment +location on disk and compressed size using a fragment lookup table. This +fragment lookup table is itself stored compressed into metadata blocks. +A second index table is used to locate these. This second index table for +speed of access (and because it is small) is read at mount time and cached +in memory. + +3.6 Uid/gid lookup table +------------------------ + +For space efficiency regular files store uid and gid indexes, which are +converted to 32-bit uids/gids using an id look up table. This table is +stored compressed into metadata blocks. A second index table is used to +locate these. This second index table for speed of access (and because it +is small) is read at mount time and cached in memory. + +3.7 Export table +---------------- + +To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems +can optionally (disabled with the -no-exports Mksquashfs option) contain +an inode number to inode disk location lookup table. This is required to +enable Squashfs to map inode numbers passed in filehandles to the inode +location on disk, which is necessary when the export code reinstantiates +expired/flushed inodes. + +This table is stored compressed into metadata blocks. A second index table is +used to locate these. This second index table for speed of access (and because +it is small) is read at mount time and cached in memory. + +3.8 Xattr table +--------------- + +The xattr table contains extended attributes for each inode. The xattrs +for each inode are stored in a list, each list entry containing a type, +name and value field. The type field encodes the xattr prefix +("user.", "trusted." etc) and it also encodes how the name/value fields +should be interpreted. Currently the type indicates whether the value +is stored inline (in which case the value field contains the xattr value), +or if it is stored out of line (in which case the value field stores a +reference to where the actual value is stored). This allows large values +to be stored out of line improving scanning and lookup performance and it +also allows values to be de-duplicated, the value being stored once, and +all other occurrences holding an out of line reference to that value. + +The xattr lists are packed into compressed 8K metadata blocks. +To reduce overhead in inodes, rather than storing the on-disk +location of the xattr list inside each inode, a 32-bit xattr id +is stored. This xattr id is mapped into the location of the xattr +list using a second xattr id lookup table. + +4. TODOS AND OUTSTANDING ISSUES +------------------------------- + +4.1 Todo list +------------- + +Implement ACL support. + +4.2 Squashfs internal cache +--------------------------- + +Blocks in Squashfs are compressed. To avoid repeatedly decompressing +recently accessed data Squashfs uses two small metadata and fragment caches. + +The cache is not used for file datablocks, these are decompressed and cached in +the page-cache in the normal way. The cache is used to temporarily cache +fragment and metadata blocks which have been read as a result of a metadata +(i.e. inode or directory) or fragment access. Because metadata and fragments +are packed together into blocks (to gain greater compression) the read of a +particular piece of metadata or fragment will retrieve other metadata/fragments +which have been packed with it, these because of locality-of-reference may be +read in the near future. Temporarily caching them ensures they are available +for near future access without requiring an additional read and decompress. + +In the future this internal cache may be replaced with an implementation which +uses the kernel page cache. Because the page cache operates on page sized +units this may introduce additional complexity in terms of locking and +associated race conditions. diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 5daa2aaec2c..74eaac26f8b 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt @@ -9,8 +9,10 @@ that support it. For example, a given bus might look like this: | |-- class | |-- config | |-- device + | |-- enable | |-- irq | |-- local_cpus + | |-- remove | |-- resource | |-- resource0 | |-- resource1 @@ -32,10 +34,13 @@ files, each with their own function. class PCI class (ascii, ro) config PCI config space (binary, rw) device PCI device (ascii, ro) + enable Whether the device is enabled (ascii, rw) irq IRQ number (ascii, ro) local_cpus nearby CPU mask (cpumask, ro) + remove remove device from kernel's list (ascii, wo) resource PCI resource host addresses (ascii, ro) - resource0..N PCI resource N, if present (binary, mmap) + resource0..N PCI resource N, if present (binary, mmap, rw[1]) + resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) rom PCI ROM resource, if present (binary, ro) subsystem_device PCI subsystem device (ascii, ro) subsystem_vendor PCI subsystem vendor (ascii, ro) @@ -43,23 +48,43 @@ files, each with their own function. ro - read only file rw - file is readable and writable + wo - write only file mmap - file is mmapable ascii - file contains ascii text binary - file contains binary data cpumask - file contains a cpumask type +[1] rw for RESOURCE_IO (I/O port) regions only + The read only files are informational, writes to them will be ignored, with the exception of the 'rom' file. Writable files can be used to perform actions on the device (e.g. changing config space, detaching a device). mmapable files are available via an mmap of the file at offset 0 and can be used to do actual device programming from userspace. Note that some platforms don't support mmapping of certain resources, so be sure to check the return -value from any attempted mmap. +value from any attempted mmap. The most notable of these are I/O port +resources, which also provide read/write access. + +The 'enable' file provides a counter that indicates how many times the device +has been enabled. If the 'enable' file currently returns '4', and a '1' is +echoed into it, it will then return '5'. Echoing a '0' into it will decrease +the count. Even when it returns to 0, though, some of the initialisation +may not be reversed. The 'rom' file is special in that it provides read-only access to the device's ROM file, if available. It's disabled by default, however, so applications should write the string "1" to the file to enable it before attempting a read -call, and disable it following the access by writing "0" to the file. +call, and disable it following the access by writing "0" to the file. Note +that the device must be enabled for a rom read to return data successfully. +In the event a driver is not bound to the device, it can be enabled using the +'enable' file, documented above. + +The 'remove' file is used to remove the PCI device, by writing a non-zero +integer to the file. This does not involve any kind of hot-plug functionality, +e.g. powering off the device. The device is removed from the kernel's list of +PCI devices, the sysfs directory for it is removed, and the device will be +removed from any drivers attached to it. Removal of PCI root buses is +disallowed. Accessing legacy resources through sysfs ---------------------------------------- diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt new file mode 100644 index 00000000000..eb843e49c5a --- /dev/null +++ b/Documentation/filesystems/sysfs-tagging.txt @@ -0,0 +1,42 @@ +Sysfs tagging +------------- + +(Taken almost verbatim from Eric Biederman's netns tagging patch +commit msg) + +The problem. Network devices show up in sysfs and with the network +namespace active multiple devices with the same name can show up in +the same directory, ouch! + +To avoid that problem and allow existing applications in network +namespaces to see the same interface that is currently presented in +sysfs, sysfs now has tagging directory support. + +By using the network namespace pointers as tags to separate out the +the sysfs directory entries we ensure that we don't have conflicts +in the directories and applications only see a limited set of +the network devices. + +Each sysfs directory entry may be tagged with zero or one +namespaces. A sysfs_dirent is augmented with a void *s_ns. If a +directory entry is tagged, then sysfs_dirent->s_flags will have a +flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will +point to the namespace to which it belongs. + +Each sysfs superblock's sysfs_super_info contains an array void +*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace +kobj_nstype first mounts sysfs, a new superblock is created. It +will be differentiated from other sysfs mounts by having its +s_fs_info->ns[kobj_nstype] set to the new namespace. Note that +through bind mounting and mounts propagation, a task can easily view +the contents of other namespaces' sysfs mounts. Therefore, when a +namespace exits, it will call kobj_ns_exit() to invalidate any +sysfs_dirent->s_ns pointers pointing to it. + +Users of this interface: +- define a type in the kobj_ns_type enumeration. +- call kobj_ns_type_register() with its kobj_ns_type_operations which has + - current_ns() which returns current's namespace + - netlink_ns() which returns a socket's namespace + - initial_ns() which returns the initial namesapce +- call kobj_ns_exit() when an individual tag is no longer valid diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index 4598ef7b622..b35a64b82f9 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -2,8 +2,10 @@ sysfs - _The_ filesystem for exporting kernel objects. Patrick Mochel <mochel@osdl.org> +Mike Murphy <mamurph@cs.clemson.edu> -10 January 2003 +Revised: 16 August 2011 +Original: 10 January 2003 What it is: @@ -21,7 +23,8 @@ interface. Using sysfs ~~~~~~~~~~~ -sysfs is always compiled in. You can access it by doing: +sysfs is always compiled in if CONFIG_SYSFS is defined. You can access +it by doing: mount -t sysfs sysfs /sys @@ -36,10 +39,12 @@ userspace. Top-level directories in sysfs represent the common ancestors of object hierarchies; i.e. the subsystems the objects belong to. -Sysfs internally stores the kobject that owns the directory in the -->d_fsdata pointer of the directory's dentry. This allows sysfs to do -reference counting directly on the kobject when the file is opened and -closed. +Sysfs internally stores a pointer to the kobject that implements a +directory in the sysfs_dirent object associated with the directory. In +the past this kobject pointer has been used by sysfs to do reference +counting directly on the kobject whenever the file is opened or closed. +With the current sysfs implementation the kobject reference count is +only modified directly by the function sysfs_schedule_callback(). Attributes @@ -57,19 +62,20 @@ values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon. Doing these things may get -you publically humiliated and your code rewritten without notice. +you publicly humiliated and your code rewritten without notice. An attribute definition is simply: struct attribute { char * name; - mode_t mode; + struct module *owner; + umode_t mode; }; -int sysfs_create_file(struct kobject * kobj, struct attribute * attr); -void sysfs_remove_file(struct kobject * kobj, struct attribute * attr); +int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); +void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); A bare attribute contains no means to read or write the value of the @@ -80,22 +86,20 @@ a specific object type. For example, the driver model defines struct device_attribute like: struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); }; -int device_create_file(struct device *, struct device_attribute *); -void device_remove_file(struct device *, struct device_attribute *); +int device_create_file(struct device *, const struct device_attribute *); +void device_remove_file(struct device *, const struct device_attribute *); It also defines this helper for defining device attributes: -#define DEVICE_ATTR(_name, _mode, _show, _store) \ -struct device_attribute dev_attr_##_name = { \ - .attr = {.name = __stringify(_name) , .mode = _mode }, \ - .show = _show, \ - .store = _store, \ -}; +#define DEVICE_ATTR(_name, _mode, _show, _store) \ +struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) For example, declaring @@ -104,7 +108,7 @@ static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); is equivalent to doing: static struct device_attribute dev_attr_foo = { - .attr = { + .attr = { .name = "foo", .mode = S_IWUSR | S_IRUGO, }, @@ -122,7 +126,7 @@ show and store methods of the attribute owners. struct sysfs_ops { ssize_t (*show)(struct kobject *, struct attribute *, char *); - ssize_t (*store)(struct kobject *, struct attribute *, const char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); }; [ Subsystems should have already defined a struct kobj_type as a @@ -137,18 +141,22 @@ calls the associated methods. To illustrate: +#define to_dev(obj) container_of(obj, struct device, kobj) #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) -#define to_dev(d) container_of(d, struct device, kobj) -static ssize_t -dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) +static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) { - struct device_attribute * dev_attr = to_dev_attr(attr); - struct device * dev = to_dev(kobj); - ssize_t ret = 0; + struct device_attribute *dev_attr = to_dev_attr(attr); + struct device *dev = to_dev(kobj); + ssize_t ret = -EIO; if (dev_attr->show) - ret = dev_attr->show(dev, buf); + ret = dev_attr->show(dev, dev_attr, buf); + if (ret >= (ssize_t)PAGE_SIZE) { + print_symbol("dev_attr_show: %s returned bad count\n", + (unsigned long)dev_attr->show); + } return ret; } @@ -161,10 +169,11 @@ To read or write attributes, show() or store() methods must be specified when declaring the attribute. The method types should be as simple as those defined for device attributes: - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); +ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); +ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); -IOW, they should take only an object and a buffer as parameters. +IOW, they should take only an object, an attribute, and a buffer as parameters. sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the @@ -176,8 +185,10 @@ implementations: Recall that an attribute should only be exporting one value, or an array of similar values, so this shouldn't be that expensive. - This allows userspace to do partial reads and seeks arbitrarily over - the entire file at will. + This allows userspace to do partial reads and forward seeks + arbitrarily over the entire file at will. If userspace seeks back to + zero or does a pread(2) with an offset of '0' the show() method will + be called again, rearmed, to fill the buffer. - On write(2), sysfs expects the entire buffer to be passed during the first write. Sysfs then passes the entire buffer to the store() @@ -192,16 +203,19 @@ implementations: Other notes: +- Writing causes the show() method to be rearmed regardless of current + file position. + - The buffer will always be PAGE_SIZE bytes in length. On i386, this is 4096. - show() methods should return the number of bytes printed into the - buffer. This is the return value of snprintf(). + buffer. This is the return value of scnprintf(). -- show() should always use snprintf(). +- show() should always use scnprintf(). -- store() should return the number of bytes used from the buffer. This - can be done using strlen(). +- store() should return the number of bytes used from the buffer. If the + entire buffer has been used, just return the count argument. - show() or store() can always return errors. If a bad value comes through, be sure to return an error. @@ -214,15 +228,18 @@ Other notes: A very simple (and naive) implementation of a device attribute is: -static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf) +static ssize_t show_name(struct device *dev, struct device_attribute *attr, + char *buf) { - return snprintf(buf, PAGE_SIZE, "%s\n", dev->name); + return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); } -static ssize_t store_name(struct device * dev, const char * buf) +static ssize_t store_name(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) { - sscanf(buf, "%20s", dev->name); - return strnlen(buf, PAGE_SIZE); + snprintf(dev->name, sizeof(dev->name), "%.*s", + (int)min(count, sizeof(dev->name) - 1), buf); + return count; } static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); @@ -243,6 +260,7 @@ The top level sysfs directory looks like: block/ bus/ class/ +dev/ devices/ firmware/ net/ @@ -269,6 +287,11 @@ fs/ contains a directory for some filesystems. Currently each filesystem wanting to export attributes must create its own hierarchy below fs/ (see ./fuse.txt for an example). +dev/ contains two directories char/ and block/. Inside these two +directories there are symlinks named <major>:<minor>. These symlinks +point to the sysfs directory for the given device. /sys/dev provides a +quick way to lookup the sysfs interface for a device from the result of +a stat(2) operation. More information can driver-model specific features can be found in Documentation/driver-model/. @@ -288,19 +311,21 @@ The following interface layers currently exist in sysfs: Structure: struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); }; Declaring: -DEVICE_ATTR(_name, _str, _mode, _show, _store); +DEVICE_ATTR(_name, _mode, _show, _store); Creation/Removal: -int device_create_file(struct device *device, struct device_attribute * attr); -void device_remove_file(struct device * dev, struct device_attribute * attr); +int device_create_file(struct device *dev, const struct device_attribute * attr); +void device_remove_file(struct device *dev, const struct device_attribute * attr); - bus drivers (include/linux/device.h) @@ -310,7 +335,7 @@ Structure: struct bus_attribute { struct attribute attr; ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); }; Declaring: @@ -331,7 +356,8 @@ Structure: struct driver_attribute { struct attribute attr; ssize_t (*show)(struct device_driver *, char * buf); - ssize_t (*store)(struct device_driver *, const char * buf); + ssize_t (*store)(struct device_driver *, const char * buf, + size_t count); }; Declaring: @@ -340,7 +366,15 @@ DRIVER_ATTR(_name, _mode, _show, _store) Creation/Removal: -int driver_create_file(struct device_driver *, struct driver_attribute *); -void driver_remove_file(struct device_driver *, struct driver_attribute *); +int driver_create_file(struct device_driver *, const struct driver_attribute *); +void driver_remove_file(struct device_driver *, const struct driver_attribute *); + +Documentation +~~~~~~~~~~~~~ +The sysfs directory structure and the attributes in each directory define an +ABI between the kernel and user space. As for any ABI, it is important that +this ABI is stable and properly documented. All new sysfs attributes must be +documented in Documentation/ABI. See also Documentation/ABI/README for more +information. diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 145e4408635..98ef5512415 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -82,16 +82,38 @@ tmpfs has a mount option to set the NUMA memory allocation policy for all files in that instance (if CONFIG_NUMA is enabled) - which can be adjusted on the fly via 'mount -o remount ...' -mpol=default prefers to allocate memory from the local node +mpol=default use the process allocation policy + (see set_mempolicy(2)) mpol=prefer:Node prefers to allocate memory from the given Node mpol=bind:NodeList allocates memory only from nodes in NodeList mpol=interleave prefers to allocate from each node in turn mpol=interleave:NodeList allocates from each node of NodeList in turn +mpol=local prefers to allocate memory from the local node NodeList format is a comma-separated list of decimal numbers and ranges, a range being two hyphen-separated decimal numbers, the smallest and largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 +A memory policy with a valid NodeList will be saved, as specified, for +use at file creation time. When a task allocates a file in the file +system, the mount option memory policy will be applied with a NodeList, +if any, modified by the calling task's cpuset constraints +[See Documentation/cgroups/cpusets.txt] and any optional flags, listed +below. If the resulting NodeLists is the empty set, the effective memory +policy for the file will revert to "default" policy. + +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/vm/numa_memory_policy.txt for a list of all available +memory allocation policy mode flags and their effect on memory policy. + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + Note that trying to mount a tmpfs with an mpol option will fail if the running kernel does not support NUMA; and will fail if its nodelist specifies a node which is not online. If your system relies on that @@ -121,4 +143,6 @@ RAM/SWAP in 10240 inodes and it is only accessible by root. Author: Christoph Rohland <cr@sap.com>, 1.12.01 Updated: - Hugh Dickins <hugh@veritas.com>, 4 June 2007 + Hugh Dickins, 4 June 2007 +Updated: + KOSAKI Motohiro, 16 Mar 2010 diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt new file mode 100644 index 00000000000..a0a61d2f389 --- /dev/null +++ b/Documentation/filesystems/ubifs.txt @@ -0,0 +1,119 @@ +Introduction +============= + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like fsck.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc (*) skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc do not skip checking CRCs on data nodes +compr=none override default compressor and set it to "none" +compr=lzo override default compressor and set it to "lzo" +compr=zlib override default compressor and set it to "zlib" + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs: +$ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name): +$ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: +http://www.linux-mtd.infradead.org/doc/ubifs.html +http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt index fde829a756e..902b95d0ee5 100644 --- a/Documentation/filesystems/udf.txt +++ b/Documentation/filesystems/udf.txt @@ -24,6 +24,8 @@ The following mount options are supported: gid= Set the default group. umask= Set the default umask. + mode= Set the default file permissions. + dmode= Set the default directory permissions. uid= Set the default user. bs= Set the block size. unhide Show otherwise hidden files. diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index fcc123ffa25..ce1126aceed 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -8,6 +8,12 @@ if you want to format from within Linux. VFAT MOUNT OPTIONS ---------------------------------------------------------------------- +uid=### -- Set the owner of all files on this filesystem. + The default is the uid of current process. + +gid=### -- Set the group of all files on this filesystem. + The default is the gid of current process. + umask=### -- The permission mask (for files and directories, see umask(1)). The default is the umask of current process. @@ -17,11 +23,26 @@ dmask=### -- The permission mask for the directory. fmask=### -- The permission mask for files. The default is the umask of current process. +allow_utime=### -- This option controls the permission check of mtime/atime. + + 20 - If current process is in group of file's group ID, + you can change timestamp. + 2 - Other users can change timestamp. + + The default is set from `dmask' option. (If the directory is + writable, utime(2) is also allowed. I.e. ~dmask & 022) + + Normally utime(2) checks current process is owner of + the file, or it has CAP_FOWNER capability. But FAT + filesystem doesn't have uid/gid on disk, so normal + check is too unflexible. With this option you can + relax it. + codepage=### -- Sets the codepage number for converting to shortname characters on FAT filesystem. By default, FAT_DEFAULT_CODEPAGE setting is used. -iocharset=name -- Character set to use for converting between the +iocharset=<name> -- Character set to use for converting between the encoding is used for user visible filename and 16 bit Unicode characters. Long filenames are stored on disk in Unicode format, but Unix for the most part doesn't @@ -71,6 +92,8 @@ check=s|r|n -- Case sensitivity checking setting. r: relaxed, case insensitive n: normal, default setting, currently case insensitive +nocase -- This was deprecated for vfat. Use shortname=win95 instead. + shortname=lower|win95|winnt|mixed -- Shortname display/create setting. lower: convert to lowercase for display, @@ -79,7 +102,81 @@ shortname=lower|win95|winnt|mixed winnt: emulate the Windows NT rule for display/create. mixed: emulate the Windows NT rule for display, emulate the Windows 95 rule for create. - Default setting is `lower'. + Default setting is `mixed'. + +tz=UTC -- Interpret timestamps as UTC rather than local time. + This option disables the conversion of timestamps + between local time (as used by Windows on FAT) and UTC + (which Linux uses internally). This is particularly + useful when mounting devices (like digital cameras) + that are set to UTC in order to avoid the pitfalls of + local time. +time_offset=minutes + -- Set offset for conversion of timestamps from local time + used by FAT to UTC. I.e. <minutes> minutes will be subtracted + from each timestamp to convert it to UTC used internally by + Linux. This is useful when time zone set in sys_tz is + not the time zone used by the filesystem. Note that this + option still does not provide correct time stamps in all + cases in presence of DST - time stamps in a different DST + setting will be off by one hour. + +showexec -- If set, the execute permission bits of the file will be + allowed only if the extension part of the name is .EXE, + .COM, or .BAT. Not set by default. + +debug -- Can be set, but unused by the current implementation. + +sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as + IMMUTABLE flag on Linux. Not set by default. + +flush -- If set, the filesystem will try to flush to disk more + early than normal. Not set by default. + +rodir -- FAT has the ATTR_RO (read-only) attribute. On Windows, + the ATTR_RO of the directory will just be ignored, + and is used only by applications as a flag (e.g. it's set + for the customized folder). + + If you want to use ATTR_RO as read-only flag even for + the directory, set this option. + +errors=panic|continue|remount-ro + -- specify FAT behavior on critical errors: panic, continue + without doing anything or remount the partition in + read-only mode (default behavior). + +discard -- If set, issues discard/TRIM commands to the block + device when blocks are freed. This is useful for SSD devices + and sparse/thinly-provisoned LUNs. + +nfs=stale_rw|nostale_ro + Enable this only if you want to export the FAT filesystem + over NFS. + + stale_rw: This option maintains an index (cache) of directory + inodes by i_logstart which is used by the nfs-related code to + improve look-ups. Full file operations (read/write) over NFS is + supported but with cache eviction at NFS server, this could + result in ESTALE issues. + + nostale_ro: This option bases the inode number and filehandle + on the on-disk location of a file in the MS-DOS directory entry. + This ensures that ESTALE will not be returned after a file is + evicted from the inode cache. However, it means that operations + such as rename, create and unlink could cause filehandles that + previously pointed at one file to point at a different file, + potentially causing data corruption. For this reason, this + option also mounts the filesystem readonly. + + To maintain backward compatibility, '-o nfs' is also accepted, + defaulting to stale_rw + +dos1xfloppy -- If set, use a fallback default BIOS Parameter Block + configuration, determined by backing device size. These static + parameters match defaults assumed by DOS 1.x for 160 kiB, + 180 kiB, 320 kiB, and 360 kiB floppies and floppy images. + <bool>: 0,1,yes,no,true,false @@ -109,7 +206,8 @@ TEST SUITE If you plan to make any modifications to the vfat filesystem, please get the test suite that comes with the vfat distribution at - http://bmrc.berkeley.edu/people/chaffee/vfat.html + http://web.archive.org/web/*/http://bmrc.berkeley.edu/ + people/chaffee/vfat.html This tests quite a few parts of the vfat filesystem and additional tests for new features or untested features would be appreciated. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index 81e5be6e6e3..a1d0d7a3016 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -72,7 +72,7 @@ structure (this is the kernel-side implementation of file descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then -called so the specific filesystem implementation can do it's work. You +called so the specific filesystem implementation can do its work. You can see that this is another switch performed by the VFS. The file structure is placed into the file descriptor table for the process. @@ -95,10 +95,11 @@ functions: extern int unregister_filesystem(struct file_system_type *); The passed struct file_system_type describes your filesystem. When a -request is made to mount a device onto a directory in your filespace, -the VFS will call the appropriate get_sb() method for the specific -filesystem. The dentry for the mount point will then be updated to -point to the root inode for the new filesystem. +request is made to mount a filesystem onto a directory in your namespace, +the VFS will call the appropriate mount() method for the specific +filesystem. New vfsmount referring to the tree returned by ->mount() +will be attached to the mountpoint, so that when pathname resolution +reaches the mountpoint it will jump into the root of that vfsmount. You can see all filesystems that are registered to the kernel in the file /proc/filesystems. @@ -107,14 +108,14 @@ file /proc/filesystems. struct file_system_type ----------------------- -This describes the filesystem. As of kernel 2.6.22, the following +This describes the filesystem. As of kernel 2.6.39, the following members are defined: struct file_system_type { const char *name; int fs_flags; - int (*get_sb) (struct file_system_type *, int, - const char *, void *, struct vfsmount *); + struct dentry *(*mount) (struct file_system_type *, int, + const char *, void *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type * next; @@ -128,11 +129,11 @@ struct file_system_type { fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) - get_sb: the method to call when a new instance of this + mount: the method to call when a new instance of this filesystem should be mounted kill_sb: the method to call when an instance of this filesystem - should be unmounted + should be shut down owner: for internal VFS use: you should initialize this to THIS_MODULE in most cases. @@ -141,9 +142,9 @@ struct file_system_type { s_lock_key, s_umount_key: lockdep-specific -The get_sb() method has the following arguments: +The mount() method has the following arguments: - struct file_system_type *fs_type: decribes the filesystem, partly initialized + struct file_system_type *fs_type: describes the filesystem, partly initialized by the specific filesystem code int flags: mount flags @@ -153,32 +154,39 @@ The get_sb() method has the following arguments: void *data: arbitrary mount options, usually comes as an ASCII string (see "Mount Options" section) - struct vfsmount *mnt: a vfs-internal representation of a mount point +The mount() method must return the root dentry of the tree requested by +caller. An active reference to its superblock must be grabbed and the +superblock must be locked. On failure it should return ERR_PTR(error). -The get_sb() method must determine if the block device specified -in the dev_name and fs_type contains a filesystem of the type the method -supports. If it succeeds in opening the named block device, it initializes a -struct super_block descriptor for the filesystem contained by the block device. -On failure it returns an error. +The arguments match those of mount(2) and their interpretation +depends on filesystem type. E.g. for block filesystems, dev_name is +interpreted as block device name, that device is opened and if it +contains a suitable filesystem image the method creates and initializes +struct super_block accordingly, returning its root dentry to caller. + +->mount() may choose to return a subtree of existing filesystem - it +doesn't have to create a new one. The main result from the caller's +point of view is a reference to dentry at the root of (sub)tree to +be attached; creation of new superblock is a common side effect. The most interesting member of the superblock structure that the -get_sb() method fills in is the "s_op" field. This is a pointer to +mount() method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation. -Usually, a filesystem uses one of the generic get_sb() implementations -and provides a fill_super() method instead. The generic methods are: +Usually, a filesystem uses one of the generic mount() implementations +and provides a fill_super() callback instead. The generic variants are: - get_sb_bdev: mount a filesystem residing on a block device + mount_bdev: mount a filesystem residing on a block device - get_sb_nodev: mount a filesystem that is not backed by a device + mount_nodev: mount a filesystem that is not backed by a device - get_sb_single: mount a filesystem which shares the instance between + mount_single: mount a filesystem which shares the instance between all mounts -A fill_super() method implementation has the following arguments: +A fill_super() callback implementation has the following arguments: - struct super_block *sb: the superblock structure. The method fill_super() + struct super_block *sb: the superblock structure. The callback must initialize this properly. void *data: arbitrary mount options, usually comes as an ASCII @@ -203,25 +211,25 @@ struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*dirty_inode) (struct inode *); + void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); - void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); - int (*show_options)(struct seq_file *, struct vfsmount *); + int (*show_options)(struct seq_file *, struct dentry *); ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + int (*nr_cached_objects)(struct super_block *); + void (*free_cached_objects)(struct super_block *, int); }; All methods are called without any locks being held, unless otherwise @@ -246,11 +254,8 @@ or bottom half). inode to disc. The second parameter indicates whether the write should be synchronous or not, not all filesystems check this flag. - put_inode: called when the VFS inode is removed from the inode - cache. - drop_inode: called when the last access to the inode is dropped, - with the inode_lock spinlock held. + with the inode->i_lock spinlock held. This method should be either NULL (normal UNIX filesystem semantics) or "generic_delete_inode" (for filesystems that do not @@ -267,22 +272,18 @@ or bottom half). put_super: called when the VFS wishes to free the superblock (i.e. unmount). This is called with the superblock lock held - write_super: called when the VFS superblock needs to be written to - disc. This method is optional - sync_fs: called when VFS is writing out all dirty data associated with a superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional. - write_super_lockfs: called when VFS is locking a filesystem and + freeze_fs: called when VFS is locking a filesystem and forcing it into a consistent state. This method is currently used by the Logical Volume Manager (LVM). - unlockfs: called when VFS is unlocking a filesystem and making it writable + unfreeze_fs: called when VFS is unlocking a filesystem and making it writable again. - statfs: called when the VFS needs to get filesystem statistics. This - is called with the kernel lock held + statfs: called when the VFS needs to get filesystem statistics. remount_fs: called when the filesystem is remounted. This is called with the kernel lock held @@ -298,6 +299,26 @@ or bottom half). quota_write: called by the VFS to write to filesystem quota file. + nr_cached_objects: called by the sb cache shrinking function for the + filesystem to return the number of freeable cached objects it contains. + Optional. + + free_cache_objects: called by the sb cache shrinking function for the + filesystem to scan the number of objects indicated to try to free them. + Optional, but any filesystem implementing this method needs to also + implement ->nr_cached_objects for it to be called correctly. + + We can't do anything with any errors that the filesystem might + encountered, hence the void return type. This will never be called if + the VM is trying to reclaim under GFP_NOFS conditions, hence this + method does not need to handle that situation itself. + + Implementations must include conditional reschedule calls inside any + scanning loop that is done. This allows the VFS to determine + appropriate scan batch sizes without having to worry about whether + implementations will cause holdoff problems due to large scan batch + sizes. + Whoever sets up the inode is responsible for filling in the "i_op" field. This is a pointer to a "struct inode_operations" which describes the methods that can be performed on individual inodes. @@ -316,28 +337,33 @@ This describes how the VFS can manipulate an inode in your filesystem. As of kernel 2.6.22, the following members are defined: struct inode_operations { - int (*create) (struct inode *,struct dentry *,int, struct nameidata *); - struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); + int (*create) (struct inode *,struct dentry *, umode_t, bool); + struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); - int (*mkdir) (struct inode *,struct dentry *,int); + int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); - int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); + int (*rename2) (struct inode *, struct dentry *, + struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); void * (*follow_link) (struct dentry *, struct nameidata *); void (*put_link) (struct dentry *, struct nameidata *, void *); - void (*truncate) (struct inode *); - int (*permission) (struct inode *, int, struct nameidata *); + int (*permission) (struct inode *, int); + int (*get_acl)(struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); - void (*truncate_range)(struct inode *, loff_t, loff_t); + void (*update_time)(struct inode *, struct timespec *, int); + int (*atomic_open)(struct inode *, struct dentry *, struct file *, + unsigned open_flag, umode_t create_mode, int *opened); + int (*tmpfile) (struct inode *, struct dentry *, umode_t); }; Again, all methods are called without any locks being held, unless @@ -390,6 +416,20 @@ otherwise noted. rename: called by the rename(2) system call to rename the object to have the parent and name given by the second inode and dentry. + rename2: this has an additional flags argument compared to rename. + If no flags are supported by the filesystem then this method + need not be implemented. If some flags are supported then the + filesystem must return -EINVAL for any unsupported or unknown + flags. Currently the following flags are implemented: + (1) RENAME_NOREPLACE: this flag indicates that if the target + of the rename exists the rename should fail with -EEXIST + instead of replacing the target. The VFS already checks for + existence, so for local filesystems the RENAME_NOREPLACE + implementation is equivalent to plain rename. + (2) RENAME_EXCHANGE: exchange source and target. Both must + exist; this is checked by the VFS. Unlike plain rename, + source and target may be of different type. + readlink: called by the readlink(2) system call. Only required if you want to support reading symbolic links @@ -406,14 +446,16 @@ otherwise noted. started might not be in the page cache at the end of the walk). - truncate: called by the VFS to change the size of a file. The - i_size field of the inode is set to the desired size by the - VFS before this method is called. This method is called by - the truncate(2) system call and related functionality. - permission: called by the VFS to check for access rights on a POSIX-like filesystem. + May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in rcu-walk + mode, the filesystem must check the permission without blocking or + storing to the inode. + + If a situation is encountered that rcu-walk cannot handle, return + -ECHILD and it will be called again in ref-walk mode. + setattr: called by the VFS to set attributes for a file. This method is called by chmod(2) and related system calls. @@ -434,9 +476,22 @@ otherwise noted. removexattr: called by the VFS to remove an extended attribute from a file. This method is called by removexattr(2) system call. - truncate_range: a method provided by the underlying filesystem to truncate a - range of blocks , i.e. punch a hole somewhere in a file. + update_time: called by the VFS to update a specific time or the i_version of + an inode. If this is not defined the VFS will update the inode itself + and call mark_inode_dirty_sync. + + atomic_open: called on the last component of an open. Using this optional + method the filesystem can look up, possibly create and open the file in + one atomic operation. If it cannot perform this (e.g. the file type + turned out to be wrong) it may signal this by returning 1 instead of + usual 0 or -ve . This method is only called if the last component is + negative or needs lookup. Cached positive dentries are still handled by + f_op->open(). If the file was created, the FILE_CREATED flag should be + set in "opened". In case of O_EXCL the method must only succeed if the + file didn't exist and hence FILE_CREATED shall always be set on success. + tmpfile: called in the end of O_TMPFILE open(). Optional, equivalent to + atomically creating, opening and unlinking a file in given directory. The Address Space Object ======================== @@ -477,7 +532,7 @@ __sync_single_inode) to check if ->writepages has been successful in writing out the whole address_space. The Writeback tag is used by filemap*wait* and sync_page* functions, -via wait_on_page_writeback_range, to wait for all writeback to +via filemap_fdatawait_range, to wait for all writeback to complete. While waiting ->sync_page (if defined) will be called on each page that is found to require writeback. @@ -496,7 +551,7 @@ written-back to storage typically in whole pages, however the address_space has finer control of write sizes. The read process essentially only requires 'readpage'. The write -process is more complicated and uses prepare_write/commit_write or +process is more complicated and uses write_begin/write_end or set_page_dirty to write data into the address_space, and writepage, sync_page, and writepages to writeback data to storage. @@ -515,18 +570,15 @@ struct address_space_operations ------------------------------- This describes how the VFS can manipulate mapping of a file to page cache in -your filesystem. As of kernel 2.6.22, the following members are defined: +your filesystem. The following members are defined: struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); - int (*sync_page)(struct page *); int (*writepages)(struct address_space *, struct writeback_control *); int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); - int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); - int (*commit_write)(struct file *, struct page *, unsigned, unsigned); int (*write_begin)(struct file *, struct address_space *mapping, loff_t pos, unsigned len, unsigned flags, struct page **pagep, void **fsdata); @@ -534,15 +586,21 @@ struct address_space_operations { loff_t pos, unsigned len, unsigned copied, struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); - ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + void (*freepage)(struct page *); + ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); struct page* (*get_xip_page)(struct address_space *, sector_t, int); /* migrate the contents of a page to the specified target */ int (*migratepage) (struct page *, struct page *); int (*launder_page) (struct page *); + int (*is_partially_uptodate) (struct page *, unsigned long, + unsigned long); + void (*is_dirty_writeback) (struct page *, bool *, bool *); + int (*error_remove_page) (struct mapping *mapping, struct page *page); + int (*swap_activate)(struct file *); + int (*swap_deactivate)(struct file *); }; writepage: called by the VM to write a dirty page to backing store. @@ -571,13 +629,6 @@ struct address_space_operations { In this case, the page will be relocated, relocked and if that all succeeds, ->readpage will be called again. - sync_page: called by the VM to notify the backing store to perform all - queued I/O operations for a page. I/O operations for other pages - associated with this address_space object may also be performed. - - This function is optional and is called only for pages with - PG_Writeback set while waiting for the writeback to complete. - writepages: called by the VM to write out pages associated with the address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then the writeback_control will specify a range of pages that must be @@ -602,37 +653,7 @@ struct address_space_operations { readpages is only used for read-ahead, so read errors are ignored. If anything goes wrong, feel free to give up. - prepare_write: called by the generic write path in VM to set up a write - request for a page. This indicates to the address space that - the given range of bytes is about to be written. The - address_space should check that the write will be able to - complete, by allocating space if necessary and doing any other - internal housekeeping. If the write will update parts of - any basic-blocks on storage, then those blocks should be - pre-read (if they haven't been read already) so that the - updated blocks can be written out properly. - The page will be locked. - - Note: the page _must not_ be marked uptodate in this function - (or anywhere else) unless it actually is uptodate right now. As - soon as a page is marked uptodate, it is possible for a concurrent - read(2) to copy it to userspace. - - commit_write: If prepare_write succeeds, new data will be copied - into the page and then commit_write will be called. It will - typically update the size of the file (if appropriate) and - mark the inode as dirty, and do any other related housekeeping - operations. It should avoid returning an error if possible - - errors should have been handled by prepare_write. - - write_begin: This is intended as a replacement for prepare_write. The - key differences being that: - - it returns a locked page (in *pagep) rather than being - given a pre locked page; - - it must be able to cope with short writes (where the - length passed to write_begin is greater than the number - of bytes copied into the page). - + write_begin: Called by the generic buffered write code to ask the filesystem to prepare to write len bytes at the given offset in the file. The address_space should check that the write will be able to complete, @@ -644,6 +665,9 @@ struct address_space_operations { The filesystem must return the locked pagecache page for the specified offset, in *pagep, for the caller to write into. + It must be able to cope with short writes (where the length passed to + write_begin is greater than the number of bytes copied into the page). + flags is a field for AOP_FLAG_xxx flags, described in include/linux/fs.h. @@ -676,23 +700,22 @@ struct address_space_operations { invalidatepage: If a page has PagePrivate set, then invalidatepage will be called when part or all of the page is to be removed from the address space. This generally corresponds to either a - truncation or a complete invalidation of the address space - (in the latter case 'offset' will always be 0). - Any private data associated with the page should be updated - to reflect this truncation. If offset is 0, then - the private data should be released, because the page - must be able to be completely discarded. This may be done by - calling the ->releasepage function, but in this case the - release MUST succeed. + truncation, punch hole or a complete invalidation of the address + space (in the latter case 'offset' will always be 0 and 'length' + will be PAGE_CACHE_SIZE). Any private data associated with the page + should be updated to reflect this truncation. If offset is 0 and + length is PAGE_CACHE_SIZE, then the private data should be released, + because the page must be able to be completely discarded. This may + be done by calling the ->releasepage function, but in this case the + release MUST succeed. releasepage: releasepage is called on PagePrivate pages to indicate that the page should be freed if possible. ->releasepage should remove any private data from the page and clear the - PagePrivate flag. It may also remove the page from the - address_space. If this fails for some reason, it may indicate - failure with a 0 return value. - This is used in two distinct though related cases. The first - is when the VM finds a clean page with no active users and + PagePrivate flag. If releasepage() fails for some reason, it must + indicate failure with a 0 return value. + releasepage() is used in two distinct though related cases. The + first is when the VM finds a clean page with no active users and wants to make it a free page. If ->releasepage succeeds, the page will be removed from the address_space and become free. @@ -707,6 +730,12 @@ struct address_space_operations { need to ensure this. Possibly it can clear the PageUptodate bit if it cannot free private data yet. + freepage: freepage is called once the page is no longer visible in + the page cache in order to allow the cleanup of any private + data. Since it may be called by the memory reclaimer, it + should not assume that the original address_space mapping still + exists, and it should not block. + direct_IO: called by the generic read/write routines to perform direct_IO - that is IO requests which bypass the page cache and transfer data directly between the storage and the @@ -728,6 +757,36 @@ struct address_space_operations { prevent redirtying the page, it is kept locked during the whole operation. + is_partially_uptodate: Called by the VM when reading a file through the + pagecache when the underlying blocksize != pagesize. If the required + block is up to date then the read can complete without needing the IO + to bring the whole page up to date. + + is_dirty_writeback: Called by the VM when attempting to reclaim a page. + The VM uses dirty and writeback information to determine if it needs + to stall to allow flushers a chance to complete some IO. Ordinarily + it can use PageDirty and PageWriteback but some filesystems have + more complex state (unstable pages in NFS prevent reclaim) or + do not set those flags due to locking problems (jbd). This callback + allows a filesystem to indicate to the VM if a page should be + treated as dirty or writeback for the purposes of stalling. + + error_remove_page: normally set to generic_error_remove_page if truncation + is ok for this address space. Used for memory failure handling. + Setting this implies you deal with pages going away under you, + unless you have them locked or reference counts increased. + + swap_activate: Called when swapon is used on a file to allocate + space if necessary and pin the block lookup information in + memory. A return value of zero indicates success, + in which case this file can be used to back swapspace. The + swapspace operations will be proxied to this address space's + ->swap_{out,in} methods. + + swap_deactivate: Called during swapoff on files where swap_activate + was successful. + + The File Object =============== @@ -738,7 +797,7 @@ struct file_operations ---------------------- This describes how the VFS can manipulate an open file. As of kernel -2.6.22, the following members are defined: +3.12, the following members are defined: struct file_operations { struct module *owner; @@ -747,29 +806,29 @@ struct file_operations { ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); - int (*readdir) (struct file *, void *, filldir_t); + ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); + ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); - int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); - int (*fsync) (struct file *, struct dentry *, int datasync); + int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); - int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long arg, struct file_lock **); + long (*fallocate)(struct file *, int mode, loff_t offset, loff_t len); + int (*show_fdinfo)(struct seq_file *m, struct file *f); }; Again, all methods are called without any locks being held, unless @@ -779,22 +838,23 @@ otherwise noted. read: called by read(2) and related system calls - aio_read: called by io_submit(2) and other asynchronous I/O operations + aio_read: vectored, possibly asynchronous read + + read_iter: possibly asynchronous read with iov_iter as destination write: called by write(2) and related system calls - aio_write: called by io_submit(2) and other asynchronous I/O operations + aio_write: vectored, possibly asynchronous write + + write_iter: possibly asynchronous write with iov_iter as source - readdir: called when the VFS needs to read the directory contents + iterate: called when the VFS needs to read the directory contents poll: called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls - ioctl: called by the ioctl(2) system call - - unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not - require the BKL should use this method instead of the ioctl() above. + unlocked_ioctl: called by the ioctl(2) system call. compat_ioctl: called by the ioctl(2) system call when 32 bit system calls are used on 64 bit kernels. @@ -823,18 +883,10 @@ otherwise noted. lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands - readv: called by the readv(2) system call - - writev: called by the writev(2) system call - - sendfile: called by the sendfile(2) system call - get_unmapped_area: called by the mmap(2) system call check_flags: called by the fcntl(2) system call for F_SETFL command - dir_notify: called by the fcntl(2) system call for F_NOTIFY command - flock: called by the flock(2) system call splice_write: called by the VFS to splice data from a pipe to a file. This @@ -843,6 +895,11 @@ otherwise noted. splice_read: called by the VFS to splice data from file to a pipe. This method is used by the splice(2) system call + setlease: called by the VFS to set or release a file lock lease. + setlease has the file_lock_lock held and must not sleep. + + fallocate: called by the VFS to preallocate blocks or punch a hole. + Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node (character or block special) most filesystems will call special @@ -869,27 +926,81 @@ the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { - int (*d_revalidate)(struct dentry *, struct nameidata *); - int (*d_hash) (struct dentry *, struct qstr *); - int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); - int (*d_delete)(struct dentry *); + int (*d_revalidate)(struct dentry *, unsigned int); + int (*d_weak_revalidate)(struct dentry *, unsigned int); + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, + unsigned int, const char *, const struct qstr *); + int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); char *(*d_dname)(struct dentry *, char *, int); + struct vfsmount *(*d_automount)(struct path *); + int (*d_manage)(struct dentry *, bool); }; d_revalidate: called when the VFS needs to revalidate a dentry. This is called whenever a name look-up finds a dentry in the - dcache. Most filesystems leave this as NULL, because all their - dentries in the dcache are valid + dcache. Most local filesystems leave this as NULL, because all their + dentries in the dcache are valid. Network filesystems are different + since things can change on the server without the client necessarily + being aware of it. + + This function should return a positive value if the dentry is still + valid, and zero or a negative error code if it isn't. + + d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU). + If in rcu-walk mode, the filesystem must revalidate the dentry without + blocking or storing to the dentry, d_parent and d_inode should not be + used without care (because they can change and, in d_inode case, even + become NULL under us). - d_hash: called when the VFS adds a dentry to the hash table + If a situation is encountered that rcu-walk cannot handle, return + -ECHILD and it will be called again in ref-walk mode. - d_compare: called when a dentry should be compared with another + d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry. + This is called when a path-walk ends at dentry that was not acquired by + doing a lookup in the parent directory. This includes "/", "." and "..", + as well as procfs-style symlinks and mountpoint traversal. - d_delete: called when the last reference to a dentry is - deleted. This means no-one is using the dentry, however it is - still valid and in the dcache + In this case, we are less concerned with whether the dentry is still + fully correct, but rather that the inode is still valid. As with + d_revalidate, most local filesystems will set this to NULL since their + dcache entries are always valid. + + This function has the same return code semantics as d_revalidate. + + d_weak_revalidate is only called after leaving rcu-walk mode. + + d_hash: called when the VFS adds a dentry to the hash table. The first + dentry passed to d_hash is the parent directory that the name is + to be hashed into. + + Same locking and synchronisation rules as d_compare regarding + what is safe to dereference etc. + + d_compare: called to compare a dentry name with a given name. The first + dentry is the parent of the dentry to be compared, the second is + the child dentry. len and name string are properties of the dentry + to be compared. qstr is the name to compare it with. + + Must be constant and idempotent, and should not take locks if + possible, and should not or store into the dentry. + Should not dereference pointers outside the dentry without + lots of care (eg. d_parent, d_inode, d_name should not be used). + + However, our vfsmount is pinned, and RCU held, so the dentries and + inodes won't disappear, neither will our sb or filesystem module. + ->d_sb may be used. + + It is a tricky calling convention because it needs to be called under + "rcu-walk", ie. without any locks or references on things. + + d_delete: called when the last reference to a dentry is dropped and the + dcache is deciding whether or not to cache it. Return 1 to delete + immediately, or 0 to cache the dentry. Default is NULL which means to + always cache a reachable dentry. d_delete must be constant and + idempotent. d_release: called when a dentry is really deallocated @@ -899,9 +1010,9 @@ struct dentry_operations { iput() yourself d_dname: called when the pathname of a dentry should be generated. - Usefull for some pseudo filesystems (sockfs, pipefs, ...) to delay + Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay pathname generation. (Instead of doing it when dentry is created, - its done only when the path is needed.). Real filesystems probably + it's done only when the path is needed.). Real filesystems probably dont want to use it, because their dentries are present in global dcache hash, so their hash should be an invariant. As no lock is held, d_dname() should not try to modify the dentry itself, unless @@ -910,6 +1021,43 @@ struct dentry_operations { at the end of the buffer, and returns a pointer to the first char. dynamic_dname() helper function is provided to take care of this. + d_automount: called when an automount dentry is to be traversed (optional). + This should create a new VFS mount record and return the record to the + caller. The caller is supplied with a path parameter giving the + automount directory to describe the automount target and the parent + VFS mount record to provide inheritable mount parameters. NULL should + be returned if someone else managed to make the automount first. If + the vfsmount creation failed, then an error code should be returned. + If -EISDIR is returned, then the directory will be treated as an + ordinary directory and returned to pathwalk to continue walking. + + If a vfsmount is returned, the caller will attempt to mount it on the + mountpoint and will remove the vfsmount from its expiration list in + the case of failure. The vfsmount should be returned with 2 refs on + it to prevent automatic expiration - the caller will clean up the + additional ref. + + This function is only used if DCACHE_NEED_AUTOMOUNT is set on the + dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the + inode being added. + + d_manage: called to allow the filesystem to manage the transition from a + dentry (optional). This allows autofs, for example, to hold up clients + waiting to explore behind a 'mountpoint' whilst letting the daemon go + past and construct the subtree there. 0 should be returned to let the + calling process continue. -EISDIR can be returned to tell pathwalk to + use this directory as an ordinary directory and to ignore anything + mounted on it and not to check the automount flag. Any other error + code will abort pathwalk completely. + + If the 'rcu_walk' parameter is true, then the caller is doing a + pathwalk in RCU-walk mode. Sleeping is not permitted in this mode, + and the caller can be asked to leave it and call again by returning + -ECHILD. + + This function is only used if DCACHE_MANAGE_TRANSIT is set on the + dentry being transited from. + Example : static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen) @@ -933,14 +1081,11 @@ manipulate dentries: the usage count) dput: close a handle for a dentry (decrements the usage count). If - the usage count drops to 0, the "d_delete" method is called - and the dentry is placed on the unused list if the dentry is - still in its parents hash list. Putting the dentry on the - unused list just means that if the system needs some RAM, it - goes through the unused list of dentries and deallocates them. - If the dentry has already been unhashed and the usage count - drops to 0, in this case the dentry is deallocated after the - "d_delete" method is called + the usage count drops to 0, and the dentry is still in its + parent's hash, the "d_delete" method is called to check whether + it should be cached. If it should not be cached, or if the dentry + is not hashed, it is deleted. Otherwise cached dentries are put + into an LRU list to be reclaimed on memory shortage. d_drop: this unhashes a dentry from its parents hash list. A subsequent call to dput() will deallocate the dentry if its @@ -964,12 +1109,9 @@ manipulate dentries: d_lookup: look up a dentry given its parent and path name component It looks up the child of that given name from the dcache hash table. If it is found, the reference count is incremented - and the dentry is returned. The caller must use d_put() + and the dentry is returned. The caller must use dput() to free the dentry when it finishes using it. -For further information on dentry locking, please refer to the document -Documentation/filesystems/dentry-locking.txt. - Mount Options ============= diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 00000000000..2ce36439c09 --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt @@ -0,0 +1,793 @@ +XFS Delayed Logging Design +-------------------------- + +Introduction to Re-logging in XFS +--------------------------------- + +XFS logging is a combination of logical and physical logging. Some objects, +such as inodes and dquots, are logged in logical format where the details +logged are made up of the changes to in-core structures rather than on-disk +structures. Other objects - typically buffers - have their physical changes +logged. The reason for these differences is to reduce the amount of log space +required for objects that are frequently logged. Some parts of inodes are more +frequently logged than others, and inodes are typically more frequently logged +than any other object (except maybe the superblock buffer) so keeping the +amount of metadata logged low is of prime importance. + +The reason that this is such a concern is that XFS allows multiple separate +modifications to a single object to be carried in the log at any given time. +This allows the log to avoid needing to flush each change to disk before +recording a new change to the object. XFS does this via a method called +"re-logging". Conceptually, this is quite simple - all it requires is that any +new change to the object is recorded with a *new copy* of all the existing +changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + <object written to disk> + E E Y (> X+n+m+o) + F E+F Yٍ+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique also allows objects to be moved forward in the log so +that an object being relogged does not prevent the tail of the log from ever +moving forward. This can be seen in the table above by the changing +(increasing) LSN of each subsequent transaction - the LSN is effectively a +direct encoding of the location in the log of the transaction. + +This relogging is also used to implement long-running, multiple-commit +transactions. These transaction are known as rolling transactions, and require +a special log reservation known as a permanent transaction reservation. A +typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +Another feature of the XFS transaction subsystem is that most transactions are +asynchronous. That is, they don't commit to disk until either a log buffer is +filled (a log buffer can hold multiple transactions) or a synchronous operation +forces the log buffers holding the transactions to disk. This means that XFS is +doing aggregation of transactions in memory - batching them, if you like - to +minimise the impact of the log IO on transaction throughput. + +The limitation on asynchronous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound of outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +------------------------- + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects written to the +log would greatly reduce the amount of metadata we write to the log, and this +is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formatting the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that it disassociates the amount of outstanding +metadata changes from the size and number of log buffers available. In other +words, instead of there only being a maximum of 2MB of transaction changes not +written to the log at any point in time, there may be a much greater amount +being accumulated in memory. Hence the potential for loss of metadata on a +crash is much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesystem is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +----------------------- + +Storing Changes + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope for deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just took a long time to recognise it. +Put simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use an +allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and rewrite the vector to +point to the memory buffer rather than the object itself, we now have a copy of +the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Memory Buffer +-V1-+-V2-+----V3----+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewriting the vector addresses to point at the memory buffer we end up with a +self-describing object that can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs it's own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks objects +in transaction commit order, so when an object is relogged it is removed from +it's place in the list and re-inserted at the tail. This is entirely arbitrary +and done to make it easy for debugging - the last items in the list are the +ones that are most recently modified. Ordering of the CIL is not necessary for +transactional integrity (as discussed in the next section) so the ordering is +done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrite the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allow new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detached from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totaling roughly 150 bytes. If we modify +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In +comparison, if we are logging full directory buffers, they are typically 4KB +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total +space. From this, it should be obvious that a static log space reservation is +not particularly flexible and is difficult to select the "optimal value" for +all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space available to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push is checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an asymmetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. The result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to it's context. + +Just to make matters more slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes + +The existing log item life cycle is as follows: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + <item IO completion> + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean. + +With delayed logging, there are new steps inserted into the life cycle: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + <next log force> + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + <item IO completion> + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +committing of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt new file mode 100644 index 00000000000..05aa455163e --- /dev/null +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt @@ -0,0 +1,350 @@ +XFS Self Describing Metadata +---------------------------- + +Introduction +------------ + +The largest scalability problem facing XFS is not one of algorithmic +scalability, but of verification of the filesystem structure. Scalabilty of the +structures and indexes on disk and the algorithms for iterating them are +adequate for supporting PB scale filesystems with billions of inodes, however it +is this very scalability that causes the verification problem. + +Almost all metadata on XFS is dynamically allocated. The only fixed location +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all +other metadata structures need to be discovered by walking the filesystem +structure in different ways. While this is already done by userspace tools for +validating and repairing the structure, there are limits to what they can +verify, and this in turn limits the supportable size of an XFS filesystem. + +For example, it is entirely possible to manually use xfs_db and a bit of +scripting to analyse the structure of a 100TB filesystem when trying to +determine the root cause of a corruption problem, but it is still mainly a +manual task of verifying that things like single bit errors or misplaced writes +weren't the ultimate cause of a corruption event. It may take a few hours to a +few days to perform such forensic analysis, so for at this scale root cause +analysis is entirely possible. + +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata +to analyse and so that analysis blows out towards weeks/months of forensic work. +Most of the analysis work is slow and tedious, so as the amount of analysis goes +up, the more likely that the cause will be lost in the noise. Hence the primary +concern for supporting PB scale filesystems is minimising the time and effort +required for basic forensic analysis of the filesystem structure. + + +Self Describing Metadata +------------------------ + +One of the problems with the current metadata format is that apart from the +magic number in the metadata block, we have no other way of identifying what it +is supposed to be. We can't even identify if it is the right place. Put simply, +you can't look at a single metadata block in isolation and say "yes, it is +supposed to be there and the contents are valid". + +Hence most of the time spent on forensic analysis is spent doing basic +verification of metadata values, looking for values that are in range (and hence +not detected by automated verification checks) but are not correct. Finding and +understanding how things like cross linked block lists (e.g. sibling +pointers in a btree end up with loops in them) are the key to understanding what +went wrong, but it is impossible to tell what order the blocks were linked into +each other or written to disk after the fact. + +Hence we need to record more information into the metadata to allow us to +quickly determine if the metadata is intact and can be ignored for the purpose +of analysis. We can't protect against every possible type of error, but we can +ensure that common types of errors are easily detectable. Hence the concept of +self describing metadata. + +The first, fundamental requirement of self describing metadata is that the +metadata object contains some form of unique identifier in a well known +location. This allows us to identify the expected contents of the block and +hence parse and verify the metadata object. IF we can't independently identify +the type of metadata in the object, then the metadata doesn't describe itself +very well at all! + +Luckily, almost all XFS metadata has magic numbers embedded already - only the +AGFL, remote symlinks and remote attribute blocks do not contain identifying +magic numbers. Hence we can change the on-disk format of all these objects to +add more identifying information and detect this simply by changing the magic +numbers in the metadata objects. That is, if it has the current magic number, +the metadata isn't self identifying. If it contains a new magic number, it is +self identifying and we can do much more expansive automated verification of the +metadata object at runtime, during forensic analysis or repair. + +As a primary concern, self describing metadata needs some form of overall +integrity checking. We cannot trust the metadata if we cannot verify that it has +not been changed as a result of external influences. Hence we need some form of +integrity check, and this is done by adding CRC32c validation to the metadata +block. If we can verify the block contains the metadata it was intended to +contain, a large amount of the manual verification work can be skipped. + +CRC32c was selected as metadata cannot be more than 64k in length in XFS and +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is +fast. So while CRC32c is not the strongest of possible integrity checks that +could be used, it is more than sufficient for our needs and has relatively +little overhead. Adding support for larger integrity fields and/or algorithms +does really provide any extra value over CRC32c, but it does add a lot of +complexity and so there is no provision for changing the integrity checking +mechanism. + +Self describing metadata needs to contain enough information so that the +metadata block can be verified as being in the correct place without needing to +look at any other metadata. This means it needs to contain location information. +Just adding a block number to the metadata is not sufficient to protect against +mis-directed writes - a write might be misdirected to the wrong LUN and so be +written to the "correct block" of the wrong filesystem. Hence location +information must contain a filesystem identifier as well as a block number. + +Another key information point in forensic analysis is knowing who the metadata +block belongs to. We already know the type, the location, that it is valid +and/or corrupted, and how long ago that it was last modified. Knowing the owner +of the block is important as it allows us to find other related metadata to +determine the scope of the corruption. For example, if we have a extent btree +object, we don't know what inode it belongs to and hence have to walk the entire +filesystem to find the owner of the block. Worse, the corruption could mean that +no owner can be found (i.e. it's an orphan block), and so without an owner field +in the metadata we have no idea of the scope of the corruption. If we have an +owner field in the metadata object, we can immediately do top down validation to +determine the scope of the problem. + +Different types of metadata have different owner identifiers. For example, +directory, attribute and extent tree blocks are all owned by an inode, whilst +freespace btree blocks are owned by an allocation group. Hence the size and +contents of the owner field are determined by the type of metadata object we are +looking at. The owner information can also identify misplaced writes (e.g. +freespace btree block written to the wrong AG). + +Self describing metadata also needs to contain some indication of when it was +written to the filesystem. One of the key information points when doing forensic +analysis is how recently the block was modified. Correlation of set of corrupted +metadata blocks based on modification times is important as it can indicate +whether the corruptions are related, whether there's been multiple corruption +events that lead to the eventual failure, and even whether there are corruptions +present that the run-time verification is not detecting. + +For example, we can determine whether a metadata object is supposed to be free +space or still allocated if it is still referenced by its owner by looking at +when the free space btree block that contains the block was last written +compared to when the metadata object itself was last written. If the free space +block is more recent than the object and the object's owner, then there is a +very good chance that the block should have been removed from the owner. + +To provide this "written timestamp", each metadata block gets the Log Sequence +Number (LSN) of the most recent transaction it was modified on written into it. +This number will always increase over the life of the filesystem, and the only +thing that resets it is running xfs_repair on the filesystem. Further, by use of +the LSN we can tell if the corrupted metadata all belonged to the same log +checkpoint and hence have some idea of how much modification occurred between +the first and last instance of corrupt metadata on disk and, further, how much +modification occurred between the corruption being written and when it was +detected. + +Runtime Validation +------------------ + +Validation of self-describing metadata takes place at runtime in two places: + + - immediately after a successful read from disk + - immediately prior to write IO submission + +The verification is completely stateless - it is done independently of the +modification process, and seeks only to check that the metadata is what it says +it is and that the metadata fields are within bounds and internally consistent. +As such, we cannot catch all types of corruption that can occur within a block +as there may be certain limitations that operational state enforces of the +metadata, or there may be corruption of interblock relationships (e.g. corrupted +sibling pointer lists). Hence we still need stateful checking in the main code +body, but in general most of the per-field validation is handled by the +verifiers. + +For read verification, the caller needs to specify the expected type of metadata +that it should see, and the IO completion process verifies that the metadata +object matches what was expected. If the verification process fails, then it +marks the object being read as EFSCORRUPTED. The caller needs to catch this +error (same as for IO errors), and if it needs to take special action due to a +verification error it can do so by catching the EFSCORRUPTED error value. If we +need more discrimination of error type at higher levels, we can define new +error numbers for different errors as necessary. + +The first step in read verification is checking the magic number and determining +whether CRC validating is necessary. If it is, the CRC32c is calculated and +compared against the value stored in the object itself. Once this is validated, +further checks are made against the location information, followed by extensive +object specific metadata validation. If any of these checks fail, then the +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. + +Write verification is the opposite of the read verification - first the object +is extensively verified and if it is OK we then update the LSN from the last +modification made to the object, After this, we calculate the CRC and insert it +into the object. Once this is done the write IO is allowed to continue. If any +error occurs during this process, the buffer is again marked with a EFSCORRUPTED +error for the higher layers to catch. + +Structures +---------- + +A typical on-disk structure needs to contain the following information: + +struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ +}; + +Depending on the metadata, this information may be part of a header structure +separate to the metadata contents, or may be distributed through an existing +structure. The latter occurs with metadata that already contains some of this +information, such as the superblock and AG headers. + +Other metadata may have different formats for the information, but the same +level of information is generally provided. For example: + + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block + number for location. The two of these combined provide the same + information as @owner and @blkno in eh above structure, but using 8 + bytes less space on disk. + + - directory/attribute node blocks have a 16 bit magic number, and the + header that contains the magic number has other information in it as + well. hence the additional metadata headers change the overall format + of the metadata. + +A typical buffer read verifier is structured as follows: + +#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + +static void +xfs_foo_read_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } +} + +The code ensures that the CRC is only checked if the filesystem has CRCs enabled +by checking the superblock of the feature bit, and then if the CRC verifies OK +(or is not needed) it verifies the actual contents of the block. + +The verifier function will take a couple of different forms, depending on +whether the magic number can be used to determine the format of the block. In +the case it can't, the code is structured as follows: + +static bool +xfs_foo_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } + + /* object specific verification checks here */ + + return true; +} + +If there are different magic numbers for the different formats, the verifier +will look like: + +static bool +xfs_foo_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; +} + +Write verifiers are very similar to the read verifiers, they just do things in +the opposite order to the read verifiers. A typical write verifier: + +static void +xfs_foo_write_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; + + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; + + + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); +} + +This will verify the internal structure of the metadata before we go any +further, detecting corruptions that have occurred as the metadata has been +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then +update the LSN field (when it was last modified) and calculate the CRC on the +metadata. Once this is done, we can issue the IO. + +Inodes and Dquots +----------------- + +Inodes and dquots are special snowflakes. They have per-object CRC and +self-identifiers, but they are packed so that there are multiple objects per +buffer. Hence we do not use per-buffer verifiers to do the work of per-object +verification and CRC calculations. The per-buffer verifiers simply perform basic +identification of the buffer - that they contain inodes or dquots, and that +there are magic numbers in all the expected spots. All further CRC and +verification checks are done when each inode is read from or written back to the +buffer. + +The structure of the verifiers and the identifiers checks is very similar to the +buffer code described above. The only difference is where they are called. For +example, inode read verification is done in xfs_iread() when the inode is first +read out of the buffer and the struct xfs_inode is instantiated. The inode is +already extensively verified during writeback in xfs_iflush_int, so the only +addition here is to add the LSN and CRC to the inode as it is copied back into +the buffer. + +XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of +the unlinked list modifications check or update CRCs, neither during unlink nor +log recovery. So, it's gone unnoticed until now. This won't matter immediately - +repair will probably complain about it - but it needs to be fixed. + diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 74aeb142ae5..5be51fd888b 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -18,6 +18,8 @@ Mount Options ============= When mounting an XFS filesystem, the following options are accepted. +For boolean mount options, the names with the (*) suffix is the +default behaviour. allocsize=size Sets the buffered I/O end-of-file preallocation size when @@ -25,82 +27,128 @@ When mounting an XFS filesystem, the following options are accepted. Valid values for this option are page size (typically 4KiB) through to 1GiB, inclusive, in power-of-2 increments. - attr2/noattr2 - The options enable/disable (default is disabled for backward - compatibility on-disk) an "opportunistic" improvement to be - made in the way inline extended attributes are stored on-disk. - When the new form is used for the first time (by setting or - removing extended attributes) the on-disk superblock feature - bit field will be updated to reflect this format being in use. - - barrier - Enables the use of block layer write barriers for writes into - the journal and unwritten extent conversion. This allows for - drive level write caching to be enabled, for devices that - support write barriers. - - dmapi - Enable the DMAPI (Data Management API) event callouts. - Use with the "mtpt" option. - - grpid/bsdgroups and nogrpid/sysvgroups - These options define what group ID a newly created file gets. - When grpid is set, it takes the group ID of the directory in - which it is created; otherwise (the default) it takes the fsgid - of the current process, unless the directory has the setgid bit - set, in which case it takes the gid from the parent directory, - and also gets the setgid bit set if it is a directory itself. - - ihashsize=value - Sets the number of hash buckets available for hashing the - in-memory inodes of the specified mount point. If a value - of zero is used, the value selected by the default algorithm - will be displayed in /proc/mounts. - - ikeep/noikeep - When inode clusters are emptied of inodes, keep them around - on the disk (ikeep) - this is the traditional XFS behaviour - and is still the default for now. Using the noikeep option, - inode clusters are returned to the free space pool. - - inode64 - Indicates that XFS is allowed to create inodes at any location - in the filesystem, including those which will result in inode - numbers occupying more than 32 bits of significance. This is - provided for backwards compatibility, but causes problems for - backup applications that cannot handle large inode numbers. - - largeio/nolargeio + The default behaviour is for dynamic end-of-file + preallocation size, which uses a set of heuristics to + optimise the preallocation size based on the current + allocation patterns within the file and the access patterns + to the file. Specifying a fixed allocsize value turns off + the dynamic behaviour. + + attr2 + noattr2 + The options enable/disable an "opportunistic" improvement to + be made in the way inline extended attributes are stored + on-disk. When the new form is used for the first time when + attr2 is selected (either when setting or removing extended + attributes) the on-disk superblock feature bit field will be + updated to reflect this format being in use. + + The default behaviour is determined by the on-disk feature + bit indicating that attr2 behaviour is active. If either + mount option it set, then that becomes the new default used + by the filesystem. + + CRC enabled filesystems always use the attr2 format, and so + will reject the noattr2 mount option if it is set. + + barrier (*) + nobarrier + Enables/disables the use of block layer write barriers for + writes into the journal and for data integrity operations. + This allows for drive level write caching to be enabled, for + devices that support write barriers. + + discard + nodiscard (*) + Enable/disable the issuing of commands to let the block + device reclaim space freed by the filesystem. This is + useful for SSD devices, thinly provisioned LUNs and virtual + machine images, but may have a performance impact. + + Note: It is currently recommended that you use the fstrim + application to discard unused blocks rather than the discard + mount option because the performance impact of this option + is quite severe. + + grpid/bsdgroups + nogrpid/sysvgroups (*) + These options define what group ID a newly created file + gets. When grpid is set, it takes the group ID of the + directory in which it is created; otherwise it takes the + fsgid of the current process, unless the directory has the + setgid bit set, in which case it takes the gid from the + parent directory, and also gets the setgid bit set if it is + a directory itself. + + filestreams + Make the data allocator use the filestreams allocation mode + across the entire filesystem rather than just on directories + configured to use it. + + ikeep + noikeep (*) + When ikeep is specified, XFS does not delete empty inode + clusters and keeps them around on disk. When noikeep is + specified, empty inode clusters are returned to the free + space pool. + + inode32 + inode64 (*) + When inode32 is specified, it indicates that XFS limits + inode creation to locations which will not result in inode + numbers with more than 32 bits of significance. + + When inode64 is specified, it indicates that XFS is allowed + to create inodes at any location in the filesystem, + including those which will result in inode numbers occupying + more than 32 bits of significance. + + inode32 is provided for backwards compatibility with older + systems and applications, since 64 bits inode numbers might + cause problems for some applications that cannot handle + large inode numbers. If applications are in use which do + not handle inode numbers bigger than 32 bits, the inode32 + option should be specified. + + + largeio + nolargeio (*) If "nolargeio" is specified, the optimal I/O reported in - st_blksize by stat(2) will be as small as possible to allow user - applications to avoid inefficient read/modify/write I/O. - If "largeio" specified, a filesystem that has a "swidth" specified - will return the "swidth" value (in bytes) in st_blksize. If the - filesystem does not have a "swidth" specified but does specify - an "allocsize" then "allocsize" (in bytes) will be returned - instead. - If neither of these two options are specified, then filesystem - will behave as if "nolargeio" was specified. + st_blksize by stat(2) will be as small as possible to allow + user applications to avoid inefficient read/modify/write + I/O. This is typically the page size of the machine, as + this is the granularity of the page cache. + + If "largeio" specified, a filesystem that was created with a + "swidth" specified will return the "swidth" value (in bytes) + in st_blksize. If the filesystem does not have a "swidth" + specified but does specify an "allocsize" then "allocsize" + (in bytes) will be returned instead. Otherwise the behaviour + is the same as if "nolargeio" was specified. logbufs=value - Set the number of in-memory log buffers. Valid numbers range - from 2-8 inclusive. - The default value is 8 buffers for filesystems with a - blocksize of 64KiB, 4 buffers for filesystems with a blocksize - of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB - and 2 buffers for all other configurations. Increasing the - number of buffers may increase performance on some workloads - at the cost of the memory used for the additional log buffers - and their associated control structures. + Set the number of in-memory log buffers. Valid numbers + range from 2-8 inclusive. + + The default value is 8 buffers. + + If the memory cost of 8 log buffers is too high on small + systems, then it may be reduced at some cost to performance + on metadata intensive workloads. The logbsize option below + controls the size of each buffer and so is also relevant to + this case. logbsize=value - Set the size of each in-memory log buffer. - Size may be specified in bytes, or in kilobytes with a "k" suffix. - Valid sizes for version 1 and version 2 logs are 16384 (16k) and - 32768 (32k). Valid sizes for version 2 logs also include - 65536 (64k), 131072 (128k) and 262144 (256k). - The default value for machines with more than 32MiB of memory - is 32768, machines with less memory use 16384 by default. + Set the size of each in-memory log buffer. The size may be + specified in bytes, or in kilobytes with a "k" suffix. + Valid sizes for version 1 and version 2 logs are 16384 (16k) + and 32768 (32k). Valid sizes for version 2 logs also + include 65536 (64k), 131072 (128k) and 262144 (256k). The + logbsize must be an integer multiple of the log + stripe unit configured at mkfs time. + + The default value for for version 1 logs is 32768, while the + default value for version 2 logs is MAX(32768, log_sunit). logdev=device and rtdev=device Use an external log (metadata journal) and/or real-time device. @@ -109,16 +157,11 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. - mtpt=mountpoint - Use with the "dmapi" option. The value specified here will be - included in the DMAPI mount event, and should be the path of - the actual mountpoint that is used. - noalign - Data allocations will not be aligned at stripe unit boundaries. - - noatime - Access timestamps are not updated when a file is read. + Data allocations will not be aligned at stripe unit + boundaries. This is only relevant to filesystems created + with non-zero data alignment parameters (sunit, swidth) by + mkfs. norecovery The filesystem will be mounted without running log recovery. @@ -129,19 +172,14 @@ When mounting an XFS filesystem, the following options are accepted. the mount will fail. nouuid - Don't check for double mounted file systems using the file system uuid. - This is useful to mount LVM snapshot volumes. + Don't check for double mounted file systems using the file + system uuid. This is useful to mount LVM snapshot volumes, + and often used in combination with "norecovery" for mounting + read-only snapshots. - osyncisosync - Make O_SYNC writes implement true O_SYNC. WITHOUT this option, - Linux XFS behaves as if an "osyncisdsync" option is used, - which will make writes to files opened with the O_SYNC flag set - behave as if the O_DSYNC flag had been used instead. - This can result in better performance without compromising - data safety. - However if this option is not in effect, timestamp updates from - O_SYNC writes can be lost if the system crashes. - If timestamp updates are critical, use the osyncisosync option. + noquota + Forcibly turns off all quota accounting and enforcement + within the filesystem. uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) @@ -156,24 +194,64 @@ When mounting an XFS filesystem, the following options are accepted. enforced. Refer to xfs_quota(8) for further details. sunit=value and swidth=value - Used to specify the stripe unit and width for a RAID device or - a stripe volume. "value" must be specified in 512-byte block - units. - If this option is not specified and the filesystem was made on - a stripe volume or the stripe width or unit were specified for - the RAID device at mkfs time, then the mount system call will - restore the value from the superblock. For filesystems that - are made directly on RAID devices, these options can be used - to override the information in the superblock if the underlying - disk layout changes after the filesystem has been created. - The "swidth" option is required if the "sunit" option has been - specified, and must be a multiple of the "sunit" value. + Used to specify the stripe unit and width for a RAID device + or a stripe volume. "value" must be specified in 512-byte + block units. These options are only relevant to filesystems + that were created with non-zero data alignment parameters. + + The sunit and swidth parameters specified must be compatible + with the existing filesystem alignment characteristics. In + general, that means the only valid changes to sunit are + increasing it by a power-of-2 multiple. Valid swidth values + are any integer multiple of a valid sunit value. + + Typically the only time these mount options are necessary if + after an underlying RAID device has had it's geometry + modified, such as adding a new disk to a RAID5 lun and + reshaping it. swalloc Data allocations will be rounded up to stripe width boundaries when the current end of file is being extended and the file size is larger than the stripe width size. + wsync + When specified, all filesystem namespace operations are + executed synchronously. This ensures that when the namespace + operation (create, unlink, etc) completes, the change to the + namespace is on stable storage. This is useful in HA setups + where failover must not result in clients seeing + inconsistent namespace presentation during or after a + failover event. + + +Deprecated Mount Options +======================== + + delaylog/nodelaylog + Delayed logging is the only logging method that XFS supports + now, so these mount options are now ignored. + + Due for removal in 3.12. + + ihashsize=value + In memory inode hashes have been removed, so this option has + no function as of August 2007. Option is deprecated. + + Due for removal in 3.12. + + irixsgid + This behaviour is now controlled by a sysctl, so the mount + option is ignored. + + Due for removal in 3.12. + + osyncisdsync + osyncisosync + O_SYNC and O_DSYNC are fully supported, so there is no need + for these options any more. + + Due for removal in 3.12. sysctls ======= @@ -185,15 +263,20 @@ The following sysctls are available for the XFS filesystem: in /proc/fs/xfs/stat. It then immediately resets to "0". fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) - The interval at which the xfssyncd thread flushes metadata - out to disk. This thread will flush log activity out, and - do some processing on unlinked inodes. + The interval at which the filesystem flushes metadata + out to disk and runs internal cache cleanup routines. - fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) - The interval at which xfsbufd scans the dirty metadata buffers list. + fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000) + The interval at which the filesystem ages filestreams cache + references and returns timed-out AGs back to the free stream + pool. - fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) - The age at which xfsbufd flushes dirty metadata buffers to disk. + fs.xfs.speculative_prealloc_lifetime + (Units: seconds Min: 1 Default: 300 Max: 86400) + The interval at which the background scanning for inodes + with unused speculative preallocation runs. The scan + removes unused preallocation from clean inodes and releases + the unused space back to the free pool. fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. @@ -230,10 +313,6 @@ The following sysctls are available for the XFS filesystem: ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl is set. - fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) - Controls whether unprivileged users can use chown to "give away" - a file to another user. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "sync" flag set by the xfs_io(8) chattr command on a directory to be @@ -254,9 +333,31 @@ The following sysctls are available for the XFS filesystem: by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. + fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodefrag" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) In "inode32" allocation mode, this option determines how many files the allocator attempts to allocate in the same allocation group before moving to the next allocation group. The intent is to control the rate at which the allocator moves between allocation groups when allocating extents for new files. + +Deprecated Sysctls +================== + + fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. + + fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt index 3cc4010521a..0466ee56927 100644 --- a/Documentation/filesystems/xip.txt +++ b/Documentation/filesystems/xip.txt @@ -39,10 +39,11 @@ The block device operation is optional, these block devices support it as of today: - dcssblk: s390 dcss block device driver -An address space operation named get_xip_page is used to retrieve reference -to a struct page. To address the target page, a reference to an address_space, -and a sector number is provided. A 3rd argument indicates whether the -function should allocate blocks if needed. +An address space operation named get_xip_mem is used to retrieve references +to a page frame number and a kernel address. To obtain these values a reference +to an address_space is provided. This function assigns values to the kmem and +pfn parameters. The third argument indicates whether the function should allocate +blocks if needed. This address space operation is mutually exclusive with readpage&writepage that do page cache read/write operations. |
