diff options
Diffstat (limited to 'Documentation/filesystems')
112 files changed, 21915 insertions, 7016 deletions
diff --git a/Documentation/filesystems/00-INDEX b/Documentation/filesystems/00-INDEX index bcfbab899b3..ac28149aede 100644 --- a/Documentation/filesystems/00-INDEX +++ b/Documentation/filesystems/00-INDEX @@ -2,49 +2,157 @@ - this file (info on some of the filesystems supported by linux). Locking - info on locking rules as they pertain to Linux VFS. +Makefile + - Makefile for building the filsystems-part of DocBook. +9p.txt + - 9p (v9fs) is an implementation of the Plan 9 remote fs protocol. adfs.txt - info and mount options for the Acorn Advanced Disc Filing System. +afs.txt + - info and examples for the distributed AFS (Andrew File System) fs. affs.txt - info and mount options for the Amiga Fast File System. +autofs4-mount-control.txt + - info on device control operations for autofs4 module. +automount-support.txt + - information about filesystem automount support. +befs.txt + - information about the BeOS filesystem for Linux. bfs.txt - info for the SCO UnixWare Boot Filesystem (BFS). -cifs.txt - - description of the CIFS filesystem +btrfs.txt + - info for the BTRFS filesystem. +caching/ + - directory containing filesystem cache documentation. +ceph.txt + - info for the Ceph Distributed File System. +cifs/ + - directory containing CIFS filesystem documentation and example code. coda.txt - description of the CODA filesystem. +configfs/ + - directory containing configfs documentation and example code. cramfs.txt - - info on the cram filesystem for small storage (ROMs etc) -devfs/ - - directory containing devfs documentation. + - info on the cram filesystem for small storage (ROMs etc). +debugfs.txt + - info on the debugfs filesystem. +devpts.txt + - info on the devpts filesystem. +directory-locking + - info about the locking scheme used for directory operations. +dlmfs.txt + - info on the userspace interface to the OCFS2 DLM. +dnotify.txt + - info about directory notification in Linux. +dnotify_test.c + - example program for dnotify. +ecryptfs.txt + - docs on eCryptfs: stacked cryptographic filesystem for Linux. +efivarfs.txt + - info for the efivarfs filesystem. +exofs.txt + - info, usage, mount options, design about EXOFS. ext2.txt - info, mount options and specifications for the Ext2 filesystem. -fat_cvf.txt - - info on the Compressed Volume Files extension to the FAT filesystem +ext3.txt + - info, mount options and specifications for the Ext3 filesystem. +ext4.txt + - info, mount options and specifications for the Ext4 filesystem. +f2fs.txt + - info and mount options for the F2FS filesystem. +fiemap.txt + - info on fiemap ioctl. +files.txt + - info on file management in the Linux kernel. +fuse.txt + - info on the Filesystem in User SpacE including mount options. +gfs2-glocks.txt + - info on the Global File System 2 - Glock internal locking rules. +gfs2-uevents.txt + - info on the Global File System 2 - uevents. +gfs2.txt + - info on the Global File System 2. +hfs.txt + - info on the Macintosh HFS Filesystem for Linux. +hfsplus.txt + - info on the Macintosh HFSPlus Filesystem for Linux. hpfs.txt - info and mount options for the OS/2 HPFS. +inotify.txt + - info on the powerful yet simple file change notification system. isofs.txt - info and mount options for the ISO 9660 (CDROM) filesystem. jfs.txt - info and mount options for the JFS filesystem. +locks.txt + - info on file locking implementations, flock() vs. fcntl(), etc. +logfs.txt + - info on the LogFS flash filesystem. +mandatory-locking.txt + - info on the Linux implementation of Sys V mandatory file locking. ncpfs.txt - info on Novell Netware(tm) filesystem using NCP protocol. +nfs/ + - nfs-related documentation. +nilfs2.txt + - info and mount options for the NILFS2 filesystem. ntfs.txt - info and mount options for the NTFS filesystem (Windows NT). +ocfs2.txt + - info and mount options for the OCFS2 clustered filesystem. +omfs.txt + - info on the Optimized MPEG FileSystem. +path-lookup.txt + - info on path walking and name lookup locking. +pohmelfs/ + - directory containing pohmelfs filesystem documentation. +porting + - various information on filesystem porting. proc.txt - info on Linux's /proc filesystem. +qnx6.txt + - info on the QNX6 filesystem. +quota.txt + - info on Quota subsystem. +ramfs-rootfs-initramfs.txt + - info on the 'in memory' filesystems ramfs, rootfs and initramfs. +relay.txt + - info on relay, for efficient streaming from kernel to user space. romfs.txt - - Description of the ROMFS filesystem. -smbfs.txt - - info on using filesystems with the SMB protocol (Windows 3.11 and NT) + - description of the ROMFS filesystem. +seq_file.txt + - how to use the seq_file API. +sharedsubtree.txt + - a description of shared subtrees for namespaces. +spufs.txt + - info and mount options for the SPU filesystem used on Cell. +squashfs.txt + - info on the squashfs filesystem. +sysfs-pci.txt + - info on accessing PCI device resources through sysfs. +sysfs-tagging.txt + - info on sysfs tagging to avoid duplicates. +sysfs.txt + - info on sysfs, a ram-based filesystem for exporting kernel objects. sysv-fs.txt - info on the SystemV/V7/Xenix/Coherent filesystem. +tmpfs.txt + - info on tmpfs, a filesystem that holds all files in virtual memory. +ubifs.txt + - info on the Unsorted Block Images FileSystem. udf.txt - info and mount options for the UDF filesystem. ufs.txt - info on the ufs filesystem. vfat.txt - - info on using the VFAT filesystem used in Windows NT and Windows 95 + - info on using the VFAT filesystem used in Windows NT and Windows 95. vfs.txt - - Overview of the Virtual File System + - overview of the Virtual File System. +xfs-delayed-logging-design.txt + - info on the XFS Delayed Logging Design. +xfs-self-describing-metadata.txt + - info on XFS Self Describing Metadata. xfs.txt - info and mount options for the XFS filesystem. +xip.txt + - info on execute-in-place for file mappings. diff --git a/Documentation/filesystems/9p.txt b/Documentation/filesystems/9p.txt new file mode 100644 index 00000000000..fec7144e817 --- /dev/null +++ b/Documentation/filesystems/9p.txt @@ -0,0 +1,161 @@ + v9fs: Plan 9 Resource Sharing for Linux + ======================================= + +ABOUT +===== + +v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. + +This software was originally developed by Ron Minnich <rminnich@sandia.gov> +and Maya Gokhale. Additional development by Greg Watson +<gwatson@lanl.gov> and most recently Eric Van Hensbergen +<ericvh@gmail.com>, Latchesar Ionkov <lucho@ionkov.net> and Russ Cox +<rsc@swtch.com>. + +The best detailed explanation of the Linux implementation and applications of +the 9p client is available in the form of a USENIX paper: + http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html + +Other applications are described in the following papers: + * XCPU & Clustering + http://xcpu.org/papers/xcpu-talk.pdf + * KVMFS: control file system for KVM + http://xcpu.org/papers/kvmfs.pdf + * CellFS: A New Programming Model for the Cell BE + http://xcpu.org/papers/cellfs-talk.pdf + * PROSE I/O: Using 9p to enable Application Partitions + http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf + * VirtFS: A Virtualization Aware File System pass-through + http://goo.gl/3WPDg + +USAGE +===== + +For remote file server: + + mount -t 9p 10.10.1.2 /mnt/9 + +For Plan 9 From User Space applications (http://swtch.com/plan9) + + mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER + +For server running on QEMU host with virtio transport: + + mount -t 9p -o trans=virtio <mount_tag> /mnt/9 + +where mount_tag is the tag associated by the server to each of the exported +mount points. Each 9P export is seen by the client as a virtio device with an +associated "mount_tag" property. Available mount tags can be +seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. + +OPTIONS +======= + + trans=name select an alternative transport. Valid options are + currently: + unix - specifying a named pipe mount point + tcp - specifying a normal TCP/IP connection + fd - used passed file descriptors for connection + (see rfdno and wfdno) + virtio - connect to the next virtio channel available + (from QEMU with trans_virtio module) + rdma - connect to a specified RDMA channel + + uname=name user name to attempt mount as on the remote server. The + server may override or ignore this value. Certain user + names may require authentication. + + aname=name aname specifies the file tree to access when the server is + offering several exported file systems. + + cache=mode specifies a caching policy. By default, no caches are used. + none = default no cache policy, metadata and data + alike are synchronous. + loose = no attempts are made at consistency, + intended for exclusive, read-only mounts + fscache = use FS-Cache for a persistent, read-only + cache backend. + mmap = minimal cache that is only used for read-write + mmap. Northing else is cached, like cache=none + + debug=n specifies debug level. The debug level is a bitmask. + 0x01 = display verbose error messages + 0x02 = developer debug (DEBUG_CURRENT) + 0x04 = display 9p trace + 0x08 = display VFS trace + 0x10 = display Marshalling debug + 0x20 = display RPC debug + 0x40 = display transport debug + 0x80 = display allocation debug + 0x100 = display protocol message debug + 0x200 = display Fid debug + 0x400 = display packet debug + 0x800 = display fscache tracing debug + + rfdno=n the file descriptor for reading with trans=fd + + wfdno=n the file descriptor for writing with trans=fd + + msize=n the number of bytes to use for 9p packet payload + + port=n port to connect to on the remote server + + noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) + + version=name Select 9P protocol version. Valid options are: + 9p2000 - Legacy mode (same as noextend) + 9p2000.u - Use 9P2000.u protocol + 9p2000.L - Use 9P2000.L protocol + + dfltuid attempt to mount as a particular uid + + dfltgid attempt to mount with a particular gid + + afid security channel - used by Plan 9 authentication protocols + + nodevmap do not map special files - represent them as normal files. + This can be used to share devices/named pipes/sockets between + hosts. This functionality will be expanded in later versions. + + access there are four access modes. + user = if a user tries to access a file on v9fs + filesystem for the first time, v9fs sends an + attach command (Tattach) for that user. + This is the default mode. + <uid> = allows only user with uid=<uid> to access + the files on the mounted filesystem + any = v9fs does single attach and performs all + operations as one user + client = ACL based access check on the 9p client + side for access validation + + cachetag cache tag to use the specified persistent cache. + cache tags for existing cache sessions can be listed at + /sys/fs/9p/caches. (applies only to cache=fscache) + +RESOURCES +========= + +Protocol specifications are maintained on github: +http://ericvh.github.com/9p-rfc/ + +9p client and server implementations are listed on +http://9p.cat-v.org/implementations + +A 9p2000.L server is being developed by LLNL and can be found +at http://code.google.com/p/diod/ + +There are user and developer mailing lists available through the v9fs project +on sourceforge (http://sourceforge.net/projects/v9fs). + +News and other information is maintained on a Wiki. +(http://sf.net/apps/mediawiki/v9fs/index.php). + +Bug reports are best issued via the mailing list. + +For more information on the Plan 9 Operating System check out +http://plan9.bell-labs.com/plan9 + +For information on Plan 9 from User Space (Plan 9 applications and libraries +ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 + diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking index 1045da582b9..b18dd177902 100644 --- a/Documentation/filesystems/Locking +++ b/Documentation/filesystems/Locking @@ -9,50 +9,68 @@ be able to use diff(1). --------------------------- dentry_operations -------------------------- prototypes: - int (*d_revalidate)(struct dentry *, int); - int (*d_hash) (struct dentry *, struct qstr *); - int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); + int (*d_revalidate)(struct dentry *, unsigned int); + int (*d_weak_revalidate)(struct dentry *, unsigned int); + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, + unsigned int, const char *, const struct qstr *); int (*d_delete)(struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen); + struct vfsmount *(*d_automount)(struct path *path); + int (*d_manage)(struct dentry *, bool); locking rules: - none have BKL - dcache_lock rename_lock ->d_lock may block -d_revalidate: no no no yes -d_hash no no no yes -d_compare: no yes no no -d_delete: yes no yes no -d_release: no no no yes -d_iput: no no no yes + rename_lock ->d_lock may block rcu-walk +d_revalidate: no no yes (ref-walk) maybe +d_weak_revalidate:no no yes no +d_hash no no no maybe +d_compare: yes no no maybe +d_delete: no yes no no +d_release: no no yes no +d_prune: no yes no no +d_iput: no no yes no +d_dname: no no no no +d_automount: no no yes no +d_manage: no no yes (ref-walk) maybe --------------------------- inode_operations --------------------------- prototypes: - int (*create) (struct inode *,struct dentry *,int, struct nameidata *); - struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameid -ata *); + int (*create) (struct inode *,struct dentry *,umode_t, bool); + struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); - int (*mkdir) (struct inode *,struct dentry *,int); + int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); - int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); + int (*rename2) (struct inode *, struct dentry *, + struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); - int (*follow_link) (struct dentry *, struct nameidata *); + void * (*follow_link) (struct dentry *, struct nameidata *); + void (*put_link) (struct dentry *, struct nameidata *, void *); void (*truncate) (struct inode *); - int (*permission) (struct inode *, int, struct nameidata *); + int (*permission) (struct inode *, int, unsigned int); + int (*get_acl)(struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, u64 len); + void (*update_time)(struct inode *, struct timespec *, int); + int (*atomic_open)(struct inode *, struct dentry *, + struct file *, unsigned open_flag, + umode_t create_mode, int *opened); + int (*tmpfile) (struct inode *, struct dentry *, umode_t); locking rules: - all may block, none have BKL - i_sem(inode) + all may block + i_mutex(inode) lookup: yes create: yes link: yes (both) @@ -62,24 +80,27 @@ mkdir: yes unlink: yes (both) rmdir: yes (both) (see below) rename: yes (all) (see below) +rename2: yes (all) (see below) readlink: no follow_link: no -truncate: yes (see below) +put_link: no setattr: yes -permission: no +permission: no (may not block if called in rcu-walk mode) +get_acl: no getattr: no setxattr: yes getxattr: no listxattr: no removexattr: yes - Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_sem on +fiemap: no +update_time: no +atomic_open: yes +tmpfile: no + + Additionally, ->rmdir(), ->unlink() and ->rename() have ->i_mutex on victim. - cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem. - ->truncate() is never called directly - it's a callback, not a -method. It's called by vmtruncate() - library function normally used by -->setattr(). Locking information above applies to that call (i.e. is -inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been -passed). + cross-directory ->rename() and rename2() has (per-superblock) +->s_vfs_rename_sem. See Documentation/filesystems/directory-locking for more detailed discussion of the locking scheme for directory operations. @@ -88,69 +109,71 @@ of the locking scheme for directory operations. prototypes: struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*read_inode) (struct inode *); - void (*dirty_inode) (struct inode *); - int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); - void (*drop_inode) (struct inode *); - void (*delete_inode) (struct inode *); + void (*dirty_inode) (struct inode *, int flags); + int (*write_inode) (struct inode *, struct writeback_control *wbc); + int (*drop_inode) (struct inode *); + void (*evict_inode) (struct inode *); void (*put_super) (struct super_block *); - void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); - int (*statfs) (struct super_block *, struct kstatfs *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); + int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); - void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); - int (*show_options)(struct seq_file *, struct vfsmount *); + int (*show_options)(struct seq_file *, struct dentry *); ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + int (*bdev_try_to_free_page)(struct super_block*, struct page*, gfp_t); locking rules: - All may block. - BKL s_lock s_umount -alloc_inode: no no no -destroy_inode: no -read_inode: no (see below) -dirty_inode: no (must not sleep) -write_inode: no -put_inode: no -drop_inode: no !!!inode_lock!!! -delete_inode: no -put_super: yes yes no -write_super: no yes read -sync_fs: no no read -write_super_lockfs: ? -unlockfs: ? -statfs: no no no -remount_fs: no yes maybe (see below) -clear_inode: no -umount_begin: yes no no -show_options: no (vfsmount->sem) -quota_read: no no no (see below) -quota_write: no no no (see below) - -->read_inode() is not a method - it's a callback used in iget(). -->remount_fs() will have the s_umount lock if it's already mounted. -When called from get_sb_single, it does NOT have the s_umount lock. + All may block [not true, see below] + s_umount +alloc_inode: +destroy_inode: +dirty_inode: +write_inode: +drop_inode: !!!inode->i_lock!!! +evict_inode: +put_super: write +sync_fs: read +freeze_fs: write +unfreeze_fs: write +statfs: maybe(read) (see below) +remount_fs: write +umount_begin: no +show_options: no (namespace_sem) +quota_read: no (see below) +quota_write: no (see below) +bdev_try_to_free_page: no (see below) + +->statfs() has s_umount (shared) when called by ustat(2) (native or +compat), but that's an accident of bad API; s_umount is used to pin +the superblock down when we only have dev_t given us by userland to +identify the superblock. Everything else (statfs(), fstatfs(), etc.) +doesn't hold it when calling ->statfs() - superblock is pinned down +by resolving the pathname passed to syscall. ->quota_read() and ->quota_write() functions are both guaranteed to be the only ones operating on the quota file by the quota code (via dqio_sem) (unless an admin really wants to screw up something and writes to quota files with quotas on). For other details about locking see also dquot_operations section. +->bdev_try_to_free_page is called from the ->releasepage handler of +the block device inode. See there for more details. --------------------------- file_system_type --------------------------- prototypes: - struct super_block *(*get_sb) (struct file_system_type *, int, - const char *, void *); + int (*get_sb) (struct file_system_type *, int, + const char *, void *, struct vfsmount *); + struct dentry *(*mount) (struct file_system_type *, int, + const char *, void *); void (*kill_sb) (struct super_block *); locking rules: - may block BKL -get_sb yes yes -kill_sb yes yes + may block +mount yes +kill_sb yes -->get_sb() returns error or a locked superblock (exclusive on ->s_umount). +->mount() returns ERR_PTR or the root dentry; its superblock should be locked +on return. ->kill_sb() takes a write-locked superblock, does all shutdown work on it, unlocks and drops the reference. @@ -163,32 +186,52 @@ prototypes: int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); - int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); - int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); - int (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + void (*freepage)(struct page *); + int (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); + int (*get_xip_mem)(struct address_space *, pgoff_t, int, void **, + unsigned long *); + int (*migratepage)(struct address_space *, struct page *, struct page *); + int (*launder_page)(struct page *); + int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long); + int (*error_remove_page)(struct address_space *, struct page *); + int (*swap_activate)(struct file *); + int (*swap_deactivate)(struct file *); locking rules: - All except set_page_dirty may block - - BKL PageLocked(page) -writepage: no yes, unlocks (see below) -readpage: no yes, unlocks -sync_page: no maybe -writepages: no -set_page_dirty no no -readpages: no -prepare_write: no yes -commit_write: no yes -bmap: yes -invalidatepage: no yes -releasepage: no yes -direct_IO: no - - ->prepare_write(), ->commit_write(), ->sync_page() and ->readpage() + All except set_page_dirty and freepage may block + + PageLocked(page) i_mutex +writepage: yes, unlocks (see below) +readpage: yes, unlocks +sync_page: maybe +writepages: +set_page_dirty no +readpages: +write_begin: locks the page yes +write_end: yes, unlocks yes +bmap: +invalidatepage: yes +releasepage: yes +freepage: yes +direct_IO: +get_xip_mem: maybe +migratepage: yes (both) +launder_page: yes +is_partially_uptodate: yes +error_remove_page: yes +swap_activate: no +swap_deactivate: no + + ->write_begin(), ->write_end(), ->sync_page() and ->readpage() may be called from the request handler (/dev/loop). ->readpage() unlocks the page, either synchronously or via I/O @@ -216,7 +259,7 @@ against the page the filesystem should redirty the page with redirty_page_for_writepage(), then unlock the page and return zero. This may also be done to avoid internal deadlocks, but rarely. -If the filesytem is called for sync then it must wait on any +If the filesystem is called for sync then it must wait on any in-progress I/O and then start new I/O. The filesystem should unlock the page synchronously, before returning to the @@ -266,13 +309,12 @@ under spinlock (it cannot block) and is sometimes called with the page not locked. ->bmap() is currently used by legacy ioctl() (FIBMAP) provided by some -filesystems and by the swapper. The latter will eventually go away. All -instances do not actually need the BKL. Please, keep it that way and don't -breed new callers. +filesystems and by the swapper. The latter will eventually go away. Please, +keep it that way and don't breed new callers. ->invalidatepage() is called when the filesystem must attempt to drop -some or all of the buffers from the page when it is being truncated. It -returns zero on success. If ->invalidatepage is zero, the kernel uses +some or all of the buffers from the page when it is being truncated. It +returns zero on success. If ->invalidatepage is zero, the kernel uses block_invalidatepage() instead. ->releasepage() is called when the kernel is about to try to drop the @@ -280,49 +322,64 @@ buffers from the page in preparation for freeing it. It returns zero to indicate that the buffers are (or may be) freeable. If ->releasepage is zero, the kernel assumes that the fs has no private interest in the buffers. - Note: currently almost all instances of address_space methods are -using BKL for internal serialization and that's one of the worst sources -of contention. Normally they are calling library functions (in fs/buffer.c) -and pass foo_get_block() as a callback (on local block-based filesystems, -indeed). BKL is not needed for library stuff and is usually taken by -foo_get_block(). It's an overkill, since block bitmaps can be protected by -internal fs locking and real critical areas are much smaller than the areas -filesystems protect now. + ->freepage() is called when the kernel is done dropping the page +from the page cache. + + ->launder_page() may be called prior to releasing a page if +it is still found to be dirty. It returns zero if the page was successfully +cleaned, or an error value if not. Note that in order to prevent the page +getting mapped back in and redirtied, it needs to be kept locked +across the entire operation. + + ->swap_activate will be called with a non-zero argument on +files backing (non block device backed) swapfiles. A return value +of zero indicates success, in which case this file can be used for +backing swapspace. The swapspace operations will be proxied to the +address space operations. + + ->swap_deactivate() will be called in the sys_swapoff() +path after ->swap_activate() returned success. ----------------------- file_lock_operations ------------------------------ prototypes: - void (*fl_insert)(struct file_lock *); /* lock insertion callback */ - void (*fl_remove)(struct file_lock *); /* lock removal callback */ void (*fl_copy_lock)(struct file_lock *, struct file_lock *); void (*fl_release_private)(struct file_lock *); locking rules: - BKL may block -fl_insert: yes no -fl_remove: yes no -fl_copy_lock: yes no -fl_release_private: yes yes + inode->i_lock may block +fl_copy_lock: yes no +fl_release_private: maybe no ----------------------- lock_manager_operations --------------------------- prototypes: - int (*fl_compare_owner)(struct file_lock *, struct file_lock *); - void (*fl_notify)(struct file_lock *); /* unblock callback */ - void (*fl_copy_lock)(struct file_lock *, struct file_lock *); - void (*fl_release_private)(struct file_lock *); - void (*fl_break)(struct file_lock *); /* break_lease callback */ + int (*lm_compare_owner)(struct file_lock *, struct file_lock *); + unsigned long (*lm_owner_key)(struct file_lock *); + void (*lm_notify)(struct file_lock *); /* unblock callback */ + int (*lm_grant)(struct file_lock *, struct file_lock *, int); + void (*lm_break)(struct file_lock *); /* break_lease callback */ + int (*lm_change)(struct file_lock **, int); locking rules: - BKL may block -fl_compare_owner: yes no -fl_notify: yes no -fl_copy_lock: yes no -fl_release_private: yes yes -fl_break: yes no - - Currently only NFSD and NLM provide instances of this class. None of the -them block. If you have out-of-tree instances - please, show up. Locking -in that area will change. + + inode->i_lock blocked_lock_lock may block +lm_compare_owner: yes[1] maybe no +lm_owner_key yes[1] yes no +lm_notify: yes yes no +lm_grant: no no no +lm_break: yes no no +lm_change yes no no + +[1]: ->lm_compare_owner and ->lm_owner_key are generally called with +*an* inode->i_lock held. It may not be the i_lock of the inode +associated with either file_lock argument! This is the case with deadlock +detection, since the code has to chase down the owners of locks that may +be entirely unrelated to the one on which the lock is being acquired. +For deadlock detection however, the blocked_lock_lock is also held. The +fact that these locks are held ensures that the file_locks do not +disappear out from under you while doing the comparison or generating an +owner key. + --------------------------- buffer_head ----------------------------------- prototypes: void (*b_end_io)(struct buffer_head *bh, int uptodate); @@ -335,41 +392,55 @@ call this method upon the IO completion. --------------------------- block_device_operations ----------------------- prototypes: - int (*open) (struct inode *, struct file *); - int (*release) (struct inode *, struct file *); - int (*ioctl) (struct inode *, struct file *, unsigned, unsigned long); + int (*open) (struct block_device *, fmode_t); + int (*release) (struct gendisk *, fmode_t); + int (*ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); + int (*compat_ioctl) (struct block_device *, fmode_t, unsigned, unsigned long); + int (*direct_access) (struct block_device *, sector_t, void **, unsigned long *); int (*media_changed) (struct gendisk *); + void (*unlock_native_capacity) (struct gendisk *); int (*revalidate_disk) (struct gendisk *); + int (*getgeo)(struct block_device *, struct hd_geometry *); + void (*swap_slot_free_notify) (struct block_device *, unsigned long); locking rules: - BKL bd_sem -open: yes yes -release: yes yes -ioctl: yes no -media_changed: no no -revalidate_disk: no no + bd_mutex +open: yes +release: yes +ioctl: no +compat_ioctl: no +direct_access: no +media_changed: no +unlock_native_capacity: no +revalidate_disk: no +getgeo: no +swap_slot_free_notify: no (see below) + +media_changed, unlock_native_capacity and revalidate_disk are called only from +check_disk_change(). + +swap_slot_free_notify is called with swap_lock and sometimes the page lock +held. -The last two are called only from check_disk_change(). --------------------------- file_operations ------------------------------- prototypes: loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, - loff_t); - int (*readdir) (struct file *, void *, filldir_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); + ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); - int (*ioctl) (struct inode *, struct file *, unsigned int, - unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); - int (*fsync) (struct file *, struct dentry *, int datasync); + int (*fsync) (struct file *, loff_t start, loff_t end, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); @@ -384,60 +455,33 @@ prototypes: unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); - int (*dir_notify)(struct file *, unsigned long); + int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, loff_t *, + size_t, unsigned int); + ssize_t (*splice_read)(struct file *, loff_t *, struct pipe_inode_info *, + size_t, unsigned int); + int (*setlease)(struct file *, long, struct file_lock **); + long (*fallocate)(struct file *, int, loff_t, loff_t); }; locking rules: - All except ->poll() may block. - BKL -llseek: no (see below) -read: no -aio_read: no -write: no -aio_write: no -readdir: no -poll: no -ioctl: yes (see below) -unlocked_ioctl: no (see below) -compat_ioctl: no -mmap: no -open: maybe (see below) -flush: no -release: no -fsync: no (see below) -aio_fsync: no -fasync: yes (see below) -lock: yes -readv: no -writev: no -sendfile: no -sendpage: no -get_unmapped_area: no -check_flags: no -dir_notify: no + All may block except for ->setlease. + No VFS locks held on entry except for ->setlease. + +->setlease has the file_list_lock held and must not sleep. ->llseek() locking has moved from llseek to the individual llseek implementations. If your fs is not using generic_file_llseek, you need to acquire and release the appropriate locks in your ->llseek(). For many filesystems, it is probably safe to acquire the inode -semaphore. Note some filesystems (i.e. remote ones) provide no -protection for i_size so you will need to use the BKL. - -->open() locking is in-transit: big lock partially moved into the methods. -The only exception is ->open() in the instances of file_operations that never -end up in ->i_fop/->proc_fops, i.e. ones that belong to character devices -(chrdev_open() takes lock before replacing ->f_op and calling the secondary -method. As soon as we fix the handling of module reference counters all -instances of ->open() will be called without the BKL. +mutex or just to use i_size_read() instead. +Note: this does not protect the file->f_pos against concurrent modifications +since this is something the userspace has to take care about. -Note: ext2_release() was *the* source of contention on fs-intensive -loads and dropping BKL on ->release() helps to get rid of that (we still -grab BKL for cases when we close a file that had been opened r/w, but that -can and should be done using the internal locking with smaller critical areas). -Current worst offender is ext2_get_block()... - -->fasync() is a mess. This area needs a big cleanup and that will probably -affect locking. +->fasync() is responsible for maintaining the FASYNC bit in filp->f_flags. +Most instances call fasync_helper(), which does that maintenance, so it's +not normally something one needs to worry about. Return values > 0 will be +mapped to zero in the VFS layer. ->readdir() and ->ioctl() on directories must be changed. Ideally we would move ->readdir() to inode_operations and use a separate method for directory @@ -445,23 +489,11 @@ move ->readdir() to inode_operations and use a separate method for directory anything that resembles union-mount we won't have a struct file for all components. And there are other reasons why the current interface is a mess... -->ioctl() on regular files is superceded by the ->unlocked_ioctl() that -doesn't take the BKL. - ->read on directories probably must go away - we should just enforce -EISDIR in sys_read() and friends. -->fsync() has i_sem on inode. - --------------------------- dquot_operations ------------------------------- prototypes: - int (*initialize) (struct inode *, int); - int (*drop) (struct inode *); - int (*alloc_space) (struct inode *, qsize_t, int); - int (*alloc_inode) (const struct inode *, unsigned long); - int (*free_space) (struct inode *, qsize_t); - int (*free_inode) (const struct inode *, unsigned long); - int (*transfer) (struct inode *, struct iattr *); int (*write_dquot) (struct dquot *); int (*acquire_dquot) (struct dquot *); int (*release_dquot) (struct dquot *); @@ -474,13 +506,6 @@ a proper locking wrt the filesystem and call the generic quota operations. What filesystem should expect from the generic quota functions: FS recursion Held locks when called -initialize: yes maybe dqonoff_sem -drop: yes - -alloc_space: ->mark_dirty() - -alloc_inode: ->mark_dirty() - -free_space: ->mark_dirty() - -free_inode: ->mark_dirty() - -transfer: yes - write_dquot: yes dqonoff_sem or dqptr_sem acquire_dquot: yes dqonoff_sem or dqptr_sem release_dquot: yes dqonoff_sem or dqptr_sem @@ -490,30 +515,56 @@ write_info: yes dqonoff_sem FS recursion means calling ->quota_read() and ->quota_write() from superblock operations. -->alloc_space(), ->alloc_inode(), ->free_space(), ->free_inode() are called -only directly by the filesystem and do not call any fs functions only -the ->mark_dirty() operation. - More details about quota locking can be found in fs/dquot.c. --------------------------- vm_operations_struct ----------------------------- prototypes: void (*open)(struct vm_area_struct*); void (*close)(struct vm_area_struct*); - struct page *(*nopage)(struct vm_area_struct*, unsigned long, int *); + int (*fault)(struct vm_area_struct*, struct vm_fault *); + int (*page_mkwrite)(struct vm_area_struct *, struct vm_fault *); + int (*access)(struct vm_area_struct *, unsigned long, void*, int, int); locking rules: - BKL mmap_sem -open: no yes -close: no yes -nopage: no yes + mmap_sem PageLocked(page) +open: yes +close: yes +fault: yes can return with page locked +map_pages: yes +page_mkwrite: yes can return with page locked +access: yes + + ->fault() is called when a previously not present pte is about +to be faulted in. The filesystem must find and return the page associated +with the passed in "pgoff" in the vm_fault structure. If it is possible that +the page may be truncated and/or invalidated, then the filesystem must lock +the page, then ensure it is not already truncated (the page lock will block +subsequent truncate), and then return with VM_FAULT_LOCKED, and the page +locked. The VM will unlock the page. + + ->map_pages() is called when VM asks to map easy accessible pages. +Filesystem should find and map pages associated with offsets from "pgoff" +till "max_pgoff". ->map_pages() is called with page table locked and must +not block. If it's not possible to reach a page without blocking, +filesystem should skip it. Filesystem should use do_set_pte() to setup +page table entry. Pointer to entry associated with offset "pgoff" is +passed in "pte" field in vm_fault structure. Pointers to entries for other +offsets should be calculated relative to "pte". + + ->page_mkwrite() is called when a previously read-only pte is +about to become writeable. The filesystem again must ensure that there are +no truncate/invalidate races, and then return with the page locked. If +the page has been truncated, the filesystem should not look up a new page +like the ->fault() handler, but simply return with VM_FAULT_NOPAGE, which +will cause the VM to retry the fault. + + ->access() is called when get_user_pages() fails in +access_process_vm(), typically used to debug a process through +/proc/pid/mem or ptrace. This function is needed only for +VM_IO | VM_PFNMAP VMAs. ================================================================================ Dubious stuff (if you break something or notice that it is broken and do not fix it yourself - at least put it here) - -ipc/shm.c::shm_delete() - may need BKL. -->read() and ->write() in many drivers are (probably) missing BKL. -drivers/sgi/char/graphics.c::sgi_graphics_nopage() - may need BKL. diff --git a/Documentation/filesystems/Makefile b/Documentation/filesystems/Makefile new file mode 100644 index 00000000000..a5dd114da14 --- /dev/null +++ b/Documentation/filesystems/Makefile @@ -0,0 +1,8 @@ +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o + +# List of programs to build +hostprogs-y := dnotify_test + +# Tell kbuild to always build the programs +always := $(hostprogs-y) diff --git a/Documentation/filesystems/adfs.txt b/Documentation/filesystems/adfs.txt index 060abb0c700..5949766353f 100644 --- a/Documentation/filesystems/adfs.txt +++ b/Documentation/filesystems/adfs.txt @@ -3,12 +3,15 @@ Mount options for ADFS uid=nnn All files in the partition will be owned by user id nnn. Default 0 (root). - gid=nnn All files in the partition willbe in group + gid=nnn All files in the partition will be in group nnn. Default 0 (root). ownmask=nnn The permission mask for ADFS 'owner' permissions will be nnn. Default 0700. othmask=nnn The permission mask for ADFS 'other' permissions will be nnn. Default 0077. + ftsuffix=n When ftsuffix=0, no file type suffix will be applied. + When ftsuffix=1, a hexadecimal suffix corresponding to + the RISC OS file type will be added. Default 0. Mapping of ADFS permissions to Linux permissions ------------------------------------------------ @@ -55,3 +58,18 @@ Mapping of ADFS permissions to Linux permissions You can therefore tailor the permission translation to whatever you desire the permissions should be under Linux. + +RISC OS file type suffix +------------------------ + + RISC OS file types are stored in bits 19..8 of the file load address. + + To enable non-RISC OS systems to be used to store files without losing + file type information, a file naming convention was devised (initially + for use with NFS) such that a hexadecimal suffix of the form ,xyz + denoted the file type: e.g. BasicFile,ffb is a BASIC (0xffb) file. This + naming convention is now also used by RISC OS emulators such as RPCEmu. + + Mounting an ADFS disc with option ftsuffix=1 will cause appropriate file + type suffixes to be appended to file names read from a directory. If the + ftsuffix option is zero or omitted, no file type suffixes will be added. diff --git a/Documentation/filesystems/affs.txt b/Documentation/filesystems/affs.txt index 30c9738590f..71b63c2b984 100644 --- a/Documentation/filesystems/affs.txt +++ b/Documentation/filesystems/affs.txt @@ -49,6 +49,10 @@ mode=mode Sets the mode flags to the given (octal) value, regardless This is useful since most of the plain AmigaOS files will map to 600. +nofilenametruncate + The file system will return an error when filename exceeds + standard maximum filename length (30 characters). + reserved=num Sets the number of reserved blocks at the start of the partition to num. You should never need this option. Default is 2. @@ -181,9 +185,8 @@ tested, though several hundred MB have been read and written using this fs. For a most up-to-date list of bugs please consult fs/affs/Changes. -Filenames are truncated to 30 characters without warning (this -can be changed by setting the compile-time option AFFS_NO_TRUNCATE -in include/linux/amigaffs.h). +By default, filenames are truncated to 30 characters without warning. +'nofilenametruncate' mount option can change that behavior. Case is ignored by the affs in filename matching, but Linux shells do care about the case. Example (with /wb being an affs mounted fs): @@ -216,4 +219,4 @@ due to an incompatibility with the Amiga floppy controller. If you are interested in an Amiga Emulator for Linux, look at -http://www-users.informatik.rwth-aachen.de/~crux/uae.html +http://web.archive.org/web/*/http://www.freiburg.linux.de/~uae/ diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt index 2f4237dfb8c..ffef91c4e0d 100644 --- a/Documentation/filesystems/afs.txt +++ b/Documentation/filesystems/afs.txt @@ -1,39 +1,88 @@ + ==================== kAFS: AFS FILESYSTEM ==================== -ABOUT -===== +Contents: + + - Overview. + - Usage. + - Mountpoints. + - Proc filesystem. + - The cell database. + - Security. + - Examples. + + +======== +OVERVIEW +======== + +This filesystem provides a fairly simple secure AFS filesystem driver. It is +under development and does not yet provide the full feature set. The features +it does support include: + + (*) Security (currently only AFS kaserver and KerberosIV tickets). + + (*) File reading and writing. + + (*) Automounting. + + (*) Local caching (via fscache). + +It does not yet support the following AFS features: -This filesystem provides a fairly simple AFS filesystem driver. It is under -development and only provides very basic facilities. It does not yet support -the following AFS features: + (*) pioctl() system call. - (*) Write support. - (*) Communications security. - (*) Local caching. - (*) pioctl() system call. - (*) Automatic mounting of embedded mountpoints. +=========== +COMPILATION +=========== + +The filesystem should be enabled by turning on the kernel configuration +options: + + CONFIG_AF_RXRPC - The RxRPC protocol transport + CONFIG_RXKAD - The RxRPC Kerberos security handler + CONFIG_AFS - The AFS filesystem + +Additionally, the following can be turned on to aid debugging: + + CONFIG_AF_RXRPC_DEBUG - Permit AF_RXRPC debugging to be enabled + CONFIG_AFS_DEBUG - Permit AFS debugging to be enabled + +They permit the debugging messages to be turned on dynamically by manipulating +the masks in the following files: + /sys/module/af_rxrpc/parameters/debug + /sys/module/kafs/parameters/debug + + +===== USAGE ===== When inserting the driver modules the root cell must be specified along with a list of volume location server IP addresses: - insmod rxrpc.o - insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 + modprobe af_rxrpc + modprobe rxkad + modprobe kafs rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 -The first module is a driver for the RxRPC remote operation protocol, and the -second is the actual filesystem driver for the AFS filesystem. +The first module is the AF_RXRPC network protocol driver. This provides the +RxRPC remote operation protocol and may also be accessed from userspace. See: + + Documentation/networking/rxrpc.txt + +The second module is the kerberos RxRPC security driver, and the third module +is the actual filesystem driver for the AFS filesystem. Once the module has been loaded, more modules can be added by the following procedure: - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells Where the parameters to the "add" command are the name of a cell and a list of -volume location servers within that cell. +volume location servers within that cell, with the latter separated by colons. Filesystems can be mounted anywhere by commands similar to the following: @@ -42,11 +91,6 @@ Filesystems can be mounted anywhere by commands similar to the following: mount -t afs "#root.afs." /afs mount -t afs "#root.cell." /afs/cambridge - NB: When using this on Linux 2.4, the mount command has to be different, - since the filesystem doesn't have access to the device name argument: - - mount -t afs none /afs -ovol="#root.afs." - Where the initial character is either a hash or a percent symbol depending on whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O volume, but are willing to use a R/W volume instead (percent). @@ -55,83 +99,139 @@ The name of the volume can be suffixes with ".backup" or ".readonly" to specify connection to only volumes of those types. The name of the cell is optional, and if not given during a mount, then the -named volume will be looked up in the cell specified during insmod. +named volume will be looked up in the cell specified during modprobe. Additional cells can be added through /proc (see later section). +=========== MOUNTPOINTS =========== -AFS has a concept of mountpoints. These are specially formatted symbolic links -(of the same form as the "device name" passed to mount). kAFS presents these -to the user as directories that have special properties: +AFS has a concept of mountpoints. In AFS terms, these are specially formatted +symbolic links (of the same form as the "device name" passed to mount). kAFS +presents these to the user as directories that have a follow-link capability +(ie: symbolic link semantics). If anyone attempts to access them, they will +automatically cause the target volume to be mounted (if possible) on that site. - (*) They cannot be listed. Running a program like "ls" on them will incur an - EREMOTE error (Object is remote). +Automatically mounted filesystems will be automatically unmounted approximately +twenty minutes after they were last used. Alternatively they can be unmounted +directly with the umount() system call. - (*) Other objects can't be looked up inside of them. This also incurs an - EREMOTE error. +Manually unmounting an AFS volume will cause any idle submounts upon it to be +culled first. If all are culled, then the requested volume will also be +unmounted, otherwise error EBUSY will be returned. - (*) They can be queried with the readlink() system call, which will return - the name of the mountpoint to which they point. The "readlink" program - will also work. +This can be used by the administrator to attempt to unmount the whole AFS tree +mounted on /afs in one go by doing: - (*) They can be mounted on (which symbolic links can't). + umount /afs +=============== PROC FILESYSTEM =============== -The rxrpc module creates a number of files in various places in the /proc -filesystem: - - (*) Firstly, some information files are made available in a directory called - "/proc/net/rxrpc/". These list the extant transport endpoint, peer, - connection and call records. - - (*) Secondly, some control files are made available in a directory called - "/proc/sys/rxrpc/". Currently, all these files can be used for is to - turn on various levels of tracing. - The AFS modules creates a "/proc/fs/afs/" directory and populates it: - (*) A "cells" file that lists cells currently known to the afs module. + (*) A "cells" file that lists cells currently known to the afs module and + their usage counts: + + [root@andromeda ~]# cat /proc/fs/afs/cells + USE NAME + 3 cambridge.redhat.com (*) A directory per cell that contains files that list volume location servers, volumes, and active servers known within that cell. + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers + USE ADDR STATE + 4 172.16.18.91 0 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers + ADDRESS + 172.16.18.91 + [root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes + USE STT VLID[0] VLID[1] VLID[2] NAME + 1 Val 20000000 20000001 20000002 root.afs + +================= THE CELL DATABASE ================= -The filesystem maintains an internal database of all the cells it knows and -the IP addresses of the volume location servers for those cells. The cell to -which the computer belongs is added to the database when insmod is performed -by the "rootcell=" argument. +The filesystem maintains an internal database of all the cells it knows and the +IP addresses of the volume location servers for those cells. The cell to which +the system belongs is added to the database when modprobe is performed by the +"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on +the kernel command line. Further cells can be added by commands similar to the following: echo add CELLNAME VLADDR[:VLADDR][:VLADDR]... >/proc/fs/afs/cells - echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells + echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 >/proc/fs/afs/cells No other cell database operations are available at this time. +======== +SECURITY +======== + +Secure operations are initiated by acquiring a key using the klog program. A +very primitive klog program is available at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +This should be compiled by: + + make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils" + +And then run as: + + ./klog + +Assuming it's successful, this adds a key of type RxRPC, named for the service +and cell, eg: "afs@<cellname>". This can be viewed with the keyctl program or +by cat'ing /proc/keys: + + [root@andromeda ~]# keyctl show + Session Keyring + -3 --alswrv 0 0 keyring: _ses.3268 + 2 --alswrv 0 0 \_ keyring: _uid.0 + 111416553 --als--v 0 0 \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM + +Currently the username, realm, password and proposed ticket lifetime are +compiled in to the program. + +It is not required to acquire a key before using AFS facilities, but if one is +not acquired then all operations will be governed by the anonymous user parts +of the ACLs. + +If a key is acquired, then all AFS operations, including mounts and automounts, +made by a possessor of that key will be secured with that key. + +If a file is opened with a particular key and then the file descriptor is +passed to a process that doesn't have that key (perhaps over an AF_UNIX +socket), then the operations on the file will be made with key that was used to +open the file. + + +======== EXAMPLES ======== -Here's what I use to test this. Some of the names and IP addresses are local -to my internal DNS. My "root.afs" partition has a mount point within it for +Here's what I use to test this. Some of the names and IP addresses are local +to my internal DNS. My "root.afs" partition has a mount point within it for some public volumes volumes. -insmod -S /tmp/rxrpc.o -insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91 +insmod /tmp/rxrpc.o +insmod /tmp/rxkad.o +insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91 mount -t afs \%root.afs. /afs mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/ -echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells +echo add grand.central.org 18.9.48.14:128.2.203.61:130.237.48.87 > /proc/fs/afs/cells mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/ mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib @@ -141,15 +241,7 @@ mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user -umount /afs/grand.central.org/user -umount /afs/grand.central.org/software -umount /afs/grand.central.org/service -umount /afs/grand.central.org/project -umount /afs/grand.central.org/doc -umount /afs/grand.central.org/contrib -umount /afs/grand.central.org/archive -umount /afs/grand.central.org -umount /afs/cambridge.redhat.com umount /afs rmmod kafs +rmmod rxkad rmmod rxrpc diff --git a/Documentation/filesystems/autofs4-mount-control.txt b/Documentation/filesystems/autofs4-mount-control.txt new file mode 100644 index 00000000000..aff22113a98 --- /dev/null +++ b/Documentation/filesystems/autofs4-mount-control.txt @@ -0,0 +1,393 @@ + +Miscellaneous Device control operations for the autofs4 kernel module +==================================================================== + +The problem +=========== + +There is a problem with active restarts in autofs (that is to say +restarting autofs when there are busy mounts). + +During normal operation autofs uses a file descriptor opened on the +directory that is being managed in order to be able to issue control +operations. Using a file descriptor gives ioctl operations access to +autofs specific information stored in the super block. The operations +are things such as setting an autofs mount catatonic, setting the +expire timeout and requesting expire checks. As is explained below, +certain types of autofs triggered mounts can end up covering an autofs +mount itself which prevents us being able to use open(2) to obtain a +file descriptor for these operations if we don't already have one open. + +Currently autofs uses "umount -l" (lazy umount) to clear active mounts +at restart. While using lazy umount works for most cases, anything that +needs to walk back up the mount tree to construct a path, such as +getcwd(2) and the proc file system /proc/<pid>/cwd, no longer works +because the point from which the path is constructed has been detached +from the mount tree. + +The actual problem with autofs is that it can't reconnect to existing +mounts. Immediately one thinks of just adding the ability to remount +autofs file systems would solve it, but alas, that can't work. This is +because autofs direct mounts and the implementation of "on demand mount +and expire" of nested mount trees have the file system mounted directly +on top of the mount trigger directory dentry. + +For example, there are two types of automount maps, direct (in the kernel +module source you will see a third type called an offset, which is just +a direct mount in disguise) and indirect. + +Here is a master map with direct and indirect map entries: + +/- /etc/auto.direct +/test /etc/auto.indirect + +and the corresponding map files: + +/etc/auto.direct: + +/automount/dparse/g6 budgie:/autofs/export1 +/automount/dparse/g1 shark:/autofs/export1 +and so on. + +/etc/auto.indirect: + +g1 shark:/autofs/export1 +g6 budgie:/autofs/export1 +and so on. + +For the above indirect map an autofs file system is mounted on /test and +mounts are triggered for each sub-directory key by the inode lookup +operation. So we see a mount of shark:/autofs/export1 on /test/g1, for +example. + +The way that direct mounts are handled is by making an autofs mount on +each full path, such as /automount/dparse/g1, and using it as a mount +trigger. So when we walk on the path we mount shark:/autofs/export1 "on +top of this mount point". Since these are always directories we can +use the follow_link inode operation to trigger the mount. + +But, each entry in direct and indirect maps can have offsets (making +them multi-mount map entries). + +For example, an indirect mount map entry could also be: + +g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export1 \ + /s2/ss2 shark:/autofs/export2 + +and a similarly a direct mount map entry could also be: + +/automount/dparse/g1 \ + / shark:/autofs/export5/testing/test \ + /s1 shark:/autofs/export/testing/test/s1 \ + /s2 shark:/autofs/export5/testing/test/s2 \ + /s1/ss1 shark:/autofs/export2 \ + /s2/ss2 shark:/autofs/export2 + +One of the issues with version 4 of autofs was that, when mounting an +entry with a large number of offsets, possibly with nesting, we needed +to mount and umount all of the offsets as a single unit. Not really a +problem, except for people with a large number of offsets in map entries. +This mechanism is used for the well known "hosts" map and we have seen +cases (in 2.4) where the available number of mounts are exhausted or +where the number of privileged ports available is exhausted. + +In version 5 we mount only as we go down the tree of offsets and +similarly for expiring them which resolves the above problem. There is +somewhat more detail to the implementation but it isn't needed for the +sake of the problem explanation. The one important detail is that these +offsets are implemented using the same mechanism as the direct mounts +above and so the mount points can be covered by a mount. + +The current autofs implementation uses an ioctl file descriptor opened +on the mount point for control operations. The references held by the +descriptor are accounted for in checks made to determine if a mount is +in use and is also used to access autofs file system information held +in the mount super block. So the use of a file handle needs to be +retained. + + +The Solution +============ + +To be able to restart autofs leaving existing direct, indirect and +offset mounts in place we need to be able to obtain a file handle +for these potentially covered autofs mount points. Rather than just +implement an isolated operation it was decided to re-implement the +existing ioctl interface and add new operations to provide this +functionality. + +In addition, to be able to reconstruct a mount tree that has busy mounts, +the uid and gid of the last user that triggered the mount needs to be +available because these can be used as macro substitution variables in +autofs maps. They are recorded at mount request time and an operation +has been added to retrieve them. + +Since we're re-implementing the control interface, a couple of other +problems with the existing interface have been addressed. First, when +a mount or expire operation completes a status is returned to the +kernel by either a "send ready" or a "send fail" operation. The +"send fail" operation of the ioctl interface could only ever send +ENOENT so the re-implementation allows user space to send an actual +status. Another expensive operation in user space, for those using +very large maps, is discovering if a mount is present. Usually this +involves scanning /proc/mounts and since it needs to be done quite +often it can introduce significant overhead when there are many entries +in the mount table. An operation to lookup the mount status of a mount +point dentry (covered or not) has also been added. + +Current kernel development policy recommends avoiding the use of the +ioctl mechanism in favor of systems such as Netlink. An implementation +using this system was attempted to evaluate its suitability and it was +found to be inadequate, in this case. The Generic Netlink system was +used for this as raw Netlink would lead to a significant increase in +complexity. There's no question that the Generic Netlink system is an +elegant solution for common case ioctl functions but it's not a complete +replacement probably because its primary purpose in life is to be a +message bus implementation rather than specifically an ioctl replacement. +While it would be possible to work around this there is one concern +that lead to the decision to not use it. This is that the autofs +expire in the daemon has become far to complex because umount +candidates are enumerated, almost for no other reason than to "count" +the number of times to call the expire ioctl. This involves scanning +the mount table which has proved to be a big overhead for users with +large maps. The best way to improve this is try and get back to the +way the expire was done long ago. That is, when an expire request is +issued for a mount (file handle) we should continually call back to +the daemon until we can't umount any more mounts, then return the +appropriate status to the daemon. At the moment we just expire one +mount at a time. A Generic Netlink implementation would exclude this +possibility for future development due to the requirements of the +message bus architecture. + + +autofs4 Miscellaneous Device mount control interface +==================================================== + +The control interface is opening a device node, typically /dev/autofs. + +All the ioctls use a common structure to pass the needed parameter +information and return operation results: + +struct autofs_dev_ioctl { + __u32 ver_major; + __u32 ver_minor; + __u32 size; /* total size of data passed in + * including this struct */ + __s32 ioctlfd; /* automount command fd */ + + __u32 arg1; /* Command parameters */ + __u32 arg2; + + char path[0]; +}; + +The ioctlfd field is a mount point file descriptor of an autofs mount +point. It is returned by the open call and is used by all calls except +the check for whether a given path is a mount point, where it may +optionally be used to check a specific mount corresponding to a given +mount point file descriptor, and when requesting the uid and gid of the +last successful mount on a directory within the autofs file system. + +The fields arg1 and arg2 are used to communicate parameters and results of +calls made as described below. + +The path field is used to pass a path where it is needed and the size field +is used account for the increased structure length when translating the +structure sent from user space. + +This structure can be initialized before setting specific fields by using +the void function call init_autofs_dev_ioctl(struct autofs_dev_ioctl *). + +All of the ioctls perform a copy of this structure from user space to +kernel space and return -EINVAL if the size parameter is smaller than +the structure size itself, -ENOMEM if the kernel memory allocation fails +or -EFAULT if the copy itself fails. Other checks include a version check +of the compiled in user space version against the module version and a +mismatch results in a -EINVAL return. If the size field is greater than +the structure size then a path is assumed to be present and is checked to +ensure it begins with a "/" and is NULL terminated, otherwise -EINVAL is +returned. Following these checks, for all ioctl commands except +AUTOFS_DEV_IOCTL_VERSION_CMD, AUTOFS_DEV_IOCTL_OPENMOUNT_CMD and +AUTOFS_DEV_IOCTL_CLOSEMOUNT_CMD the ioctlfd is validated and if it is +not a valid descriptor or doesn't correspond to an autofs mount point +an error of -EBADF, -ENOTTY or -EINVAL (not an autofs descriptor) is +returned. + + +The ioctls +========== + +An example of an implementation which uses this interface can be seen +in autofs version 5.0.4 and later in file lib/dev-ioctl-lib.c of the +distribution tar available for download from kernel.org in directory +/pub/linux/daemons/autofs/v5. + +The device node ioctl operations implemented by this interface are: + + +AUTOFS_DEV_IOCTL_VERSION +------------------------ + +Get the major and minor version of the autofs4 device ioctl kernel module +implementation. It requires an initialized struct autofs_dev_ioctl as an +input parameter and sets the version information in the passed in structure. +It returns 0 on success or the error -EINVAL if a version mismatch is +detected. + + +AUTOFS_DEV_IOCTL_PROTOVER_CMD and AUTOFS_DEV_IOCTL_PROTOSUBVER_CMD +------------------------------------------------------------------ + +Get the major and minor version of the autofs4 protocol version understood +by loaded module. This call requires an initialized struct autofs_dev_ioctl +with the ioctlfd field set to a valid autofs mount point descriptor +and sets the requested version number in structure field arg1. These +commands return 0 on success or one of the negative error codes if +validation fails. + + +AUTOFS_DEV_IOCTL_OPENMOUNT and AUTOFS_DEV_IOCTL_CLOSEMOUNT +---------------------------------------------------------- + +Obtain and release a file descriptor for an autofs managed mount point +path. The open call requires an initialized struct autofs_dev_ioctl with +the path field set and the size field adjusted appropriately as well +as the arg1 field set to the device number of the autofs mount. The +device number can be obtained from the mount options shown in +/proc/mounts. The close call requires an initialized struct +autofs_dev_ioct with the ioctlfd field set to the descriptor obtained +from the open call. The release of the file descriptor can also be done +with close(2) so any open descriptors will also be closed at process exit. +The close call is included in the implemented operations largely for +completeness and to provide for a consistent user space implementation. + + +AUTOFS_DEV_IOCTL_READY_CMD and AUTOFS_DEV_IOCTL_FAIL_CMD +-------------------------------------------------------- + +Return mount and expire result status from user space to the kernel. +Both of these calls require an initialized struct autofs_dev_ioctl +with the ioctlfd field set to the descriptor obtained from the open +call and the arg1 field set to the wait queue token number, received +by user space in the foregoing mount or expire request. The arg2 field +is set to the status to be returned. For the ready call this is always +0 and for the fail call it is set to the errno of the operation. + + +AUTOFS_DEV_IOCTL_SETPIPEFD_CMD +------------------------------ + +Set the pipe file descriptor used for kernel communication to the daemon. +Normally this is set at mount time using an option but when reconnecting +to a existing mount we need to use this to tell the autofs mount about +the new kernel pipe descriptor. In order to protect mounts against +incorrectly setting the pipe descriptor we also require that the autofs +mount be catatonic (see next call). + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +the arg1 field set to descriptor of the pipe. On success the call +also sets the process group id used to identify the controlling process +(eg. the owning automount(8) daemon) to the process group of the caller. + + +AUTOFS_DEV_IOCTL_CATATONIC_CMD +------------------------------ + +Make the autofs mount point catatonic. The autofs mount will no longer +issue mount requests, the kernel communication pipe descriptor is released +and any remaining waits in the queue released. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_TIMEOUT_CMD +---------------------------- + +Set the expire timeout for mounts within an autofs mount point. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. + + +AUTOFS_DEV_IOCTL_REQUESTER_CMD +------------------------------ + +Return the uid and gid of the last process to successfully trigger a the +mount on the given path dentry. + +The call requires an initialized struct autofs_dev_ioctl with the path +field set to the mount point in question and the size field adjusted +appropriately as well as the arg1 field set to the device number of the +containing autofs mount. Upon return the struct field arg1 contains the +uid and arg2 the gid. + +When reconstructing an autofs mount tree with active mounts we need to +re-connect to mounts that may have used the original process uid and +gid (or string variations of them) for mount lookups within the map entry. +This call provides the ability to obtain this uid and gid so they may be +used by user space for the mount map lookups. + + +AUTOFS_DEV_IOCTL_EXPIRE_CMD +--------------------------- + +Issue an expire request to the kernel for an autofs mount. Typically +this ioctl is called until no further expire candidates are found. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call. In +addition an immediate expire, independent of the mount timeout, can be +requested by setting the arg1 field to 1. If no expire candidates can +be found the ioctl returns -1 with errno set to EAGAIN. + +This call causes the kernel module to check the mount corresponding +to the given ioctlfd for mounts that can be expired, issues an expire +request back to the daemon and waits for completion. + +AUTOFS_DEV_IOCTL_ASKUMOUNT_CMD +------------------------------ + +Checks if an autofs mount point is in use. + +The call requires an initialized struct autofs_dev_ioctl with the +ioctlfd field set to the descriptor obtained from the open call and +it returns the result in the arg1 field, 1 for busy and 0 otherwise. + + +AUTOFS_DEV_IOCTL_ISMOUNTPOINT_CMD +--------------------------------- + +Check if the given path is a mountpoint. + +The call requires an initialized struct autofs_dev_ioctl. There are two +possible variations. Both use the path field set to the path of the mount +point to check and the size field adjusted appropriately. One uses the +ioctlfd field to identify a specific mount point to check while the other +variation uses the path and optionally arg1 set to an autofs mount type. +The call returns 1 if this is a mount point and sets arg1 to the device +number of the mount and field arg2 to the relevant super block magic +number (described below) or 0 if it isn't a mountpoint. In both cases +the the device number (as returned by new_encode_dev()) is returned +in field arg1. + +If supplied with a file descriptor we're looking for a specific mount, +not necessarily at the top of the mounted stack. In this case the path +the descriptor corresponds to is considered a mountpoint if it is itself +a mountpoint or contains a mount, such as a multi-mount without a root +mount. In this case we return 1 if the descriptor corresponds to a mount +point and and also returns the super magic of the covering mount if there +is one or 0 if it isn't a mountpoint. + +If a path is supplied (and the ioctlfd field is set to -1) then the path +is looked up and is checked to see if it is the root of a mount. If a +type is also given we are looking for a particular autofs mount and if +a match isn't found a fail is returned. If the the located path is the +root of a mount 1 is returned along with the super magic of the mount +or 0 otherwise. + diff --git a/Documentation/filesystems/automount-support.txt b/Documentation/filesystems/automount-support.txt index 58c65a1713e..7cac200e2a8 100644 --- a/Documentation/filesystems/automount-support.txt +++ b/Documentation/filesystems/automount-support.txt @@ -19,7 +19,7 @@ following procedure: (2) Have the follow_link() op do the following steps: - (a) Call do_kern_mount() to call the appropriate filesystem to set up a + (a) Call vfs_kern_mount() to call the appropriate filesystem to set up a superblock and gain a vfsmount structure representing it. (b) Copy the nameidata provided as an argument and substitute the dentry diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt index 877a7b1d46e..da45e6c842b 100644 --- a/Documentation/filesystems/befs.txt +++ b/Documentation/filesystems/befs.txt @@ -7,7 +7,7 @@ WARNING Make sure you understand that this is alpha software. This means that the implementation is neither complete nor well-tested. -I DISCLAIM ALL RESPONSIBILTY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! +I DISCLAIM ALL RESPONSIBILITY FOR ANY POSSIBLE BAD EFFECTS OF THIS CODE! LICENSE ===== @@ -22,16 +22,16 @@ He has been working on the code since Aug 13, 2001. See the changelog for details. Original Author: Makoto Kato <m_kato@ga2.so-net.ne.jp> -His orriginal code can still be found at: +His original code can still be found at: <http://hp.vector.co.jp/authors/VA008030/bfs/> Does anyone know of a more current email address for Makoto? He doesn't respond to the address given above... -Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru> +This filesystem doesn't have a maintainer. WHAT IS THIS DRIVER? ================== -This module implements the native filesystem of BeOS <http://www.be.com/> +This module implements the native filesystem of BeOS http://www.beincorporated.com/ for the linux 2.4.1 and later kernels. Currently it is a read-only implementation. @@ -39,7 +39,7 @@ Which is it, BFS or BEFS? ================ Be, Inc said, "BeOS Filesystem is officially called BFS, not BeFS". But Unixware Boot Filesystem is called bfs, too. And they are already in -the kernel. Because of this nameing conflict, on Linux the BeOS +the kernel. Because of this naming conflict, on Linux the BeOS filesystem is called befs. HOW TO INSTALL @@ -57,11 +57,11 @@ if the patching step fails (i.e. there are rejected hunks), you can try to figure it out yourself (it shouldn't be hard), or mail the maintainer (Will Dyson <will_dyson@pobox.com>) for help. -step 2. Configuretion & make kernel +step 2. Configuration & make kernel The linux kernel has many compile-time options. Most of them are beyond the scope of this document. I suggest the Kernel-HOWTO document as a good general -reference on this topic. <http://www.linux.com/howto/Kernel-HOWTO.html> +reference on this topic. http://www.linuxdocs.org/HOWTOs/Kernel-HOWTO-4.html However, to use the BeFS module, you must enable it at configure time. diff --git a/Documentation/filesystems/bfs.txt b/Documentation/filesystems/bfs.txt index d2841e0bcf0..78043d5a8fc 100644 --- a/Documentation/filesystems/bfs.txt +++ b/Documentation/filesystems/bfs.txt @@ -26,11 +26,11 @@ You can simplify mounting by just typing: this will allocate the first available loopback device (and load loop.o kernel module if necessary) automatically. If the loopback driver is not -loaded automatically, make sure that your kernel is compiled with kmod -support (CONFIG_KMOD) enabled. Beware that umount will not -deallocate /dev/loopN device if /etc/mtab file on your system is a -symbolic link to /proc/mounts. You will need to do it manually using -"-d" switch of losetup(8). Read losetup(8) manpage for more info. +loaded automatically, make sure that you have compiled the module and +that modprobe is functioning. Beware that umount will not deallocate +/dev/loopN device if /etc/mtab file on your system is a symbolic link to +/proc/mounts. You will need to do it manually using "-d" switch of +losetup(8). Read losetup(8) manpage for more info. To create the BFS image under UnixWare you need to find out first which slice contains it. The command prtvtoc(1M) is your friend: @@ -54,4 +54,4 @@ The first 4 bytes should be 0x1badface. If you have any patches, questions or suggestions regarding this BFS implementation please contact the author: -Tigran A. Aivazian <tigran@veritas.com> +Tigran Aivazian <tigran@aivazian.fsnet.co.uk> diff --git a/Documentation/filesystems/btrfs.txt b/Documentation/filesystems/btrfs.txt new file mode 100644 index 00000000000..d11cc2f8077 --- /dev/null +++ b/Documentation/filesystems/btrfs.txt @@ -0,0 +1,270 @@ + +BTRFS +===== + +Btrfs is a copy on write filesystem for Linux aimed at +implementing advanced features while focusing on fault tolerance, +repair and easy administration. Initially developed by Oracle, Btrfs +is licensed under the GPL and open for contribution from anyone. + +Linux has a wealth of filesystems to choose from, but we are facing a +number of challenges with scaling to the large storage subsystems that +are becoming common in today's data centers. Filesystems need to scale +in their ability to address and manage large storage, and also in +their ability to detect, repair and tolerate errors in the data stored +on disk. Btrfs is under heavy development, and is not suitable for +any uses other than benchmarking and review. The Btrfs disk format is +not yet finalized. + +The main Btrfs features include: + + * Extent based file storage (2^64 max file size) + * Space efficient packing of small files + * Space efficient indexed directories + * Dynamic inode allocation + * Writable snapshots + * Subvolumes (separate internal filesystem roots) + * Object level mirroring and striping + * Checksums on data and metadata (multiple algorithms available) + * Compression + * Integrated multiple device support, with several raid algorithms + * Online filesystem check (not yet implemented) + * Very fast offline filesystem check + * Efficient incremental backup and FS mirroring (not yet implemented) + * Online filesystem defragmentation + + +Mount Options +============= + +When mounting a btrfs filesystem, the following option are accepted. +Options with (*) are default options and will not show in the mount options. + + alloc_start=<bytes> + Debugging option to force all block allocations above a certain + byte threshold on each block device. The value is specified in + bytes, optionally with a K, M, or G suffix, case insensitive. + Default is 1MB. + + noautodefrag(*) + autodefrag + Disable/enable auto defragmentation. + Auto defragmentation detects small random writes into files and queue + them up for the defrag process. Works best for small files; + Not well suited for large database workloads. + + check_int + check_int_data + check_int_print_mask=<value> + These debugging options control the behavior of the integrity checking + module (the BTRFS_FS_CHECK_INTEGRITY config option required). + + check_int enables the integrity checker module, which examines all + block write requests to ensure on-disk consistency, at a large + memory and CPU cost. + + check_int_data includes extent data in the integrity checks, and + implies the check_int option. + + check_int_print_mask takes a bitmask of BTRFSIC_PRINT_MASK_* values + as defined in fs/btrfs/check-integrity.c, to control the integrity + checker module behavior. + + See comments at the top of fs/btrfs/check-integrity.c for more info. + + commit=<seconds> + Set the interval of periodic commit, 30 seconds by default. Higher + values defer data being synced to permanent storage with obvious + consequences when the system crashes. The upper bound is not forced, + but a warning is printed if it's more than 300 seconds (5 minutes). + + compress + compress=<type> + compress-force + compress-force=<type> + Control BTRFS file data compression. Type may be specified as "zlib" + "lzo" or "no" (for no compression, used for remounting). If no type + is specified, zlib is used. If compress-force is specified, + all files will be compressed, whether or not they compress well. + If compression is enabled, nodatacow and nodatasum are disabled. + + degraded + Allow mounts to continue with missing devices. A read-write mount may + fail with too many devices missing, for example if a stripe member + is completely missing. + + device=<devicepath> + Specify a device during mount so that ioctls on the control device + can be avoided. Especially useful when trying to mount a multi-device + setup as root. May be specified multiple times for multiple devices. + + nodiscard(*) + discard + Disable/enable discard mount option. + Discard issues frequent commands to let the block device reclaim space + freed by the filesystem. + This is useful for SSD devices, thinly provisioned + LUNs and virtual machine images, but may have a significant + performance impact. (The fstrim command is also available to + initiate batch trims from userspace). + + noenospc_debug(*) + enospc_debug + Disable/enable debugging option to be more verbose in some ENOSPC conditions. + + fatal_errors=<action> + Action to take when encountering a fatal error: + "bug" - BUG() on a fatal error. This is the default. + "panic" - panic() on a fatal error. + + noflushoncommit(*) + flushoncommit + The 'flushoncommit' mount option forces any data dirtied by a write in a + prior transaction to commit as part of the current commit. This makes + the committed state a fully consistent view of the file system from the + application's perspective (i.e., it includes all completed file system + operations). This was previously the behavior only when a snapshot is + created. + + inode_cache + Enable free inode number caching. Defaults to off due to an overflow + problem when the free space crcs don't fit inside a single page. + + max_inline=<bytes> + Specify the maximum amount of space, in bytes, that can be inlined in + a metadata B-tree leaf. The value is specified in bytes, optionally + with a K, M, or G suffix, case insensitive. In practice, this value + is limited by the root sector size, with some space unavailable due + to leaf headers. For a 4k sectorsize, max inline data is ~3900 bytes. + + metadata_ratio=<value> + Specify that 1 metadata chunk should be allocated after every <value> + data chunks. Off by default. + + acl(*) + noacl + Enable/disable support for Posix Access Control Lists (ACLs). See the + acl(5) manual page for more information about ACLs. + + barrier(*) + nobarrier + Enable/disable the use of block layer write barriers. Write barriers + ensure that certain IOs make it through the device cache and are on + persistent storage. If disabled on a device with a volatile + (non-battery-backed) write-back cache, nobarrier option will lead to + filesystem corruption on a system crash or power loss. + + datacow(*) + nodatacow + Enable/disable data copy-on-write for newly created files. + Nodatacow implies nodatasum, and disables all compression. + + datasum(*) + nodatasum + Enable/disable data checksumming for newly created files. + Datasum implies datacow. + + treelog(*) + notreelog + Enable/disable the tree logging used for fsync and O_SYNC writes. + + recovery + Enable autorecovery attempts if a bad tree root is found at mount time. + Currently this scans a list of several previous tree roots and tries to + use the first readable. + + rescan_uuid_tree + Force check and rebuild procedure of the UUID tree. This should not + normally be needed. + + skip_balance + Skip automatic resume of interrupted balance operation after mount. + May be resumed with "btrfs balance resume." + + space_cache (*) + Enable the on-disk freespace cache. + nospace_cache + Disable freespace cache loading without clearing the cache. + clear_cache + Force clearing and rebuilding of the disk space cache if something + has gone wrong. + + ssd + nossd + ssd_spread + Options to control ssd allocation schemes. By default, BTRFS will + enable or disable ssd allocation heuristics depending on whether a + rotational or nonrotational disk is in use. The ssd and nossd options + can override this autodetection. + + The ssd_spread mount option attempts to allocate into big chunks + of unused space, and may perform better on low-end ssds. ssd_spread + implies ssd, enabling all other ssd heuristics as well. + + subvol=<path> + Mount subvolume at <path> rather than the root subvolume. <path> is + relative to the top level subvolume. + + subvolid=<ID> + Mount subvolume specified by an ID number rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume list" to see subvolume ID numbers. + + subvolrootid=<objectid> (deprecated) + Mount subvolume specified by <objectid> rather than the root subvolume. + This allows mounting of subvolumes which are not in the root of the mounted + filesystem. + You can use "btrfs subvolume show " to see the object ID for a subvolume. + + thread_pool=<number> + The number of worker threads to allocate. The default number is equal + to the number of CPUs + 2, or 8, whichever is smaller. + + user_subvol_rm_allowed + Allow subvolumes to be deleted by a non-root user. Use with caution. + +MAILING LIST +============ + +There is a Btrfs mailing list hosted on vger.kernel.org. You can +find details on how to subscribe here: + +http://vger.kernel.org/vger-lists.html#linux-btrfs + +Mailing list archives are available from gmane: + +http://dir.gmane.org/gmane.comp.file-systems.btrfs + + + +IRC +=== + +Discussion of Btrfs also occurs on the #btrfs channel of the Freenode +IRC network. + + + + UTILITIES + ========= + +Userspace tools for creating and manipulating Btrfs file systems are +available from the git repository at the following location: + + http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs.git + git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs.git + +These include the following tools: + +* mkfs.btrfs: create a filesystem + +* btrfs: a single tool to manage the filesystems, refer to the manpage for more details + +* 'btrfsck' or 'btrfs check': do a consistency check of the filesystem + +Other tools for specific tasks: + +* btrfs-convert: in-place conversion from ext2/3/4 filesystems + +* btrfs-image: dump filesystem metadata for debugging diff --git a/Documentation/filesystems/caching/backend-api.txt b/Documentation/filesystems/caching/backend-api.txt new file mode 100644 index 00000000000..277d1e81067 --- /dev/null +++ b/Documentation/filesystems/caching/backend-api.txt @@ -0,0 +1,703 @@ + ========================== + FS-CACHE CACHE BACKEND API + ========================== + +The FS-Cache system provides an API by which actual caches can be supplied to +FS-Cache for it to then serve out to network filesystems and other interested +parties. + +This API is declared in <linux/fscache-cache.h>. + + +==================================== +INITIALISING AND REGISTERING A CACHE +==================================== + +To start off, a cache definition must be initialised and registered for each +cache the backend wants to make available. For instance, CacheFS does this in +the fill_super() operation on mounting. + +The cache definition (struct fscache_cache) should be initialised by calling: + + void fscache_init_cache(struct fscache_cache *cache, + struct fscache_cache_ops *ops, + const char *idfmt, + ...); + +Where: + + (*) "cache" is a pointer to the cache definition; + + (*) "ops" is a pointer to the table of operations that the backend supports on + this cache; and + + (*) "idfmt" is a format and printf-style arguments for constructing a label + for the cache. + + +The cache should then be registered with FS-Cache by passing a pointer to the +previously initialised cache definition to: + + int fscache_add_cache(struct fscache_cache *cache, + struct fscache_object *fsdef, + const char *tagname); + +Two extra arguments should also be supplied: + + (*) "fsdef" which should point to the object representation for the FS-Cache + master index in this cache. Netfs primary index entries will be created + here. FS-Cache keeps the caller's reference to the index object if + successful and will release it upon withdrawal of the cache. + + (*) "tagname" which, if given, should be a text string naming this cache. If + this is NULL, the identifier will be used instead. For CacheFS, the + identifier is set to name the underlying block device and the tag can be + supplied by mount. + +This function may return -ENOMEM if it ran out of memory or -EEXIST if the tag +is already in use. 0 will be returned on success. + + +===================== +UNREGISTERING A CACHE +===================== + +A cache can be withdrawn from the system by calling this function with a +pointer to the cache definition: + + void fscache_withdraw_cache(struct fscache_cache *cache); + +In CacheFS's case, this is called by put_super(). + + +======== +SECURITY +======== + +The cache methods are executed one of two contexts: + + (1) that of the userspace process that issued the netfs operation that caused + the cache method to be invoked, or + + (2) that of one of the processes in the FS-Cache thread pool. + +In either case, this may not be an appropriate context in which to access the +cache. + +The calling process's fsuid, fsgid and SELinux security identities may need to +be masqueraded for the duration of the cache driver's access to the cache. +This is left to the cache to handle; FS-Cache makes no effort in this regard. + + +=================================== +CONTROL AND STATISTICS PRESENTATION +=================================== + +The cache may present data to the outside world through FS-Cache's interfaces +in sysfs and procfs - the former for control and the latter for statistics. + +A sysfs directory called /sys/fs/fscache/<cachetag>/ is created if CONFIG_SYSFS +is enabled. This is accessible through the kobject struct fscache_cache::kobj +and is for use by the cache as it sees fit. + + +======================== +RELEVANT DATA STRUCTURES +======================== + + (*) Index/Data file FS-Cache representation cookie: + + struct fscache_cookie { + struct fscache_object_def *def; + struct fscache_netfs *netfs; + void *netfs_data; + ... + }; + + The fields that might be of use to the backend describe the object + definition, the netfs definition and the netfs's data for this cookie. + The object definition contain functions supplied by the netfs for loading + and matching index entries; these are required to provide some of the + cache operations. + + + (*) In-cache object representation: + + struct fscache_object { + int debug_id; + enum { + FSCACHE_OBJECT_RECYCLING, + ... + } state; + spinlock_t lock + struct fscache_cache *cache; + struct fscache_cookie *cookie; + ... + }; + + Structures of this type should be allocated by the cache backend and + passed to FS-Cache when requested by the appropriate cache operation. In + the case of CacheFS, they're embedded in CacheFS's internal object + structures. + + The debug_id is a simple integer that can be used in debugging messages + that refer to a particular object. In such a case it should be printed + using "OBJ%x" to be consistent with FS-Cache. + + Each object contains a pointer to the cookie that represents the object it + is backing. An object should retired when put_object() is called if it is + in state FSCACHE_OBJECT_RECYCLING. The fscache_object struct should be + initialised by calling fscache_object_init(object). + + + (*) FS-Cache operation record: + + struct fscache_operation { + atomic_t usage; + struct fscache_object *object; + unsigned long flags; + #define FSCACHE_OP_EXCLUSIVE + void (*processor)(struct fscache_operation *op); + void (*release)(struct fscache_operation *op); + ... + }; + + FS-Cache has a pool of threads that it uses to give CPU time to the + various asynchronous operations that need to be done as part of driving + the cache. These are represented by the above structure. The processor + method is called to give the op CPU time, and the release method to get + rid of it when its usage count reaches 0. + + An operation can be made exclusive upon an object by setting the + appropriate flag before enqueuing it with fscache_enqueue_operation(). If + an operation needs more processing time, it should be enqueued again. + + + (*) FS-Cache retrieval operation record: + + struct fscache_retrieval { + struct fscache_operation op; + struct address_space *mapping; + struct list_head *to_do; + ... + }; + + A structure of this type is allocated by FS-Cache to record retrieval and + allocation requests made by the netfs. This struct is then passed to the + backend to do the operation. The backend may get extra refs to it by + calling fscache_get_retrieval() and refs may be discarded by calling + fscache_put_retrieval(). + + A retrieval operation can be used by the backend to do retrieval work. To + do this, the retrieval->op.processor method pointer should be set + appropriately by the backend and fscache_enqueue_retrieval() called to + submit it to the thread pool. CacheFiles, for example, uses this to queue + page examination when it detects PG_lock being cleared. + + The to_do field is an empty list available for the cache backend to use as + it sees fit. + + + (*) FS-Cache storage operation record: + + struct fscache_storage { + struct fscache_operation op; + pgoff_t store_limit; + ... + }; + + A structure of this type is allocated by FS-Cache to record outstanding + writes to be made. FS-Cache itself enqueues this operation and invokes + the write_page() method on the object at appropriate times to effect + storage. + + +================ +CACHE OPERATIONS +================ + +The cache backend provides FS-Cache with a table of operations that can be +performed on the denizens of the cache. These are held in a structure of type: + + struct fscache_cache_ops + + (*) Name of cache provider [mandatory]: + + const char *name + + This isn't strictly an operation, but should be pointed at a string naming + the backend. + + + (*) Allocate a new object [mandatory]: + + struct fscache_object *(*alloc_object)(struct fscache_cache *cache, + struct fscache_cookie *cookie) + + This method is used to allocate a cache object representation to back a + cookie in a particular cache. fscache_object_init() should be called on + the object to initialise it prior to returning. + + This function may also be used to parse the index key to be used for + multiple lookup calls to turn it into a more convenient form. FS-Cache + will call the lookup_complete() method to allow the cache to release the + form once lookup is complete or aborted. + + + (*) Look up and create object [mandatory]: + + void (*lookup_object)(struct fscache_object *object) + + This method is used to look up an object, given that the object is already + allocated and attached to the cookie. This should instantiate that object + in the cache if it can. + + The method should call fscache_object_lookup_negative() as soon as + possible if it determines the object doesn't exist in the cache. If the + object is found to exist and the netfs indicates that it is valid then + fscache_obtained_object() should be called once the object is in a + position to have data stored in it. Similarly, fscache_obtained_object() + should also be called once a non-present object has been created. + + If a lookup error occurs, fscache_object_lookup_error() should be called + to abort the lookup of that object. + + + (*) Release lookup data [mandatory]: + + void (*lookup_complete)(struct fscache_object *object) + + This method is called to ask the cache to release any resources it was + using to perform a lookup. + + + (*) Increment object refcount [mandatory]: + + struct fscache_object *(*grab_object)(struct fscache_object *object) + + This method is called to increment the reference count on an object. It + may fail (for instance if the cache is being withdrawn) by returning NULL. + It should return the object pointer if successful. + + + (*) Lock/Unlock object [mandatory]: + + void (*lock_object)(struct fscache_object *object) + void (*unlock_object)(struct fscache_object *object) + + These methods are used to exclusively lock an object. It must be possible + to schedule with the lock held, so a spinlock isn't sufficient. + + + (*) Pin/Unpin object [optional]: + + int (*pin_object)(struct fscache_object *object) + void (*unpin_object)(struct fscache_object *object) + + These methods are used to pin an object into the cache. Once pinned an + object cannot be reclaimed to make space. Return -ENOSPC if there's not + enough space in the cache to permit this. + + + (*) Check coherency state of an object [mandatory]: + + int (*check_consistency)(struct fscache_object *object) + + This method is called to have the cache check the saved auxiliary data of + the object against the netfs's idea of the state. 0 should be returned + if they're consistent and -ESTALE otherwise. -ENOMEM and -ERESTARTSYS + may also be returned. + + (*) Update object [mandatory]: + + int (*update_object)(struct fscache_object *object) + + This is called to update the index entry for the specified object. The + new information should be in object->cookie->netfs_data. This can be + obtained by calling object->cookie->def->get_aux()/get_attr(). + + + (*) Invalidate data object [mandatory]: + + int (*invalidate_object)(struct fscache_operation *op) + + This is called to invalidate a data object (as pointed to by op->object). + All the data stored for this object should be discarded and an + attr_changed operation should be performed. The caller will follow up + with an object update operation. + + fscache_op_complete() must be called on op before returning. + + + (*) Discard object [mandatory]: + + void (*drop_object)(struct fscache_object *object) + + This method is called to indicate that an object has been unbound from its + cookie, and that the cache should release the object's resources and + retire it if it's in state FSCACHE_OBJECT_RECYCLING. + + This method should not attempt to release any references held by the + caller. The caller will invoke the put_object() method as appropriate. + + + (*) Release object reference [mandatory]: + + void (*put_object)(struct fscache_object *object) + + This method is used to discard a reference to an object. The object may + be freed when all the references to it are released. + + + (*) Synchronise a cache [mandatory]: + + void (*sync)(struct fscache_cache *cache) + + This is called to ask the backend to synchronise a cache with its backing + device. + + + (*) Dissociate a cache [mandatory]: + + void (*dissociate_pages)(struct fscache_cache *cache) + + This is called to ask a cache to perform any page dissociations as part of + cache withdrawal. + + + (*) Notification that the attributes on a netfs file changed [mandatory]: + + int (*attr_changed)(struct fscache_object *object); + + This is called to indicate to the cache that certain attributes on a netfs + file have changed (for example the maximum size a file may reach). The + cache can read these from the netfs by calling the cookie's get_attr() + method. + + The cache may use the file size information to reserve space on the cache. + It should also call fscache_set_store_limit() to indicate to FS-Cache the + highest byte it's willing to store for an object. + + This method may return -ve if an error occurred or the cache object cannot + be expanded. In such a case, the object will be withdrawn from service. + + This operation is run asynchronously from FS-Cache's thread pool, and + storage and retrieval operations from the netfs are excluded during the + execution of this operation. + + + (*) Reserve cache space for an object's data [optional]: + + int (*reserve_space)(struct fscache_object *object, loff_t size); + + This is called to request that cache space be reserved to hold the data + for an object and the metadata used to track it. Zero size should be + taken as request to cancel a reservation. + + This should return 0 if successful, -ENOSPC if there isn't enough space + available, or -ENOMEM or -EIO on other errors. + + The reservation may exceed the current size of the object, thus permitting + future expansion. If the amount of space consumed by an object would + exceed the reservation, it's permitted to refuse requests to allocate + pages, but not required. An object may be pruned down to its reservation + size if larger than that already. + + + (*) Request page be read from cache [mandatory]: + + int (*read_or_alloc_page)(struct fscache_retrieval *op, + struct page *page, + gfp_t gfp) + + This is called to attempt to read a netfs page from the cache, or to + reserve a backing block if not. FS-Cache will have done as much checking + as it can before calling, but most of the work belongs to the backend. + + If there's no page in the cache, then -ENODATA should be returned if the + backend managed to reserve a backing block; -ENOBUFS or -ENOMEM if it + didn't. + + If there is suitable data in the cache, then a read operation should be + queued and 0 returned. When the read finishes, fscache_end_io() should be + called. + + The fscache_mark_pages_cached() should be called for the page if any cache + metadata is retained. This will indicate to the netfs that the page needs + explicit uncaching. This operation takes a pagevec, thus allowing several + pages to be marked at once. + + The retrieval record pointed to by op should be retained for each page + queued and released when I/O on the page has been formally ended. + fscache_get/put_retrieval() are available for this purpose. + + The retrieval record may be used to get CPU time via the FS-Cache thread + pool. If this is desired, the op->op.processor should be set to point to + the appropriate processing routine, and fscache_enqueue_retrieval() should + be called at an appropriate point to request CPU time. For instance, the + retrieval routine could be enqueued upon the completion of a disk read. + The to_do field in the retrieval record is provided to aid in this. + + If an I/O error occurs, fscache_io_error() should be called and -ENOBUFS + returned if possible or fscache_end_io() called with a suitable error + code. + + fscache_put_retrieval() should be called after a page or pages are dealt + with. This will complete the operation when all pages are dealt with. + + + (*) Request pages be read from cache [mandatory]: + + int (*read_or_alloc_pages)(struct fscache_retrieval *op, + struct list_head *pages, + unsigned *nr_pages, + gfp_t gfp) + + This is like the read_or_alloc_page() method, except it is handed a list + of pages instead of one page. Any pages on which a read operation is + started must be added to the page cache for the specified mapping and also + to the LRU. Such pages must also be removed from the pages list and + *nr_pages decremented per page. + + If there was an error such as -ENOMEM, then that should be returned; else + if one or more pages couldn't be read or allocated, then -ENOBUFS should + be returned; else if one or more pages couldn't be read, then -ENODATA + should be returned. If all the pages are dispatched then 0 should be + returned. + + + (*) Request page be allocated in the cache [mandatory]: + + int (*allocate_page)(struct fscache_retrieval *op, + struct page *page, + gfp_t gfp) + + This is like the read_or_alloc_page() method, except that it shouldn't + read from the cache, even if there's data there that could be retrieved. + It should, however, set up any internal metadata required such that + the write_page() method can write to the cache. + + If there's no backing block available, then -ENOBUFS should be returned + (or -ENOMEM if there were other problems). If a block is successfully + allocated, then the netfs page should be marked and 0 returned. + + + (*) Request pages be allocated in the cache [mandatory]: + + int (*allocate_pages)(struct fscache_retrieval *op, + struct list_head *pages, + unsigned *nr_pages, + gfp_t gfp) + + This is an multiple page version of the allocate_page() method. pages and + nr_pages should be treated as for the read_or_alloc_pages() method. + + + (*) Request page be written to cache [mandatory]: + + int (*write_page)(struct fscache_storage *op, + struct page *page); + + This is called to write from a page on which there was a previously + successful read_or_alloc_page() call or similar. FS-Cache filters out + pages that don't have mappings. + + This method is called asynchronously from the FS-Cache thread pool. It is + not required to actually store anything, provided -ENODATA is then + returned to the next read of this page. + + If an error occurred, then a negative error code should be returned, + otherwise zero should be returned. FS-Cache will take appropriate action + in response to an error, such as withdrawing this object. + + If this method returns success then FS-Cache will inform the netfs + appropriately. + + + (*) Discard retained per-page metadata [mandatory]: + + void (*uncache_page)(struct fscache_object *object, struct page *page) + + This is called when a netfs page is being evicted from the pagecache. The + cache backend should tear down any internal representation or tracking it + maintains for this page. + + +================== +FS-CACHE UTILITIES +================== + +FS-Cache provides some utilities that a cache backend may make use of: + + (*) Note occurrence of an I/O error in a cache: + + void fscache_io_error(struct fscache_cache *cache) + + This tells FS-Cache that an I/O error occurred in the cache. After this + has been called, only resource dissociation operations (object and page + release) will be passed from the netfs to the cache backend for the + specified cache. + + This does not actually withdraw the cache. That must be done separately. + + + (*) Invoke the retrieval I/O completion function: + + void fscache_end_io(struct fscache_retrieval *op, struct page *page, + int error); + + This is called to note the end of an attempt to retrieve a page. The + error value should be 0 if successful and an error otherwise. + + + (*) Record that one or more pages being retrieved or allocated have been dealt + with: + + void fscache_retrieval_complete(struct fscache_retrieval *op, + int n_pages); + + This is called to record the fact that one or more pages have been dealt + with and are no longer the concern of this operation. When the number of + pages remaining in the operation reaches 0, the operation will be + completed. + + + (*) Record operation completion: + + void fscache_op_complete(struct fscache_operation *op); + + This is called to record the completion of an operation. This deducts + this operation from the parent object's run state, potentially permitting + one or more pending operations to start running. + + + (*) Set highest store limit: + + void fscache_set_store_limit(struct fscache_object *object, + loff_t i_size); + + This sets the limit FS-Cache imposes on the highest byte it's willing to + try and store for a netfs. Any page over this limit is automatically + rejected by fscache_read_alloc_page() and co with -ENOBUFS. + + + (*) Mark pages as being cached: + + void fscache_mark_pages_cached(struct fscache_retrieval *op, + struct pagevec *pagevec); + + This marks a set of pages as being cached. After this has been called, + the netfs must call fscache_uncache_page() to unmark the pages. + + + (*) Perform coherency check on an object: + + enum fscache_checkaux fscache_check_aux(struct fscache_object *object, + const void *data, + uint16_t datalen); + + This asks the netfs to perform a coherency check on an object that has + just been looked up. The cookie attached to the object will determine the + netfs to use. data and datalen should specify where the auxiliary data + retrieved from the cache can be found. + + One of three values will be returned: + + (*) FSCACHE_CHECKAUX_OKAY + + The coherency data indicates the object is valid as is. + + (*) FSCACHE_CHECKAUX_NEEDS_UPDATE + + The coherency data needs updating, but otherwise the object is + valid. + + (*) FSCACHE_CHECKAUX_OBSOLETE + + The coherency data indicates that the object is obsolete and should + be discarded. + + + (*) Initialise a freshly allocated object: + + void fscache_object_init(struct fscache_object *object); + + This initialises all the fields in an object representation. + + + (*) Indicate the destruction of an object: + + void fscache_object_destroyed(struct fscache_cache *cache); + + This must be called to inform FS-Cache that an object that belonged to a + cache has been destroyed and deallocated. This will allow continuation + of the cache withdrawal process when it is stopped pending destruction of + all the objects. + + + (*) Indicate negative lookup on an object: + + void fscache_object_lookup_negative(struct fscache_object *object); + + This is called to indicate to FS-Cache that a lookup process for an object + found a negative result. + + This changes the state of an object to permit reads pending on lookup + completion to go off and start fetching data from the netfs server as it's + known at this point that there can't be any data in the cache. + + This may be called multiple times on an object. Only the first call is + significant - all subsequent calls are ignored. + + + (*) Indicate an object has been obtained: + + void fscache_obtained_object(struct fscache_object *object); + + This is called to indicate to FS-Cache that a lookup process for an object + produced a positive result, or that an object was created. This should + only be called once for any particular object. + + This changes the state of an object to indicate: + + (1) if no call to fscache_object_lookup_negative() has been made on + this object, that there may be data available, and that reads can + now go and look for it; and + + (2) that writes may now proceed against this object. + + + (*) Indicate that object lookup failed: + + void fscache_object_lookup_error(struct fscache_object *object); + + This marks an object as having encountered a fatal error (usually EIO) + and causes it to move into a state whereby it will be withdrawn as soon + as possible. + + + (*) Get and release references on a retrieval record: + + void fscache_get_retrieval(struct fscache_retrieval *op); + void fscache_put_retrieval(struct fscache_retrieval *op); + + These two functions are used to retain a retrieval record whilst doing + asynchronous data retrieval and block allocation. + + + (*) Enqueue a retrieval record for processing. + + void fscache_enqueue_retrieval(struct fscache_retrieval *op); + + This enqueues a retrieval record for processing by the FS-Cache thread + pool. One of the threads in the pool will invoke the retrieval record's + op->op.processor callback function. This function may be called from + within the callback function. + + + (*) List of object state names: + + const char *fscache_object_states[]; + + For debugging purposes, this may be used to turn the state that an object + is in into a text string for display purposes. diff --git a/Documentation/filesystems/caching/cachefiles.txt b/Documentation/filesystems/caching/cachefiles.txt new file mode 100644 index 00000000000..748a1ae49e1 --- /dev/null +++ b/Documentation/filesystems/caching/cachefiles.txt @@ -0,0 +1,501 @@ + =============================================== + CacheFiles: CACHE ON ALREADY MOUNTED FILESYSTEM + =============================================== + +Contents: + + (*) Overview. + + (*) Requirements. + + (*) Configuration. + + (*) Starting the cache. + + (*) Things to avoid. + + (*) Cache culling. + + (*) Cache structure. + + (*) Security model and SELinux. + + (*) A note on security. + + (*) Statistical information. + + (*) Debugging. + + +======== +OVERVIEW +======== + +CacheFiles is a caching backend that's meant to use as a cache a directory on +an already mounted filesystem of a local type (such as Ext3). + +CacheFiles uses a userspace daemon to do some of the cache management - such as +reaping stale nodes and culling. This is called cachefilesd and lives in +/sbin. + +The filesystem and data integrity of the cache are only as good as those of the +filesystem providing the backing services. Note that CacheFiles does not +attempt to journal anything since the journalling interfaces of the various +filesystems are very specific in nature. + +CacheFiles creates a misc character device - "/dev/cachefiles" - that is used +to communication with the daemon. Only one thing may have this open at once, +and whilst it is open, a cache is at least partially in existence. The daemon +opens this and sends commands down it to control the cache. + +CacheFiles is currently limited to a single cache. + +CacheFiles attempts to maintain at least a certain percentage of free space on +the filesystem, shrinking the cache by culling the objects it contains to make +space if necessary - see the "Cache Culling" section. This means it can be +placed on the same medium as a live set of data, and will expand to make use of +spare space and automatically contract when the set of data requires more +space. + + +============ +REQUIREMENTS +============ + +The use of CacheFiles and its daemon requires the following features to be +available in the system and in the cache filesystem: + + - dnotify. + + - extended attributes (xattrs). + + - openat() and friends. + + - bmap() support on files in the filesystem (FIBMAP ioctl). + + - The use of bmap() to detect a partial page at the end of the file. + +It is strongly recommended that the "dir_index" option is enabled on Ext3 +filesystems being used as a cache. + + +============= +CONFIGURATION +============= + +The cache is configured by a script in /etc/cachefilesd.conf. These commands +set up cache ready for use. The following script commands are available: + + (*) brun <N>% + (*) bcull <N>% + (*) bstop <N>% + (*) frun <N>% + (*) fcull <N>% + (*) fstop <N>% + + Configure the culling limits. Optional. See the section on culling + The defaults are 7% (run), 5% (cull) and 1% (stop) respectively. + + The commands beginning with a 'b' are file space (block) limits, those + beginning with an 'f' are file count limits. + + (*) dir <path> + + Specify the directory containing the root of the cache. Mandatory. + + (*) tag <name> + + Specify a tag to FS-Cache to use in distinguishing multiple caches. + Optional. The default is "CacheFiles". + + (*) debug <mask> + + Specify a numeric bitmask to control debugging in the kernel module. + Optional. The default is zero (all off). The following values can be + OR'd into the mask to collect various information: + + 1 Turn on trace of function entry (_enter() macros) + 2 Turn on trace of function exit (_leave() macros) + 4 Turn on trace of internal debug points (_debug()) + + This mask can also be set through sysfs, eg: + + echo 5 >/sys/modules/cachefiles/parameters/debug + + +================== +STARTING THE CACHE +================== + +The cache is started by running the daemon. The daemon opens the cache device, +configures the cache and tells it to begin caching. At that point the cache +binds to fscache and the cache becomes live. + +The daemon is run as follows: + + /sbin/cachefilesd [-d]* [-s] [-n] [-f <configfile>] + +The flags are: + + (*) -d + + Increase the debugging level. This can be specified multiple times and + is cumulative with itself. + + (*) -s + + Send messages to stderr instead of syslog. + + (*) -n + + Don't daemonise and go into background. + + (*) -f <configfile> + + Use an alternative configuration file rather than the default one. + + +=============== +THINGS TO AVOID +=============== + +Do not mount other things within the cache as this will cause problems. The +kernel module contains its own very cut-down path walking facility that ignores +mountpoints, but the daemon can't avoid them. + +Do not create, rename or unlink files and directories in the cache whilst the +cache is active, as this may cause the state to become uncertain. + +Renaming files in the cache might make objects appear to be other objects (the +filename is part of the lookup key). + +Do not change or remove the extended attributes attached to cache files by the +cache as this will cause the cache state management to get confused. + +Do not create files or directories in the cache, lest the cache get confused or +serve incorrect data. + +Do not chmod files in the cache. The module creates things with minimal +permissions to prevent random users being able to access them directly. + + +============= +CACHE CULLING +============= + +The cache may need culling occasionally to make space. This involves +discarding objects from the cache that have been used less recently than +anything else. Culling is based on the access time of data objects. Empty +directories are culled if not in use. + +Cache culling is done on the basis of the percentage of blocks and the +percentage of files available in the underlying filesystem. There are six +"limits": + + (*) brun + (*) frun + + If the amount of free space and the number of available files in the cache + rises above both these limits, then culling is turned off. + + (*) bcull + (*) fcull + + If the amount of available space or the number of available files in the + cache falls below either of these limits, then culling is started. + + (*) bstop + (*) fstop + + If the amount of available space or the number of available files in the + cache falls below either of these limits, then no further allocation of + disk space or files is permitted until culling has raised things above + these limits again. + +These must be configured thusly: + + 0 <= bstop < bcull < brun < 100 + 0 <= fstop < fcull < frun < 100 + +Note that these are percentages of available space and available files, and do +_not_ appear as 100 minus the percentage displayed by the "df" program. + +The userspace daemon scans the cache to build up a table of cullable objects. +These are then culled in least recently used order. A new scan of the cache is +started as soon as space is made in the table. Objects will be skipped if +their atimes have changed or if the kernel module says it is still using them. + + +=============== +CACHE STRUCTURE +=============== + +The CacheFiles module will create two directories in the directory it was +given: + + (*) cache/ + + (*) graveyard/ + +The active cache objects all reside in the first directory. The CacheFiles +kernel module moves any retired or culled objects that it can't simply unlink +to the graveyard from which the daemon will actually delete them. + +The daemon uses dnotify to monitor the graveyard directory, and will delete +anything that appears therein. + + +The module represents index objects as directories with the filename "I..." or +"J...". Note that the "cache/" directory is itself a special index. + +Data objects are represented as files if they have no children, or directories +if they do. Their filenames all begin "D..." or "E...". If represented as a +directory, data objects will have a file in the directory called "data" that +actually holds the data. + +Special objects are similar to data objects, except their filenames begin +"S..." or "T...". + + +If an object has children, then it will be represented as a directory. +Immediately in the representative directory are a collection of directories +named for hash values of the child object keys with an '@' prepended. Into +this directory, if possible, will be placed the representations of the child +objects: + + INDEX INDEX INDEX DATA FILES + ========= ========== ================================= ================ + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400 + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...DB1ry + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...N22ry + cache/@4a/I03nfs/@30/Ji000000000000000--fHg8hi8400/@75/Es0g000w...FP1ry + + +If the key is so long that it exceeds NAME_MAX with the decorations added on to +it, then it will be cut into pieces, the first few of which will be used to +make a nest of directories, and the last one of which will be the objects +inside the last directory. The names of the intermediate directories will have +'+' prepended: + + J1223/@23/+xy...z/+kl...m/Epqr + + +Note that keys are raw data, and not only may they exceed NAME_MAX in size, +they may also contain things like '/' and NUL characters, and so they may not +be suitable for turning directly into a filename. + +To handle this, CacheFiles will use a suitably printable filename directly and +"base-64" encode ones that aren't directly suitable. The two versions of +object filenames indicate the encoding: + + OBJECT TYPE PRINTABLE ENCODED + =============== =============== =============== + Index "I..." "J..." + Data "D..." "E..." + Special "S..." "T..." + +Intermediate directories are always "@" or "+" as appropriate. + + +Each object in the cache has an extended attribute label that holds the object +type ID (required to distinguish special objects) and the auxiliary data from +the netfs. The latter is used to detect stale objects in the cache and update +or retire them. + + +Note that CacheFiles will erase from the cache any file it doesn't recognise or +any file of an incorrect type (such as a FIFO file or a device file). + + +========================== +SECURITY MODEL AND SELINUX +========================== + +CacheFiles is implemented to deal properly with the LSM security features of +the Linux kernel and the SELinux facility. + +One of the problems that CacheFiles faces is that it is generally acting on +behalf of a process, and running in that process's context, and that includes a +security context that is not appropriate for accessing the cache - either +because the files in the cache are inaccessible to that process, or because if +the process creates a file in the cache, that file may be inaccessible to other +processes. + +The way CacheFiles works is to temporarily change the security context (fsuid, +fsgid and actor security label) that the process acts as - without changing the +security context of the process when it the target of an operation performed by +some other process (so signalling and suchlike still work correctly). + + +When the CacheFiles module is asked to bind to its cache, it: + + (1) Finds the security label attached to the root cache directory and uses + that as the security label with which it will create files. By default, + this is: + + cachefiles_var_t + + (2) Finds the security label of the process which issued the bind request + (presumed to be the cachefilesd daemon), which by default will be: + + cachefilesd_t + + and asks LSM to supply a security ID as which it should act given the + daemon's label. By default, this will be: + + cachefiles_kernel_t + + SELinux transitions the daemon's security ID to the module's security ID + based on a rule of this form in the policy. + + type_transition <daemon's-ID> kernel_t : process <module's-ID>; + + For instance: + + type_transition cachefilesd_t kernel_t : process cachefiles_kernel_t; + + +The module's security ID gives it permission to create, move and remove files +and directories in the cache, to find and access directories and files in the +cache, to set and access extended attributes on cache objects, and to read and +write files in the cache. + +The daemon's security ID gives it only a very restricted set of permissions: it +may scan directories, stat files and erase files and directories. It may +not read or write files in the cache, and so it is precluded from accessing the +data cached therein; nor is it permitted to create new files in the cache. + + +There are policy source files available in: + + http://people.redhat.com/~dhowells/fscache/cachefilesd-0.8.tar.bz2 + +and later versions. In that tarball, see the files: + + cachefilesd.te + cachefilesd.fc + cachefilesd.if + +They are built and installed directly by the RPM. + +If a non-RPM based system is being used, then copy the above files to their own +directory and run: + + make -f /usr/share/selinux/devel/Makefile + semodule -i cachefilesd.pp + +You will need checkpolicy and selinux-policy-devel installed prior to the +build. + + +By default, the cache is located in /var/fscache, but if it is desirable that +it should be elsewhere, than either the above policy files must be altered, or +an auxiliary policy must be installed to label the alternate location of the +cache. + +For instructions on how to add an auxiliary policy to enable the cache to be +located elsewhere when SELinux is in enforcing mode, please see: + + /usr/share/doc/cachefilesd-*/move-cache.txt + +When the cachefilesd rpm is installed; alternatively, the document can be found +in the sources. + + +================== +A NOTE ON SECURITY +================== + +CacheFiles makes use of the split security in the task_struct. It allocates +its own task_security structure, and redirects current->cred to point to it +when it acts on behalf of another process, in that process's context. + +The reason it does this is that it calls vfs_mkdir() and suchlike rather than +bypassing security and calling inode ops directly. Therefore the VFS and LSM +may deny the CacheFiles access to the cache data because under some +circumstances the caching code is running in the security context of whatever +process issued the original syscall on the netfs. + +Furthermore, should CacheFiles create a file or directory, the security +parameters with that object is created (UID, GID, security label) would be +derived from that process that issued the system call, thus potentially +preventing other processes from accessing the cache - including CacheFiles's +cache management daemon (cachefilesd). + +What is required is to temporarily override the security of the process that +issued the system call. We can't, however, just do an in-place change of the +security data as that affects the process as an object, not just as a subject. +This means it may lose signals or ptrace events for example, and affects what +the process looks like in /proc. + +So CacheFiles makes use of a logical split in the security between the +objective security (task->real_cred) and the subjective security (task->cred). +The objective security holds the intrinsic security properties of a process and +is never overridden. This is what appears in /proc, and is what is used when a +process is the target of an operation by some other process (SIGKILL for +example). + +The subjective security holds the active security properties of a process, and +may be overridden. This is not seen externally, and is used whan a process +acts upon another object, for example SIGKILLing another process or opening a +file. + +LSM hooks exist that allow SELinux (or Smack or whatever) to reject a request +for CacheFiles to run in a context of a specific security label, or to create +files and directories with another security label. + + +======================= +STATISTICAL INFORMATION +======================= + +If FS-Cache is compiled with the following option enabled: + + CONFIG_CACHEFILES_HISTOGRAM=y + +then it will gather certain statistics and display them through a proc file. + + (*) /proc/fs/cachefiles/histogram + + cat /proc/fs/cachefiles/histogram + JIFS SECS LOOKUPS MKDIRS CREATES + ===== ===== ========= ========= ========= + + This shows the breakdown of the number of times each amount of time + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The + columns are as follows: + + COLUMN TIME MEASUREMENT + ======= ======================================================= + LOOKUPS Length of time to perform a lookup on the backing fs + MKDIRS Length of time to perform a mkdir on the backing fs + CREATES Length of time to perform a create on the backing fs + + Each row shows the number of events that took a particular range of times. + Each step is 1 jiffy in size. The JIFS column indicates the particular + jiffy range covered, and the SECS field the equivalent number of seconds. + + +========= +DEBUGGING +========= + +If CONFIG_CACHEFILES_DEBUG is enabled, the CacheFiles facility can have runtime +debugging enabled by adjusting the value in: + + /sys/module/cachefiles/parameters/debug + +This is a bitmask of debugging streams to enable: + + BIT VALUE STREAM POINT + ======= ======= =============================== ======================= + 0 1 General Function entry trace + 1 2 Function exit trace + 2 4 General + +The appropriate set of values should be OR'd together and the result written to +the control file. For example: + + echo $((1|4|8)) >/sys/module/cachefiles/parameters/debug + +will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/fscache.txt b/Documentation/filesystems/caching/fscache.txt new file mode 100644 index 00000000000..770267af5b3 --- /dev/null +++ b/Documentation/filesystems/caching/fscache.txt @@ -0,0 +1,443 @@ + ========================== + General Filesystem Caching + ========================== + +======== +OVERVIEW +======== + +This facility is a general purpose cache for network filesystems, though it +could be used for caching other things such as ISO9660 filesystems too. + +FS-Cache mediates between cache backends (such as CacheFS) and network +filesystems: + + +---------+ + | | +--------------+ + | NFS |--+ | | + | | | +-->| CacheFS | + +---------+ | +----------+ | | /dev/hda5 | + | | | | +--------------+ + +---------+ +-->| | | + | | | |--+ + | AFS |----->| FS-Cache | + | | | |--+ + +---------+ +-->| | | + | | | | +--------------+ + +---------+ | +----------+ | | | + | | | +-->| CacheFiles | + | ISOFS |--+ | /var/cache | + | | +--------------+ + +---------+ + +Or to look at it another way, FS-Cache is a module that provides a caching +facility to a network filesystem such that the cache is transparent to the +user: + + +---------+ + | | + | Server | + | | + +---------+ + | NETWORK + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + | + | +----------+ + V | | + +---------+ | | + | | | | + | NFS |----->| FS-Cache | + | | | |--+ + +---------+ | | | +--------------+ +--------------+ + | | | | | | | | + V +----------+ +-->| CacheFiles |-->| Ext3 | + +---------+ | /var/cache | | /dev/sda6 | + | | +--------------+ +--------------+ + | VFS | ^ ^ + | | | | + +---------+ +--------------+ | + | KERNEL SPACE | | + ~~~~~|~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|~~~~~~|~~~~ + | USER SPACE | | + V | | + +---------+ +--------------+ + | | | | + | Process | | cachefilesd | + | | | | + +---------+ +--------------+ + + +FS-Cache does not follow the idea of completely loading every netfs file +opened in its entirety into a cache before permitting it to be accessed and +then serving the pages out of that cache rather than the netfs inode because: + + (1) It must be practical to operate without a cache. + + (2) The size of any accessible file must not be limited to the size of the + cache. + + (3) The combined size of all opened files (this includes mapped libraries) + must not be limited to the size of the cache. + + (4) The user should not be forced to download an entire file just to do a + one-off access of a small portion of it (such as might be done with the + "file" program). + +It instead serves the cache out in PAGE_SIZE chunks as and when requested by +the netfs('s) using it. + + +FS-Cache provides the following facilities: + + (1) More than one cache can be used at once. Caches can be selected + explicitly by use of tags. + + (2) Caches can be added / removed at any time. + + (3) The netfs is provided with an interface that allows either party to + withdraw caching facilities from a file (required for (2)). + + (4) The interface to the netfs returns as few errors as possible, preferring + rather to let the netfs remain oblivious. + + (5) Cookies are used to represent indices, files and other objects to the + netfs. The simplest cookie is just a NULL pointer - indicating nothing + cached there. + + (6) The netfs is allowed to propose - dynamically - any index hierarchy it + desires, though it must be aware that the index search function is + recursive, stack space is limited, and indices can only be children of + indices. + + (7) Data I/O is done direct to and from the netfs's pages. The netfs + indicates that page A is at index B of the data-file represented by cookie + C, and that it should be read or written. The cache backend may or may + not start I/O on that page, but if it does, a netfs callback will be + invoked to indicate completion. The I/O may be either synchronous or + asynchronous. + + (8) Cookies can be "retired" upon release. At this point FS-Cache will mark + them as obsolete and the index hierarchy rooted at that point will get + recycled. + + (9) The netfs provides a "match" function for index searches. In addition to + saying whether a match was made or not, this can also specify that an + entry should be updated or deleted. + +(10) As much as possible is done asynchronously. + + +FS-Cache maintains a virtual indexing tree in which all indices, files, objects +and pages are kept. Bits of this tree may actually reside in one or more +caches. + + FSDEF + | + +------------------------------------+ + | | + NFS AFS + | | + +--------------------------+ +-----------+ + | | | | + homedir mirror afs.org redhat.com + | | | + +------------+ +---------------+ +----------+ + | | | | | | + 00001 00002 00007 00125 vol00001 vol00002 + | | | | | + +---+---+ +-----+ +---+ +------+------+ +-----+----+ + | | | | | | | | | | | | | +PG0 PG1 PG2 PG0 XATTR PG0 PG1 DIRENT DIRENT DIRENT R/W R/O Bak + | | + PG0 +-------+ + | | + 00001 00003 + | + +---+---+ + | | | + PG0 PG1 PG2 + +In the example above, you can see two netfs's being backed: NFS and AFS. These +have different index hierarchies: + + (*) The NFS primary index contains per-server indices. Each server index is + indexed by NFS file handles to get data file objects. Each data file + objects can have an array of pages, but may also have further child + objects, such as extended attributes and directory entries. Extended + attribute objects themselves have page-array contents. + + (*) The AFS primary index contains per-cell indices. Each cell index contains + per-logical-volume indices. Each of volume index contains up to three + indices for the read-write, read-only and backup mirrors of those volumes. + Each of these contains vnode data file objects, each of which contains an + array of pages. + +The very top index is the FS-Cache master index in which individual netfs's +have entries. + +Any index object may reside in more than one cache, provided it only has index +children. Any index with non-index object children will be assumed to only +reside in one cache. + + +The netfs API to FS-Cache can be found in: + + Documentation/filesystems/caching/netfs-api.txt + +The cache backend API to FS-Cache can be found in: + + Documentation/filesystems/caching/backend-api.txt + +A description of the internal representations and object state machine can be +found in: + + Documentation/filesystems/caching/object.txt + + +======================= +STATISTICAL INFORMATION +======================= + +If FS-Cache is compiled with the following options enabled: + + CONFIG_FSCACHE_STATS=y + CONFIG_FSCACHE_HISTOGRAM=y + +then it will gather certain statistics and display them through a number of +proc files. + + (*) /proc/fs/fscache/stats + + This shows counts of a number of events that can happen in FS-Cache: + + CLASS EVENT MEANING + ======= ======= ======================================================= + Cookies idx=N Number of index cookies allocated + dat=N Number of data storage cookies allocated + spc=N Number of special cookies allocated + Objects alc=N Number of objects allocated + nal=N Number of object allocation failures + avl=N Number of objects that reached the available state + ded=N Number of objects that reached the dead state + ChkAux non=N Number of objects that didn't have a coherency check + ok=N Number of objects that passed a coherency check + upd=N Number of objects that needed a coherency data update + obs=N Number of objects that were declared obsolete + Pages mrk=N Number of pages marked as being cached + unc=N Number of uncache page requests seen + Acquire n=N Number of acquire cookie requests seen + nul=N Number of acq reqs given a NULL parent + noc=N Number of acq reqs rejected due to no cache available + ok=N Number of acq reqs succeeded + nbf=N Number of acq reqs rejected due to error + oom=N Number of acq reqs failed on ENOMEM + Lookups n=N Number of lookup calls made on cache backends + neg=N Number of negative lookups made + pos=N Number of positive lookups made + crt=N Number of objects created by lookup + tmo=N Number of lookups timed out and requeued + Updates n=N Number of update cookie requests seen + nul=N Number of upd reqs given a NULL parent + run=N Number of upd reqs granted CPU time + Relinqs n=N Number of relinquish cookie requests seen + nul=N Number of rlq reqs given a NULL parent + wcr=N Number of rlq reqs waited on completion of creation + AttrChg n=N Number of attribute changed requests seen + ok=N Number of attr changed requests queued + nbf=N Number of attr changed rejected -ENOBUFS + oom=N Number of attr changed failed -ENOMEM + run=N Number of attr changed ops given CPU time + Allocs n=N Number of allocation requests seen + ok=N Number of successful alloc reqs + wt=N Number of alloc reqs that waited on lookup completion + nbf=N Number of alloc reqs rejected -ENOBUFS + int=N Number of alloc reqs aborted -ERESTARTSYS + ops=N Number of alloc reqs submitted + owt=N Number of alloc reqs waited for CPU time + abt=N Number of alloc reqs aborted due to object death + Retrvls n=N Number of retrieval (read) requests seen + ok=N Number of successful retr reqs + wt=N Number of retr reqs that waited on lookup completion + nod=N Number of retr reqs returned -ENODATA + nbf=N Number of retr reqs rejected -ENOBUFS + int=N Number of retr reqs aborted -ERESTARTSYS + oom=N Number of retr reqs failed -ENOMEM + ops=N Number of retr reqs submitted + owt=N Number of retr reqs waited for CPU time + abt=N Number of retr reqs aborted due to object death + Stores n=N Number of storage (write) requests seen + ok=N Number of successful store reqs + agn=N Number of store reqs on a page already pending storage + nbf=N Number of store reqs rejected -ENOBUFS + oom=N Number of store reqs failed -ENOMEM + ops=N Number of store reqs submitted + run=N Number of store reqs granted CPU time + pgs=N Number of pages given store req processing time + rxd=N Number of store reqs deleted from tracking tree + olm=N Number of store reqs over store limit + VmScan nos=N Number of release reqs against pages with no pending store + gon=N Number of release reqs against pages stored by time lock granted + bsy=N Number of release reqs ignored due to in-progress store + can=N Number of page stores cancelled due to release req + Ops pend=N Number of times async ops added to pending queues + run=N Number of times async ops given CPU time + enq=N Number of times async ops queued for processing + can=N Number of async ops cancelled + rej=N Number of async ops rejected due to object lookup/create failure + dfr=N Number of async ops queued for deferred release + rel=N Number of async ops released + gc=N Number of deferred-release async ops garbage collected + CacheOp alo=N Number of in-progress alloc_object() cache ops + luo=N Number of in-progress lookup_object() cache ops + luc=N Number of in-progress lookup_complete() cache ops + gro=N Number of in-progress grab_object() cache ops + upo=N Number of in-progress update_object() cache ops + dro=N Number of in-progress drop_object() cache ops + pto=N Number of in-progress put_object() cache ops + syn=N Number of in-progress sync_cache() cache ops + atc=N Number of in-progress attr_changed() cache ops + rap=N Number of in-progress read_or_alloc_page() cache ops + ras=N Number of in-progress read_or_alloc_pages() cache ops + alp=N Number of in-progress allocate_page() cache ops + als=N Number of in-progress allocate_pages() cache ops + wrp=N Number of in-progress write_page() cache ops + ucp=N Number of in-progress uncache_page() cache ops + dsp=N Number of in-progress dissociate_pages() cache ops + + + (*) /proc/fs/fscache/histogram + + cat /proc/fs/fscache/histogram + JIFS SECS OBJ INST OP RUNS OBJ RUNS RETRV DLY RETRIEVLS + ===== ===== ========= ========= ========= ========= ========= + + This shows the breakdown of the number of times each amount of time + between 0 jiffies and HZ-1 jiffies a variety of tasks took to run. The + columns are as follows: + + COLUMN TIME MEASUREMENT + ======= ======================================================= + OBJ INST Length of time to instantiate an object + OP RUNS Length of time a call to process an operation took + OBJ RUNS Length of time a call to process an object event took + RETRV DLY Time between an requesting a read and lookup completing + RETRIEVLS Time between beginning and end of a retrieval + + Each row shows the number of events that took a particular range of times. + Each step is 1 jiffy in size. The JIFS column indicates the particular + jiffy range covered, and the SECS field the equivalent number of seconds. + + +=========== +OBJECT LIST +=========== + +If CONFIG_FSCACHE_OBJECT_LIST is enabled, the FS-Cache facility will maintain a +list of all the objects currently allocated and allow them to be viewed +through: + + /proc/fs/fscache/objects + +This will look something like: + + [root@andromeda ~]# head /proc/fs/fscache/objects + OBJECT PARENT STAT CHLDN OPS OOP IPR EX READS EM EV F S | NETFS_COOKIE_DEF TY FL NETFS_DATA OBJECT_KEY, AUX_DATA + ======== ======== ==== ===== === === === == ===== == == = = | ================ == == ================ ================ + 17e4b 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88001dd82820 010006017edcf8bbc93b43298fdfbe71e50b57b13a172c0117f38472, e567634700000000000000000000000063f2404a000000000000000000000000c9030000000000000000000063f2404a + 1693a 2 ACTV 0 0 0 0 0 0 7b 4 0 0 | NFS.fh DT 0 ffff88002db23380 010006017edcf8bbc93b43298fdfbe71e50b57b1e0162c01a2df0ea6, 420ebc4a000000000000000000000000420ebc4a0000000000000000000000000e1801000000000000000000420ebc4a + +where the first set of columns before the '|' describe the object: + + COLUMN DESCRIPTION + ======= =============================================================== + OBJECT Object debugging ID (appears as OBJ%x in some debug messages) + PARENT Debugging ID of parent object + STAT Object state + CHLDN Number of child objects of this object + OPS Number of outstanding operations on this object + OOP Number of outstanding child object management operations + IPR + EX Number of outstanding exclusive operations + READS Number of outstanding read operations + EM Object's event mask + EV Events raised on this object + F Object flags + S Object work item busy state mask (1:pending 2:running) + +and the second set of columns describe the object's cookie, if present: + + COLUMN DESCRIPTION + =============== ======================================================= + NETFS_COOKIE_DEF Name of netfs cookie definition + TY Cookie type (IX - index, DT - data, hex - special) + FL Cookie flags + NETFS_DATA Netfs private data stored in the cookie + OBJECT_KEY Object key } 1 column, with separating comma + AUX_DATA Object aux data } presence may be configured + +The data shown may be filtered by attaching the a key to an appropriate keyring +before viewing the file. Something like: + + keyctl add user fscache:objlist <restrictions> @s + +where <restrictions> are a selection of the following letters: + + K Show hexdump of object key (don't show if not given) + A Show hexdump of object aux data (don't show if not given) + +and the following paired letters: + + C Show objects that have a cookie + c Show objects that don't have a cookie + B Show objects that are busy + b Show objects that aren't busy + W Show objects that have pending writes + w Show objects that don't have pending writes + R Show objects that have outstanding reads + r Show objects that don't have outstanding reads + S Show objects that have work queued + s Show objects that don't have work queued + +If neither side of a letter pair is given, then both are implied. For example: + + keyctl add user fscache:objlist KB @s + +shows objects that are busy, and lists their object keys, but does not dump +their auxiliary data. It also implies "CcWwRrSs", but as 'B' is given, 'b' is +not implied. + +By default all objects and all fields will be shown. + + +========= +DEBUGGING +========= + +If CONFIG_FSCACHE_DEBUG is enabled, the FS-Cache facility can have runtime +debugging enabled by adjusting the value in: + + /sys/module/fscache/parameters/debug + +This is a bitmask of debugging streams to enable: + + BIT VALUE STREAM POINT + ======= ======= =============================== ======================= + 0 1 Cache management Function entry trace + 1 2 Function exit trace + 2 4 General + 3 8 Cookie management Function entry trace + 4 16 Function exit trace + 5 32 General + 6 64 Page handling Function entry trace + 7 128 Function exit trace + 8 256 General + 9 512 Operation management Function entry trace + 10 1024 Function exit trace + 11 2048 General + +The appropriate set of values should be OR'd together and the result written to +the control file. For example: + + echo $((1|8|64)) >/sys/module/fscache/parameters/debug + +will turn on all function entry debugging. diff --git a/Documentation/filesystems/caching/netfs-api.txt b/Documentation/filesystems/caching/netfs-api.txt new file mode 100644 index 00000000000..aed6b94160b --- /dev/null +++ b/Documentation/filesystems/caching/netfs-api.txt @@ -0,0 +1,915 @@ + =============================== + FS-CACHE NETWORK FILESYSTEM API + =============================== + +There's an API by which a network filesystem can make use of the FS-Cache +facilities. This is based around a number of principles: + + (1) Caches can store a number of different object types. There are two main + object types: indices and files. The first is a special type used by + FS-Cache to make finding objects faster and to make retiring of groups of + objects easier. + + (2) Every index, file or other object is represented by a cookie. This cookie + may or may not have anything associated with it, but the netfs doesn't + need to care. + + (3) Barring the top-level index (one entry per cached netfs), the index + hierarchy for each netfs is structured according the whim of the netfs. + +This API is declared in <linux/fscache.h>. + +This document contains the following sections: + + (1) Network filesystem definition + (2) Index definition + (3) Object definition + (4) Network filesystem (un)registration + (5) Cache tag lookup + (6) Index registration + (7) Data file registration + (8) Miscellaneous object registration + (9) Setting the data file size + (10) Page alloc/read/write + (11) Page uncaching + (12) Index and data file consistency + (13) Cookie enablement + (14) Miscellaneous cookie operations + (15) Cookie unregistration + (16) Index invalidation + (17) Data file invalidation + (18) FS-Cache specific page flags. + + +============================= +NETWORK FILESYSTEM DEFINITION +============================= + +FS-Cache needs a description of the network filesystem. This is specified +using a record of the following structure: + + struct fscache_netfs { + uint32_t version; + const char *name; + struct fscache_cookie *primary_index; + ... + }; + +This first two fields should be filled in before registration, and the third +will be filled in by the registration function; any other fields should just be +ignored and are for internal use only. + +The fields are: + + (1) The name of the netfs (used as the key in the toplevel index). + + (2) The version of the netfs (if the name matches but the version doesn't, the + entire in-cache hierarchy for this netfs will be scrapped and begun + afresh). + + (3) The cookie representing the primary index will be allocated according to + another parameter passed into the registration function. + +For example, kAFS (linux/fs/afs/) uses the following definitions to describe +itself: + + struct fscache_netfs afs_cache_netfs = { + .version = 0, + .name = "afs", + }; + + +================ +INDEX DEFINITION +================ + +Indices are used for two purposes: + + (1) To aid the finding of a file based on a series of keys (such as AFS's + "cell", "volume ID", "vnode ID"). + + (2) To make it easier to discard a subset of all the files cached based around + a particular key - for instance to mirror the removal of an AFS volume. + +However, since it's unlikely that any two netfs's are going to want to define +their index hierarchies in quite the same way, FS-Cache tries to impose as few +restraints as possible on how an index is structured and where it is placed in +the tree. The netfs can even mix indices and data files at the same level, but +it's not recommended. + +Each index entry consists of a key of indeterminate length plus some auxiliary +data, also of indeterminate length. + +There are some limits on indices: + + (1) Any index containing non-index objects should be restricted to a single + cache. Any such objects created within an index will be created in the + first cache only. The cache in which an index is created can be + controlled by cache tags (see below). + + (2) The entry data must be atomically journallable, so it is limited to about + 400 bytes at present. At least 400 bytes will be available. + + (3) The depth of the index tree should be judged with care as the search + function is recursive. Too many layers will run the kernel out of stack. + + +================= +OBJECT DEFINITION +================= + +To define an object, a structure of the following type should be filled out: + + struct fscache_cookie_def + { + uint8_t name[16]; + uint8_t type; + + struct fscache_cache_tag *(*select_cache)( + const void *parent_netfs_data, + const void *cookie_netfs_data); + + uint16_t (*get_key)(const void *cookie_netfs_data, + void *buffer, + uint16_t bufmax); + + void (*get_attr)(const void *cookie_netfs_data, + uint64_t *size); + + uint16_t (*get_aux)(const void *cookie_netfs_data, + void *buffer, + uint16_t bufmax); + + enum fscache_checkaux (*check_aux)(void *cookie_netfs_data, + const void *data, + uint16_t datalen); + + void (*get_context)(void *cookie_netfs_data, void *context); + + void (*put_context)(void *cookie_netfs_data, void *context); + + void (*mark_pages_cached)(void *cookie_netfs_data, + struct address_space *mapping, + struct pagevec *cached_pvec); + + void (*now_uncached)(void *cookie_netfs_data); + }; + +This has the following fields: + + (1) The type of the object [mandatory]. + + This is one of the following values: + + (*) FSCACHE_COOKIE_TYPE_INDEX + + This defines an index, which is a special FS-Cache type. + + (*) FSCACHE_COOKIE_TYPE_DATAFILE + + This defines an ordinary data file. + + (*) Any other value between 2 and 255 + + This defines an extraordinary object such as an XATTR. + + (2) The name of the object type (NUL terminated unless all 16 chars are used) + [optional]. + + (3) A function to select the cache in which to store an index [optional]. + + This function is invoked when an index needs to be instantiated in a cache + during the instantiation of a non-index object. Only the immediate index + parent for the non-index object will be queried. Any indices above that + in the hierarchy may be stored in multiple caches. This function does not + need to be supplied for any non-index object or any index that will only + have index children. + + If this function is not supplied or if it returns NULL then the first + cache in the parent's list will be chosen, or failing that, the first + cache in the master list. + + (4) A function to retrieve an object's key from the netfs [mandatory]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function and the maximum length of key data that it may + provide. It should write the required key data into the given buffer and + return the quantity it wrote. + + (5) A function to retrieve attribute data from the netfs [optional]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function. It should return the size of the file if + this is a data file. The size may be used to govern how much cache must + be reserved for this file in the cache. + + If the function is absent, a file size of 0 is assumed. + + (6) A function to retrieve auxiliary data from the netfs [optional]. + + This function will be called with the netfs data that was passed to the + cookie acquisition function and the maximum length of auxiliary data that + it may provide. It should write the auxiliary data into the given buffer + and return the quantity it wrote. + + If this function is absent, the auxiliary data length will be set to 0. + + The length of the auxiliary data buffer may be dependent on the key + length. A netfs mustn't rely on being able to provide more than 400 bytes + for both. + + (7) A function to check the auxiliary data [optional]. + + This function will be called to check that a match found in the cache for + this object is valid. For instance with AFS it could check the auxiliary + data against the data version number returned by the server to determine + whether the index entry in a cache is still valid. + + If this function is absent, it will be assumed that matching objects in a + cache are always valid. + + If present, the function should return one of the following values: + + (*) FSCACHE_CHECKAUX_OKAY - the entry is okay as is + (*) FSCACHE_CHECKAUX_NEEDS_UPDATE - the entry requires update + (*) FSCACHE_CHECKAUX_OBSOLETE - the entry should be deleted + + This function can also be used to extract data from the auxiliary data in + the cache and copy it into the netfs's structures. + + (8) A pair of functions to manage contexts for the completion callback + [optional]. + + The cache read/write functions are passed a context which is then passed + to the I/O completion callback function. To ensure this context remains + valid until after the I/O completion is called, two functions may be + provided: one to get an extra reference on the context, and one to drop a + reference to it. + + If the context is not used or is a type of object that won't go out of + scope, then these functions are not required. These functions are not + required for indices as indices may not contain data. These functions may + be called in interrupt context and so may not sleep. + + (9) A function to mark a page as retaining cache metadata [optional]. + + This is called by the cache to indicate that it is retaining in-memory + information for this page and that the netfs should uncache the page when + it has finished. This does not indicate whether there's data on the disk + or not. Note that several pages at once may be presented for marking. + + The PG_fscache bit is set on the pages before this function would be + called, so the function need not be provided if this is sufficient. + + This function is not required for indices as they're not permitted data. + +(10) A function to unmark all the pages retaining cache metadata [mandatory]. + + This is called by FS-Cache to indicate that a backing store is being + unbound from a cookie and that all the marks on the pages should be + cleared to prevent confusion. Note that the cache will have torn down all + its tracking information so that the pages don't need to be explicitly + uncached. + + This function is not required for indices as they're not permitted data. + + +=================================== +NETWORK FILESYSTEM (UN)REGISTRATION +=================================== + +The first step is to declare the network filesystem to the cache. This also +involves specifying the layout of the primary index (for AFS, this would be the +"cell" level). + +The registration function is: + + int fscache_register_netfs(struct fscache_netfs *netfs); + +It just takes a pointer to the netfs definition. It returns 0 or an error as +appropriate. + +For kAFS, registration is done as follows: + + ret = fscache_register_netfs(&afs_cache_netfs); + +The last step is, of course, unregistration: + + void fscache_unregister_netfs(struct fscache_netfs *netfs); + + +================ +CACHE TAG LOOKUP +================ + +FS-Cache permits the use of more than one cache. To permit particular index +subtrees to be bound to particular caches, the second step is to look up cache +representation tags. This step is optional; it can be left entirely up to +FS-Cache as to which cache should be used. The problem with doing that is that +FS-Cache will always pick the first cache that was registered. + +To get the representation for a named tag: + + struct fscache_cache_tag *fscache_lookup_cache_tag(const char *name); + +This takes a text string as the name and returns a representation of a tag. It +will never return an error. It may return a dummy tag, however, if it runs out +of memory; this will inhibit caching with this tag. + +Any representation so obtained must be released by passing it to this function: + + void fscache_release_cache_tag(struct fscache_cache_tag *tag); + +The tag will be retrieved by FS-Cache when it calls the object definition +operation select_cache(). + + +================== +INDEX REGISTRATION +================== + +The third step is to inform FS-Cache about part of an index hierarchy that can +be used to locate files. This is done by requesting a cookie for each index in +the path to the file: + + struct fscache_cookie * + fscache_acquire_cookie(struct fscache_cookie *parent, + const struct fscache_object_def *def, + void *netfs_data, + bool enable); + +This function creates an index entry in the index represented by parent, +filling in the index entry by calling the operations pointed to by def. + +Note that this function never returns an error - all errors are handled +internally. It may, however, return NULL to indicate no cookie. It is quite +acceptable to pass this token back to this function as the parent to another +acquisition (or even to the relinquish cookie, read page and write page +functions - see below). + +Note also that no indices are actually created in a cache until a non-index +object needs to be created somewhere down the hierarchy. Furthermore, an index +may be created in several different caches independently at different times. +This is all handled transparently, and the netfs doesn't see any of it. + +A cookie will be created in the disabled state if enabled is false. A cookie +must be enabled to do anything with it. A disabled cookie can be enabled by +calling fscache_enable_cookie() (see below). + +For example, with AFS, a cell would be added to the primary index. This index +entry would have a dependent inode containing a volume location index for the +volume mappings within this cell: + + cell->cache = + fscache_acquire_cookie(afs_cache_netfs.primary_index, + &afs_cell_cache_index_def, + cell, true); + +Then when a volume location was accessed, it would be entered into the cell's +index and an inode would be allocated that acts as a volume type and hash chain +combination: + + vlocation->cache = + fscache_acquire_cookie(cell->cache, + &afs_vlocation_cache_index_def, + vlocation, true); + +And then a particular flavour of volume (R/O for example) could be added to +that index, creating another index for vnodes (AFS inode equivalents): + + volume->cache = + fscache_acquire_cookie(vlocation->cache, + &afs_volume_cache_index_def, + volume, true); + + +====================== +DATA FILE REGISTRATION +====================== + +The fourth step is to request a data file be created in the cache. This is +identical to index cookie acquisition. The only difference is that the type in +the object definition should be something other than index type. + + vnode->cache = + fscache_acquire_cookie(volume->cache, + &afs_vnode_cache_object_def, + vnode, true); + + +================================= +MISCELLANEOUS OBJECT REGISTRATION +================================= + +An optional step is to request an object of miscellaneous type be created in +the cache. This is almost identical to index cookie acquisition. The only +difference is that the type in the object definition should be something other +than index type. Whilst the parent object could be an index, it's more likely +it would be some other type of object such as a data file. + + xattr->cache = + fscache_acquire_cookie(vnode->cache, + &afs_xattr_cache_object_def, + xattr, true); + +Miscellaneous objects might be used to store extended attributes or directory +entries for example. + + +========================== +SETTING THE DATA FILE SIZE +========================== + +The fifth step is to set the physical attributes of the file, such as its size. +This doesn't automatically reserve any space in the cache, but permits the +cache to adjust its metadata for data tracking appropriately: + + int fscache_attr_changed(struct fscache_cookie *cookie); + +The cache will return -ENOBUFS if there is no backing cache or if there is no +space to allocate any extra metadata required in the cache. The attributes +will be accessed with the get_attr() cookie definition operation. + +Note that attempts to read or write data pages in the cache over this size may +be rebuffed with -ENOBUFS. + +This operation schedules an attribute adjustment to happen asynchronously at +some point in the future, and as such, it may happen after the function returns +to the caller. The attribute adjustment excludes read and write operations. + + +===================== +PAGE ALLOC/READ/WRITE +===================== + +And the sixth step is to store and retrieve pages in the cache. There are +three functions that are used to do this. + +Note: + + (1) A page should not be re-read or re-allocated without uncaching it first. + + (2) A read or allocated page must be uncached when the netfs page is released + from the pagecache. + + (3) A page should only be written to the cache if previous read or allocated. + +This permits the cache to maintain its page tracking in proper order. + + +PAGE READ +--------- + +Firstly, the netfs should ask FS-Cache to examine the caches and read the +contents cached for a particular page of a particular file if present, or else +allocate space to store the contents if not: + + typedef + void (*fscache_rw_complete_t)(struct page *page, + void *context, + int error); + + int fscache_read_or_alloc_page(struct fscache_cookie *cookie, + struct page *page, + fscache_rw_complete_t end_io_func, + void *context, + gfp_t gfp); + +The cookie argument must specify a cookie for an object that isn't an index, +the page specified will have the data loaded into it (and is also used to +specify the page number), and the gfp argument is used to control how any +memory allocations made are satisfied. + +If the cookie indicates the inode is not cached: + + (1) The function will return -ENOBUFS. + +Else if there's a copy of the page resident in the cache: + + (1) The mark_pages_cached() cookie operation will be called on that page. + + (2) The function will submit a request to read the data from the cache's + backing device directly into the page specified. + + (3) The function will return 0. + + (4) When the read is complete, end_io_func() will be invoked with: + + (*) The netfs data supplied when the cookie was created. + + (*) The page descriptor. + + (*) The context argument passed to the above function. This will be + maintained with the get_context/put_context functions mentioned above. + + (*) An argument that's 0 on success or negative for an error code. + + If an error occurs, it should be assumed that the page contains no usable + data. fscache_readpages_cancel() may need to be called. + + end_io_func() will be called in process context if the read is results in + an error, but it might be called in interrupt context if the read is + successful. + +Otherwise, if there's not a copy available in cache, but the cache may be able +to store the page: + + (1) The mark_pages_cached() cookie operation will be called on that page. + + (2) A block may be reserved in the cache and attached to the object at the + appropriate place. + + (3) The function will return -ENODATA. + +This function may also return -ENOMEM or -EINTR, in which case it won't have +read any data from the cache. + + +PAGE ALLOCATE +------------- + +Alternatively, if there's not expected to be any data in the cache for a page +because the file has been extended, a block can simply be allocated instead: + + int fscache_alloc_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +This is similar to the fscache_read_or_alloc_page() function, except that it +never reads from the cache. It will return 0 if a block has been allocated, +rather than -ENODATA as the other would. One or the other must be performed +before writing to the cache. + +The mark_pages_cached() cookie operation will be called on the page if +successful. + + +PAGE WRITE +---------- + +Secondly, if the netfs changes the contents of the page (either due to an +initial download or if a user performs a write), then the page should be +written back to the cache: + + int fscache_write_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +The cookie argument must specify a data file cookie, the page specified should +contain the data to be written (and is also used to specify the page number), +and the gfp argument is used to control how any memory allocations made are +satisfied. + +The page must have first been read or allocated successfully and must not have +been uncached before writing is performed. + +If the cookie indicates the inode is not cached then: + + (1) The function will return -ENOBUFS. + +Else if space can be allocated in the cache to hold this page: + + (1) PG_fscache_write will be set on the page. + + (2) The function will submit a request to write the data to cache's backing + device directly from the page specified. + + (3) The function will return 0. + + (4) When the write is complete PG_fscache_write is cleared on the page and + anyone waiting for that bit will be woken up. + +Else if there's no space available in the cache, -ENOBUFS will be returned. It +is also possible for the PG_fscache_write bit to be cleared when no write took +place if unforeseen circumstances arose (such as a disk error). + +Writing takes place asynchronously. + + +MULTIPLE PAGE READ +------------------ + +A facility is provided to read several pages at once, as requested by the +readpages() address space operation: + + int fscache_read_or_alloc_pages(struct fscache_cookie *cookie, + struct address_space *mapping, + struct list_head *pages, + int *nr_pages, + fscache_rw_complete_t end_io_func, + void *context, + gfp_t gfp); + +This works in a similar way to fscache_read_or_alloc_page(), except: + + (1) Any page it can retrieve data for is removed from pages and nr_pages and + dispatched for reading to the disk. Reads of adjacent pages on disk may + be merged for greater efficiency. + + (2) The mark_pages_cached() cookie operation will be called on several pages + at once if they're being read or allocated. + + (3) If there was an general error, then that error will be returned. + + Else if some pages couldn't be allocated or read, then -ENOBUFS will be + returned. + + Else if some pages couldn't be read but were allocated, then -ENODATA will + be returned. + + Otherwise, if all pages had reads dispatched, then 0 will be returned, the + list will be empty and *nr_pages will be 0. + + (4) end_io_func will be called once for each page being read as the reads + complete. It will be called in process context if error != 0, but it may + be called in interrupt context if there is no error. + +Note that a return of -ENODATA, -ENOBUFS or any other error does not preclude +some of the pages being read and some being allocated. Those pages will have +been marked appropriately and will need uncaching. + + +CANCELLATION OF UNREAD PAGES +---------------------------- + +If one or more pages are passed to fscache_read_or_alloc_pages() but not then +read from the cache and also not read from the underlying filesystem then +those pages will need to have any marks and reservations removed. This can be +done by calling: + + void fscache_readpages_cancel(struct fscache_cookie *cookie, + struct list_head *pages); + +prior to returning to the caller. The cookie argument should be as passed to +fscache_read_or_alloc_pages(). Every page in the pages list will be examined +and any that have PG_fscache set will be uncached. + + +============== +PAGE UNCACHING +============== + +To uncache a page, this function should be called: + + void fscache_uncache_page(struct fscache_cookie *cookie, + struct page *page); + +This function permits the cache to release any in-memory representation it +might be holding for this netfs page. This function must be called once for +each page on which the read or write page functions above have been called to +make sure the cache's in-memory tracking information gets torn down. + +Note that pages can't be explicitly deleted from the a data file. The whole +data file must be retired (see the relinquish cookie function below). + +Furthermore, note that this does not cancel the asynchronous read or write +operation started by the read/alloc and write functions, so the page +invalidation functions must use: + + bool fscache_check_page_write(struct fscache_cookie *cookie, + struct page *page); + +to see if a page is being written to the cache, and: + + void fscache_wait_on_page_write(struct fscache_cookie *cookie, + struct page *page); + +to wait for it to finish if it is. + + +When releasepage() is being implemented, a special FS-Cache function exists to +manage the heuristics of coping with vmscan trying to eject pages, which may +conflict with the cache trying to write pages to the cache (which may itself +need to allocate memory): + + bool fscache_maybe_release_page(struct fscache_cookie *cookie, + struct page *page, + gfp_t gfp); + +This takes the netfs cookie, and the page and gfp arguments as supplied to +releasepage(). It will return false if the page cannot be released yet for +some reason and if it returns true, the page has been uncached and can now be +released. + +To make a page available for release, this function may wait for an outstanding +storage request to complete, or it may attempt to cancel the storage request - +in which case the page will not be stored in the cache this time. + + +BULK INODE PAGE UNCACHE +----------------------- + +A convenience routine is provided to perform an uncache on all the pages +attached to an inode. This assumes that the pages on the inode correspond on a +1:1 basis with the pages in the cache. + + void fscache_uncache_all_inode_pages(struct fscache_cookie *cookie, + struct inode *inode); + +This takes the netfs cookie that the pages were cached with and the inode that +the pages are attached to. This function will wait for pages to finish being +written to the cache and for the cache to finish with the page generally. No +error is returned. + + +=============================== +INDEX AND DATA FILE CONSISTENCY +=============================== + +To find out whether auxiliary data for an object is up to data within the +cache, the following function can be called: + + int fscache_check_consistency(struct fscache_cookie *cookie) + +This will call back to the netfs to check whether the auxiliary data associated +with a cookie is correct. It returns 0 if it is and -ESTALE if it isn't; it +may also return -ENOMEM and -ERESTARTSYS. + +To request an update of the index data for an index or other object, the +following function should be called: + + void fscache_update_cookie(struct fscache_cookie *cookie); + +This function will refer back to the netfs_data pointer stored in the cookie by +the acquisition function to obtain the data to write into each revised index +entry. The update method in the parent index definition will be called to +transfer the data. + +Note that partial updates may happen automatically at other times, such as when +data blocks are added to a data file object. + + +================= +COOKIE ENABLEMENT +================= + +Cookies exist in one of two states: enabled and disabled. If a cookie is +disabled, it ignores all attempts to acquire child cookies; check, update or +invalidate its state; allocate, read or write backing pages - though it is +still possible to uncache pages and relinquish the cookie. + +The initial enablement state is set by fscache_acquire_cookie(), but the cookie +can be enabled or disabled later. To disable a cookie, call: + + void fscache_disable_cookie(struct fscache_cookie *cookie, + bool invalidate); + +If the cookie is not already disabled, this locks the cookie against other +enable and disable ops, marks the cookie as being disabled, discards or +invalidates any backing objects and waits for cessation of activity on any +associated object before unlocking the cookie. + +All possible failures are handled internally. The caller should consider +calling fscache_uncache_all_inode_pages() afterwards to make sure all page +markings are cleared up. + +Cookies can be enabled or reenabled with: + + void fscache_enable_cookie(struct fscache_cookie *cookie, + bool (*can_enable)(void *data), + void *data) + +If the cookie is not already enabled, this locks the cookie against other +enable and disable ops, invokes can_enable() and, if the cookie is not an index +cookie, will begin the procedure of acquiring backing objects. + +The optional can_enable() function is passed the data argument and returns a +ruling as to whether or not enablement should actually be permitted to begin. + +All possible failures are handled internally. The cookie will only be marked +as enabled if provisional backing objects are allocated. + + +=============================== +MISCELLANEOUS COOKIE OPERATIONS +=============================== + +There are a number of operations that can be used to control cookies: + + (*) Cookie pinning: + + int fscache_pin_cookie(struct fscache_cookie *cookie); + void fscache_unpin_cookie(struct fscache_cookie *cookie); + + These operations permit data cookies to be pinned into the cache and to + have the pinning removed. They are not permitted on index cookies. + + The pinning function will return 0 if successful, -ENOBUFS in the cookie + isn't backed by a cache, -EOPNOTSUPP if the cache doesn't support pinning, + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or + -EIO if there's any other problem. + + (*) Data space reservation: + + int fscache_reserve_space(struct fscache_cookie *cookie, loff_t size); + + This permits a netfs to request cache space be reserved to store up to the + given amount of a file. It is permitted to ask for more than the current + size of the file to allow for future file expansion. + + If size is given as zero then the reservation will be cancelled. + + The function will return 0 if successful, -ENOBUFS in the cookie isn't + backed by a cache, -EOPNOTSUPP if the cache doesn't support reservations, + -ENOSPC if there isn't enough space to honour the operation, -ENOMEM or + -EIO if there's any other problem. + + Note that this doesn't pin an object in a cache; it can still be culled to + make space if it's not in use. + + +===================== +COOKIE UNREGISTRATION +===================== + +To get rid of a cookie, this function should be called. + + void fscache_relinquish_cookie(struct fscache_cookie *cookie, + bool retire); + +If retire is non-zero, then the object will be marked for recycling, and all +copies of it will be removed from all active caches in which it is present. +Not only that but all child objects will also be retired. + +If retire is zero, then the object may be available again when next the +acquisition function is called. Retirement here will overrule the pinning on a +cookie. + +One very important note - relinquish must NOT be called for a cookie unless all +the cookies for "child" indices, objects and pages have been relinquished +first. + + +================== +INDEX INVALIDATION +================== + +There is no direct way to invalidate an index subtree. To do this, the caller +should relinquish and retire the cookie they have, and then acquire a new one. + + +====================== +DATA FILE INVALIDATION +====================== + +Sometimes it will be necessary to invalidate an object that contains data. +Typically this will be necessary when the server tells the netfs of a foreign +change - at which point the netfs has to throw away all the state it had for an +inode and reload from the server. + +To indicate that a cache object should be invalidated, the following function +can be called: + + void fscache_invalidate(struct fscache_cookie *cookie); + +This can be called with spinlocks held as it defers the work to a thread pool. +All extant storage, retrieval and attribute change ops at this point are +cancelled and discarded. Some future operations will be rejected until the +cache has had a chance to insert a barrier in the operations queue. After +that, operations will be queued again behind the invalidation operation. + +The invalidation operation will perform an attribute change operation and an +auxiliary data update operation as it is very likely these will have changed. + +Using the following function, the netfs can wait for the invalidation operation +to have reached a point at which it can start submitting ordinary operations +once again: + + void fscache_wait_on_invalidate(struct fscache_cookie *cookie); + + +=========================== +FS-CACHE SPECIFIC PAGE FLAG +=========================== + +FS-Cache makes use of a page flag, PG_private_2, for its own purpose. This is +given the alternative name PG_fscache. + +PG_fscache is used to indicate that the page is known by the cache, and that +the cache must be informed if the page is going to go away. It's an indication +to the netfs that the cache has an interest in this page, where an interest may +be a pointer to it, resources allocated or reserved for it, or I/O in progress +upon it. + +The netfs can use this information in methods such as releasepage() to +determine whether it needs to uncache a page or update it. + +Furthermore, if this bit is set, releasepage() and invalidatepage() operations +will be called on a page to get rid of it, even if PG_private is not set. This +allows caching to attempted on a page before read_cache_pages() to be called +after fscache_read_or_alloc_pages() as the former will try and release pages it +was given under certain circumstances. + +This bit does not overlap with such as PG_private. This means that FS-Cache +can be used with a filesystem that uses the block buffering code. + +There are a number of operations defined on this flag: + + int PageFsCache(struct page *page); + void SetPageFsCache(struct page *page) + void ClearPageFsCache(struct page *page) + int TestSetPageFsCache(struct page *page) + int TestClearPageFsCache(struct page *page) + +These functions are bit test, bit set, bit clear, bit test and set and bit +test and clear operations on PG_fscache. diff --git a/Documentation/filesystems/caching/object.txt b/Documentation/filesystems/caching/object.txt new file mode 100644 index 00000000000..100ff41127e --- /dev/null +++ b/Documentation/filesystems/caching/object.txt @@ -0,0 +1,320 @@ + ==================================================== + IN-KERNEL CACHE OBJECT REPRESENTATION AND MANAGEMENT + ==================================================== + +By: David Howells <dhowells@redhat.com> + +Contents: + + (*) Representation + + (*) Object management state machine. + + - Provision of cpu time. + - Locking simplification. + + (*) The set of states. + + (*) The set of events. + + +============== +REPRESENTATION +============== + +FS-Cache maintains an in-kernel representation of each object that a netfs is +currently interested in. Such objects are represented by the fscache_cookie +struct and are referred to as cookies. + +FS-Cache also maintains a separate in-kernel representation of the objects that +a cache backend is currently actively caching. Such objects are represented by +the fscache_object struct. The cache backends allocate these upon request, and +are expected to embed them in their own representations. These are referred to +as objects. + +There is a 1:N relationship between cookies and objects. A cookie may be +represented by multiple objects - an index may exist in more than one cache - +or even by no objects (it may not be cached). + +Furthermore, both cookies and objects are hierarchical. The two hierarchies +correspond, but the cookies tree is a superset of the union of the object trees +of multiple caches: + + NETFS INDEX TREE : CACHE 1 : CACHE 2 + : : + : +-----------+ : + +----------->| IObject | : + +-----------+ | : +-----------+ : + | ICookie |-------+ : | : + +-----------+ | : | : +-----------+ + | +------------------------------>| IObject | + | : | : +-----------+ + | : V : | + | : +-----------+ : | + V +----------->| IObject | : | + +-----------+ | : +-----------+ : | + | ICookie |-------+ : | : V + +-----------+ | : | : +-----------+ + | +------------------------------>| IObject | + +-----+-----+ : | : +-----------+ + | | : | : | + V | : V : | + +-----------+ | : +-----------+ : | + | ICookie |------------------------->| IObject | : | + +-----------+ | : +-----------+ : | + | V : | : V + | +-----------+ : | : +-----------+ + | | ICookie |-------------------------------->| IObject | + | +-----------+ : | : +-----------+ + V | : V : | + +-----------+ | : +-----------+ : | + | DCookie |------------------------->| DObject | : | + +-----------+ | : +-----------+ : | + | : : | + +-------+-------+ : : | + | | : : | + V V : : V + +-----------+ +-----------+ : : +-----------+ + | DCookie | | DCookie |------------------------>| DObject | + +-----------+ +-----------+ : : +-----------+ + : : + +In the above illustration, ICookie and IObject represent indices and DCookie +and DObject represent data storage objects. Indices may have representation in +multiple caches, but currently, non-index objects may not. Objects of any type +may also be entirely unrepresented. + +As far as the netfs API goes, the netfs is only actually permitted to see +pointers to the cookies. The cookies themselves and any objects attached to +those cookies are hidden from it. + + +=============================== +OBJECT MANAGEMENT STATE MACHINE +=============================== + +Within FS-Cache, each active object is managed by its own individual state +machine. The state for an object is kept in the fscache_object struct, in +object->state. A cookie may point to a set of objects that are in different +states. + +Each state has an action associated with it that is invoked when the machine +wakes up in that state. There are four logical sets of states: + + (1) Preparation: states that wait for the parent objects to become ready. The + representations are hierarchical, and it is expected that an object must + be created or accessed with respect to its parent object. + + (2) Initialisation: states that perform lookups in the cache and validate + what's found and that create on disk any missing metadata. + + (3) Normal running: states that allow netfs operations on objects to proceed + and that update the state of objects. + + (4) Termination: states that detach objects from their netfs cookies, that + delete objects from disk, that handle disk and system errors and that free + up in-memory resources. + + +In most cases, transitioning between states is in response to signalled events. +When a state has finished processing, it will usually set the mask of events in +which it is interested (object->event_mask) and relinquish the worker thread. +Then when an event is raised (by calling fscache_raise_event()), if the event +is not masked, the object will be queued for processing (by calling +fscache_enqueue_object()). + + +PROVISION OF CPU TIME +--------------------- + +The work to be done by the various states was given CPU time by the threads of +the slow work facility. This was used in preference to the workqueue facility +because: + + (1) Threads may be completely occupied for very long periods of time by a + particular work item. These state actions may be doing sequences of + synchronous, journalled disk accesses (lookup, mkdir, create, setxattr, + getxattr, truncate, unlink, rmdir, rename). + + (2) Threads may do little actual work, but may rather spend a lot of time + sleeping on I/O. This means that single-threaded and 1-per-CPU-threaded + workqueues don't necessarily have the right numbers of threads. + + +LOCKING SIMPLIFICATION +---------------------- + +Because only one worker thread may be operating on any particular object's +state machine at once, this simplifies the locking, particularly with respect +to disconnecting the netfs's representation of a cache object (fscache_cookie) +from the cache backend's representation (fscache_object) - which may be +requested from either end. + + +================= +THE SET OF STATES +================= + +The object state machine has a set of states that it can be in. There are +preparation states in which the object sets itself up and waits for its parent +object to transit to a state that allows access to its children: + + (1) State FSCACHE_OBJECT_INIT. + + Initialise the object and wait for the parent object to become active. In + the cache, it is expected that it will not be possible to look an object + up from the parent object, until that parent object itself has been looked + up. + +There are initialisation states in which the object sets itself up and accesses +disk for the object metadata: + + (2) State FSCACHE_OBJECT_LOOKING_UP. + + Look up the object on disk, using the parent as a starting point. + FS-Cache expects the cache backend to probe the cache to see whether this + object is represented there, and if it is, to see if it's valid (coherency + management). + + The cache should call fscache_object_lookup_negative() to indicate lookup + failure for whatever reason, and should call fscache_obtained_object() to + indicate success. + + At the completion of lookup, FS-Cache will let the netfs go ahead with + read operations, no matter whether the file is yet cached. If not yet + cached, read operations will be immediately rejected with ENODATA until + the first known page is uncached - as to that point there can be no data + to be read out of the cache for that file that isn't currently also held + in the pagecache. + + (3) State FSCACHE_OBJECT_CREATING. + + Create an object on disk, using the parent as a starting point. This + happens if the lookup failed to find the object, or if the object's + coherency data indicated what's on disk is out of date. In this state, + FS-Cache expects the cache to create + + The cache should call fscache_obtained_object() if creation completes + successfully, fscache_object_lookup_negative() otherwise. + + At the completion of creation, FS-Cache will start processing write + operations the netfs has queued for an object. If creation failed, the + write ops will be transparently discarded, and nothing recorded in the + cache. + +There are some normal running states in which the object spends its time +servicing netfs requests: + + (4) State FSCACHE_OBJECT_AVAILABLE. + + A transient state in which pending operations are started, child objects + are permitted to advance from FSCACHE_OBJECT_INIT state, and temporary + lookup data is freed. + + (5) State FSCACHE_OBJECT_ACTIVE. + + The normal running state. In this state, requests the netfs makes will be + passed on to the cache. + + (6) State FSCACHE_OBJECT_INVALIDATING. + + The object is undergoing invalidation. When the state comes here, it + discards all pending read, write and attribute change operations as it is + going to clear out the cache entirely and reinitialise it. It will then + continue to the FSCACHE_OBJECT_UPDATING state. + + (7) State FSCACHE_OBJECT_UPDATING. + + The state machine comes here to update the object in the cache from the + netfs's records. This involves updating the auxiliary data that is used + to maintain coherency. + +And there are terminal states in which an object cleans itself up, deallocates +memory and potentially deletes stuff from disk: + + (8) State FSCACHE_OBJECT_LC_DYING. + + The object comes here if it is dying because of a lookup or creation + error. This would be due to a disk error or system error of some sort. + Temporary data is cleaned up, and the parent is released. + + (9) State FSCACHE_OBJECT_DYING. + + The object comes here if it is dying due to an error, because its parent + cookie has been relinquished by the netfs or because the cache is being + withdrawn. + + Any child objects waiting on this one are given CPU time so that they too + can destroy themselves. This object waits for all its children to go away + before advancing to the next state. + +(10) State FSCACHE_OBJECT_ABORT_INIT. + + The object comes to this state if it was waiting on its parent in + FSCACHE_OBJECT_INIT, but its parent died. The object will destroy itself + so that the parent may proceed from the FSCACHE_OBJECT_DYING state. + +(11) State FSCACHE_OBJECT_RELEASING. +(12) State FSCACHE_OBJECT_RECYCLING. + + The object comes to one of these two states when dying once it is rid of + all its children, if it is dying because the netfs relinquished its + cookie. In the first state, the cached data is expected to persist, and + in the second it will be deleted. + +(13) State FSCACHE_OBJECT_WITHDRAWING. + + The object transits to this state if the cache decides it wants to + withdraw the object from service, perhaps to make space, but also due to + error or just because the whole cache is being withdrawn. + +(14) State FSCACHE_OBJECT_DEAD. + + The object transits to this state when the in-memory object record is + ready to be deleted. The object processor shouldn't ever see an object in + this state. + + +THE SET OF EVENTS +----------------- + +There are a number of events that can be raised to an object state machine: + + (*) FSCACHE_OBJECT_EV_UPDATE + + The netfs requested that an object be updated. The state machine will ask + the cache backend to update the object, and the cache backend will ask the + netfs for details of the change through its cookie definition ops. + + (*) FSCACHE_OBJECT_EV_CLEARED + + This is signalled in two circumstances: + + (a) when an object's last child object is dropped and + + (b) when the last operation outstanding on an object is completed. + + This is used to proceed from the dying state. + + (*) FSCACHE_OBJECT_EV_ERROR + + This is signalled when an I/O error occurs during the processing of some + object. + + (*) FSCACHE_OBJECT_EV_RELEASE + (*) FSCACHE_OBJECT_EV_RETIRE + + These are signalled when the netfs relinquishes a cookie it was using. + The event selected depends on whether the netfs asks for the backing + object to be retired (deleted) or retained. + + (*) FSCACHE_OBJECT_EV_WITHDRAW + + This is signalled when the cache backend wants to withdraw an object. + This means that the object will have to be detached from the netfs's + cookie. + +Because the withdrawing releasing/retiring events are all handled by the object +state machine, it doesn't matter if there's a collision with both ends trying +to sever the connection at the same time. The state machine can just pick +which one it wants to honour, and that effects the other. diff --git a/Documentation/filesystems/caching/operations.txt b/Documentation/filesystems/caching/operations.txt new file mode 100644 index 00000000000..bee2a5f93d6 --- /dev/null +++ b/Documentation/filesystems/caching/operations.txt @@ -0,0 +1,213 @@ + ================================ + ASYNCHRONOUS OPERATIONS HANDLING + ================================ + +By: David Howells <dhowells@redhat.com> + +Contents: + + (*) Overview. + + (*) Operation record initialisation. + + (*) Parameters. + + (*) Procedure. + + (*) Asynchronous callback. + + +======== +OVERVIEW +======== + +FS-Cache has an asynchronous operations handling facility that it uses for its +data storage and retrieval routines. Its operations are represented by +fscache_operation structs, though these are usually embedded into some other +structure. + +This facility is available to and expected to be be used by the cache backends, +and FS-Cache will create operations and pass them off to the appropriate cache +backend for completion. + +To make use of this facility, <linux/fscache-cache.h> should be #included. + + +=============================== +OPERATION RECORD INITIALISATION +=============================== + +An operation is recorded in an fscache_operation struct: + + struct fscache_operation { + union { + struct work_struct fast_work; + struct slow_work slow_work; + }; + unsigned long flags; + fscache_operation_processor_t processor; + ... + }; + +Someone wanting to issue an operation should allocate something with this +struct embedded in it. They should initialise it by calling: + + void fscache_operation_init(struct fscache_operation *op, + fscache_operation_release_t release); + +with the operation to be initialised and the release function to use. + +The op->flags parameter should be set to indicate the CPU time provision and +the exclusivity (see the Parameters section). + +The op->fast_work, op->slow_work and op->processor flags should be set as +appropriate for the CPU time provision (see the Parameters section). + +FSCACHE_OP_WAITING may be set in op->flags prior to each submission of the +operation and waited for afterwards. + + +========== +PARAMETERS +========== + +There are a number of parameters that can be set in the operation record's flag +parameter. There are three options for the provision of CPU time in these +operations: + + (1) The operation may be done synchronously (FSCACHE_OP_MYTHREAD). A thread + may decide it wants to handle an operation itself without deferring it to + another thread. + + This is, for example, used in read operations for calling readpages() on + the backing filesystem in CacheFiles. Although readpages() does an + asynchronous data fetch, the determination of whether pages exist is done + synchronously - and the netfs does not proceed until this has been + determined. + + If this option is to be used, FSCACHE_OP_WAITING must be set in op->flags + before submitting the operation, and the operating thread must wait for it + to be cleared before proceeding: + + wait_on_bit(&op->flags, FSCACHE_OP_WAITING, + fscache_wait_bit, TASK_UNINTERRUPTIBLE); + + + (2) The operation may be fast asynchronous (FSCACHE_OP_FAST), in which case it + will be given to keventd to process. Such an operation is not permitted + to sleep on I/O. + + This is, for example, used by CacheFiles to copy data from a backing fs + page to a netfs page after the backing fs has read the page in. + + If this option is used, op->fast_work and op->processor must be + initialised before submitting the operation: + + INIT_WORK(&op->fast_work, do_some_work); + + + (3) The operation may be slow asynchronous (FSCACHE_OP_SLOW), in which case it + will be given to the slow work facility to process. Such an operation is + permitted to sleep on I/O. + + This is, for example, used by FS-Cache to handle background writes of + pages that have just been fetched from a remote server. + + If this option is used, op->slow_work and op->processor must be + initialised before submitting the operation: + + fscache_operation_init_slow(op, processor) + + +Furthermore, operations may be one of two types: + + (1) Exclusive (FSCACHE_OP_EXCLUSIVE). Operations of this type may not run in + conjunction with any other operation on the object being operated upon. + + An example of this is the attribute change operation, in which the file + being written to may need truncation. + + (2) Shareable. Operations of this type may be running simultaneously. It's + up to the operation implementation to prevent interference between other + operations running at the same time. + + +========= +PROCEDURE +========= + +Operations are used through the following procedure: + + (1) The submitting thread must allocate the operation and initialise it + itself. Normally this would be part of a more specific structure with the + generic op embedded within. + + (2) The submitting thread must then submit the operation for processing using + one of the following two functions: + + int fscache_submit_op(struct fscache_object *object, + struct fscache_operation *op); + + int fscache_submit_exclusive_op(struct fscache_object *object, + struct fscache_operation *op); + + The first function should be used to submit non-exclusive ops and the + second to submit exclusive ones. The caller must still set the + FSCACHE_OP_EXCLUSIVE flag. + + If successful, both functions will assign the operation to the specified + object and return 0. -ENOBUFS will be returned if the object specified is + permanently unavailable. + + The operation manager will defer operations on an object that is still + undergoing lookup or creation. The operation will also be deferred if an + operation of conflicting exclusivity is in progress on the object. + + If the operation is asynchronous, the manager will retain a reference to + it, so the caller should put their reference to it by passing it to: + + void fscache_put_operation(struct fscache_operation *op); + + (3) If the submitting thread wants to do the work itself, and has marked the + operation with FSCACHE_OP_MYTHREAD, then it should monitor + FSCACHE_OP_WAITING as described above and check the state of the object if + necessary (the object might have died whilst the thread was waiting). + + When it has finished doing its processing, it should call + fscache_op_complete() and fscache_put_operation() on it. + + (4) The operation holds an effective lock upon the object, preventing other + exclusive ops conflicting until it is released. The operation can be + enqueued for further immediate asynchronous processing by adjusting the + CPU time provisioning option if necessary, eg: + + op->flags &= ~FSCACHE_OP_TYPE; + op->flags |= ~FSCACHE_OP_FAST; + + and calling: + + void fscache_enqueue_operation(struct fscache_operation *op) + + This can be used to allow other things to have use of the worker thread + pools. + + +===================== +ASYNCHRONOUS CALLBACK +===================== + +When used in asynchronous mode, the worker thread pool will invoke the +processor method with a pointer to the operation. This should then get at the +container struct by using container_of(): + + static void fscache_write_op(struct fscache_operation *_op) + { + struct fscache_storage *op = + container_of(_op, struct fscache_storage, op); + ... + } + +The caller holds a reference on the operation, and will invoke +fscache_put_operation() when the processor function returns. The processor +function is at liberty to call fscache_enqueue_operation() or to take extra +references. diff --git a/Documentation/filesystems/ceph.txt b/Documentation/filesystems/ceph.txt new file mode 100644 index 00000000000..d6030aa3337 --- /dev/null +++ b/Documentation/filesystems/ceph.txt @@ -0,0 +1,148 @@ +Ceph Distributed File System +============================ + +Ceph is a distributed network file system designed to provide good +performance, reliability, and scalability. + +Basic features include: + + * POSIX semantics + * Seamless scaling from 1 to many thousands of nodes + * High availability and reliability. No single point of failure. + * N-way replication of data across storage nodes + * Fast recovery from node failures + * Automatic rebalancing of data on node addition/removal + * Easy deployment: most FS components are userspace daemons + +Also, + * Flexible snapshots (on any directory) + * Recursive accounting (nested files, directories, bytes) + +In contrast to cluster filesystems like GFS, OCFS2, and GPFS that rely +on symmetric access by all clients to shared block devices, Ceph +separates data and metadata management into independent server +clusters, similar to Lustre. Unlike Lustre, however, metadata and +storage nodes run entirely as user space daemons. Storage nodes +utilize btrfs to store data objects, leveraging its advanced features +(checksumming, metadata replication, etc.). File data is striped +across storage nodes in large chunks to distribute workload and +facilitate high throughputs. When storage nodes fail, data is +re-replicated in a distributed fashion by the storage nodes themselves +(with some minimal coordination from a cluster monitor), making the +system extremely efficient and scalable. + +Metadata servers effectively form a large, consistent, distributed +in-memory cache above the file namespace that is extremely scalable, +dynamically redistributes metadata in response to workload changes, +and can tolerate arbitrary (well, non-Byzantine) node failures. The +metadata server takes a somewhat unconventional approach to metadata +storage to significantly improve performance for common workloads. In +particular, inodes with only a single link are embedded in +directories, allowing entire directories of dentries and inodes to be +loaded into its cache with a single I/O operation. The contents of +extremely large directories can be fragmented and managed by +independent metadata servers, allowing scalable concurrent access. + +The system offers automatic data rebalancing/migration when scaling +from a small cluster of just a few nodes to many hundreds, without +requiring an administrator carve the data set into static volumes or +go through the tedious process of migrating data between servers. +When the file system approaches full, new nodes can be easily added +and things will "just work." + +Ceph includes flexible snapshot mechanism that allows a user to create +a snapshot on any subdirectory (and its nested contents) in the +system. Snapshot creation and deletion are as simple as 'mkdir +.snap/foo' and 'rmdir .snap/foo'. + +Ceph also provides some recursive accounting on directories for nested +files and bytes. That is, a 'getfattr -d foo' on any directory in the +system will reveal the total number of nested regular files and +subdirectories, and a summation of all nested file sizes. This makes +the identification of large disk space consumers relatively quick, as +no 'du' or similar recursive scan of the file system is required. + + +Mount Syntax +============ + +The basic mount syntax is: + + # mount -t ceph monip[:port][,monip2[:port]...]:/[subdir] mnt + +You only need to specify a single monitor, as the client will get the +full list when it connects. (However, if the monitor you specify +happens to be down, the mount won't succeed.) The port can be left +off if the monitor is using the default. So if the monitor is at +1.2.3.4, + + # mount -t ceph 1.2.3.4:/ /mnt/ceph + +is sufficient. If /sbin/mount.ceph is installed, a hostname can be +used instead of an IP address. + + + +Mount Options +============= + + ip=A.B.C.D[:N] + Specify the IP and/or port the client should bind to locally. + There is normally not much reason to do this. If the IP is not + specified, the client's IP address is determined by looking at the + address its connection to the monitor originates from. + + wsize=X + Specify the maximum write size in bytes. By default there is no + maximum. Ceph will normally size writes based on the file stripe + size. + + rsize=X + Specify the maximum readahead. + + mount_timeout=X + Specify the timeout value for mount (in seconds), in the case + of a non-responsive Ceph file system. The default is 30 + seconds. + + rbytes + When stat() is called on a directory, set st_size to 'rbytes', + the summation of file sizes over all files nested beneath that + directory. This is the default. + + norbytes + When stat() is called on a directory, set st_size to the + number of entries in that directory. + + nocrc + Disable CRC32C calculation for data writes. If set, the storage node + must rely on TCP's error correction to detect data corruption + in the data payload. + + dcache + Use the dcache contents to perform negative lookups and + readdir when the client has the entire directory contents in + its cache. (This does not change correctness; the client uses + cached metadata only when a lease or capability ensures it is + valid.) + + nodcache + Do not use the dcache as above. This avoids a significant amount of + complex code, sacrificing performance without affecting correctness, + and is useful for tracking down bugs. + + noasyncreaddir + Do not use the dcache as above for readdir. + +More Information +================ + +For more information on Ceph, see the home page at + http://ceph.newdream.net/ + +The Linux kernel client source tree is available at + git://ceph.newdream.net/git/ceph-client.git + git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git + +and the source for the full system is at + git://ceph.newdream.net/git/ceph.git diff --git a/Documentation/filesystems/cifs.txt b/Documentation/filesystems/cifs.txt deleted file mode 100644 index 49cc923a93e..00000000000 --- a/Documentation/filesystems/cifs.txt +++ /dev/null @@ -1,51 +0,0 @@ - This is the client VFS module for the Common Internet File System - (CIFS) protocol which is the successor to the Server Message Block - (SMB) protocol, the native file sharing mechanism for most early - PC operating systems. CIFS is fully supported by current network - file servers such as Windows 2000, Windows 2003 (including - Windows XP) as well by Samba (which provides excellent CIFS - server support for Linux and many other operating systems), so - this network filesystem client can mount to a wide variety of - servers. The smbfs module should be used instead of this cifs module - for mounting to older SMB servers such as OS/2. The smbfs and cifs - modules can coexist and do not conflict. The CIFS VFS filesystem - module is designed to work well with servers that implement the - newer versions (dialects) of the SMB/CIFS protocol such as Samba, - the program written by Andrew Tridgell that turns any Unix host - into a SMB/CIFS file server. - - The intent of this module is to provide the most advanced network - file system function for CIFS compliant servers, including better - POSIX compliance, secure per-user session establishment, high - performance safe distributed caching (oplock), optional packet - signing, large files, Unicode support and other internationalization - improvements. Since both Samba server and this filesystem client support - the CIFS Unix extensions, the combination can provide a reasonable - alternative to NFSv4 for fileserving in some Linux to Linux environments, - not just in Linux to Windows environments. - - This filesystem has an optional mount utility (mount.cifs) that can - be obtained from the project page and installed in the path in the same - directory with the other mount helpers (such as mount.smbfs). - Mounting using the cifs filesystem without installing the mount helper - requires specifying the server's ip address. - - For Linux 2.4: - mount //anything/here /mnt_target -o - user=username,pass=password,unc=//ip_address_of_server/sharename - - For Linux 2.5: - mount //ip_address_of_server/sharename /mnt_target -o user=username, pass=password - - - For more information on the module see the project page at - - http://us1.samba.org/samba/Linux_CIFS_client.html - - For more information on CIFS see: - - http://www.snia.org/tech_activities/CIFS - - or the Samba site: - - http://www.samba.org diff --git a/Documentation/filesystems/cifs/AUTHORS b/Documentation/filesystems/cifs/AUTHORS new file mode 100644 index 00000000000..ca4a67a0bb1 --- /dev/null +++ b/Documentation/filesystems/cifs/AUTHORS @@ -0,0 +1,56 @@ +Original Author +=============== +Steve French (sfrench@samba.org) + +The author wishes to express his appreciation and thanks to: +Andrew Tridgell (Samba team) for his early suggestions about smb/cifs VFS +improvements. Thanks to IBM for allowing me time and test resources to pursue +this project, to Jim McDonough from IBM (and the Samba Team) for his help, to +the IBM Linux JFS team for explaining many esoteric Linux filesystem features. +Jeremy Allison of the Samba team has done invaluable work in adding the server +side of the original CIFS Unix extensions and reviewing and implementing +portions of the newer CIFS POSIX extensions into the Samba 3 file server. Thank +Dave Boutcher of IBM Rochester (author of the OS/400 smb/cifs filesystem client) +for proving years ago that very good smb/cifs clients could be done on Unix-like +operating systems. Volker Lendecke, Andrew Tridgell, Urban Widmark, John +Newbigin and others for their work on the Linux smbfs module. Thanks to +the other members of the Storage Network Industry Association CIFS Technical +Workgroup for their work specifying this highly complex protocol and finally +thanks to the Samba team for their technical advice and encouragement. + +Patch Contributors +------------------ +Zwane Mwaikambo +Andi Kleen +Amrut Joshi +Shobhit Dayal +Sergey Vlasov +Richard Hughes +Yury Umanets +Mark Hamzy (for some of the early cifs IPv6 work) +Domen Puncer +Jesper Juhl (in particular for lots of whitespace/formatting cleanup) +Vince Negri and Dave Stahl (for finding an important caching bug) +Adrian Bunk (kcalloc cleanups) +Miklos Szeredi +Kazeon team for various fixes especially for 2.4 version. +Asser Ferno (Change Notify support) +Shaggy (Dave Kleikamp) for innumerable small fs suggestions and some good cleanup +Gunter Kukkukk (testing and suggestions for support of old servers) +Igor Mammedov (DFS support) +Jeff Layton (many, many fixes, as well as great work on the cifs Kerberos code) +Scott Lovenberg + +Test case and Bug Report contributors +------------------------------------- +Thanks to those in the community who have submitted detailed bug reports +and debug of problems they have found: Jochen Dolze, David Blaine, +Rene Scharfe, Martin Josefsson, Alexander Wild, Anthony Liguori, +Lars Muller, Urban Widmark, Massimiliano Ferrero, Howard Owen, +Olaf Kirch, Kieron Briggs, Nick Millington and others. Also special +mention to the Stanford Checker (SWAT) which pointed out many minor +bugs in error paths. Valuable suggestions also have come from Al Viro +and Dave Miller. + +And thanks to the IBM LTC and Power test teams and SuSE testers for +finding multiple bugs during excellent stress test runs. diff --git a/Documentation/filesystems/cifs/CHANGES b/Documentation/filesystems/cifs/CHANGES new file mode 100644 index 00000000000..bc0025cdd1c --- /dev/null +++ b/Documentation/filesystems/cifs/CHANGES @@ -0,0 +1,1065 @@ +Version 1.62 +------------ +Add sockopt=TCP_NODELAY mount option. EA (xattr) routines hardened +to more strictly handle corrupt frames. + +Version 1.61 +------------ +Fix append problem to Samba servers (files opened with O_APPEND could +have duplicated data). Fix oops in cifs_lookup. Workaround problem +mounting to OS/400 Netserve. Fix oops in cifs_get_tcp_session. +Disable use of server inode numbers when server only +partially supports them (e.g. for one server querying inode numbers on +FindFirst fails but QPathInfo queries works). Fix oops with dfs in +cifs_put_smb_ses. Fix mmap to work on directio mounts (needed +for OpenOffice when on forcedirectio mount e.g.) + +Version 1.60 +------------- +Fix memory leak in reconnect. Fix oops in DFS mount error path. +Set s_maxbytes to smaller (the max that vfs can handle) so that +sendfile will now work over cifs mounts again. Add noforcegid +and noforceuid mount parameters. Fix small mem leak when using +ntlmv2. Fix 2nd mount to same server but with different port to +be allowed (rather than reusing the 1st port) - only when the +user explicitly overrides the port on the 2nd mount. + +Version 1.59 +------------ +Client uses server inode numbers (which are persistent) rather than +client generated ones by default (mount option "serverino" turned +on by default if server supports it). Add forceuid and forcegid +mount options (so that when negotiating unix extensions specifying +which uid mounted does not immediately force the server's reported +uids to be overridden). Add support for scope mount parm. Improve +hard link detection to use same inode for both. Do not set +read-only dos attribute on directories (for chmod) since Windows +explorer special cases this attribute bit for directories for +a different purpose. + +Version 1.58 +------------ +Guard against buffer overruns in various UCS-2 to UTF-8 string conversions +when the UTF-8 string is composed of unusually long (more than 4 byte) converted +characters. Add support for mounting root of a share which redirects immediately +to DFS target. Convert string conversion functions from Unicode to more +accurately mark string length before allocating memory (which may help the +rare cases where a UTF-8 string is much larger than the UCS2 string that +we converted from). Fix endianness of the vcnum field used during +session setup to distinguish multiple mounts to same server from different +userids. Raw NTLMSSP fixed (it requires /proc/fs/cifs/experimental +flag to be set to 2, and mount must enable krb5 to turn on extended security). +Performance of file create to Samba improved (posix create on lookup +removes 1 of 2 network requests sent on file create) + +Version 1.57 +------------ +Improve support for multiple security contexts to the same server. We +used to use the same "vcnumber" for all connections which could cause +the server to treat subsequent connections, especially those that +are authenticated as guest, as reconnections, invalidating the earlier +user's smb session. This fix allows cifs to mount multiple times to the +same server with different userids without risking invalidating earlier +established security contexts. fsync now sends SMB Flush operation +to better ensure that we wait for server to write all of the data to +server disk (not just write it over the network). Add new mount +parameter to allow user to disable sending the (slow) SMB flush on +fsync if desired (fsync still flushes all cached write data to the server). +Posix file open support added (turned off after one attempt if server +fails to support it properly, as with Samba server versions prior to 3.3.2) +Fix "redzone overwritten" bug in cifs_put_tcon (CIFSTcon may allocate too +little memory for the "nativeFileSystem" field returned by the server +during mount). Endian convert inode numbers if necessary (makes it easier +to compare inode numbers on network files from big endian systems). + +Version 1.56 +------------ +Add "forcemandatorylock" mount option to allow user to use mandatory +rather than posix (advisory) byte range locks, even though server would +support posix byte range locks. Fix query of root inode when prefixpath +specified and user does not have access to query information about the +top of the share. Fix problem in 2.6.28 resolving DFS paths to +Samba servers (worked to Windows). Fix rmdir so that pending search +(readdir) requests do not get invalid results which include the now +removed directory. Fix oops in cifs_dfs_ref.c when prefixpath is not reachable +when using DFS. Add better file create support to servers which support +the CIFS POSIX protocol extensions (this adds support for new flags +on create, and improves semantics for write of locked ranges). + +Version 1.55 +------------ +Various fixes to make delete of open files behavior more predictable +(when delete of an open file fails we mark the file as "delete-on-close" +in a way that more servers accept, but only if we can first rename the +file to a temporary name). Add experimental support for more safely +handling fcntl(F_SETLEASE). Convert cifs to using blocking tcp +sends, and also let tcp autotune the socket send and receive buffers. +This reduces the number of EAGAIN errors returned by TCP/IP in +high stress workloads (and the number of retries on socket writes +when sending large SMBWriteX requests). Fix case in which a portion of +data can in some cases not get written to the file on the server before the +file is closed. Fix DFS parsing to properly handle path consumed field, +and to handle certain codepage conversions better. Fix mount and +umount race that can cause oops in mount or umount or reconnect. + +Version 1.54 +------------ +Fix premature write failure on congested networks (we would give up +on EAGAIN from the socket too quickly on large writes). +Cifs_mkdir and cifs_create now respect the setgid bit on parent dir. +Fix endian problems in acl (mode from/to cifs acl) on bigendian +architectures. Fix problems with preserving timestamps on copying open +files (e.g. "cp -a") to Windows servers. For mkdir and create honor setgid bit +on parent directory when server supports Unix Extensions but not POSIX +create. Update cifs.upcall version to handle new Kerberos sec flags +(this requires update of cifs.upcall program from Samba). Fix memory leak +on dns_upcall (resolving DFS referralls). Fix plain text password +authentication (requires setting SecurityFlags to 0x30030 to enable +lanman and plain text though). Fix writes to be at correct offset when +file is open with O_APPEND and file is on a directio (forcediretio) mount. +Fix bug in rewinding readdir directory searches. Add nodfs mount option. + +Version 1.53 +------------ +DFS support added (Microsoft Distributed File System client support needed +for referrals which enable a hierarchical name space among servers). +Disable temporary caching of mode bits to servers which do not support +storing of mode (e.g. Windows servers, when client mounts without cifsacl +mount option) and add new "dynperm" mount option to enable temporary caching +of mode (enable old behavior). Fix hang on mount caused when server crashes +tcp session during negotiate protocol. + +Version 1.52 +------------ +Fix oops on second mount to server when null auth is used. +Enable experimental Kerberos support. Return writebehind errors on flush +and sync so that events like out of disk space get reported properly on +cached files. Fix setxattr failure to certain Samba versions. Fix mount +of second share to disconnected server session (autoreconnect on this). +Add ability to modify cifs acls for handling chmod (when mounted with +cifsacl flag). Fix prefixpath path separator so we can handle mounts +with prefixpaths longer than one directory (one path component) when +mounted to Windows servers. Fix slow file open when cifsacl +enabled. Fix memory leak in FindNext when the SMB call returns -EBADF. + + +Version 1.51 +------------ +Fix memory leak in statfs when mounted to very old servers (e.g. +Windows 9x). Add new feature "POSIX open" which allows servers +which support the current POSIX Extensions to provide better semantics +(e.g. delete for open files opened with posix open). Take into +account umask on posix mkdir not just older style mkdir. Add +ability to mount to IPC$ share (which allows CIFS named pipes to be +opened, read and written as if they were files). When 1st tree +connect fails (e.g. due to signing negotiation failure) fix +leak that causes cifsd not to stop and rmmod to fail to cleanup +cifs_request_buffers pool. Fix problem with POSIX Open/Mkdir on +bigendian architectures. Fix possible memory corruption when +EAGAIN returned on kern_recvmsg. Return better error if server +requires packet signing but client has disabled it. When mounted +with cifsacl mount option - mode bits are approximated based +on the contents of the ACL of the file or directory. When cifs +mount helper is missing convert make sure that UNC name +has backslash (not forward slash) between ip address of server +and the share name. + +Version 1.50 +------------ +Fix NTLMv2 signing. NFS server mounted over cifs works (if cifs mount is +done with "serverino" mount option). Add support for POSIX Unlink +(helps with certain sharing violation cases when server such as +Samba supports newer POSIX CIFS Protocol Extensions). Add "nounix" +mount option to allow disabling the CIFS Unix Extensions for just +that mount. Fix hang on spinlock in find_writable_file (race when +reopening file after session crash). Byte range unlock request to +windows server could unlock more bytes (on server copy of file) +than intended if start of unlock request is well before start of +a previous byte range lock that we issued. + +Version 1.49 +------------ +IPv6 support. Enable ipv6 addresses to be passed on mount (put the ipv6 +address after the "ip=" mount option, at least until mount.cifs is fixed to +handle DNS host to ipv6 name translation). Accept override of uid or gid +on mount even when Unix Extensions are negotiated (it used to be ignored +when Unix Extensions were ignored). This allows users to override the +default uid and gid for files when they are certain that the uids or +gids on the server do not match those of the client. Make "sec=none" +mount override username (so that null user connection is attempted) +to match what documentation said. Support for very large reads, over 127K, +available to some newer servers (such as Samba 3.0.26 and later but +note that it also requires setting CIFSMaxBufSize at module install +time to a larger value which may hurt performance in some cases). +Make sign option force signing (or fail if server does not support it). + +Version 1.48 +------------ +Fix mtime bouncing around from local idea of last write times to remote time. +Fix hang (in i_size_read) when simultaneous size update of same remote file +on smp system corrupts sequence number. Do not reread unnecessarily partial page +(which we are about to overwrite anyway) when writing out file opened rw. +When DOS attribute of file on non-Unix server's file changes on the server side +from read-only back to read-write, reflect this change in default file mode +(we had been leaving a file's mode read-only until the inode were reloaded). +Allow setting of attribute back to ATTR_NORMAL (removing readonly dos attribute +when archive dos attribute not set and we are changing mode back to writeable +on server which does not support the Unix Extensions). Remove read only dos +attribute on chmod when adding any write permission (ie on any of +user/group/other (not all of user/group/other ie 0222) when +mounted to windows. Add support for POSIX MkDir (slight performance +enhancement and eliminates the network race between the mkdir and set +path info of the mode). + + +Version 1.47 +------------ +Fix oops in list_del during mount caused by unaligned string. +Fix file corruption which could occur on some large file +copies caused by writepages page i/o completion bug. +Seek to SEEK_END forces check for update of file size for non-cached +files. Allow file size to be updated on remote extend of locally open, +non-cached file. Fix reconnect to newer Samba servers (or other servers +which support the CIFS Unix/POSIX extensions) so that we again tell the +server the Unix/POSIX cifs capabilities which we support (SetFSInfo). +Add experimental support for new POSIX Open/Mkdir (which returns +stat information on the open, and allows setting the mode). + +Version 1.46 +------------ +Support deep tree mounts. Better support OS/2, Win9x (DOS) time stamps. +Allow null user to be specified on mount ("username="). Do not return +EINVAL on readdir when filldir fails due to overwritten blocksize +(fixes FC problem). Return error in rename 2nd attempt retry (ie report +if rename by handle also fails, after rename by path fails, we were +not reporting whether the retry worked or not). Fix NTLMv2 to +work to Windows servers (mount with option "sec=ntlmv2"). + +Version 1.45 +------------ +Do not time out lockw calls when using posix extensions. Do not +time out requests if server still responding reasonably fast +on requests on other threads. Improve POSIX locking emulation, +(lock cancel now works, and unlock of merged range works even +to Windows servers now). Fix oops on mount to lanman servers +(win9x, os/2 etc.) when null password. Do not send listxattr +(SMB to query all EAs) if nouser_xattr specified. Fix SE Linux +problem (instantiate inodes/dentries in right order for readdir). + +Version 1.44 +------------ +Rewritten sessionsetup support, including support for legacy SMB +session setup needed for OS/2 and older servers such as Windows 95 and 98. +Fix oops on ls to OS/2 servers. Add support for level 1 FindFirst +so we can do search (ls etc.) to OS/2. Do not send NTCreateX +or recent levels of FindFirst unless server says it supports NT SMBs +(instead use legacy equivalents from LANMAN dialect). Fix to allow +NTLMv2 authentication support (now can use stronger password hashing +on mount if corresponding /proc/fs/cifs/SecurityFlags is set (0x4004). +Allow override of global cifs security flags on mount via "sec=" option(s). + +Version 1.43 +------------ +POSIX locking to servers which support CIFS POSIX Extensions +(disabled by default controlled by proc/fs/cifs/Experimental). +Handle conversion of long share names (especially Asian languages) +to Unicode during mount. Fix memory leak in sess struct on reconnect. +Fix rare oops after acpi suspend. Fix O_TRUNC opens to overwrite on +cifs open which helps rare case when setpathinfo fails or server does +not support it. + +Version 1.42 +------------ +Fix slow oplock break when mounted to different servers at the same time and +the tids match and we try to find matching fid on wrong server. Fix read +looping when signing required by server (2.6.16 kernel only). Fix readdir +vs. rename race which could cause each to hang. Return . and .. even +if server does not. Allow searches to skip first three entries and +begin at any location. Fix oops in find_writeable_file. + +Version 1.41 +------------ +Fix NTLMv2 security (can be enabled in /proc/fs/cifs) so customers can +configure stronger authentication. Fix sfu symlinks so they can +be followed (not just recognized). Fix wraparound of bcc on +read responses when buffer size over 64K and also fix wrap of +max smb buffer size when CIFSMaxBufSize over 64K. Fix oops in +cifs_user_read and cifs_readpages (when EAGAIN on send of smb +on socket is returned over and over). Add POSIX (advisory) byte range +locking support (requires server with newest CIFS UNIX Extensions +to the protocol implemented). Slow down negprot slightly in port 139 +RFC1001 case to give session_init time on buggy servers. + +Version 1.40 +------------ +Use fsuid (fsgid) more consistently instead of uid (gid). Improve performance +of readpages by eliminating one extra memcpy. Allow update of file size +from remote server even if file is open for write as long as mount is +directio. Recognize share mode security and send NTLM encrypted password +on tree connect if share mode negotiated. + +Version 1.39 +------------ +Defer close of a file handle slightly if pending writes depend on that handle +(this reduces the EBADF bad file handle errors that can be logged under heavy +stress on writes). Modify cifs Kconfig options to expose CONFIG_CIFS_STATS2 +Fix SFU style symlinks and mknod needed for servers which do not support the +CIFS Unix Extensions. Fix setfacl/getfacl on bigendian. Timeout negative +dentries so files that the client sees as deleted but that later get created +on the server will be recognized. Add client side permission check on setattr. +Timeout stuck requests better (where server has never responded or sent corrupt +responses) + +Version 1.38 +------------ +Fix tcp socket retransmission timeouts (e.g. on ENOSPACE from the socket) +to be smaller at first (but increasing) so large write performance performance +over GigE is better. Do not hang thread on illegal byte range lock response +from Windows (Windows can send an RFC1001 size which does not match smb size) by +allowing an SMBs TCP length to be up to a few bytes longer than it should be. +wsize and rsize can now be larger than negotiated buffer size if server +supports large readx/writex, even when directio mount flag not specified. +Write size will in many cases now be 16K instead of 4K which greatly helps +file copy performance on lightly loaded networks. Fix oops in dnotify +when experimental config flag enabled. Make cifsFYI more granular. + +Version 1.37 +------------ +Fix readdir caching when unlink removes file in current search buffer, +and this is followed by a rewind search to just before the deleted entry. +Do not attempt to set ctime unless atime and/or mtime change requested +(most servers throw it away anyway). Fix length check of received smbs +to be more accurate. Fix big endian problem with mapchars mount option, +and with a field returned by statfs. + +Version 1.36 +------------ +Add support for mounting to older pre-CIFS servers such as Windows9x and ME. +For these older servers, add option for passing netbios name of server in +on mount (servernetbiosname). Add suspend support for power management, to +avoid cifsd thread preventing software suspend from working. +Add mount option for disabling the default behavior of sending byte range lock +requests to the server (necessary for certain applications which break with +mandatory lock behavior such as Evolution), and also mount option for +requesting case insensitive matching for path based requests (requesting +case sensitive is the default). + +Version 1.35 +------------ +Add writepage performance improvements. Fix path name conversions +for long filenames on mounts which were done with "mapchars" mount option +specified. Ensure multiplex ids do not collide. Fix case in which +rmmod can oops if done soon after last unmount. Fix truncated +search (readdir) output when resume filename was a long filename. +Fix filename conversion when mapchars mount option was specified and +filename was a long filename. + +Version 1.34 +------------ +Fix error mapping of the TOO_MANY_LINKS (hardlinks) case. +Do not oops if root user kills cifs oplock kernel thread or +kills the cifsd thread (NB: killing the cifs kernel threads is not +recommended, unmount and rmmod cifs will kill them when they are +no longer needed). Fix readdir to ASCII servers (ie older servers +which do not support Unicode) and also require asterisk. +Fix out of memory case in which data could be written one page +off in the page cache. + +Version 1.33 +------------ +Fix caching problem, in which readdir of directory containing a file +which was cached could cause the file's time stamp to be updated +without invalidating the readahead data (so we could get stale +file data on the client for that file even as the server copy changed). +Cleanup response processing so cifsd can not loop when abnormally +terminated. + + +Version 1.32 +------------ +Fix oops in ls when Transact2 FindFirst (or FindNext) returns more than one +transact response for an SMB request and search entry split across two frames. +Add support for lsattr (getting ext2/ext3/reiserfs attr flags from the server) +as new protocol extensions. Do not send Get/Set calls for POSIX ACLs +unless server explicitly claims to support them in CIFS Unix extensions +POSIX ACL capability bit. Fix packet signing when multiuser mounting with +different users from the same client to the same server. Fix oops in +cifs_close. Add mount option for remapping reserved characters in +filenames (also allow recognizing files with created by SFU which have any +of these seven reserved characters, except backslash, to be recognized). +Fix invalid transact2 message (we were sometimes trying to interpret +oplock breaks as SMB responses). Add ioctl for checking that the +current uid matches the uid of the mounter (needed by umount.cifs). +Reduce the number of large buffer allocations in cifs response processing +(significantly reduces memory pressure under heavy stress with multiple +processes accessing the same server at the same time). + +Version 1.31 +------------ +Fix updates of DOS attributes and time fields so that files on NT4 servers +do not get marked delete on close. Display sizes of cifs buffer pools in +cifs stats. Fix oops in unmount when cifsd thread being killed by +shutdown. Add generic readv/writev and aio support. Report inode numbers +consistently in readdir and lookup (when serverino mount option is +specified use the inode number that the server reports - for both lookup +and readdir, otherwise by default the locally generated inode number is used +for inodes created in either path since servers are not always able to +provide unique inode numbers when exporting multiple volumes from under one +sharename). + +Version 1.30 +------------ +Allow new nouser_xattr mount parm to disable xattr support for user namespace. +Do not flag user_xattr mount parm in dmesg. Retry failures setting file time +(mostly affects NT4 servers) by retry with handle based network operation. +Add new POSIX Query FS Info for returning statfs info more accurately. +Handle passwords with multiple commas in them. + +Version 1.29 +------------ +Fix default mode in sysfs of cifs module parms. Remove old readdir routine. +Fix capabilities flags for large readx so as to allow reads larger than 64K. + +Version 1.28 +------------ +Add module init parm for large SMB buffer size (to allow it to be changed +from its default of 16K) which is especially useful for large file copy +when mounting with the directio mount option. Fix oops after +returning from mount when experimental ExtendedSecurity enabled and +SpnegoNegotiated returning invalid error. Fix case to retry better when +peek returns from 1 to 3 bytes on socket which should have more data. +Fixed path based calls (such as cifs lookup) to handle path names +longer than 530 (now can handle PATH_MAX). Fix pass through authentication +from Samba server to DC (Samba required dummy LM password). + +Version 1.27 +------------ +Turn off DNOTIFY (directory change notification support) by default +(unless built with the experimental flag) to fix hang with KDE +file browser. Fix DNOTIFY flag mappings. Fix hang (in wait_event +waiting on an SMB response) in SendReceive when session dies but +reconnects quickly from another task. Add module init parms for +minimum number of large and small network buffers in the buffer pools, +and for the maximum number of simultaneous requests. + +Version 1.26 +------------ +Add setfacl support to allow setting of ACLs remotely to Samba 3.10 and later +and other POSIX CIFS compliant servers. Fix error mapping for getfacl +to EOPNOTSUPP when server does not support posix acls on the wire. Fix +improperly zeroed buffer in CIFS Unix extensions set times call. + +Version 1.25 +------------ +Fix internationalization problem in cifs readdir with filenames that map to +longer UTF-8 strings than the string on the wire was in Unicode. Add workaround +for readdir to netapp servers. Fix search rewind (seek into readdir to return +non-consecutive entries). Do not do readdir when server negotiates +buffer size to small to fit filename. Add support for reading POSIX ACLs from +the server (add also acl and noacl mount options). + +Version 1.24 +------------ +Optionally allow using server side inode numbers, rather than client generated +ones by specifying mount option "serverino" - this is required for some apps +to work which double check hardlinked files and have persistent inode numbers. + +Version 1.23 +------------ +Multiple bigendian fixes. On little endian systems (for reconnect after +network failure) fix tcp session reconnect code so we do not try first +to reconnect on reverse of port 445. Treat reparse points (NTFS junctions) +as directories rather than symlinks because we can do follow link on them. + +Version 1.22 +------------ +Add config option to enable XATTR (extended attribute) support, mapping +xattr names in the "user." namespace space to SMB/CIFS EAs. Lots of +minor fixes pointed out by the Stanford SWAT checker (mostly missing +or out of order NULL pointer checks in little used error paths). + +Version 1.21 +------------ +Add new mount parm to control whether mode check (generic_permission) is done +on the client. If Unix extensions are enabled and the uids on the client +and server do not match, client permission checks are meaningless on +server uids that do not exist on the client (this does not affect the +normal ACL check which occurs on the server). Fix default uid +on mknod to match create and mkdir. Add optional mount parm to allow +override of the default uid behavior (in which the server sets the uid +and gid of newly created files). Normally for network filesystem mounts +user want the server to set the uid/gid on newly created files (rather than +using uid of the client processes you would in a local filesystem). + +Version 1.20 +------------ +Make transaction counts more consistent. Merge /proc/fs/cifs/SimultaneousOps +info into /proc/fs/cifs/DebugData. Fix oops in rare oops in readdir +(in build_wildcard_path_from_dentry). Fix mknod to pass type field +(block/char/fifo) properly. Remove spurious mount warning log entry when +credentials passed as mount argument. Set major/minor device number in +inode for block and char devices when unix extensions enabled. + +Version 1.19 +------------ +Fix /proc/fs/cifs/Stats and DebugData display to handle larger +amounts of return data. Properly limit requests to MAX_REQ (50 +is the usual maximum active multiplex SMB/CIFS requests per server). +Do not kill cifsd (and thus hurt the other SMB session) when more than one +session to the same server (but with different userids) exists and one +of the two user's smb sessions is being removed while leaving the other. +Do not loop reconnecting in cifsd demultiplex thread when admin +kills the thread without going through unmount. + +Version 1.18 +------------ +Do not rename hardlinked files (since that should be a noop). Flush +cached write behind data when reopening a file after session abend, +except when already in write. Grab per socket sem during reconnect +to avoid oops in sendmsg if overlapping with reconnect. Do not +reset cached inode file size on readdir for files open for write on +client. + + +Version 1.17 +------------ +Update number of blocks in file so du command is happier (in Linux a fake +blocksize of 512 is required for calculating number of blocks in inode). +Fix prepare write of partial pages to read in data from server if possible. +Fix race on tcpStatus field between unmount and reconnection code, causing +cifsd process sometimes to hang around forever. Improve out of memory +checks in cifs_filldir + +Version 1.16 +------------ +Fix incorrect file size in file handle based setattr on big endian hardware. +Fix oops in build_path_from_dentry when out of memory. Add checks for invalid +and closing file structs in writepage/partialpagewrite. Add statistics +for each mounted share (new menuconfig option). Fix endianness problem in +volume information displayed in /proc/fs/cifs/DebugData (only affects +affects big endian architectures). Prevent renames while constructing +path names for open, mkdir and rmdir. + +Version 1.15 +------------ +Change to mempools for alloc smb request buffers and multiplex structs +to better handle low memory problems (and potential deadlocks). + +Version 1.14 +------------ +Fix incomplete listings of large directories on Samba servers when Unix +extensions enabled. Fix oops when smb_buffer can not be allocated. Fix +rename deadlock when writing out dirty pages at same time. + +Version 1.13 +------------ +Fix open of files in which O_CREATE can cause the mode to change in +some cases. Fix case in which retry of write overlaps file close. +Fix PPC64 build error. Reduce excessive stack usage in smb password +hashing. Fix overwrite of Linux user's view of file mode to Windows servers. + +Version 1.12 +------------ +Fixes for large file copy, signal handling, socket retry, buffer +allocation and low memory situations. + +Version 1.11 +------------ +Better port 139 support to Windows servers (RFC1001/RFC1002 Session_Initialize) +also now allowing support for specifying client netbiosname. NT4 support added. + +Version 1.10 +------------ +Fix reconnection (and certain failed mounts) to properly wake up the +blocked users thread so it does not seem hung (in some cases was blocked +until the cifs receive timeout expired). Fix spurious error logging +to kernel log when application with open network files killed. + +Version 1.09 +------------ +Fix /proc/fs module unload warning message (that could be logged +to the kernel log). Fix intermittent failure in connectathon +test7 (hardlink count not immediately refreshed in case in which +inode metadata can be incorrectly kept cached when time near zero) + +Version 1.08 +------------ +Allow file_mode and dir_mode (specified at mount time) to be enforced +locally (the server already enforced its own ACLs too) for servers +that do not report the correct mode (do not support the +CIFS Unix Extensions). + +Version 1.07 +------------ +Fix some small memory leaks in some unmount error paths. Fix major leak +of cache pages in readpages causing multiple read oriented stress +testcases (including fsx, and even large file copy) to fail over time. + +Version 1.06 +------------ +Send NTCreateX with ATTR_POSIX if Linux/Unix extensions negotiated with server. +This allows files that differ only in case and improves performance of file +creation and file open to such servers. Fix semaphore conflict which causes +slow delete of open file to Samba (which unfortunately can cause an oplock +break to self while vfs_unlink held i_sem) which can hang for 20 seconds. + +Version 1.05 +------------ +fixes to cifs_readpages for fsx test case + +Version 1.04 +------------ +Fix caching data integrity bug when extending file size especially when no +oplock on file. Fix spurious logging of valid already parsed mount options +that are parsed outside of the cifs vfs such as nosuid. + + +Version 1.03 +------------ +Connect to server when port number override not specified, and tcp port +unitialized. Reset search to restart at correct file when kernel routine +filldir returns error during large directory searches (readdir). + +Version 1.02 +------------ +Fix caching problem when files opened by multiple clients in which +page cache could contain stale data, and write through did +not occur often enough while file was still open when read ahead +(read oplock) not allowed. Treat "sep=" when first mount option +as an override of comma as the default separator between mount +options. + +Version 1.01 +------------ +Allow passwords longer than 16 bytes. Allow null password string. + +Version 1.00 +------------ +Gracefully clean up failed mounts when attempting to mount to servers such as +Windows 98 that terminate tcp sessions during protocol negotiation. Handle +embedded commas in mount parsing of passwords. + +Version 0.99 +------------ +Invalidate local inode cached pages on oplock break and when last file +instance is closed so that the client does not continue using stale local +copy rather than later modified server copy of file. Do not reconnect +when server drops the tcp session prematurely before negotiate +protocol response. Fix oops in reopen_file when dentry freed. Allow +the support for CIFS Unix Extensions to be disabled via proc interface. + +Version 0.98 +------------ +Fix hang in commit_write during reconnection of open files under heavy load. +Fix unload_nls oops in a mount failure path. Serialize writes to same socket +which also fixes any possible races when cifs signatures are enabled in SMBs +being sent out of signature sequence number order. + +Version 0.97 +------------ +Fix byte range locking bug (endian problem) causing bad offset and +length. + +Version 0.96 +------------ +Fix oops (in send_sig) caused by CIFS unmount code trying to +wake up the demultiplex thread after it had exited. Do not log +error on harmless oplock release of closed handle. + +Version 0.95 +------------ +Fix unsafe global variable usage and password hash failure on gcc 3.3.1 +Fix problem reconnecting secondary mounts to same server after session +failure. Fix invalid dentry - race in mkdir when directory gets created +by another client between the lookup and mkdir. + +Version 0.94 +------------ +Fix to list processing in reopen_files. Fix reconnection when server hung +but tcpip session still alive. Set proper timeout on socket read. + +Version 0.93 +------------ +Add missing mount options including iocharset. SMP fixes in write and open. +Fix errors in reconnecting after TCP session failure. Fix module unloading +of default nls codepage + +Version 0.92 +------------ +Active smb transactions should never go negative (fix double FreeXid). Fix +list processing in file routines. Check return code on kmalloc in open. +Fix spinlock usage for SMP. + +Version 0.91 +------------ +Fix oops in reopen_files when invalid dentry. drop dentry on server rename +and on revalidate errors. Fix cases where pid is now tgid. Fix return code +on create hard link when server does not support them. + +Version 0.90 +------------ +Fix scheduling while atomic error in getting inode info on newly created file. +Fix truncate of existing files opened with O_CREAT but not O_TRUNC set. + +Version 0.89 +------------ +Fix oops on write to dead tcp session. Remove error log write for case when file open +O_CREAT but not O_EXCL + +Version 0.88 +------------ +Fix non-POSIX behavior on rename of open file and delete of open file by taking +advantage of trans2 SetFileInfo rename facility if available on target server. +Retry on ENOSPC and EAGAIN socket errors. + +Version 0.87 +------------ +Fix oops on big endian readdir. Set blksize to be even power of two (2**blkbits) to fix +allocation size miscalculation. After oplock token lost do not read through +cache. + +Version 0.86 +------------ +Fix oops on empty file readahead. Fix for file size handling for locally cached files. + +Version 0.85 +------------ +Fix oops in mkdir when server fails to return inode info. Fix oops in reopen_files +during auto reconnection to server after server recovered from failure. + +Version 0.84 +------------ +Finish support for Linux 2.5 open/create changes, which removes the +redundant NTCreate/QPathInfo/close that was sent during file create. +Enable oplock by default. Enable packet signing by default (needed to +access many recent Windows servers) + +Version 0.83 +------------ +Fix oops when mounting to long server names caused by inverted parms to kmalloc. +Fix MultiuserMount (/proc/fs/cifs configuration setting) so that when enabled +we will choose a cifs user session (smb uid) that better matches the local +uid if a) the mount uid does not match the current uid and b) we have another +session to the same server (ip address) for a different mount which +matches the current local uid. + +Version 0.82 +------------ +Add support for mknod of block or character devices. Fix oplock +code (distributed caching) to properly send response to oplock +break from server. + +Version 0.81 +------------ +Finish up CIFS packet digital signing for the default +NTLM security case. This should help Windows 2003 +network interoperability since it is common for +packet signing to be required now. Fix statfs (stat -f) +which recently started returning errors due to +invalid value (-1 instead of 0) being set in the +struct kstatfs f_ffiles field. + +Version 0.80 +----------- +Fix oops on stopping oplock thread when removing cifs when +built as module. + +Version 0.79 +------------ +Fix mount options for ro (readonly), uid, gid and file and directory mode. + +Version 0.78 +------------ +Fix errors displayed on failed mounts to be more understandable. +Fixed various incorrect or misleading smb to posix error code mappings. + +Version 0.77 +------------ +Fix display of NTFS DFS junctions to display as symlinks. +They are the network equivalent. Fix oops in +cifs_partialpagewrite caused by missing spinlock protection +of openfile linked list. Allow writebehind caching errors to +be returned to the application at file close. + +Version 0.76 +------------ +Clean up options displayed in /proc/mounts by show_options to +be more consistent with other filesystems. + +Version 0.75 +------------ +Fix delete of readonly file to Windows servers. Reflect +presence or absence of read only dos attribute in mode +bits for servers that do not support CIFS Unix extensions. +Fix shortened results on readdir of large directories to +servers supporting CIFS Unix extensions (caused by +incorrect resume key). + +Version 0.74 +------------ +Fix truncate bug (set file size) that could cause hangs e.g. running fsx + +Version 0.73 +------------ +unload nls if mount fails. + +Version 0.72 +------------ +Add resume key support to search (readdir) code to workaround +Windows bug. Add /proc/fs/cifs/LookupCacheEnable which +allows disabling caching of attribute information for +lookups. + +Version 0.71 +------------ +Add more oplock handling (distributed caching code). Remove +dead code. Remove excessive stack space utilization from +symlink routines. + +Version 0.70 +------------ +Fix oops in get dfs referral (triggered when null path sent in to +mount). Add support for overriding rsize at mount time. + +Version 0.69 +------------ +Fix buffer overrun in readdir which caused intermittent kernel oopses. +Fix writepage code to release kmap on write data. Allow "-ip=" new +mount option to be passed in on parameter distinct from the first part +(server name portion of) the UNC name. Allow override of the +tcp port of the target server via new mount option "-port=" + +Version 0.68 +------------ +Fix search handle leak on rewind. Fix setuid and gid so that they are +reflected in the local inode immediately. Cleanup of whitespace +to make 2.4 and 2.5 versions more consistent. + + +Version 0.67 +------------ +Fix signal sending so that captive thread (cifsd) exits on umount +(which was causing the warning in kmem_cache_free of the request buffers +at rmmod time). This had broken as a sideeffect of the recent global +kernel change to daemonize. Fix memory leak in readdir code which +showed up in "ls -R" (and applications that did search rewinding). + +Version 0.66 +------------ +Reconnect tids and fids after session reconnection (still do not +reconnect byte range locks though). Fix problem caching +lookup information for directory inodes, improving performance, +especially in deep directory trees. Fix various build warnings. + +Version 0.65 +------------ +Finish fixes to commit write for caching/readahead consistency. fsx +now works to Samba servers. Fix oops caused when readahead +was interrupted by a signal. + +Version 0.64 +------------ +Fix data corruption (in partial page after truncate) that caused fsx to +fail to Windows servers. Cleaned up some extraneous error logging in +common error paths. Add generic sendfile support. + +Version 0.63 +------------ +Fix memory leak in AllocMidQEntry. +Finish reconnection logic, so connection with server can be dropped +(or server rebooted) and the cifs client will reconnect. + +Version 0.62 +------------ +Fix temporary socket leak when bad userid or password specified +(or other SMBSessSetup failure). Increase maximum buffer size to slightly +over 16K to allow negotiation of up to Samba and Windows server default read +sizes. Add support for readpages + +Version 0.61 +------------ +Fix oops when username not passed in on mount. Extensive fixes and improvements +to error logging (strip redundant newlines, change debug macros to ensure newline +passed in and to be more consistent). Fix writepage wrong file handle problem, +a readonly file handle could be incorrectly used to attempt to write out +file updates through the page cache to multiply open files. This could cause +the iozone benchmark to fail on the fwrite test. Fix bug mounting two different +shares to the same Windows server when using different usernames +(doing this to Samba servers worked but Windows was rejecting it) - now it is +possible to use different userids when connecting to the same server from a +Linux client. Fix oops when treeDisconnect called during unmount on +previously freed socket. + +Version 0.60 +------------ +Fix oops in readpages caused by not setting address space operations in inode in +rare code path. + +Version 0.59 +------------ +Includes support for deleting of open files and renaming over existing files (per POSIX +requirement). Add readlink support for Windows junction points (directory symlinks). + +Version 0.58 +------------ +Changed read and write to go through pagecache. Added additional address space operations. +Memory mapped operations now working. + +Version 0.57 +------------ +Added writepage code for additional memory mapping support. Fixed leak in xids causing +the simultaneous operations counter (/proc/fs/cifs/SimultaneousOps) to increase on +every stat call. Additional formatting cleanup. + +Version 0.56 +------------ +Fix bigendian bug in order of time conversion. Merge 2.5 to 2.4 version. Formatting cleanup. + +Version 0.55 +------------ +Fixes from Zwane Mwaikambo for adding missing return code checking in a few places. +Also included a modified version of his fix to protect global list manipulation of +the smb session and tree connection and mid related global variables. + +Version 0.54 +------------ +Fix problem with captive thread hanging around at unmount time. Adjust to 2.5.42-pre +changes to superblock layout. Remove wasteful allocation of smb buffers (now the send +buffer is reused for responses). Add more oplock handling. Additional minor cleanup. + +Version 0.53 +------------ +More stylistic updates to better match kernel style. Add additional statistics +for filesystem which can be viewed via /proc/fs/cifs. Add more pieces of NTLMv2 +and CIFS Packet Signing enablement. + +Version 0.52 +------------ +Replace call to sleep_on with safer wait_on_event. +Make stylistic changes to better match kernel style recommendations. +Remove most typedef usage (except for the PDUs themselves). + +Version 0.51 +------------ +Update mount so the -unc mount option is no longer required (the ip address can be specified +in a UNC style device name. Implementation of readpage/writepage started. + +Version 0.50 +------------ +Fix intermittent problem with incorrect smb header checking on badly +fragmented tcp responses + +Version 0.49 +------------ +Fixes to setting of allocation size and file size. + +Version 0.48 +------------ +Various 2.5.38 fixes. Now works on 2.5.38 + +Version 0.47 +------------ +Prepare for 2.5 kernel merge. Remove ifdefs. + +Version 0.46 +------------ +Socket buffer management fixes. Fix dual free. + +Version 0.45 +------------ +Various big endian fixes for hardlinks and symlinks and also for dfs. + +Version 0.44 +------------ +Various big endian fixes for servers with Unix extensions such as Samba + +Version 0.43 +------------ +Various FindNext fixes for incorrect filenames on large directory searches on big endian +clients. basic posix file i/o tests now work on big endian machines, not just le + +Version 0.42 +------------ +SessionSetup and NegotiateProtocol now work from Big Endian machines. +Various Big Endian fixes found during testing on the Linux on 390. Various fixes for compatibility with older +versions of 2.4 kernel (now builds and works again on kernels at least as early as 2.4.7). + +Version 0.41 +------------ +Various minor fixes for Connectathon Posix "basic" file i/o test suite. Directory caching fixed so hardlinked +files now return the correct number of links on fstat as they are repeatedly linked and unlinked. + +Version 0.40 +------------ +Implemented "Raw" (i.e. not encapsulated in SPNEGO) NTLMSSP (i.e. the Security Provider Interface used to negotiate +session advanced session authentication). Raw NTLMSSP is preferred by Windows 2000 Professional and Windows XP. +Began implementing support for SPNEGO encapsulation of NTLMSSP based session authentication blobs +(which is the mechanism preferred by Windows 2000 server in the absence of Kerberos). + +Version 0.38 +------------ +Introduced optional mount helper utility mount.cifs and made coreq changes to cifs vfs to enable +it. Fixed a few bugs in the DFS code (e.g. bcc two bytes too short and incorrect uid in PDU). + +Version 0.37 +------------ +Rewrote much of connection and mount/unmount logic to handle bugs with +multiple uses to same share, multiple users to same server etc. + +Version 0.36 +------------ +Fixed major problem with dentry corruption (missing call to dput) + +Version 0.35 +------------ +Rewrite of readdir code to fix bug. Various fixes for bigendian machines. +Begin adding oplock support. Multiusermount and oplockEnabled flags added to /proc/fs/cifs +although corresponding function not fully implemented in the vfs yet + +Version 0.34 +------------ +Fixed dentry caching bug, misc. cleanup + +Version 0.33 +------------ +Fixed 2.5 support to handle build and configure changes as well as misc. 2.5 changes. Now can build +on current 2.5 beta version (2.5.24) of the Linux kernel as well as on 2.4 Linux kernels. +Support for STATUS codes (newer 32 bit NT error codes) added. DFS support begun to be added. + +Version 0.32 +------------ +Unix extensions (symlink, readlink, hardlink, chmod and some chgrp and chown) implemented +and tested against Samba 2.2.5 + + +Version 0.31 +------------ +1) Fixed lockrange to be correct (it was one byte too short) + +2) Fixed GETLK (i.e. the fcntl call to test a range of bytes in a file to see if locked) to correctly +show range as locked when there is a conflict with an existing lock. + +3) default file perms are now 2767 (indicating support for mandatory locks) instead of 777 for directories +in most cases. Eventually will offer optional ability to query server for the correct perms. + +3) Fixed eventual trap when mounting twice to different shares on the same server when the first succeeded +but the second one was invalid and failed (the second one was incorrectly disconnecting the tcp and smb +session) + +4) Fixed error logging of valid mount options + +5) Removed logging of password field. + +6) Moved negotiate, treeDisconnect and uloggoffX (only tConx and SessSetup remain in connect.c) to cifssmb.c +and cleaned them up and made them more consistent with other cifs functions. + +7) Server support for Unix extensions is now fully detected and FindFirst is implemented both ways +(with or without Unix extensions) but FindNext and QueryPathInfo with the Unix extensions are not completed, +nor is the symlink support using the Unix extensions + +8) Started adding the readlink and follow_link code + +Version 0.3 +----------- +Initial drop + diff --git a/Documentation/filesystems/cifs/README b/Documentation/filesystems/cifs/README new file mode 100644 index 00000000000..2d5622f60e1 --- /dev/null +++ b/Documentation/filesystems/cifs/README @@ -0,0 +1,753 @@ +The CIFS VFS support for Linux supports many advanced network filesystem +features such as hierarchical dfs like namespace, hardlinks, locking and more. +It was designed to comply with the SNIA CIFS Technical Reference (which +supersedes the 1992 X/Open SMB Standard) as well as to perform best practice +practical interoperability with Windows 2000, Windows XP, Samba and equivalent +servers. This code was developed in participation with the Protocol Freedom +Information Foundation. + +Please see + http://protocolfreedom.org/ and + http://samba.org/samba/PFIF/ +for more details. + + +For questions or bug reports please contact: + sfrench@samba.org (sfrench@us.ibm.com) + +Build instructions: +================== +For Linux 2.4: +1) Get the kernel source (e.g.from http://www.kernel.org) +and download the cifs vfs source (see the project page +at http://us1.samba.org/samba/Linux_CIFS_client.html) +and change directory into the top of the kernel directory +then patch the kernel (e.g. "patch -p1 < cifs_24.patch") +to add the cifs vfs to your kernel configure options if +it has not already been added (e.g. current SuSE and UL +users do not need to apply the cifs_24.patch since the cifs vfs is +already in the kernel configure menu) and then +mkdir linux/fs/cifs and then copy the current cifs vfs files from +the cifs download to your kernel build directory e.g. + + cp <cifs_download_dir>/fs/cifs/* to <kernel_download_dir>/fs/cifs + +2) make menuconfig (or make xconfig) +3) select cifs from within the network filesystem choices +4) save and exit +5) make dep +6) make modules (or "make" if CIFS VFS not to be built as a module) + +For Linux 2.6: +1) Download the kernel (e.g. from http://www.kernel.org) +and change directory into the top of the kernel directory tree +(e.g. /usr/src/linux-2.5.73) +2) make menuconfig (or make xconfig) +3) select cifs from within the network filesystem choices +4) save and exit +5) make + + +Installation instructions: +========================= +If you have built the CIFS vfs as module (successfully) simply +type "make modules_install" (or if you prefer, manually copy the file to +the modules directory e.g. /lib/modules/2.4.10-4GB/kernel/fs/cifs/cifs.o). + +If you have built the CIFS vfs into the kernel itself, follow the instructions +for your distribution on how to install a new kernel (usually you +would simply type "make install"). + +If you do not have the utility mount.cifs (in the Samba 3.0 source tree and on +the CIFS VFS web site) copy it to the same directory in which mount.smbfs and +similar files reside (usually /sbin). Although the helper software is not +required, mount.cifs is recommended. Eventually the Samba 3.0 utility program +"net" may also be helpful since it may someday provide easier mount syntax for +users who are used to Windows e.g. + net use <mount point> <UNC name or cifs URL> +Note that running the Winbind pam/nss module (logon service) on all of your +Linux clients is useful in mapping Uids and Gids consistently across the +domain to the proper network user. The mount.cifs mount helper can be +trivially built from Samba 3.0 or later source e.g. by executing: + + gcc samba/source/client/mount.cifs.c -o mount.cifs + +If cifs is built as a module, then the size and number of network buffers +and maximum number of simultaneous requests to one server can be configured. +Changing these from their defaults is not recommended. By executing modinfo + modinfo kernel/fs/cifs/cifs.ko +on kernel/fs/cifs/cifs.ko the list of configuration changes that can be made +at module initialization time (by running insmod cifs.ko) can be seen. + +Allowing User Mounts +==================== +To permit users to mount and unmount over directories they own is possible +with the cifs vfs. A way to enable such mounting is to mark the mount.cifs +utility as suid (e.g. "chmod +s /sbin/mount.cifs). To enable users to +umount shares they mount requires +1) mount.cifs version 1.4 or later +2) an entry for the share in /etc/fstab indicating that a user may +unmount it e.g. +//server/usersharename /mnt/username cifs user 0 0 + +Note that when the mount.cifs utility is run suid (allowing user mounts), +in order to reduce risks, the "nosuid" mount flag is passed in on mount to +disallow execution of an suid program mounted on the remote target. +When mount is executed as root, nosuid is not passed in by default, +and execution of suid programs on the remote target would be enabled +by default. This can be changed, as with nfs and other filesystems, +by simply specifying "nosuid" among the mount options. For user mounts +though to be able to pass the suid flag to mount requires rebuilding +mount.cifs with the following flag: + + gcc samba/source/client/mount.cifs.c -DCIFS_ALLOW_USR_SUID -o mount.cifs + +There is a corresponding manual page for cifs mounting in the Samba 3.0 and +later source tree in docs/manpages/mount.cifs.8 + +Allowing User Unmounts +====================== +To permit users to ummount directories that they have user mounted (see above), +the utility umount.cifs may be used. It may be invoked directly, or if +umount.cifs is placed in /sbin, umount can invoke the cifs umount helper +(at least for most versions of the umount utility) for umount of cifs +mounts, unless umount is invoked with -i (which will avoid invoking a umount +helper). As with mount.cifs, to enable user unmounts umount.cifs must be marked +as suid (e.g. "chmod +s /sbin/umount.cifs") or equivalent (some distributions +allow adding entries to a file to the /etc/permissions file to achieve the +equivalent suid effect). For this utility to succeed the target path +must be a cifs mount, and the uid of the current user must match the uid +of the user who mounted the resource. + +Also note that the customary way of allowing user mounts and unmounts is +(instead of using mount.cifs and unmount.cifs as suid) to add a line +to the file /etc/fstab for each //server/share you wish to mount, but +this can become unwieldy when potential mount targets include many +or unpredictable UNC names. + +Samba Considerations +==================== +To get the maximum benefit from the CIFS VFS, we recommend using a server that +supports the SNIA CIFS Unix Extensions standard (e.g. Samba 2.2.5 or later or +Samba 3.0) but the CIFS vfs works fine with a wide variety of CIFS servers. +Note that uid, gid and file permissions will display default values if you do +not have a server that supports the Unix extensions for CIFS (such as Samba +2.2.5 or later). To enable the Unix CIFS Extensions in the Samba server, add +the line: + + unix extensions = yes + +to your smb.conf file on the server. Note that the following smb.conf settings +are also useful (on the Samba server) when the majority of clients are Unix or +Linux: + + case sensitive = yes + delete readonly = yes + ea support = yes + +Note that server ea support is required for supporting xattrs from the Linux +cifs client, and that EA support is present in later versions of Samba (e.g. +3.0.6 and later (also EA support works in all versions of Windows, at least to +shares on NTFS filesystems). Extended Attribute (xattr) support is an optional +feature of most Linux filesystems which may require enabling via +make menuconfig. Client support for extended attributes (user xattr) can be +disabled on a per-mount basis by specifying "nouser_xattr" on mount. + +The CIFS client can get and set POSIX ACLs (getfacl, setfacl) to Samba servers +version 3.10 and later. Setting POSIX ACLs requires enabling both XATTR and +then POSIX support in the CIFS configuration options when building the cifs +module. POSIX ACL support can be disabled on a per mount basic by specifying +"noacl" on mount. + +Some administrators may want to change Samba's smb.conf "map archive" and +"create mask" parameters from the default. Unless the create mask is changed +newly created files can end up with an unnecessarily restrictive default mode, +which may not be what you want, although if the CIFS Unix extensions are +enabled on the server and client, subsequent setattr calls (e.g. chmod) can +fix the mode. Note that creating special devices (mknod) remotely +may require specifying a mkdev function to Samba if you are not using +Samba 3.0.6 or later. For more information on these see the manual pages +("man smb.conf") on the Samba server system. Note that the cifs vfs, +unlike the smbfs vfs, does not read the smb.conf on the client system +(the few optional settings are passed in on mount via -o parameters instead). +Note that Samba 2.2.7 or later includes a fix that allows the CIFS VFS to delete +open files (required for strict POSIX compliance). Windows Servers already +supported this feature. Samba server does not allow symlinks that refer to files +outside of the share, so in Samba versions prior to 3.0.6, most symlinks to +files with absolute paths (ie beginning with slash) such as: + ln -s /mnt/foo bar +would be forbidden. Samba 3.0.6 server or later includes the ability to create +such symlinks safely by converting unsafe symlinks (ie symlinks to server +files that are outside of the share) to a samba specific format on the server +that is ignored by local server applications and non-cifs clients and that will +not be traversed by the Samba server). This is opaque to the Linux client +application using the cifs vfs. Absolute symlinks will work to Samba 3.0.5 or +later, but only for remote clients using the CIFS Unix extensions, and will +be invisbile to Windows clients and typically will not affect local +applications running on the same server as Samba. + +Use instructions: +================ +Once the CIFS VFS support is built into the kernel or installed as a module +(cifs.o), you can use mount syntax like the following to access Samba or Windows +servers: + + mount -t cifs //9.53.216.11/e$ /mnt -o user=myname,pass=mypassword + +Before -o the option -v may be specified to make the mount.cifs +mount helper display the mount steps more verbosely. +After -o the following commonly used cifs vfs specific options +are supported: + + user=<username> + pass=<password> + domain=<domain name> + +Other cifs mount options are described below. Use of TCP names (in addition to +ip addresses) is available if the mount helper (mount.cifs) is installed. If +you do not trust the server to which are mounted, or if you do not have +cifs signing enabled (and the physical network is insecure), consider use +of the standard mount options "noexec" and "nosuid" to reduce the risk of +running an altered binary on your local system (downloaded from a hostile server +or altered by a hostile router). + +Although mounting using format corresponding to the CIFS URL specification is +not possible in mount.cifs yet, it is possible to use an alternate format +for the server and sharename (which is somewhat similar to NFS style mount +syntax) instead of the more widely used UNC format (i.e. \\server\share): + mount -t cifs tcp_name_of_server:share_name /mnt -o user=myname,pass=mypasswd + +When using the mount helper mount.cifs, passwords may be specified via alternate +mechanisms, instead of specifying it after -o using the normal "pass=" syntax +on the command line: +1) By including it in a credential file. Specify credentials=filename as one +of the mount options. Credential files contain two lines + username=someuser + password=your_password +2) By specifying the password in the PASSWD environment variable (similarly +the user name can be taken from the USER environment variable). +3) By specifying the password in a file by name via PASSWD_FILE +4) By specifying the password in a file by file descriptor via PASSWD_FD + +If no password is provided, mount.cifs will prompt for password entry + +Restrictions +============ +Servers must support either "pure-TCP" (port 445 TCP/IP CIFS connections) or RFC +1001/1002 support for "Netbios-Over-TCP/IP." This is not likely to be a +problem as most servers support this. + +Valid filenames differ between Windows and Linux. Windows typically restricts +filenames which contain certain reserved characters (e.g.the character : +which is used to delimit the beginning of a stream name by Windows), while +Linux allows a slightly wider set of valid characters in filenames. Windows +servers can remap such characters when an explicit mapping is specified in +the Server's registry. Samba starting with version 3.10 will allow such +filenames (ie those which contain valid Linux characters, which normally +would be forbidden for Windows/CIFS semantics) as long as the server is +configured for Unix Extensions (and the client has not disabled +/proc/fs/cifs/LinuxExtensionsEnabled). + + +CIFS VFS Mount Options +====================== +A partial list of the supported mount options follows: + user The user name to use when trying to establish + the CIFS session. + password The user password. If the mount helper is + installed, the user will be prompted for password + if not supplied. + ip The ip address of the target server + unc The target server Universal Network Name (export) to + mount. + domain Set the SMB/CIFS workgroup name prepended to the + username during CIFS session establishment + forceuid Set the default uid for inodes to the uid + passed in on mount. For mounts to servers + which do support the CIFS Unix extensions, such as a + properly configured Samba server, the server provides + the uid, gid and mode so this parameter should not be + specified unless the server and clients uid and gid + numbering differ. If the server and client are in the + same domain (e.g. running winbind or nss_ldap) and + the server supports the Unix Extensions then the uid + and gid can be retrieved from the server (and uid + and gid would not have to be specifed on the mount. + For servers which do not support the CIFS Unix + extensions, the default uid (and gid) returned on lookup + of existing files will be the uid (gid) of the person + who executed the mount (root, except when mount.cifs + is configured setuid for user mounts) unless the "uid=" + (gid) mount option is specified. Also note that permission + checks (authorization checks) on accesses to a file occur + at the server, but there are cases in which an administrator + may want to restrict at the client as well. For those + servers which do not report a uid/gid owner + (such as Windows), permissions can also be checked at the + client, and a crude form of client side permission checking + can be enabled by specifying file_mode and dir_mode on + the client. (default) + forcegid (similar to above but for the groupid instead of uid) (default) + noforceuid Fill in file owner information (uid) by requesting it from + the server if possible. With this option, the value given in + the uid= option (on mount) will only be used if the server + can not support returning uids on inodes. + noforcegid (similar to above but for the group owner, gid, instead of uid) + uid Set the default uid for inodes, and indicate to the + cifs kernel driver which local user mounted. If the server + supports the unix extensions the default uid is + not used to fill in the owner fields of inodes (files) + unless the "forceuid" parameter is specified. + gid Set the default gid for inodes (similar to above). + file_mode If CIFS Unix extensions are not supported by the server + this overrides the default mode for file inodes. + fsc Enable local disk caching using FS-Cache (off by default). This + option could be useful to improve performance on a slow link, + heavily loaded server and/or network where reading from the + disk is faster than reading from the server (over the network). + This could also impact scalability positively as the + number of calls to the server are reduced. However, local + caching is not suitable for all workloads for e.g. read-once + type workloads. So, you need to consider carefully your + workload/scenario before using this option. Currently, local + disk caching is functional for CIFS files opened as read-only. + dir_mode If CIFS Unix extensions are not supported by the server + this overrides the default mode for directory inodes. + port attempt to contact the server on this tcp port, before + trying the usual ports (port 445, then 139). + iocharset Codepage used to convert local path names to and from + Unicode. Unicode is used by default for network path + names if the server supports it. If iocharset is + not specified then the nls_default specified + during the local client kernel build will be used. + If server does not support Unicode, this parameter is + unused. + rsize default read size (usually 16K). The client currently + can not use rsize larger than CIFSMaxBufSize. CIFSMaxBufSize + defaults to 16K and may be changed (from 8K to the maximum + kmalloc size allowed by your kernel) at module install time + for cifs.ko. Setting CIFSMaxBufSize to a very large value + will cause cifs to use more memory and may reduce performance + in some cases. To use rsize greater than 127K (the original + cifs protocol maximum) also requires that the server support + a new Unix Capability flag (for very large read) which some + newer servers (e.g. Samba 3.0.26 or later) do. rsize can be + set from a minimum of 2048 to a maximum of 130048 (127K or + CIFSMaxBufSize, whichever is smaller) + wsize default write size (default 57344) + maximum wsize currently allowed by CIFS is 57344 (fourteen + 4096 byte pages) + actimeo=n attribute cache timeout in seconds (default 1 second). + After this timeout, the cifs client requests fresh attribute + information from the server. This option allows to tune the + attribute cache timeout to suit the workload needs. Shorter + timeouts mean better the cache coherency, but increased number + of calls to the server. Longer timeouts mean reduced number + of calls to the server at the expense of less stricter cache + coherency checks (i.e. incorrect attribute cache for a short + period of time). + rw mount the network share read-write (note that the + server may still consider the share read-only) + ro mount network share read-only + version used to distinguish different versions of the + mount helper utility (not typically needed) + sep if first mount option (after the -o), overrides + the comma as the separator between the mount + parms. e.g. + -o user=myname,password=mypassword,domain=mydom + could be passed instead with period as the separator by + -o sep=.user=myname.password=mypassword.domain=mydom + this might be useful when comma is contained within username + or password or domain. This option is less important + when the cifs mount helper cifs.mount (version 1.1 or later) + is used. + nosuid Do not allow remote executables with the suid bit + program to be executed. This is only meaningful for mounts + to servers such as Samba which support the CIFS Unix Extensions. + If you do not trust the servers in your network (your mount + targets) it is recommended that you specify this option for + greater security. + exec Permit execution of binaries on the mount. + noexec Do not permit execution of binaries on the mount. + dev Recognize block devices on the remote mount. + nodev Do not recognize devices on the remote mount. + suid Allow remote files on this mountpoint with suid enabled to + be executed (default for mounts when executed as root, + nosuid is default for user mounts). + credentials Although ignored by the cifs kernel component, it is used by + the mount helper, mount.cifs. When mount.cifs is installed it + opens and reads the credential file specified in order + to obtain the userid and password arguments which are passed to + the cifs vfs. + guest Although ignored by the kernel component, the mount.cifs + mount helper will not prompt the user for a password + if guest is specified on the mount options. If no + password is specified a null password will be used. + perm Client does permission checks (vfs_permission check of uid + and gid of the file against the mode and desired operation), + Note that this is in addition to the normal ACL check on the + target machine done by the server software. + Client permission checking is enabled by default. + noperm Client does not do permission checks. This can expose + files on this mount to access by other users on the local + client system. It is typically only needed when the server + supports the CIFS Unix Extensions but the UIDs/GIDs on the + client and server system do not match closely enough to allow + access by the user doing the mount, but it may be useful with + non CIFS Unix Extension mounts for cases in which the default + mode is specified on the mount but is not to be enforced on the + client (e.g. perhaps when MultiUserMount is enabled) + Note that this does not affect the normal ACL check on the + target machine done by the server software (of the server + ACL against the user name provided at mount time). + serverino Use server's inode numbers instead of generating automatically + incrementing inode numbers on the client. Although this will + make it easier to spot hardlinked files (as they will have + the same inode numbers) and inode numbers may be persistent, + note that the server does not guarantee that the inode numbers + are unique if multiple server side mounts are exported under a + single share (since inode numbers on the servers might not + be unique if multiple filesystems are mounted under the same + shared higher level directory). Note that some older + (e.g. pre-Windows 2000) do not support returning UniqueIDs + or the CIFS Unix Extensions equivalent and for those + this mount option will have no effect. Exporting cifs mounts + under nfsd requires this mount option on the cifs mount. + This is now the default if server supports the + required network operation. + noserverino Client generates inode numbers (rather than using the actual one + from the server). These inode numbers will vary after + unmount or reboot which can confuse some applications, + but not all server filesystems support unique inode + numbers. + setuids If the CIFS Unix extensions are negotiated with the server + the client will attempt to set the effective uid and gid of + the local process on newly created files, directories, and + devices (create, mkdir, mknod). If the CIFS Unix Extensions + are not negotiated, for newly created files and directories + instead of using the default uid and gid specified on + the mount, cache the new file's uid and gid locally which means + that the uid for the file can change when the inode is + reloaded (or the user remounts the share). + nosetuids The client will not attempt to set the uid and gid on + on newly created files, directories, and devices (create, + mkdir, mknod) which will result in the server setting the + uid and gid to the default (usually the server uid of the + user who mounted the share). Letting the server (rather than + the client) set the uid and gid is the default. If the CIFS + Unix Extensions are not negotiated then the uid and gid for + new files will appear to be the uid (gid) of the mounter or the + uid (gid) parameter specified on the mount. + netbiosname When mounting to servers via port 139, specifies the RFC1001 + source name to use to represent the client netbios machine + name when doing the RFC1001 netbios session initialize. + direct Do not do inode data caching on files opened on this mount. + This precludes mmapping files on this mount. In some cases + with fast networks and little or no caching benefits on the + client (e.g. when the application is doing large sequential + reads bigger than page size without rereading the same data) + this can provide better performance than the default + behavior which caches reads (readahead) and writes + (writebehind) through the local Linux client pagecache + if oplock (caching token) is granted and held. Note that + direct allows write operations larger than page size + to be sent to the server. + strictcache Use for switching on strict cache mode. In this mode the + client read from the cache all the time it has Oplock Level II, + otherwise - read from the server. All written data are stored + in the cache, but if the client doesn't have Exclusive Oplock, + it writes the data to the server. + rwpidforward Forward pid of a process who opened a file to any read or write + operation on that file. This prevent applications like WINE + from failing on read and write if we use mandatory brlock style. + acl Allow setfacl and getfacl to manage posix ACLs if server + supports them. (default) + noacl Do not allow setfacl and getfacl calls on this mount + user_xattr Allow getting and setting user xattrs (those attributes whose + name begins with "user." or "os2.") as OS/2 EAs (extended + attributes) to the server. This allows support of the + setfattr and getfattr utilities. (default) + nouser_xattr Do not allow getfattr/setfattr to get/set/list xattrs + mapchars Translate six of the seven reserved characters (not backslash) + *?<>|: + to the remap range (above 0xF000), which also + allows the CIFS client to recognize files created with + such characters by Windows's POSIX emulation. This can + also be useful when mounting to most versions of Samba + (which also forbids creating and opening files + whose names contain any of these seven characters). + This has no effect if the server does not support + Unicode on the wire. + nomapchars Do not translate any of these seven characters (default). + nocase Request case insensitive path name matching (case + sensitive is the default if the server supports it). + (mount option "ignorecase" is identical to "nocase") + posixpaths If CIFS Unix extensions are supported, attempt to + negotiate posix path name support which allows certain + characters forbidden in typical CIFS filenames, without + requiring remapping. (default) + noposixpaths If CIFS Unix extensions are supported, do not request + posix path name support (this may cause servers to + reject creatingfile with certain reserved characters). + nounix Disable the CIFS Unix Extensions for this mount (tree + connection). This is rarely needed, but it may be useful + in order to turn off multiple settings all at once (ie + posix acls, posix locks, posix paths, symlink support + and retrieving uids/gids/mode from the server) or to + work around a bug in server which implement the Unix + Extensions. + nobrl Do not send byte range lock requests to the server. + This is necessary for certain applications that break + with cifs style mandatory byte range locks (and most + cifs servers do not yet support requesting advisory + byte range locks). + forcemandatorylock Even if the server supports posix (advisory) byte range + locking, send only mandatory lock requests. For some + (presumably rare) applications, originally coded for + DOS/Windows, which require Windows style mandatory byte range + locking, they may be able to take advantage of this option, + forcing the cifs client to only send mandatory locks + even if the cifs server would support posix advisory locks. + "forcemand" is accepted as a shorter form of this mount + option. + nostrictsync If this mount option is set, when an application does an + fsync call then the cifs client does not send an SMB Flush + to the server (to force the server to write all dirty data + for this file immediately to disk), although cifs still sends + all dirty (cached) file data to the server and waits for the + server to respond to the write. Since SMB Flush can be + very slow, and some servers may be reliable enough (to risk + delaying slightly flushing the data to disk on the server), + turning on this option may be useful to improve performance for + applications that fsync too much, at a small risk of server + crash. If this mount option is not set, by default cifs will + send an SMB flush request (and wait for a response) on every + fsync call. + nodfs Disable DFS (global name space support) even if the + server claims to support it. This can help work around + a problem with parsing of DFS paths with Samba server + versions 3.0.24 and 3.0.25. + remount remount the share (often used to change from ro to rw mounts + or vice versa) + cifsacl Report mode bits (e.g. on stat) based on the Windows ACL for + the file. (EXPERIMENTAL) + servern Specify the server 's netbios name (RFC1001 name) to use + when attempting to setup a session to the server. + This is needed for mounting to some older servers (such + as OS/2 or Windows 98 and Windows ME) since they do not + support a default server name. A server name can be up + to 15 characters long and is usually uppercased. + sfu When the CIFS Unix Extensions are not negotiated, attempt to + create device files and fifos in a format compatible with + Services for Unix (SFU). In addition retrieve bits 10-12 + of the mode via the SETFILEBITS extended attribute (as + SFU does). In the future the bottom 9 bits of the + mode also will be emulated using queries of the security + descriptor (ACL). + mfsymlinks Enable support for Minshall+French symlinks + (see http://wiki.samba.org/index.php/UNIX_Extensions#Minshall.2BFrench_symlinks) + This option is ignored when specified together with the + 'sfu' option. Minshall+French symlinks are used even if + the server supports the CIFS Unix Extensions. + sign Must use packet signing (helps avoid unwanted data modification + by intermediate systems in the route). Note that signing + does not work with lanman or plaintext authentication. + seal Must seal (encrypt) all data on this mounted share before + sending on the network. Requires support for Unix Extensions. + Note that this differs from the sign mount option in that it + causes encryption of data sent over this mounted share but other + shares mounted to the same server are unaffected. + locallease This option is rarely needed. Fcntl F_SETLEASE is + used by some applications such as Samba and NFSv4 server to + check to see whether a file is cacheable. CIFS has no way + to explicitly request a lease, but can check whether a file + is cacheable (oplocked). Unfortunately, even if a file + is not oplocked, it could still be cacheable (ie cifs client + could grant fcntl leases if no other local processes are using + the file) for cases for example such as when the server does not + support oplocks and the user is sure that the only updates to + the file will be from this client. Specifying this mount option + will allow the cifs client to check for leases (only) locally + for files which are not oplocked instead of denying leases + in that case. (EXPERIMENTAL) + sec Security mode. Allowed values are: + none attempt to connection as a null user (no name) + krb5 Use Kerberos version 5 authentication + krb5i Use Kerberos authentication and packet signing + ntlm Use NTLM password hashing (default) + ntlmi Use NTLM password hashing with signing (if + /proc/fs/cifs/PacketSigningEnabled on or if + server requires signing also can be the default) + ntlmv2 Use NTLMv2 password hashing + ntlmv2i Use NTLMv2 password hashing with packet signing + lanman (if configured in kernel config) use older + lanman hash +hard Retry file operations if server is not responding +soft Limit retries to unresponsive servers (usually only + one retry) before returning an error. (default) + +The mount.cifs mount helper also accepts a few mount options before -o +including: + + -S take password from stdin (equivalent to setting the environment + variable "PASSWD_FD=0" + -V print mount.cifs version + -? display simple usage information + +With most 2.6 kernel versions of modutils, the version of the cifs kernel +module can be displayed via modinfo. + +Misc /proc/fs/cifs Flags and Debug Info +======================================= +Informational pseudo-files: +DebugData Displays information about active CIFS sessions and + shares, features enabled as well as the cifs.ko + version. +Stats Lists summary resource usage information as well as per + share statistics, if CONFIG_CIFS_STATS in enabled + in the kernel configuration. + +Configuration pseudo-files: +PacketSigningEnabled If set to one, cifs packet signing is enabled + and will be used if the server requires + it. If set to two, cifs packet signing is + required even if the server considers packet + signing optional. (default 1) +SecurityFlags Flags which control security negotiation and + also packet signing. Authentication (may/must) + flags (e.g. for NTLM and/or NTLMv2) may be combined with + the signing flags. Specifying two different password + hashing mechanisms (as "must use") on the other hand + does not make much sense. Default flags are + 0x07007 + (NTLM, NTLMv2 and packet signing allowed). The maximum + allowable flags if you want to allow mounts to servers + using weaker password hashes is 0x37037 (lanman, + plaintext, ntlm, ntlmv2, signing allowed). Some + SecurityFlags require the corresponding menuconfig + options to be enabled (lanman and plaintext require + CONFIG_CIFS_WEAK_PW_HASH for example). Enabling + plaintext authentication currently requires also + enabling lanman authentication in the security flags + because the cifs module only supports sending + laintext passwords using the older lanman dialect + form of the session setup SMB. (e.g. for authentication + using plain text passwords, set the SecurityFlags + to 0x30030): + + may use packet signing 0x00001 + must use packet signing 0x01001 + may use NTLM (most common password hash) 0x00002 + must use NTLM 0x02002 + may use NTLMv2 0x00004 + must use NTLMv2 0x04004 + may use Kerberos security 0x00008 + must use Kerberos 0x08008 + may use lanman (weak) password hash 0x00010 + must use lanman password hash 0x10010 + may use plaintext passwords 0x00020 + must use plaintext passwords 0x20020 + (reserved for future packet encryption) 0x00040 + +cifsFYI If set to non-zero value, additional debug information + will be logged to the system error log. This field + contains three flags controlling different classes of + debugging entries. The maximum value it can be set + to is 7 which enables all debugging points (default 0). + Some debugging statements are not compiled into the + cifs kernel unless CONFIG_CIFS_DEBUG2 is enabled in the + kernel configuration. cifsFYI may be set to one or + nore of the following flags (7 sets them all): + + log cifs informational messages 0x01 + log return codes from cifs entry points 0x02 + log slow responses (ie which take longer than 1 second) + CONFIG_CIFS_STATS2 must be enabled in .config 0x04 + + +traceSMB If set to one, debug information is logged to the + system error log with the start of smb requests + and responses (default 0) +LookupCacheEnable If set to one, inode information is kept cached + for one second improving performance of lookups + (default 1) +OplockEnabled If set to one, safe distributed caching enabled. + (default 1) +LinuxExtensionsEnabled If set to one then the client will attempt to + use the CIFS "UNIX" extensions which are optional + protocol enhancements that allow CIFS servers + to return accurate UID/GID information as well + as support symbolic links. If you use servers + such as Samba that support the CIFS Unix + extensions but do not want to use symbolic link + support and want to map the uid and gid fields + to values supplied at mount (rather than the + actual values, then set this to zero. (default 1) + +These experimental features and tracing can be enabled by changing flags in +/proc/fs/cifs (after the cifs module has been installed or built into the +kernel, e.g. insmod cifs). To enable a feature set it to 1 e.g. to enable +tracing to the kernel message log type: + + echo 7 > /proc/fs/cifs/cifsFYI + +cifsFYI functions as a bit mask. Setting it to 1 enables additional kernel +logging of various informational messages. 2 enables logging of non-zero +SMB return codes while 4 enables logging of requests that take longer +than one second to complete (except for byte range lock requests). +Setting it to 4 requires defining CONFIG_CIFS_STATS2 manually in the +source code (typically by setting it in the beginning of cifsglob.h), +and setting it to seven enables all three. Finally, tracing +the start of smb requests and responses can be enabled via: + + echo 1 > /proc/fs/cifs/traceSMB + +Per share (per client mount) statistics are available in /proc/fs/cifs/Stats +if the kernel was configured with cifs statistics enabled. The statistics +represent the number of successful (ie non-zero return code from the server) +SMB responses to some of the more common commands (open, delete, mkdir etc.). +Also recorded is the total bytes read and bytes written to the server for +that share. Note that due to client caching effects this can be less than the +number of bytes read and written by the application running on the client. +The statistics for the number of total SMBs and oplock breaks are different in +that they represent all for that share, not just those for which the server +returned success. + +Also note that "cat /proc/fs/cifs/DebugData" will display information about +the active sessions and the shares that are mounted. + +Enabling Kerberos (extended security) works but requires version 1.2 or later +of the helper program cifs.upcall to be present and to be configured in the +/etc/request-key.conf file. The cifs.upcall helper program is from the Samba +project(http://www.samba.org). NTLM and NTLMv2 and LANMAN support do not +require this helper. Note that NTLMv2 security (which does not require the +cifs.upcall helper program), instead of using Kerberos, is sufficient for +some use cases. + +DFS support allows transparent redirection to shares in an MS-DFS name space. +In addition, DFS support for target shares which are specified as UNC +names which begin with host names (rather than IP addresses) requires +a user space helper (such as cifs.upcall) to be present in order to +translate host names to ip address, and the user space helper must also +be configured in the file /etc/request-key.conf. Samba, Windows servers and +many NAS appliances support DFS as a way of constructing a global name +space to ease network configuration and improve reliability. + +To use cifs Kerberos and DFS support, the Linux keyutils package should be +installed and something like the following lines should be added to the +/etc/request-key.conf file: + +create cifs.spnego * * /usr/local/sbin/cifs.upcall %k +create dns_resolver * * /usr/local/sbin/cifs.upcall %k + +CIFS kernel module parameters +============================= +These module parameters can be specified or modified either during the time of +module loading or during the runtime by using the interface + /proc/module/cifs/parameters/<param> + +i.e. echo "value" > /sys/module/cifs/parameters/<param> + +1. enable_oplocks - Enable or disable oplocks. Oplocks are enabled by default. + [Y/y/1]. To disable use any of [N/n/0]. + diff --git a/Documentation/filesystems/cifs/TODO b/Documentation/filesystems/cifs/TODO new file mode 100644 index 00000000000..355abcdcda9 --- /dev/null +++ b/Documentation/filesystems/cifs/TODO @@ -0,0 +1,129 @@ +Version 1.53 May 20, 2008 + +A Partial List of Missing Features +================================== + +Contributions are welcome. There are plenty of opportunities +for visible, important contributions to this module. Here +is a partial list of the known problems and missing features: + +a) Support for SecurityDescriptors(Windows/CIFS ACLs) for chmod/chgrp/chown +so that these operations can be supported to Windows servers + +b) Mapping POSIX ACLs (and eventually NFSv4 ACLs) to CIFS +SecurityDescriptors + +c) Better pam/winbind integration (e.g. to handle uid mapping +better) + +d) Cleanup now unneeded SessSetup code in +fs/cifs/connect.c and add back in NTLMSSP code if any servers +need it + +e) fix NTLMv2 signing when two mounts with different users to same +server. + +f) Directory entry caching relies on a 1 second timer, rather than +using FindNotify or equivalent. - (started) + +g) quota support (needs minor kernel change since quota calls +to make it to network filesystems or deviceless filesystems) + +h) investigate sync behavior (including syncpage) and check +for proper behavior of intr/nointr + +i) improve support for very old servers (OS/2 and Win9x for example) +Including support for changing the time remotely (utimes command). + +j) hook lower into the sockets api (as NFS/SunRPC does) to avoid the +extra copy in/out of the socket buffers in some cases. + +k) Better optimize open (and pathbased setfilesize) to reduce the +oplock breaks coming from windows srv. Piggyback identical file +opens on top of each other by incrementing reference count rather +than resending (helps reduce server resource utilization and avoid +spurious oplock breaks). + +l) Improve performance of readpages by sending more than one read +at a time when 8 pages or more are requested. In conjuntion +add support for async_cifs_readpages. + +m) Add support for storing symlink info to Windows servers +in the Extended Attribute format their SFU clients would recognize. + +n) Finish fcntl D_NOTIFY support so kde and gnome file list windows +will autorefresh (partially complete by Asser). Needs minor kernel +vfs change to support removing D_NOTIFY on a file. + +o) Add GUI tool to configure /proc/fs/cifs settings and for display of +the CIFS statistics (started) + +p) implement support for security and trusted categories of xattrs +(requires minor protocol extension) to enable better support for SELINUX + +q) Implement O_DIRECT flag on open (already supported on mount) + +r) Create UID mapping facility so server UIDs can be mapped on a per +mount or a per server basis to client UIDs or nobody if no mapping +exists. This is helpful when Unix extensions are negotiated to +allow better permission checking when UIDs differ on the server +and client. Add new protocol request to the CIFS protocol +standard for asking the server for the corresponding name of a +particular uid. + +s) Add support for CIFS Unix and also the newer POSIX extensions to the +server side for Samba 4. + +t) In support for OS/2 (LANMAN 1.2 and LANMAN2.1 based SMB servers) +need to add ability to set time to server (utimes command) + +u) DOS attrs - returned as pseudo-xattr in Samba format (check VFAT and NTFS for this too) + +v) mount check for unmatched uids + +w) Add support for new vfs entry point for fallocate + +x) Fix Samba 3 server to handle Linux kernel aio so dbench with lots of +processes can proceed better in parallel (on the server) + +y) Fix Samba 3 to handle reads/writes over 127K (and remove the cifs mount +restriction of wsize max being 127K) + +KNOWN BUGS (updated April 24, 2007) +==================================== +See http://bugzilla.samba.org - search on product "CifsVFS" for +current bug list. + +1) existing symbolic links (Windows reparse points) are recognized but +can not be created remotely. They are implemented for Samba and those that +support the CIFS Unix extensions, although earlier versions of Samba +overly restrict the pathnames. +2) follow_link and readdir code does not follow dfs junctions +but recognizes them +3) create of new files to FAT partitions on Windows servers can +succeed but still return access denied (appears to be Windows +server not cifs client problem) and has not been reproduced recently. +NTFS partitions do not have this problem. +4) Unix/POSIX capabilities are reset after reconnection, and affect +a few fields in the tree connection but we do do not know which +superblocks to apply these changes to. We should probably walk +the list of superblocks to set these. Also need to check the +flags on the second mount to the same share, and see if we +can do the same trick that NFS does to remount duplicate shares. + +Misc testing to do +================== +1) check out max path names and max path name components against various server +types. Try nested symlinks (8 deep). Return max path name in stat -f information + +2) Modify file portion of ltp so it can run against a mounted network +share and run it against cifs vfs in automated fashion. + +3) Additional performance testing and optimization using iozone and similar - +there are some easy changes that can be done to parallelize sequential writes, +and when signing is disabled to request larger read sizes (larger than +negotiated size) and send larger write sizes to modern servers. + +4) More exhaustively test against less common servers. More testing +against Windows 9x, Windows ME servers. + diff --git a/Documentation/filesystems/cifs/cifs.txt b/Documentation/filesystems/cifs/cifs.txt new file mode 100644 index 00000000000..2fac91ac96c --- /dev/null +++ b/Documentation/filesystems/cifs/cifs.txt @@ -0,0 +1,31 @@ + This is the client VFS module for the Common Internet File System + (CIFS) protocol which is the successor to the Server Message Block + (SMB) protocol, the native file sharing mechanism for most early + PC operating systems. New and improved versions of CIFS are now + called SMB2 and SMB3. These dialects are also supported by the + CIFS VFS module. CIFS is fully supported by network + file servers such as Windows 2000, 2003, 2008 and 2012 + as well by Samba (which provides excellent CIFS + server support for Linux and many other operating systems), so + this network filesystem client can mount to a wide variety of + servers. + + The intent of this module is to provide the most advanced network + file system function for CIFS compliant servers, including better + POSIX compliance, secure per-user session establishment, high + performance safe distributed caching (oplock), optional packet + signing, large files, Unicode support and other internationalization + improvements. Since both Samba server and this filesystem client support + the CIFS Unix extensions, the combination can provide a reasonable + alternative to NFSv4 for fileserving in some Linux to Linux environments, + not just in Linux to Windows environments. + + This filesystem has an mount utility (mount.cifs) that can be obtained from + + https://ftp.samba.org/pub/linux-cifs/cifs-utils/ + + It must be installed in the directory with the other mount helpers. + + For more information on the module see the project wiki page at + + https://wiki.samba.org/index.php/LinuxCIFS_utils diff --git a/Documentation/filesystems/cifs/winucase_convert.pl b/Documentation/filesystems/cifs/winucase_convert.pl new file mode 100755 index 00000000000..322a9c833f2 --- /dev/null +++ b/Documentation/filesystems/cifs/winucase_convert.pl @@ -0,0 +1,62 @@ +#!/usr/bin/perl -w +# +# winucase_convert.pl -- convert "Windows 8 Upper Case Mapping Table.txt" to +# a two-level set of C arrays. +# +# Copyright 2013: Jeff Layton <jlayton@redhat.com> +# +# This program is free software: you can redistribute it and/or modify +# it under the terms of the GNU General Public License as published by +# the Free Software Foundation, either version 3 of the License, or +# (at your option) any later version. +# +# This program is distributed in the hope that it will be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program. If not, see <http://www.gnu.org/licenses/>. +# + +while(<>) { + next if (!/^0x(..)(..)\t0x(....)\t/); + $firstchar = hex($1); + $secondchar = hex($2); + $uppercase = hex($3); + + $top[$firstchar][$secondchar] = $uppercase; +} + +for ($i = 0; $i < 256; $i++) { + next if (!$top[$i]); + + printf("static const wchar_t t2_%2.2x[256] = {", $i); + for ($j = 0; $j < 256; $j++) { + if (($j % 8) == 0) { + print "\n\t"; + } else { + print " "; + } + printf("0x%4.4x,", $top[$i][$j] ? $top[$i][$j] : 0); + } + print "\n};\n\n"; +} + +printf("static const wchar_t *const toplevel[256] = {", $i); +for ($i = 0; $i < 256; $i++) { + if (($i % 8) == 0) { + print "\n\t"; + } elsif ($top[$i]) { + print " "; + } else { + print " "; + } + + if ($top[$i]) { + printf("t2_%2.2x,", $i); + } else { + print "NULL,"; + } +} +print "\n};\n\n"; diff --git a/Documentation/filesystems/configfs/Makefile b/Documentation/filesystems/configfs/Makefile new file mode 100644 index 00000000000..be7ec5e67db --- /dev/null +++ b/Documentation/filesystems/configfs/Makefile @@ -0,0 +1,3 @@ +ifneq ($(CONFIG_CONFIGFS_FS),) +obj-m += configfs_example_explicit.o configfs_example_macros.o +endif diff --git a/Documentation/filesystems/configfs/configfs.txt b/Documentation/filesystems/configfs/configfs.txt new file mode 100644 index 00000000000..b40fec9d3f5 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs.txt @@ -0,0 +1,484 @@ + +configfs - Userspace-driven kernel object configuration. + +Joel Becker <joel.becker@oracle.com> + +Updated: 31 March 2005 + +Copyright (c) 2005 Oracle Corporation, + Joel Becker <joel.becker@oracle.com> + + +[What is configfs?] + +configfs is a ram-based filesystem that provides the converse of +sysfs's functionality. Where sysfs is a filesystem-based view of +kernel objects, configfs is a filesystem-based manager of kernel +objects, or config_items. + +With sysfs, an object is created in kernel (for example, when a device +is discovered) and it is registered with sysfs. Its attributes then +appear in sysfs, allowing userspace to read the attributes via +readdir(3)/read(2). It may allow some attributes to be modified via +write(2). The important point is that the object is created and +destroyed in kernel, the kernel controls the lifecycle of the sysfs +representation, and sysfs is merely a window on all this. + +A configfs config_item is created via an explicit userspace operation: +mkdir(2). It is destroyed via rmdir(2). The attributes appear at +mkdir(2) time, and can be read or modified via read(2) and write(2). +As with sysfs, readdir(3) queries the list of items and/or attributes. +symlink(2) can be used to group items together. Unlike sysfs, the +lifetime of the representation is completely driven by userspace. The +kernel modules backing the items must respond to this. + +Both sysfs and configfs can and should exist together on the same +system. One is not a replacement for the other. + +[Using configfs] + +configfs can be compiled as a module or into the kernel. You can access +it by doing + + mount -t configfs none /config + +The configfs tree will be empty unless client modules are also loaded. +These are modules that register their item types with configfs as +subsystems. Once a client subsystem is loaded, it will appear as a +subdirectory (or more than one) under /config. Like sysfs, the +configfs tree is always there, whether mounted on /config or not. + +An item is created via mkdir(2). The item's attributes will also +appear at this time. readdir(3) can determine what the attributes are, +read(2) can query their default values, and write(2) can store new +values. Like sysfs, attributes should be ASCII text files, preferably +with only one value per file. The same efficiency caveats from sysfs +apply. Don't mix more than one attribute in one attribute file. + +Like sysfs, configfs expects write(2) to store the entire buffer at +once. When writing to configfs attributes, userspace processes should +first read the entire file, modify the portions they wish to change, and +then write the entire buffer back. Attribute files have a maximum size +of one page (PAGE_SIZE, 4096 on i386). + +When an item needs to be destroyed, remove it with rmdir(2). An +item cannot be destroyed if any other item has a link to it (via +symlink(2)). Links can be removed via unlink(2). + +[Configuring FakeNBD: an Example] + +Imagine there's a Network Block Device (NBD) driver that allows you to +access remote block devices. Call it FakeNBD. FakeNBD uses configfs +for its configuration. Obviously, there will be a nice program that +sysadmins use to configure FakeNBD, but somehow that program has to tell +the driver about it. Here's where configfs comes in. + +When the FakeNBD driver is loaded, it registers itself with configfs. +readdir(3) sees this just fine: + + # ls /config + fakenbd + +A fakenbd connection can be created with mkdir(2). The name is +arbitrary, but likely the tool will make some use of the name. Perhaps +it is a uuid or a disk name: + + # mkdir /config/fakenbd/disk1 + # ls /config/fakenbd/disk1 + target device rw + +The target attribute contains the IP address of the server FakeNBD will +connect to. The device attribute is the device on the server. +Predictably, the rw attribute determines whether the connection is +read-only or read-write. + + # echo 10.0.0.1 > /config/fakenbd/disk1/target + # echo /dev/sda1 > /config/fakenbd/disk1/device + # echo 1 > /config/fakenbd/disk1/rw + +That's it. That's all there is. Now the device is configured, via the +shell no less. + +[Coding With configfs] + +Every object in configfs is a config_item. A config_item reflects an +object in the subsystem. It has attributes that match values on that +object. configfs handles the filesystem representation of that object +and its attributes, allowing the subsystem to ignore all but the +basic show/store interaction. + +Items are created and destroyed inside a config_group. A group is a +collection of items that share the same attributes and operations. +Items are created by mkdir(2) and removed by rmdir(2), but configfs +handles that. The group has a set of operations to perform these tasks + +A subsystem is the top level of a client module. During initialization, +the client module registers the subsystem with configfs, the subsystem +appears as a directory at the top of the configfs filesystem. A +subsystem is also a config_group, and can do everything a config_group +can. + +[struct config_item] + + struct config_item { + char *ci_name; + char ci_namebuf[UOBJ_NAME_LEN]; + struct kref ci_kref; + struct list_head ci_entry; + struct config_item *ci_parent; + struct config_group *ci_group; + struct config_item_type *ci_type; + struct dentry *ci_dentry; + }; + + void config_item_init(struct config_item *); + void config_item_init_type_name(struct config_item *, + const char *name, + struct config_item_type *type); + struct config_item *config_item_get(struct config_item *); + void config_item_put(struct config_item *); + +Generally, struct config_item is embedded in a container structure, a +structure that actually represents what the subsystem is doing. The +config_item portion of that structure is how the object interacts with +configfs. + +Whether statically defined in a source file or created by a parent +config_group, a config_item must have one of the _init() functions +called on it. This initializes the reference count and sets up the +appropriate fields. + +All users of a config_item should have a reference on it via +config_item_get(), and drop the reference when they are done via +config_item_put(). + +By itself, a config_item cannot do much more than appear in configfs. +Usually a subsystem wants the item to display and/or store attributes, +among other things. For that, it needs a type. + +[struct config_item_type] + + struct configfs_item_operations { + void (*release)(struct config_item *); + ssize_t (*show_attribute)(struct config_item *, + struct configfs_attribute *, + char *); + ssize_t (*store_attribute)(struct config_item *, + struct configfs_attribute *, + const char *, size_t); + int (*allow_link)(struct config_item *src, + struct config_item *target); + int (*drop_link)(struct config_item *src, + struct config_item *target); + }; + + struct config_item_type { + struct module *ct_owner; + struct configfs_item_operations *ct_item_ops; + struct configfs_group_operations *ct_group_ops; + struct configfs_attribute **ct_attrs; + }; + +The most basic function of a config_item_type is to define what +operations can be performed on a config_item. All items that have been +allocated dynamically will need to provide the ct_item_ops->release() +method. This method is called when the config_item's reference count +reaches zero. Items that wish to display an attribute need to provide +the ct_item_ops->show_attribute() method. Similarly, storing a new +attribute value uses the store_attribute() method. + +[struct configfs_attribute] + + struct configfs_attribute { + char *ca_name; + struct module *ca_owner; + umode_t ca_mode; + }; + +When a config_item wants an attribute to appear as a file in the item's +configfs directory, it must define a configfs_attribute describing it. +It then adds the attribute to the NULL-terminated array +config_item_type->ct_attrs. When the item appears in configfs, the +attribute file will appear with the configfs_attribute->ca_name +filename. configfs_attribute->ca_mode specifies the file permissions. + +If an attribute is readable and the config_item provides a +ct_item_ops->show_attribute() method, that method will be called +whenever userspace asks for a read(2) on the attribute. The converse +will happen for write(2). + +[struct config_group] + +A config_item cannot live in a vacuum. The only way one can be created +is via mkdir(2) on a config_group. This will trigger creation of a +child item. + + struct config_group { + struct config_item cg_item; + struct list_head cg_children; + struct configfs_subsystem *cg_subsys; + struct config_group **default_groups; + }; + + void config_group_init(struct config_group *group); + void config_group_init_type_name(struct config_group *group, + const char *name, + struct config_item_type *type); + + +The config_group structure contains a config_item. Properly configuring +that item means that a group can behave as an item in its own right. +However, it can do more: it can create child items or groups. This is +accomplished via the group operations specified on the group's +config_item_type. + + struct configfs_group_operations { + struct config_item *(*make_item)(struct config_group *group, + const char *name); + struct config_group *(*make_group)(struct config_group *group, + const char *name); + int (*commit_item)(struct config_item *item); + void (*disconnect_notify)(struct config_group *group, + struct config_item *item); + void (*drop_item)(struct config_group *group, + struct config_item *item); + }; + +A group creates child items by providing the +ct_group_ops->make_item() method. If provided, this method is called from mkdir(2) in the group's directory. The subsystem allocates a new +config_item (or more likely, its container structure), initializes it, +and returns it to configfs. Configfs will then populate the filesystem +tree to reflect the new item. + +If the subsystem wants the child to be a group itself, the subsystem +provides ct_group_ops->make_group(). Everything else behaves the same, +using the group _init() functions on the group. + +Finally, when userspace calls rmdir(2) on the item or group, +ct_group_ops->drop_item() is called. As a config_group is also a +config_item, it is not necessary for a separate drop_group() method. +The subsystem must config_item_put() the reference that was initialized +upon item allocation. If a subsystem has no work to do, it may omit +the ct_group_ops->drop_item() method, and configfs will call +config_item_put() on the item on behalf of the subsystem. + +IMPORTANT: drop_item() is void, and as such cannot fail. When rmdir(2) +is called, configfs WILL remove the item from the filesystem tree +(assuming that it has no children to keep it busy). The subsystem is +responsible for responding to this. If the subsystem has references to +the item in other threads, the memory is safe. It may take some time +for the item to actually disappear from the subsystem's usage. But it +is gone from configfs. + +When drop_item() is called, the item's linkage has already been torn +down. It no longer has a reference on its parent and has no place in +the item hierarchy. If a client needs to do some cleanup before this +teardown happens, the subsystem can implement the +ct_group_ops->disconnect_notify() method. The method is called after +configfs has removed the item from the filesystem view but before the +item is removed from its parent group. Like drop_item(), +disconnect_notify() is void and cannot fail. Client subsystems should +not drop any references here, as they still must do it in drop_item(). + +A config_group cannot be removed while it still has child items. This +is implemented in the configfs rmdir(2) code. ->drop_item() will not be +called, as the item has not been dropped. rmdir(2) will fail, as the +directory is not empty. + +[struct configfs_subsystem] + +A subsystem must register itself, usually at module_init time. This +tells configfs to make the subsystem appear in the file tree. + + struct configfs_subsystem { + struct config_group su_group; + struct mutex su_mutex; + }; + + int configfs_register_subsystem(struct configfs_subsystem *subsys); + void configfs_unregister_subsystem(struct configfs_subsystem *subsys); + + A subsystem consists of a toplevel config_group and a mutex. +The group is where child config_items are created. For a subsystem, +this group is usually defined statically. Before calling +configfs_register_subsystem(), the subsystem must have initialized the +group via the usual group _init() functions, and it must also have +initialized the mutex. + When the register call returns, the subsystem is live, and it +will be visible via configfs. At that point, mkdir(2) can be called and +the subsystem must be ready for it. + +[An Example] + +The best example of these basic concepts is the simple_children +subsystem/group and the simple_child item in configfs_example_explicit.c +and configfs_example_macros.c. It shows a trivial object displaying and +storing an attribute, and a simple group creating and destroying these +children. + +The only difference between configfs_example_explicit.c and +configfs_example_macros.c is how the attributes of the childless item +are defined. The childless item has extended attributes, each with +their own show()/store() operation. This follows a convention commonly +used in sysfs. configfs_example_explicit.c creates these attributes +by explicitly defining the structures involved. Conversely +configfs_example_macros.c uses some convenience macros from configfs.h +to define the attributes. These macros are similar to their sysfs +counterparts. + +[Hierarchy Navigation and the Subsystem Mutex] + +There is an extra bonus that configfs provides. The config_groups and +config_items are arranged in a hierarchy due to the fact that they +appear in a filesystem. A subsystem is NEVER to touch the filesystem +parts, but the subsystem might be interested in this hierarchy. For +this reason, the hierarchy is mirrored via the config_group->cg_children +and config_item->ci_parent structure members. + +A subsystem can navigate the cg_children list and the ci_parent pointer +to see the tree created by the subsystem. This can race with configfs' +management of the hierarchy, so configfs uses the subsystem mutex to +protect modifications. Whenever a subsystem wants to navigate the +hierarchy, it must do so under the protection of the subsystem +mutex. + +A subsystem will be prevented from acquiring the mutex while a newly +allocated item has not been linked into this hierarchy. Similarly, it +will not be able to acquire the mutex while a dropping item has not +yet been unlinked. This means that an item's ci_parent pointer will +never be NULL while the item is in configfs, and that an item will only +be in its parent's cg_children list for the same duration. This allows +a subsystem to trust ci_parent and cg_children while they hold the +mutex. + +[Item Aggregation Via symlink(2)] + +configfs provides a simple group via the group->item parent/child +relationship. Often, however, a larger environment requires aggregation +outside of the parent/child connection. This is implemented via +symlink(2). + +A config_item may provide the ct_item_ops->allow_link() and +ct_item_ops->drop_link() methods. If the ->allow_link() method exists, +symlink(2) may be called with the config_item as the source of the link. +These links are only allowed between configfs config_items. Any +symlink(2) attempt outside the configfs filesystem will be denied. + +When symlink(2) is called, the source config_item's ->allow_link() +method is called with itself and a target item. If the source item +allows linking to target item, it returns 0. A source item may wish to +reject a link if it only wants links to a certain type of object (say, +in its own subsystem). + +When unlink(2) is called on the symbolic link, the source item is +notified via the ->drop_link() method. Like the ->drop_item() method, +this is a void function and cannot return failure. The subsystem is +responsible for responding to the change. + +A config_item cannot be removed while it links to any other item, nor +can it be removed while an item links to it. Dangling symlinks are not +allowed in configfs. + +[Automatically Created Subgroups] + +A new config_group may want to have two types of child config_items. +While this could be codified by magic names in ->make_item(), it is much +more explicit to have a method whereby userspace sees this divergence. + +Rather than have a group where some items behave differently than +others, configfs provides a method whereby one or many subgroups are +automatically created inside the parent at its creation. Thus, +mkdir("parent") results in "parent", "parent/subgroup1", up through +"parent/subgroupN". Items of type 1 can now be created in +"parent/subgroup1", and items of type N can be created in +"parent/subgroupN". + +These automatic subgroups, or default groups, do not preclude other +children of the parent group. If ct_group_ops->make_group() exists, +other child groups can be created on the parent group directly. + +A configfs subsystem specifies default groups by filling in the +NULL-terminated array default_groups on the config_group structure. +Each group in that array is populated in the configfs tree at the same +time as the parent group. Similarly, they are removed at the same time +as the parent. No extra notification is provided. When a ->drop_item() +method call notifies the subsystem the parent group is going away, it +also means every default group child associated with that parent group. + +As a consequence of this, default_groups cannot be removed directly via +rmdir(2). They also are not considered when rmdir(2) on the parent +group is checking for children. + +[Dependent Subsystems] + +Sometimes other drivers depend on particular configfs items. For +example, ocfs2 mounts depend on a heartbeat region item. If that +region item is removed with rmdir(2), the ocfs2 mount must BUG or go +readonly. Not happy. + +configfs provides two additional API calls: configfs_depend_item() and +configfs_undepend_item(). A client driver can call +configfs_depend_item() on an existing item to tell configfs that it is +depended on. configfs will then return -EBUSY from rmdir(2) for that +item. When the item is no longer depended on, the client driver calls +configfs_undepend_item() on it. + +These API cannot be called underneath any configfs callbacks, as +they will conflict. They can block and allocate. A client driver +probably shouldn't calling them of its own gumption. Rather it should +be providing an API that external subsystems call. + +How does this work? Imagine the ocfs2 mount process. When it mounts, +it asks for a heartbeat region item. This is done via a call into the +heartbeat code. Inside the heartbeat code, the region item is looked +up. Here, the heartbeat code calls configfs_depend_item(). If it +succeeds, then heartbeat knows the region is safe to give to ocfs2. +If it fails, it was being torn down anyway, and heartbeat can gracefully +pass up an error. + +[Committable Items] + +NOTE: Committable items are currently unimplemented. + +Some config_items cannot have a valid initial state. That is, no +default values can be specified for the item's attributes such that the +item can do its work. Userspace must configure one or more attributes, +after which the subsystem can start whatever entity this item +represents. + +Consider the FakeNBD device from above. Without a target address *and* +a target device, the subsystem has no idea what block device to import. +The simple example assumes that the subsystem merely waits until all the +appropriate attributes are configured, and then connects. This will, +indeed, work, but now every attribute store must check if the attributes +are initialized. Every attribute store must fire off the connection if +that condition is met. + +Far better would be an explicit action notifying the subsystem that the +config_item is ready to go. More importantly, an explicit action allows +the subsystem to provide feedback as to whether the attributes are +initialized in a way that makes sense. configfs provides this as +committable items. + +configfs still uses only normal filesystem operations. An item is +committed via rename(2). The item is moved from a directory where it +can be modified to a directory where it cannot. + +Any group that provides the ct_group_ops->commit_item() method has +committable items. When this group appears in configfs, mkdir(2) will +not work directly in the group. Instead, the group will have two +subdirectories: "live" and "pending". The "live" directory does not +support mkdir(2) or rmdir(2) either. It only allows rename(2). The +"pending" directory does allow mkdir(2) and rmdir(2). An item is +created in the "pending" directory. Its attributes can be modified at +will. Userspace commits the item by renaming it into the "live" +directory. At this point, the subsystem receives the ->commit_item() +callback. If all required attributes are filled to satisfaction, the +method returns zero and the item is moved to the "live" directory. + +As rmdir(2) does not work in the "live" directory, an item must be +shutdown, or "uncommitted". Again, this is done via rename(2), this +time from the "live" directory back to the "pending" one. The subsystem +is notified by the ct_group_ops->uncommit_object() method. + + diff --git a/Documentation/filesystems/configfs/configfs_example_explicit.c b/Documentation/filesystems/configfs/configfs_example_explicit.c new file mode 100644 index 00000000000..1420233dfa5 --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example_explicit.c @@ -0,0 +1,483 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example_explicit.c - This file is a demonstration module + * containing a number of configfs subsystems. It explicitly defines + * each structure without using the helper macros defined in + * configfs.h. + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <linux/configfs.h> + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +struct childless_attribute { + struct configfs_attribute attr; + ssize_t (*show)(struct childless *, char *); + ssize_t (*store)(struct childless *, const char *, size_t); +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if ((*p != '\0') && (*p != '\n')) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +static struct childless_attribute childless_attr_showme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "showme", .ca_mode = S_IRUGO }, + .show = childless_showme_read, +}; +static struct childless_attribute childless_attr_storeme = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "storeme", .ca_mode = S_IRUGO | S_IWUSR }, + .show = childless_storeme_read, + .store = childless_storeme_write, +}; +static struct childless_attribute childless_attr_description = { + .attr = { .ca_owner = THIS_MODULE, .ca_name = "description", .ca_mode = S_IRUGO }, + .show = childless_description_read, +}; + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +static ssize_t childless_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = 0; + + if (childless_attr->show) + ret = childless_attr->show(childless, page); + return ret; +} + +static ssize_t childless_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct childless *childless = to_childless(item); + struct childless_attribute *childless_attr = + container_of(attr, struct childless_attribute, attr); + ssize_t ret = -EINVAL; + + if (childless_attr->store) + ret = childless_attr->store(childless, page, count); + return ret; +} + +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return ERR_PTR(-ENOMEM); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + +static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kzalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return ERR_PTR(-ENOMEM); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + mutex_init(&subsys->su_mutex); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (i--; i >= 0; i--) + configfs_unregister_subsystem(example_subsys[i]); + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) + configfs_unregister_subsystem(example_subsys[i]); +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff --git a/Documentation/filesystems/configfs/configfs_example_macros.c b/Documentation/filesystems/configfs/configfs_example_macros.c new file mode 100644 index 00000000000..327dfbc640a --- /dev/null +++ b/Documentation/filesystems/configfs/configfs_example_macros.c @@ -0,0 +1,446 @@ +/* + * vim: noexpandtab ts=8 sts=0 sw=8: + * + * configfs_example_macros.c - This file is a demonstration module + * containing a number of configfs subsystems. It uses the helper + * macros defined by configfs.h + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public + * License as published by the Free Software Foundation; either + * version 2 of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + * General Public License for more details. + * + * You should have received a copy of the GNU General Public + * License along with this program; if not, write to the + * Free Software Foundation, Inc., 59 Temple Place - Suite 330, + * Boston, MA 021110-1307, USA. + * + * Based on sysfs: + * sysfs is Copyright (C) 2001, 2002, 2003 Patrick Mochel + * + * configfs Copyright (C) 2005 Oracle. All rights reserved. + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/slab.h> + +#include <linux/configfs.h> + + + +/* + * 01-childless + * + * This first example is a childless subsystem. It cannot create + * any config_items. It just has attributes. + * + * Note that we are enclosing the configfs_subsystem inside a container. + * This is not necessary if a subsystem has no attributes directly + * on the subsystem. See the next example, 02-simple-children, for + * such a subsystem. + */ + +struct childless { + struct configfs_subsystem subsys; + int showme; + int storeme; +}; + +static inline struct childless *to_childless(struct config_item *item) +{ + return item ? container_of(to_configfs_subsystem(to_config_group(item)), struct childless, subsys) : NULL; +} + +CONFIGFS_ATTR_STRUCT(childless); +#define CHILDLESS_ATTR(_name, _mode, _show, _store) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR(_name, _mode, _show, _store) +#define CHILDLESS_ATTR_RO(_name, _show) \ +struct childless_attribute childless_attr_##_name = __CONFIGFS_ATTR_RO(_name, _show); + +static ssize_t childless_showme_read(struct childless *childless, + char *page) +{ + ssize_t pos; + + pos = sprintf(page, "%d\n", childless->showme); + childless->showme++; + + return pos; +} + +static ssize_t childless_storeme_read(struct childless *childless, + char *page) +{ + return sprintf(page, "%d\n", childless->storeme); +} + +static ssize_t childless_storeme_write(struct childless *childless, + const char *page, + size_t count) +{ + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + childless->storeme = tmp; + + return count; +} + +static ssize_t childless_description_read(struct childless *childless, + char *page) +{ + return sprintf(page, +"[01-childless]\n" +"\n" +"The childless subsystem is the simplest possible subsystem in\n" +"configfs. It does not support the creation of child config_items.\n" +"It only has a few attributes. In fact, it isn't much different\n" +"than a directory in /proc.\n"); +} + +CHILDLESS_ATTR_RO(showme, childless_showme_read); +CHILDLESS_ATTR(storeme, S_IRUGO | S_IWUSR, childless_storeme_read, + childless_storeme_write); +CHILDLESS_ATTR_RO(description, childless_description_read); + +static struct configfs_attribute *childless_attrs[] = { + &childless_attr_showme.attr, + &childless_attr_storeme.attr, + &childless_attr_description.attr, + NULL, +}; + +CONFIGFS_ATTR_OPS(childless); +static struct configfs_item_operations childless_item_ops = { + .show_attribute = childless_attr_show, + .store_attribute = childless_attr_store, +}; + +static struct config_item_type childless_type = { + .ct_item_ops = &childless_item_ops, + .ct_attrs = childless_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct childless childless_subsys = { + .subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "01-childless", + .ci_type = &childless_type, + }, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 02-simple-children + * + * This example merely has a simple one-attribute child. Note that + * there is no extra attribute structure, as the child's attribute is + * known from the get-go. Also, there is no container for the + * subsystem, as it has no attributes of its own. + */ + +struct simple_child { + struct config_item item; + int storeme; +}; + +static inline struct simple_child *to_simple_child(struct config_item *item) +{ + return item ? container_of(item, struct simple_child, item) : NULL; +} + +static struct configfs_attribute simple_child_attr_storeme = { + .ca_owner = THIS_MODULE, + .ca_name = "storeme", + .ca_mode = S_IRUGO | S_IWUSR, +}; + +static struct configfs_attribute *simple_child_attrs[] = { + &simple_child_attr_storeme, + NULL, +}; + +static ssize_t simple_child_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + ssize_t count; + struct simple_child *simple_child = to_simple_child(item); + + count = sprintf(page, "%d\n", simple_child->storeme); + + return count; +} + +static ssize_t simple_child_attr_store(struct config_item *item, + struct configfs_attribute *attr, + const char *page, size_t count) +{ + struct simple_child *simple_child = to_simple_child(item); + unsigned long tmp; + char *p = (char *) page; + + tmp = simple_strtoul(p, &p, 10); + if (!p || (*p && (*p != '\n'))) + return -EINVAL; + + if (tmp > INT_MAX) + return -ERANGE; + + simple_child->storeme = tmp; + + return count; +} + +static void simple_child_release(struct config_item *item) +{ + kfree(to_simple_child(item)); +} + +static struct configfs_item_operations simple_child_item_ops = { + .release = simple_child_release, + .show_attribute = simple_child_attr_show, + .store_attribute = simple_child_attr_store, +}; + +static struct config_item_type simple_child_type = { + .ct_item_ops = &simple_child_item_ops, + .ct_attrs = simple_child_attrs, + .ct_owner = THIS_MODULE, +}; + + +struct simple_children { + struct config_group group; +}; + +static inline struct simple_children *to_simple_children(struct config_item *item) +{ + return item ? container_of(to_config_group(item), struct simple_children, group) : NULL; +} + +static struct config_item *simple_children_make_item(struct config_group *group, const char *name) +{ + struct simple_child *simple_child; + + simple_child = kzalloc(sizeof(struct simple_child), GFP_KERNEL); + if (!simple_child) + return ERR_PTR(-ENOMEM); + + config_item_init_type_name(&simple_child->item, name, + &simple_child_type); + + simple_child->storeme = 0; + + return &simple_child->item; +} + +static struct configfs_attribute simple_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *simple_children_attrs[] = { + &simple_children_attr_description, + NULL, +}; + +static ssize_t simple_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[02-simple-children]\n" +"\n" +"This subsystem allows the creation of child config_items. These\n" +"items have only one attribute that is readable and writeable.\n"); +} + +static void simple_children_release(struct config_item *item) +{ + kfree(to_simple_children(item)); +} + +static struct configfs_item_operations simple_children_item_ops = { + .release = simple_children_release, + .show_attribute = simple_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations simple_children_group_ops = { + .make_item = simple_children_make_item, +}; + +static struct config_item_type simple_children_type = { + .ct_item_ops = &simple_children_item_ops, + .ct_group_ops = &simple_children_group_ops, + .ct_attrs = simple_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem simple_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "02-simple-children", + .ci_type = &simple_children_type, + }, + }, +}; + + +/* ----------------------------------------------------------------- */ + +/* + * 03-group-children + * + * This example reuses the simple_children group from above. However, + * the simple_children group is not the subsystem itself, it is a + * child of the subsystem. Creation of a group in the subsystem creates + * a new simple_children group. That group can then have simple_child + * children of its own. + */ + +static struct config_group *group_children_make_group(struct config_group *group, const char *name) +{ + struct simple_children *simple_children; + + simple_children = kzalloc(sizeof(struct simple_children), + GFP_KERNEL); + if (!simple_children) + return ERR_PTR(-ENOMEM); + + config_group_init_type_name(&simple_children->group, name, + &simple_children_type); + + return &simple_children->group; +} + +static struct configfs_attribute group_children_attr_description = { + .ca_owner = THIS_MODULE, + .ca_name = "description", + .ca_mode = S_IRUGO, +}; + +static struct configfs_attribute *group_children_attrs[] = { + &group_children_attr_description, + NULL, +}; + +static ssize_t group_children_attr_show(struct config_item *item, + struct configfs_attribute *attr, + char *page) +{ + return sprintf(page, +"[03-group-children]\n" +"\n" +"This subsystem allows the creation of child config_groups. These\n" +"groups are like the subsystem simple-children.\n"); +} + +static struct configfs_item_operations group_children_item_ops = { + .show_attribute = group_children_attr_show, +}; + +/* + * Note that, since no extra work is required on ->drop_item(), + * no ->drop_item() is provided. + */ +static struct configfs_group_operations group_children_group_ops = { + .make_group = group_children_make_group, +}; + +static struct config_item_type group_children_type = { + .ct_item_ops = &group_children_item_ops, + .ct_group_ops = &group_children_group_ops, + .ct_attrs = group_children_attrs, + .ct_owner = THIS_MODULE, +}; + +static struct configfs_subsystem group_children_subsys = { + .su_group = { + .cg_item = { + .ci_namebuf = "03-group-children", + .ci_type = &group_children_type, + }, + }, +}; + +/* ----------------------------------------------------------------- */ + +/* + * We're now done with our subsystem definitions. + * For convenience in this module, here's a list of them all. It + * allows the init function to easily register them. Most modules + * will only have one subsystem, and will only call register_subsystem + * on it directly. + */ +static struct configfs_subsystem *example_subsys[] = { + &childless_subsys.subsys, + &simple_children_subsys, + &group_children_subsys, + NULL, +}; + +static int __init configfs_example_init(void) +{ + int ret; + int i; + struct configfs_subsystem *subsys; + + for (i = 0; example_subsys[i]; i++) { + subsys = example_subsys[i]; + + config_group_init(&subsys->su_group); + mutex_init(&subsys->su_mutex); + ret = configfs_register_subsystem(subsys); + if (ret) { + printk(KERN_ERR "Error %d while registering subsystem %s\n", + ret, + subsys->su_group.cg_item.ci_namebuf); + goto out_unregister; + } + } + + return 0; + +out_unregister: + for (i--; i >= 0; i--) + configfs_unregister_subsystem(example_subsys[i]); + + return ret; +} + +static void __exit configfs_example_exit(void) +{ + int i; + + for (i = 0; example_subsys[i]; i++) + configfs_unregister_subsystem(example_subsys[i]); +} + +module_init(configfs_example_init); +module_exit(configfs_example_exit); +MODULE_LICENSE("GPL"); diff --git a/Documentation/filesystems/debugfs.txt b/Documentation/filesystems/debugfs.txt new file mode 100644 index 00000000000..3a863f69272 --- /dev/null +++ b/Documentation/filesystems/debugfs.txt @@ -0,0 +1,191 @@ +Copyright 2009 Jonathan Corbet <corbet@lwn.net> + +Debugfs exists as a simple way for kernel developers to make information +available to user space. Unlike /proc, which is only meant for information +about a process, or sysfs, which has strict one-value-per-file rules, +debugfs has no rules at all. Developers can put any information they want +there. The debugfs filesystem is also intended to not serve as a stable +ABI to user space; in theory, there are no stability constraints placed on +files exported there. The real world is not always so simple, though [1]; +even debugfs interfaces are best designed with the idea that they will need +to be maintained forever. + +Debugfs is typically mounted with a command like: + + mount -t debugfs none /sys/kernel/debug + +(Or an equivalent /etc/fstab line). +The debugfs root directory is accessible only to the root user by +default. To change access to the tree the "uid", "gid" and "mode" mount +options can be used. + +Note that the debugfs API is exported GPL-only to modules. + +Code using debugfs should include <linux/debugfs.h>. Then, the first order +of business will be to create at least one directory to hold a set of +debugfs files: + + struct dentry *debugfs_create_dir(const char *name, struct dentry *parent); + +This call, if successful, will make a directory called name underneath the +indicated parent directory. If parent is NULL, the directory will be +created in the debugfs root. On success, the return value is a struct +dentry pointer which can be used to create files in the directory (and to +clean it up at the end). A NULL return value indicates that something went +wrong. If ERR_PTR(-ENODEV) is returned, that is an indication that the +kernel has been built without debugfs support and none of the functions +described below will work. + +The most general way to create a file within a debugfs directory is with: + + struct dentry *debugfs_create_file(const char *name, umode_t mode, + struct dentry *parent, void *data, + const struct file_operations *fops); + +Here, name is the name of the file to create, mode describes the access +permissions the file should have, parent indicates the directory which +should hold the file, data will be stored in the i_private field of the +resulting inode structure, and fops is a set of file operations which +implement the file's behavior. At a minimum, the read() and/or write() +operations should be provided; others can be included as needed. Again, +the return value will be a dentry pointer to the created file, NULL for +error, or ERR_PTR(-ENODEV) if debugfs support is missing. + +In a number of cases, the creation of a set of file operations is not +actually necessary; the debugfs code provides a number of helper functions +for simple situations. Files containing a single integer value can be +created with any of: + + struct dentry *debugfs_create_u8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_u16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_u32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + struct dentry *debugfs_create_u64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These files support both reading and writing the given value; if a specific +file should not be written to, simply set the mode bits accordingly. The +values in these files are in decimal; if hexadecimal is more appropriate, +the following functions can be used instead: + + struct dentry *debugfs_create_x8(const char *name, umode_t mode, + struct dentry *parent, u8 *value); + struct dentry *debugfs_create_x16(const char *name, umode_t mode, + struct dentry *parent, u16 *value); + struct dentry *debugfs_create_x32(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + struct dentry *debugfs_create_x64(const char *name, umode_t mode, + struct dentry *parent, u64 *value); + +These functions are useful as long as the developer knows the size of the +value to be exported. Some types can have different widths on different +architectures, though, complicating the situation somewhat. There is a +function meant to help out in one special case: + + struct dentry *debugfs_create_size_t(const char *name, umode_t mode, + struct dentry *parent, + size_t *value); + +As might be expected, this function will create a debugfs file to represent +a variable of type size_t. + +Boolean values can be placed in debugfs with: + + struct dentry *debugfs_create_bool(const char *name, umode_t mode, + struct dentry *parent, u32 *value); + +A read on the resulting file will yield either Y (for non-zero values) or +N, followed by a newline. If written to, it will accept either upper- or +lower-case values, or 1 or 0. Any other input will be silently ignored. + +Another option is exporting a block of arbitrary binary data, with +this structure and function: + + struct debugfs_blob_wrapper { + void *data; + unsigned long size; + }; + + struct dentry *debugfs_create_blob(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_blob_wrapper *blob); + +A read of this file will return the data pointed to by the +debugfs_blob_wrapper structure. Some drivers use "blobs" as a simple way +to return several lines of (static) formatted text output. This function +can be used to export binary information, but there does not appear to be +any code which does so in the mainline. Note that all files created with +debugfs_create_blob() are read-only. + +If you want to dump a block of registers (something that happens quite +often during development, even if little such code reaches mainline. +Debugfs offers two functions: one to make a registers-only file, and +another to insert a register block in the middle of another sequential +file. + + struct debugfs_reg32 { + char *name; + unsigned long offset; + }; + + struct debugfs_regset32 { + struct debugfs_reg32 *regs; + int nregs; + void __iomem *base; + }; + + struct dentry *debugfs_create_regset32(const char *name, umode_t mode, + struct dentry *parent, + struct debugfs_regset32 *regset); + + int debugfs_print_regs32(struct seq_file *s, struct debugfs_reg32 *regs, + int nregs, void __iomem *base, char *prefix); + +The "base" argument may be 0, but you may want to build the reg32 array +using __stringify, and a number of register names (macros) are actually +byte offsets over a base for the register block. + + +There are a couple of other directory-oriented helper functions: + + struct dentry *debugfs_rename(struct dentry *old_dir, + struct dentry *old_dentry, + struct dentry *new_dir, + const char *new_name); + + struct dentry *debugfs_create_symlink(const char *name, + struct dentry *parent, + const char *target); + +A call to debugfs_rename() will give a new name to an existing debugfs +file, possibly in a different directory. The new_name must not exist prior +to the call; the return value is old_dentry with updated information. +Symbolic links can be created with debugfs_create_symlink(). + +There is one important thing that all debugfs users must take into account: +there is no automatic cleanup of any directories created in debugfs. If a +module is unloaded without explicitly removing debugfs entries, the result +will be a lot of stale pointers and no end of highly antisocial behavior. +So all debugfs users - at least those which can be built as modules - must +be prepared to remove all files and directories they create there. A file +can be removed with: + + void debugfs_remove(struct dentry *dentry); + +The dentry value can be NULL, in which case nothing will be removed. + +Once upon a time, debugfs users were required to remember the dentry +pointer for every debugfs file they created so that all files could be +cleaned up. We live in more civilized times now, though, and debugfs users +can call: + + void debugfs_remove_recursive(struct dentry *dentry); + +If this function is passed a pointer for the dentry corresponding to the +top-level directory, the entire hierarchy below that directory will be +removed. + +Notes: + [1] http://lwn.net/Articles/309298/ diff --git a/Documentation/filesystems/devfs/ChangeLog b/Documentation/filesystems/devfs/ChangeLog deleted file mode 100644 index e5aba5246d7..00000000000 --- a/Documentation/filesystems/devfs/ChangeLog +++ /dev/null @@ -1,1977 +0,0 @@ -/* -*- auto-fill -*- */ -=============================================================================== -Changes for patch v1 - -- creation of devfs - -- modified miscellaneous character devices to support devfs -=============================================================================== -Changes for patch v2 - -- bug fix with manual inode creation -=============================================================================== -Changes for patch v3 - -- bugfixes - -- documentation improvements - -- created a couple of scripts (one to save&restore a devfs and the - other to set up compatibility symlinks) - -- devfs support for SCSI discs. New name format is: sd_hHcCiIlL -=============================================================================== -Changes for patch v4 - -- bugfix for the directory reading code - -- bugfix for compilation with kerneld - -- devfs support for generic hard discs - -- rationalisation of the various watchdog drivers -=============================================================================== -Changes for patch v5 - -- support for mounting directly from entries in the devfs (it doesn't - need to be mounted to do this), including the root filesystem. - Mounting of swap partitions also works. Hence, now if you set - CONFIG_DEVFS_ONLY to 'Y' then you won't be able to access your discs - via ordinary device nodes. Naturally, the default is 'N' so that you - can still use your old device nodes. If you want to mount from devfs - entries, make sure you use: append = "root=/dev/sd_..." in your - lilo.conf. It seems LILO looks for the device number (major&minor) - and writes that into the kernel image :-( - -- support for character memory devices (/dev/null, /dev/zero, /dev/full - and so on). Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> -=============================================================================== -Changes for patch v6 - -- support for subdirectories - -- support for symbolic links (created by devfs_mk_symlink(), no - support yet for creation via symlink(2)) - -- SCSI disc naming now cast in stone, with the format: - /dev/sd/c0b1t2u3 controller=0, bus=1, ID=2, LUN=3, whole disc - /dev/sd/c0b1t2u3p4 controller=0, bus=1, ID=2, LUN=3, 4th partition - -- loop devices now appear in devfs - -- tty devices, console, serial ports, etc. now appear in devfs - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- bugs with mounting devfs-only devices now fixed -=============================================================================== -Changes for patch v7 - -- SCSI CD-ROMS, tapes and generic devices now appear in devfs -=============================================================================== -Changes for patch v8 - -- bugfix with no-rewind SCSI tapes - -- RAMDISCs now appear in devfs - -- better cleaning up of devfs entries created by various modules - -- interface change to <devfs_register> -=============================================================================== -Changes for patch v9 - -- the v8 patch was corrupted somehow, which would affect the patch for - linux/fs/filesystems.c - I've also fixed the v8 patch file on the WWW - -- MetaDevices (/dev/md*) should now appear in devfs -=============================================================================== -Changes for patch v10 - -- bugfix in meta device support for devfs - -- created this ChangeLog file - -- added devfs support to the floppy driver - -- added support for creating sockets in a devfs -=============================================================================== -Changes for patch v11 - -- added DEVFS_FL_HIDE_UNREG flag - -- incorporated better patch for ttyname() in libc 5.4.43 from H.J. Lu. - -- interface change to <devfs_mk_symlink> - -- support for creating symlinks with symlink(2) - -- parallel port printer (/dev/lp*) now appears in devfs -=============================================================================== -Changes for patch v12 - -- added inode check to <devfs_fill_file> function - -- improved devfs support when mounting from devfs - -- added call to <<release>> operation when removing swap areas on - devfs devices - -- increased NR_SUPER to 128 to support large numbers of devfs mounts - (for chroot(2) gaols) - -- fixed bug in SCSI disc support: was generating incorrect minors if - SCSI ID's did not start at 0 and increase by 1 - -- support symlink traversal when mounting root -=============================================================================== -Changes for patch v13 - -- added devfs support to soundcard driver - Thanks to Eric Dumas <dumas@linux.eu.org> and - C. Scott Ananian <cananian@alumni.princeton.edu> - -- added devfs support to the joystick driver - -- loop driver now has it's own subdirectory "/dev/loop/" - -- created <devfs_get_flags> and <devfs_set_flags> functions - -- fix problem with SCSI disc compatibility names (sd{a,b,c,d,e,f}) - which assumes ID's start at 0 and increase by 1. Also only create - devfs entries for SCSI disc partitions which actually exist - Show new names in partition check - Thanks to Jakub Jelinek <jj@sunsite.ms.mff.cuni.cz> -=============================================================================== -Changes for patch v14 - -- bug fix in floppy driver: would not compile without - CONFIG_DEVFS_FS='Y' - Thanks to Jurgen Botz <jbotz@nova.botz.org> - -- bug fix in loop driver - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- do not create devfs entries for printers not configured - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- do not create devfs entries for serial ports not present - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- ensure <tty_register_devfs> is exported from tty_io.c - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- allow unregistering of devfs symlink entries - -- fixed bug in SCSI disc naming introduced in last patch version -=============================================================================== -Changes for patch v15 - -- ported to kernel 2.1.81 -=============================================================================== -Changes for patch v16 - -- created <devfs_set_symlink_destination> function - -- moved DEVFS_SUPER_MAGIC into header file - -- added DEVFS_FL_HIDE flag - -- created <devfs_get_maj_min> - -- created <devfs_get_handle_from_inode> - -- fixed bugs in searching by major&minor - -- changed interface to <devfs_unregister>, <devfs_fill_file> and - <devfs_find_handle> - -- fixed inode times when symlink created with symlink(2) - -- change tty driver to do auto-creation of devfs entries - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to - devfs - -- updated libc 5.4.43 patch for ttyname() -=============================================================================== -Changes for patch v17 - -- added CONFIG_DEVFS_TTY_COMPAT - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- bugfix in devfs support for drivers/char/lp.c - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- clean up serial driver so that PCMCIA devices unregister correctly - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- fixed bug in genhd.c: whole disc (non-SCSI) was not registered to - devfs [was missing in patch v16] - -- updated libc 5.4.43 patch for ttyname() [was missing in patch v16] - -- all SCSI devices now registered in /dev/sg - -- support removal of devfs entries via unlink(2) -=============================================================================== -Changes for patch v18 - -- added floppy/?u720 floppy entry - -- fixed kerneld support for entries in devfs subdirectories - -- incorporated latest patch for ttyname() in libc 5.4.43 from H.J. Lu. -=============================================================================== -Changes for patch v19 - -- bug fix when looking up unregistered entries: kerneld was not called - -- fixes for kernel 2.1.86 (now requires 2.1.86) -=============================================================================== -Changes for patch v20 - -- only create available floppy entries - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new IDE naming scheme following SCSI format (i.e. /dev/id/c0b0t0u0p1 - instead of /dev/hda1) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new XT disc naming scheme following SCSI format (i.e. /dev/xd/c0t0p1 - instead of /dev/xda1) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- new non-standard CD-ROM names (i.e. /dev/sbp/c#t#) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- allow symlink traversal when mounting the root filesystem - -- Create entries for MD devices at MD init - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v21 - -- ported to kernel 2.1.91 -=============================================================================== -Changes for patch v22 - -- SCSI host number patch ("scsihosts=" kernel option) - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> -=============================================================================== -Changes for patch v23 - -- Fixed persistence bug with device numbers for manually created - device files - -- Fixed problem with recreating symlinks with different content - -- Added CONFIG_DEVFS_MOUNT (mount devfs on /dev at boot time) -=============================================================================== -Changes for patch v24 - -- Switched from CONFIG_KERNELD to CONFIG_KMOD: module autoloading - should now work again - -- Hide entries which are manually unlinked - -- Always invalidate devfs dentry cache when registering entries - -- Support removal of devfs directories via rmdir(2) - -- Ensure directories created by <devfs_mk_dir> are visible - -- Default no access for "other" for floppy device -=============================================================================== -Changes for patch v25 - -- Updates to CREDITS file and minor IDE numbering change - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- Invalidate devfs dentry cache when making directories - -- Invalidate devfs dentry cache when removing entries - -- More informative message if root FS mount fails when devfs - configured - -- Fixed persistence bug with fifos -=============================================================================== -Changes for patch v26 - -- ported to kernel 2.1.97 - -- Changed serial directory from "/dev/serial" to "/dev/tts" and - "/dev/consoles" to "/dev/vc" to be more friendly to new procps -=============================================================================== -Changes for patch v27 - -- Added support for IDE4 and IDE5 - Thanks to Andrzej Krzysztofowicz <ankry@green.mif.pg.gda.pl> - -- Documented "scsihosts=" boot parameter - -- Print process command when debugging kerneld/kmod - -- Added debugging for register/unregister/change operations - -- Added "devfs=" boot options - -- Hide unregistered entries by default -=============================================================================== -Changes for patch v28 - -- No longer lock/unlock superblock in <devfs_put_super> (cope with - recent VFS interface change) - -- Do not automatically change ownership/protection of /dev/tty - -- Drop negative dentries when they are released - -- Manage dcache more efficiently -=============================================================================== -Changes for patch v29 - -- Added DEVFS_FL_AUTO_DEVNUM flag -=============================================================================== -Changes for patch v30 - -- No longer set unnecessary methods - -- Ported to kernel 2.1.99-pre3 -=============================================================================== -Changes for patch v31 - -- Added PID display to <call_kerneld> debugging message - -- Added "diread" and "diwrite" options - -- Ported to kernel 2.1.102 - -- Fixed persistence problem with permissions -=============================================================================== -Changes for patch v32 - -- Fixed devfs support in drivers/block/md.c -=============================================================================== -Changes for patch v33 - -- Support legacy device nodes - -- Fixed bug where recreated inodes were hidden - -- New IDE naming scheme: everything is under /dev/ide -=============================================================================== -Changes for patch v34 - -- Improved debugging in <get_vfs_inode> - -- Prevent duplicate calls to <devfs_mk_dir> in SCSI layer - -- No longer free old dentries in <devfs_mk_dir> - -- Free all dentries for a given entry when deleting inodes -=============================================================================== -Changes for patch v35 - -- Ported to kernel 2.1.105 (sound driver changes) -=============================================================================== -Changes for patch v36 - -- Fixed sound driver port -=============================================================================== -Changes for patch v37 - -- Minor documentation tweaks -=============================================================================== -Changes for patch v38 - -- More documentation tweaks - -- Fix for sound driver port - -- Removed ttyname-patch (grab libc 5.4.44 instead) - -- Ported to kernel 2.1.107-pre2 (loop driver fix) -=============================================================================== -Changes for patch v39 - -- Ported to kernel 2.1.107 (hd.c hunk broke due to spelling "fixes"). Sigh - -- Removed many #ifdef's, replaced with trickery in include/devfs_fs.h -=============================================================================== -Changes for patch v40 - -- Fix for sound driver port - -- Limit auto-device numbering to majors 128 to 239 -=============================================================================== -Changes for patch v41 - -- Fixed inode times persistence problem -=============================================================================== -Changes for patch v42 - -- Ported to kernel 2.1.108 (drivers/scsi/hosts.c hunk broke) -=============================================================================== -Changes for patch v43 - -- Fixed spelling in <devfs_readlink> debug - -- Fixed bug in <devfs_setup> parsing "dilookup" - -- More #ifdef's removed - -- Supported Sparc keyboard (/dev/kbd) - -- Supported DSP56001 digital signal processor (/dev/dsp56k) - -- Supported Apple Desktop Bus (/dev/adb) - -- Supported Coda network file system (/dev/cfs*) -=============================================================================== -Changes for patch v44 - -- Fixed devfs inode leak when manually recreating inodes - -- Fixed permission persistence problem when recreating inodes -=============================================================================== -Changes for patch v45 - -- Ported to kernel 2.1.110 -=============================================================================== -Changes for patch v46 - -- Ported to kernel 2.1.112-pre1 - -- Removed harmless "unused variable" compiler warning - -- Fixed modes for manually recreated device nodes -=============================================================================== -Changes for patch v47 - -- Added NULL devfs inode warning in <devfs_read_inode> - -- Force all inode nlink values to 1 -=============================================================================== -Changes for patch v48 - -- Added "dimknod" option - -- Set inode nlink to 0 when freeing dentries - -- Added support for virtual console capture devices (/dev/vcs*) - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Fixed modes for manually recreated symlinks -=============================================================================== -Changes for patch v49 - -- Ported to kernel 2.1.113 -=============================================================================== -Changes for patch v50 - -- Fixed bugs in recreated directories and symlinks -=============================================================================== -Changes for patch v51 - -- Improved robustness of rc.devfs script - Thanks to Roderich Schupp <rsch@experteam.de> - -- Fixed bugs in recreated device nodes - -- Fixed bug in currently unused <devfs_get_handle_from_inode> - -- Defined new <devfs_handle_t> type - -- Improved debugging when getting entries - -- Fixed bug where directories could be emptied - -- Ported to kernel 2.1.115 -=============================================================================== -Changes for patch v52 - -- Replaced dummy .epoch inode with .devfsd character device - -- Modified rc.devfs to take account of above change - -- Removed spurious driver warning messages when CONFIG_DEVFS_FS=n - -- Implemented devfsd protocol revision 0 -=============================================================================== -Changes for patch v53 - -- Ported to kernel 2.1.116 (kmod change broke hunk) - -- Updated Documentation/Configure.help - -- Test and tty pattern patch for rc.devfs script - Thanks to Roderich Schupp <rsch@experteam.de> - -- Added soothing message to warning in <devfs_d_iput> -=============================================================================== -Changes for patch v54 - -- Ported to kernel 2.1.117 - -- Fixed default permissions in sound driver - -- Added support for frame buffer devices (/dev/fb*) -=============================================================================== -Changes for patch v55 - -- Ported to kernel 2.1.119 - -- Use GCC extensions for structure initialisations - -- Implemented async open notification - -- Incremented devfsd protocol revision to 1 -=============================================================================== -Changes for patch v56 - -- Ported to kernel 2.1.120-pre3 - -- Moved async open notification to end of <devfs_open> -=============================================================================== -Changes for patch v57 - -- Ported to kernel 2.1.121 - -- Prepended "/dev/" to module load request - -- Renamed <call_kerneld> to <call_kmod> - -- Created sample modules.conf file -=============================================================================== -Changes for patch v58 - -- Fixed typo "AYSNC" -> "ASYNC" -=============================================================================== -Changes for patch v59 - -- Added open flag for files -=============================================================================== -Changes for patch v60 - -- Ported to kernel 2.1.123-pre2 -=============================================================================== -Changes for patch v61 - -- Set i_blocks=0 and i_blksize=1024 in <devfs_read_inode> -=============================================================================== -Changes for patch v62 - -- Ported to kernel 2.1.123 -=============================================================================== -Changes for patch v63 - -- Ported to kernel 2.1.124-pre2 -=============================================================================== -Changes for patch v64 - -- Fixed Unix98 pty support - -- Increased buffer size in <get_partition_list> to avoid crash and - burn -=============================================================================== -Changes for patch v65 - -- More Unix98 pty support fixes - -- Added test for empty <<name>> in <devfs_find_handle> - -- Renamed <generate_path> to <devfs_generate_path> and published - -- Created /dev/root symlink - Thanks to Roderich Schupp <rsch@ExperTeam.de> - with further modifications by me -=============================================================================== -Changes for patch v66 - -- Yet more Unix98 pty support fixes (now tested) - -- Created <devfs_get_fops> - -- Support media change checks when CONFIG_DEVFS_ONLY=y - -- Abolished Unix98-style PTY names for old PTY devices -=============================================================================== -Changes for patch v67 - -- Added inline declaration for dummy <devfs_generate_path> - -- Removed spurious "unable to register... in devfs" messages when - CONFIG_DEVFS_FS=n - -- Fixed misc. devices when CONFIG_DEVFS_FS=n - -- Limit auto-device numbering to majors 144 to 239 -=============================================================================== -Changes for patch v68 - -- Hide unopened virtual consoles from directory listings - -- Added support for video capture devices - -- Ported to kernel 2.1.125 -=============================================================================== -Changes for patch v69 - -- Fix for CONFIG_VT=n -=============================================================================== -Changes for patch v70 - -- Added support for non-OSS/Free sound cards -=============================================================================== -Changes for patch v71 - -- Ported to kernel 2.1.126-pre2 -=============================================================================== -Changes for patch v72 - -- #ifdef's for CONFIG_DEVFS_DISABLE_OLD_NAMES removed -=============================================================================== -Changes for patch v73 - -- CONFIG_DEVFS_DISABLE_OLD_NAMES replaced with "nocompat" boot option - -- CONFIG_DEVFS_BOOT_OPTIONS removed: boot options always available -=============================================================================== -Changes for patch v74 - -- Removed CONFIG_DEVFS_MOUNT and "mount" boot option and replaced with - "nomount" boot option - -- Documentation updates - -- Updated sample modules.conf -=============================================================================== -Changes for patch v75 - -- Updated sample modules.conf - -- Remount devfs after initrd finishes - -- Ported to kernel 2.1.127 - -- Added support for ISDN - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v76 - -- Updated an email address in ChangeLog - -- CONFIG_DEVFS_ONLY replaced with "only" boot option -=============================================================================== -Changes for patch v77 - -- Added DEVFS_FL_REMOVABLE flag - -- Check for disc change when listing directories with removable media - devices - -- Use DEVFS_FL_REMOVABLE in sd.c - -- Ported to kernel 2.1.128 -=============================================================================== -Changes for patch v78 - -- Only call <scan_dir_for_removable> on first call to <devfs_readdir> - -- Ported to kernel 2.1.129-pre5 - -- ISDN support improvements - Thanks to Christophe Leroy <christophe.leroy5@capway.com> -=============================================================================== -Changes for patch v79 - -- Ported to kernel 2.1.130 - -- Renamed miscdevice "apm" to "apm_bios" to be consistent with - devices.txt -=============================================================================== -Changes for patch v80 - -- Ported to kernel 2.1.131 - -- Updated <devfs_rmdir> for VFS change in 2.1.131 -=============================================================================== -Changes for patch v81 - -- Fixed permissions on /dev/ptmx -=============================================================================== -Changes for patch v82 - -- Ported to kernel 2.1.132-pre4 - -- Changed initial permissions on /dev/pts/* - -- Created <devfs_mk_compat> - -- Added "symlinks" boot option - -- Changed devfs_register_blkdev() back to register_blkdev() for IDE - -- Check for partitions on removable media in <devfs_lookup> -=============================================================================== -Changes for patch v83 - -- Fixed support for ramdisc when using string-based root FS name - -- Ported to kernel 2.2.0-pre1 -=============================================================================== -Changes for patch v84 - -- Ported to kernel 2.2.0-pre7 -=============================================================================== -Changes for patch v85 - -- Compile fixes for driver/sound/sound_common.c (non-module) and - drivers/isdn/isdn_common.c - Thanks to Christophe Leroy <christophe.leroy5@capway.com> - -- Added support for registering regular files - -- Created <devfs_set_file_size> - -- Added /dev/cpu/mtrr as an alternative interface to /proc/mtrr - -- Update devfs inodes from entries if not changed through FS -=============================================================================== -Changes for patch v86 - -- Ported to kernel 2.2.0-pre9 -=============================================================================== -Changes for patch v87 - -- Fixed bug when mounting non-devfs devices in a devfs -=============================================================================== -Changes for patch v88 - -- Fixed <devfs_fill_file> to only initialise temporary inodes - -- Trap for NULL fops in <devfs_register> - -- Return -ENODEV in <devfs_fill_file> for non-driver inodes - -- Fixed bug when unswapping non-devfs devices in a devfs -=============================================================================== -Changes for patch v89 - -- Switched to C data types in include/linux/devfs_fs.h - -- Switched from PATH_MAX to DEVFS_PATHLEN - -- Updated Documentation/filesystems/devfs/modules.conf to take account - of reverse scanning (!) by modprobe - -- Ported to kernel 2.2.0 -=============================================================================== -Changes for patch v90 - -- CONFIG_DEVFS_DISABLE_OLD_TTY_NAMES replaced with "nottycompat" boot - option - -- CONFIG_DEVFS_TTY_COMPAT removed: existing "symlinks" boot option now - controls this. This means you must have libc 5.4.44 or later, or a - recent version of libc 6 if you use the "symlinks" option -=============================================================================== -Changes for patch v91 - -- Switch from <devfs_mk_symlink> to <devfs_mk_compat> in - drivers/char/vc_screen.c to fix problems with Midnight Commander -=============================================================================== -Changes for patch v92 - -- Ported to kernel 2.2.2-pre5 -=============================================================================== -Changes for patch v93 - -- Modified <sd_name> in drivers/scsi/sd.c to cope with devices that - don't exist (which happens with new RAID autostart code printk()s) -=============================================================================== -Changes for patch v94 - -- Fixed bug in joystick driver: only first joystick was registered -=============================================================================== -Changes for patch v95 - -- Fixed another bug in joystick driver - -- Fixed <devfsd_read> to not overrun event buffer -=============================================================================== -Changes for patch v96 - -- Ported to kernel 2.2.5-2 - -- Created <devfs_auto_unregister> - -- Fixed bugs: compatibility entries were not unregistered for: - loop driver - floppy driver - RAMDISC driver - IDE tape driver - SCSI CD-ROM driver - SCSI HDD driver -=============================================================================== -Changes for patch v97 - -- Fixed bugs: compatibility entries were not unregistered for: - ALSA sound driver - partitions in generic disc driver - -- Don't return unregistred entries in <devfs_find_handle> - -- Panic in <devfs_unregister> if entry unregistered - -- Don't panic in <devfs_auto_unregister> for duplicates -=============================================================================== -Changes for patch v98 - -- Don't unregister already unregistered entries in <unregister> - -- Register entry in <sd_detect> - -- Unregister entry in <sd_detach> - -- Changed to <devfs_*register_chrdev> in drivers/char/tty_io.c - -- Ported to kernel 2.2.7 -=============================================================================== -Changes for patch v99 - -- Ported to kernel 2.2.8 - -- Fixed bug in drivers/scsi/sd.c when >16 SCSI discs - -- Disable warning messages when unable to read partition table for - removable media -=============================================================================== -Changes for patch v100 - -- Ported to kernel 2.3.1-pre5 - -- Added "oops-on-panic" boot option - -- Improved debugging in <devfs_register> and <devfs_unregister> - -- Register entry in <sr_detect> - -- Unregister entry in <sr_detach> - -- Register entry in <sg_detect> - -- Unregister entry in <sg_detach> - -- Added support for ALSA drivers -=============================================================================== -Changes for patch v101 - -- Ported to kernel 2.3.2 -=============================================================================== -Changes for patch v102 - -- Update serial driver to register PCMCIA entries - Thanks to Roch-Alexandre Nomine-Beguin <roch@samarkand.infini.fr> - -- Updated an email address in ChangeLog - -- Hide virtual console capture entries from directory listings when - corresponding console device is not open -=============================================================================== -Changes for patch v103 - -- Ported to kernel 2.3.3 -=============================================================================== -Changes for patch v104 - -- Added documentation for some functions - -- Added "doc" target to fs/devfs/Makefile - -- Added "v4l" directory for video4linux devices - -- Replaced call to <devfs_unregister> in <sd_detach> with call to - <devfs_register_partitions> - -- Moved registration for sr and sg drivers from detect() to attach() - methods - -- Register entries in <st_attach> and unregister in <st_detach> - -- Work around IDE driver treating CD-ROM as gendisk - -- Use <sed> instead of <tr> in rc.devfs - -- Updated ToDo list - -- Removed "oops-on-panic" boot option: now always Oops -=============================================================================== -Changes for patch v105 - -- Unregister SCSI host from <scsi_host_no_list> in <scsi_unregister> - Thanks to Zoltán Böszörményi <zboszor@mail.externet.hu> - -- Don't save /dev/log in rc.devfs - -- Ported to kernel 2.3.4-pre1 -=============================================================================== -Changes for patch v106 - -- Fixed silly typo in drivers/scsi/st.c - -- Improved debugging in <devfs_register> -=============================================================================== -Changes for patch v107 - -- Added "diunlink" and "nokmod" boot options - -- Removed superfluous warning message in <devfs_d_iput> -=============================================================================== -Changes for patch v108 - -- Remove entries when unloading sound module -=============================================================================== -Changes for patch v109 - -- Ported to kernel 2.3.6-pre2 -=============================================================================== -Changes for patch v110 - -- Took account of change to <d_alloc_root> -=============================================================================== -Changes for patch v111 - -- Created separate event queue for each mounted devfs - -- Removed <devfs_invalidate_dcache> - -- Created new ioctl()s for devfsd - -- Incremented devfsd protocol revision to 3 - -- Fixed bug when re-creating directories: contents were lost - -- Block access to inodes until devfsd updates permissions -=============================================================================== -Changes for patch v112 - -- Modified patch so it applies against 2.3.5 and 2.3.6 - -- Updated an email address in ChangeLog - -- Do not automatically change ownership/protection of /dev/tty<n> - -- Updated sample modules.conf - -- Switched to sending process uid/gid to devfsd - -- Renamed <call_kmod> to <try_modload> - -- Added DEVFSD_NOTIFY_LOOKUP event - -- Added DEVFSD_NOTIFY_CHANGE event - -- Added DEVFSD_NOTIFY_CREATE event - -- Incremented devfsd protocol revision to 4 - -- Moved kernel-specific stuff to include/linux/devfs_fs_kernel.h -=============================================================================== -Changes for patch v113 - -- Ported to kernel 2.3.9 - -- Restricted permissions on some block devices -=============================================================================== -Changes for patch v114 - -- Added support for /dev/netlink - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Return EISDIR rather than EINVAL for read(2) on directories - -- Ported to kernel 2.3.10 -=============================================================================== -Changes for patch v115 - -- Added support for all remaining character devices - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Cleaned up netlink support -=============================================================================== -Changes for patch v116 - -- Added support for /dev/parport%d - Thanks to Tim Waugh <tim@cyberelk.demon.co.uk> - -- Fixed parallel port ATAPI tape driver - -- Fixed Atari SLM laser printer driver -=============================================================================== -Changes for patch v117 - -- Added support for COSA card - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Fixed drivers/char/ppdev.c: missing #include <linux/init.h> - -- Fixed drivers/char/ftape/zftape/zftape-init.c - Thanks to Vladimir Popov <mashgrad@usa.net> -=============================================================================== -Changes for patch v118 - -- Ported to kernel 2.3.15-pre3 - -- Fixed bug in loop driver - -- Unregister /dev/lp%d entries in drivers/char/lp.c - Thanks to Maciej W. Rozycki <macro@ds2.pg.gda.pl> -=============================================================================== -Changes for patch v119 - -- Ported to kernel 2.3.16 -=============================================================================== -Changes for patch v120 - -- Fixed bug in drivers/scsi/scsi.c - -- Added /dev/ppp - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Ported to kernel 2.3.17 -=============================================================================== -Changes for patch v121 - -- Fixed bug in drivers/block/loop.c - -- Ported to kernel 2.3.18 -=============================================================================== -Changes for patch v122 - -- Ported to kernel 2.3.19 -=============================================================================== -Changes for patch v123 - -- Ported to kernel 2.3.20 -=============================================================================== -Changes for patch v124 - -- Ported to kernel 2.3.21 -=============================================================================== -Changes for patch v125 - -- Created <devfs_get_info>, <devfs_set_info>, - <devfs_get_first_child> and <devfs_get_next_sibling> - Added <<dir>> parameter to <devfs_register>, <devfs_mk_compat>, - <devfs_mk_dir> and <devfs_find_handle> - Work sponsored by SGI - -- Fixed apparent bug in COSA driver - -- Re-instated "scsihosts=" boot option -=============================================================================== -Changes for patch v126 - -- Always create /dev/pts if CONFIG_UNIX98_PTYS=y - -- Fixed call to <devfs_mk_dir> in drivers/block/ide-disk.c - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Allow multiple unregistrations - -- Created /dev/scsi hierarchy - Work sponsored by SGI -=============================================================================== -Changes for patch v127 - -Work sponsored by SGI - -- No longer disable devpts if devfs enabled (caveat emptor) - -- Added flags array to struct gendisk and removed code from - drivers/scsi/sd.c - -- Created /dev/discs hierarchy -=============================================================================== -Changes for patch v128 - -Work sponsored by SGI - -- Created /dev/cdroms hierarchy -=============================================================================== -Changes for patch v129 - -Work sponsored by SGI - -- Removed compatibility entries for sound devices - -- Removed compatibility entries for printer devices - -- Removed compatibility entries for video4linux devices - -- Removed compatibility entries for parallel port devices - -- Removed compatibility entries for frame buffer devices -=============================================================================== -Changes for patch v130 - -Work sponsored by SGI - -- Added major and minor number to devfsd protocol - -- Incremented devfsd protocol revision to 5 - -- Removed compatibility entries for SoundBlaster CD-ROMs - -- Removed compatibility entries for netlink devices - -- Removed compatibility entries for SCSI generic devices - -- Removed compatibility entries for SCSI tape devices -=============================================================================== -Changes for patch v131 - -Work sponsored by SGI - -- Support info pointer for all devfs entry types - -- Added <<info>> parameter to <devfs_mk_dir> and <devfs_mk_symlink> - -- Removed /dev/st hierarchy - -- Removed /dev/sg hierarchy - -- Removed compatibility entries for loop devices - -- Removed compatibility entries for IDE tape devices - -- Removed compatibility entries for SCSI CD-ROMs - -- Removed /dev/sr hierarchy -=============================================================================== -Changes for patch v132 - -Work sponsored by SGI - -- Removed compatibility entries for floppy devices - -- Removed compatibility entries for RAMDISCs - -- Removed compatibility entries for meta-devices - -- Removed compatibility entries for SCSI discs - -- Created <devfs_make_root> - -- Removed /dev/sd hierarchy - -- Support "../" when searching devfs namespace - -- Created /dev/ide/host* hierarchy - -- Supported IDE hard discs in /dev/ide/host* hierarchy - -- Removed compatibility entries for IDE discs - -- Removed /dev/ide/hd hierarchy - -- Supported IDE CD-ROMs in /dev/ide/host* hierarchy - -- Removed compatibility entries for IDE CD-ROMs - -- Removed /dev/ide/cd hierarchy -=============================================================================== -Changes for patch v133 - -Work sponsored by SGI - -- Created <devfs_get_unregister_slave> - -- Fixed bug in fs/partitions/check.c when rescanning -=============================================================================== -Changes for patch v134 - -Work sponsored by SGI - -- Removed /dev/sd, /dev/sr, /dev/st and /dev/sg directories - -- Removed /dev/ide/hd directory - -- Exported <devfs_get_parent> - -- Created <devfs_register_tape> and /dev/tapes hierarchy - -- Removed /dev/ide/mt hierarchy - -- Removed /dev/ide/fd hierarchy - -- Ported to kernel 2.3.25 -=============================================================================== -Changes for patch v135 - -Work sponsored by SGI - -- Removed compatibility entries for virtual console capture devices - -- Removed unused <devfs_set_symlink_destination> - -- Removed compatibility entries for serial devices - -- Removed compatibility entries for console devices - -- Do not hide entries from devfsd or children - -- Removed DEVFS_FL_TTY_COMPAT flag - -- Removed "nottycompat" boot option - -- Removed <devfs_mk_compat> -=============================================================================== -Changes for patch v136 - -Work sponsored by SGI - -- Moved BSD pty devices to /dev/pty - -- Added DEVFS_FL_WAIT flag -=============================================================================== -Changes for patch v137 - -Work sponsored by SGI - -- Really fixed bug in fs/partitions/check.c when rescanning - -- Support new "disc" naming scheme in <get_removable_partition> - -- Allow NULL fops in <devfs_register> - -- Removed redundant name functions in SCSI disc and IDE drivers -=============================================================================== -Changes for patch v138 - -Work sponsored by SGI - -- Fixed old bugs in drivers/block/paride/pt.c, drivers/char/tpqic02.c, - drivers/net/wan/cosa.c and drivers/scsi/scsi.c - Thanks to Sergey Kubushin <ksi@ksi-linux.com> - -- Fall back to major table if NULL fops given to <devfs_register> -=============================================================================== -Changes for patch v139 - -Work sponsored by SGI - -- Corrected and moved <get_blkfops> and <get_chrfops> declarations - from arch/alpha/kernel/osf_sys.c to include/linux/fs.h - -- Removed name function from struct gendisk - -- Updated devfs FAQ -=============================================================================== -Changes for patch v140 - -Work sponsored by SGI - -- Ported to kernel 2.3.27 -=============================================================================== -Changes for patch v141 - -Work sponsored by SGI - -- Bug fix in arch/m68k/atari/joystick.c - -- Moved ISDN and capi devices to /dev/isdn -=============================================================================== -Changes for patch v142 - -Work sponsored by SGI - -- Bug fix in drivers/block/ide-probe.c (patch confusion) -=============================================================================== -Changes for patch v143 - -Work sponsored by SGI - -- Bug fix in drivers/block/blkpg.c:partition_name() -=============================================================================== -Changes for patch v144 - -Work sponsored by SGI - -- Ported to kernel 2.3.29 - -- Removed calls to <devfs_register> from cdu31a, cm206, mcd and mcdx - CD-ROM drivers: generic driver handles this now - -- Moved joystick devices to /dev/joysticks -=============================================================================== -Changes for patch v145 - -Work sponsored by SGI - -- Ported to kernel 2.3.30-pre3 - -- Register whole-disc entry even for invalid partition tables - -- Fixed bug in mounting root FS when initrd enabled - -- Fixed device entry leak with IDE CD-ROMs - -- Fixed compile problem with drivers/isdn/isdn_common.c - -- Moved COSA devices to /dev/cosa - -- Support fifos when unregistering - -- Created <devfs_register_series> and used in many drivers - -- Moved Coda devices to /dev/coda - -- Moved parallel port IDE tapes to /dev/pt - -- Moved parallel port IDE generic devices to /dev/pg -=============================================================================== -Changes for patch v146 - -Work sponsored by SGI - -- Removed obsolete DEVFS_FL_COMPAT and DEVFS_FL_TOLERANT flags - -- Fixed compile problem with fs/coda/psdev.c - -- Reinstate change to <devfs_register_blkdev> in - drivers/block/ide-probe.c now that fs/isofs/inode.c is fixed - -- Switched to <devfs_register_blkdev> in drivers/block/floppy.c, - drivers/scsi/sr.c and drivers/block/md.c - -- Moved DAC960 devices to /dev/dac960 -=============================================================================== -Changes for patch v147 - -Work sponsored by SGI - -- Ported to kernel 2.3.32-pre4 -=============================================================================== -Changes for patch v148 - -Work sponsored by SGI - -- Removed kmod support: use devfsd instead - -- Moved miscellaneous character devices to /dev/misc -=============================================================================== -Changes for patch v149 - -Work sponsored by SGI - -- Ensure include/linux/joystick.h is OK for user-space - -- Improved debugging in <get_vfs_inode> - -- Ensure dentries created by devfsd will be cleaned up -=============================================================================== -Changes for patch v150 - -Work sponsored by SGI - -- Ported to kernel 2.3.34 -=============================================================================== -Changes for patch v151 - -Work sponsored by SGI - -- Ported to kernel 2.3.35-pre1 - -- Created <devfs_get_name> -=============================================================================== -Changes for patch v152 - -Work sponsored by SGI - -- Updated sample modules.conf - -- Ported to kernel 2.3.36-pre1 -=============================================================================== -Changes for patch v153 - -Work sponsored by SGI - -- Ported to kernel 2.3.42 - -- Removed <devfs_fill_file> -=============================================================================== -Changes for patch v154 - -Work sponsored by SGI - -- Took account of device number changes for /dev/fb* -=============================================================================== -Changes for patch v155 - -Work sponsored by SGI - -- Ported to kernel 2.3.43-pre8 - -- Moved /dev/tty0 to /dev/vc/0 - -- Moved sequence number formatting from <_tty_make_name> to drivers -=============================================================================== -Changes for patch v156 - -Work sponsored by SGI - -- Fixed breakage in drivers/scsi/sd.c due to recent SCSI changes -=============================================================================== -Changes for patch v157 - -Work sponsored by SGI - -- Ported to kernel 2.3.45 -=============================================================================== -Changes for patch v158 - -Work sponsored by SGI - -- Ported to kernel 2.3.46-pre2 -=============================================================================== -Changes for patch v159 - -Work sponsored by SGI - -- Fixed drivers/block/md.c - Thanks to Mike Galbraith <mikeg@weiden.de> - -- Documentation fixes - -- Moved device registration from <lp_init> to <lp_register> - Thanks to Tim Waugh <twaugh@redhat.com> -=============================================================================== -Changes for patch v160 - -Work sponsored by SGI - -- Fixed drivers/char/joystick/joystick.c - Thanks to Vojtech Pavlik <vojtech@suse.cz> - -- Documentation updates - -- Fixed arch/i386/kernel/mtrr.c if procfs and devfs not enabled - -- Fixed drivers/char/stallion.c -=============================================================================== -Changes for patch v161 - -Work sponsored by SGI - -- Remove /dev/ide when ide-mod is unloaded - -- Fixed bug in drivers/block/ide-probe.c when secondary but no primary - -- Added DEVFS_FL_NO_PERSISTENCE flag - -- Used new DEVFS_FL_NO_PERSISTENCE flag for Unix98 pty slaves - -- Removed unnecessary call to <update_devfs_inode_from_entry> in - <devfs_readdir> - -- Only set auto-ownership for /dev/pty/s* -=============================================================================== -Changes for patch v162 - -Work sponsored by SGI - -- Set inode->i_size to correct size for symlinks - Thanks to Jeremy Fitzhardinge <jeremy@goop.org> - -- Only give lookup() method to directories to comply with new VFS - assumptions - -- Remove unnecessary tests in symlink methods - -- Don't kill existing block ops in <devfs_read_inode> - -- Restore auto-ownership for /dev/pty/m* -=============================================================================== -Changes for patch v163 - -Work sponsored by SGI - -- Don't create missing directories in <devfs_find_handle> - -- Removed Documentation/filesystems/devfs/mk-devlinks - -- Updated Documentation/filesystems/devfs/README -=============================================================================== -Changes for patch v164 - -Work sponsored by SGI - -- Fixed CONFIG_DEVFS breakage in drivers/char/serial.c introduced in - linux-2.3.99-pre6-7 -=============================================================================== -Changes for patch v165 - -Work sponsored by SGI - -- Ported to kernel 2.3.99-pre6 -=============================================================================== -Changes for patch v166 - -Work sponsored by SGI - -- Added CONFIG_DEVFS_MOUNT -=============================================================================== -Changes for patch v167 - -Work sponsored by SGI - -- Updated Documentation/filesystems/devfs/README - -- Updated sample modules.conf -=============================================================================== -Changes for patch v168 - -Work sponsored by SGI - -- Disabled multi-mount capability (use VFS bindings instead) - -- Updated README from master HTML file -=============================================================================== -Changes for patch v169 - -Work sponsored by SGI - -- Removed multi-mount code - -- Removed compatibility macros: VFS has changed too much -=============================================================================== -Changes for patch v170 - -Work sponsored by SGI - -- Updated README from master HTML file - -- Merged devfs inode into devfs entry -=============================================================================== -Changes for patch v171 - -Work sponsored by SGI - -- Updated sample modules.conf - -- Removed dead code in <devfs_register> which used to call - <free_dentries> - -- Ported to kernel 2.4.0-test2-pre3 -=============================================================================== -Changes for patch v172 - -Work sponsored by SGI - -- Changed interface to <devfs_register> - -- Changed interface to <devfs_register_series> -=============================================================================== -Changes for patch v173 - -Work sponsored by SGI - -- Simplified interface to <devfs_mk_symlink> - -- Simplified interface to <devfs_mk_dir> - -- Simplified interface to <devfs_find_handle> -=============================================================================== -Changes for patch v174 - -Work sponsored by SGI - -- Updated README from master HTML file -=============================================================================== -Changes for patch v175 - -Work sponsored by SGI - -- DocBook update for fs/devfs/base.c - Thanks to Tim Waugh <twaugh@redhat.com> - -- Removed stale fs/tunnel.c (was never used or completed) -=============================================================================== -Changes for patch v176 - -Work sponsored by SGI - -- Updated ToDo list - -- Removed sample modules.conf: now distributed with devfsd - -- Updated README from master HTML file - -- Ported to kernel 2.4.0-test3-pre4 (which had devfs-patch-v174) -=============================================================================== -Changes for patch v177 - -- Updated README from master HTML file - -- Documentation cleanups - -- Ensure <devfs_generate_path> terminates string for root entry - Thanks to Tim Jansen <tim@tjansen.de> - -- Exported <devfs_get_name> to modules - -- Make <devfs_mk_symlink> send events to devfsd - -- Cleaned up option processing in <devfs_setup> - -- Fixed bugs in handling symlinks: could leak or cause Oops - -- Cleaned up directory handling by separating fops - Thanks to Alexander Viro <viro@parcelfarce.linux.theplanet.co.uk> -=============================================================================== -Changes for patch v178 - -- Fixed handling of inverted options in <devfs_setup> -=============================================================================== -Changes for patch v179 - -- Adjusted <try_modload> to account for <devfs_generate_path> fix -=============================================================================== -Changes for patch v180 - -- Fixed !CONFIG_DEVFS_FS stub declaration of <devfs_get_info> -=============================================================================== -Changes for patch v181 - -- Answered question posed by Al Viro and removed his comments from <devfs_open> - -- Moved setting of registered flag after other fields are changed - -- Fixed race between <devfsd_close> and <devfsd_notify_one> - -- Global VFS changes added bogus BKL to devfsd_close(): removed - -- Widened locking in <devfs_readlink> and <devfs_follow_link> - -- Replaced <devfsd_read> stack usage with <devfsd_ioctl> kmalloc - -- Simplified locking in <devfsd_ioctl> and fixed memory leak -=============================================================================== -Changes for patch v182 - -- Created <devfs_*alloc_major> and <devfs_*alloc_devnum> - -- Removed broken devnum allocation and use <devfs_alloc_devnum> - -- Fixed old devnum leak by calling new <devfs_dealloc_devnum> - -- Created <devfs_*alloc_unique_number> - -- Fixed number leak for /dev/cdroms/cdrom%d - -- Fixed number leak for /dev/discs/disc%d -=============================================================================== -Changes for patch v183 - -- Fixed bug in <devfs_setup> which could hang boot process -=============================================================================== -Changes for patch v184 - -- Documentation typo fix for fs/devfs/util.c - -- Fixed drivers/char/stallion.c for devfs - -- Added DEVFSD_NOTIFY_DELETE event - -- Updated README from master HTML file - -- Removed #include <asm/segment.h> from fs/devfs/base.c -=============================================================================== -Changes for patch v185 - -- Made <block_semaphore> and <char_semaphore> in fs/devfs/util.c - private - -- Fixed inode table races by removing it and using inode->u.generic_ip - instead - -- Moved <devfs_read_inode> into <get_vfs_inode> - -- Moved <devfs_write_inode> into <devfs_notify_change> -=============================================================================== -Changes for patch v186 - -- Fixed race in <devfs_do_symlink> for uni-processor - -- Updated README from master HTML file -=============================================================================== -Changes for patch v187 - -- Fixed drivers/char/stallion.c for devfs - -- Fixed drivers/char/rocket.c for devfs - -- Fixed bug in <devfs_alloc_unique_number>: limited to 128 numbers -=============================================================================== -Changes for patch v188 - -- Updated major masks in fs/devfs/util.c up to Linus' "no new majors" - proclamation. Block: were 126 now 122 free, char: were 26 now 19 free - -- Updated README from master HTML file - -- Removed remnant of multi-mount support in <devfs_mknod> - -- Removed unused DEVFS_FL_SHOW_UNREG flag -=============================================================================== -Changes for patch v189 - -- Removed nlink field from struct devfs_inode - -- Removed auto-ownership for /dev/pty/* (BSD ptys) and used - DEVFS_FL_CURRENT_OWNER|DEVFS_FL_NO_PERSISTENCE for /dev/pty/s* (just - like Unix98 pty slaves) and made /dev/pty/m* rw-rw-rw- access -=============================================================================== -Changes for patch v190 - -- Updated README from master HTML file - -- Replaced BKL with global rwsem to protect symlink data (quick and - dirty hack) -=============================================================================== -Changes for patch v191 - -- Replaced global rwsem for symlink with per-link refcount -=============================================================================== -Changes for patch v192 - -- Removed unnecessary #ifdef CONFIG_DEVFS_FS from arch/i386/kernel/mtrr.c - -- Ported to kernel 2.4.10-pre11 - -- Set inode->i_mapping->a_ops for block nodes in <get_vfs_inode> -=============================================================================== -Changes for patch v193 - -- Went back to global rwsem for symlinks (refcount scheme no good) -=============================================================================== -Changes for patch v194 - -- Fixed overrun in <devfs_link> by removing function (not needed) - -- Updated README from master HTML file -=============================================================================== -Changes for patch v195 - -- Fixed buffer underrun in <try_modload> - -- Moved down_read() from <search_for_entry_in_dir> to <find_entry> -=============================================================================== -Changes for patch v196 - -- Fixed race in <devfsd_ioctl> when setting event mask - Thanks to Kari Hurtta <hurtta@leija.mh.fmi.fi> - -- Avoid deadlock in <devfs_follow_link> by using temporary buffer -=============================================================================== -Changes for patch v197 - -- First release of new locking code for devfs core (v1.0) - -- Fixed bug in drivers/cdrom/cdrom.c -=============================================================================== -Changes for patch v198 - -- Discard temporary buffer, now use "%s" for dentry names - -- Don't generate path in <try_modload>: use fake entry instead - -- Use "existing" directory in <_devfs_make_parent_for_leaf> - -- Use slab cache rather than fixed buffer for devfsd events -=============================================================================== -Changes for patch v199 - -- Removed obsolete usage of DEVFS_FL_NO_PERSISTENCE - -- Send DEVFSD_NOTIFY_REGISTERED events in <devfs_mk_dir> - -- Fixed locking bug in <devfs_d_revalidate_wait> due to typo - -- Do not send CREATE, CHANGE, ASYNC_OPEN or DELETE events from devfsd - or children -=============================================================================== -Changes for patch v200 - -- Ported to kernel 2.5.1-pre2 -=============================================================================== -Changes for patch v201 - -- Fixed bug in <devfsd_read>: was dereferencing freed pointer -=============================================================================== -Changes for patch v202 - -- Fixed bug in <devfsd_close>: was dereferencing freed pointer - -- Added process group check for devfsd privileges -=============================================================================== -Changes for patch v203 - -- Use SLAB_ATOMIC in <devfsd_notify_de> from <devfs_d_delete> -=============================================================================== -Changes for patch v204 - -- Removed long obsolete rc.devfs - -- Return old entry in <devfs_mk_dir> for 2.4.x kernels - -- Updated README from master HTML file - -- Increment refcount on module in <check_disc_changed> - -- Created <devfs_get_handle> and exported <devfs_put> - -- Increment refcount on module in <devfs_get_ops> - -- Created <devfs_put_ops> and used where needed to fix races - -- Added clarifying comments in response to preliminary EMC code review - -- Added poisoning to <devfs_put> - -- Improved debugging messages - -- Fixed unregister bugs in drivers/md/lvm-fs.c -=============================================================================== -Changes for patch v205 - -- Corrected (made useful) debugging message in <unregister> - -- Moved <kmem_cache_create> in <mount_devfs_fs> to <init_devfs_fs> - -- Fixed drivers/md/lvm-fs.c to create "lvm" entry - -- Added magic number to guard against scribbling drivers - -- Only return old entry in <devfs_mk_dir> if a directory - -- Defined macros for error and debug messages - -- Updated README from master HTML file -=============================================================================== -Changes for patch v206 - -- Added support for multiple Compaq cpqarray controllers - -- Fixed (rare, old) race in <devfs_lookup> -=============================================================================== -Changes for patch v207 - -- Fixed deadlock bug in <devfs_d_revalidate_wait> - -- Tag VFS deletable in <devfs_mk_symlink> if handle ignored - -- Updated README from master HTML file -=============================================================================== -Changes for patch v208 - -- Added KERN_* to remaining messages - -- Cleaned up declaration of <stat_read> - -- Updated README from master HTML file -=============================================================================== -Changes for patch v209 - -- Updated README from master HTML file - -- Removed silently introduced calls to lock_kernel() and - unlock_kernel() due to recent VFS locking changes. BKL isn't - required in devfs - -- Changed <devfs_rmdir> to allow later additions if not yet empty - -- Added calls to <devfs_register_partitions> in drivers/block/blkpc.c - <add_partition> and <del_partition> - -- Fixed bug in <devfs_alloc_unique_number>: was clearing beyond - bitfield - -- Fixed bitfield data type for <devfs_*alloc_devnum> - -- Made major bitfield type and initialiser 64 bit safe -=============================================================================== -Changes for patch v210 - -- Updated fs/devfs/util.c to fix shift warning on 64 bit machines - Thanks to Anton Blanchard <anton@samba.org> - -- Updated README from master HTML file -=============================================================================== -Changes for patch v211 - -- Do not put miscellaneous character devices in /dev/misc if they - specify their own directory (i.e. contain a '/' character) - -- Copied macro for error messages from fs/devfs/base.c to - fs/devfs/util.c and made use of this macro - -- Removed 2.4.x compatibility code from fs/devfs/base.c -=============================================================================== -Changes for patch v212 - -- Added BKL to <devfs_open> because drivers still need it -=============================================================================== -Changes for patch v213 - -- Protected <scan_dir_for_removable> and <get_removable_partition> - from changing directory contents -=============================================================================== -Changes for patch v214 - -- Switched to ISO C structure field initialisers - -- Switch to set_current_state() and move before add_wait_queue() - -- Updated README from master HTML file - -- Fixed devfs entry leak in <devfs_readdir> when *readdir fails -=============================================================================== -Changes for patch v215 - -- Created <devfs_find_and_unregister> - -- Switched many functions from <devfs_find_handle> to - <devfs_find_and_unregister> - -- Switched many functions from <devfs_find_handle> to <devfs_get_handle> -=============================================================================== -Changes for patch v216 - -- Switched arch/ia64/sn/io/hcl.c from <devfs_find_handle> to - <devfs_get_handle> - -- Removed deprecated <devfs_find_handle> -=============================================================================== -Changes for patch v217 - -- Exported <devfs_find_and_unregister> and <devfs_only> to modules - -- Updated README from master HTML file - -- Fixed module unload race in <devfs_open> -=============================================================================== -Changes for patch v218 - -- Removed DEVFS_FL_AUTO_OWNER flag - -- Switched lingering structure field initialiser to ISO C - -- Added locking when setting/clearing flags - -- Documentation fix in fs/devfs/util.c diff --git a/Documentation/filesystems/devfs/README b/Documentation/filesystems/devfs/README deleted file mode 100644 index 54366ecc241..00000000000 --- a/Documentation/filesystems/devfs/README +++ /dev/null @@ -1,1964 +0,0 @@ -Devfs (Device File System) FAQ - - -Linux Devfs (Device File System) FAQ -Richard Gooch -20-AUG-2002 - - -Document languages: - - - - - - - ------------------------------------------------------------------------------ - -NOTE: the master copy of this document is available online at: - -http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html -and looks much better than the text version distributed with the -kernel sources. A mirror site is available at: - -http://www.ras.ucalgary.ca/~rgooch/linux/docs/devfs.html - -There is also an optional daemon that may be used with devfs. You can -find out more about it at: - -http://www.atnf.csiro.au/~rgooch/linux/ - -A mailing list is available which you may subscribe to. Send -email -to majordomo@oss.sgi.com with the following line in the -body of the message: -subscribe devfs -To unsubscribe, send the message body: -unsubscribe devfs -instead. The list is archived at - -http://oss.sgi.com/projects/devfs/archive/. - ------------------------------------------------------------------------------ - -Contents - - -What is it? - -Why do it? - -Who else does it? - -How it works - -Operational issues (essential reading) - -Instructions for the impatient -Permissions persistence across reboots -Dealing with drivers without devfs support -All the way with Devfs -Other Issues -Kernel Naming Scheme -Devfsd Naming Scheme -Old Compatibility Names -SCSI Host Probing Issues - - - -Device drivers currently ported - -Allocation of Device Numbers - -Questions and Answers - -Making things work -Alternatives to devfs -What I don't like about devfs -How to report bugs -Strange kernel messages -Compilation problems with devfsd - - -Other resources - -Translations of this document - - ------------------------------------------------------------------------------ - - -What is it? - -Devfs is an alternative to "real" character and block special devices -on your root filesystem. Kernel device drivers can register devices by -name rather than major and minor numbers. These devices will appear in -devfs automatically, with whatever default ownership and -protection the driver specified. A daemon (devfsd) can be used to -override these defaults. Devfs has been in the kernel since 2.3.46. - -NOTE that devfs is entirely optional. If you prefer the old -disc-based device nodes, then simply leave CONFIG_DEVFS_FS=n (the -default). In this case, nothing will change. ALSO NOTE that if you do -enable devfs, the defaults are such that full compatibility is -maintained with the old devices names. - -There are two aspects to devfs: one is the underlying device -namespace, which is a namespace just like any mounted filesystem. The -other aspect is the filesystem code which provides a view of the -device namespace. The reason I make a distinction is because devfs -can be mounted many times, with each mount showing the same device -namespace. Changes made are global to all mounted devfs filesystems. -Also, because the devfs namespace exists without any devfs mounts, you -can easily mount the root filesystem by referring to an entry in the -devfs namespace. - - -The cost of devfs is a small increase in kernel code size and memory -usage. About 7 pages of code (some of that in __init sections) and 72 -bytes for each entry in the namespace. A modest system has only a -couple of hundred device entries, so this costs a few more -pages. Compare this with the suggestion to put /dev on a <a -href="#why-faq-ramdisc">ramdisc. - -On a typical machine, the cost is under 0.2 percent. On a modest -system with 64 MBytes of RAM, the cost is under 0.1 percent. The -accusations of "bloatware" levelled at devfs are not justified. - ------------------------------------------------------------------------------ - - -Why do it? - -There are several problems that devfs addresses. Some of these -problems are more serious than others (depending on your point of -view), and some can be solved without devfs. However, the totality of -these problems really calls out for devfs. - -The choice is a patchwork of inefficient user space solutions, which -are complex and likely to be fragile, or to use a simple and efficient -devfs which is robust. - -There have been many counter-proposals to devfs, all seeking to -provide some of the benefits without actually implementing devfs. So -far there has been an absence of code and no proposed alternative has -been able to provide all the features that devfs does. Further, -alternative proposals require far more complexity in user-space (and -still deliver less functionality than devfs). Some people have the -mantra of reducing "kernel bloat", but don't consider the effects on -user-space. - -A good solution limits the total complexity of kernel-space and -user-space. - - -Major&minor allocation - -The existing scheme requires the allocation of major and minor device -numbers for each and every device. This means that a central -co-ordinating authority is required to issue these device numbers -(unless you're developing a "private" device driver), in order to -preserve uniqueness. Devfs shifts the burden to a namespace. This may -not seem like a huge benefit, but actually it is. Since driver authors -will naturally choose a device name which reflects the functionality -of the device, there is far less potential for namespace conflict. -Solving this requires a kernel change. - -/dev management - -Because you currently access devices through device nodes, these must -be created by the system administrator. For standard devices you can -usually find a MAKEDEV programme which creates all these (hundreds!) -of nodes. This means that changes in the kernel must be reflected by -changes in the MAKEDEV programme, or else the system administrator -creates device nodes by hand. - -The basic problem is that there are two separate databases of -major and minor numbers. One is in the kernel and one is in /dev (or -in a MAKEDEV programme, if you want to look at it that way). This is -duplication of information, which is not good practice. -Solving this requires a kernel change. - -/dev growth - -A typical /dev has over 1200 nodes! Most of these devices simply don't -exist because the hardware is not available. A huge /dev increases the -time to access devices (I'm just referring to the dentry lookup times -and the time taken to read inodes off disc: the next subsection shows -some more horrors). - -An example of how big /dev can grow is if we consider SCSI devices: - -host 6 bits (say up to 64 hosts on a really big machine) -channel 4 bits (say up to 16 SCSI buses per host) -id 4 bits -lun 3 bits -partition 6 bits -TOTAL 23 bits - - -This requires 8 Mega (1024*1024) inodes if we want to store all -possible device nodes. Even if we scrap everything but id,partition -and assume a single host adapter with a single SCSI bus and only one -logical unit per SCSI target (id), that's still 10 bits or 1024 -inodes. Each VFS inode takes around 256 bytes (kernel 2.1.78), so -that's 256 kBytes of inode storage on disc (assuming real inodes take -a similar amount of space as VFS inodes). This is actually not so bad, -because disc is cheap these days. Embedded systems would care about -256 kBytes of /dev inodes, but you could argue that embedded systems -would have hand-tuned /dev directories. I've had to do just that on my -embedded systems, but I would rather just leave it to devfs. - -Another issue is the time taken to lookup an inode when first -referenced. Not only does this take time in scanning through a list in -memory, but also the seek times to read the inodes off disc. -This could be solved in user-space using a clever programme which -scanned the kernel logs and deleted /dev entries which are not -available and created them when they were available. This programme -would need to be run every time a new module was loaded, which would -slow things down a lot. - -There is an existing programme called scsidev which will automatically -create device nodes for SCSI devices. It can do this by scanning files -in /proc/scsi. Unfortunately, to extend this idea to other device -nodes would require significant modifications to existing drivers (so -they too would provide information in /proc). This is a non-trivial -change (I should know: devfs has had to do something similar). Once -you go to this much effort, you may as well use devfs itself (which -also provides this information). Furthermore, such a system would -likely be implemented in an ad-hoc fashion, as different drivers will -provide their information in different ways. - -Devfs is much cleaner, because it (naturally) has a uniform mechanism -to provide this information: the device nodes themselves! - - -Node to driver file_operations translation - -There is an important difference between the way disc-based character -and block nodes and devfs entries make the connection between an entry -in /dev and the actual device driver. - -With the current 8 bit major and minor numbers the connection between -disc-based c&b nodes and per-major drivers is done through a -fixed-length table of 128 entries. The various filesystem types set -the inode operations for c&b nodes to {chr,blk}dev_inode_operations, -so when a device is opened a few quick levels of indirection bring us -to the driver file_operations. - -For miscellaneous character devices a second step is required: there -is a scan for the driver entry with the same minor number as the file -that was opened, and the appropriate minor open method is called. This -scanning is done *every time* you open a device node. Potentially, you -may be searching through dozens of misc. entries before you find your -open method. While not an enormous performance overhead, this does -seem pointless. - -Linux *must* move beyond the 8 bit major and minor barrier, -somehow. If we simply increase each to 16 bits, then the indexing -scheme used for major driver lookup becomes untenable, because the -major tables (one each for character and block devices) would need to -be 64 k entries long (512 kBytes on x86, 1 MByte for 64 bit -systems). So we would have to use a scheme like that used for -miscellaneous character devices, which means the search time goes up -linearly with the average number of major device drivers on your -system. Not all "devices" are hardware, some are higher-level drivers -like KGI, so you can get more "devices" without adding hardware -You can improve this by creating an ordered (balanced:-) -binary tree, in which case your search time becomes log(N). -Alternatively, you can use hashing to speed up the search. -But why do that search at all if you don't have to? Once again, it -seems pointless. - -Note that devfs doesn't use the major&minor system. For devfs -entries, the connection is done when you lookup the /dev entry. When -devfs_register() is called, an internal table is appended which has -the entry name and the file_operations. If the dentry cache doesn't -have the /dev entry already, this internal table is scanned to get the -file_operations, and an inode is created. If the dentry cache already -has the entry, there is *no lookup time* (other than the dentry scan -itself, but we can't avoid that anyway, and besides Linux dentries -cream other OS's which don't have them:-). Furthermore, the number of -node entries in a devfs is only the number of available device -entries, not the number of *conceivable* entries. Even if you remove -unnecessary entries in a disc-based /dev, the number of conceivable -entries remains the same: you just limit yourself in order to save -space. - -Devfs provides a fast connection between a VFS node and the device -driver, in a scalable way. - -/dev as a system administration tool - -Right now /dev contains a list of conceivable devices, most of which I -don't have. Devfs only shows those devices available on my -system. This means that listing /dev is a handy way of checking what -devices are available. - -Major&minor size - -Existing major and minor numbers are limited to 8 bits each. This is -now a limiting factor for some drivers, particularly the SCSI disc -driver, which consumes a single major number. Only 16 discs are -supported, and each disc may have only 15 partitions. Maybe this isn't -a problem for you, but some of us are building huge Linux systems with -disc arrays. With devfs an arbitrary pointer can be associated with -each device entry, which can be used to give an effective 32 bit -device identifier (i.e. that's like having a 32 bit minor -number). Since this is private to the kernel, there are no C library -compatibility issues which you would have with increasing major and -minor number sizes. See the section on "Allocation of Device Numbers" -for details on maintaining compatibility with userspace. - -Solving this requires a kernel change. - -Since writing this, the kernel has been modified so that the SCSI disc -driver has more major numbers allocated to it and now supports up to -128 discs. Since these major numbers are non-contiguous (a result of -unplanned expansion), the implementation is a little more cumbersome -than originally. - -Just like the changes to IPv4 to fix impending limitations in the -address space, people find ways around the limitations. In the long -run, however, solutions like IPv6 or devfs can't be put off forever. - -Read-only root filesystem - -Having your device nodes on the root filesystem means that you can't -operate properly with a read-only root filesystem. This is because you -want to change ownerships and protections of tty devices. Existing -practice prevents you using a CD-ROM as your root filesystem for a -*real* system. Sure, you can boot off a CD-ROM, but you can't change -tty ownerships, so it's only good for installing. - -Also, you can't use a shared NFS root filesystem for a cluster of -discless Linux machines (having tty ownerships changed on a common -/dev is not good). Nor can you embed your root filesystem in a -ROM-FS. - -You can get around this by creating a RAMDISC at boot time, making -an ext2 filesystem in it, mounting it somewhere and copying the -contents of /dev into it, then unmounting it and mounting it over -/dev. - -A devfs is a cleaner way of solving this. - -Non-Unix root filesystem - -Non-Unix filesystems (such as NTFS) can't be used for a root -filesystem because they variously don't support character and block -special files or symbolic links. You can't have a separate disc-based -or RAMDISC-based filesystem mounted on /dev because you need device -nodes before you can mount these. Devfs can be mounted without any -device nodes. Devlinks won't work because symlinks aren't supported. -An alternative solution is to use initrd to mount a RAMDISC initial -root filesystem (which is populated with a minimal set of device -nodes), and then construct a new /dev in another RAMDISC, and finally -switch to your non-Unix root filesystem. This requires clever boot -scripts and a fragile and conceptually complex boot procedure. - -Devfs solves this in a robust and conceptually simple way. - -PTY security - -Current pseudo-tty (pty) devices are owned by root and read-writable -by everyone. The user of a pty-pair cannot change -ownership/protections without being suid-root. - -This could be solved with a secure user-space daemon which runs as -root and does the actual creation of pty-pairs. Such a daemon would -require modification to *every* programme that wants to use this new -mechanism. It also slows down creation of pty-pairs. - -An alternative is to create a new open_pty() syscall which does much -the same thing as the user-space daemon. Once again, this requires -modifications to pty-handling programmes. - -The devfs solution allows a device driver to "tag" certain device -files so that when an unopened device is opened, the ownerships are -changed to the current euid and egid of the opening process, and the -protections are changed to the default registered by the driver. When -the device is closed ownership is set back to root and protections are -set back to read-write for everybody. No programme need be changed. -The devpts filesystem provides this auto-ownership feature for Unix98 -ptys. It doesn't support old-style pty devices, nor does it have all -the other features of devfs. - -Intelligent device management - -Devfs implements a simple yet powerful protocol for communication with -a device management daemon (devfsd) which runs in user space. It is -possible to send a message (either synchronously or asynchronously) to -devfsd on any event, such as registration/unregistration of device -entries, opening and closing devices, looking up inodes, scanning -directories and more. This has many possibilities. Some of these are -already implemented. See: - - -http://www.atnf.csiro.au/~rgooch/linux/ - -Device entry registration events can be used by devfsd to change -permissions of newly-created device nodes. This is one mechanism to -control device permissions. - -Device entry registration/unregistration events can be used to run -programmes or scripts. This can be used to provide automatic mounting -of filesystems when a new block device media is inserted into the -drive. - -Asynchronous device open and close events can be used to implement -clever permissions management. For example, the default permissions on -/dev/dsp do not allow everybody to read from the device. This is -sensible, as you don't want some remote user recording what you say at -your console. However, the console user is also prevented from -recording. This behaviour is not desirable. With asynchronous device -open and close events, you can have devfsd run a programme or script -when console devices are opened to change the ownerships for *other* -device nodes (such as /dev/dsp). On closure, you can run a different -script to restore permissions. An advantage of this scheme over -modifying the C library tty handling is that this works even if your -programme crashes (how many times have you seen the utmp database with -lingering entries for non-existent logins?). - -Synchronous device open events can be used to perform intelligent -device access protections. Before the device driver open() method is -called, the daemon must first validate the open attempt, by running an -external programme or script. This is far more flexible than access -control lists, as access can be determined on the basis of other -system conditions instead of just the UID and GID. - -Inode lookup events can be used to authenticate module autoload -requests. Instead of using kmod directly, the event is sent to -devfsd which can implement an arbitrary authentication before loading -the module itself. - -Inode lookup events can also be used to construct arbitrary -namespaces, without having to resort to populating devfs with symlinks -to devices that don't exist. - -Speculative Device Scanning - -Consider an application (like cdparanoia) that wants to find all -CD-ROM devices on the system (SCSI, IDE and other types), whether or -not their respective modules are loaded. The application must -speculatively open certain device nodes (such as /dev/sr0 for the SCSI -CD-ROMs) in order to make sure the module is loaded. This requires -that all Linux distributions follow the standard device naming scheme -(last time I looked RedHat did things differently). Devfs solves the -naming problem. - -The same application also wants to see which devices are actually -available on the system. With the existing system it needs to read the -/dev directory and speculatively open each /dev/sr* device to -determine if the device exists or not. With a large /dev this is an -inefficient operation, especially if there are many /dev/sr* nodes. A -solution like scsidev could reduce the number of /dev/sr* entries (but -of course that also requires all that inefficient directory scanning). - -With devfs, the application can open the /dev/sr directory -(which triggers the module autoloading if required), and proceed to -read /dev/sr. Since only the available devices will have -entries, there are no inefficencies in directory scanning or device -openings. - ------------------------------------------------------------------------------ - -Who else does it? - -FreeBSD has a devfs implementation. Solaris and AIX each have a -pseudo-devfs (something akin to scsidev but for all devices, with some -unspecified kernel support). BeOS, Plan9 and QNX also have it. SGI's -IRIX 6.4 and above also have a device filesystem. - -While we shouldn't just automatically do something because others do -it, we should not ignore the work of others either. FreeBSD has a lot -of competent people working on it, so their opinion should not be -blithely ignored. - ------------------------------------------------------------------------------ - - -How it works - -Registering device entries - -For every entry (device node) in a devfs-based /dev a driver must call -devfs_register(). This adds the name of the device entry, the -file_operations structure pointer and a few other things to an -internal table. Device entries may be added and removed at any -time. When a device entry is registered, it automagically appears in -any mounted devfs'. - -Inode lookup - -When a lookup operation on an entry is performed and if there is no -driver information for that entry devfs will attempt to call -devfsd. If still no driver information can be found then a negative -dentry is yielded and the next stage operation will be called by the -VFS (such as create() or mknod() inode methods). If driver information -can be found, an inode is created (if one does not exist already) and -all is well. - -Manually creating device nodes - -The mknod() method allows you to create an ordinary named pipe in the -devfs, or you can create a character or block special inode if one -does not already exist. You may wish to create a character or block -special inode so that you can set permissions and ownership. Later, if -a device driver registers an entry with the same name, the -permissions, ownership and times are retained. This is how you can set -the protections on a device even before the driver is loaded. Once you -create an inode it appears in the directory listing. - -Unregistering device entries - -A device driver calls devfs_unregister() to unregister an entry. - -Chroot() gaols - -2.2.x kernels - -The semantics of inode creation are different when devfs is mounted -with the "explicit" option. Now, when a device entry is registered, it -will not appear until you use mknod() to create the device. It doesn't -matter if you mknod() before or after the device is registered with -devfs_register(). The purpose of this behaviour is to support -chroot(2) gaols, where you want to mount a minimal devfs inside the -gaol. Only the devices you specifically want to be available (through -your mknod() setup) will be accessible. - -2.4.x kernels - -As of kernel 2.3.99, the VFS has had the ability to rebind parts of -the global filesystem namespace into another part of the namespace. -This now works even at the leaf-node level, which means that -individual files and device nodes may be bound into other parts of the -namespace. This is like making links, but better, because it works -across filesystems (unlike hard links) and works through chroot() -gaols (unlike symbolic links). - -Because of these improvements to the VFS, the multi-mount capability -in devfs is no longer needed. The administrator may create a minimal -device tree inside a chroot(2) gaol by using VFS bindings. As this -provides most of the features of the devfs multi-mount capability, I -removed the multi-mount support code (after issuing an RFC). This -yielded code size reductions and simplifications. - -If you want to construct a minimal chroot() gaol, the following -command should suffice: - -mount --bind /dev/null /gaol/dev/null - - -Repeat for other device nodes you want to expose. Simple! - ------------------------------------------------------------------------------ - - -Operational issues - - -Instructions for the impatient - -Nobody likes reading documentation. People just want to get in there -and play. So this section tells you quickly the steps you need to take -to run with devfs mounted over /dev. Skip these steps and you will end -up with a nearly unbootable system. Subsequent sections describe the -issues in more detail, and discuss non-essential configuration -options. - -Devfsd -OK, if you're reading this, I assume you want to play with -devfs. First you should ensure that /usr/src/linux contains a -recent kernel source tree. Then you need to compile devfsd, the device -management daemon, available at - -http://www.atnf.csiro.au/~rgooch/linux/. -Because the kernel has a naming scheme -which is quite different from the old naming scheme, you need to -install devfsd so that software and configuration files that use the -old naming scheme will not break. - -Compile and install devfsd. You will be provided with a default -configuration file /etc/devfsd.conf which will provide -compatibility symlinks for the old naming scheme. Don't change this -config file unless you know what you're doing. Even if you think you -do know what you're doing, don't change it until you've followed all -the steps below and booted a devfs-enabled system and verified that it -works. - -Now edit your main system boot script so that devfsd is started at the -very beginning (before any filesystem -checks). /etc/rc.d/rc.sysinit is often the main boot script -on systems with SysV-style boot scripts. On systems with BSD-style -boot scripts it is often /etc/rc. Also check -/sbin/rc. - -NOTE that the line you put into the boot -script should be exactly: - -/sbin/devfsd /dev - -DO NOT use some special daemon-launching -programme, otherwise the boot script may not wait for devfsd to finish -initialising. - -System Libraries -There may still be some problems because of broken software making -assumptions about device names. In particular, some software does not -handle devices which are symbolic links. If you are running a libc 5 -based system, install libc 5.4.44 (if you have libc 5.4.46, go back to -libc 5.4.44, which is actually correct). If you are running a glibc -based system, make sure you have glibc 2.1.3 or later. - -/etc/securetty -PAM (Pluggable Authentication Modules) is supposed to be a flexible -mechanism for providing better user authentication and access to -services. Unfortunately, it's also fragile, complex and undocumented -(check out RedHat 6.1, and probably other distributions as well). PAM -has problems with symbolic links. Append the following lines to your -/etc/securetty file: - -vc/1 -vc/2 -vc/3 -vc/4 -vc/5 -vc/6 -vc/7 -vc/8 - -This will not weaken security. If you have a version of util-linux -earlier than 2.10.h, please upgrade to 2.10.h or later. If you -absolutely cannot upgrade, then also append the following lines to -your /etc/securetty file: - -1 -2 -3 -4 -5 -6 -7 -8 - -This may potentially weaken security by allowing root logins over the -network (a password is still required, though). However, since there -are problems with dealing with symlinks, I'm suspicious of the level -of security offered in any case. - -XFree86 -While not essential, it's probably a good idea to upgrade to XFree86 -4.0, as patches went in to make it more devfs-friendly. If you don't, -you'll probably need to apply the following patch to -/etc/security/console.perms so that ordinary users can run -startx. Note that not all distributions have this file (e.g. Debian), -so if it's not present, don't worry about it. - ---- /etc/security/console.perms.orig Sat Apr 17 16:26:47 1999 -+++ /etc/security/console.perms Fri Feb 25 23:53:55 2000 -@@ -14,7 +14,7 @@ - # man 5 console.perms - - # file classes -- these are regular expressions --<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] -+<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - - # device classes -- these are shell-style globs - <floppy>=/dev/fd[0-1]* - -If the patch does not apply, then change the line: - -<console>=tty[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - -with: - -<console>=tty[0-9][0-9]* vc/[0-9][0-9]* :[0-9]\.[0-9] :[0-9] - - -Disable devpts -I've had a report of devpts mounted on /dev/pts not working -correctly. Since devfs will also manage /dev/pts, there is no -need to mount devpts as well. You should either edit your -/etc/fstab so devpts is not mounted, or disable devpts from -your kernel configuration. - -Unsupported drivers -Not all drivers have devfs support. If you depend on one of these -drivers, you will need to create a script or tarfile that you can use -at boot time to create device nodes as appropriate. There is a -section which describes this. Another -section lists the drivers which have -devfs support. - -/dev/mouse - -Many disributions configure /dev/mouse to be the mouse device -for XFree86 and GPM. I actually think this is a bad idea, because it -adds another level of indirection. When looking at a config file, if -you see /dev/mouse you're left wondering which mouse -is being referred to. Hence I recommend putting the actual mouse -device (for example /dev/psaux) into your -/etc/X11/XF86Config file (and similarly for the GPM -configuration file). - -Alternatively, use the same technique used for unsupported drivers -described above. - -The Kernel -Finally, you need to make sure devfs is compiled into your kernel. Set -CONFIG_EXPERIMENTAL=y, CONFIG_DEVFS_FS=y and CONFIG_DEVFS_MOUNT=y by -using favourite configuration tool (i.e. make config or -make xconfig) and then make clean and then recompile your kernel and -modules. At boot, devfs will be mounted onto /dev. - -If you encounter problems booting (for example if you forgot a -configuration step), you can pass devfs=nomount at the kernel -boot command line. This will prevent the kernel from mounting devfs at -boot time onto /dev. - -In general, a kernel built with CONFIG_DEVFS_FS=y but without mounting -devfs onto /dev is completely safe, and requires no -configuration changes. One exception to take note of is when -LABEL= directives are used in /etc/fstab. In this -case you will be unable to boot properly. This is because the -mount(8) programme uses /proc/partitions as part of -the volume label search process, and the device names it finds are not -available, because setting CONFIG_DEVFS_FS=y changes the names in -/proc/partitions, irrespective of whether devfs is mounted. - -Now you've finished all the steps required. You're now ready to boot -your shiny new kernel. Enjoy. - -Changing the configuration - -OK, you've now booted a devfs-enabled system, and everything works. -Now you may feel like changing the configuration (common targets are -/etc/fstab and /etc/devfsd.conf). Since you have a -system that works, if you make any changes and it doesn't work, you -now know that you only have to restore your configuration files to the -default and it will work again. - - -Permissions persistence across reboots - -If you don't use mknod(2) to create a device file, nor use chmod(2) or -chown(2) to change the ownerships/permissions, the inode ctime will -remain at 0 (the epoch, 12 am, 1-JAN-1970, GMT). Anything with a ctime -later than this has had it's ownership/permissions changed. Hence, a -simple script or programme may be used to tar up all changed inodes, -prior to shutdown. Although effective, many consider this approach a -kludge. - -A much better approach is to use devfsd to save and restore -permissions. It may be configured to record changes in permissions and -will save them in a database (in fact a directory tree), and restore -these upon boot. This is an efficient method and results in immediate -saving of current permissions (unlike the tar approach, which saves -permissions at some unspecified future time). - -The default configuration file supplied with devfsd has config entries -which you may uncomment to enable persistence management. - -If you decide to use the tar approach anyway, be aware that tar will -first unlink(2) an inode before creating a new device node. The -unlink(2) has the effect of breaking the connection between a devfs -entry and the device driver. If you use the "devfs=only" boot option, -you lose access to the device driver, requiring you to reload the -module. I consider this a bug in tar (there is no real need to -unlink(2) the inode first). - -Alternatively, you can use devfsd to provide more sophisticated -management of device permissions. You can use devfsd to store -permissions for whole groups of devices with a single configuration -entry, rather than the conventional single entry per device entry. - -Permissions database stored in mounted-over /dev - -If you wish to save and restore your device permissions into the -disc-based /dev while still mounting devfs onto /dev -you may do so. This requires a 2.4.x kernel (in fact, 2.3.99 or -later), which has the VFS binding facility. You need to do the -following to set this up: - - - -make sure the kernel does not mount devfs at boot time - - -make sure you have a correct /dev/console entry in your -root file-system (where your disc-based /dev lives) - -create the /dev-state directory - - -add the following lines near the very beginning of your boot -scripts: - -mount --bind /dev /dev-state -mount -t devfs none /dev -devfsd /dev - - - - -add the following lines to your /etc/devfsd.conf file: - -REGISTER ^pt[sy] IGNORE -CREATE ^pt[sy] IGNORE -CHANGE ^pt[sy] IGNORE -DELETE ^pt[sy] IGNORE -REGISTER .* COPY /dev-state/$devname $devpath -CREATE .* COPY $devpath /dev-state/$devname -CHANGE .* COPY $devpath /dev-state/$devname -DELETE .* CFUNCTION GLOBAL unlink /dev-state/$devname -RESTORE /dev-state - -Note that the sample devfsd.conf file contains these lines, -as well as other sample configurations you may find useful. See the -devfsd distribution - - -reboot. - - - - -Permissions database stored in normal directory - -If you are using an older kernel which doesn't support VFS binding, -then you won't be able to have the permissions database in a -mounted-over /dev. However, you can still use a regular -directory to store the database. The sample /etc/devfsd.conf -file above may still be used. You will need to create the -/dev-state directory prior to installing devfsd. If you have -old permissions in /dev, then just copy (or move) the device -nodes over to the new directory. - -Which method is better? - -The best method is to have the permissions database stored in the -mounted-over /dev. This is because you will not need to copy -device nodes over to /dev-state, and because it allows you to -switch between devfs and non-devfs kernels, without requiring you to -copy permissions between /dev-state (for devfs) and -/dev (for non-devfs). - - -Dealing with drivers without devfs support - -Currently, not all device drivers in the kernel have been modified to -use devfs. Device drivers which do not yet have devfs support will not -automagically appear in devfs. The simplest way to create device nodes -for these drivers is to unpack a tarfile containing the required -device nodes. You can do this in your boot scripts. All your drivers -will now work as before. - -Hopefully for most people devfs will have enough support so that they -can mount devfs directly over /dev without losing most functionality -(i.e. losing access to various devices). As of 22-JAN-1998 (devfs -patch version 10) I am now running this way. All the devices I have -are available in devfs, so I don't lose anything. - -WARNING: if your configuration requires the old-style device names -(i.e. /dev/hda1 or /dev/sda1), you must install devfsd and configure -it to maintain compatibility entries. It is almost certain that you -will require this. Note that the kernel creates a compatibility entry -for the root device, so you don't need initrd. - -Note that you no longer need to mount devpts if you use Unix98 PTYs, -as devfs can manage /dev/pts itself. This saves you some RAM, as you -don't need to compile and install devpts. Note that some versions of -glibc have a bug with Unix98 pty handling on devfs systems. Contact -the glibc maintainers for a fix. Glibc 2.1.3 has the fix. - -Note also that apart from editing /etc/fstab, other things will need -to be changed if you *don't* install devfsd. Some software (like the X -server) hard-wire device names in their source. It really is much -easier to install devfsd so that compatibility entries are created. -You can then slowly migrate your system to using the new device names -(for example, by starting with /etc/fstab), and then limiting the -compatibility entries that devfsd creates. - -IF YOU CONFIGURE TO MOUNT DEVFS AT BOOT, MAKE SURE YOU INSTALL DEVFSD -BEFORE YOU BOOT A DEVFS-ENABLED KERNEL! - -Now that devfs has gone into the 2.3.46 kernel, I'm getting a lot of -reports back. Many of these are because people are trying to run -without devfsd, and hence some things break. Please just run devfsd if -things break. I want to concentrate on real bugs rather than -misconfiguration problems at the moment. If people are willing to fix -bugs/false assumptions in other code (i.e. glibc, X server) and submit -that to the respective maintainers, that would be great. - - -All the way with Devfs - -The devfs kernel patch creates a rationalised device tree. As stated -above, if you want to keep using the old /dev naming scheme, -you just need to configure devfsd appopriately (see the man -page). People who prefer the old names can ignore this section. For -those of us who like the rationalised names and an uncluttered -/dev, read on. - -If you don't run devfsd, or don't enable compatibility entry -management, then you will have to configure your system to use the new -names. For example, you will then need to edit your -/etc/fstab to use the new disc naming scheme. If you want to -be able to boot non-devfs kernels, you will need compatibility -symlinks in the underlying disc-based /dev pointing back to -the old-style names for when you boot a kernel without devfs. - -You can selectively decide which devices you want compatibility -entries for. For example, you may only want compatibility entries for -BSD pseudo-terminal devices (otherwise you'll have to patch you C -library or use Unix98 ptys instead). It's just a matter of putting in -the correct regular expression into /dev/devfsd.conf. - -There are other choices of naming schemes that you may prefer. For -example, I don't use the kernel-supplied -names, because they are too verbose. A common misconception is -that the kernel-supplied names are meant to be used directly in -configuration files. This is not the case. They are designed to -reflect the layout of the devices attached and to provide easy -classification. - -If you like the kernel-supplied names, that's fine. If you don't then -you should be using devfsd to construct a namespace more to your -liking. Devfsd has built-in code to construct a -namespace that is both logical and easy to -manage. In essence, it creates a convenient abbreviation of the -kernel-supplied namespace. - -You are of course free to build your own namespace. Devfsd has all the -infrastructure required to make this easy for you. All you need do is -write a script. You can even write some C code and devfsd can load the -shared object as a callable extension. - - -Other Issues - -The init programme -Another thing to take note of is whether your init programme -creates a Unix socket /dev/telinit. Some versions of init -create /dev/telinit so that the telinit programme can -communicate with the init process. If you have such a system you need -to make sure that devfs is mounted over /dev *before* init -starts. In other words, you can't leave the mounting of devfs to -/etc/rc, since this is executed after init. Other -versions of init require a named pipe /dev/initctl -which must exist *before* init starts. Once again, you need to -mount devfs and then create the named pipe *before* init -starts. - -The default behaviour now is not to mount devfs onto /dev at -boot time for 2.3.x and later kernels. You can correct this with the -"devfs=mount" boot option. This solves any problems with init, -and also prevents the dreaded: - -Cannot open initial console - -message. For 2.2.x kernels where you need to apply the devfs patch, -the default is to mount. - -If you have automatic mounting of devfs onto /dev then you -may need to create /dev/initctl in your boot scripts. The -following lines should suffice: - -mknod /dev/initctl p -kill -SIGUSR1 1 # tell init that /dev/initctl now exists - -Alternatively, if you don't want the kernel to mount devfs onto -/dev then you could use the following procedure is a -guideline for how to get around /dev/initctl problems: - -# cd /sbin -# mv init init.real -# cat > init -#! /bin/sh -mount -n -t devfs none /dev -mknod /dev/initctl p -exec /sbin/init.real $* -[control-D] -# chmod a+x init - -Note that newer versions of init create /dev/initctl -automatically, so you don't have to worry about this. - -Module autoloading -You will need to configure devfsd to enable module -autoloading. The following lines should be placed in your -/etc/devfsd.conf file: - -LOOKUP .* MODLOAD - - -As of devfsd-v1.3.10, a generic /etc/modules.devfs -configuration file is installed, which is used by the MODLOAD -action. This should be sufficient for most configurations. If you -require further configuration, edit your /etc/modules.conf -file. The way module autoloading work with devfs is: - - -a process attempts to lookup a device node (e.g. /dev/fred) - - -if that device node does not exist, the full pathname is passed to -devfsd as a string - - -devfsd will pass the string to the modprobe programme (provided the -configuration line shown above is present), and specifies that -/etc/modules.devfs is the configuration file - - -/etc/modules.devfs includes /etc/modules.conf to -access local configurations - -modprobe will search it's configuration files, looking for an alias -that translates the pathname into a module name - - -the translated pathname is then used to load the module. - - -If you wanted a lookup of /dev/fred to load the -mymod module, you would require the following configuration -line in /etc/modules.conf: - -alias /dev/fred mymod - -The /etc/modules.devfs configuration file provides many such -aliases for standard device names. If you look closely at this file, -you will note that some modules require multiple alias configuration -lines. This is required to support module autoloading for old and new -device names. - -Mounting root off a devfs device -If you wish to mount root off a devfs device when you pass the -"devfs=only" boot option, then you need to pass in the -"root=<device>" option to the kernel when booting. If you use -LILO, then you must have this in lilo.conf: - -append = "root=<device>" - -Surprised? Yep, so was I. It turns out if you have (as most people -do): - -root = <device> - - -then LILO will determine the device number of <device> and will -write that device number into a special place in the kernel image -before starting the kernel, and the kernel will use that device number -to mount the root filesystem. So, using the "append" variety ensures -that LILO passes the root filesystem device as a string, which devfs -can then use. - -Note that this isn't an issue if you don't pass "devfs=only". - -TTY issues -The ttyname(3) function in some versions of the C library makes -false assumptions about device entries which are symbolic links. The -tty(1) programme is one that depends on this function. I've -written a patch to libc 5.4.43 which fixes this. This has been -included in libc 5.4.44 and a similar fix is in glibc 2.1.3. - - -Kernel Naming Scheme - -The kernel provides a default naming scheme. This scheme is designed -to make it easy to search for specific devices or device types, and to -view the available devices. Some device types (such as hard discs), -have a directory of entries, making it easy to see what devices of -that class are available. Often, the entries are symbolic links into a -directory tree that reflects the topology of available devices. The -topological tree is useful for finding how your devices are arranged. - -Below is a list of the naming schemes for the most common drivers. A -list of reserved device names is -available for reference. Please send email to -rgooch@atnf.csiro.au to obtain an allocation. Please be -patient (the maintainer is busy). An alternative name may be allocated -instead of the requested name, at the discretion of the maintainer. - -Disc Devices - -All discs, whether SCSI, IDE or whatever, are placed under the -/dev/discs hierarchy: - - /dev/discs/disc0 first disc - /dev/discs/disc1 second disc - - -Each of these entries is a symbolic link to the directory for that -device. The device directory contains: - - disc for the whole disc - part* for individual partitions - - -CD-ROM Devices - -All CD-ROMs, whether SCSI, IDE or whatever, are placed under the -/dev/cdroms hierarchy: - - /dev/cdroms/cdrom0 first CD-ROM - /dev/cdroms/cdrom1 second CD-ROM - - -Each of these entries is a symbolic link to the real device entry for -that device. - -Tape Devices - -All tapes, whether SCSI, IDE or whatever, are placed under the -/dev/tapes hierarchy: - - /dev/tapes/tape0 first tape - /dev/tapes/tape1 second tape - - -Each of these entries is a symbolic link to the directory for that -device. The device directory contains: - - mt for mode 0 - mtl for mode 1 - mtm for mode 2 - mta for mode 3 - mtn for mode 0, no rewind - mtln for mode 1, no rewind - mtmn for mode 2, no rewind - mtan for mode 3, no rewind - - -SCSI Devices - -To uniquely identify any SCSI device requires the following -information: - - controller (host adapter) - bus (SCSI channel) - target (SCSI ID) - unit (Logical Unit Number) - - -All SCSI devices are placed under /dev/scsi (assuming devfs -is mounted on /dev). Hence, a SCSI device with the following -parameters: c=1,b=2,t=3,u=4 would appear as: - - /dev/scsi/host1/bus2/target3/lun4 device directory - - -Inside this directory, a number of device entries may be created, -depending on which SCSI device-type drivers were installed. - -See the section on the disc naming scheme to see what entries the SCSI -disc driver creates. - -See the section on the tape naming scheme to see what entries the SCSI -tape driver creates. - -The SCSI CD-ROM driver creates: - - cd - - -The SCSI generic driver creates: - - generic - - -IDE Devices - -To uniquely identify any IDE device requires the following -information: - - controller - bus (aka. primary/secondary) - target (aka. master/slave) - unit - - -All IDE devices are placed under /dev/ide, and uses a similar -naming scheme to the SCSI subsystem. - -XT Hard Discs - -All XT discs are placed under /dev/xd. The first XT disc has -the directory /dev/xd/disc0. - -TTY devices - -The tty devices now appear as: - - New name Old-name Device Type - -------- -------- ----------- - /dev/tts/{0,1,...} /dev/ttyS{0,1,...} Serial ports - /dev/cua/{0,1,...} /dev/cua{0,1,...} Call out devices - /dev/vc/0 /dev/tty Current virtual console - /dev/vc/{1,2,...} /dev/tty{1...63} Virtual consoles - /dev/vcc/{0,1,...} /dev/vcs{1...63} Virtual consoles - /dev/pty/m{0,1,...} /dev/ptyp?? PTY masters - /dev/pty/s{0,1,...} /dev/ttyp?? PTY slaves - - -RAMDISCS - -The RAMDISCS are placed in their own directory, and are named thus: - - /dev/rd/{0,1,2,...} - - -Meta Devices - -The meta devices are placed in their own directory, and are named -thus: - - /dev/md/{0,1,2,...} - - -Floppy discs - -Floppy discs are placed in the /dev/floppy directory. - -Loop devices - -Loop devices are placed in the /dev/loop directory. - -Sound devices - -Sound devices are placed in the /dev/sound directory -(audio, sequencer, ...). - - -Devfsd Naming Scheme - -Devfsd provides a naming scheme which is a convenient abbreviation of -the kernel-supplied namespace. In some -cases, the kernel-supplied naming scheme is quite convenient, so -devfsd does not provide another naming scheme. The convenience names -that devfsd creates are in fact the same names as the original devfs -kernel patch created (before Linus mandated the Big Name -Change). These are referred to as "new compatibility entries". - -In order to configure devfsd to create these convenience names, the -following lines should be placed in your /etc/devfsd.conf: - -REGISTER .* MKNEWCOMPAT -UNREGISTER .* RMNEWCOMPAT - -This will cause devfsd to create (and destroy) symbolic links which -point to the kernel-supplied names. - -SCSI Hard Discs - -All SCSI discs are placed under /dev/sd (assuming devfs is -mounted on /dev). Hence, a SCSI disc with the following -parameters: c=1,b=2,t=3,u=4 would appear as: - - /dev/sd/c1b2t3u4 for the whole disc - /dev/sd/c1b2t3u4p5 for the 5th partition - /dev/sd/c1b2t3u4p5s6 for the 6th slice in the 5th partition - - -SCSI Tapes - -All SCSI tapes are placed under /dev/st. A similar naming -scheme is used as for SCSI discs. A SCSI tape with the -parameters:c=1,b=2,t=3,u=4 would appear as: - - /dev/st/c1b2t3u4m0 for mode 0 - /dev/st/c1b2t3u4m1 for mode 1 - /dev/st/c1b2t3u4m2 for mode 2 - /dev/st/c1b2t3u4m3 for mode 3 - /dev/st/c1b2t3u4m0n for mode 0, no rewind - /dev/st/c1b2t3u4m1n for mode 1, no rewind - /dev/st/c1b2t3u4m2n for mode 2, no rewind - /dev/st/c1b2t3u4m3n for mode 3, no rewind - - -SCSI CD-ROMs - -All SCSI CD-ROMs are placed under /dev/sr. A similar naming -scheme is used as for SCSI discs. A SCSI CD-ROM with the -parameters:c=1,b=2,t=3,u=4 would appear as: - - /dev/sr/c1b2t3u4 - - -SCSI Generic Devices - -The generic (aka. raw) interface for all SCSI devices are placed under -/dev/sg. A similar naming scheme is used as for SCSI discs. A -SCSI generic device with the parameters:c=1,b=2,t=3,u=4 would appear -as: - - /dev/sg/c1b2t3u4 - - -IDE Hard Discs - -All IDE discs are placed under /dev/ide/hd, using a similar -convention to SCSI discs. The following mappings exist between the new -and the old names: - - /dev/hda /dev/ide/hd/c0b0t0u0 - /dev/hdb /dev/ide/hd/c0b0t1u0 - /dev/hdc /dev/ide/hd/c0b1t0u0 - /dev/hdd /dev/ide/hd/c0b1t1u0 - - -IDE Tapes - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/mt directory. - -IDE CD-ROM - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/cd directory. - -IDE Floppies - -A similar naming scheme is used as for IDE discs. The entries will -appear in the /dev/ide/fd directory. - -XT Hard Discs - -All XT discs are placed under /dev/xd. The first XT disc -would appear as /dev/xd/c0t0. - - -Old Compatibility Names - -The old compatibility names are the legacy device names, such as -/dev/hda, /dev/sda, /dev/rtc and so on. -Devfsd can be configured to create compatibility symlinks so that you -may continue to use the old names in your configuration files and so -that old applications will continue to function correctly. - -In order to configure devfsd to create these legacy names, the -following lines should be placed in your /etc/devfsd.conf: - -REGISTER .* MKOLDCOMPAT -UNREGISTER .* RMOLDCOMPAT - -This will cause devfsd to create (and destroy) symbolic links which -point to the kernel-supplied names. - - ------------------------------------------------------------------------------ - - -Device drivers currently ported - -- All miscellaneous character devices support devfs (this is done - transparently through misc_register()) - -- SCSI discs and generic hard discs - -- Character memory devices (null, zero, full and so on) - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- Loop devices (/dev/loop?) - -- TTY devices (console, serial ports, terminals and pseudo-terminals) - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- SCSI tapes (/dev/scsi and /dev/tapes) - -- SCSI CD-ROMs (/dev/scsi and /dev/cdroms) - -- SCSI generic devices (/dev/scsi) - -- RAMDISCS (/dev/ram?) - -- Meta Devices (/dev/md*) - -- Floppy discs (/dev/floppy) - -- Parallel port printers (/dev/printers) - -- Sound devices (/dev/sound) - Thanks to Eric Dumas <dumas@linux.eu.org> and - C. Scott Ananian <cananian@alumni.princeton.edu> - -- Joysticks (/dev/joysticks) - -- Sparc keyboard (/dev/kbd) - -- DSP56001 digital signal processor (/dev/dsp56k) - -- Apple Desktop Bus (/dev/adb) - -- Coda network file system (/dev/cfs*) - -- Virtual console capture devices (/dev/vcc) - Thanks to Dennis Hou <smilax@mindmeld.yi.org> - -- Frame buffer devices (/dev/fb) - -- Video capture devices (/dev/v4l) - - ------------------------------------------------------------------------------ - - -Allocation of Device Numbers - -Devfs allows you to write a driver which doesn't need to allocate a -device number (major&minor numbers) for the internal operation of the -kernel. However, there are a number of userspace programmes that use -the device number as a unique handle for a device. An example is the -find programme, which uses device numbers to determine whether -an inode is on a different filesystem than another inode. The device -number used is the one for the block device which a filesystem is -using. To preserve compatibility with userspace programmes, block -devices using devfs need to have unique device numbers allocated to -them. Furthermore, POSIX specifies device numbers, so some kind of -device number needs to be presented to userspace. - -The simplest option (especially when porting drivers to devfs) is to -keep using the old major and minor numbers. Devfs will take whatever -values are given for major&minor and pass them onto userspace. - -This device number is a 16 bit number, so this leaves plenty of space -for large numbers of discs and partitions. This scheme can also be -used for character devices, in particular the tty devices, which are -currently limited to 256 pseudo-ttys (this limits the total number of -simultaneous xterms and remote logins). Note that the device number -is limited to the range 36864-61439 (majors 144-239), in order to -avoid any possible conflicts with existing official allocations. - -Please note that using dynamically allocated block device numbers may -break the NFS daemons (both user and kernel mode), which expect dev_t -for a given device to be constant over the lifetime of remote mounts. - -A final note on this scheme: since it doesn't increase the size of -device numbers, there are no compatibility issues with userspace. - ------------------------------------------------------------------------------ - - -Questions and Answers - - -Making things work -Alternatives to devfs -What I don't like about devfs -How to report bugs -Strange kernel messages -Compilation problems with devfsd - - - -Making things work - -Here are some common questions and answers. - - - -Devfsd doesn't start - -Make sure you have compiled and installed devfsd -Make sure devfsd is being started from your boot -scripts -Make sure you have configured your kernel to enable devfs (see -below) -Make sure devfs is mounted (see below) - - -Devfsd is not managing all my permissions - -Make sure you are capturing the appropriate events. For example, -device entries created by the kernel generate REGISTER events, -but those created by devfsd generate CREATE events. - - -Devfsd is not capturing all REGISTER events - -See the previous entry: you may need to capture CREATE events. - - -X will not start - -Make sure you followed the steps -outlined above. - - -Why don't my network devices appear in devfs? - -This is not a bug. Network devices have their own, completely separate -namespace. They are accessed via socket(2) and -setsockopt(2) calls, and thus require no device nodes. I have -raised the possibilty of moving network devices into the device -namespace, but have had no response. - - -How can I test if I have devfs compiled into my kernel? - -All filesystems built-in or currently loaded are listed in -/proc/filesystems. If you see a devfs entry, then -you know that devfs was compiled into your kernel. If you have -correctly configured and rebuilt your kernel, then devfs will be -built-in. If you think you've configured it in, but -/proc/filesystems doesn't show it, you've made a mistake. -Common mistakes include: - -Using a 2.2.x kernel without applying the devfs patch (if you -don't know how to patch your kernel, use 2.4.x instead, don't bother -asking me how to patch) -Forgetting to set CONFIG_EXPERIMENTAL=y -Forgetting to set CONFIG_DEVFS_FS=y -Forgetting to set CONFIG_DEVFS_MOUNT=y (if you want devfs -to be automatically mounted at boot) -Editing your .config manually, instead of using make -config or make xconfig -Forgetting to run make dep; make clean after changing the -configuration and before compiling -Forgetting to compile your kernel and modules -Forgetting to install your kernel -Forgetting to install your modules - -Please check twice that you've done all these steps before sending in -a bug report. - - - -How can I test if devfs is mounted on /dev? - -The device filesystem will always create an entry called -".devfsd", which is used to communicate with the daemon. Even -if the daemon is not running, this entry will exist. Testing for the -existence of this entry is the approved method of determining if devfs -is mounted or not. Note that the type of entry (i.e. regular file, -character device, named pipe, etc.) may change without notice. Only -the existence of the entry should be relied upon. - - -When I start devfsd, I see the error: -Error opening file: ".devfsd" No such file or directory? - -This means that devfs is not mounted. Make sure you have devfs mounted. - - -How do I mount devfs? - -First make sure you have devfs compiled into your kernel (see -above). Then you will either need to: - -set CONFIG_DEVFS_MOUNT=y in your kernel config -pass devfs=mount to your boot loader -mount devfs manually in your boot scripts with: -mount -t none devfs /dev - - - -Mount by volume LABEL=<label> doesn't work with -devfs - -Most probably you are not mounting devfs onto /dev. What -happens is that if your kernel config has CONFIG_DEVFS_FS=y -then the contents of /proc/partitions will have the devfs -names (such as scsi/host0/bus0/target0/lun0/part1). The -contents of /proc/partitions are used by mount(8) when -mounting by volume label. If devfs is not mounted on /dev, -then mount(8) will fail to find devices. The solution is to -make sure that devfs is mounted on /dev. See above for how to -do that. - - -I have extra or incorrect entries in /dev - -You may have stale entries in your dev-state area. Check for a -RESTORE configuration line in your devfsd configuration -(typically /etc/devfsd.conf). If you have this line, check -the contents of the specified directory for stale entries. Remove -any entries which are incorrect, then reboot. - - -I get "Unable to open initial console" messages at boot - -This usually happens when you don't have devfs automounted onto -/dev at boot time, and there is no valid -/dev/console entry on your root file-system. Create a valid -/dev/console device node. - - - - - -Alternatives to devfs - -I've attempted to collate all the anti-devfs proposals and explain -their limitations. Under construction. - - -Why not just pass device create/remove events to a daemon? - -Here the suggestion is to develop an API in the kernel so that devices -can register create and remove events, and a daemon listens for those -events. The daemon would then populate/depopulate /dev (which -resides on disc). - -This has several limitations: - - -it only works for modules loaded and unloaded (or devices inserted -and removed) after the kernel has finished booting. Without a database -of events, there is no way the daemon could fully populate -/dev - - -if you add a database to this scheme, the question is then how to -present that database to user-space. If you make it a list of strings -with embedded event codes which are passed through a pipe to the -daemon, then this is only of use to the daemon. I would argue that the -natural way to present this data is via a filesystem (since many of -the events will be of a hierarchical nature), such as devfs. -Presenting the data as a filesystem makes it easy for the user to see -what is available and also makes it easy to write scripts to scan the -"database" - - -the tight binding between device nodes and drivers is no longer -possible (requiring the otherwise perfectly avoidable -table lookups) - - -you cannot catch inode lookup events on /dev which means -that module autoloading requires device nodes to be created. This is a -problem, particularly for drivers where only a few inodes are created -from a potentially large set - - -this technique can't be used when the root FS is mounted -read-only - - - - -Just implement a better scsidev - -This suggestion involves taking the scsidev programme and -extending it to scan for all devices, not just SCSI devices. The -scsidev programme works by scanning /proc/scsi - -Problems: - - -the kernel does not currently provide a list of all devices -available. Not all drivers register entries in /proc or -generate kernel messages - - -there is no uniform mechanism to register devices other than the -devfs API - - -implementing such an API is then the same as the -proposal above - - - - -Put /dev on a ramdisc - -This suggestion involves creating a ramdisc and populating it with -device nodes and then mounting it over /dev. - -Problems: - - - -this doesn't help when mounting the root filesystem, since you -still need a device node to do that - - -if you want to use this technique for the root device node as -well, you need to use initrd. This complicates the booting sequence -and makes it significantly harder to administer and configure. The -initrd is essentially opaque, robbing the system administrator of easy -configuration - - -insufficient information is available to correctly populate the -ramdisc. So we come back to the -proposal above to "solve" this - - -a ramdisc-based solution would take more kernel memory, since the -backing store would be (at best) normal VFS inodes and dentries, which -take 284 bytes and 112 bytes, respectively, for each entry. Compare -that to 72 bytes for devfs - - - - -Do nothing: there's no problem - -Sometimes people can be heard to claim that the existing scheme is -fine. This is what they're ignoring: - - -device number size (8 bits each for major and minor) is a real -limitation, and must be fixed somehow. Systems with large numbers of -SCSI devices, for example, will continue to consume the remaining -unallocated major numbers. USB will also need to push beyond the 8 bit -minor limitation - - -simply increasing the device number size is insufficient. Apart -from causing a lot of pain, it doesn't solve the management issues -of a /dev with thousands or more device nodes - - -ignoring the problem of a huge /dev will not make it go -away, and dismisses the legitimacy of a large number of people who -want a dynamic /dev - - -the standard response then becomes: "write a device management -daemon", which brings us back to the -proposal above - - - - -What I don't like about devfs - -Here are some common complaints about devfs, and some suggestions and -solutions that may make it more palatable for you. I can't please -everybody, but I do try :-) - -I hate the naming scheme - -First, remember that no naming scheme will please everybody. You hate -the scheme, others love it. Who's to say who's right and who's wrong? -Ultimately, the person who writes the code gets to choose, and what -exists now is a combination of the choices made by the -devfs author and the -kernel maintainer (Linus). - -However, not all is lost. If you want to create your own naming -scheme, it is a simple matter to write a standalone script, hack -devfsd, or write a script called by devfsd. You can create whatever -naming scheme you like. - -Further, if you want to remove all traces of the devfs naming scheme -from /dev, you can mount devfs elsewhere (say -/devfs) and populate /dev with links into -/devfs. This population can be automated using devfsd if you -wish. - -You can even use the VFS binding facility to make the links, rather -than using symbolic links. This way, you don't even have to see the -"destination" of these symbolic links. - -Devfs puts policy into the kernel - -There's already policy in the kernel. Device numbers are in fact -policy (why should the kernel dictate what device numbers I use?). -Face it, some policy has to be in the kernel. The real difference -between device names as policy and device numbers as policy is that -no one will use device numbers directly, because device -numbers are devoid of meaning to humans and are ugly. At least with -the devfs device names, (even though you can add your own naming -scheme) some people will use the devfs-supplied names directly. This -offends some people :-) - -Devfs is bloatware - -This is not even remotely true. As shown above, -both code and data size are quite modest. - - -How to report bugs - -If you have (or think you have) a bug with devfs, please follow the -steps below: - - - -make sure you have enabled debugging output when configuring your -kernel. You will need to set (at least) the following config options: - -CONFIG_DEVFS_DEBUG=y -CONFIG_DEBUG_KERNEL=y -CONFIG_DEBUG_SLAB=y - - - -please make sure you have the latest devfs patches applied. The -latest kernel version might not have the latest devfs patches applied -yet (Linus is very busy) - - -save a copy of your complete kernel logs (preferably by -using the dmesg programme) for later inclusion in your bug -report. You may need to use the -s switch to increase the -internal buffer size so you can capture all the boot messages. -Don't edit or trim the dmesg output - - - - -try booting with devfs=dall passed to the kernel boot -command line (read the documentation on your bootloader on how to do -this), and save the result to a file. This may be quite verbose, and -it may overflow the messages buffer, but try to get as much of it as -you can - - -if you get an Oops, run ksymoops to decode it so that the -names of the offending functions are provided. A non-decoded Oops is -pretty useless - - -send a copy of your devfsd configuration file(s) - -send the bug report to me first. -Don't expect that I will see it if you post it to the linux-kernel -mailing list. Include all the information listed above, plus -anything else that you think might be relevant. Put the string -devfs somewhere in the subject line, so my mail filters mark -it as urgent - - - - -Here is a general guide on how to ask questions in a way that greatly -improves your chances of getting a reply: - -http://www.tuxedo.org/~esr/faqs/smart-questions.html. If you have -a bug to report, you should also read - -http://www.chiark.greenend.org.uk/~sgtatham/bugs.html. - - -Strange kernel messages - -You may see devfs-related messages in your kernel logs. Below are some -messages and what they mean (and what you should do about them, if -anything). - - - -devfs_register(fred): could not append to parent, err: -17 - -You need to check what the error code means, but usually 17 means -EEXIST. This means that a driver attempted to create an entry -fred in a directory, but there already was an entry with that -name. This is often caused by flawed boot scripts which untar a bunch -of inodes into /dev, as a way to restore permissions. This -message is harmless, as the device nodes will still -provide access to the driver (unless you use the devfs=only -boot option, which is only for dedicated souls:-). If you want to get -rid of these annoying messages, upgrade to devfsd-v1.3.20 and use the -recommended RESTORE directive to restore permissions. - - -devfs_mk_dir(bill): using old entry in dir: c1808724 "" - -This is similar to the message above, except that a driver attempted -to create a directory named bill, and the parent directory -has an entry with the same name. In this case, to ensure that drivers -continue to work properly, the old entry is re-used and given to the -driver. In 2.5 kernels, the driver is given a NULL entry, and thus, -under rare circumstances, may not create the require device nodes. -The solution is the same as above. - - - - - -Compilation problems with devfsd - -Usually, you can compile devfsd just by typing in -make in the source directory, followed by a make -install (as root). Sometimes, you may have problems, particularly -on broken configurations. - - - -error messages relating to DEVFSD_NOTIFY_DELETE - -This happened because you have an ancient set of kernel headers -installed in /usr/include/linux or /usr/src/linux. -Install kernel 2.4.10 or later. You may need to pass the -KERNEL_DIR variable to make (if you did not install -the new kernel sources as /usr/src/linux), or you may copy -the devfs_fs.h file in the kernel source tree into -/usr/include/linux. - - - - ------------------------------------------------------------------------------ - - -Other resources - - - -Douglas Gilbert has written a useful document at - -http://www.torque.net/sg/devfs_scsi.html which -explores the SCSI subsystem and how it interacts with devfs - - -Douglas Gilbert has written another useful document at - -http://www.torque.net/scsi/SCSI-2.4-HOWTO/ which -discusses the Linux SCSI subsystem in 2.4. - - -Johannes Erdfelt has started a discussion paper on Linux and -hot-swap devices, describing what the requirements are for a scalable -solution and how and why he's used devfs+devfsd. Note that this is an -early draft only, available in plain text form at: - -http://johannes.erdfelt.com/hotswap.txt. -Johannes has promised a HTML version will follow. - - -I presented an invited -paper -at the - -2nd Annual Storage Management Workshop held in Miamia, Florida, -U.S.A. in October 2000. - - - - ------------------------------------------------------------------------------ - - -Translations of this document - -This document has been translated into other languages. - - - - -The document master (in English) by rgooch@atnf.csiro.au is -available at - -http://www.atnf.csiro.au/~rgooch/linux/docs/devfs.html - - - -A Korean translation by viatoris@nownuri.net is available at - -http://your.destiny.pe.kr/devfs/devfs.html - - - - ------------------------------------------------------------------------------ -Most flags courtesy of ITA's -Flags of All Countries -used with permission. diff --git a/Documentation/filesystems/devfs/ToDo b/Documentation/filesystems/devfs/ToDo deleted file mode 100644 index afd5a8f2c19..00000000000 --- a/Documentation/filesystems/devfs/ToDo +++ /dev/null @@ -1,40 +0,0 @@ - Device File System (devfs) ToDo List - - Richard Gooch <rgooch@atnf.csiro.au> - - 3-JUL-2000 - -This is a list of things to be done for better devfs support in the -Linux kernel. If you'd like to contribute to the devfs, please have a -look at this list for anything that is unallocated. Also, if there are -items missing (surely), please contact me so I can add them to the -list (preferably with your name attached to them:-). - - -- >256 ptys - Thanks to C. Scott Ananian <cananian@alumni.princeton.edu> - -- Amiga floppy driver (drivers/block/amiflop.c) - -- Atari floppy driver (drivers/block/ataflop.c) - -- SWIM3 (Super Woz Integrated Machine 3) floppy driver (drivers/block/swim3.c) - -- Amiga ZorroII ramdisc driver (drivers/block/z2ram.c) - -- Parallel port ATAPI CD-ROM (drivers/block/paride/pcd.c) - -- Parallel port ATAPI floppy (drivers/block/paride/pf.c) - -- AP1000 block driver (drivers/ap1000/ap.c, drivers/ap1000/ddv.c) - -- Archimedes floppy (drivers/acorn/block/fd1772.c) - -- MFM hard drive (drivers/acorn/block/mfmhd.c) - -- I2O block device (drivers/message/i2o/i2o_block.c) - -- ST-RAM device (arch/m68k/atari/stram.c) - -- Raw devices - diff --git a/Documentation/filesystems/devfs/boot-options b/Documentation/filesystems/devfs/boot-options deleted file mode 100644 index df3d33b03e0..00000000000 --- a/Documentation/filesystems/devfs/boot-options +++ /dev/null @@ -1,65 +0,0 @@ -/* -*- auto-fill -*- */ - - Device File System (devfs) Boot Options - - Richard Gooch <rgooch@atnf.csiro.au> - - 18-AUG-2001 - - -When CONFIG_DEVFS_DEBUG is enabled, you can pass several boot options -to the kernel to debug devfs. The boot options are prefixed by -"devfs=", and are separated by commas. Spaces are not allowed. The -syntax looks like this: - -devfs=<option1>,<option2>,<option3> - -and so on. For example, if you wanted to turn on debugging for module -load requests and device registration, you would do: - -devfs=dmod,dreg - -You may prefix "no" to any option. This will invert the option. - - -Debugging Options -================= - -These requires CONFIG_DEVFS_DEBUG to be enabled. -Note that all debugging options have 'd' as the first character. By -default all options are off. All debugging output is sent to the -kernel logs. The debugging options do not take effect until the devfs -version message appears (just prior to the root filesystem being -mounted). - -These are the options: - -dmod print module load requests to <request_module> - -dreg print device register requests to <devfs_register> - -dunreg print device unregister requests to <devfs_unregister> - -dchange print device change requests to <devfs_set_flags> - -dilookup print inode lookup requests - -diget print VFS inode allocations - -diunlink print inode unlinks - -dichange print inode changes - -dimknod print calls to mknod(2) - -dall some debugging turned on - - -Other Options -============= - -These control the default behaviour of devfs. The options are: - -mount mount devfs onto /dev at boot time - -only disable non-devfs device nodes for devfs-capable drivers diff --git a/Documentation/filesystems/devpts.txt b/Documentation/filesystems/devpts.txt new file mode 100644 index 00000000000..68dffd87f9b --- /dev/null +++ b/Documentation/filesystems/devpts.txt @@ -0,0 +1,132 @@ + +To support containers, we now allow multiple instances of devpts filesystem, +such that indices of ptys allocated in one instance are independent of indices +allocated in other instances of devpts. + +To preserve backward compatibility, this support for multiple instances is +enabled only if: + + - CONFIG_DEVPTS_MULTIPLE_INSTANCES=y, and + - '-o newinstance' mount option is specified while mounting devpts + +IOW, devpts now supports both single-instance and multi-instance semantics. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=n, there is no change in behavior and +this referred to as the "legacy" mode. In this mode, the new mount options +(-o newinstance and -o ptmxmode) will be ignored with a 'bogus option' message +on console. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and devpts is mounted without the +'newinstance' option (as in current start-up scripts) the new mount binds +to the initial kernel mount of devpts. This mode is referred to as the +'single-instance' mode and the current, single-instance semantics are +preserved, i.e PTYs are common across the system. + +The only difference between this single-instance mode and the legacy mode +is the presence of new, '/dev/pts/ptmx' node with permissions 0000, which +can safely be ignored. + +If CONFIG_DEVPTS_MULTIPLE_INSTANCES=y and 'newinstance' option is specified, +the mount is considered to be in the multi-instance mode and a new instance +of the devpts fs is created. Any ptys created in this instance are independent +of ptys in other instances of devpts. Like in the single-instance mode, the +/dev/pts/ptmx node is present. To effectively use the multi-instance mode, +open of /dev/ptmx must be a redirected to '/dev/pts/ptmx' using a symlink or +bind-mount. + +Eg: A container startup script could do the following: + + $ chmod 0666 /dev/pts/ptmx + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ ns_exec -cm /bin/bash + + # We are now in new container + + $ umount /dev/pts + $ mount -t devpts -o newinstance lxcpts /dev/pts + $ sshd -p 1234 + +where 'ns_exec -cm /bin/bash' calls clone() with CLONE_NEWNS flag and execs +/bin/bash in the child process. A pty created by the sshd is not visible in +the original mount of /dev/pts. + +User-space changes +------------------ + +In multi-instance mode (i.e '-o newinstance' mount option is specified at least +once), following user-space issues should be noted. + +1. If -o newinstance mount option is never used, /dev/pts/ptmx can be ignored + and no change is needed to system-startup scripts. + +2. To effectively use multi-instance mode (i.e -o newinstance is specified) + administrators or startup scripts should "redirect" open of /dev/ptmx to + /dev/pts/ptmx using either a bind mount or symlink. + + $ mount -t devpts -o newinstance devpts /dev/pts + + followed by either + + $ rm /dev/ptmx + $ ln -s pts/ptmx /dev/ptmx + $ chmod 666 /dev/pts/ptmx + or + $ mount -o bind /dev/pts/ptmx /dev/ptmx + +3. The '/dev/ptmx -> pts/ptmx' symlink is the preferred method since it + enables better error-reporting and treats both single-instance and + multi-instance mounts similarly. + + But this method requires that system-startup scripts set the mode of + /dev/pts/ptmx correctly (default mode is 0000). The scripts can set the + mode by, either + + - adding ptmxmode mount option to devpts entry in /etc/fstab, or + - using 'chmod 0666 /dev/pts/ptmx' + +4. If multi-instance mode mount is needed for containers, but the system + startup scripts have not yet been updated, container-startup scripts + should bind mount /dev/ptmx to /dev/pts/ptmx to avoid breaking single- + instance mounts. + + Or, in general, container-startup scripts should use: + + mount -t devpts -o newinstance -o ptmxmode=0666 devpts /dev/pts + if [ ! -L /dev/ptmx ]; then + mount -o bind /dev/pts/ptmx /dev/ptmx + fi + + When all devpts mounts are multi-instance, /dev/ptmx can permanently be + a symlink to pts/ptmx and the bind mount can be ignored. + +5. A multi-instance mount that is not accompanied by the /dev/ptmx to + /dev/pts/ptmx redirection would result in an unusable/unreachable pty. + + mount -t devpts -o newinstance lxcpts /dev/pts + + immediately followed by: + + open("/dev/ptmx") + + would create a pty, say /dev/pts/7, in the initial kernel mount. + But /dev/pts/7 would be invisible in the new mount. + +6. The permissions for /dev/pts/ptmx node should be specified when mounting + /dev/pts, using the '-o ptmxmode=%o' mount option (default is 0000). + + mount -t devpts -o newinstance -o ptmxmode=0644 devpts /dev/pts + + The permissions can be later be changed as usual with 'chmod'. + + chmod 666 /dev/pts/ptmx + +7. A mount of devpts without the 'newinstance' option results in binding to + initial kernel mount. This behavior while preserving legacy semantics, + does not provide strict isolation in a container environment. i.e by + mounting devpts without the 'newinstance' option, a container could + get visibility into the 'host' or root container's devpts. + + To workaround this and have strict isolation, all mounts of devpts, + including the mount in the root container, should use the newinstance + option. diff --git a/Documentation/filesystems/directory-locking b/Documentation/filesystems/directory-locking index 34380d4fbce..09bbf9a54f8 100644 --- a/Documentation/filesystems/directory-locking +++ b/Documentation/filesystems/directory-locking @@ -1,5 +1,10 @@ Locking scheme used for directory operations is based on two -kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). +kinds of locks - per-inode (->i_mutex) and per-filesystem +(->s_vfs_rename_mutex). + + When taking the i_mutex on multiple non-directory objects, we +always acquire the locks in order by increasing address. We'll call +that "inode pointer" order in the following. For our purposes all operations fall in 5 classes: @@ -11,8 +16,9 @@ kinds of locks - per-inode (->i_sem) and per-filesystem (->s_vfs_rename_sem). locks victim and calls the method. 4) rename() that is _not_ cross-directory. Locking rules: caller locks -the parent, finds source and target, if target already exists - locks it -and then calls the method. +the parent and finds source and target. If target already exists, lock +it. If source is a non-directory, lock it. If that means we need to +lock both, lock them in inode pointer order. 5) link creation. Locking rules: * lock parent @@ -29,7 +35,9 @@ rules: fail with -ENOTEMPTY * if new parent is equal to or is a descendent of source fail with -ELOOP - * if target exists - lock it. + * If target exists, lock it. If source is a non-directory, lock + it. In case that means we need to lock both source and target, + do so in inode pointer order. * call the method. @@ -55,19 +63,25 @@ objects - A < B iff A is an ancestor of B. renames will be blocked on filesystem lock and we don't start changing the order until we had acquired all locks). -(3) any operation holds at most one lock on non-directory object and - that lock is acquired after all other locks. (Proof: see descriptions - of operations). +(3) locks on non-directory objects are acquired only after locks on + directory objects, and are acquired in inode pointer order. + (Proof: all operations but renames take lock on at most one + non-directory object, except renames, which take locks on source and + target in inode pointer order in the case they are not directories.) Now consider the minimal deadlock. Each process is blocked on attempt to acquire some lock and already holds at least one lock. Let's consider the set of contended locks. First of all, filesystem lock is not contended, since any process blocked on it is not holding any locks. -Thus all processes are blocked on ->i_sem. +Thus all processes are blocked on ->i_mutex. + + By (3), any process holding a non-directory lock can only be +waiting on another non-directory lock with a larger address. Therefore +the process holding the "largest" such lock can always make progress, and +non-directory objects are not included in the set of contended locks. - Non-directory objects are not contended due to (3). Thus link -creation can't be a part of deadlock - it can't be blocked on source -and it means that it doesn't hold any locks. + Thus link creation can't be a part of deadlock - it can't be +blocked on source and it means that it doesn't hold any locks. Any contended object is either held by cross-directory rename or has a child that is also contended. Indeed, suppose that it is held by @@ -82,7 +96,7 @@ own descendent. Moreover, there is exactly one cross-directory rename Consider the object blocking the cross-directory rename. One of its descendents is locked by cross-directory rename (otherwise we -would again have an infinite set of of contended objects). But that +would again have an infinite set of contended objects). But that means that cross-directory rename is taking locks out of order. Due to (2) the order hadn't changed since we had acquired filesystem lock. But locking rules for cross-directory rename guarantee that we do not diff --git a/Documentation/filesystems/dlmfs.txt b/Documentation/filesystems/dlmfs.txt new file mode 100644 index 00000000000..1b528b2ad80 --- /dev/null +++ b/Documentation/filesystems/dlmfs.txt @@ -0,0 +1,130 @@ +dlmfs +================== +A minimal DLM userspace interface implemented via a virtual file +system. + +dlmfs is built with OCFS2 as it requires most of its infrastructure. + +Project web page: http://oss.oracle.com/projects/ocfs2 +Tools web page: http://oss.oracle.com/projects/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +CREDITS +======= + +Some code taken from ramfs which is Copyright (C) 2000 Linus Torvalds +and Transmeta Corp. + +Mark Fasheh <mark.fasheh@oracle.com> + +Caveats +======= +- Right now it only works with the OCFS2 DLM, though support for other + DLM implementations should not be a major issue. + +Mount options +============= +None + +Usage +===== + +If you're just interested in OCFS2, then please see ocfs2.txt. The +rest of this document will be geared towards those who want to use +dlmfs for easy to setup and easy to use clustered locking in +userspace. + +Setup +===== + +dlmfs requires that the OCFS2 cluster infrastructure be in +place. Please download ocfs2-tools from the above url and configure a +cluster. + +You'll want to start heartbeating on a volume which all the nodes in +your lockspace can access. The easiest way to do this is via +ocfs2_hb_ctl (distributed with ocfs2-tools). Right now it requires +that an OCFS2 file system be in place so that it can automatically +find its heartbeat area, though it will eventually support heartbeat +against raw disks. + +Please see the ocfs2_hb_ctl and mkfs.ocfs2 manual pages distributed +with ocfs2-tools. + +Once you're heartbeating, DLM lock 'domains' can be easily created / +destroyed and locks within them accessed. + +Locking +======= + +Users may access dlmfs via standard file system calls, or they can use +'libo2dlm' (distributed with ocfs2-tools) which abstracts the file +system calls and presents a more traditional locking api. + +dlmfs handles lock caching automatically for the user, so a lock +request for an already acquired lock will not generate another DLM +call. Userspace programs are assumed to handle their own local +locking. + +Two levels of locks are supported - Shared Read, and Exclusive. +Also supported is a Trylock operation. + +For information on the libo2dlm interface, please see o2dlm.h, +distributed with ocfs2-tools. + +Lock value blocks can be read and written to a resource via read(2) +and write(2) against the fd obtained via your open(2) call. The +maximum currently supported LVB length is 64 bytes (though that is an +OCFS2 DLM limitation). Through this mechanism, users of dlmfs can share +small amounts of data amongst their nodes. + +mkdir(2) signals dlmfs to join a domain (which will have the same name +as the resulting directory) + +rmdir(2) signals dlmfs to leave the domain + +Locks for a given domain are represented by regular inodes inside the +domain directory. Locking against them is done via the open(2) system +call. + +The open(2) call will not return until your lock has been granted or +an error has occurred, unless it has been instructed to do a trylock +operation. If the lock succeeds, you'll get an fd. + +open(2) with O_CREAT to ensure the resource inode is created - dlmfs does +not automatically create inodes for existing lock resources. + +Open Flag Lock Request Type +--------- ----------------- +O_RDONLY Shared Read +O_RDWR Exclusive + +Open Flag Resulting Locking Behavior +--------- -------------------------- +O_NONBLOCK Trylock operation + +You must provide exactly one of O_RDONLY or O_RDWR. + +If O_NONBLOCK is also provided and the trylock operation was valid but +could not lock the resource then open(2) will return ETXTBUSY. + +close(2) drops the lock associated with your fd. + +Modes passed to mkdir(2) or open(2) are adhered to locally. Chown is +supported locally as well. This means you can use them to restrict +access to the resources via dlmfs on your local node only. + +The resource LVB may be read from the fd in either Shared Read or +Exclusive modes via the read(2) system call. It can be written via +write(2) only when open in Exclusive mode. + +Once written, an LVB will be visible to other nodes who obtain Read +Only or higher level locks on the resource. + +See Also +======== +http://opendlm.sourceforge.net/cvsmirror/opendlm/docs/dlmbook_final.pdf + +For more information on the VMS distributed locking API. diff --git a/Documentation/filesystems/dnotify.txt b/Documentation/filesystems/dnotify.txt new file mode 100644 index 00000000000..6baf88f4685 --- /dev/null +++ b/Documentation/filesystems/dnotify.txt @@ -0,0 +1,70 @@ + Linux Directory Notification + ============================ + + Stephen Rothwell <sfr@canb.auug.org.au> + +The intention of directory notification is to allow user applications +to be notified when a directory, or any of the files in it, are changed. +The basic mechanism involves the application registering for notification +on a directory using a fcntl(2) call and the notifications themselves +being delivered using signals. + +The application decides which "events" it wants to be notified about. +The currently defined events are: + + DN_ACCESS A file in the directory was accessed (read) + DN_MODIFY A file in the directory was modified (write,truncate) + DN_CREATE A file was created in the directory + DN_DELETE A file was unlinked from directory + DN_RENAME A file in the directory was renamed + DN_ATTRIB A file in the directory had its attributes + changed (chmod,chown) + +Usually, the application must reregister after each notification, but +if DN_MULTISHOT is or'ed with the event mask, then the registration will +remain until explicitly removed (by registering for no events). + +By default, SIGIO will be delivered to the process and no other useful +information. However, if the F_SETSIG fcntl(2) call is used to let the +kernel know which signal to deliver, a siginfo structure will be passed to +the signal handler and the si_fd member of that structure will contain the +file descriptor associated with the directory in which the event occurred. + +Preferably the application will choose one of the real time signals +(SIGRTMIN + <n>) so that the notifications may be queued. This is +especially important if DN_MULTISHOT is specified. Note that SIGRTMIN +is often blocked, so it is better to use (at least) SIGRTMIN + 1. + +Implementation expectations (features and bugs :-)) +--------------------------- + +The notification should work for any local access to files even if the +actual file system is on a remote server. This implies that remote +access to files served by local user mode servers should be notified. +Also, remote accesses to files served by a local kernel NFS server should +be notified. + +In order to make the impact on the file system code as small as possible, +the problem of hard links to files has been ignored. So if a file (x) +exists in two directories (a and b) then a change to the file using the +name "a/x" should be notified to a program expecting notifications on +directory "a", but will not be notified to one expecting notifications on +directory "b". + +Also, files that are unlinked, will still cause notifications in the +last directory that they were linked to. + +Configuration +------------- + +Dnotify is controlled via the CONFIG_DNOTIFY configuration option. When +disabled, fcntl(fd, F_NOTIFY, ...) will return -EINVAL. + +Example +------- +See Documentation/filesystems/dnotify_test.c for an example. + +NOTE +---- +Beginning with Linux 2.6.13, dnotify has been replaced by inotify. +See Documentation/filesystems/inotify.txt for more information on it. diff --git a/Documentation/filesystems/dnotify_test.c b/Documentation/filesystems/dnotify_test.c new file mode 100644 index 00000000000..8b37b4a1e18 --- /dev/null +++ b/Documentation/filesystems/dnotify_test.c @@ -0,0 +1,34 @@ +#define _GNU_SOURCE /* needed to get the defines */ +#include <fcntl.h> /* in glibc 2.2 this has the needed + values defined */ +#include <signal.h> +#include <stdio.h> +#include <unistd.h> + +static volatile int event_fd; + +static void handler(int sig, siginfo_t *si, void *data) +{ + event_fd = si->si_fd; +} + +int main(void) +{ + struct sigaction act; + int fd; + + act.sa_sigaction = handler; + sigemptyset(&act.sa_mask); + act.sa_flags = SA_SIGINFO; + sigaction(SIGRTMIN + 1, &act, NULL); + + fd = open(".", O_RDONLY); + fcntl(fd, F_SETSIG, SIGRTMIN + 1); + fcntl(fd, F_NOTIFY, DN_MODIFY|DN_CREATE|DN_MULTISHOT); + /* we will now be notified if any of the files + in "." is modified or new files are created */ + while (1) { + pause(); + printf("Got event on fd=%d\n", event_fd); + } +} diff --git a/Documentation/filesystems/ecryptfs.txt b/Documentation/filesystems/ecryptfs.txt new file mode 100644 index 00000000000..01d8a08351a --- /dev/null +++ b/Documentation/filesystems/ecryptfs.txt @@ -0,0 +1,77 @@ +eCryptfs: A stacked cryptographic filesystem for Linux + +eCryptfs is free software. Please see the file COPYING for details. +For documentation, please see the files in the doc/ subdirectory. For +building and installation instructions please see the INSTALL file. + +Maintainer: Phillip Hellewell +Lead developer: Michael A. Halcrow <mhalcrow@us.ibm.com> +Developers: Michael C. Thompson + Kent Yoder +Web Site: http://ecryptfs.sf.net + +This software is currently undergoing development. Make sure to +maintain a backup copy of any data you write into eCryptfs. + +eCryptfs requires the userspace tools downloadable from the +SourceForge site: + +http://sourceforge.net/projects/ecryptfs/ + +Userspace requirements include: + - David Howells' userspace keyring headers and libraries (version + 1.0 or higher), obtainable from + http://people.redhat.com/~dhowells/keyutils/ + - Libgcrypt + + +NOTES + +In the beta/experimental releases of eCryptfs, when you upgrade +eCryptfs, you should copy the files to an unencrypted location and +then copy the files back into the new eCryptfs mount to migrate the +files. + + +MOUNT-WIDE PASSPHRASE + +Create a new directory into which eCryptfs will write its encrypted +files (i.e., /root/crypt). Then, create the mount point directory +(i.e., /mnt/crypt). Now it's time to mount eCryptfs: + +mount -t ecryptfs /root/crypt /mnt/crypt + +You should be prompted for a passphrase and a salt (the salt may be +blank). + +Try writing a new file: + +echo "Hello, World" > /mnt/crypt/hello.txt + +The operation will complete. Notice that there is a new file in +/root/crypt that is at least 12288 bytes in size (depending on your +host page size). This is the encrypted underlying file for what you +just wrote. To test reading, from start to finish, you need to clear +the user session keyring: + +keyctl clear @u + +Then umount /mnt/crypt and mount again per the instructions given +above. + +cat /mnt/crypt/hello.txt + + +NOTES + +eCryptfs version 0.1 should only be mounted on (1) empty directories +or (2) directories containing files only created by eCryptfs. If you +mount a directory that has pre-existing files not created by eCryptfs, +then behavior is undefined. Do not run eCryptfs in higher verbosity +levels unless you are doing so for the sole purpose of debugging or +development, since secret values will be written out to the system log +in that case. + + +Mike Halcrow +mhalcrow@us.ibm.com diff --git a/Documentation/filesystems/efivarfs.txt b/Documentation/filesystems/efivarfs.txt new file mode 100644 index 00000000000..c477af086e6 --- /dev/null +++ b/Documentation/filesystems/efivarfs.txt @@ -0,0 +1,16 @@ + +efivarfs - a (U)EFI variable filesystem + +The efivarfs filesystem was created to address the shortcomings of +using entries in sysfs to maintain EFI variables. The old sysfs EFI +variables code only supported variables of up to 1024 bytes. This +limitation existed in version 0.99 of the EFI specification, but was +removed before any full releases. Since variables can now be larger +than a single page, sysfs isn't the best interface for this. + +Variables can be created, deleted and modified with the efivarfs +filesystem. + +efivarfs is typically mounted like this, + + mount -t efivarfs none /sys/firmware/efi/efivars diff --git a/Documentation/filesystems/exofs.txt b/Documentation/filesystems/exofs.txt new file mode 100644 index 00000000000..23583a13697 --- /dev/null +++ b/Documentation/filesystems/exofs.txt @@ -0,0 +1,185 @@ +=============================================================================== +WHAT IS EXOFS? +=============================================================================== + +exofs is a file system that uses an OSD and exports the API of a normal Linux +file system. Users access exofs like any other local file system, and exofs +will in turn issue commands to the local OSD initiator. + +OSD is a new T10 command set that views storage devices not as a large/flat +array of sectors but as a container of objects, each having a length, quota, +time attributes and more. Each object is addressed by a 64bit ID, and is +contained in a 64bit ID partition. Each object has associated attributes +attached to it, which are integral part of the object and provide metadata about +the object. The standard defines some common obligatory attributes, but user +attributes can be added as needed. + +=============================================================================== +ENVIRONMENT +=============================================================================== + +To use this file system, you need to have an object store to run it on. You +may download a target from: +http://open-osd.org + +See Documentation/scsi/osd.txt for how to setup a working osd environment. + +=============================================================================== +USAGE +=============================================================================== + +1. Download and compile exofs and open-osd initiator: + You need an external Kernel source tree or kernel headers from your + distribution. (anything based on 2.6.26 or later). + + a. download open-osd including exofs source using: + [parent-directory]$ git clone git://git.open-osd.org/open-osd.git + + b. Build the library module like this: + [parent-directory]$ make -C KSRC=$(KER_DIR) open-osd + + This will build both the open-osd initiator as well as the exofs kernel + module. Use whatever parameters you compiled your Kernel with and + $(KER_DIR) above pointing to the Kernel you compile against. See the file + open-osd/top-level-Makefile for an example. + +2. Get the OSD initiator and target set up properly, and login to the target. + See Documentation/scsi/osd.txt for farther instructions. Also see ./do-osd + for example script that does all these steps. + +3. Insmod the exofs.ko module: + [exofs]$ insmod exofs.ko + +4. Make sure the directory where you want to mount exists. If not, create it. + (For example, mkdir /mnt/exofs) + +5. At first run you will need to invoke the mkfs.exofs application + + As an example, this will create the file system on: + /dev/osd0 partition ID 65536 + + mkfs.exofs --pid=65536 --format /dev/osd0 + + The --format is optional. If not specified, no OSD_FORMAT will be + performed and a clean file system will be created in the specified pid, + in the available space of the target. (Use --format=size_in_meg to limit + the total LUN space available) + + If pid already exists, it will be deleted and a new one will be created in + its place. Be careful. + + An exofs lives inside a single OSD partition. You can create multiple exofs + filesystems on the same device using multiple pids. + + (run mkfs.exofs without any parameters for usage help message) + +6. Mount the file system. + + For example, to mount /dev/osd0, partition ID 0x10000 on /mnt/exofs: + + mount -t exofs -o pid=65536 /dev/osd0 /mnt/exofs/ + +7. For reference (See do-exofs example script): + do-exofs start - an example of how to perform the above steps. + do-exofs stop - an example of how to unmount the file system. + do-exofs format - an example of how to format and mkfs a new exofs. + +8. Extra compilation flags (uncomment in fs/exofs/Kbuild): + CONFIG_EXOFS_DEBUG - for debug messages and extra checks. + +=============================================================================== +exofs mount options +=============================================================================== +Similar to any mount command: + mount -t exofs -o exofs_options /dev/osdX mount_exofs_directory + +Where: + -t exofs: specifies the exofs file system + + /dev/osdX: X is a decimal number. /dev/osdX was created after a successful + login into an OSD target. + + mount_exofs_directory: The directory to mount the file system on + + exofs specific options: Options are separated by commas (,) + pid=<integer> - The partition number to mount/create as + container of the filesystem. + This option is mandatory. integer can be + Hex by pre-pending an 0x to the number. + osdname=<id> - Mount by a device's osdname. + osdname is usually a 36 character uuid of the + form "d2683732-c906-4ee1-9dbd-c10c27bb40df". + It is one of the device's uuid specified in the + mkfs.exofs format command. + If this option is specified then the /dev/osdX + above can be empty and is ignored. + to=<integer> - Timeout in ticks for a single command. + default is (60 * HZ) [for debugging only] + +=============================================================================== +DESIGN +=============================================================================== + +* The file system control block (AKA on-disk superblock) resides in an object + with a special ID (defined in common.h). + Information included in the file system control block is used to fill the + in-memory superblock structure at mount time. This object is created before + the file system is used by mkexofs.c. It contains information such as: + - The file system's magic number + - The next inode number to be allocated + +* Each file resides in its own object and contains the data (and it will be + possible to extend the file over multiple objects, though this has not been + implemented yet). + +* A directory is treated as a file, and essentially contains a list of <file + name, inode #> pairs for files that are found in that directory. The object + IDs correspond to the files' inode numbers and will be allocated according to + a bitmap (stored in a separate object). Now they are allocated using a + counter. + +* Each file's control block (AKA on-disk inode) is stored in its object's + attributes. This applies to both regular files and other types (directories, + device files, symlinks, etc.). + +* Credentials are generated per object (inode and superblock) when they are + created in memory (read from disk or created). The credential works for all + operations and is used as long as the object remains in memory. + +* Async OSD operations are used whenever possible, but the target may execute + them out of order. The operations that concern us are create, delete, + readpage, writepage, update_inode, and truncate. The following pairs of + operations should execute in the order written, and we need to prevent them + from executing in reverse order: + - The following are handled with the OBJ_CREATED and OBJ_2BCREATED + flags. OBJ_CREATED is set when we know the object exists on the OSD - + in create's callback function, and when we successfully do a + read_inode. + OBJ_2BCREATED is set in the beginning of the create function, so we + know that we should wait. + - create/delete: delete should wait until the object is created + on the OSD. + - create/readpage: readpage should be able to return a page + full of zeroes in this case. If there was a write already + en-route (i.e. create, writepage, readpage) then the page + would be locked, and so it would really be the same as + create/writepage. + - create/writepage: if writepage is called for a sync write, it + should wait until the object is created on the OSD. + Otherwise, it should just return. + - create/truncate: truncate should wait until the object is + created on the OSD. + - create/update_inode: update_inode should wait until the + object is created on the OSD. + - Handled by VFS locks: + - readpage/delete: shouldn't happen because of page lock. + - writepage/delete: shouldn't happen because of page lock. + - readpage/writepage: shouldn't happen because of page lock. + +=============================================================================== +LICENSE/COPYRIGHT +=============================================================================== +The exofs file system is based on ext2 v0.5b (distributed with the Linux kernel +version 2.6.10). All files include the original copyrights, and the license +is GPL version 2 (only version 2, as is true for the Linux kernel). The +Linux kernel can be downloaded from www.kernel.org. diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index d16334ec48b..67639f905f1 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -17,8 +17,6 @@ set using tune2fs(8). Kernel-determined defaults are indicated by (*). bsddf (*) Makes `df' act like BSD. minixdf Makes `df' act like Minix. -check Check block and inode bitmaps at mount time - (requires CONFIG_EXT2_CHECK). check=none, nocheck (*) Don't do extra checking of bitmaps on mount (check=normal and check=strict options removed) @@ -207,7 +205,7 @@ Reserved Space In ext2, there is a mechanism for reserving a certain number of blocks for a particular user (normally the super-user). This is intended to -allow for the system to continue functioning even if non-priveleged users +allow for the system to continue functioning even if non-privileged users fill up all the space available to them (this is independent of filesystem quotas). It also keeps the filesystem from filling up entirely which helps combat fragmentation. @@ -324,7 +322,7 @@ an upper limit on the block size imposed by the page size of the kernel, so 8kB blocks are only allowed on Alpha systems (and other architectures which support larger pages). -There is an upper limit of 32768 subdirectories in a single directory. +There is an upper limit of 32000 subdirectories in a single directory. There is a "soft" upper limit of about 10-15k files in a single directory with the current linear linked-list directory implementation. This limit @@ -371,15 +369,15 @@ The kernel source file:/usr/src/linux/fs/ext2/ e2fsprogs (e2fsck) http://e2fsprogs.sourceforge.net/ Design & Implementation http://e2fsprogs.sourceforge.net/ext2intro.html Journaling (ext3) ftp://ftp.uk.linux.org/pub/linux/sct/fs/jfs/ -Hashed Directories http://kernelnewbies.org/~phillips/htree/ Filesystem Resizing http://ext2resize.sourceforge.net/ -Compression (*) http://www.netspace.net.au/~reiter/e2compr/ +Compression (*) http://e2compr.sourceforge.net/ Implementations for: -Windows 95/98/NT/2000 http://uranus.it.swin.edu.au/~jn/linux/Explore2fs.htm -Windows 95 (*) http://www.yipton.demon.co.uk/content.html#FSDEXT2 +Windows 95/98/NT/2000 http://www.chrysocome.net/explore2fs +Windows 95 (*) http://www.yipton.net/content.html#FSDEXT2 DOS client (*) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ -OS/2 http://perso.wanadoo.fr/matthieu.willm/ext2-os2/ -RISC OS client ftp://ftp.barnet.ac.uk/pub/acorn/armlinux/iscafs/ +OS/2 (+) ftp://metalab.unc.edu/pub/Linux/system/filesystems/ext2/ +RISC OS client http://www.esw-heim.tu-clausthal.de/~marco/smorbrod/IscaFS/ (*) no longer actively developed/supported (as of Apr 2001) +(+) no longer actively developed/supported (as of Mar 2009) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index 9ab7f446f7a..7ed0d17d672 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -2,11 +2,11 @@ Ext3 Filesystem =============== -ext3 was originally released in September 1999. Written by Stephen Tweedie -for 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, +Ext3 was originally released in September 1999. Written by Stephen Tweedie +for the 2.2 branch, and ported to 2.4 kernels by Peter Braam, Andreas Dilger, Andrew Morton, Alexander Viro, Ted Ts'o and Stephen Tweedie. -ext3 is ext2 filesystem enhanced with journalling capabilities. +Ext3 is the ext2 filesystem enhanced with journalling capabilities. Options ======= @@ -14,85 +14,109 @@ Options When mounting an ext3 filesystem, the following option are accepted: (*) == default -jounal=update Update the ext3 file system's journal to the - current format. +ro Mount filesystem read only. Note that ext3 will replay + the journal (and thus write to the partition) even when + mounted "read only". Mount options "ro,noload" can be + used to prevent writes to the filesystem. -journal=inum When a journal already exists, this option is - ignored. Otherwise, it specifies the number of - the inode which will represent the ext3 file - system's journal file. +journal=update Update the ext3 file system's journal to the current + format. -noload Don't load the journal on mounting. +journal=inum When a journal already exists, this option is ignored. + Otherwise, it specifies the number of the inode which + will represent the ext3 file system's journal file. -data=journal All data are committed into the journal prior - to being written into the main file system. +journal_path=path +journal_dev=devnum When the external journal device's major/minor numbers + have changed, these options allow the user to specify + the new journal location. The journal device is + identified through either its new major/minor numbers + encoded in devnum, or via a path to the device. + +norecovery Don't load the journal on mounting. Note that this forces +noload mount of inconsistent filesystem, which can lead to + various problems. + +data=journal All data are committed into the journal prior to being + written into the main file system. data=ordered (*) All data are forced directly out to the main file - system prior to its metadata being committed to - the journal. + system prior to its metadata being committed to the + journal. -data=writeback Data ordering is not preserved, data may be - written into the main file system after its - metadata has been committed to the journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. commit=nrsec (*) Ext3 can be told to sync all its data and metadata every 'nrsec' seconds. The default value is 5 seconds. - This means that if you lose your power, you will lose, - as much, the latest 5 seconds of work (your filesystem - will not be damaged though, thanks to journaling). This - default value (or any low value) will hurt performance, - but it's good for data-safety. Setting it to 0 will - have the same effect than leaving the default 5 sec. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). Setting it to very large values will improve performance. -barrier=1 This enables/disables barriers. barrier=0 disables it, - barrier=1 enables it. - -orlov (*) This enables the new Orlov block allocator. It's enabled - by default. - -oldalloc This disables the Orlov block allocator and enables the - old block allocator. Orlov should have better performance, - we'd like to get some feedback if it's the contrary for - you. - -user_xattr (*) Enables POSIX Extended Attributes. It's enabled by - default, however you need to confifure its support - (CONFIG_EXT3_FS_XATTR). This is neccesary if you want - to use POSIX Acces Control Lists support. You can visit - http://acl.bestbits.at to know more about POSIX Extended - attributes. - -nouser_xattr Disables POSIX Extended Attributes. - -acl (*) Enables POSIX Access Control Lists support. This is - enabled by default, however you need to configure - its support (CONFIG_EXT3_FS_POSIX_ACL). If you want - to know more about ACLs visit http://acl.bestbits.at - -noacl This option disables POSIX Access Control List support. +barrier=<0|1(*)> This enables/disables the use of write barriers in +barrier (*) the jbd code. barrier=0 disables, barrier=1 enables. +nobarrier This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. + The mount options "barrier" and "nobarrier" can + also be used to enable or disable barriers, for + consistency with other ext3 mount options. + +user_xattr Enables Extended User Attributes. Additionally, you + need to have extended attribute support enabled in the + kernel configuration (CONFIG_EXT3_FS_XATTR). See the + attr(5) manual page and http://acl.bestbits.at/ to + learn more about extended attributes. + +nouser_xattr Disables Extended User Attributes. + +acl Enables POSIX Access Control Lists support. + Additionally, you need to have ACL support enabled in + the kernel configuration (CONFIG_EXT3_FS_POSIX_ACL). + See the acl(5) manual page and http://acl.bestbits.at/ + for more information. + +noacl This option disables POSIX Access Control List + support. reservation noreservation -resize= - bsddf (*) Make 'df' act like BSD. minixdf Make 'df' act like Minix. check=none Don't do extra checking of bitmaps on mount. -nocheck +nocheck debug Extra debugging information is sent to syslog. -errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=remount-ro Remount the filesystem read-only on an error. errors=continue Keep going on a filesystem error. errors=panic Panic and halt the machine if an error occurs. + (These mount options override the errors behavior + specified in the superblock, which can be + configured using tune2fs.) + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. grpid Give objects the same group ID as their creator. -bsdgroups +bsdgroups nogrpid (*) New objects have the group ID of their creator. sysvgroups @@ -103,81 +127,89 @@ resuid=n The user ID which may use the reserved blocks. sb=n Use alternate superblock at this location. -quota Quota options are currently silently ignored. -noquota (see fs/ext3/super.c, line 594) -grpquota -usrquota +quota These options are ignored by the filesystem. They +noquota are used only by quota tools to recognize volumes +grpquota where quota should be turned on. See documentation +usrquota in the quota-tools package for more details + (http://sourceforge.net/projects/linuxquota). +jqfmt=<quota type> These options tell filesystem details about quota +usrjquota=<file> so that quota information can be properly updated +grpjquota=<file> during journal replay. They replace the above + quota options. See documentation in the quota-tools + package for more details + (http://sourceforge.net/projects/linuxquota). Specification ============= -ext3 shares all disk implementation with ext2 filesystem, and add -transactions capabilities to ext2. Journaling is done by the -Journaling block device layer. +Ext3 shares all disk implementation with the ext2 filesystem, and adds +transactions capabilities to ext2. Journaling is done by the Journaling Block +Device layer. Journaling Block Device layer ----------------------------- -The Journaling Block Device layer (JBD) isn't ext3 specific. It was -design to add journaling capabilities on a block device. The ext3 -filesystem code will inform the JBD of modifications it is performing -(Call a transaction). the journal support the transactions start and -stop, and in case of crash, the journal can replayed the transactions -to put the partition on a consistent state fastly. +The Journaling Block Device layer (JBD) isn't ext3 specific. It was designed +to add journaling capabilities to a block device. The ext3 filesystem code +will inform the JBD of modifications it is performing (called a transaction). +The journal supports the transactions start and stop, and in case of a crash, +the journal can replay the transactions to quickly put the partition back into +a consistent state. -handles represent a single atomic update to a filesystem. JBD can -handle external journal on a block device. +Handles represent a single atomic update to a filesystem. JBD can handle an +external journal on a block device. Data Mode --------- -There's 3 different data modes: +There are 3 different data modes: * writeback mode -In data=writeback mode, ext3 does not journal data at all. This mode -provides a similar level of journaling as XFS, JFS, and ReiserFS in its -default mode - metadata journaling. A crash+recovery can cause -incorrect data to appear in files which were written shortly before the -crash. This mode will typically provide the best ext3 performance. +In data=writeback mode, ext3 does not journal data at all. This mode provides +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default +mode - metadata journaling. A crash+recovery can cause incorrect data to +appear in files which were written shortly before the crash. This mode will +typically provide the best ext3 performance. * ordered mode -In data=ordered mode, ext3 only officially journals metadata, but it -logically groups metadata and data blocks into a single unit called a -transaction. When it's time to write the new metadata out to disk, the -associated data blocks are written first. In general, this mode -perform slightly slower than writeback but significantly faster than -journal mode. +In data=ordered mode, ext3 only officially journals metadata, but it logically +groups metadata and data blocks into a single unit called a transaction. When +it's time to write the new metadata out to disk, the associated data blocks +are written first. In general, this mode performs slightly slower than +writeback but significantly faster than journal mode. * journal mode -data=journal mode provides full data and metadata journaling. All new -data is written to the journal first, and then to its final location. -In the event of a crash, the journal can be replayed, bringing both -data and metadata into a consistent state. This mode is the slowest -except when data needs to be read from and written to disk at the same -time where it outperform all others mode. +data=journal mode provides full data and metadata journaling. All new data is +written to the journal first, and then to its final location. +In the event of a crash, the journal can be replayed, bringing both data and +metadata into a consistent state. This mode is the slowest except when data +needs to be read from and written to disk at the same time where it +outperforms all other modes. Compatibility ------------- Ext2 partitions can be easily convert to ext3, with `tune2fs -j <dev>`. -Ext3 is fully compatible with Ext2. Ext3 partitions can easily be -mounted as Ext2. +Ext3 is fully compatible with Ext2. Ext3 partitions can easily be mounted as +Ext2. + External Tools ============== -see manual pages to know more. +See manual pages to learn more. + +tune2fs: create a ext3 journal on a ext2 partition with the -j flag. +mke2fs: create a ext3 partition with the -j flag. +debugfs: ext2 and ext3 file system debugger. +ext2online: online (mounted) ext2 and ext3 filesystem resizer -tune2fs: create a ext3 journal on a ext2 partition with the -j flags -mke2fs: create a ext3 partition with the -j flags -debugfs: ext2 and ext3 file system debugger References ========== -kernel source: file:/usr/src/linux/fs/ext3 - file:/usr/src/linux/fs/jbd +kernel source: <file:fs/ext3/> + <file:fs/jbd/> -programs: http://e2fsprogs.sourceforge.net +programs: http://e2fsprogs.sourceforge.net/ + http://ext2resize.sourceforge.net -useful link: - http://www.zip.com.au/~akpm/linux/ext3/ext3-usage.html - http://www-106.ibm.com/developerworks/linux/library/l-fs7/ - http://www-106.ibm.com/developerworks/linux/library/l-fs8/ +useful links: http://www.ibm.com/developerworks/library/l-fs7/index.html + http://www.ibm.com/developerworks/library/l-fs8/index.html diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt new file mode 100644 index 00000000000..919a3293aaa --- /dev/null +++ b/Documentation/filesystems/ext4.txt @@ -0,0 +1,625 @@ + +Ext4 Filesystem +=============== + +Ext4 is an advanced level of the ext3 filesystem which incorporates +scalability and reliability enhancements for supporting large filesystems +(64 bit) in keeping with increasing disk capacities and state-of-the-art +feature requirements. + +Mailing list: linux-ext4@vger.kernel.org +Web site: http://ext4.wiki.kernel.org + + +1. Quick usage instructions: +=========================== + +Note: More extensive information for getting started with ext4 can be + found at the ext4 wiki site at the URL: + http://ext4.wiki.kernel.org/index.php/Ext4_Howto + + - Compile and install the latest version of e2fsprogs (as of this + writing version 1.41.3) from: + + http://sourceforge.net/project/showfiles.php?group_id=2406 + + or + + ftp://ftp.kernel.org/pub/linux/kernel/people/tytso/e2fsprogs/ + + or grab the latest git repository from: + + git://git.kernel.org/pub/scm/fs/ext2/e2fsprogs.git + + - Note that it is highly important to install the mke2fs.conf file + that comes with the e2fsprogs 1.41.x sources in /etc/mke2fs.conf. If + you have edited the /etc/mke2fs.conf file installed on your system, + you will need to merge your changes with the version from e2fsprogs + 1.41.x. + + - Create a new filesystem using the ext4 filesystem type: + + # mke2fs -t ext4 /dev/hda1 + + Or to configure an existing ext3 filesystem to support extents: + + # tune2fs -O extents /dev/hda1 + + If the filesystem was created with 128 byte inodes, it can be + converted to use 256 byte for greater efficiency via: + + # tune2fs -I 256 /dev/hda1 + + (Note: we currently do not have tools to convert an ext4 + filesystem back to ext3; so please do not do try this on production + filesystems.) + + - Mounting: + + # mount -t ext4 /dev/hda1 /wherever + + - When comparing performance with other filesystems, it's always + important to try multiple workloads; very often a subtle change in a + workload parameter can completely change the ranking of which + filesystems do well compared to others. When comparing versus ext3, + note that ext4 enables write barriers by default, while ext3 does + not enable write barriers by default. So it is useful to use + explicitly specify whether barriers are enabled or not when via the + '-o barriers=[0|1]' mount option for both ext3 and ext4 filesystems + for a fair comparison. When tuning ext3 for best benchmark numbers, + it is often worthwhile to try changing the data journaling mode; '-o + data=writeback' can be faster for some workloads. (Note however that + running mounted with data=writeback can potentially leave stale data + exposed in recently written files in case of an unclean shutdown, + which could be a security exposure in some situations.) Configuring + the filesystem with a large journal can also be helpful for + metadata-intensive workloads. + +2. Features +=========== + +2.1 Currently available + +* ability to use filesystems > 16TB (e2fsprogs support not available yet) +* extent format reduces metadata overhead (RAM, IO for access, transactions) +* extent format more robust in face of on-disk corruption due to magics, +* internal redundancy in tree +* improved file allocation (multi-block alloc) +* lift 32000 subdirectory limit imposed by i_links_count[1] +* nsec timestamps for mtime, atime, ctime, create time +* inode version field on disk (NFSv4, Lustre) +* reduced e2fsck time via uninit_bg feature +* journal checksumming for robustness, performance +* persistent file preallocation (e.g for streaming media, databases) +* ability to pack bitmaps and inode tables into larger virtual groups via the + flex_bg feature +* large file support +* Inode allocation using large virtual block groups via flex_bg +* delayed allocation +* large block (up to pagesize) support +* efficient new ordered mode in JBD2 and ext4(avoid using buffer head to force + the ordering) + +[1] Filesystems with a block size of 1k may see a limit imposed by the +directory hash tree having a maximum depth of two. + +2.2 Candidate features for future inclusion + +* Online defrag (patches available but not well tested) +* reduced mke2fs time via lazy itable initialization in conjunction with + the uninit_bg feature (capability to do this is available in e2fsprogs + but a kernel thread to do lazy zeroing of unused inode table blocks + after filesystem is first mounted is required for safety) + +There are several others under discussion, whether they all make it in is +partly a function of how much time everyone has to work on them. Features like +metadata checksumming have been discussed and planned for a bit but no patches +exist yet so I'm not sure they're in the near-term roadmap. + +The big performance win will come with mballoc, delalloc and flex_bg +grouping of bitmaps and inode tables. Some test results available here: + + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-write-2.6.27-rc1.html + - http://www.bullopensource.org/ext4/20080818-ffsb/ffsb-readwrite-2.6.27-rc1.html + +3. Options +========== + +When mounting an ext4 filesystem, the following option are accepted: +(*) == default + +ro Mount filesystem read only. Note that ext4 will + replay the journal (and thus write to the + partition) even when mounted "read only". The + mount options "ro,noload" can be used to prevent + writes to the filesystem. + +journal_checksum Enable checksumming of the journal transactions. + This will allow the recovery code in e2fsck and the + kernel to detect corruption in the kernel. It is a + compatible change and will be ignored by older kernels. + +journal_async_commit Commit block can be written to disk without waiting + for descriptor blocks. If enabled older kernels cannot + mount the device. This will enable 'journal_checksum' + internally. + +journal_path=path +journal_dev=devnum When the external journal device's major/minor numbers + have changed, these options allow the user to specify + the new journal location. The journal device is + identified through either its new major/minor numbers + encoded in devnum, or via a path to the device. + +norecovery Don't load the journal on mounting. Note that +noload if the filesystem was not unmounted cleanly, + skipping the journal replay will lead to the + filesystem containing inconsistencies that can + lead to any number of problems. + +data=journal All data are committed into the journal prior to being + written into the main file system. Enabling + this mode will disable delayed allocation and + O_DIRECT support. + +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. + +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. + +commit=nrsec (*) Ext4 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. + +barrier=<0|1(*)> This enables/disables the use of write barriers in +barrier(*) the jbd code. barrier=0 disables, barrier=1 enables. +nobarrier This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. + The mount options "barrier" and "nobarrier" can + also be used to enable or disable barriers, for + consistency with other ext4 mount options. + +inode_readahead_blks=n This tuning parameter controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache. The default value is 32 blocks. + +nouser_xattr Disables Extended User Attributes. See the + attr(5) manual page and http://acl.bestbits.at/ + for more information about extended attributes. + +noacl This option disables POSIX Access Control List + support. If ACL support is enabled in the kernel + configuration (CONFIG_EXT4_FS_POSIX_ACL), ACL is + enabled by default on mount. See the acl(5) manual + page and http://acl.bestbits.at/ for more information + about acl. + +bsddf (*) Make 'df' act like BSD. +minixdf Make 'df' act like Minix. + +debug Extra debugging information is sent to syslog. + +abort Simulate the effects of calling ext4_abort() for + debugging purposes. This is normally used while + remounting a filesystem which is already mounted. + +errors=remount-ro Remount the filesystem read-only on an error. +errors=continue Keep going on a filesystem error. +errors=panic Panic and halt the machine if an error occurs. + (These mount options override the errors behavior + specified in the superblock, which can be configured + using tune2fs) + +data_err=ignore(*) Just print an error message if an error occurs + in a file data buffer in ordered mode. +data_err=abort Abort the journal if an error occurs in a file + data buffer in ordered mode. + +grpid Give objects the same group ID as their creator. +bsdgroups + +nogrpid (*) New objects have the group ID of their creator. +sysvgroups + +resgid=n The group ID which may use the reserved blocks. + +resuid=n The user ID which may use the reserved blocks. + +sb=n Use alternate superblock at this location. + +quota These options are ignored by the filesystem. They +noquota are used only by quota tools to recognize volumes +grpquota where quota should be turned on. See documentation +usrquota in the quota-tools package for more details + (http://sourceforge.net/projects/linuxquota). + +jqfmt=<quota type> These options tell filesystem details about quota +usrjquota=<file> so that quota information can be properly updated +grpjquota=<file> during journal replay. They replace the above + quota options. See documentation in the quota-tools + package for more details + (http://sourceforge.net/projects/linuxquota). + +stripe=n Number of filesystem blocks that mballoc will try + to use for allocation size and alignment. For RAID5/6 + systems this should be the number of data + disks * RAID chunk size in file system blocks. + +delalloc (*) Defer block allocation until just before ext4 + writes out the block(s) in question. This + allows ext4 to better allocation decisions + more efficiently. +nodelalloc Disable delayed allocation. Blocks are allocated + when the data is copied from userspace to the + page cache, either via the write(2) system call + or when an mmap'ed page which was previously + unallocated is written for the first time. + +max_batch_time=usec Maximum amount of time ext4 should wait for + additional filesystem operations to be batch + together with a synchronous write operation. + Since a synchronous write operation is going to + force a commit and then a wait for the I/O + complete, it doesn't cost much, and can be a + huge throughput win, we wait for a small amount + of time to see if any other transactions can + piggyback on the synchronous write. The + algorithm used is designed to automatically tune + for the speed of the disk, by measuring the + amount of time (on average) that it takes to + finish committing a transaction. Call this time + the "commit time". If the time that the + transaction has been running is less than the + commit time, ext4 will try sleeping for the + commit time to see if other operations will join + the transaction. The commit time is capped by + the max_batch_time, which defaults to 15000us + (15ms). This optimization can be turned off + entirely by setting max_batch_time to 0. + +min_batch_time=usec This parameter sets the commit time (as + described above) to be at least min_batch_time. + It defaults to zero microseconds. Increasing + this parameter may improve the throughput of + multi-threaded, synchronous workloads on very + fast disks, at the cost of increasing latency. + +journal_ioprio=prio The I/O priority (from 0 to 7, where 0 is the + highest priority) which should be used for I/O + operations submitted by kjournald2 during a + commit operation. This defaults to 3, which is + a slightly higher priority than the default I/O + priority. + +auto_da_alloc(*) Many broken applications don't use fsync() when +noauto_da_alloc replacing existing files via patterns such as + fd = open("foo.new")/write(fd,..)/close(fd)/ + rename("foo.new", "foo"), or worse yet, + fd = open("foo", O_TRUNC)/write(fd,..)/close(fd). + If auto_da_alloc is enabled, ext4 will detect + the replace-via-rename and replace-via-truncate + patterns and force that any delayed allocation + blocks are allocated such that at the next + journal commit, in the default data=ordered + mode, the data blocks of the new file are forced + to disk before the rename() operation is + committed. This provides roughly the same level + of guarantees as ext3, and avoids the + "zero-length" problem that can happen when a + system crashes before the delayed allocation + blocks are forced to disk. + +noinit_itable Do not initialize any uninitialized inode table + blocks in the background. This feature may be + used by installation CD's so that the install + process can complete as quickly as possible; the + inode table initialization process would then be + deferred until the next time the file system + is unmounted. + +init_itable=n The lazy itable init code will wait n times the + number of milliseconds it took to zero out the + previous block group's inode table. This + minimizes the impact on the system performance + while file system's inode table is being initialized. + +discard Controls whether ext4 should issue discard/TRIM +nodiscard(*) commands to the underlying block device when + blocks are freed. This is useful for SSD devices + and sparse/thinly-provisioned LUNs, but it is off + by default until sufficient testing has been done. + +nouid32 Disables 32-bit UIDs and GIDs. This is for + interoperability with older kernels which only + store and expect 16-bit values. + +block_validity This options allows to enables/disables the in-kernel +noblock_validity facility for tracking filesystem metadata blocks + within internal data structures. This allows multi- + block allocator and other routines to quickly locate + extents which might overlap with filesystem metadata + blocks. This option is intended for debugging + purposes and since it negatively affects the + performance, it is off by default. + +dioread_lock Controls whether or not ext4 should use the DIO read +dioread_nolock locking. If the dioread_nolock option is specified + ext4 will allocate uninitialized extent before buffer + write and convert the extent to initialized after IO + completes. This approach allows ext4 code to avoid + using inode mutex, which improves scalability on high + speed storages. However this does not work with + data journaling and dioread_nolock option will be + ignored with kernel warning. Note that dioread_nolock + code path is only used for extent-based files. + Because of the restrictions this options comprises + it is off by default (e.g. dioread_lock). + +max_dir_size_kb=n This limits the size of directories so that any + attempt to expand them beyond the specified + limit in kilobytes will cause an ENOSPC error. + This is useful in memory constrained + environments, where a very large directory can + cause severe performance problems or even + provoke the Out Of Memory killer. (For example, + if there is only 512mb memory available, a 176mb + directory may seriously cramp the system's style.) + +i_version Enable 64-bit inode version support. This option is + off by default. + +Data Mode +========= +There are 3 different data modes: + +* writeback mode +In data=writeback mode, ext4 does not journal data at all. This mode provides +a similar level of journaling as that of XFS, JFS, and ReiserFS in its default +mode - metadata journaling. A crash+recovery can cause incorrect data to +appear in files which were written shortly before the crash. This mode will +typically provide the best ext4 performance. + +* ordered mode +In data=ordered mode, ext4 only officially journals metadata, but it logically +groups metadata information related to data changes with the data blocks into a +single unit called a transaction. When it's time to write the new metadata +out to disk, the associated data blocks are written first. In general, +this mode performs slightly slower than writeback but significantly faster than journal mode. + +* journal mode +data=journal mode provides full data and metadata journaling. All new data is +written to the journal first, and then to its final location. +In the event of a crash, the journal can be replayed, bringing both data and +metadata into a consistent state. This mode is the slowest except when data +needs to be read from and written to disk at the same time where it +outperforms all others modes. Enabling this mode will disable delayed +allocation and O_DIRECT support. + +/proc entries +============= + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /proc/fs/ext4/<devname> +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks +.............................................................................. + +/sys entries +============ + +Information about mounted ext4 file systems can be found in +/sys/fs/ext4. Each mounted filesystem will have a directory in +/sys/fs/ext4 based on its device name (i.e., /sys/fs/ext4/hdc or +/sys/fs/ext4/dm-0). The files in each per-device directory are shown +in table below. + +Files in /sys/fs/ext4/<devname> +(see also Documentation/ABI/testing/sysfs-fs-ext4) +.............................................................................. + File Content + + delayed_allocation_blocks This file is read-only and shows the number of + blocks that are dirty in the page cache, but + which do not have their location in the + filesystem allocated yet. + + inode_goal Tuning parameter which (if non-zero) controls + the goal inode used by the inode allocator in + preference to all other allocation heuristics. + This is intended for debugging use only, and + should be 0 on production systems. + + inode_readahead_blks Tuning parameter which controls the maximum + number of inode table blocks that ext4's inode + table readahead algorithm will pre-read into + the buffer cache + + lifetime_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was created. + + max_writeback_mb_bump The maximum number of megabytes the writeback + code will try to write out before move on to + another inode. + + mb_group_prealloc The multiblock allocator will round up allocation + requests to a multiple of this tuning parameter if + the stripe size is not set in the ext4 superblock + + mb_max_to_scan The maximum number of extents the multiblock + allocator will search to find the best extent + + mb_min_to_scan The minimum number of extents the multiblock + allocator will search to find the best extent + + mb_order2_req Tuning parameter which controls the minimum size + for requests (as a power of 2) where the buddy + cache is used + + mb_stats Controls whether the multiblock allocator should + collect statistics, which are shown during the + unmount. 1 means to collect statistics, 0 means + not to collect statistics + + mb_stream_req Files which have fewer blocks than this tunable + parameter will have their blocks allocated out + of a block group specific preallocation pool, so + that small files are packed closely together. + Each large file will have its blocks allocated + out of its own unique preallocation pool. + + session_write_kbytes This file is read-only and shows the number of + kilobytes of data that have been written to this + filesystem since it was mounted. + + reserved_clusters This is RW file and contains number of reserved + clusters in the file system which will be used + in the specific situations to avoid costly + zeroout, unexpected ENOSPC, or possible data + loss. The default is 2% or 4096 clusters, + whichever is smaller and this can be changed + however it can never exceed number of clusters + in the file system. If there is not enough space + for the reserved space when mounting the file + mount will _not_ fail. +.............................................................................. + +Ioctls +====== + +There is some Ext4 specific functionality which can be accessed by applications +through the system call interfaces. The list of all Ext4 specific ioctls are +shown in the table below. + +Table of Ext4 specific ioctls +.............................................................................. + Ioctl Description + EXT4_IOC_GETFLAGS Get additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_GETFLAGS. + + EXT4_IOC_SETFLAGS Set additional attributes associated with inode. + The ioctl argument is an integer bitfield, with + bit values described in ext4.h. This ioctl is an + alias for FS_IOC_SETFLAGS. + + EXT4_IOC_GETVERSION + EXT4_IOC_GETVERSION_OLD + Get the inode i_generation number stored for + each inode. The i_generation number is normally + changed only when new inode is created and it is + particularly useful for network filesystems. The + '_OLD' version of this ioctl is an alias for + FS_IOC_GETVERSION. + + EXT4_IOC_SETVERSION + EXT4_IOC_SETVERSION_OLD + Set the inode i_generation number stored for + each inode. The '_OLD' version of this ioctl + is an alias for FS_IOC_SETVERSION. + + EXT4_IOC_GROUP_EXTEND This ioctl has the same purpose as the resize + mount option. It allows to resize filesystem + to the end of the last existing block group, + further resize has to be done with resize2fs, + either online, or offline. The argument points + to the unsigned logn number representing the + filesystem new block count. + + EXT4_IOC_MOVE_EXT Move the block extents from orig_fd (the one + this ioctl is pointing to) to the donor_fd (the + one specified in move_extent structure passed + as an argument to this ioctl). Then, exchange + inode metadata between orig_fd and donor_fd. + This is especially useful for online + defragmentation, because the allocator has the + opportunity to allocate moved blocks better, + ideally into one contiguous extent. + + EXT4_IOC_GROUP_ADD Add a new group descriptor to an existing or + new group descriptor block. The new group + descriptor is described by ext4_new_group_input + structure, which is passed as an argument to + this ioctl. This is especially useful in + conjunction with EXT4_IOC_GROUP_EXTEND, + which allows online resize of the filesystem + to the end of the last existing block group. + Those two ioctls combined is used in userspace + online resize tool (e.g. resize2fs). + + EXT4_IOC_MIGRATE This ioctl operates on the filesystem itself. + It converts (migrates) ext3 indirect block mapped + inode to ext4 extent mapped inode by walking + through indirect block mapping of the original + inode and converting contiguous block ranges + into ext4 extents of the temporary inode. Then, + inodes are swapped. This ioctl might help, when + migrating from ext3 to ext4 filesystem, however + suggestion is to create fresh ext4 filesystem + and copy data from the backup. Note, that + filesystem has to support extents for this ioctl + to work. + + EXT4_IOC_ALLOC_DA_BLKS Force all of the delay allocated blocks to be + allocated to preserve application-expected ext3 + behaviour. Note that this will also start + triggering a write of the data blocks, but this + behaviour may change in the future as it is + not necessary and has been done this way only + for sake of simplicity. + + EXT4_IOC_RESIZE_FS Resize the filesystem to a new size. The number + of blocks of resized filesystem is passed in via + 64 bit integer argument. The kernel allocates + bitmaps and inode table, the userspace tool thus + just passes the new number of blocks. + +EXT4_IOC_SWAP_BOOT Swap i_blocks and associated attributes + (like i_blocks, i_size, i_flags, ...) from + the specified inode with inode + EXT4_BOOT_LOADER_INO (#5). This is typically + used to store a boot loader in a secure part of + the filesystem, where it can't be changed by a + normal user by accident. + The data blocks of the previous boot loader + will be associated with the given inode. + +.............................................................................. + +References +========== + +kernel source: <file:fs/ext4/> + <file:fs/jbd2/> + +programs: http://e2fsprogs.sourceforge.net/ + +useful links: http://fedoraproject.org/wiki/ext3-devel + http://www.bullopensource.org/ext4/ + http://ext4.wiki.kernel.org/index.php/Main_Page + http://fedoraproject.org/wiki/Features/Ext4 diff --git a/Documentation/filesystems/f2fs.txt b/Documentation/filesystems/f2fs.txt new file mode 100644 index 00000000000..51afba17bba --- /dev/null +++ b/Documentation/filesystems/f2fs.txt @@ -0,0 +1,545 @@ +================================================================================ +WHAT IS Flash-Friendly File System (F2FS)? +================================================================================ + +NAND flash memory-based storage devices, such as SSD, eMMC, and SD cards, have +been equipped on a variety systems ranging from mobile to server systems. Since +they are known to have different characteristics from the conventional rotating +disks, a file system, an upper layer to the storage device, should adapt to the +changes from the sketch in the design level. + +F2FS is a file system exploiting NAND flash memory-based storage devices, which +is based on Log-structured File System (LFS). The design has been focused on +addressing the fundamental issues in LFS, which are snowball effect of wandering +tree and high cleaning overhead. + +Since a NAND flash memory-based storage device shows different characteristic +according to its internal geometry or flash memory management scheme, namely FTL, +F2FS and its tools support various parameters not only for configuring on-disk +layout, but also for selecting allocation and cleaning algorithms. + +The following git tree provides the file system formatting tool (mkfs.f2fs), +a consistency checking tool (fsck.f2fs), and a debugging tool (dump.f2fs). +>> git://git.kernel.org/pub/scm/linux/kernel/git/jaegeuk/f2fs-tools.git + +For reporting bugs and sending patches, please use the following mailing list: +>> linux-f2fs-devel@lists.sourceforge.net + +================================================================================ +BACKGROUND AND DESIGN ISSUES +================================================================================ + +Log-structured File System (LFS) +-------------------------------- +"A log-structured file system writes all modifications to disk sequentially in +a log-like structure, thereby speeding up both file writing and crash recovery. +The log is the only structure on disk; it contains indexing information so that +files can be read back from the log efficiently. In order to maintain large free +areas on disk for fast writing, we divide the log into segments and use a +segment cleaner to compress the live information from heavily fragmented +segments." from Rosenblum, M. and Ousterhout, J. K., 1992, "The design and +implementation of a log-structured file system", ACM Trans. Computer Systems +10, 1, 26–52. + +Wandering Tree Problem +---------------------- +In LFS, when a file data is updated and written to the end of log, its direct +pointer block is updated due to the changed location. Then the indirect pointer +block is also updated due to the direct pointer block update. In this manner, +the upper index structures such as inode, inode map, and checkpoint block are +also updated recursively. This problem is called as wandering tree problem [1], +and in order to enhance the performance, it should eliminate or relax the update +propagation as much as possible. + +[1] Bityutskiy, A. 2005. JFFS3 design issues. http://www.linux-mtd.infradead.org/ + +Cleaning Overhead +----------------- +Since LFS is based on out-of-place writes, it produces so many obsolete blocks +scattered across the whole storage. In order to serve new empty log space, it +needs to reclaim these obsolete blocks seamlessly to users. This job is called +as a cleaning process. + +The process consists of three operations as follows. +1. A victim segment is selected through referencing segment usage table. +2. It loads parent index structures of all the data in the victim identified by + segment summary blocks. +3. It checks the cross-reference between the data and its parent index structure. +4. It moves valid data selectively. + +This cleaning job may cause unexpected long delays, so the most important goal +is to hide the latencies to users. And also definitely, it should reduce the +amount of valid data to be moved, and move them quickly as well. + +================================================================================ +KEY FEATURES +================================================================================ + +Flash Awareness +--------------- +- Enlarge the random write area for better performance, but provide the high + spatial locality +- Align FS data structures to the operational units in FTL as best efforts + +Wandering Tree Problem +---------------------- +- Use a term, “nodeâ€, that represents inodes as well as various pointer blocks +- Introduce Node Address Table (NAT) containing the locations of all the “node†+ blocks; this will cut off the update propagation. + +Cleaning Overhead +----------------- +- Support a background cleaning process +- Support greedy and cost-benefit algorithms for victim selection policies +- Support multi-head logs for static/dynamic hot and cold data separation +- Introduce adaptive logging for efficient block allocation + +================================================================================ +MOUNT OPTIONS +================================================================================ + +background_gc=%s Turn on/off cleaning operations, namely garbage + collection, triggered in background when I/O subsystem is + idle. If background_gc=on, it will turn on the garbage + collection and if background_gc=off, garbage collection + will be truned off. + Default value for this option is on. So garbage + collection is on by default. +disable_roll_forward Disable the roll-forward recovery routine +discard Issue discard/TRIM commands when a segment is cleaned. +no_heap Disable heap-style segment allocation which finds free + segments for data from the beginning of main area, while + for node from the end of main area. +nouser_xattr Disable Extended User Attributes. Note: xattr is enabled + by default if CONFIG_F2FS_FS_XATTR is selected. +noacl Disable POSIX Access Control List. Note: acl is enabled + by default if CONFIG_F2FS_FS_POSIX_ACL is selected. +active_logs=%u Support configuring the number of active logs. In the + current design, f2fs supports only 2, 4, and 6 logs. + Default number is 6. +disable_ext_identify Disable the extension list configured by mkfs, so f2fs + does not aware of cold files such as media files. +inline_xattr Enable the inline xattrs feature. +inline_data Enable the inline data feature: New created small(<~3.4k) + files can be written into inode block. +flush_merge Merge concurrent cache_flush commands as much as possible + to eliminate redundant command issues. If the underlying + device handles the cache_flush command relatively slowly, + recommend to enable this option. + +================================================================================ +DEBUGFS ENTRIES +================================================================================ + +/sys/kernel/debug/f2fs/ contains information about all the partitions mounted as +f2fs. Each file shows the whole f2fs information. + +/sys/kernel/debug/f2fs/status includes: + - major file system information managed by f2fs currently + - average SIT information about whole segments + - current memory footprint consumed by f2fs. + +================================================================================ +SYSFS ENTRIES +================================================================================ + +Information about mounted f2f2 file systems can be found in +/sys/fs/f2fs. Each mounted filesystem will have a directory in +/sys/fs/f2fs based on its device name (i.e., /sys/fs/f2fs/sda). +The files in each per-device directory are shown in table below. + +Files in /sys/fs/f2fs/<devname> +(see also Documentation/ABI/testing/sysfs-fs-f2fs) +.............................................................................. + File Content + + gc_max_sleep_time This tuning parameter controls the maximum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_min_sleep_time This tuning parameter controls the minimum sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_no_gc_sleep_time This tuning parameter controls the default sleep + time for the garbage collection thread. Time is + in milliseconds. + + gc_idle This parameter controls the selection of victim + policy for garbage collection. Setting gc_idle = 0 + (default) will disable this option. Setting + gc_idle = 1 will select the Cost Benefit approach + & setting gc_idle = 2 will select the greedy aproach. + + reclaim_segments This parameter controls the number of prefree + segments to be reclaimed. If the number of prefree + segments is larger than the number of segments + in the proportion to the percentage over total + volume size, f2fs tries to conduct checkpoint to + reclaim the prefree segments to free segments. + By default, 5% over total # of segments. + + max_small_discards This parameter controls the number of discard + commands that consist small blocks less than 2MB. + The candidates to be discarded are cached until + checkpoint is triggered, and issued during the + checkpoint. By default, it is disabled with 0. + + ipu_policy This parameter controls the policy of in-place + updates in f2fs. There are five policies: + 0: F2FS_IPU_FORCE, 1: F2FS_IPU_SSR, + 2: F2FS_IPU_UTIL, 3: F2FS_IPU_SSR_UTIL, + 4: F2FS_IPU_DISABLE. + + min_ipu_util This parameter controls the threshold to trigger + in-place-updates. The number indicates percentage + of the filesystem utilization, and used by + F2FS_IPU_UTIL and F2FS_IPU_SSR_UTIL policies. + + max_victim_search This parameter controls the number of trials to + find a victim segment when conducting SSR and + cleaning operations. The default value is 4096 + which covers 8GB block address range. + + dir_level This parameter controls the directory level to + support large directory. If a directory has a + number of files, it can reduce the file lookup + latency by increasing this dir_level value. + Otherwise, it needs to decrease this value to + reduce the space overhead. The default value is 0. + + ram_thresh This parameter controls the memory footprint used + by free nids and cached nat entries. By default, + 10 is set, which indicates 10 MB / 1 GB RAM. + +================================================================================ +USAGE +================================================================================ + +1. Download userland tools and compile them. + +2. Skip, if f2fs was compiled statically inside kernel. + Otherwise, insert the f2fs.ko module. + # insmod f2fs.ko + +3. Create a directory trying to mount + # mkdir /mnt/f2fs + +4. Format the block device, and then mount as f2fs + # mkfs.f2fs -l label /dev/block_device + # mount -t f2fs /dev/block_device /mnt/f2fs + +mkfs.f2fs +--------- +The mkfs.f2fs is for the use of formatting a partition as the f2fs filesystem, +which builds a basic on-disk layout. + +The options consist of: +-l [label] : Give a volume label, up to 512 unicode name. +-a [0 or 1] : Split start location of each area for heap-based allocation. + 1 is set by default, which performs this. +-o [int] : Set overprovision ratio in percent over volume size. + 5 is set by default. +-s [int] : Set the number of segments per section. + 1 is set by default. +-z [int] : Set the number of sections per zone. + 1 is set by default. +-e [str] : Set basic extension list. e.g. "mp3,gif,mov" +-t [0 or 1] : Disable discard command or not. + 1 is set by default, which conducts discard. + +fsck.f2fs +--------- +The fsck.f2fs is a tool to check the consistency of an f2fs-formatted +partition, which examines whether the filesystem metadata and user-made data +are cross-referenced correctly or not. +Note that, initial version of the tool does not fix any inconsistency. + +The options consist of: + -d debug level [default:0] + +dump.f2fs +--------- +The dump.f2fs shows the information of specific inode and dumps SSA and SIT to +file. Each file is dump_ssa and dump_sit. + +The dump.f2fs is used to debug on-disk data structures of the f2fs filesystem. +It shows on-disk inode information reconized by a given inode number, and is +able to dump all the SSA and SIT entries into predefined files, ./dump_ssa and +./dump_sit respectively. + +The options consist of: + -d debug level [default:0] + -i inode no (hex) + -s [SIT dump segno from #1~#2 (decimal), for all 0~-1] + -a [SSA dump segno from #1~#2 (decimal), for all 0~-1] + +Examples: +# dump.f2fs -i [ino] /dev/sdx +# dump.f2fs -s 0~-1 /dev/sdx (SIT dump) +# dump.f2fs -a 0~-1 /dev/sdx (SSA dump) + +================================================================================ +DESIGN +================================================================================ + +On-disk Layout +-------------- + +F2FS divides the whole volume into a number of segments, each of which is fixed +to 2MB in size. A section is composed of consecutive segments, and a zone +consists of a set of sections. By default, section and zone sizes are set to one +segment size identically, but users can easily modify the sizes by mkfs. + +F2FS splits the entire volume into six areas, and all the areas except superblock +consists of multiple segments as described below. + + align with the zone size <-| + |-> align with the segment size + _________________________________________________________________________ + | | | Segment | Node | Segment | | + | Superblock | Checkpoint | Info. | Address | Summary | Main | + | (SB) | (CP) | Table (SIT) | Table (NAT) | Area (SSA) | | + |____________|_____2______|______N______|______N______|______N_____|__N___| + . . + . . + . . + ._________________________________________. + |_Segment_|_..._|_Segment_|_..._|_Segment_| + . . + ._________._________ + |_section_|__...__|_ + . . + .________. + |__zone__| + +- Superblock (SB) + : It is located at the beginning of the partition, and there exist two copies + to avoid file system crash. It contains basic partition information and some + default parameters of f2fs. + +- Checkpoint (CP) + : It contains file system information, bitmaps for valid NAT/SIT sets, orphan + inode lists, and summary entries of current active segments. + +- Segment Information Table (SIT) + : It contains segment information such as valid block count and bitmap for the + validity of all the blocks. + +- Node Address Table (NAT) + : It is composed of a block address table for all the node blocks stored in + Main area. + +- Segment Summary Area (SSA) + : It contains summary entries which contains the owner information of all the + data and node blocks stored in Main area. + +- Main Area + : It contains file and directory data including their indices. + +In order to avoid misalignment between file system and flash-based storage, F2FS +aligns the start block address of CP with the segment size. Also, it aligns the +start block address of Main area with the zone size by reserving some segments +in SSA area. + +Reference the following survey for additional technical details. +https://wiki.linaro.org/WorkingGroups/Kernel/Projects/FlashCardSurvey + +File System Metadata Structure +------------------------------ + +F2FS adopts the checkpointing scheme to maintain file system consistency. At +mount time, F2FS first tries to find the last valid checkpoint data by scanning +CP area. In order to reduce the scanning time, F2FS uses only two copies of CP. +One of them always indicates the last valid data, which is called as shadow copy +mechanism. In addition to CP, NAT and SIT also adopt the shadow copy mechanism. + +For file system consistency, each CP points to which NAT and SIT copies are +valid, as shown as below. + + +--------+----------+---------+ + | CP | SIT | NAT | + +--------+----------+---------+ + . . . . + . . . . + . . . . + +-------+-------+--------+--------+--------+--------+ + | CP #0 | CP #1 | SIT #0 | SIT #1 | NAT #0 | NAT #1 | + +-------+-------+--------+--------+--------+--------+ + | ^ ^ + | | | + `----------------------------------------' + +Index Structure +--------------- + +The key data structure to manage the data locations is a "node". Similar to +traditional file structures, F2FS has three types of node: inode, direct node, +indirect node. F2FS assigns 4KB to an inode block which contains 923 data block +indices, two direct node pointers, two indirect node pointers, and one double +indirect node pointer as described below. One direct node block contains 1018 +data blocks, and one indirect node block contains also 1018 node blocks. Thus, +one inode block (i.e., a file) covers: + + 4KB * (923 + 2 * 1018 + 2 * 1018 * 1018 + 1018 * 1018 * 1018) := 3.94TB. + + Inode block (4KB) + |- data (923) + |- direct node (2) + | `- data (1018) + |- indirect node (2) + | `- direct node (1018) + | `- data (1018) + `- double indirect node (1) + `- indirect node (1018) + `- direct node (1018) + `- data (1018) + +Note that, all the node blocks are mapped by NAT which means the location of +each node is translated by the NAT table. In the consideration of the wandering +tree problem, F2FS is able to cut off the propagation of node updates caused by +leaf data writes. + +Directory Structure +------------------- + +A directory entry occupies 11 bytes, which consists of the following attributes. + +- hash hash value of the file name +- ino inode number +- len the length of file name +- type file type such as directory, symlink, etc + +A dentry block consists of 214 dentry slots and file names. Therein a bitmap is +used to represent whether each dentry is valid or not. A dentry block occupies +4KB with the following composition. + + Dentry Block(4 K) = bitmap (27 bytes) + reserved (3 bytes) + + dentries(11 * 214 bytes) + file name (8 * 214 bytes) + + [Bucket] + +--------------------------------+ + |dentry block 1 | dentry block 2 | + +--------------------------------+ + . . + . . + . [Dentry Block Structure: 4KB] . + +--------+----------+----------+------------+ + | bitmap | reserved | dentries | file names | + +--------+----------+----------+------------+ + [Dentry Block: 4KB] . . + . . + . . + +------+------+-----+------+ + | hash | ino | len | type | + +------+------+-----+------+ + [Dentry Structure: 11 bytes] + +F2FS implements multi-level hash tables for directory structure. Each level has +a hash table with dedicated number of hash buckets as shown below. Note that +"A(2B)" means a bucket includes 2 data blocks. + +---------------------- +A : bucket +B : block +N : MAX_DIR_HASH_DEPTH +---------------------- + +level #0 | A(2B) + | +level #1 | A(2B) - A(2B) + | +level #2 | A(2B) - A(2B) - A(2B) - A(2B) + . | . . . . +level #N/2 | A(2B) - A(2B) - A(2B) - A(2B) - A(2B) - ... - A(2B) + . | . . . . +level #N | A(4B) - A(4B) - A(4B) - A(4B) - A(4B) - ... - A(4B) + +The number of blocks and buckets are determined by, + + ,- 2, if n < MAX_DIR_HASH_DEPTH / 2, + # of blocks in level #n = | + `- 4, Otherwise + + ,- 2^(n + dir_level), + | if n + dir_level < MAX_DIR_HASH_DEPTH / 2, + # of buckets in level #n = | + `- 2^((MAX_DIR_HASH_DEPTH / 2) - 1), + Otherwise + +When F2FS finds a file name in a directory, at first a hash value of the file +name is calculated. Then, F2FS scans the hash table in level #0 to find the +dentry consisting of the file name and its inode number. If not found, F2FS +scans the next hash table in level #1. In this way, F2FS scans hash tables in +each levels incrementally from 1 to N. In each levels F2FS needs to scan only +one bucket determined by the following equation, which shows O(log(# of files)) +complexity. + + bucket number to scan in level #n = (hash value) % (# of buckets in level #n) + +In the case of file creation, F2FS finds empty consecutive slots that cover the +file name. F2FS searches the empty slots in the hash tables of whole levels from +1 to N in the same way as the lookup operation. + +The following figure shows an example of two cases holding children. + --------------> Dir <-------------- + | | + child child + + child - child [hole] - child + + child - child - child [hole] - [hole] - child + + Case 1: Case 2: + Number of children = 6, Number of children = 3, + File size = 7 File size = 7 + +Default Block Allocation +------------------------ + +At runtime, F2FS manages six active logs inside "Main" area: Hot/Warm/Cold node +and Hot/Warm/Cold data. + +- Hot node contains direct node blocks of directories. +- Warm node contains direct node blocks except hot node blocks. +- Cold node contains indirect node blocks +- Hot data contains dentry blocks +- Warm data contains data blocks except hot and cold data blocks +- Cold data contains multimedia data or migrated data blocks + +LFS has two schemes for free space management: threaded log and copy-and-compac- +tion. The copy-and-compaction scheme which is known as cleaning, is well-suited +for devices showing very good sequential write performance, since free segments +are served all the time for writing new data. However, it suffers from cleaning +overhead under high utilization. Contrarily, the threaded log scheme suffers +from random writes, but no cleaning process is needed. F2FS adopts a hybrid +scheme where the copy-and-compaction scheme is adopted by default, but the +policy is dynamically changed to the threaded log scheme according to the file +system status. + +In order to align F2FS with underlying flash-based storage, F2FS allocates a +segment in a unit of section. F2FS expects that the section size would be the +same as the unit size of garbage collection in FTL. Furthermore, with respect +to the mapping granularity in FTL, F2FS allocates each section of the active +logs from different zones as much as possible, since FTL can write the data in +the active logs into one allocation unit according to its mapping granularity. + +Cleaning process +---------------- + +F2FS does cleaning both on demand and in the background. On-demand cleaning is +triggered when there are not enough free segments to serve VFS calls. Background +cleaner is operated by a kernel thread, and triggers the cleaning job when the +system is idle. + +F2FS supports two victim selection policies: greedy and cost-benefit algorithms. +In the greedy algorithm, F2FS selects a victim segment having the smallest number +of valid blocks. In the cost-benefit algorithm, F2FS selects a victim segment +according to the segment age and the number of valid blocks in order to address +log block thrashing problem in the greedy algorithm. F2FS adopts the greedy +algorithm for on-demand cleaner, while background cleaner adopts cost-benefit +algorithm. + +In order to identify whether the data in the victim segment are valid or not, +F2FS manages a bitmap. Each bit represents the validity of a block, and the +bitmap is composed of a bit stream covering whole blocks in main area. diff --git a/Documentation/filesystems/fiemap.txt b/Documentation/filesystems/fiemap.txt new file mode 100644 index 00000000000..1b805a0efbb --- /dev/null +++ b/Documentation/filesystems/fiemap.txt @@ -0,0 +1,228 @@ +============ +Fiemap Ioctl +============ + +The fiemap ioctl is an efficient method for userspace to get file +extent mappings. Instead of block-by-block mapping (such as bmap), fiemap +returns a list of extents. + + +Request Basics +-------------- + +A fiemap request is encoded within struct fiemap: + +struct fiemap { + __u64 fm_start; /* logical offset (inclusive) at + * which to start mapping (in) */ + __u64 fm_length; /* logical length of mapping which + * userspace cares about (in) */ + __u32 fm_flags; /* FIEMAP_FLAG_* flags for request (in/out) */ + __u32 fm_mapped_extents; /* number of extents that were + * mapped (out) */ + __u32 fm_extent_count; /* size of fm_extents array (in) */ + __u32 fm_reserved; + struct fiemap_extent fm_extents[0]; /* array of mapped extents (out) */ +}; + + +fm_start, and fm_length specify the logical range within the file +which the process would like mappings for. Extents returned mirror +those on disk - that is, the logical offset of the 1st returned extent +may start before fm_start, and the range covered by the last returned +extent may end after fm_length. All offsets and lengths are in bytes. + +Certain flags to modify the way in which mappings are looked up can be +set in fm_flags. If the kernel doesn't understand some particular +flags, it will return EBADR and the contents of fm_flags will contain +the set of flags which caused the error. If the kernel is compatible +with all flags passed, the contents of fm_flags will be unmodified. +It is up to userspace to determine whether rejection of a particular +flag is fatal to its operation. This scheme is intended to allow the +fiemap interface to grow in the future but without losing +compatibility with old software. + +fm_extent_count specifies the number of elements in the fm_extents[] array +that can be used to return extents. If fm_extent_count is zero, then the +fm_extents[] array is ignored (no extents will be returned), and the +fm_mapped_extents count will hold the number of extents needed in +fm_extents[] to hold the file's current mapping. Note that there is +nothing to prevent the file from changing between calls to FIEMAP. + +The following flags can be set in fm_flags: + +* FIEMAP_FLAG_SYNC +If this flag is set, the kernel will sync the file before mapping extents. + +* FIEMAP_FLAG_XATTR +If this flag is set, the extents returned will describe the inodes +extended attribute lookup tree, instead of its data tree. + + +Extent Mapping +-------------- + +Extent information is returned within the embedded fm_extents array +which userspace must allocate along with the fiemap structure. The +number of elements in the fiemap_extents[] array should be passed via +fm_extent_count. The number of extents mapped by kernel will be +returned via fm_mapped_extents. If the number of fiemap_extents +allocated is less than would be required to map the requested range, +the maximum number of extents that can be mapped in the fm_extent[] +array will be returned and fm_mapped_extents will be equal to +fm_extent_count. In that case, the last extent in the array will not +complete the requested range and will not have the FIEMAP_EXTENT_LAST +flag set (see the next section on extent flags). + +Each extent is described by a single fiemap_extent structure as +returned in fm_extents. + +struct fiemap_extent { + __u64 fe_logical; /* logical offset in bytes for the start of + * the extent */ + __u64 fe_physical; /* physical offset in bytes for the start + * of the extent */ + __u64 fe_length; /* length in bytes for the extent */ + __u64 fe_reserved64[2]; + __u32 fe_flags; /* FIEMAP_EXTENT_* flags for this extent */ + __u32 fe_reserved[3]; +}; + +All offsets and lengths are in bytes and mirror those on disk. It is valid +for an extents logical offset to start before the request or its logical +length to extend past the request. Unless FIEMAP_EXTENT_NOT_ALIGNED is +returned, fe_logical, fe_physical, and fe_length will be aligned to the +block size of the file system. With the exception of extents flagged as +FIEMAP_EXTENT_MERGED, adjacent extents will not be merged. + +The fe_flags field contains flags which describe the extent returned. +A special flag, FIEMAP_EXTENT_LAST is always set on the last extent in +the file so that the process making fiemap calls can determine when no +more extents are available, without having to call the ioctl again. + +Some flags are intentionally vague and will always be set in the +presence of other more specific flags. This way a program looking for +a general property does not have to know all existing and future flags +which imply that property. + +For example, if FIEMAP_EXTENT_DATA_INLINE or FIEMAP_EXTENT_DATA_TAIL +are set, FIEMAP_EXTENT_NOT_ALIGNED will also be set. A program looking +for inline or tail-packed data can key on the specific flag. Software +which simply cares not to try operating on non-aligned extents +however, can just key on FIEMAP_EXTENT_NOT_ALIGNED, and not have to +worry about all present and future flags which might imply unaligned +data. Note that the opposite is not true - it would be valid for +FIEMAP_EXTENT_NOT_ALIGNED to appear alone. + +* FIEMAP_EXTENT_LAST +This is the last extent in the file. A mapping attempt past this +extent will return nothing. + +* FIEMAP_EXTENT_UNKNOWN +The location of this extent is currently unknown. This may indicate +the data is stored on an inaccessible volume or that no storage has +been allocated for the file yet. + +* FIEMAP_EXTENT_DELALLOC + - This will also set FIEMAP_EXTENT_UNKNOWN. +Delayed allocation - while there is data for this extent, its +physical location has not been allocated yet. + +* FIEMAP_EXTENT_ENCODED +This extent does not consist of plain filesystem blocks but is +encoded (e.g. encrypted or compressed). Reading the data in this +extent via I/O to the block device will have undefined results. + +Note that it is *always* undefined to try to update the data +in-place by writing to the indicated location without the +assistance of the filesystem, or to access the data using the +information returned by the FIEMAP interface while the filesystem +is mounted. In other words, user applications may only read the +extent data via I/O to the block device while the filesystem is +unmounted, and then only if the FIEMAP_EXTENT_ENCODED flag is +clear; user applications must not try reading or writing to the +filesystem via the block device under any other circumstances. + +* FIEMAP_EXTENT_DATA_ENCRYPTED + - This will also set FIEMAP_EXTENT_ENCODED +The data in this extent has been encrypted by the file system. + +* FIEMAP_EXTENT_NOT_ALIGNED +Extent offsets and length are not guaranteed to be block aligned. + +* FIEMAP_EXTENT_DATA_INLINE + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is located within a meta data block. + +* FIEMAP_EXTENT_DATA_TAIL + This will also set FIEMAP_EXTENT_NOT_ALIGNED +Data is packed into a block with data from other files. + +* FIEMAP_EXTENT_UNWRITTEN +Unwritten extent - the extent is allocated but its data has not been +initialized. This indicates the extent's data will be all zero if read +through the filesystem but the contents are undefined if read directly from +the device. + +* FIEMAP_EXTENT_MERGED +This will be set when a file does not support extents, i.e., it uses a block +based addressing scheme. Since returning an extent for each block back to +userspace would be highly inefficient, the kernel will try to merge most +adjacent blocks into 'extents'. + + +VFS -> File System Implementation +--------------------------------- + +File systems wishing to support fiemap must implement a ->fiemap callback on +their inode_operations structure. The fs ->fiemap call is responsible for +defining its set of supported fiemap flags, and calling a helper function on +each discovered extent: + +struct inode_operations { + ... + + int (*fiemap)(struct inode *, struct fiemap_extent_info *, u64 start, + u64 len); + +->fiemap is passed struct fiemap_extent_info which describes the +fiemap request: + +struct fiemap_extent_info { + unsigned int fi_flags; /* Flags as passed from user */ + unsigned int fi_extents_mapped; /* Number of mapped extents */ + unsigned int fi_extents_max; /* Size of fiemap_extent array */ + struct fiemap_extent *fi_extents_start; /* Start of fiemap_extent array */ +}; + +It is intended that the file system should not need to access any of this +structure directly. + + +Flag checking should be done at the beginning of the ->fiemap callback via the +fiemap_check_flags() helper: + +int fiemap_check_flags(struct fiemap_extent_info *fieinfo, u32 fs_flags); + +The struct fieinfo should be passed in as received from ioctl_fiemap(). The +set of fiemap flags which the fs understands should be passed via fs_flags. If +fiemap_check_flags finds invalid user flags, it will place the bad values in +fieinfo->fi_flags and return -EBADR. If the file system gets -EBADR, from +fiemap_check_flags(), it should immediately exit, returning that error back to +ioctl_fiemap(). + + +For each extent in the request range, the file system should call +the helper function, fiemap_fill_next_extent(): + +int fiemap_fill_next_extent(struct fiemap_extent_info *info, u64 logical, + u64 phys, u64 len, u32 flags, u32 dev); + +fiemap_fill_next_extent() will use the passed values to populate the +next free extent in the fm_extents array. 'General' extent flags will +automatically be set from specific flags on behalf of the calling file +system so that the userspace API is not broken. + +fiemap_fill_next_extent() returns 0 on success, and 1 when the +user-supplied fm_extents array is full. If an error is encountered +while copying the extent to user memory, -EFAULT will be returned. diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt index 8c206f4e025..46dfc6b038c 100644 --- a/Documentation/filesystems/files.txt +++ b/Documentation/filesystems/files.txt @@ -55,7 +55,7 @@ the fdtable structure - 2. Reading of the fdtable as described above must be protected by rcu_read_lock()/rcu_read_unlock(). -3. For any update to the the fd table, files->file_lock must +3. For any update to the fd table, files->file_lock must be held. 4. To look up the file structure given an fd, a reader @@ -76,13 +76,13 @@ the fdtable structure - 5. Handling of the file structures is special. Since the look-up of the fd (fget()/fget_light()) are lock-free, it is possible that look-up may race with the last put() operation on the - file structure. This is avoided using the rcuref APIs + file structure. This is avoided using atomic_long_inc_not_zero() on ->f_count : rcu_read_lock(); file = fcheck_files(files, fd); if (file) { - if (rcuref_inc_lf(&file->f_count)) + if (atomic_long_inc_not_zero(&file->f_count)) *fput_needed = 1; else /* Didn't get the reference, someone's freed */ @@ -92,7 +92,7 @@ the fdtable structure - .... return file; - rcuref_inc_lf() detects if refcounts is already zero or + atomic_long_inc_not_zero() detects if refcounts is already zero or goes to zero during increment. If it does, we fail fget()/fget_light(). @@ -113,8 +113,8 @@ the fdtable structure - if (fd >= 0) { /* locate_fd() may have expanded fdtable, load the ptr */ fdt = files_fdtable(files); - FD_SET(fd, fdt->open_fds); - FD_CLR(fd, fdt->close_on_exec); + __set_open_fd(fd, fdt); + __clear_close_on_exec(fd, fdt); spin_unlock(&files->file_lock); ..... diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt index 6b5741e651a..13af4a49e7d 100644 --- a/Documentation/filesystems/fuse.txt +++ b/Documentation/filesystems/fuse.txt @@ -18,6 +18,14 @@ Non-privileged mount (or user mount): user. NOTE: this is not the same as mounts allowed with the "user" option in /etc/fstab, which is not discussed here. +Filesystem connection: + + A connection between the filesystem daemon and the kernel. The + connection exists until either the daemon dies, or the filesystem is + umounted. Note that detaching (or lazy umounting) the filesystem + does _not_ break the connection, in this case it will exist until + the last reference to the filesystem is released. + Mount owner: The user who does the mounting. @@ -43,6 +51,22 @@ homepage: http://fuse.sourceforge.net/ +Filesystem type +~~~~~~~~~~~~~~~ + +The filesystem type given to mount(2) can be one of the following: + +'fuse' + + This is the usual way to mount a FUSE filesystem. The first + argument of the mount system call may contain an arbitrary string, + which is not interpreted by the kernel. + +'fuseblk' + + The filesystem is block device based. The first argument of the + mount system call is interpreted as the name of the device. + Mount options ~~~~~~~~~~~~~ @@ -67,11 +91,11 @@ Mount options 'default_permissions' By default FUSE doesn't check file access permissions, the - filesystem is free to implement it's access policy or leave it to + filesystem is free to implement its access policy or leave it to the underlying file access mechanism (e.g. in case of network filesystems). This option enables permission checking, restricting - access based on file mode. This is option is usually useful - together with the 'allow_other' mount option. + access based on file mode. It is usually useful together with the + 'allow_other' mount option. 'allow_other' @@ -86,6 +110,111 @@ Mount options The default is infinite. Note that the size of read requests is limited anyway to 32 pages (which is 128kbyte on i386). +'blksize=N' + + Set the block size for the filesystem. The default is 512. This + option is only valid for 'fuseblk' type mounts. + +Control filesystem +~~~~~~~~~~~~~~~~~~ + +There's a control filesystem for FUSE, which can be mounted by: + + mount -t fusectl none /sys/fs/fuse/connections + +Mounting it under the '/sys/fs/fuse/connections' directory makes it +backwards compatible with earlier versions. + +Under the fuse control filesystem each connection has a directory +named by a unique number. + +For each connection the following files exist within this directory: + + 'waiting' + + The number of requests which are waiting to be transferred to + userspace or being processed by the filesystem daemon. If there is + no filesystem activity and 'waiting' is non-zero, then the + filesystem is hung or deadlocked. + + 'abort' + + Writing anything into this file will abort the filesystem + connection. This means that all waiting requests will be aborted an + error returned for all aborted and new requests. + +Only the owner of the mount may read or write these files. + +Interrupting filesystem operations +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +If a process issuing a FUSE filesystem request is interrupted, the +following will happen: + + 1) If the request is not yet sent to userspace AND the signal is + fatal (SIGKILL or unhandled fatal signal), then the request is + dequeued and returns immediately. + + 2) If the request is not yet sent to userspace AND the signal is not + fatal, then an 'interrupted' flag is set for the request. When + the request has been successfully transferred to userspace and + this flag is set, an INTERRUPT request is queued. + + 3) If the request is already sent to userspace, then an INTERRUPT + request is queued. + +INTERRUPT requests take precedence over other requests, so the +userspace filesystem will receive queued INTERRUPTs before any others. + +The userspace filesystem may ignore the INTERRUPT requests entirely, +or may honor them by sending a reply to the _original_ request, with +the error set to EINTR. + +It is also possible that there's a race between processing the +original request and its INTERRUPT request. There are two possibilities: + + 1) The INTERRUPT request is processed before the original request is + processed + + 2) The INTERRUPT request is processed after the original request has + been answered + +If the filesystem cannot find the original request, it should wait for +some timeout and/or a number of new requests to arrive, after which it +should reply to the INTERRUPT request with an EAGAIN error. In case +1) the INTERRUPT request will be requeued. In case 2) the INTERRUPT +reply will be ignored. + +Aborting a filesystem connection +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +It is possible to get into certain situations where the filesystem is +not responding. Reasons for this may be: + + a) Broken userspace filesystem implementation + + b) Network connection down + + c) Accidental deadlock + + d) Malicious deadlock + +(For more on c) and d) see later sections) + +In either of these cases it may be useful to abort the connection to +the filesystem. There are several ways to do this: + + - Kill the filesystem daemon. Works in case of a) and b) + + - Kill the filesystem daemon and all users of the filesystem. Works + in all cases except some malicious deadlocks + + - Use forced umount (umount -f). Works in all cases but only if + filesystem is still attached (it hasn't been lazy unmounted) + + - Abort filesystem through the FUSE control filesystem. Most + powerful method, always works. + How do non-privileged mounts work? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -248,25 +377,7 @@ Scenario 1 - Simple deadlock | | for "file"] | | *DEADLOCK* -The solution for this is to allow requests to be interrupted while -they are in userspace: - - | [interrupted by signal] | - | <fuse_unlink() | - | [release semaphore] | [semaphore acquired] - | <sys_unlink() | - | | >fuse_unlink() - | | [queue req on fc->pending] - | | [wake up fc->waitq] - | | [sleep on req->waitq] - -If the filesystem daemon was single threaded, this will stop here, -since there's no other thread to dequeue and execute the request. -In this case the solution is to kill the FUSE daemon as well. If -there are multiple serving threads, you just have to kill them as -long as any remain. - -Moral: a filesystem which deadlocks, can soon find itself dead. +The solution for this is to allow the filesystem to be aborted. Scenario 2 - Tricky deadlock ---------------------------- @@ -299,17 +410,14 @@ but is caused by a pagefault. | | [lock page] | | * DEADLOCK * -Solution is again to let the the request be interrupted (not -elaborated further). - -An additional problem is that while the write buffer is being -copied to the request, the request must not be interrupted. This -is because the destination address of the copy may not be valid -after the request is interrupted. +Solution is basically the same as above. -This is solved with doing the copy atomically, and allowing -interruption while the page(s) belonging to the write buffer are -faulted with get_user_pages(). The 'req->locked' flag indicates -when the copy is taking place, and interruption is delayed until -this flag is unset. +An additional problem is that while the write buffer is being copied +to the request, the request must not be interrupted/aborted. This is +because the destination address of the copy may not be valid after the +request has returned. +This is solved with doing the copy atomically, and allowing abort +while the page(s) belonging to the write buffer are faulted with +get_user_pages(). The 'req->locked' flag indicates when the copy is +taking place, and abort is delayed until this flag is unset. diff --git a/Documentation/filesystems/gfs2-glocks.txt b/Documentation/filesystems/gfs2-glocks.txt new file mode 100644 index 00000000000..fcc79957be6 --- /dev/null +++ b/Documentation/filesystems/gfs2-glocks.txt @@ -0,0 +1,231 @@ + Glock internal locking rules + ------------------------------ + +This documents the basic principles of the glock state machine +internals. Each glock (struct gfs2_glock in fs/gfs2/incore.h) +has two main (internal) locks: + + 1. A spinlock (gl_spin) which protects the internal state such + as gl_state, gl_target and the list of holders (gl_holders) + 2. A non-blocking bit lock, GLF_LOCK, which is used to prevent other + threads from making calls to the DLM, etc. at the same time. If a + thread takes this lock, it must then call run_queue (usually via the + workqueue) when it releases it in order to ensure any pending tasks + are completed. + +The gl_holders list contains all the queued lock requests (not +just the holders) associated with the glock. If there are any +held locks, then they will be contiguous entries at the head +of the list. Locks are granted in strictly the order that they +are queued, except for those marked LM_FLAG_PRIORITY which are +used only during recovery, and even then only for journal locks. + +There are three lock states that users of the glock layer can request, +namely shared (SH), deferred (DF) and exclusive (EX). Those translate +to the following DLM lock modes: + +Glock mode | DLM lock mode +------------------------------ + UN | IV/NL Unlocked (no DLM lock associated with glock) or NL + SH | PR (Protected read) + DF | CW (Concurrent write) + EX | EX (Exclusive) + +Thus DF is basically a shared mode which is incompatible with the "normal" +shared lock mode, SH. In GFS2 the DF mode is used exclusively for direct I/O +operations. The glocks are basically a lock plus some routines which deal +with cache management. The following rules apply for the cache: + +Glock mode | Cache data | Cache Metadata | Dirty Data | Dirty Metadata +-------------------------------------------------------------------------- + UN | No | No | No | No + SH | Yes | Yes | No | No + DF | No | Yes | No | No + EX | Yes | Yes | Yes | Yes + +These rules are implemented using the various glock operations which +are defined for each type of glock. Not all types of glocks use +all the modes. Only inode glocks use the DF mode for example. + +Table of glock operations and per type constants: + +Field | Purpose +---------------------------------------------------------------------------- +go_xmote_th | Called before remote state change (e.g. to sync dirty data) +go_xmote_bh | Called after remote state change (e.g. to refill cache) +go_inval | Called if remote state change requires invalidating the cache +go_demote_ok | Returns boolean value of whether its ok to demote a glock + | (e.g. checks timeout, and that there is no cached data) +go_lock | Called for the first local holder of a lock +go_unlock | Called on the final local unlock of a lock +go_dump | Called to print content of object for debugfs file, or on + | error to dump glock to the log. +go_type | The type of the glock, LM_TYPE_..... +go_callback | Called if the DLM sends a callback to drop this lock +go_flags | GLOF_ASPACE is set, if the glock has an address space + | associated with it + +The minimum hold time for each lock is the time after a remote lock +grant for which we ignore remote demote requests. This is in order to +prevent a situation where locks are being bounced around the cluster +from node to node with none of the nodes making any progress. This +tends to show up most with shared mmaped files which are being written +to by multiple nodes. By delaying the demotion in response to a +remote callback, that gives the userspace program time to make +some progress before the pages are unmapped. + +There is a plan to try and remove the go_lock and go_unlock callbacks +if possible, in order to try and speed up the fast path though the locking. +Also, eventually we hope to make the glock "EX" mode locally shared +such that any local locking will be done with the i_mutex as required +rather than via the glock. + +Locking rules for glock operations: + +Operation | GLF_LOCK bit lock held | gl_spin spinlock held +----------------------------------------------------------------- +go_xmote_th | Yes | No +go_xmote_bh | Yes | No +go_inval | Yes | No +go_demote_ok | Sometimes | Yes +go_lock | Yes | No +go_unlock | Yes | No +go_dump | Sometimes | Yes +go_callback | Sometimes (N/A) | Yes + +N.B. Operations must not drop either the bit lock or the spinlock +if its held on entry. go_dump and do_demote_ok must never block. +Note that go_dump will only be called if the glock's state +indicates that it is caching uptodate data. + +Glock locking order within GFS2: + + 1. i_mutex (if required) + 2. Rename glock (for rename only) + 3. Inode glock(s) + (Parents before children, inodes at "same level" with same parent in + lock number order) + 4. Rgrp glock(s) (for (de)allocation operations) + 5. Transaction glock (via gfs2_trans_begin) for non-read operations + 6. Page lock (always last, very important!) + +There are two glocks per inode. One deals with access to the inode +itself (locking order as above), and the other, known as the iopen +glock is used in conjunction with the i_nlink field in the inode to +determine the lifetime of the inode in question. Locking of inodes +is on a per-inode basis. Locking of rgrps is on a per rgrp basis. +In general we prefer to lock local locks prior to cluster locks. + + Glock Statistics + ------------------ + +The stats are divided into two sets: those relating to the +super block and those relating to an individual glock. The +super block stats are done on a per cpu basis in order to +try and reduce the overhead of gathering them. They are also +further divided by glock type. All timings are in nanoseconds. + +In the case of both the super block and glock statistics, +the same information is gathered in each case. The super +block timing statistics are used to provide default values for +the glock timing statistics, so that newly created glocks +should have, as far as possible, a sensible starting point. +The per-glock counters are initialised to zero when the +glock is created. The per-glock statistics are lost when +the glock is ejected from memory. + +The statistics are divided into three pairs of mean and +variance, plus two counters. The mean/variance pairs are +smoothed exponential estimates and the algorithm used is +one which will be very familiar to those used to calculation +of round trip times in network code. See "TCP/IP Illustrated, +Volume 1", W. Richard Stevens, sect 21.3, "Round-Trip Time Measurement", +p. 299 and onwards. Also, Volume 2, Sect. 25.10, p. 838 and onwards. +Unlike the TCP/IP Illustrated case, the mean and variance are +not scaled, but are in units of integer nanoseconds. + +The three pairs of mean/variance measure the following +things: + + 1. DLM lock time (non-blocking requests) + 2. DLM lock time (blocking requests) + 3. Inter-request time (again to the DLM) + +A non-blocking request is one which will complete right +away, whatever the state of the DLM lock in question. That +currently means any requests when (a) the current state of +the lock is exclusive, i.e. a lock demotion (b) the requested +state is either null or unlocked (again, a demotion) or (c) the +"try lock" flag is set. A blocking request covers all the other +lock requests. + +There are two counters. The first is there primarily to show +how many lock requests have been made, and thus how much data +has gone into the mean/variance calculations. The other counter +is counting queuing of holders at the top layer of the glock +code. Hopefully that number will be a lot larger than the number +of dlm lock requests issued. + +So why gather these statistics? There are several reasons +we'd like to get a better idea of these timings: + +1. To be able to better set the glock "min hold time" +2. To spot performance issues more easily +3. To improve the algorithm for selecting resource groups for +allocation (to base it on lock wait time, rather than blindly +using a "try lock") + +Due to the smoothing action of the updates, a step change in +some input quantity being sampled will only fully be taken +into account after 8 samples (or 4 for the variance) and this +needs to be carefully considered when interpreting the +results. + +Knowing both the time it takes a lock request to complete and +the average time between lock requests for a glock means we +can compute the total percentage of the time for which the +node is able to use a glock vs. time that the rest of the +cluster has its share. That will be very useful when setting +the lock min hold time. + +Great care has been taken to ensure that we +measure exactly the quantities that we want, as accurately +as possible. There are always inaccuracies in any +measuring system, but I hope this is as accurate as we +can reasonably make it. + +Per sb stats can be found here: +/sys/kernel/debug/gfs2/<fsname>/sbstats +Per glock stats can be found here: +/sys/kernel/debug/gfs2/<fsname>/glstats + +Assuming that debugfs is mounted on /sys/kernel/debug and also +that <fsname> is replaced with the name of the gfs2 filesystem +in question. + +The abbreviations used in the output as are follows: + +srtt - Smoothed round trip time for non-blocking dlm requests +srttvar - Variance estimate for srtt +srttb - Smoothed round trip time for (potentially) blocking dlm requests +srttvarb - Variance estimate for srttb +sirt - Smoothed inter-request time (for dlm requests) +sirtvar - Variance estimate for sirt +dlm - Number of dlm requests made (dcnt in glstats file) +queue - Number of glock requests queued (qcnt in glstats file) + +The sbstats file contains a set of these stats for each glock type (so 8 lines +for each type) and for each cpu (one column per cpu). The glstats file contains +a set of these stats for each glock in a similar format to the glocks file, but +using the format mean/variance for each of the timing stats. + +The gfs2_glock_lock_time tracepoint prints out the current values of the stats +for the glock in question, along with some addition information on each dlm +reply that is received: + +status - The status of the dlm request +flags - The dlm request flags +tdiff - The time taken by this specific request +(remaining fields as per above list) + + diff --git a/Documentation/filesystems/gfs2-uevents.txt b/Documentation/filesystems/gfs2-uevents.txt new file mode 100644 index 00000000000..19a19ebebc3 --- /dev/null +++ b/Documentation/filesystems/gfs2-uevents.txt @@ -0,0 +1,100 @@ + uevents and GFS2 + ================== + +During the lifetime of a GFS2 mount, a number of uevents are generated. +This document explains what the events are and what they are used +for (by gfs_controld in gfs2-utils). + +A list of GFS2 uevents +----------------------- + +1. ADD + +The ADD event occurs at mount time. It will always be the first +uevent generated by the newly created filesystem. If the mount +is successful, an ONLINE uevent will follow. If it is not successful +then a REMOVE uevent will follow. + +The ADD uevent has two environment variables: SPECTATOR=[0|1] +and RDONLY=[0|1] that specify the spectator status (a read-only mount +with no journal assigned), and read-only (with journal assigned) status +of the filesystem respectively. + +2. ONLINE + +The ONLINE uevent is generated after a successful mount or remount. It +has the same environment variables as the ADD uevent. The ONLINE +uevent, along with the two environment variables for spectator and +RDONLY are a relatively recent addition (2.6.32-rc+) and will not +be generated by older kernels. + +3. CHANGE + +The CHANGE uevent is used in two places. One is when reporting the +successful mount of the filesystem by the first node (FIRSTMOUNT=Done). +This is used as a signal by gfs_controld that it is then ok for other +nodes in the cluster to mount the filesystem. + +The other CHANGE uevent is used to inform of the completion +of journal recovery for one of the filesystems journals. It has +two environment variables, JID= which specifies the journal id which +has just been recovered, and RECOVERY=[Done|Failed] to indicate the +success (or otherwise) of the operation. These uevents are generated +for every journal recovered, whether it is during the initial mount +process or as the result of gfs_controld requesting a specific journal +recovery via the /sys/fs/gfs2/<fsname>/lock_module/recovery file. + +Because the CHANGE uevent was used (in early versions of gfs_controld) +without checking the environment variables to discover the state, we +cannot add any more functions to it without running the risk of +someone using an older version of the user tools and breaking their +cluster. For this reason the ONLINE uevent was used when adding a new +uevent for a successful mount or remount. + +4. OFFLINE + +The OFFLINE uevent is only generated due to filesystem errors and is used +as part of the "withdraw" mechanism. Currently this doesn't give any +information about what the error is, which is something that needs to +be fixed. + +5. REMOVE + +The REMOVE uevent is generated at the end of an unsuccessful mount +or at the end of a umount of the filesystem. All REMOVE uevents will +have been preceded by at least an ADD uevent for the same filesystem, +and unlike the other uevents is generated automatically by the kernel's +kobject subsystem. + + +Information common to all GFS2 uevents (uevent environment variables) +---------------------------------------------------------------------- + +1. LOCKTABLE= + +The LOCKTABLE is a string, as supplied on the mount command +line (locktable=) or via fstab. It is used as a filesystem label +as well as providing the information for a lock_dlm mount to be +able to join the cluster. + +2. LOCKPROTO= + +The LOCKPROTO is a string, and its value depends on what is set +on the mount command line, or via fstab. It will be either +lock_nolock or lock_dlm. In the future other lock managers +may be supported. + +3. JOURNALID= + +If a journal is in use by the filesystem (journals are not +assigned for spectator mounts) then this will give the +numeric journal id in all GFS2 uevents. + +4. UUID= + +With recent versions of gfs2-utils, mkfs.gfs2 writes a UUID +into the filesystem superblock. If it exists, this will +be included in every uevent relating to the filesystem. + + + diff --git a/Documentation/filesystems/gfs2.txt b/Documentation/filesystems/gfs2.txt new file mode 100644 index 00000000000..cc4f2306609 --- /dev/null +++ b/Documentation/filesystems/gfs2.txt @@ -0,0 +1,45 @@ +Global File System +------------------ + +https://fedorahosted.org/cluster/wiki/HomePage + +GFS is a cluster file system. It allows a cluster of computers to +simultaneously use a block device that is shared between them (with FC, +iSCSI, NBD, etc). GFS reads and writes to the block device like a local +file system, but also uses a lock module to allow the computers coordinate +their I/O so file system consistency is maintained. One of the nifty +features of GFS is perfect consistency -- changes made to the file system +on one machine show up immediately on all other machines in the cluster. + +GFS uses interchangeable inter-node locking mechanisms, the currently +supported mechanisms are: + + lock_nolock -- allows gfs to be used as a local file system + + lock_dlm -- uses a distributed lock manager (dlm) for inter-node locking + The dlm is found at linux/fs/dlm/ + +Lock_dlm depends on user space cluster management systems found +at the URL above. + +To use gfs as a local file system, no external clustering systems are +needed, simply: + + $ mkfs -t gfs2 -p lock_nolock -j 1 /dev/block_device + $ mount -t gfs2 /dev/block_device /dir + +If you are using Fedora, you need to install the gfs2-utils package +and, for lock_dlm, you will also need to install the cman package +and write a cluster.conf as per the documentation. For F17 and above +cman has been replaced by the dlm package. + +GFS2 is not on-disk compatible with previous versions of GFS, but it +is pretty close. + +The following man pages can be found at the URL above: + fsck.gfs2 to repair a filesystem + gfs2_grow to expand a filesystem online + gfs2_jadd to add journals to a filesystem online + tunegfs2 to manipulate, examine and tune a filesystem + gfs2_convert to convert a gfs filesystem to gfs2 in-place + mkfs.gfs2 to make a filesystem diff --git a/Documentation/filesystems/hfs.txt b/Documentation/filesystems/hfs.txt index bd0fa770403..d096df6db07 100644 --- a/Documentation/filesystems/hfs.txt +++ b/Documentation/filesystems/hfs.txt @@ -1,3 +1,4 @@ +Note: This filesystem doesn't have a maintainer. Macintosh HFS Filesystem for Linux ================================== @@ -76,8 +77,6 @@ hformat that can be used to create HFS filesystem. See Credits ======= -The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU) -and is now maintained by Roman Zippel (roman@ardistech.com) at Ardis -Technologies. -Roman rewrote large parts of the code and brought in btree routines derived -from Brad Boyer's hfsplus driver (also maintained by Roman now). +The HFS drivers was written by Paul H. Hargrovea (hargrove@sccm.Stanford.EDU). +Roman Zippel (roman@ardistech.com) rewrote large parts of the code and brought +in btree routines derived from Brad Boyer's hfsplus driver. diff --git a/Documentation/filesystems/hfsplus.txt b/Documentation/filesystems/hfsplus.txt new file mode 100644 index 00000000000..59f7569fc9e --- /dev/null +++ b/Documentation/filesystems/hfsplus.txt @@ -0,0 +1,59 @@ + +Macintosh HFSPlus Filesystem for Linux +====================================== + +HFSPlus is a filesystem first introduced in MacOS 8.1. +HFSPlus has several extensions to HFS, including 32-bit allocation +blocks, 255-character unicode filenames, and file sizes of 2^63 bytes. + + +Mount options +============= + +When mounting an HFSPlus filesystem, the following options are accepted: + + creator=cccc, type=cccc + Specifies the creator/type values as shown by the MacOS finder + used for creating new files. Default values: '????'. + + uid=n, gid=n + Specifies the user/group that owns all files on the filesystem + that have uninitialized permissions structures. + Default: user/group id of the mounting process. + + umask=n + Specifies the umask (in octal) used for files and directories + that have uninitialized permissions structures. + Default: umask of the mounting process. + + session=n + Select the CDROM session to mount as HFSPlus filesystem. Defaults to + leaving that decision to the CDROM driver. This option will fail + with anything but a CDROM as underlying devices. + + part=n + Select partition number n from the devices. This option only makes + sense for CDROMs because they can't be partitioned under Linux. + For disk devices the generic partition parsing code does this + for us. Defaults to not parsing the partition table at all. + + decompose + Decompose file name characters. + + nodecompose + Do not decompose file name characters. + + force + Used to force write access to volumes that are marked as journalled + or locked. Use at your own risk. + + nls=cccc + Encoding to use when presenting file names. + + +References +========== + +kernel source: <file:fs/hfsplus> + +Apple Technote 1150 https://developer.apple.com/legacy/library/technotes/tn/tn1150.html diff --git a/Documentation/filesystems/hpfs.txt b/Documentation/filesystems/hpfs.txt index 33dc360c8e8..74630bd504f 100644 --- a/Documentation/filesystems/hpfs.txt +++ b/Documentation/filesystems/hpfs.txt @@ -103,7 +103,7 @@ to analyze or change OS2SYS.INI. Codepages HPFS can contain several uppercasing tables for several codepages and each -file has a pointer to codepage it's name is in. However OS/2 was created in +file has a pointer to codepage its name is in. However OS/2 was created in America where people don't care much about codepages and so multiple codepages support is quite buggy. I have Czech OS/2 working in codepage 852 on my disk. Once I booted English OS/2 working in cp 850 and I created a file on my 852 @@ -274,7 +274,7 @@ History Fixed race-condition in buffer code - it is in all filesystems in Linux; when reading device (cat /dev/hda) while creating files on it, files could be damaged -2.02 Woraround for bug in breada in Linux. breada could cause accesses beyond +2.02 Workaround for bug in breada in Linux. breada could cause accesses beyond end of partition 2.03 Char, block devices and pipes are correctly created Fixed non-crashing race in unlink (Alexander Viro) @@ -290,7 +290,7 @@ History 2.07 More fixes for Warp Server. Now it really works 2.08 Creating new files is not so slow on large disks An attempt to sync deleted file does not generate filesystem error -2.09 Fixed error on extremly fragmented files +2.09 Fixed error on extremely fragmented files vim: set textwidth=80: diff --git a/Documentation/filesystems/inotify.txt b/Documentation/filesystems/inotify.txt index 6d501903f68..cfd02712b83 100644 --- a/Documentation/filesystems/inotify.txt +++ b/Documentation/filesystems/inotify.txt @@ -69,17 +69,136 @@ Prototypes: int inotify_rm_watch (int fd, __u32 mask); -(iii) Internal Kernel Implementation +(iii) Kernel Interface -Each inotify instance is associated with an inotify_device structure. +Inotify's kernel API consists a set of functions for managing watches and an +event callback. + +To use the kernel API, you must first initialize an inotify instance with a set +of inotify_operations. You are given an opaque inotify_handle, which you use +for any further calls to inotify. + + struct inotify_handle *ih = inotify_init(my_event_handler); + +You must provide a function for processing events and a function for destroying +the inotify watch. + + void handle_event(struct inotify_watch *watch, u32 wd, u32 mask, + u32 cookie, const char *name, struct inode *inode) + + watch - the pointer to the inotify_watch that triggered this call + wd - the watch descriptor + mask - describes the event that occurred + cookie - an identifier for synchronizing events + name - the dentry name for affected files in a directory-based event + inode - the affected inode in a directory-based event + + void destroy_watch(struct inotify_watch *watch) + +You may add watches by providing a pre-allocated and initialized inotify_watch +structure and specifying the inode to watch along with an inotify event mask. +You must pin the inode during the call. You will likely wish to embed the +inotify_watch structure in a structure of your own which contains other +information about the watch. Once you add an inotify watch, it is immediately +subject to removal depending on filesystem events. You must grab a reference if +you depend on the watch hanging around after the call. + + inotify_init_watch(&my_watch->iwatch); + inotify_get_watch(&my_watch->iwatch); // optional + s32 wd = inotify_add_watch(ih, &my_watch->iwatch, inode, mask); + inotify_put_watch(&my_watch->iwatch); // optional + +You may use the watch descriptor (wd) or the address of the inotify_watch for +other inotify operations. You must not directly read or manipulate data in the +inotify_watch. Additionally, you must not call inotify_add_watch() more than +once for a given inotify_watch structure, unless you have first called either +inotify_rm_watch() or inotify_rm_wd(). + +To determine if you have already registered a watch for a given inode, you may +call inotify_find_watch(), which gives you both the wd and the watch pointer for +the inotify_watch, or an error if the watch does not exist. + + wd = inotify_find_watch(ih, inode, &watchp); + +You may use container_of() on the watch pointer to access your own data +associated with a given watch. When an existing watch is found, +inotify_find_watch() bumps the refcount before releasing its locks. You must +put that reference with: + + put_inotify_watch(watchp); + +Call inotify_find_update_watch() to update the event mask for an existing watch. +inotify_find_update_watch() returns the wd of the updated watch, or an error if +the watch does not exist. + + wd = inotify_find_update_watch(ih, inode, mask); + +An existing watch may be removed by calling either inotify_rm_watch() or +inotify_rm_wd(). + + int ret = inotify_rm_watch(ih, &my_watch->iwatch); + int ret = inotify_rm_wd(ih, wd); + +A watch may be removed while executing your event handler with the following: + + inotify_remove_watch_locked(ih, iwatch); + +Call inotify_destroy() to remove all watches from your inotify instance and +release it. If there are no outstanding references, inotify_destroy() will call +your destroy_watch op for each watch. + + inotify_destroy(ih); + +When inotify removes a watch, it sends an IN_IGNORED event to your callback. +You may use this event as an indication to free the watch memory. Note that +inotify may remove a watch due to filesystem events, as well as by your request. +If you use IN_ONESHOT, inotify will remove the watch after the first event, at +which point you may call the final inotify_put_watch. + +(iv) Kernel Interface Prototypes + + struct inotify_handle *inotify_init(struct inotify_operations *ops); + + inotify_init_watch(struct inotify_watch *watch); + + s32 inotify_add_watch(struct inotify_handle *ih, + struct inotify_watch *watch, + struct inode *inode, u32 mask); + + s32 inotify_find_watch(struct inotify_handle *ih, struct inode *inode, + struct inotify_watch **watchp); + + s32 inotify_find_update_watch(struct inotify_handle *ih, + struct inode *inode, u32 mask); + + int inotify_rm_wd(struct inotify_handle *ih, u32 wd); + + int inotify_rm_watch(struct inotify_handle *ih, + struct inotify_watch *watch); + + void inotify_remove_watch_locked(struct inotify_handle *ih, + struct inotify_watch *watch); + + void inotify_destroy(struct inotify_handle *ih); + + void get_inotify_watch(struct inotify_watch *watch); + void put_inotify_watch(struct inotify_watch *watch); + + +(v) Internal Kernel Implementation + +Each inotify instance is represented by an inotify_handle structure. +Inotify's userspace consumers also have an inotify_device which is +associated with the inotify_handle, and on which events are queued. Each watch is associated with an inotify_watch structure. Watches are chained -off of each associated device and each associated inode. +off of each associated inotify_handle and each associated inode. -See fs/inotify.c for the locking and lifetime rules. +See fs/notify/inotify/inotify_fsnotify.c and fs/notify/inotify/inotify_user.c +for the locking and lifetime rules. -(iv) Rationale +(vi) Rationale Q: What is the design decision behind not tying the watch to the open fd of the watched object? @@ -145,7 +264,7 @@ A: The poor user-space interface is the second biggest problem with dnotify. file descriptor-based one that allows basic file I/O and poll/select. Obtaining the fd and managing the watches could have been done either via a device file or a family of new system calls. We decided to implement a - family of system calls because that is the preffered approach for new kernel + family of system calls because that is the preferred approach for new kernel interfaces. The only real difference was whether we wanted to use open(2) and ioctl(2) or a couple of new system calls. System calls beat ioctls. diff --git a/Documentation/filesystems/isofs.txt b/Documentation/filesystems/isofs.txt index 424585ff6ea..ba0a93384de 100644 --- a/Documentation/filesystems/isofs.txt +++ b/Documentation/filesystems/isofs.txt @@ -9,9 +9,9 @@ when using discs encoded using Microsoft's Joliet extensions. iocharset=name Character set to use for converting from Unicode to ASCII. Joliet filenames are stored in Unicode format, but Unix for the most part doesn't know how to deal with Unicode. - There is also an option of doing UTF8 translations with the + There is also an option of doing UTF-8 translations with the utf8 option. - utf8 Encode Unicode names in UTF8 format. Default is no. + utf8 Encode Unicode names in UTF-8 format. Default is no. Mount options unique to the isofs filesystem. block=512 Set the block size for the disk to 512 bytes @@ -23,7 +23,13 @@ Mount options unique to the isofs filesystem. map=off Do not map non-Rock Ridge filenames to lower case map=normal Map non-Rock Ridge filenames to lower case map=acorn As map=normal but also apply Acorn extensions if present - mode=xxx Sets the permissions on files to xxx + mode=xxx Sets the permissions on files to xxx unless Rock Ridge + extensions set the permissions otherwise + dmode=xxx Sets the permissions on directories to xxx unless Rock Ridge + extensions set the permissions otherwise + overriderockperm Set permissions on files and directories according to + 'mode' and 'dmode' even though Rock Ridge extensions are + present. nojoliet Ignore Joliet extensions if they are present. norock Ignore Rock Ridge extensions if they are present. hide Completely strip hidden files from the file system. @@ -35,7 +41,7 @@ Mount options unique to the isofs filesystem. sbsector=xxx Session begins from sector xxx Recommended documents about ISO 9660 standard are located at: -http://www.y-adagio.com/public/standards/iso_cdromr/tocont.htm +http://www.y-adagio.com/ ftp://ftp.ecma.ch/ecma-st/Ecma-119.pdf Quoting from the PDF "This 2nd Edition of Standard ECMA-119 is technically identical with ISO 9660.", so it is a valid and gratis substitute of the diff --git a/Documentation/filesystems/jfs.txt b/Documentation/filesystems/jfs.txt index 3e992daf99a..41fd757997b 100644 --- a/Documentation/filesystems/jfs.txt +++ b/Documentation/filesystems/jfs.txt @@ -3,10 +3,11 @@ IBM's Journaled File System (JFS) for Linux JFS Homepage: http://jfs.sourceforge.net/ The following mount options are supported: +(*) == default iocharset=name Character set to use for converting from Unicode to ASCII. The default is to do no conversion. Use - iocharset=utf8 for UTF8 translations. This requires + iocharset=utf8 for UTF-8 translations. This requires CONFIG_NLS_UTF8 to be set in the kernel .config file. iocharset=none specifies the default behavior explicitly. @@ -21,15 +22,31 @@ nointegrity Do not write to the journal. The primary use of this option from backup media. The integrity of the volume is not guaranteed if the system abnormally abends. -integrity Default. Commit metadata changes to the journal. Use this - option to remount a volume where the nointegrity option was +integrity(*) Commit metadata changes to the journal. Use this option to + remount a volume where the nointegrity option was previously specified in order to restore normal behavior. errors=continue Keep going on a filesystem error. -errors=remount-ro Default. Remount the filesystem read-only on an error. +errors=remount-ro(*) Remount the filesystem read-only on an error. errors=panic Panic and halt the machine if an error occurs. -Please send bugs, comments, cards and letters to shaggy@austin.ibm.com. +uid=value Override on-disk uid with specified value +gid=value Override on-disk gid with specified value +umask=value Override on-disk umask with specified octal value. For + directories, the execute bit will be set if the corresponding + read bit is set. + +discard=minlen This enables/disables the use of discard/TRIM commands. +discard The discard/TRIM commands are sent to the underlying +nodiscard(*) block device when blocks are freed. This is useful for SSD + devices and sparse/thinly-provisioned LUNs. The FITRIM ioctl + command is also available together with the nodiscard option. + The value of minlen specifies the minimum blockcount, when + a TRIM command to the block device is considered useful. + When no value is given to the discard option, it defaults to + 64 blocks, which means 256KiB in JFS. + The minlen value of discard overrides the minlen value given + on an FITRIM ioctl(). The JFS mailing list can be subscribed to by using the link labeled "Mail list Subscribe" at our web page http://jfs.sourceforge.net/ diff --git a/Documentation/filesystems/locks.txt b/Documentation/filesystems/locks.txt new file mode 100644 index 00000000000..2cf81082581 --- /dev/null +++ b/Documentation/filesystems/locks.txt @@ -0,0 +1,68 @@ + File Locking Release Notes + + Andy Walker <andy@lysaker.kvaerner.no> + + 12 May 1997 + + +1. What's New? +-------------- + +1.1 Broken Flock Emulation +-------------------------- + +The old flock(2) emulation in the kernel was swapped for proper BSD +compatible flock(2) support in the 1.3.x series of kernels. With the +release of the 2.1.x kernel series, support for the old emulation has +been totally removed, so that we don't need to carry this baggage +forever. + +This should not cause problems for anybody, since everybody using a +2.1.x kernel should have updated their C library to a suitable version +anyway (see the file "Documentation/Changes".) + +1.2 Allow Mixed Locks Again +--------------------------- + +1.2.1 Typical Problems - Sendmail +--------------------------------- +Because sendmail was unable to use the old flock() emulation, many sendmail +installations use fcntl() instead of flock(). This is true of Slackware 3.0 +for example. This gave rise to some other subtle problems if sendmail was +configured to rebuild the alias file. Sendmail tried to lock the aliases.dir +file with fcntl() at the same time as the GDBM routines tried to lock this +file with flock(). With pre 1.3.96 kernels this could result in deadlocks that, +over time, or under a very heavy mail load, would eventually cause the kernel +to lock solid with deadlocked processes. + + +1.2.2 The Solution +------------------ +The solution I have chosen, after much experimentation and discussion, +is to make flock() and fcntl() locks oblivious to each other. Both can +exists, and neither will have any effect on the other. + +I wanted the two lock styles to be cooperative, but there were so many +race and deadlock conditions that the current solution was the only +practical one. It puts us in the same position as, for example, SunOS +4.1.x and several other commercial Unices. The only OS's that support +cooperative flock()/fcntl() are those that emulate flock() using +fcntl(), with all the problems that implies. + + +1.3 Mandatory Locking As A Mount Option +--------------------------------------- + +Mandatory locking, as described in +'Documentation/filesystems/mandatory-locking.txt' was prior to this release a +general configuration option that was valid for all mounted filesystems. This +had a number of inherent dangers, not the least of which was the ability to +freeze an NFS server by asking it to read a file for which a mandatory lock +existed. + +From this release of the kernel, mandatory locking can be turned on and off +on a per-filesystem basis, using the mount options 'mand' and 'nomand'. +The default is to disallow mandatory locking. The intention is that +mandatory locking only be enabled on a local filesystem as the specific need +arises. + diff --git a/Documentation/filesystems/logfs.txt b/Documentation/filesystems/logfs.txt new file mode 100644 index 00000000000..bca42c22a14 --- /dev/null +++ b/Documentation/filesystems/logfs.txt @@ -0,0 +1,241 @@ + +The LogFS Flash Filesystem +========================== + +Specification +============= + +Superblocks +----------- + +Two superblocks exist at the beginning and end of the filesystem. +Each superblock is 256 Bytes large, with another 3840 Bytes reserved +for future purposes, making a total of 4096 Bytes. + +Superblock locations may differ for MTD and block devices. On MTD the +first non-bad block contains a superblock in the first 4096 Bytes and +the last non-bad block contains a superblock in the last 4096 Bytes. +On block devices, the first 4096 Bytes of the device contain the first +superblock and the last aligned 4096 Byte-block contains the second +superblock. + +For the most part, the superblocks can be considered read-only. They +are written only to correct errors detected within the superblocks, +move the journal and change the filesystem parameters through tunefs. +As a result, the superblock does not contain any fields that require +constant updates, like the amount of free space, etc. + +Segments +-------- + +The space in the device is split up into equal-sized segments. +Segments are the primary write unit of LogFS. Within each segments, +writes happen from front (low addresses) to back (high addresses. If +only a partial segment has been written, the segment number, the +current position within and optionally a write buffer are stored in +the journal. + +Segments are erased as a whole. Therefore Garbage Collection may be +required to completely free a segment before doing so. + +Journal +-------- + +The journal contains all global information about the filesystem that +is subject to frequent change. At mount time, it has to be scanned +for the most recent commit entry, which contains a list of pointers to +all currently valid entries. + +Object Store +------------ + +All space except for the superblocks and journal is part of the object +store. Each segment contains a segment header and a number of +objects, each consisting of the object header and the payload. +Objects are either inodes, directory entries (dentries), file data +blocks or indirect blocks. + +Levels +------ + +Garbage collection (GC) may fail if all data is written +indiscriminately. One requirement of GC is that data is separated +roughly according to the distance between the tree root and the data. +Effectively that means all file data is on level 0, indirect blocks +are on levels 1, 2, 3 4 or 5 for 1x, 2x, 3x, 4x or 5x indirect blocks, +respectively. Inode file data is on level 6 for the inodes and 7-11 +for indirect blocks. + +Each segment contains objects of a single level only. As a result, +each level requires its own separate segment to be open for writing. + +Inode File +---------- + +All inodes are stored in a special file, the inode file. Single +exception is the inode file's inode (master inode) which for obvious +reasons is stored in the journal instead. Instead of data blocks, the +leaf nodes of the inode files are inodes. + +Aliases +------- + +Writes in LogFS are done by means of a wandering tree. A naïve +implementation would require that for each write or a block, all +parent blocks are written as well, since the block pointers have +changed. Such an implementation would not be very efficient. + +In LogFS, the block pointer changes are cached in the journal by means +of alias entries. Each alias consists of its logical address - inode +number, block index, level and child number (index into block) - and +the changed data. Any 8-byte word can be changes in this manner. + +Currently aliases are used for block pointers, file size, file used +bytes and the height of an inodes indirect tree. + +Segment Aliases +--------------- + +Related to regular aliases, these are used to handle bad blocks. +Initially, bad blocks are handled by moving the affected segment +content to a spare segment and noting this move in the journal with a +segment alias, a simple (to, from) tupel. GC will later empty this +segment and the alias can be removed again. This is used on MTD only. + +Vim +--- + +By cleverly predicting the life time of data, it is possible to +separate long-living data from short-living data and thereby reduce +the GC overhead later. Each type of distinc life expectency (vim) can +have a separate segment open for writing. Each (level, vim) tupel can +be open just once. If an open segment with unknown vim is encountered +at mount time, it is closed and ignored henceforth. + +Indirect Tree +------------- + +Inodes in LogFS are similar to FFS-style filesystems with direct and +indirect block pointers. One difference is that LogFS uses a single +indirect pointer that can be either a 1x, 2x, etc. indirect pointer. +A height field in the inode defines the height of the indirect tree +and thereby the indirection of the pointer. + +Another difference is the addressing of indirect blocks. In LogFS, +the first 16 pointers in the first indirect block are left empty, +corresponding to the 16 direct pointers in the inode. In ext2 (maybe +others as well) the first pointer in the first indirect block +corresponds to logical block 12, skipping the 12 direct pointers. +So where ext2 is using arithmetic to better utilize space, LogFS keeps +arithmetic simple and uses compression to save space. + +Compression +----------- + +Both file data and metadata can be compressed. Compression for file +data can be enabled with chattr +c and disabled with chattr -c. Doing +so has no effect on existing data, but new data will be stored +accordingly. New inodes will inherit the compression flag of the +parent directory. + +Metadata is always compressed. However, the space accounting ignores +this and charges for the uncompressed size. Failing to do so could +result in GC failures when, after moving some data, indirect blocks +compress worse than previously. Even on a 100% full medium, GC may +not consume any extra space, so the compression gains are lost space +to the user. + +However, they are not lost space to the filesystem internals. By +cheating the user for those bytes, the filesystem gained some slack +space and GC will run less often and faster. + +Garbage Collection and Wear Leveling +------------------------------------ + +Garbage collection is invoked whenever the number of free segments +falls below a threshold. The best (known) candidate is picked based +on the least amount of valid data contained in the segment. All +remaining valid data is copied elsewhere, thereby invalidating it. + +The GC code also checks for aliases and writes then back if their +number gets too large. + +Wear leveling is done by occasionally picking a suboptimal segment for +garbage collection. If a stale segments erase count is significantly +lower than the active segments' erase counts, it will be picked. Wear +leveling is rate limited, so it will never monopolize the device for +more than one segment worth at a time. + +Values for "occasionally", "significantly lower" are compile time +constants. + +Hashed directories +------------------ + +To satisfy efficient lookup(), directory entries are hashed and +located based on the hash. In order to both support large directories +and not be overly inefficient for small directories, several hash +tables of increasing size are used. For each table, the hash value +modulo the table size gives the table index. + +Tables sizes are chosen to limit the number of indirect blocks with a +fully populated table to 0, 1, 2 or 3 respectively. So the first +table contains 16 entries, the second 512-16, etc. + +The last table is special in several ways. First its size depends on +the effective 32bit limit on telldir/seekdir cookies. Since logfs +uses the upper half of the address space for indirect blocks, the size +is limited to 2^31. Secondly the table contains hash buckets with 16 +entries each. + +Using single-entry buckets would result in birthday "attacks". At +just 2^16 used entries, hash collisions would be likely (P >= 0.5). +My math skills are insufficient to do the combinatorics for the 17x +collisions necessary to overflow a bucket, but testing showed that in +10,000 runs the lowest directory fill before a bucket overflow was +188,057,130 entries with an average of 315,149,915 entries. So for +directory sizes of up to a million, bucket overflows should be +virtually impossible under normal circumstances. + +With carefully chosen filenames, it is obviously possible to cause an +overflow with just 21 entries (4 higher tables + 16 entries + 1). So +there may be a security concern if a malicious user has write access +to a directory. + +Open For Discussion +=================== + +Device Address Space +-------------------- + +A device address space is used for caching. Both block devices and +MTD provide functions to either read a single page or write a segment. +Partial segments may be written for data integrity, but where possible +complete segments are written for performance on simple block device +flash media. + +Meta Inodes +----------- + +Inodes are stored in the inode file, which is just a regular file for +most purposes. At umount time, however, the inode file needs to +remain open until all dirty inodes are written. So +generic_shutdown_super() may not close this inode, but shouldn't +complain about remaining inodes due to the inode file either. Same +goes for mapping inode of the device address space. + +Currently logfs uses a hack that essentially copies part of fs/inode.c +code over. A general solution would be preferred. + +Indirect block mapping +---------------------- + +With compression, the block device (or mapping inode) cannot be used +to cache indirect blocks. Some other place is required. Currently +logfs uses the top half of each inode's address space. The low 8TB +(on 32bit) are filled with file data, the high 8TB are used for +indirect blocks. + +One problem is that 16TB files created on 64bit systems actually have +data in the top 8TB. But files >16TB would cause problems anyway, so +only the limit has changed. diff --git a/Documentation/filesystems/mandatory-locking.txt b/Documentation/filesystems/mandatory-locking.txt new file mode 100644 index 00000000000..0979d1d2ca8 --- /dev/null +++ b/Documentation/filesystems/mandatory-locking.txt @@ -0,0 +1,171 @@ + Mandatory File Locking For The Linux Operating System + + Andy Walker <andy@lysaker.kvaerner.no> + + 15 April 1996 + (Updated September 2007) + +0. Why you should avoid mandatory locking +----------------------------------------- + +The Linux implementation is prey to a number of difficult-to-fix race +conditions which in practice make it not dependable: + + - The write system call checks for a mandatory lock only once + at its start. It is therefore possible for a lock request to + be granted after this check but before the data is modified. + A process may then see file data change even while a mandatory + lock was held. + - Similarly, an exclusive lock may be granted on a file after + the kernel has decided to proceed with a read, but before the + read has actually completed, and the reading process may see + the file data in a state which should not have been visible + to it. + - Similar races make the claimed mutual exclusion between lock + and mmap similarly unreliable. + +1. What is mandatory locking? +------------------------------ + +Mandatory locking is kernel enforced file locking, as opposed to the more usual +cooperative file locking used to guarantee sequential access to files among +processes. File locks are applied using the flock() and fcntl() system calls +(and the lockf() library routine which is a wrapper around fcntl().) It is +normally a process' responsibility to check for locks on a file it wishes to +update, before applying its own lock, updating the file and unlocking it again. +The most commonly used example of this (and in the case of sendmail, the most +troublesome) is access to a user's mailbox. The mail user agent and the mail +transfer agent must guard against updating the mailbox at the same time, and +prevent reading the mailbox while it is being updated. + +In a perfect world all processes would use and honour a cooperative, or +"advisory" locking scheme. However, the world isn't perfect, and there's +a lot of poorly written code out there. + +In trying to address this problem, the designers of System V UNIX came up +with a "mandatory" locking scheme, whereby the operating system kernel would +block attempts by a process to write to a file that another process holds a +"read" -or- "shared" lock on, and block attempts to both read and write to a +file that a process holds a "write " -or- "exclusive" lock on. + +The System V mandatory locking scheme was intended to have as little impact as +possible on existing user code. The scheme is based on marking individual files +as candidates for mandatory locking, and using the existing fcntl()/lockf() +interface for applying locks just as if they were normal, advisory locks. + +Note 1: In saying "file" in the paragraphs above I am actually not telling +the whole truth. System V locking is based on fcntl(). The granularity of +fcntl() is such that it allows the locking of byte ranges in files, in addition +to entire files, so the mandatory locking rules also have byte level +granularity. + +Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite +borrowing the fcntl() locking scheme from System V. The mandatory locking +scheme is defined by the System V Interface Definition (SVID) Version 3. + +2. Marking a file for mandatory locking +--------------------------------------- + +A file is marked as a candidate for mandatory locking by setting the group-id +bit in its file mode but removing the group-execute bit. This is an otherwise +meaningless combination, and was chosen by the System V implementors so as not +to break existing user programs. + +Note that the group-id bit is usually automatically cleared by the kernel when +a setgid file is written to. This is a security measure. The kernel has been +modified to recognize the special case of a mandatory lock candidate and to +refrain from clearing this bit. Similarly the kernel has been modified not +to run mandatory lock candidates with setgid privileges. + +3. Available implementations +---------------------------- + +I have considered the implementations of mandatory locking available with +SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. + +Generally I have tried to make the most sense out of the behaviour exhibited +by these three reference systems. There are many anomalies. + +All the reference systems reject all calls to open() for a file on which +another process has outstanding mandatory locks. This is in direct +contravention of SVID 3, which states that only calls to open() with the +O_TRUNC flag set should be rejected. The Linux implementation follows the SVID +definition, which is the "Right Thing", since only calls with O_TRUNC can +modify the contents of the file. + +HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not +just mandatory locks. That would appear to contravene POSIX.1. + +mmap() is another interesting case. All the operating systems mentioned +prevent mandatory locks from being applied to an mmap()'ed file, but HP-UX +also disallows advisory locks for such a file. SVID actually specifies the +paranoid HP-UX behaviour. + +In my opinion only MAP_SHARED mappings should be immune from locking, and then +only from mandatory locks - that is what is currently implemented. + +SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for +mandatory locks, so reads and writes to locked files always block when they +should return EAGAIN. + +I'm afraid that this is such an esoteric area that the semantics described +below are just as valid as any others, so long as the main points seem to +agree. + +4. Semantics +------------ + +1. Mandatory locks can only be applied via the fcntl()/lockf() locking + interface - in other words the System V/POSIX interface. BSD style + locks using flock() never result in a mandatory lock. + +2. If a process has locked a region of a file with a mandatory read lock, then + other processes are permitted to read from that region. If any of these + processes attempts to write to the region it will block until the lock is + released, unless the process has opened the file with the O_NONBLOCK + flag in which case the system call will return immediately with the error + status EAGAIN. + +3. If a process has locked a region of a file with a mandatory write lock, all + attempts to read or write to that region block until the lock is released, + unless a process has opened the file with the O_NONBLOCK flag in which case + the system call will return immediately with the error status EAGAIN. + +4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has + any mandatory locks owned by other processes will be rejected with the + error status EAGAIN. + +5. Attempts to apply a mandatory lock to a file that is memory mapped and + shared (via mmap() with MAP_SHARED) will be rejected with the error status + EAGAIN. + +6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) + that has any mandatory locks in effect will be rejected with the error status + EAGAIN. + +5. Which system calls are affected? +----------------------------------- + +Those which modify a file's contents, not just the inode. That gives read(), +write(), readv(), writev(), open(), creat(), mmap(), truncate() and +ftruncate(). truncate() and ftruncate() are considered to be "write" actions +for the purposes of mandatory locking. + +The affected region is usually defined as stretching from the current position +for the total number of bytes read or written. For the truncate calls it is +defined as the bytes of a file removed or added (we must also consider bytes +added, as a lock can specify just "the whole file", rather than a specific +range of bytes.) + +Note 3: I may have overlooked some system calls that need mandatory lock +checking in my eagerness to get this code out the door. Please let me know, or +better still fix the system calls yourself and submit a patch to me or Linus. + +6. Warning! +----------- + +Not even root can override a mandatory lock, so runaway processes can wreak +havoc if they lock crucial files. The way around it is to change the file +permissions (remove the setgid bit) before trying to read or write to it. +Of course, that might be a bit tricky if the system is hung :-( + diff --git a/Documentation/filesystems/ncpfs.txt b/Documentation/filesystems/ncpfs.txt index f12c30c93f2..5af164f4b37 100644 --- a/Documentation/filesystems/ncpfs.txt +++ b/Documentation/filesystems/ncpfs.txt @@ -7,6 +7,6 @@ ftp.gwdg.de/pub/linux/misc/ncpfs, but sunsite and its many mirrors will have it as well. Related products are linware and mars_nwe, which will give Linux partial -NetWare server functionality. Linware's home site is -klokan.sh.cvut.cz/pub/linux/linware; mars_nwe can be found on -ftp.gwdg.de/pub/linux/misc/ncpfs. +NetWare server functionality. + +mars_nwe can be found on ftp.gwdg.de/pub/linux/misc/ncpfs. diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX new file mode 100644 index 00000000000..53f3b596ac0 --- /dev/null +++ b/Documentation/filesystems/nfs/00-INDEX @@ -0,0 +1,26 @@ +00-INDEX + - this file (nfs-related documentation). +Exporting + - explanation of how to make filesystems exportable. +fault_injection.txt + - information for using fault injection on the server +knfsd-stats.txt + - statistics which the NFS server makes available to user space. +nfs.txt + - nfs client, and DNS resolution for fs_locations. +nfs41-server.txt + - info on the Linux server implementation of NFSv4 minor version 1. +nfs-rdma.txt + - how to install and setup the Linux NFS/RDMA client and server software +nfsd-admin-interfaces.txt + - Administrative interfaces for nfsd. +nfsroot.txt + - short guide on setting up a diskless box with NFS root filesystem. +pnfs.txt + - short explanation of some of the internals of the pnfs client code +rpc-cache.txt + - introduction to the caching mechanisms in the sunrpc layer. +idmapper.txt + - information for configuring request-keys to be used by idmapper +rpc-server-gss.txt + - Information on GSS authentication support in the NFS Server diff --git a/Documentation/filesystems/Exporting b/Documentation/filesystems/nfs/Exporting index 31047e0fe14..e543b1a619c 100644 --- a/Documentation/filesystems/Exporting +++ b/Documentation/filesystems/nfs/Exporting @@ -2,9 +2,12 @@ Making Filesystems Exportable ============================= -Most filesystem operations require a dentry (or two) as a starting +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting point. Local applications have a reference-counted hold on suitable -dentrys via open file descriptors or cwd/root. However remote +dentries via open file descriptors or cwd/root. However remote applications that access a filesystem via a remote filesystem protocol such as NFS may not be able to hold such a reference, and so need a different way to refer to a particular dentry. As the alternative @@ -13,14 +16,14 @@ server-reboot (among other things, though these tend to be the most problematic), there is no simple answer like 'filename'. The mechanism discussed here allows each filesystem implementation to -specify how to generate an opaque (out side of the filesystem) byte +specify how to generate an opaque (outside of the filesystem) byte string for any dentry, and how to find an appropriate dentry for any given opaque byte string. This byte string will be called a "filehandle fragment" as it corresponds to part of an NFS filehandle. A filesystem which supports the mapping between filehandle fragments -and dentrys will be termed "exportable". +and dentries will be termed "exportable". @@ -89,11 +92,16 @@ For a filesystem to be exportable it must: 1/ provide the filehandle fragment routines described below. 2/ make sure that d_splice_alias is used rather than d_add when ->lookup finds an inode for a given parent and name. - Typically the ->lookup routine will end: - if (inode) - return d_splice(inode, dentry); - d_add(dentry, inode); - return NULL; + + If inode is NULL, d_splice_alias(inode, dentry) is equivalent to + + d_add(dentry, inode), NULL + + Similarly, d_splice_alias(ERR_PTR(err), dentry) = ERR_PTR(err) + + Typically the ->lookup routine will simply end with a: + + return d_splice_alias(inode, dentry); } @@ -101,67 +109,39 @@ For a filesystem to be exportable it must: A file system implementation declares that instances of the filesystem are exportable by setting the s_export_op field in the struct super_block. This field must point to a "struct export_operations" -struct which could potentially be full of NULLs, though normally at -least get_parent will be set. - - The primary operations are decode_fh and encode_fh. -decode_fh takes a filehandle fragment and tries to find or create a -dentry for the object referred to by the filehandle. -encode_fh takes a dentry and creates a filehandle fragment which can -later be used to find/create a dentry for the same object. - -decode_fh will probably make use of "find_exported_dentry". -This function lives in the "exportfs" module which a filesystem does -not need unless it is being exported. So rather that calling -find_exported_dentry directly, each filesystem should call it through -the find_exported_dentry pointer in it's export_operations table. -This field is set correctly by the exporting agent (e.g. nfsd) when a -filesystem is exported, and before any export operations are called. - -find_exported_dentry needs three support functions from the -filesystem: - get_name. When given a parent dentry and a child dentry, this - should find a name in the directory identified by the parent - dentry, which leads to the object identified by the child dentry. - If no get_name function is supplied, a default implementation is - provided which uses vfs_readdir to find potential names, and - matches inode numbers to find the correct match. - - get_parent. When given a dentry for a directory, this should return - a dentry for the parent. Quite possibly the parent dentry will - have been allocated by d_alloc_anon. - The default get_parent function just returns an error so any - filehandle lookup that requires finding a parent will fail. - ->lookup("..") is *not* used as a default as it can leave ".." - entries in the dcache which are too messy to work with. - - get_dentry. When given an opaque datum, this should find the - implied object and create a dentry for it (possibly with - d_alloc_anon). - The opaque datum is whatever is passed down by the decode_fh - function, and is often simply a fragment of the filehandle - fragment. - decode_fh passes two datums through find_exported_dentry. One that - should be used to identify the target object, and one that can be - used to identify the object's parent, should that be necessary. - The default get_dentry function assumes that the datum contains an - inode number and a generation number, and it attempts to get the - inode using "iget" and check it's validity by matching the - generation number. A filesystem should only depend on the default - if iget can safely be used this way. - -If decode_fh and/or encode_fh are left as NULL, then default -implementations are used. These defaults are suitable for ext2 and -extremely similar filesystems (like ext3). - -The default encode_fh creates a filehandle fragment from the inode -number and generation number of the target together with the inode -number and generation number of the parent (if the parent is -required). - -The default decode_fh extract the target and parent datums from the -filehandle assuming the format used by the default encode_fh and -passed them to find_exported_dentry. +struct which has the following members: + + encode_fh (optional) + Takes a dentry and creates a filehandle fragment which can later be used + to find or create a dentry for the same object. The default + implementation creates a filehandle fragment that encodes a 32bit inode + and generation number for the inode encoded, and if necessary the + same information for the parent. + + fh_to_dentry (mandatory) + Given a filehandle fragment, this should find the implied object and + create a dentry for it (possibly with d_alloc_anon). + + fh_to_parent (optional but strongly recommended) + Given a filehandle fragment, this should find the parent of the + implied object and create a dentry for it (possibly with d_alloc_anon). + May fail if the filehandle fragment is too small. + + get_parent (optional but strongly recommended) + When given a dentry for a directory, this should return a dentry for + the parent. Quite possibly the parent dentry will have been allocated + by d_alloc_anon. The default get_parent function just returns an error + so any filehandle lookup that requires finding a parent will fail. + ->lookup("..") is *not* used as a default as it can leave ".." entries + in the dcache which are too messy to work with. + + get_name (optional) + When given a parent dentry and a child dentry, this should find a name + in the directory identified by the parent dentry, which leads to the + object identified by the child dentry. If no get_name function is + supplied, a default implementation is provided which uses vfs_readdir + to find potential names, and matches inode numbers to find the correct + match. A filehandle fragment consists of an array of 1 or more 4byte words, @@ -172,5 +152,3 @@ generated by encode_fh, in which case it will have been padded with nuls. Rather, the encode_fh routine should choose a "type" which indicates the decode_fh how much of the filehandle is valid, and how it should be interpreted. - - diff --git a/Documentation/filesystems/nfs/fault_injection.txt b/Documentation/filesystems/nfs/fault_injection.txt new file mode 100644 index 00000000000..426d166089a --- /dev/null +++ b/Documentation/filesystems/nfs/fault_injection.txt @@ -0,0 +1,69 @@ + +Fault Injection +=============== +Fault injection is a method for forcing errors that may not normally occur, or +may be difficult to reproduce. Forcing these errors in a controlled environment +can help the developer find and fix bugs before their code is shipped in a +production system. Injecting an error on the Linux NFS server will allow us to +observe how the client reacts and if it manages to recover its state correctly. + +NFSD_FAULT_INJECTION must be selected when configuring the kernel to use this +feature. + + +Using Fault Injection +===================== +On the client, mount the fault injection server through NFS v4.0+ and do some +work over NFS (open files, take locks, ...). + +On the server, mount the debugfs filesystem to <debug_dir> and ls +<debug_dir>/nfsd. This will show a list of files that will be used for +injecting faults on the NFS server. As root, write a number n to the file +corresponding to the action you want the server to take. The server will then +process the first n items it finds. So if you want to forget 5 locks, echo '5' +to <debug_dir>/nfsd/forget_locks. A value of 0 will tell the server to forget +all corresponding items. A log message will be created containing the number +of items forgotten (check dmesg). + +Go back to work on the client and check if the client recovered from the error +correctly. + + +Available Faults +================ +forget_clients: + The NFS server keeps a list of clients that have placed a mount call. If + this list is cleared, the server will have no knowledge of who the client + is, forcing the client to reauthenticate with the server. + +forget_openowners: + The NFS server keeps a list of what files are currently opened and who + they were opened by. Clearing this list will force the client to reopen + its files. + +forget_locks: + The NFS server keeps a list of what files are currently locked in the VFS. + Clearing this list will force the client to reclaim its locks (files are + unlocked through the VFS as they are cleared from this list). + +forget_delegations: + A delegation is used to assure the client that a file, or part of a file, + has not changed since the delegation was awarded. Clearing this list will + force the client to reaquire its delegation before accessing the file + again. + +recall_delegations: + Delegations can be recalled by the server when another client attempts to + access a file. This test will notify the client that its delegation has + been revoked, forcing the client to reaquire the delegation before using + the file again. + + +tools/nfs/inject_faults.sh script +================================= +This script has been created to ease the fault injection process. This script +will detect the mounted debugfs directory and write to the files located there +based on the arguments passed by the user. For example, running +`inject_faults.sh forget_locks 1` as root will instruct the server to forget +one lock. Running `inject_faults forget_locks` will instruct the server to +forgetall locks. diff --git a/Documentation/filesystems/nfs/idmapper.txt b/Documentation/filesystems/nfs/idmapper.txt new file mode 100644 index 00000000000..fe03d10bb79 --- /dev/null +++ b/Documentation/filesystems/nfs/idmapper.txt @@ -0,0 +1,75 @@ + +========= +ID Mapper +========= +Id mapper is used by NFS to translate user and group ids into names, and to +translate user and group names into ids. Part of this translation involves +performing an upcall to userspace to request the information. There are two +ways NFS could obtain this information: placing a call to /sbin/request-key +or by placing a call to the rpc.idmap daemon. + +NFS will attempt to call /sbin/request-key first. If this succeeds, the +result will be cached using the generic request-key cache. This call should +only fail if /etc/request-key.conf is not configured for the id_resolver key +type, see the "Configuring" section below if you wish to use the request-key +method. + +If the call to /sbin/request-key fails (if /etc/request-key.conf is not +configured with the id_resolver key type), then the idmapper will ask the +legacy rpc.idmap daemon for the id mapping. This result will be stored +in a custom NFS idmap cache. + + +=========== +Configuring +=========== +The file /etc/request-key.conf will need to be modified so /sbin/request-key can +direct the upcall. The following line should be added: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +This will direct all id_resolver requests to the program /usr/sbin/nfs.idmap. +The last parameter, 600, defines how many seconds into the future the key will +expire. This parameter is optional for /usr/sbin/nfs.idmap. When the timeout +is not specified, nfs.idmap will default to 600 seconds. + +id mapper uses for key descriptions: + uid: Find the UID for the given user + gid: Find the GID for the given group + user: Find the user name for the given UID + group: Find the group name for the given GID + +You can handle any of these individually, rather than using the generic upcall +program. If you would like to use your own program for a uid lookup then you +would edit your request-key.conf so it look similar to this: + +#OP TYPE DESCRIPTION CALLOUT INFO PROGRAM ARG1 ARG2 ARG3 ... +#====== ======= =============== =============== =============================== +create id_resolver uid:* * /some/other/program %k %d 600 +create id_resolver * * /usr/sbin/nfs.idmap %k %d 600 + +Notice that the new line was added above the line for the generic program. +request-key will find the first matching line and corresponding program. In +this case, /some/other/program will handle all uid lookups and +/usr/sbin/nfs.idmap will handle gid, user, and group lookups. + +See <file:Documentation/security/keys-request-key.txt> for more information +about the request-key function. + + +========= +nfs.idmap +========= +nfs.idmap is designed to be called by request-key, and should not be run "by +hand". This program takes two arguments, a serialized key and a key +description. The serialized key is first converted into a key_serial_t, and +then passed as an argument to keyctl_instantiate (both are part of keyutils.h). + +The actual lookups are performed by functions found in nfsidmap.h. nfs.idmap +determines the correct function to call by looking at the first part of the +description string. For example, a uid lookup description will appear as +"uid:user@domain". + +nfs.idmap will return 0 if the key was instantiated, and non-zero otherwise. diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt new file mode 100644 index 00000000000..64ced5149d3 --- /dev/null +++ b/Documentation/filesystems/nfs/knfsd-stats.txt @@ -0,0 +1,159 @@ + +Kernel NFS Server Statistics +============================ + +This document describes the format and semantics of the statistics +which the kernel NFS server makes available to userspace. These +statistics are available in several text form pseudo files, each of +which is described separately below. + +In most cases you don't need to know these formats, as the nfsstat(8) +program from the nfs-utils distribution provides a helpful command-line +interface for extracting and printing them. + +All the files described here are formatted as a sequence of text lines, +separated by newline '\n' characters. Lines beginning with a hash +'#' character are comments intended for humans and should be ignored +by parsing routines. All other lines contain a sequence of fields +separated by whitespace. + +/proc/fs/nfsd/pool_stats +------------------------ + +This file is available in kernels from 2.6.30 onwards, if the +/proc/fs/nfsd filesystem is mounted (it almost always should be). + +The first line is a comment which describes the fields present in +all the other lines. The other lines present the following data as +a sequence of unsigned decimal numeric fields. One line is shown +for each NFS thread pool. + +All counters are 64 bits wide and wrap naturally. There is no way +to zero these counters, instead applications should do their own +rate conversion. + +pool + The id number of the NFS thread pool to which this line applies. + This number does not change. + + Thread pool ids are a contiguous set of small integers starting + at zero. The maximum value depends on the thread pool mode, but + currently cannot be larger than the number of CPUs in the system. + Note that in the default case there will be a single thread pool + which contains all the nfsd threads and all the CPUs in the system, + and thus this file will have a single line with a pool id of "0". + +packets-arrived + Counts how many NFS packets have arrived. More precisely, this + is the number of times that the network stack has notified the + sunrpc server layer that new data may be available on a transport + (e.g. an NFS or UDP socket or an NFS/RDMA endpoint). + + Depending on the NFS workload patterns and various network stack + effects (such as Large Receive Offload) which can combine packets + on the wire, this may be either more or less than the number + of NFS calls received (which statistic is available elsewhere). + However this is a more accurate and less workload-dependent measure + of how much CPU load is being placed on the sunrpc server layer + due to NFS network traffic. + +sockets-enqueued + Counts how many times an NFS transport is enqueued to wait for + an nfsd thread to service it, i.e. no nfsd thread was considered + available. + + The circumstance this statistic tracks indicates that there was NFS + network-facing work to be done but it couldn't be done immediately, + thus introducing a small delay in servicing NFS calls. The ideal + rate of change for this counter is zero; significantly non-zero + values may indicate a performance limitation. + + This can happen either because there are too few nfsd threads in the + thread pool for the NFS workload (the workload is thread-limited), + or because the NFS workload needs more CPU time than is available in + the thread pool (the workload is CPU-limited). In the former case, + configuring more nfsd threads will probably improve the performance + of the NFS workload. In the latter case, the sunrpc server layer is + already choosing not to wake idle nfsd threads because there are too + many nfsd threads which want to run but cannot, so configuring more + nfsd threads will make no difference whatsoever. The overloads-avoided + statistic (see below) can be used to distinguish these cases. + +threads-woken + Counts how many times an idle nfsd thread is woken to try to + receive some data from an NFS transport. + + This statistic tracks the circumstance where incoming + network-facing NFS work is being handled quickly, which is a good + thing. The ideal rate of change for this counter will be close + to but less than the rate of change of the packets-arrived counter. + +overloads-avoided + Counts how many times the sunrpc server layer chose not to wake an + nfsd thread, despite the presence of idle nfsd threads, because + too many nfsd threads had been recently woken but could not get + enough CPU time to actually run. + + This statistic counts a circumstance where the sunrpc layer + heuristically avoids overloading the CPU scheduler with too many + runnable nfsd threads. The ideal rate of change for this counter + is zero. Significant non-zero values indicate that the workload + is CPU limited. Usually this is associated with heavy CPU usage + on all the CPUs in the nfsd thread pool. + + If a sustained large overloads-avoided rate is detected on a pool, + the top(1) utility should be used to check for the following + pattern of CPU usage on all the CPUs associated with the given + nfsd thread pool. + + - %us ~= 0 (as you're *NOT* running applications on your NFS server) + + - %wa ~= 0 + + - %id ~= 0 + + - %sy + %hi + %si ~= 100 + + If this pattern is seen, configuring more nfsd threads will *not* + improve the performance of the workload. If this patten is not + seen, then something more subtle is wrong. + +threads-timedout + Counts how many times an nfsd thread triggered an idle timeout, + i.e. was not woken to handle any incoming network packets for + some time. + + This statistic counts a circumstance where there are more nfsd + threads configured than can be used by the NFS workload. This is + a clue that the number of nfsd threads can be reduced without + affecting performance. Unfortunately, it's only a clue and not + a strong indication, for a couple of reasons: + + - Currently the rate at which the counter is incremented is quite + slow; the idle timeout is 60 minutes. Unless the NFS workload + remains constant for hours at a time, this counter is unlikely + to be providing information that is still useful. + + - It is usually a wise policy to provide some slack, + i.e. configure a few more nfsds than are currently needed, + to allow for future spikes in load. + + +Note that incoming packets on NFS transports will be dealt with in +one of three ways. An nfsd thread can be woken (threads-woken counts +this case), or the transport can be enqueued for later attention +(sockets-enqueued counts this case), or the packet can be temporarily +deferred because the transport is currently being used by an nfsd +thread. This last case is not very interesting and is not explicitly +counted, but can be inferred from the other counters thus: + +packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + + +More +---- +Descriptions of the other statistics file should go here. + + +Greg Banks <gnb@sgi.com> +26 Mar 2009 diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt new file mode 100644 index 00000000000..e386f7e4bce --- /dev/null +++ b/Documentation/filesystems/nfs/nfs-rdma.txt @@ -0,0 +1,271 @@ +################################################################################ +# # +# NFS/RDMA README # +# # +################################################################################ + + Author: NetApp and Open Grid Computing + Date: May 29, 2008 + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + + This document describes how to install and setup the Linux NFS/RDMA client + and server software. + + The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server + was first included in the following release, Linux 2.6.25. + + In our testing, we have obtained excellent performance results (full 10Gbit + wire bandwidth at minimal client CPU) under many workloads. The code passes + the full Connectathon test suite and operates over both Infiniband and iWARP + RDMA adapters. + +Getting Help +~~~~~~~~~~~~ + + If you get stuck, you can ask questions on the + + nfs-rdma-devel@lists.sourceforge.net + + mailing list. + +Installation +~~~~~~~~~~~~ + + These instructions are a step by step guide to building a machine for + use with NFS/RDMA. + + - Install an RDMA device + + Any device supported by the drivers in drivers/infiniband/hw is acceptable. + + Testing has been performed using several Mellanox-based IB cards, the + Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + + - Install a Linux distribution and tools + + The first kernel release to contain both the NFS/RDMA client and server was + Linux 2.6.25 Therefore, a distribution compatible with this and subsequent + Linux kernel release should be installed. + + The procedures described in this document have been tested with + distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + + - Install nfs-utils-1.1.2 or greater on the client + + An NFS/RDMA mount point can be obtained by using the mount.nfs command in + nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils + version with support for NFS/RDMA mounts, but for various reasons we + recommend using nfs-utils-1.1.2 or greater). To see which version of + mount.nfs you are using, type: + + $ /sbin/mount.nfs -V + + If the version is less than 1.1.2 or the command does not exist, + you should install the latest version of nfs-utils. + + Download the latest package from: + + http://www.kernel.org/pub/linux/utils/nfs + + Uncompress the package and follow the installation instructions. + + If you will not need the idmapper and gssd executables (you do not need + these to create an NFS/RDMA enabled mount command), the installation + process can be simplified by disabling these features when running + configure: + + $ ./configure --disable-gss --disable-nfsv4 + + To build nfs-utils you will need the tcp_wrappers package installed. For + more information on this see the package's README and INSTALL files. + + After building the nfs-utils package, there will be a mount.nfs binary in + the utils/mount directory. This binary can be used to initiate NFS v2, v3, + or v4 mounts. To initiate a v4 mount, the binary must be called + mount.nfs4. The standard technique is to create a symlink called + mount.nfs4 to mount.nfs. + + This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + + $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + + In this location, mount.nfs will be invoked automatically for NFS mounts + by the system mount command. + + NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed + on the NFS client machine. You do not need this specific version of + nfs-utils on the server. Furthermore, only the mount.nfs command from + nfs-utils-1.1.2 is needed on the client. + + - Install a Linux kernel with NFS/RDMA + + The NFS/RDMA client and server are both included in the mainline Linux + kernel version 2.6.25 and later. This and other versions of the 2.6 Linux + kernel can be found at: + + ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + + Download the sources and place them in an appropriate location. + + - Configure the RDMA stack + + Make sure your kernel configuration has RDMA support enabled. Under + Device Drivers -> InfiniBand support, update the kernel configuration + to enable InfiniBand support [NOTE: the option name is misleading. Enabling + InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + + Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or + iWARP adapter support (amso, cxgb3, etc.). + + If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + + - Configure the NFS client and server + + Your kernel configuration must also have NFS file system support and/or + NFS server support enabled. These and other NFS related configuration + options can be found under File Systems -> Network File Systems. + + - Build, install, reboot + + The NFS/RDMA code will be enabled automatically if NFS and RDMA + are turned on. The NFS/RDMA client and server are configured via the hidden + SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The + value of SUNRPC_XPRT_RDMA will be: + + - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client + and server will not be built + - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, + in this case the NFS/RDMA client and server will be built as modules + - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client + and server will be built into the kernel + + Therefore, if you have followed the steps above and turned no NFS and RDMA, + the NFS/RDMA client and server will be built. + + Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + + Before configuring the NFS/RDMA software, it is a good idea to test + your new kernel to ensure that the kernel is working correctly. + In particular, it is a good idea to verify that the RDMA stack + is functioning as expected and standard NFS over TCP/IP and/or UDP/IP + is working properly. + + - Check RDMA Setup + + If you built the RDMA components as modules, load them at + this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel + card: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + + If you are using InfiniBand, make sure there is a Subnet Manager (SM) + running on the network. If your IB switch has an embedded SM, you can + use it. Otherwise, you will need to run an SM, such as OpenSM, on one + of your end nodes. + + If an SM is running on your network, you should see the following: + + $ cat /sys/class/infiniband/driverX/ports/1/state + 4: ACTIVE + + where driverX is mthca0, ipath5, ehca3, etc. + + To further test the InfiniBand software stack, use IPoIB (this + assumes you have two IB hosts named host1 and host2): + + host1$ ifconfig ib0 a.b.c.x + host2$ ifconfig ib0 a.b.c.y + host1$ ping a.b.c.y + host2$ ping a.b.c.x + + For other device types, follow the appropriate procedures. + + - Check NFS Setup + + For the NFS components enabled above (client and/or server), + test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + + We recommend that you use two machines, one to act as the client and + one to act as the server. + + One time configuration: + + - On the server system, configure the /etc/exports file and + start the NFS/RDMA server. + + Exports entries with the following formats have been tested: + + /vol0 192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) + /vol0 192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + + The IP address(es) is(are) the client's IPoIB address for an InfiniBand + HCA or the cleint's iWARP address(es) for an RNIC. + + NOTE: The "insecure" option must be used because the NFS/RDMA client does + not use a reserved port. + + Each time a machine boots: + + - Load and configure the RDMA drivers + + For InfiniBand using a Mellanox adapter: + + $ modprobe ib_mthca + $ modprobe ib_ipoib + $ ifconfig ib0 a.b.c.d + + NOTE: use unique addresses for the client and server + + - Start the NFS server + + If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA transport module: + + $ modprobe svcrdma + + Regardless of how the server was built (module or built-in), start the + server: + + $ /etc/init.d/nfs start + + or + + $ service nfs start + + Instruct the server to listen on the RDMA transport: + + $ echo rdma 20049 > /proc/fs/nfsd/portlist + + - On the client system + + If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in + kernel config), load the RDMA client module: + + $ modprobe xprtrdma.ko + + Regardless of how the client was built (module or built-in), use this + command to mount the NFS/RDMA server: + + $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt + + To verify that the mount is using RDMA, run "cat /proc/mounts" and check + the "proto" field for the given mount. + + Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt new file mode 100644 index 00000000000..f2571c8bef7 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs.txt @@ -0,0 +1,136 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +special features of the NFS client that can be configured by system +administrators. + + +The nfs4_unique_id parameter +============================ + +NFSv4 requires clients to identify themselves to servers with a unique +string. File open and lock state shared between one client and one server +is associated with this identity. To support robust NFSv4 state recovery +and transparent state migration, this identity string must not change +across client reboots. + +Without any other intervention, the Linux client uses a string that contains +the local system's node name. System administrators, however, often do not +take care to ensure that node names are fully qualified and do not change +over the lifetime of a client system. Node names can have other +administrative requirements that require particular behavior that does not +work well as part of an nfs_client_id4 string. + +The nfs.nfs4_unique_id boot parameter specifies a unique string that can be +used instead of a system's node name when an NFS client identifies itself to +a server. Thus, if the system's node name is not unique, or it changes, its +nfs.nfs4_unique_id stays the same, preventing collision with other clients +or loss of state during NFS reboot recovery or transparent state migration. + +The nfs.nfs4_unique_id string is typically a UUID, though it can contain +anything that is believed to be unique across all NFS clients. An +nfs4_unique_id string should be chosen when a client system is installed, +just as a system's root file system gets a fresh UUID in its label at +install time. + +The string should remain fixed for the lifetime of the client. It can be +changed safely if care is taken that the client shuts down cleanly and all +outstanding NFSv4 state has expired, to prevent loss of NFSv4 state. + +This string can be stored in an NFS client's grub.conf, or it can be provided +via a net boot facility such as PXE. It may also be specified as an nfs.ko +module parameter. Specifying a uniquifier string is not support for NFS +clients running in containers. + + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See + http://tools.ietf.org/html/rfc3530#section-6 +and + http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + + (1) The process checks the dns_resolve cache to see if it contains a + valid entry. If so, it returns that entry and exits. + + (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' + (may be changed using the 'nfs.cache_getent' kernel boot parameter) + is run, with two arguments: + - the cache name, "dns_resolve" + - the hostname to resolve + + (3) After looking up the corresponding ip address, the helper script + writes the result into the rpc_pipefs pseudo-file + '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' + in the following (text) format: + + "<ip address> <hostname> <ttl>\n" + + Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 + (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. + <hostname> is identical to the second argument of the helper + script, and <ttl> is the 'time to live' of this cache entry (in + units of seconds). + + Note: If <ip address> is invalid, say the string "0", then a negative + entry is created, which will cause the kernel to treat the hostname + as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ + echo "Usage: $0 cache_name entry_name" + exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in + dns_resolve) + name="$2" + result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" + [ -z "${result}" ] && result="0" + ;; + *) + die + ;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt new file mode 100644 index 00000000000..c49cd7e796e --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.txt @@ -0,0 +1,180 @@ +NFSv4.1 Server Implementation + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file. The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is enabled by default. +It can be disabled at run time by writing the string "-4.1" to +the /proc/fs/nfsd/versions control file. Note that to write this +control file, the nfsd service must be taken down. You can use rpc.nfsd +for this; see rpc.nfsd(8). + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively. Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on RFC 5661. + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +Other NFSv4.1 features, Parallel NFS operations in particular, +are still under development out of tree. +See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design +for more information. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1. The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: + pNFS Parallel NFS + FDELG File Delegations + DDELG Directory Delegations + +The following abbreviations indicate the linux server implementation status. + I Implemented NFSv4.1 operations. + NS Not Supported. + NS* unimplemented optional feature. + P pNFS features implemented out of tree. + PNS pNFS features that are not supported yet (out of tree). + +Operations + + +----------------------+------------+--------------+----------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +----------------------+------------+--------------+----------------+ + | ACCESS | REQ | | Section 18.1 | +I | BACKCHANNEL_CTL | REQ | | Section 18.33 | +I | BIND_CONN_TO_SESSION | REQ | | Section 18.34 | + | CLOSE | REQ | | Section 18.2 | + | COMMIT | REQ | | Section 18.3 | + | CREATE | REQ | | Section 18.4 | +I | CREATE_SESSION | REQ | | Section 18.36 | +NS*| DELEGPURGE | OPT | FDELG (REQ) | Section 18.5 | + | DELEGRETURN | OPT | FDELG, | Section 18.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +I | DESTROY_CLIENTID | REQ | | Section 18.50 | +I | DESTROY_SESSION | REQ | | Section 18.37 | +I | EXCHANGE_ID | REQ | | Section 18.35 | +I | FREE_STATEID | REQ | | Section 18.38 | + | GETATTR | REQ | | Section 18.7 | +P | GETDEVICEINFO | OPT | pNFS (REQ) | Section 18.40 | +P | GETDEVICELIST | OPT | pNFS (OPT) | Section 18.41 | + | GETFH | REQ | | Section 18.8 | +NS*| GET_DIR_DELEGATION | OPT | DDELG (REQ) | Section 18.39 | +P | LAYOUTCOMMIT | OPT | pNFS (REQ) | Section 18.42 | +P | LAYOUTGET | OPT | pNFS (REQ) | Section 18.43 | +P | LAYOUTRETURN | OPT | pNFS (REQ) | Section 18.44 | + | LINK | OPT | | Section 18.9 | + | LOCK | REQ | | Section 18.10 | + | LOCKT | REQ | | Section 18.11 | + | LOCKU | REQ | | Section 18.12 | + | LOOKUP | REQ | | Section 18.13 | + | LOOKUPP | REQ | | Section 18.14 | + | NVERIFY | REQ | | Section 18.15 | + | OPEN | REQ | | Section 18.16 | +NS*| OPENATTR | OPT | | Section 18.17 | + | OPEN_CONFIRM | MNI | | N/A | + | OPEN_DOWNGRADE | REQ | | Section 18.18 | + | PUTFH | REQ | | Section 18.19 | + | PUTPUBFH | REQ | | Section 18.20 | + | PUTROOTFH | REQ | | Section 18.21 | + | READ | REQ | | Section 18.22 | + | READDIR | REQ | | Section 18.23 | + | READLINK | OPT | | Section 18.24 | + | RECLAIM_COMPLETE | REQ | | Section 18.51 | + | RELEASE_LOCKOWNER | MNI | | N/A | + | REMOVE | REQ | | Section 18.25 | + | RENAME | REQ | | Section 18.26 | + | RENEW | MNI | | N/A | + | RESTOREFH | REQ | | Section 18.27 | + | SAVEFH | REQ | | Section 18.28 | + | SECINFO | REQ | | Section 18.29 | +I | SECINFO_NO_NAME | REC | pNFS files | Section 18.45, | + | | | layout (REQ) | Section 13.12 | +I | SEQUENCE | REQ | | Section 18.46 | + | SETATTR | REQ | | Section 18.30 | + | SETCLIENTID | MNI | | N/A | + | SETCLIENTID_CONFIRM | MNI | | N/A | +NS | SET_SSV | REQ | | Section 18.47 | +I | TEST_STATEID | REQ | | Section 18.48 | + | VERIFY | REQ | | Section 18.31 | +NS*| WANT_DELEGATION | OPT | FDELG (OPT) | Section 18.49 | + | WRITE | REQ | | Section 18.32 | + +Callback Operations + + +-------------------------+-----------+-------------+---------------+ + | Operation | REQ, REC, | Feature | Definition | + | | OPT, or | (REQ, REC, | | + | | MNI | or OPT) | | + +-------------------------+-----------+-------------+---------------+ + | CB_GETATTR | OPT | FDELG (REQ) | Section 20.1 | +P | CB_LAYOUTRECALL | OPT | pNFS (REQ) | Section 20.3 | +NS*| CB_NOTIFY | OPT | DDELG (REQ) | Section 20.4 | +P | CB_NOTIFY_DEVICEID | OPT | pNFS (OPT) | Section 20.12 | +NS*| CB_NOTIFY_LOCK | OPT | | Section 20.11 | +NS*| CB_PUSH_DELEG | OPT | FDELG (OPT) | Section 20.5 | + | CB_RECALL | OPT | FDELG, | Section 20.2 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_RECALL_ANY | OPT | FDELG, | Section 20.6 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS | CB_RECALL_SLOT | REQ | | Section 20.8 | +NS*| CB_RECALLABLE_OBJ_AVAIL | OPT | DDELG, pNFS | Section 20.7 | + | | | (REQ) | | +I | CB_SEQUENCE | OPT | FDELG, | Section 20.9 | + | | | DDELG, pNFS | | + | | | (REQ) | | +NS*| CB_WANTS_CANCELLED | OPT | FDELG, | Section 20.10 | + | | | DDELG, pNFS | | + | | | (REQ) | | + +-------------------------+-----------+-------------+---------------+ + +Implementation notes: + +SSV: +* The spec claims this is mandatory, but we don't actually know of any + implementations, so we're ignoring it for now. The server returns + NFS4ERR_ENCR_ALG_UNSUPP on EXCHANGE_ID, which should be future-proof. + +GSS on the backchannel: +* Again, theoretically required but not widely implemented (in + particular, the current Linux client doesn't request it). We return + NFS4ERR_ENCR_ALG_UNSUPP on CREATE_SESSION. + +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or + CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that + persist across client reboots). Thus we need not implement this for + now. + +EXCHANGE_ID: +* implementation ids are ignored + +CREATE_SESSION: +* backchannel attributes are ignored + +SEQUENCE: +* no support for dynamic slot table renegotiation (optional) + +Nonstandard compound limitations: +* No support for a sessions fore channel RPC compound that requires both a + ca_maxrequestsize request and a ca_maxresponsesize reply, so we may + fail to live up to the promise we made in CREATE_SESSION fore channel + negotiation. + +See also http://wiki.linux-nfs.org/wiki/index.php/Server_4.0_and_4.1_issues. diff --git a/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt new file mode 100644 index 00000000000..56a96fb08a7 --- /dev/null +++ b/Documentation/filesystems/nfs/nfsd-admin-interfaces.txt @@ -0,0 +1,41 @@ +Administrative interfaces for nfsd +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Note that normally these interfaces are used only by the utilities in +nfs-utils. + +nfsd is controlled mainly by pseudofiles under the "nfsd" filesystem, +which is normally mounted at /proc/fs/nfsd/. + +The server is always started by the first write of a nonzero value to +nfsd/threads. + +Before doing that, NFSD can be told which sockets to listen on by +writing to nfsd/portlist; that write may be: + + - an ascii-encoded file descriptor, which should refer to a + bound (and listening, for tcp) socket, or + - "transportname port", where transportname is currently either + "udp", "tcp", or "rdma". + +If nfsd is started without doing any of these, then it will create one +udp and one tcp listener at port 2049 (see nfsd_init_socks). + +On startup, nfsd and lockd grace periods start. + +nfsd is shut down by a write of 0 to nfsd/threads. All locks and state +are thrown away at that point. + +Between startup and shutdown, the number of threads may be adjusted up +or down by additional writes to nfsd/threads or by writes to +nfsd/pool_threads. + +For more detail about files under nfsd/ and what they control, see +fs/nfsd/nfsctl.c; most of them have detailed comments. + +Implementation notes +^^^^^^^^^^^^^^^^^^^^ + +Note that the rpc server requires the caller to serialize addition and +removal of listening sockets, and startup and shutdown of the server. +For nfsd this is done using nfsd_mutex. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt new file mode 100644 index 00000000000..2d66ed68812 --- /dev/null +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -0,0 +1,302 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> +Updated 2006 by Horms <horms@verge.net.au> + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities + ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line + ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + + This is necessary to enable the pseudo-NFS-device. Note that it's not a + real device but just a synonym to tell the kernel to use NFS instead of + a real device. + + +nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] + + If the `nfsroot' parameter is NOT given on the command line, + the default "/tftpboot/%s" will be used. + + <server-ip> Specifies the IP address of the NFS server. + The default address is determined by the `ip' parameter + (see below). This parameter allows the use of different + servers for IP autoconfiguration and NFS. + + <root-dir> Name of the directory on the server to mount as root. + If there is a "%s" token in the string, it will be + replaced by the ASCII-representation of the client's + IP address. + + <nfs-options> Standard NFS options. All options are separated by commas. + The following defaults are used: + port = as given by server portmap daemon + rsize = 4096 + wsize = 4096 + timeo = 7 + retrans = 3 + acregmin = 3 + acregmax = 60 + acdirmin = 30 + acdirmax = 60 + flags = hard, nointr, noposix, cto, ac + + +ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf>: + <dns0-ip>:<dns1-ip> + + This parameter tells the kernel how to configure IP addresses of devices + and also how to set up the IP routing table. It was originally called + `nfsaddrs', but now the boot-time IP configuration works independently of + NFS, so it was renamed to `ip' and the old name remained as an alias for + compatibility reasons. + + If this parameter is missing from the kernel command line, all fields are + assumed to be empty, and the defaults mentioned below apply. In general + this means that the kernel tries to configure everything using + autoconfiguration. + + The <autoconf> parameter can appear alone as the value to the `ip' + parameter (without all the ':' characters before). If the value is + "ip=off" or "ip=none", no autoconfiguration will take place, otherwise + autoconfiguration will take place. The most common way to use this + is "ip=dhcp". + + <client-ip> IP address of the client. + + Default: Determined using autoconfiguration. + + <server-ip> IP address of the NFS server. If RARP is used to determine + the client address and this parameter is NOT empty only + replies from the specified server are accepted. + + Only required for NFS root. That is autoconfiguration + will not be triggered if it is missing and NFS root is not + in operation. + + Default: Determined using autoconfiguration. + The address of the autoconfiguration server is used. + + <gw-ip> IP address of a gateway if the server is on a different subnet. + + Default: Determined using autoconfiguration. + + <netmask> Netmask for local network interface. If unspecified + the netmask is derived from the client IP address assuming + classful addressing. + + Default: Determined using autoconfiguration. + + <hostname> Name of the client. May be supplied by autoconfiguration, + but its absence will not trigger autoconfiguration. + If specified and DHCP is used, the user provided hostname will + be carried in the DHCP request to hopefully update DNS record. + + Default: Client IP address is used in ASCII notation. + + <device> Name of network device to use. + + Default: If the host only has one device, it is used. + Otherwise the device is determined using + autoconfiguration. This is done by sending + autoconfiguration requests out of all devices, + and using the device that received the first reply. + + <autoconf> Method to use for autoconfiguration. In the case of options + which specify multiple autoconfiguration protocols, + requests are sent using all protocols, and the first one + to reply is used. + + Only autoconfiguration protocols that have been compiled + into the kernel will be used, regardless of the value of + this option. + + off or none: don't use autoconfiguration + (do static IP assignment instead) + on or any: use any protocol available in the kernel + (default) + dhcp: use DHCP + bootp: use BOOTP + rarp: use RARP + both: use both BOOTP and RARP but not DHCP + (old option kept for backwards compatibility) + + Default: any + + <dns0-ip> IP address of first nameserver. + Value gets exported by /proc/net/pnp which is often linked + on embedded systems by /etc/resolv.conf. + + <dns1-ip> IP address of secound nameserver. + Same as above. + + +nfsrootdebug + + This parameter enables debugging messages to appear in the kernel + log at boot time so that administrators can verify that the correct + NFS mount options, server address, and root path are passed to the + NFS client. + + +rdinit=<executable file> + + To specify which file contains the program that starts system + initialization, administrators can use this command line parameter. + The default value of this parameter is "/init". If the specified + file exists and the kernel can execute it, root filesystem related + kernel command line parameters, including `nfsroot=', are ignored. + + A description of the process of mounting the root file system can be + found in: + + Documentation/early-userspace/README + + + + +3.) Boot Loader + ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1) Booting from a floppy using syslinux + + When building kernels, an easy way to create a boot floppy that uses + syslinux is to use the zdisk or bzdisk make targets which use zimage + and bzimage images respectively. Both targets accept the + FDARGS parameter which can be used to set the kernel command line. + + e.g. + make bzdisk FDARGS="root=/dev/nfs" + + Note that the user running this command will need to have + access to the floppy drive device, /dev/fd0 + + For more information on syslinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + N.B: Previously it was possible to write a kernel directly to + a floppy using dd, configure the boot device using rdev, and + boot using the resulting floppy. Linux no longer supports this + method of booting. + +3.2) Booting from a cdrom using isolinux + + When building kernels, an easy way to create a bootable cdrom that + uses isolinux is to use the isoimage target which uses a bzimage + image. Like zdisk and bzdisk, this target accepts the FDARGS + parameter which can be used to set the kernel command line. + + e.g. + make isoimage FDARGS="root=/dev/nfs" + + The resulting iso image will be arch/<ARCH>/boot/image.iso + This can be written to a cdrom using a variety of tools including + cdrecord. + + e.g. + cdrecord dev=ATAPI:1,0,0 arch/x86/boot/image.iso + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO + When using LILO all the necessary command line parameters may be + specified using the 'append=' directive in the LILO configuration + file. + + However, to use the 'root=' directive you also need to create + a dummy root device, which may be removed after LILO is run. + + mknod /dev/boot255 c 0 255 + + For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB + When using GRUB, kernel parameter are simply appended after the kernel + specification: kernel <kernel> <parameters> + +3.4) Using loadlin + loadlin may be used to boot Linux from a DOS command prompt without + requiring a local hard disk to mount as root. This has not been + thoroughly tested by the authors of this document, but in general + it should be possible configure the kernel command line similarly + to the configuration of LILO. + + Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM + This is probably the most elegant way of booting a diskless client. + With a boot ROM the kernel is loaded using the TFTP protocol. The + authors of this document are not aware of any no commercial boot + ROMs that support booting Linux over the network. However, there + are two free implementations of a boot ROM, netboot-nfs and + etherboot, both of which are available on sunsite.unc.edu, and both + of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux + Pxelinux may be used to boot linux using the PXE boot loader + which is present on many modern network cards. + + When using pxelinux, the kernel image is specified using + "kernel <relative-path-below /tftpboot>". The nfsroot parameters + are passed to the kernel by adding them to the "append" line. + It is common to use serial console in conjunction with pxeliunx, + see Documentation/serial-console.txt for more information. + + For more information on isolinux, including how to create bootdisks + for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits + ------- + + The nfsroot code in the kernel and the RARP support have been written + by Gero Kuhlmann <gero@gkminix.han.de>. + + The rest of the IP layer autoconfiguration code has been written + by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. + + In order to write the initial version of nfsroot I would like to thank + Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/filesystems/nfs/pnfs.txt b/Documentation/filesystems/nfs/pnfs.txt new file mode 100644 index 00000000000..adc81a35fe2 --- /dev/null +++ b/Documentation/filesystems/nfs/pnfs.txt @@ -0,0 +1,109 @@ +Reference counting in pnfs: +========================== + +The are several inter-related caches. We have layouts which can +reference multiple devices, each of which can reference multiple data servers. +Each data server can be referenced by multiple devices. Each device +can be referenced by multiple layouts. To keep all of this straight, +we need to reference count. + + +struct pnfs_layout_hdr +---------------------- +The on-the-wire command LAYOUTGET corresponds to struct +pnfs_layout_segment, usually referred to by the variable name lseg. +Each nfs_inode may hold a pointer to a cache of these layout +segments in nfsi->layout, of type struct pnfs_layout_hdr. + +We reference the header for the inode pointing to it, across each +outstanding RPC call that references it (LAYOUTGET, LAYOUTRETURN, +LAYOUTCOMMIT), and for each lseg held within. + +Each header is also (when non-empty) put on a list associated with +struct nfs_client (cl_layouts). Being put on this list does not bump +the reference count, as the layout is kept around by the lseg that +keeps it in the list. + +deviceid_cache +-------------- +lsegs reference device ids, which are resolved per nfs_client and +layout driver type. The device ids are held in a RCU cache (struct +nfs4_deviceid_cache). The cache itself is referenced across each +mount. The entries (struct nfs4_deviceid) themselves are held across +the lifetime of each lseg referencing them. + +RCU is used because the deviceid is basically a write once, read many +data structure. The hlist size of 32 buckets needs better +justification, but seems reasonable given that we can have multiple +deviceid's per filesystem, and multiple filesystems per nfs_client. + +The hash code is copied from the nfsd code base. A discussion of +hashing and variations of this algorithm can be found at: +http://groups.google.com/group/comp.lang.c/browse_thread/thread/9522965e2b8d3809 + +data server cache +----------------- +file driver devices refer to data servers, which are kept in a module +level cache. Its reference is held over the lifetime of the deviceid +pointing to it. + +lseg +---- +lseg maintains an extra reference corresponding to the NFS_LSEG_VALID +bit which holds it in the pnfs_layout_hdr's list. When the final lseg +is removed from the pnfs_layout_hdr's list, the NFS_LAYOUT_DESTROYED +bit is set, preventing any new lsegs from being added. + +layout drivers +-------------- + +PNFS utilizes what is called layout drivers. The STD defines 3 basic +layout types: "files" "objects" and "blocks". For each of these types +there is a layout-driver with a common function-vectors table which +are called by the nfs-client pnfs-core to implement the different layout +types. + +Files-layout-driver code is in: fs/nfs/nfs4filelayout.c && nfs4filelayoutdev.c +Objects-layout-deriver code is in: fs/nfs/objlayout/.. directory +Blocks-layout-deriver code is in: fs/nfs/blocklayout/.. directory + +objects-layout setup +-------------------- + +As part of the full STD implementation the objlayoutdriver.ko needs, at times, +to automatically login to yet undiscovered iscsi/osd devices. For this the +driver makes up-calles to a user-mode script called *osd_login* + +The path_name of the script to use is by default: + /sbin/osd_login. +This name can be overridden by the Kernel module parameter: + objlayoutdriver.osd_login_prog + +If Kernel does not find the osd_login_prog path it will zero it out +and will not attempt farther logins. An admin can then write new value +to the objlayoutdriver.osd_login_prog Kernel parameter to re-enable it. + +The /sbin/osd_login is part of the nfs-utils package, and should usually +be installed on distributions that support this Kernel version. + +The API to the login script is as follows: + Usage: $0 -u <URI> -o <OSDNAME> -s <SYSTEMID> + Options: + -u target uri e.g. iscsi://<ip>:<port> + (allways exists) + (More protocols can be defined in the future. + The client does not interpret this string it is + passed unchanged as received from the Server) + -o osdname of the requested target OSD + (Might be empty) + (A string which denotes the OSD name, there is a + limit of 64 chars on this string) + -s systemid of the requested target OSD + (Might be empty) + (This string, if not empty is always an hex + representation of the 20 bytes osd_system_id) + +blocks-layout setup +------------------- + +TODO: Document the setup needs of the blocks layout driver diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt new file mode 100644 index 00000000000..ebcaaee2161 --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-cache.txt @@ -0,0 +1,202 @@ + This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use. There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: + - mapping from IP address to client name + - mapping from client name and filesystem to export options + - mapping from UID to list of GIDs, to work around NFS's limitation + of 16 gids. + - mappings between local UID/GID and remote UID/GID for sites that + do not have uniform uid assignment + - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: + - general cache lookup with correct locking + - supporting 'NEGATIVE' as well as positive entries + - allowing an EXPIRED time on cache items, and removing + items after they expire, and are no longer in-use. + - making requests to user-space to fill in cache entries + - allowing user-space to directly set entries in the cache + - delaying RPC requests that depend on as-yet incomplete + cache entries, and replaying those requests when the cache entry + is complete. + - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store. This is in the form of a + structure definition that must contain a + struct cache_head + as an element, usually the first. + It will also contain a key and some content. + Each cache element is reference counted and contains + expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that + describes the cache. This stores the hash table, some + parameters for cache management, and some operations detailing how + to work with particular cache items. + The operations requires are: + struct cache_head *alloc(void) + This simply allocates appropriate memory and returns + a pointer to the cache_detail embedded within the + structure + void cache_put(struct kref *) + This is called when the last reference to an item is + dropped. The pointer passed is to the 'ref' field + in the cache_head. cache_put should release any + references create by 'cache_init' and, if CACHE_VALID + is set, any references created by cache_update. + It should then release the memory allocated by + 'alloc'. + int match(struct cache_head *orig, struct cache_head *new) + test if the keys in the two structures match. Return + 1 if they do, 0 if they don't. + void init(struct cache_head *orig, struct cache_head *new) + Set the 'key' fields in 'new' from 'orig'. This may + include taking references to shared objects. + void update(struct cache_head *orig, struct cache_head *new) + Set the 'content' fileds in 'new' from 'orig'. + int cache_show(struct seq_file *m, struct cache_detail *cd, + struct cache_head *h) + Optional. Used to provide a /proc file that lists the + contents of a cache. This should show one item, + usually on just one line. + int cache_request(struct cache_detail *cd, struct cache_head *h, + char **bpp, int *blen) + Format a request to be send to user-space for an item + to be instantiated. *bpp is a buffer of size *blen. + bpp should be moved forward over the encoded message, + and *blen should be reduced to show how much free + space remains. Return 0 on success or <0 if not + enough room or other problem. + int cache_parse(struct cache_detail *cd, char *buf, int len) + A message from user space has arrived to fill out a + cache entry. It is in 'buf' of length 'len'. + cache_parse should parse this, find the item in the + cache with sunrpc_cache_lookup, and update the item + with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register(). This + includes it on a list of caches that will be regularly + cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry. If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *". This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req). This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon. When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit). It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item. A second item is +passed which should hold the content. If the item found by _lookup +has valid data, then it is discarded and a new item is created. This +saves any user of an item from worrying about content changing while +it is being inspected. If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: + - a key + - an expiry time + - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting. When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space. These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: + open the channel. + select for readable + read a request + write a response + loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it. It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail. This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use its own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted. two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of + hex digits, and pairs of these digits provide the bytes in the + field. +2/ otherwise a \ in the field must be followed by 3 octal digits + which give the code for a byte. Other characters are treated + as them selves. At the very least, space, newline, nul, and + '\' must be quoted in this way. diff --git a/Documentation/filesystems/nfs/rpc-server-gss.txt b/Documentation/filesystems/nfs/rpc-server-gss.txt new file mode 100644 index 00000000000..716f4be8e8b --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-server-gss.txt @@ -0,0 +1,91 @@ + +rpcsec_gss support for kernel RPC servers +========================================= + +This document gives references to the standards and protocols used to +implement RPCGSS authentication in kernel RPC servers such as the NFS +server and the NFS client's NFSv4.0 callback server. (But note that +NFSv4.1 and higher don't require the client to act as a server for the +purposes of authentication.) + +RPCGSS is specified in a few IETF documents: + - RFC2203 v1: http://tools.ietf.org/rfc/rfc2203.txt + - RFC5403 v2: http://tools.ietf.org/rfc/rfc5403.txt +and there is a 3rd version being proposed: + - http://tools.ietf.org/id/draft-williams-rpcsecgssv3.txt + (At draft n. 02 at the time of writing) + +Background +---------- + +The RPCGSS Authentication method describes a way to perform GSSAPI +Authentication for NFS. Although GSSAPI is itself completely mechanism +agnostic, in many cases only the KRB5 mechanism is supported by NFS +implementations. + +The Linux kernel, at the moment, supports only the KRB5 mechanism, and +depends on GSSAPI extensions that are KRB5 specific. + +GSSAPI is a complex library, and implementing it completely in kernel is +unwarranted. However GSSAPI operations are fundementally separable in 2 +parts: +- initial context establishment +- integrity/privacy protection (signing and encrypting of individual + packets) + +The former is more complex and policy-independent, but less +performance-sensitive. The latter is simpler and needs to be very fast. + +Therefore, we perform per-packet integrity and privacy protection in the +kernel, but leave the initial context establishment to userspace. We +need upcalls to request userspace to perform context establishment. + +NFS Server Legacy Upcall Mechanism +---------------------------------- + +The classic upcall mechanism uses a custom text based upcall mechanism +to talk to a custom daemon called rpc.svcgssd that is provide by the +nfs-utils package. + +This upcall mechanism has 2 limitations: + +A) It can handle tokens that are no bigger than 2KiB + +In some Kerberos deployment GSSAPI tokens can be quite big, up and +beyond 64KiB in size due to various authorization extensions attacked to +the Kerberos tickets, that needs to be sent through the GSS layer in +order to perform context establishment. + +B) It does not properly handle creds where the user is member of more +than a few housand groups (the current hard limit in the kernel is 65K +groups) due to limitation on the size of the buffer that can be send +back to the kernel (4KiB). + +NFS Server New RPC Upcall Mechanism +----------------------------------- + +The newer upcall mechanism uses RPC over a unix socket to a daemon +called gss-proxy, implemented by a userspace program called Gssproxy. + +The gss_proxy RPC protocol is currently documented here: + + https://fedorahosted.org/gss-proxy/wiki/ProtocolDocumentation + +This upcall mechanism uses the kernel rpc client and connects to the gssproxy +userspace program over a regular unix socket. The gssproxy protocol does not +suffer from the size limitations of the legacy protocol. + +Negotiating Upcall Mechanisms +----------------------------- + +To provide backward compatibility, the kernel defaults to using the +legacy mechanism. To switch to the new mechanism, gss-proxy must bind +to /var/run/gssproxy.sock and then write "1" to +/proc/net/rpc/use-gss-proxy. If gss-proxy dies, it must repeat both +steps. + +Once the upcall mechanism is chosen, it cannot be changed. To prevent +locking into the legacy mechanisms, the above steps must be performed +before starting nfsd. Whoever starts nfsd can guarantee this by reading +from /proc/net/rpc/use-gss-proxy and checking that it contains a +"1"--the read will block until gss-proxy has done its write to the file. diff --git a/Documentation/filesystems/nilfs2.txt b/Documentation/filesystems/nilfs2.txt new file mode 100644 index 00000000000..41c3d332acc --- /dev/null +++ b/Documentation/filesystems/nilfs2.txt @@ -0,0 +1,270 @@ +NILFS2 +------ + +NILFS2 is a log-structured file system (LFS) supporting continuous +snapshotting. In addition to versioning capability of the entire file +system, users can even restore files mistakenly overwritten or +destroyed just a few seconds ago. Since NILFS2 can keep consistency +like conventional LFS, it achieves quick recovery after system +crashes. + +NILFS2 creates a number of checkpoints every few seconds or per +synchronous write basis (unless there is no change). Users can select +significant versions among continuously created checkpoints, and can +change them into snapshots which will be preserved until they are +changed back to checkpoints. + +There is no limit on the number of snapshots until the volume gets +full. Each snapshot is mountable as a read-only file system +concurrently with its writable mount, and this feature is convenient +for online backup. + +The userland tools are included in nilfs-utils package, which is +available from the following download page. At least "mkfs.nilfs2", +"mount.nilfs2", "umount.nilfs2", and "nilfs_cleanerd" (so called +cleaner or garbage collector) are required. Details on the tools are +described in the man pages included in the package. + +Project web page: http://nilfs.sourceforge.net/ +Download page: http://nilfs.sourceforge.net/en/download.html +List info: http://vger.kernel.org/vger-lists.html#linux-nilfs + +Caveats +======= + +Features which NILFS2 does not support yet: + + - atime + - extended attributes + - POSIX ACLs + - quotas + - fsck + - defragmentation + +Mount options +============= + +NILFS2 supports the following mount options: +(*) == default + +barrier(*) This enables/disables the use of write barriers. This +nobarrier requires an IO stack which can support barriers, and + if nilfs gets an error on a barrier write, it will + disable again with a warning. +errors=continue Keep going on a filesystem error. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +cp=n Specify the checkpoint-number of the snapshot to be + mounted. Checkpoints and snapshots are listed by lscp + user command. Only the checkpoints marked as snapshot + are mountable with this option. Snapshot is read-only, + so a read-only mount option must be specified together. +order=relaxed(*) Apply relaxed order semantics that allows modified data + blocks to be written to disk without making a + checkpoint if no metadata update is going. This mode + is equivalent to the ordered data mode of the ext3 + filesystem except for the updates on data blocks still + conserve atomicity. This will improve synchronous + write performance for overwriting. +order=strict Apply strict in-order semantics that preserves sequence + of all file operations including overwriting of data + blocks. That means, it is guaranteed that no + overtaking of events occurs in the recovered file + system after a crash. +norecovery Disable recovery of the filesystem on mount. + This disables every write access on the device for + read-only mounts or snapshots. This option will fail + for r/w mounts on an unclean volume. +discard This enables/disables the use of discard/TRIM commands. +nodiscard(*) The discard/TRIM commands are sent to the underlying + block device when blocks are freed. This is useful + for SSD devices and sparse/thinly-provisioned LUNs. + +Ioctls +====== + +There is some NILFS2 specific functionality which can be accessed by applications +through the system call interfaces. The list of all NILFS2 specific ioctls are +shown in the table below. + +Table of NILFS2 specific ioctls +.............................................................................. + Ioctl Description + NILFS_IOCTL_CHANGE_CPMODE Change mode of given checkpoint between + checkpoint and snapshot state. This ioctl is + used in chcp and mkcp utilities. + + NILFS_IOCTL_DELETE_CHECKPOINT Remove checkpoint from NILFS2 file system. + This ioctl is used in rmcp utility. + + NILFS_IOCTL_GET_CPINFO Return info about requested checkpoints. This + ioctl is used in lscp utility and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_CPSTAT Return checkpoints statistics. This ioctl is + used by lscp, rmcp utilities and by + nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_SUINFO Return segment usage info about requested + segments. This ioctl is used in lssu, + nilfs_resize utilities and by nilfs_cleanerd + daemon. + + NILFS_IOCTL_SET_SUINFO Modify segment usage info of requested + segments. This ioctl is used by + nilfs_cleanerd daemon to skip unnecessary + cleaning operation of segments and reduce + performance penalty or wear of flash device + due to redundant move of in-use blocks. + + NILFS_IOCTL_GET_SUSTAT Return segment usage statistics. This ioctl + is used in lssu, nilfs_resize utilities and + by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_VINFO Return information on virtual block addresses. + This ioctl is used by nilfs_cleanerd daemon. + + NILFS_IOCTL_GET_BDESCS Return information about descriptors of disk + block numbers. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_CLEAN_SEGMENTS Do garbage collection operation in the + environment of requested parameters from + userspace. This ioctl is used by + nilfs_cleanerd daemon. + + NILFS_IOCTL_SYNC Make a checkpoint. This ioctl is used in + mkcp utility. + + NILFS_IOCTL_RESIZE Resize NILFS2 volume. This ioctl is used + by nilfs_resize utility. + + NILFS_IOCTL_SET_ALLOC_RANGE Define lower limit of segments in bytes and + upper limit of segments in bytes. This ioctl + is used by nilfs_resize utility. + +NILFS2 usage +============ + +To use nilfs2 as a local file system, simply: + + # mkfs -t nilfs2 /dev/block_device + # mount -t nilfs2 /dev/block_device /dir + +This will also invoke the cleaner through the mount helper program +(mount.nilfs2). + +Checkpoints and snapshots are managed by the following commands. +Their manpages are included in the nilfs-utils package above. + + lscp list checkpoints or snapshots. + mkcp make a checkpoint or a snapshot. + chcp change an existing checkpoint to a snapshot or vice versa. + rmcp invalidate specified checkpoint(s). + +To mount a snapshot, + + # mount -t nilfs2 -r -o cp=<cno> /dev/block_device /snap_dir + +where <cno> is the checkpoint number of the snapshot. + +To unmount the NILFS2 mount point or snapshot, simply: + + # umount /dir + +Then, the cleaner daemon is automatically shut down by the umount +helper program (umount.nilfs2). + +Disk format +=========== + +A nilfs2 volume is equally divided into a number of segments except +for the super block (SB) and segment #0. A segment is the container +of logs. Each log is composed of summary information blocks, payload +blocks, and an optional super root block (SR): + + ______________________________________________________ + | |SB| | Segment | Segment | Segment | ... | Segment | | + |_|__|_|____0____|____1____|____2____|_____|____N____|_| + 0 +1K +4K +8M +16M +24M +(8MB x N) + . . (Typical offsets for 4KB-block) + . . + .______________________. + | log | log |... | log | + |__1__|__2__|____|__m__| + . . + . . + . . + .______________________________. + | Summary | Payload blocks |SR| + |_blocks__|_________________|__| + +The payload blocks are organized per file, and each file consists of +data blocks and B-tree node blocks: + + |<--- File-A --->|<--- File-B --->| + _______________________________________________________________ + | Data blocks | B-tree blocks | Data blocks | B-tree blocks | ... + _|_____________|_______________|_____________|_______________|_ + + +Since only the modified blocks are written in the log, it may have +files without data blocks or B-tree node blocks. + +The organization of the blocks is recorded in the summary information +blocks, which contains a header structure (nilfs_segment_summary), per +file structures (nilfs_finfo), and per block structures (nilfs_binfo): + + _________________________________________________________________________ + | Summary | finfo | binfo | ... | binfo | finfo | binfo | ... | binfo |... + |_blocks__|___A___|_(A,1)_|_____|(A,Na)_|___B___|_(B,1)_|_____|(B,Nb)_|___ + + +The logs include regular files, directory files, symbolic link files +and several meta data files. The mata data files are the files used +to maintain file system meta data. The current version of NILFS2 uses +the following meta data files: + + 1) Inode file (ifile) -- Stores on-disk inodes + 2) Checkpoint file (cpfile) -- Stores checkpoints + 3) Segment usage file (sufile) -- Stores allocation state of segments + 4) Data address translation file -- Maps virtual block numbers to usual + (DAT) block numbers. This file serves to + make on-disk blocks relocatable. + +The following figure shows a typical organization of the logs: + + _________________________________________________________________________ + | Summary | regular file | file | ... | ifile | cpfile | sufile | DAT |SR| + |_blocks__|_or_directory_|_______|_____|_______|________|________|_____|__| + + +To stride over segment boundaries, this sequence of files may be split +into multiple logs. The sequence of logs that should be treated as +logically one log, is delimited with flags marked in the segment +summary. The recovery code of nilfs2 looks this boundary information +to ensure atomicity of updates. + +The super root block is inserted for every checkpoints. It includes +three special inodes, inodes for the DAT, cpfile, and sufile. Inodes +of regular files, directories, symlinks and other special files, are +included in the ifile. The inode of ifile itself is included in the +corresponding checkpoint entry in the cpfile. Thus, the hierarchy +among NILFS2 files can be depicted as follows: + + Super block (SB) + | + v + Super root block (the latest cno=xx) + |-- DAT + |-- sufile + `-- cpfile + |-- ifile (cno=c1) + |-- ifile (cno=c2) ---- file (ino=i1) + : : |-- file (ino=i2) + `-- ifile (cno=xx) |-- file (ino=i3) + : : + `-- file (ino=yy) + ( regular file, directory, or symlink ) + +For detail on the format of each file, please see include/linux/nilfs2_fs.h. diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt index 614de312490..61947facfc0 100644 --- a/Documentation/filesystems/ntfs.txt +++ b/Documentation/filesystems/ntfs.txt @@ -13,7 +13,7 @@ Table of contents - Using NTFS volume and stripe sets - The Device-Mapper driver - The Software RAID / MD driver - - Limitiations when using the MD driver + - Limitations when using the MD driver - ChangeLog @@ -40,10 +40,10 @@ Web site ======== There is plenty of additional information on the linux-ntfs web site -at http://linux-ntfs.sourceforge.net/ +at http://www.linux-ntfs.org/ The web site has a lot of additional information, such as a comprehensive -FAQ, documentation on the NTFS on-disk format, informaiton on the Linux-NTFS +FAQ, documentation on the NTFS on-disk format, information on the Linux-NTFS userspace utilities, etc. @@ -272,7 +272,7 @@ And you would know that /dev/hda2 has a size of 37768814 - 4209030 + 1 = For Win2k and later dynamic disks, you can for example use the ldminfo utility which is part of the Linux LDM tools (the latest version at the time of writing is linux-ldm-0.0.8.tar.bz2). You can download it from: - http://linux-ntfs.sourceforge.net/downloads.html + http://www.linux-ntfs.org/ Simply extract the downloaded archive (tar xvjf linux-ldm-0.0.8.tar.bz2), go into it (cd linux-ldm-0.0.8) and change to the test directory (cd test). You will find the precompiled (i386) ldminfo utility there. NOTE: You will not be @@ -337,7 +337,7 @@ Finally, for a mirrored volume, i.e. raid level 1, the table would look like this (note all values are in 512-byte sectors): --- cut here --- -# Ofs Size Raid Log Number Region Should Number Source Start Taget Start +# Ofs Size Raid Log Number Region Should Number Source Start Target Start # in of the type type of log size sync? of Device in Device in # vol volume params mirrors Device Device 0 2056320 mirror core 2 16 nosync 2 /dev/hda1 0 /dev/hdb1 0 @@ -349,8 +349,8 @@ end of the line. Note the "Should sync?" parameter "nosync" means that the two mirrors are already in sync which will be the case on a clean shutdown of Windows. If the mirrors are not clean, you can specify the "sync" option instead of "nosync" -and the Device-Mapper driver will then copy the entirey of the "Source Device" -to the "Target Device" or if you specified multipled target devices to all of +and the Device-Mapper driver will then copy the entirety of the "Source Device" +to the "Target Device" or if you specified multiple target devices to all of them. Once you have your table, save it in a file somewhere (e.g. /etc/ntfsvolume1), @@ -383,14 +383,14 @@ Software RAID / MD driver. For which you need to set up your /etc/raidtab appropriately (see man 5 raidtab). Linear volume sets, i.e. linear raid, as well as stripe sets, i.e. raid level -0, have been tested and work fine (though see section "Limitiations when using +0, have been tested and work fine (though see section "Limitations when using the MD driver with NTFS volumes" especially if you want to use linear raid). Even though untested, there is no reason why mirrors, i.e. raid level 1, and stripes with parity, i.e. raid level 5, should not work, too. You have to use the "persistent-superblock 0" option for each raid-disk in the NTFS volume/stripe you are configuring in /etc/raidtab as the persistent -superblock used by the MD driver would damange the NTFS volume. +superblock used by the MD driver would damage the NTFS volume. Windows by default uses a stripe chunk size of 64k, so you probably want the "chunk-size 64k" option for each raid-disk, too. @@ -407,7 +407,7 @@ raiddev /dev/md0 device /dev/hda5 raid-disk 0 device /dev/hdb1 - raid-disl 1 + raid-disk 1 For linear raid, just change the raid-level above to "raid-level linear", for mirrors, change it to "raid-level 1", and for stripe sets with parity, change @@ -435,7 +435,7 @@ setup correctly to avoid the possibility of causing damage to the data on the ntfs volume. -Limitiations when using the Software RAID / MD driver +Limitations when using the Software RAID / MD driver ----------------------------------------------------- Using the md driver will not work properly if any of your NTFS partitions have @@ -455,8 +455,26 @@ not have this problem with odd numbers of sectors. ChangeLog ========= -Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. - +2.1.30: + - Fix writev() (it kept writing the first segment over and over again + instead of moving onto subsequent segments). + - Fix crash in ntfs_mft_record_alloc() when mapping the new extent mft + record failed. +2.1.29: + - Fix a deadlock when mounting read-write. +2.1.28: + - Fix a deadlock. +2.1.27: + - Implement page migration support so the kernel can move memory used + by NTFS files and directories around for management purposes. + - Add support for writing to sparse files created with Windows XP SP2. + - Many minor improvements and bug fixes. +2.1.26: + - Implement support for sector sizes above 512 bytes (up to the maximum + supported by NTFS which is 4096 bytes). + - Enhance support for NTFS volumes which were supported by Windows but + not by Linux due to invalid attribute list attribute flags. + - A few minor updates and bug fixes. 2.1.25: - Write support is now extended with write(2) being able to both overwrite existing file data and to extend files. Also, if a write @@ -588,7 +606,7 @@ Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog. - Major bug fixes for reading files and volumes in corner cases which were being hit by Windows 2k/XP users. 2.1.2: - - Major bug fixes aleviating the hangs in statfs experienced by some + - Major bug fixes alleviating the hangs in statfs experienced by some users. 2.1.1: - Update handling of compressed files so people no longer get the diff --git a/Documentation/filesystems/ocfs2.txt b/Documentation/filesystems/ocfs2.txt new file mode 100644 index 00000000000..7618a287aa4 --- /dev/null +++ b/Documentation/filesystems/ocfs2.txt @@ -0,0 +1,102 @@ +OCFS2 filesystem +================== +OCFS2 is a general purpose extent based shared disk cluster file +system with many similarities to ext3. It supports 64 bit inode +numbers, and has automatically extending metadata groups which may +also make it attractive for non-clustered use. + +You'll want to install the ocfs2-tools package in order to at least +get "mount.ocfs2" and "ocfs2_hb_ctl". + +Project web page: http://oss.oracle.com/projects/ocfs2 +Tools web page: http://oss.oracle.com/projects/ocfs2-tools +OCFS2 mailing lists: http://oss.oracle.com/projects/ocfs2/mailman/ + +All code copyright 2005 Oracle except when otherwise noted. + +CREDITS: +Lots of code taken from ext3 and other projects. + +Authors in alphabetical order: +Joel Becker <joel.becker@oracle.com> +Zach Brown <zach.brown@oracle.com> +Mark Fasheh <mfasheh@suse.com> +Kurt Hackel <kurt.hackel@oracle.com> +Tao Ma <tao.ma@oracle.com> +Sunil Mushran <sunil.mushran@oracle.com> +Manish Singh <manish.singh@oracle.com> +Tiger Yang <tiger.yang@oracle.com> + +Caveats +======= +Features which OCFS2 does not support yet: + - Directory change notification (F_NOTIFY) + - Distributed Caching (F_SETLEASE/F_GETLEASE/break_lease) + +Mount options +============= + +OCFS2 supports the following mount options: +(*) == default + +barrier=1 This enables/disables barriers. barrier=0 disables it, + barrier=1 enables it. +errors=remount-ro(*) Remount the filesystem read-only on an error. +errors=panic Panic and halt the machine if an error occurs. +intr (*) Allow signals to interrupt cluster operations. +nointr Do not allow signals to interrupt cluster + operations. +noatime Do not update access time. +relatime(*) Update atime if the previous atime is older than + mtime or ctime +strictatime Always update atime, but the minimum update interval + is specified by atime_quantum. +atime_quantum=60(*) OCFS2 will not update atime unless this number + of seconds has passed since the last update. + Set to zero to always update atime. This option need + work with strictatime. +data=ordered (*) All data are forced directly out to the main file + system prior to its metadata being committed to the + journal. +data=writeback Data ordering is not preserved, data may be written + into the main file system after its metadata has been + committed to the journal. +preferred_slot=0(*) During mount, try to use this filesystem slot first. If + it is in use by another node, the first empty one found + will be chosen. Invalid values will be ignored. +commit=nrsec (*) Ocfs2 can be told to sync all its data and metadata + every 'nrsec' seconds. The default value is 5 seconds. + This means that if you lose your power, you will lose + as much as the latest 5 seconds of work (your + filesystem will not be damaged though, thanks to the + journaling). This default value (or any low value) + will hurt performance, but it's good for data-safety. + Setting it to 0 will have the same effect as leaving + it at the default (5 seconds). + Setting it to very large values will improve + performance. +localalloc=8(*) Allows custom localalloc size in MB. If the value is too + large, the fs will silently revert it to the default. +localflocks This disables cluster aware flock. +inode64 Indicates that Ocfs2 is allowed to create inodes at + any location in the filesystem, including those which + will result in inode numbers occupying more than 32 + bits of significance. +user_xattr (*) Enables Extended User Attributes. +nouser_xattr Disables Extended User Attributes. +acl Enables POSIX Access Control Lists support. +noacl (*) Disables POSIX Access Control Lists support. +resv_level=2 (*) Set how aggressive allocation reservations will be. + Valid values are between 0 (reservations off) to 8 + (maximum space for reservations). +dir_resv_level= (*) By default, directory reservations will scale with file + reservations - users should rarely need to change this + value. If allocation reservations are turned off, this + option will have no effect. +coherency=full (*) Disallow concurrent O_DIRECT writes, cluster inode + lock will be taken to force other nodes drop cache, + therefore full cluster coherency is guaranteed even + for O_DIRECT writes. +coherency=buffered Allow concurrent O_DIRECT writes without EX lock among + nodes, which gains high performance at risk of getting + stale data on other nodes. diff --git a/Documentation/filesystems/omfs.txt b/Documentation/filesystems/omfs.txt new file mode 100644 index 00000000000..1d0d41ff5c6 --- /dev/null +++ b/Documentation/filesystems/omfs.txt @@ -0,0 +1,106 @@ +Optimized MPEG Filesystem (OMFS) + +Overview +======== + +OMFS is a filesystem created by SonicBlue for use in the ReplayTV DVR +and Rio Karma MP3 player. The filesystem is extent-based, utilizing +block sizes from 2k to 8k, with hash-based directories. This +filesystem driver may be used to read and write disks from these +devices. + +Note, it is not recommended that this FS be used in place of a general +filesystem for your own streaming media device. Native Linux filesystems +will likely perform better. + +More information is available at: + + http://linux-karma.sf.net/ + +Various utilities, including mkomfs and omfsck, are included with +omfsprogs, available at: + + http://bobcopeland.com/karma/ + +Instructions are included in its README. + +Options +======= + +OMFS supports the following mount-time options: + + uid=n - make all files owned by specified user + gid=n - make all files owned by specified group + umask=xxx - set permission umask to xxx + fmask=xxx - set umask to xxx for files + dmask=xxx - set umask to xxx for directories + +Disk format +=========== + +OMFS discriminates between "sysblocks" and normal data blocks. The sysblock +group consists of super block information, file metadata, directory structures, +and extents. Each sysblock has a header containing CRCs of the entire +sysblock, and may be mirrored in successive blocks on the disk. A sysblock may +have a smaller size than a data block, but since they are both addressed by the +same 64-bit block number, any remaining space in the smaller sysblock is +unused. + +Sysblock header information: + +struct omfs_header { + __be64 h_self; /* FS block where this is located */ + __be32 h_body_size; /* size of useful data after header */ + __be16 h_crc; /* crc-ccitt of body_size bytes */ + char h_fill1[2]; + u8 h_version; /* version, always 1 */ + char h_type; /* OMFS_INODE_X */ + u8 h_magic; /* OMFS_IMAGIC */ + u8 h_check_xor; /* XOR of header bytes before this */ + __be32 h_fill2; +}; + +Files and directories are both represented by omfs_inode: + +struct omfs_inode { + struct omfs_header i_head; /* header */ + __be64 i_parent; /* parent containing this inode */ + __be64 i_sibling; /* next inode in hash bucket */ + __be64 i_ctime; /* ctime, in milliseconds */ + char i_fill1[35]; + char i_type; /* OMFS_[DIR,FILE] */ + __be32 i_fill2; + char i_fill3[64]; + char i_name[OMFS_NAMELEN]; /* filename */ + __be64 i_size; /* size of file, in bytes */ +}; + +Directories in OMFS are implemented as a large hash table. Filenames are +hashed then prepended into the bucket list beginning at OMFS_DIR_START. +Lookup requires hashing the filename, then seeking across i_sibling pointers +until a match is found on i_name. Empty buckets are represented by block +pointers with all-1s (~0). + +A file is an omfs_inode structure followed by an extent table beginning at +OMFS_EXTENT_START: + +struct omfs_extent_entry { + __be64 e_cluster; /* start location of a set of blocks */ + __be64 e_blocks; /* number of blocks after e_cluster */ +}; + +struct omfs_extent { + __be64 e_next; /* next extent table location */ + __be32 e_extent_count; /* total # extents in this table */ + __be32 e_fill; + struct omfs_extent_entry e_entry; /* start of extent entries */ +}; + +Each extent holds the block offset followed by number of blocks allocated to +the extent. The final extent in each table is a terminator with e_cluster +being ~0 and e_blocks being ones'-complement of the total number of blocks +in the table. + +If this table overflows, a continuation inode is written and pointed to by +e_next. These have a header but lack the rest of the inode structure. + diff --git a/Documentation/filesystems/path-lookup.txt b/Documentation/filesystems/path-lookup.txt new file mode 100644 index 00000000000..3571667c710 --- /dev/null +++ b/Documentation/filesystems/path-lookup.txt @@ -0,0 +1,382 @@ +Path walking and name lookup locking +==================================== + +Path resolution is the finding a dentry corresponding to a path name string, by +performing a path walk. Typically, for every open(), stat() etc., the path name +will be resolved. Paths are resolved by walking the namespace tree, starting +with the first component of the pathname (eg. root or cwd) with a known dentry, +then finding the child of that dentry, which is named the next component in the +path string. Then repeating the lookup from the child dentry and finding its +child with the next element, and so on. + +Since it is a frequent operation for workloads like multiuser environments and +web servers, it is important to optimize this code. + +Path walking synchronisation history: +Prior to 2.5.10, dcache_lock was acquired in d_lookup (dcache hash lookup) and +thus in every component during path look-up. Since 2.5.10 onwards, fast-walk +algorithm changed this by holding the dcache_lock at the beginning and walking +as many cached path component dentries as possible. This significantly +decreases the number of acquisition of dcache_lock. However it also increases +the lock hold time significantly and affects performance in large SMP machines. +Since 2.5.62 kernel, dcache has been using a new locking model that uses RCU to +make dcache look-up lock-free. + +All the above algorithms required taking a lock and reference count on the +dentry that was looked up, so that may be used as the basis for walking the +next path element. This is inefficient and unscalable. It is inefficient +because of the locks and atomic operations required for every dentry element +slows things down. It is not scalable because many parallel applications that +are path-walk intensive tend to do path lookups starting from a common dentry +(usually, the root "/" or current working directory). So contention on these +common path elements causes lock and cacheline queueing. + +Since 2.6.38, RCU is used to make a significant part of the entire path walk +(including dcache look-up) completely "store-free" (so, no locks, atomics, or +even stores into cachelines of common dentries). This is known as "rcu-walk" +path walking. + +Path walking overview +===================== + +A name string specifies a start (root directory, cwd, fd-relative) and a +sequence of elements (directory entry names), which together refer to a path in +the namespace. A path is represented as a (dentry, vfsmount) tuple. The name +elements are sub-strings, separated by '/'. + +Name lookups will want to find a particular path that a name string refers to +(usually the final element, or parent of final element). This is done by taking +the path given by the name's starting point (which we know in advance -- eg. +current->fs->cwd or current->fs->root) as the first parent of the lookup. Then +iteratively for each subsequent name element, look up the child of the current +parent with the given name and if it is not the desired entry, make it the +parent for the next lookup. + +A parent, of course, must be a directory, and we must have appropriate +permissions on the parent inode to be able to walk into it. + +Turning the child into a parent for the next lookup requires more checks and +procedures. Symlinks essentially substitute the symlink name for the target +name in the name string, and require some recursive path walking. Mount points +must be followed into (thus changing the vfsmount that subsequent path elements +refer to), switching from the mount point path to the root of the particular +mounted vfsmount. These behaviours are variously modified depending on the +exact path walking flags. + +Path walking then must, broadly, do several particular things: +- find the start point of the walk; +- perform permissions and validity checks on inodes; +- perform dcache hash name lookups on (parent, name element) tuples; +- traverse mount points; +- traverse symlinks; +- lookup and create missing parts of the path on demand. + +Safe store-free look-up of dcache hash table +============================================ + +Dcache name lookup +------------------ +In order to lookup a dcache (parent, name) tuple, we take a hash on the tuple +and use that to select a bucket in the dcache-hash table. The list of entries +in that bucket is then walked, and we do a full comparison of each entry +against our (parent, name) tuple. + +The hash lists are RCU protected, so list walking is not serialised with +concurrent updates (insertion, deletion from the hash). This is a standard RCU +list application with the exception of renames, which will be covered below. + +Parent and name members of a dentry, as well as its membership in the dcache +hash, and its inode are protected by the per-dentry d_lock spinlock. A +reference is taken on the dentry (while the fields are verified under d_lock), +and this stabilises its d_inode pointer and actual inode. This gives a stable +point to perform the next step of our path walk against. + +These members are also protected by d_seq seqlock, although this offers +read-only protection and no durability of results, so care must be taken when +using d_seq for synchronisation (see seqcount based lookups, below). + +Renames +------- +Back to the rename case. In usual RCU protected lists, the only operations that +will happen to an object is insertion, and then eventually removal from the +list. The object will not be reused until an RCU grace period is complete. +This ensures the RCU list traversal primitives can run over the object without +problems (see RCU documentation for how this works). + +However when a dentry is renamed, its hash value can change, requiring it to be +moved to a new hash list. Allocating and inserting a new alias would be +expensive and also problematic for directory dentries. Latency would be far to +high to wait for a grace period after removing the dentry and before inserting +it in the new hash bucket. So what is done is to insert the dentry into the +new list immediately. + +However, when the dentry's list pointers are updated to point to objects in the +new list before waiting for a grace period, this can result in a concurrent RCU +lookup of the old list veering off into the new (incorrect) list and missing +the remaining dentries on the list. + +There is no fundamental problem with walking down the wrong list, because the +dentry comparisons will never match. However it is fatal to miss a matching +dentry. So a seqlock is used to detect when a rename has occurred, and so the +lookup can be retried. + + 1 2 3 + +---+ +---+ +---+ +hlist-->| N-+->| N-+->| N-+-> +head <--+-P |<-+-P |<-+-P | + +---+ +---+ +---+ + +Rename of dentry 2 may require it deleted from the above list, and inserted +into a new list. Deleting 2 gives the following list. + + 1 3 + +---+ +---+ (don't worry, the longer pointers do not +hlist-->| N-+-------->| N-+-> impose a measurable performance overhead +head <--+-P |<--------+-P | on modern CPUs) + +---+ +---+ + ^ 2 ^ + | +---+ | + | | N-+----+ + +----+-P | + +---+ + +This is a standard RCU-list deletion, which leaves the deleted object's +pointers intact, so a concurrent list walker that is currently looking at +object 2 will correctly continue to object 3 when it is time to traverse the +next object. + +However, when inserting object 2 onto a new list, we end up with this: + + 1 3 + +---+ +---+ +hlist-->| N-+-------->| N-+-> +head <--+-P |<--------+-P | + +---+ +---+ + 2 + +---+ + | N-+----> + <----+-P | + +---+ + +Because we didn't wait for a grace period, there may be a concurrent lookup +still at 2. Now when it follows 2's 'next' pointer, it will walk off into +another list without ever having checked object 3. + +A related, but distinctly different, issue is that of rename atomicity versus +lookup operations. If a file is renamed from 'A' to 'B', a lookup must only +find either 'A' or 'B'. So if a lookup of 'A' returns NULL, a subsequent lookup +of 'B' must succeed (note the reverse is not true). + +Between deleting the dentry from the old hash list, and inserting it on the new +hash list, a lookup may find neither 'A' nor 'B' matching the dentry. The same +rename seqlock is also used to cover this race in much the same way, by +retrying a negative lookup result if a rename was in progress. + +Seqcount based lookups +---------------------- +In refcount based dcache lookups, d_lock is used to serialise access to +the dentry, stabilising it while comparing its name and parent and then +taking a reference count (the reference count then gives a stable place to +start the next part of the path walk from). + +As explained above, we would like to do path walking without taking locks or +reference counts on intermediate dentries along the path. To do this, a per +dentry seqlock (d_seq) is used to take a "coherent snapshot" of what the dentry +looks like (its name, parent, and inode). That snapshot is then used to start +the next part of the path walk. When loading the coherent snapshot under d_seq, +care must be taken to load the members up-front, and use those pointers rather +than reloading from the dentry later on (otherwise we'd have interesting things +like d_inode going NULL underneath us, if the name was unlinked). + +Also important is to avoid performing any destructive operations (pretty much: +no non-atomic stores to shared data), and to recheck the seqcount when we are +"done" with the operation. Retry or abort if the seqcount does not match. +Avoiding destructive or changing operations means we can easily unwind from +failure. + +What this means is that a caller, provided they are holding RCU lock to +protect the dentry object from disappearing, can perform a seqcount based +lookup which does not increment the refcount on the dentry or write to +it in any way. This returned dentry can be used for subsequent operations, +provided that d_seq is rechecked after that operation is complete. + +Inodes are also rcu freed, so the seqcount lookup dentry's inode may also be +queried for permissions. + +With this two parts of the puzzle, we can do path lookups without taking +locks or refcounts on dentry elements. + +RCU-walk path walking design +============================ + +Path walking code now has two distinct modes, ref-walk and rcu-walk. ref-walk +is the traditional[*] way of performing dcache lookups using d_lock to +serialise concurrent modifications to the dentry and take a reference count on +it. ref-walk is simple and obvious, and may sleep, take locks, etc while path +walking is operating on each dentry. rcu-walk uses seqcount based dentry +lookups, and can perform lookup of intermediate elements without any stores to +shared data in the dentry or inode. rcu-walk can not be applied to all cases, +eg. if the filesystem must sleep or perform non trivial operations, rcu-walk +must be switched to ref-walk mode. + +[*] RCU is still used for the dentry hash lookup in ref-walk, but not the full + path walk. + +Where ref-walk uses a stable, refcounted ``parent'' to walk the remaining +path string, rcu-walk uses a d_seq protected snapshot. When looking up a +child of this parent snapshot, we open d_seq critical section on the child +before closing d_seq critical section on the parent. This gives an interlocking +ladder of snapshots to walk down. + + + proc 101 + /----------------\ + / comm: "vi" \ + / fs.root: dentry0 \ + \ fs.cwd: dentry2 / + \ / + \----------------/ + +So when vi wants to open("/home/npiggin/test.c", O_RDWR), then it will +start from current->fs->root, which is a pinned dentry. Alternatively, +"./test.c" would start from cwd; both names refer to the same path in +the context of proc101. + + dentry 0 + +---------------------+ rcu-walk begins here, we note d_seq, check the + | name: "/" | inode's permission, and then look up the next + | inode: 10 | path element which is "home"... + | children:"home", ...| + +---------------------+ + | + dentry 1 V + +---------------------+ ... which brings us here. We find dentry1 via + | name: "home" | hash lookup, then note d_seq and compare name + | inode: 678 | string and parent pointer. When we have a match, + | children:"npiggin" | we now recheck the d_seq of dentry0. Then we + +---------------------+ check inode and look up the next element. + | + dentry2 V + +---------------------+ Note: if dentry0 is now modified, lookup is + | name: "npiggin" | not necessarily invalid, so we need only keep a + | inode: 543 | parent for d_seq verification, and grandparents + | children:"a.c", ... | can be forgotten. + +---------------------+ + | + dentry3 V + +---------------------+ At this point we have our destination dentry. + | name: "a.c" | We now take its d_lock, verify d_seq of this + | inode: 14221 | dentry. If that checks out, we can increment + | children:NULL | its refcount because we're holding d_lock. + +---------------------+ + +Taking a refcount on a dentry from rcu-walk mode, by taking its d_lock, +re-checking its d_seq, and then incrementing its refcount is called +"dropping rcu" or dropping from rcu-walk into ref-walk mode. + +It is, in some sense, a bit of a house of cards. If the seqcount check of the +parent snapshot fails, the house comes down, because we had closed the d_seq +section on the grandparent, so we have nothing left to stand on. In that case, +the path walk must be fully restarted (which we do in ref-walk mode, to avoid +live locks). It is costly to have a full restart, but fortunately they are +quite rare. + +When we reach a point where sleeping is required, or a filesystem callout +requires ref-walk, then instead of restarting the walk, we attempt to drop rcu +at the last known good dentry we have. Avoiding a full restart in ref-walk in +these cases is fundamental for performance and scalability because blocking +operations such as creates and unlinks are not uncommon. + +The detailed design for rcu-walk is like this: +* LOOKUP_RCU is set in nd->flags, which distinguishes rcu-walk from ref-walk. +* Take the RCU lock for the entire path walk, starting with the acquiring + of the starting path (eg. root/cwd/fd-path). So now dentry refcounts are + not required for dentry persistence. +* synchronize_rcu is called when unregistering a filesystem, so we can + access d_ops and i_ops during rcu-walk. +* Similarly take the vfsmount lock for the entire path walk. So now mnt + refcounts are not required for persistence. Also we are free to perform mount + lookups, and to assume dentry mount points and mount roots are stable up and + down the path. +* Have a per-dentry seqlock to protect the dentry name, parent, and inode, + so we can load this tuple atomically, and also check whether any of its + members have changed. +* Dentry lookups (based on parent, candidate string tuple) recheck the parent + sequence after the child is found in case anything changed in the parent + during the path walk. +* inode is also RCU protected so we can load d_inode and use the inode for + limited things. +* i_mode, i_uid, i_gid can be tested for exec permissions during path walk. +* i_op can be loaded. +* When the destination dentry is reached, drop rcu there (ie. take d_lock, + verify d_seq, increment refcount). +* If seqlock verification fails anywhere along the path, do a full restart + of the path lookup in ref-walk mode. -ECHILD tends to be used (for want of + a better errno) to signal an rcu-walk failure. + +The cases where rcu-walk cannot continue are: +* NULL dentry (ie. any uncached path element) +* Following links + +It may be possible eventually to make following links rcu-walk aware. + +Uncached path elements will always require dropping to ref-walk mode, at the +very least because i_mutex needs to be grabbed, and objects allocated. + +Final note: +"store-free" path walking is not strictly store free. We take vfsmount lock +and refcounts (both of which can be made per-cpu), and we also store to the +stack (which is essentially CPU-local), and we also have to take locks and +refcount on final dentry. + +The point is that shared data, where practically possible, is not locked +or stored into. The result is massive improvements in performance and +scalability of path resolution. + + +Interesting statistics +====================== + +The following table gives rcu lookup statistics for a few simple workloads +(2s12c24t Westmere, debian non-graphical system). Ungraceful are attempts to +drop rcu that fail due to d_seq failure and requiring the entire path lookup +again. Other cases are successful rcu-drops that are required before the final +element, nodentry for missing dentry, revalidate for filesystem revalidate +routine requiring rcu drop, permission for permission check requiring drop, +and link for symlink traversal requiring drop. + + rcu-lookups restart nodentry link revalidate permission +bootup 47121 0 4624 1010 10283 7852 +dbench 25386793 0 6778659(26.7%) 55 549 1156 +kbuild 2696672 10 64442(2.3%) 108764(4.0%) 1 1590 +git diff 39605 0 28 2 0 106 +vfstest 24185492 4945 708725(2.9%) 1076136(4.4%) 0 2651 + +What this shows is that failed rcu-walk lookups, ie. ones that are restarted +entirely with ref-walk, are quite rare. Even the "vfstest" case which +specifically has concurrent renames/mkdir/rmdir/ creat/unlink/etc to exercise +such races is not showing a huge amount of restarts. + +Dropping from rcu-walk to ref-walk mean that we have encountered a dentry where +the reference count needs to be taken for some reason. This is either because +we have reached the target of the path walk, or because we have encountered a +condition that can't be resolved in rcu-walk mode. Ideally, we drop rcu-walk +only when we have reached the target dentry, so the other statistics show where +this does not happen. + +Note that a graceful drop from rcu-walk mode due to something such as the +dentry not existing (which can be common) is not necessarily a failure of +rcu-walk scheme, because some elements of the path may have been walked in +rcu-walk mode. The further we get from common path elements (such as cwd or +root), the less contended the dentry is likely to be. The closer we are to +common path elements, the more likely they will exist in dentry cache. + + +Papers and other documentation on dcache locking +================================================ + +1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). + +2. http://lse.sourceforge.net/locking/dcache/dcache.html + + diff --git a/Documentation/filesystems/pohmelfs/design_notes.txt b/Documentation/filesystems/pohmelfs/design_notes.txt new file mode 100644 index 00000000000..8aef9133570 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/design_notes.txt @@ -0,0 +1,72 @@ +POHMELFS: Parallel Optimized Host Message Exchange Layered File System. + + Evgeniy Polyakov <zbr@ioremap.net> + +Homepage: http://www.ioremap.net/projects/pohmelfs + +POHMELFS first began as a network filesystem with coherent local data and +metadata caches but is now evolving into a parallel distributed filesystem. + +Main features of this FS include: + * Locally coherent cache for data and metadata with (potentially) byte-range locks. + Since all Linux filesystems lock the whole inode during writing, algorithm + is very simple and does not use byte-ranges, although they are sent in + locking messages. + * Completely async processing of all events except creation of hard and symbolic + links, and rename events. + Object creation and data reading and writing are processed asynchronously. + * Flexible object architecture optimized for network processing. + Ability to create long paths to objects and remove arbitrarily huge + directories with a single network command. + (like removing the whole kernel tree via a single network command). + * Very high performance. + * Fast and scalable multithreaded userspace server. Being in userspace it works + with any underlying filesystem and still is much faster than async in-kernel NFS one. + * Client is able to switch between different servers (if one goes down, client + automatically reconnects to second and so on). + * Transactions support. Full failover for all operations. + Resending transactions to different servers on timeout or error. + * Read request (data read, directory listing, lookup requests) balancing between multiple servers. + * Write requests are replicated to multiple servers and completed only when all of them are acked. + * Ability to add and/or remove servers from the working set at run-time. + * Strong authentification and possible data encryption in network channel. + * Extended attributes support. + +POHMELFS is based on transactions, which are potentially long-standing objects that live +in the client's memory. Each transaction contains all the information needed to process a given +command (or set of commands, which is frequently used during data writing: single transactions +can contain creation and data writing commands). Transactions are committed by all the servers +to which they are sent and, in case of failures, are eventually resent or dropped with an error. +For example, reading will return an error if no servers are available. + +POHMELFS uses a asynchronous approach to data processing. Courtesy of transactions, it is +possible to detach replies from requests and, if the command requires data to be received, the +caller sleeps waiting for it. Thus, it is possible to issue multiple read commands to different +servers and async threads will pick up replies in parallel, find appropriate transactions in the +system and put the data where it belongs (like the page or inode cache). + +The main feature of POHMELFS is writeback data and the metadata cache. +Only a few non-performance critical operations use the write-through cache and +are synchronous: hard and symbolic link creation, and object rename. Creation, +removal of objects and data writing are asynchronous and are sent to +the server during system writeback. Only one writer at a time is allowed for any +given inode, which is guarded by an appropriate locking protocol. +Because of this feature, POHMELFS is extremely fast at metadata intensive +workloads and can fully utilize the bandwidth to the servers when doing bulk +data transfers. + +POHMELFS clients operate with a working set of servers and are capable of balancing read-only +operations (like lookups or directory listings) between them according to IO priorities. +Administrators can add or remove servers from the set at run-time via special commands (described +in Documentation/filesystems/pohmelfs/info.txt file). Writes are replicated to all servers, which +are connected with write permission turned on. IO priority and permissions can be changed in +run-time. + +POHMELFS is capable of full data channel encryption and/or strong crypto hashing. +One can select any kernel supported cipher, encryption mode, hash type and operation mode +(hmac or digest). It is also possible to use both or neither (default). Crypto configuration +is checked during mount time and, if the server does not support it, appropriate capabilities +will be disabled or mount will fail (if 'crypto_fail_unsupported' mount option is specified). +Crypto performance heavily depends on the number of crypto threads, which asynchronously perform +crypto operations and send the resulting data to server or submit it up the stack. This number +can be controlled via a mount option. diff --git a/Documentation/filesystems/pohmelfs/info.txt b/Documentation/filesystems/pohmelfs/info.txt new file mode 100644 index 00000000000..db2e4139362 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/info.txt @@ -0,0 +1,99 @@ +POHMELFS usage information. + +Mount options. +All but index, number of crypto threads and maximum IO size can changed via remount. + +idx=%u + Each mountpoint is associated with a special index via this option. + Administrator can add or remove servers from the given index, so all mounts, + which were attached to it, are updated. + Default it is 0. + +trans_scan_timeout=%u + This timeout, expressed in milliseconds, specifies time to scan transaction + trees looking for stale requests, which have to be resent, or if number of + retries exceed specified limit, dropped with error. + Default is 5 seconds. + +drop_scan_timeout=%u + Internal timeout, expressed in milliseconds, which specifies how frequently + inodes marked to be dropped are freed. It also specifies how frequently + the system checks that servers have to be added or removed from current working set. + Default is 1 second. + +wait_on_page_timeout=%u + Number of milliseconds to wait for reply from remote server for data reading command. + If this timeout is exceeded, reading returns an error. + Default is 5 seconds. + +trans_retries=%u + This is the number of times that a transaction will be resent to a server that did + not answer for the last @trans_scan_timeout milliseconds. + When the number of resends exceeds this limit, the transaction is completed with error. + Default is 5 resends. + +crypto_thread_num=%u + Number of crypto processing threads. Threads are used both for RX and TX traffic. + Default is 2, or no threads if crypto operations are not supported. + +trans_max_pages=%u + Maximum number of pages in a single transaction. This parameter also controls + the number of pages, allocated for crypto processing (each crypto thread has + pool of pages, the number of which is equal to 'trans_max_pages'. + Default is 100 pages. + +crypto_fail_unsupported + If specified, mount will fail if the server does not support requested crypto operations. + By default mount will disable non-matching crypto operations. + +mcache_timeout=%u + Maximum number of milliseconds to wait for the mcache objects to be processed. + Mcache includes locks (given lock should be granted by server), attributes (they should be + fully received in the given timeframe). + Default is 5 seconds. + +Usage examples. + +Add server server1.net:1025 into the working set with index $idx +with appropriate hash algorithm and key file and cipher algorithm, mode and key file: +$cfg A add -a server1.net -p 1025 -i $idx -K $hash_key -k $cipher_key + +Mount filesystem with given index $idx to /mnt mountpoint. +Client will connect to all servers specified in the working set via previous command: +mount -t pohmel -o idx=$idx q /mnt + +Change permissions to read-only (-I 1 option, '-I 2' - write-only, 3 - rw): +$cfg A modify -a server1.net -p 1025 -i $idx -I 1 + +Change IO priority to 123 (node with the highest priority gets read requests). +$cfg A modify -a server1.net -p 1025 -i $idx -P 123 + +One can check currect status of all connections in the mountstats file: +# cat /proc/$PID/mountstats +... +device none mounted on /mnt with fstype pohmel +idx addr(:port) socket_type protocol active priority permissions +0 server1.net:1026 1 6 1 250 1 +0 server2.net:1025 1 6 1 123 3 + +Server installation. + +Creating a server, which listens at port 1025 and 0.0.0.0 address. +Working root directory (note, that server chroots there, so you have to have appropriate permissions) +is set to /mnt, server will negotiate hash/cipher with client, in case client requested it, there +are appropriate key files. +Number of working threads is set to 10. + +# ./fserver -a 0.0.0.0 -p 1025 -r /mnt -w 10 -K hash_key -k cipher_key + + -A 6 - listen on ipv6 address. Default: Disabled. + -r root - path to root directory. Default: /tmp. + -a addr - listen address. Default: 0.0.0.0. + -p port - listen port. Default: 1025. + -w workers - number of workers per connected client. Default: 1. + -K file - hash key size. Default: none. + -k file - cipher key size. Default: none. + -h - this help. + +Number of worker threads specifies how many workers will be created for each client. +Bulk single-client transafers usually are better handled with smaller number (like 1-3). diff --git a/Documentation/filesystems/pohmelfs/network_protocol.txt b/Documentation/filesystems/pohmelfs/network_protocol.txt new file mode 100644 index 00000000000..c680b4b5353 --- /dev/null +++ b/Documentation/filesystems/pohmelfs/network_protocol.txt @@ -0,0 +1,227 @@ +POHMELFS network protocol. + +Basic structure used in network communication is following command: + +struct netfs_cmd +{ + __u16 cmd; /* Command number */ + __u16 csize; /* Attached crypto information size */ + __u16 cpad; /* Attached padding size */ + __u16 ext; /* External flags */ + __u32 size; /* Size of the attached data */ + __u32 trans; /* Transaction id */ + __u64 id; /* Object ID to operate on. Used for feedback.*/ + __u64 start; /* Start of the object. */ + __u64 iv; /* IV sequence */ + __u8 data[0]; +}; + +Commands can be embedded into transaction command (which in turn has own command), +so one can extend protocol as needed without breaking backward compatibility as long +as old commands are supported. All string lengths include tail 0 byte. + +All commands are transferred over the network in big-endian. CPU endianness is used at the end peers. + +@cmd - command number, which specifies command to be processed. Following + commands are used currently: + + NETFS_READDIR = 1, /* Read directory for given inode number */ + NETFS_READ_PAGE, /* Read data page from the server */ + NETFS_WRITE_PAGE, /* Write data page to the server */ + NETFS_CREATE, /* Create directory entry */ + NETFS_REMOVE, /* Remove directory entry */ + NETFS_LOOKUP, /* Lookup single object */ + NETFS_LINK, /* Create a link */ + NETFS_TRANS, /* Transaction */ + NETFS_OPEN, /* Open intent */ + NETFS_INODE_INFO, /* Metadata cache coherency synchronization message */ + NETFS_PAGE_CACHE, /* Page cache invalidation message */ + NETFS_READ_PAGES, /* Read multiple contiguous pages in one go */ + NETFS_RENAME, /* Rename object */ + NETFS_CAPABILITIES, /* Capabilities of the client, for example supported crypto */ + NETFS_LOCK, /* Distributed lock message */ + NETFS_XATTR_SET, /* Set extended attribute */ + NETFS_XATTR_GET, /* Get extended attribute */ + +@ext - external flags. Used by different commands to specify some extra arguments + like partial size of the embedded objects or creation flags. + +@size - size of the attached data. For NETFS_READ_PAGE and NETFS_READ_PAGES no data is attached, + but size of the requested data is incorporated here. It does not include size of the command + header (struct netfs_cmd) itself. + +@id - id of the object this command operates on. Each command can use it for own purpose. + +@start - start of the object this command operates on. Each command can use it for own purpose. + +@csize, @cpad - size and padding size of the (attached if needed) crypto information. + +Command specifications. + +@NETFS_READDIR +This command is used to sync content of the remote dir to the client. + +@ext - length of the path to object. +@size - the same. +@id - local inode number of the directory to read. +@start - zero. + + +@NETFS_READ_PAGE +This command is used to read data from remote server. +Data size does not exceed local page cache size. + +@id - inode number. +@start - first byte offset. +@size - number of bytes to read plus length of the path to object. +@ext - object path length. + + +@NETFS_CREATE +Used to create object. +It does not require that all directories on top of the object were +already created, it will create them automatically. Each object has +associated @netfs_path_entry data structure, which contains creation +mode (permissions and type) and length of the name as long as name itself. + +@start - 0 +@size - size of the all data structures needed to create a path +@id - local inode number +@ext - 0 + + +@NETFS_REMOVE +Used to remove object. + +@ext - length of the path to object. +@size - the same. +@id - local inode number. +@start - zero. + + +@NETFS_LOOKUP +Lookup information about object on server. + +@ext - length of the path to object. +@size - the same. +@id - local inode number of the directory to look object in. +@start - local inode number of the object to look at. + + +@NETFS_LINK +Create hard of symlink. +Command is sent as "object_path|target_path". + +@size - size of the above string. +@id - parent local inode number. +@start - 1 for symlink, 0 for hardlink. +@ext - size of the "object_path" above. + + +@NETFS_TRANS +Transaction header. + +@size - incorporates all embedded command sizes including theirs header sizes. +@start - transaction generation number - unique id used to find transaction. +@ext - transaction flags. Unused at the moment. +@id - 0. + + +@NETFS_OPEN +Open intent for given transaction. + +@id - local inode number. +@start - 0. +@size - path length to the object. +@ext - open flags (O_RDWR and so on). + + +@NETFS_INODE_INFO +Metadata update command. +It is sent to servers when attributes of the object are changed and received +when data or metadata were updated. It operates with the following structure: + +struct netfs_inode_info +{ + unsigned int mode; + unsigned int nlink; + unsigned int uid; + unsigned int gid; + unsigned int blocksize; + unsigned int padding; + __u64 ino; + __u64 blocks; + __u64 rdev; + __u64 size; + __u64 version; +}; + +It effectively mirrors stat(2) returned data. + + +@ext - path length to the object. +@size - the same plus size of the netfs_inode_info structure. +@id - local inode number. +@start - 0. + + +@NETFS_PAGE_CACHE +Command is only received by clients. It contains information about +page to be marked as not up-to-date. + +@id - client's inode number. +@start - last byte of the page to be invalidated. If it is not equal to + current inode size, it will be vmtruncated(). +@size - 0 +@ext - 0 + + +@NETFS_READ_PAGES +Used to read multiple contiguous pages in one go. + +@start - first byte of the contiguous region to read. +@size - contains of two fields: lower 8 bits are used to represent page cache shift + used by client, another 3 bytes are used to get number of pages. +@id - local inode number. +@ext - path length to the object. + + +@NETFS_RENAME +Used to rename object. +Attached data is formed into following string: "old_path|new_path". + +@id - local inode number. +@start - parent inode number. +@size - length of the above string. +@ext - length of the old path part. + + +@NETFS_CAPABILITIES +Used to exchange crypto capabilities with server. +If crypto capabilities are not supported by server, then client will disable it +or fail (if 'crypto_fail_unsupported' mount options was specified). + +@id - superblock index. Used to specify crypto information for group of servers. +@size - size of the attached capabilities structure. +@start - 0. +@size - 0. +@scsize - 0. + +@NETFS_LOCK +Used to send lock request/release messages. Although it sends byte range request +and is capable of flushing pages based on that, it is not used, since all Linux +filesystems lock the whole inode. + +@id - lock generation number. +@start - start of the locked range. +@size - size of the locked range. +@ext - lock type: read/write. Not used actually. 15'th bit is used to determine, + if it is lock request (1) or release (0). + +@NETFS_XATTR_SET +@NETFS_XATTR_GET +Used to set/get extended attributes for given inode. +@id - attribute generation number or xattr setting type +@start - size of the attribute (request or attached) +@size - name length, path len and data size for given attribute +@ext - path length for given object diff --git a/Documentation/filesystems/porting b/Documentation/filesystems/porting index 2f388460cbe..0f3a1390bf0 100644 --- a/Documentation/filesystems/porting +++ b/Documentation/filesystems/porting @@ -1,6 +1,6 @@ Changes since 2.5.0: ---- +--- [recommended] New helpers: sb_bread(), sb_getblk(), sb_find_get_block(), set_bh(), @@ -10,7 +10,7 @@ Use them. (sb_find_get_block() replaces 2.4's get_hash_table()) ---- +--- [recommended] New methods: ->alloc_inode() and ->destroy_inode(). @@ -28,14 +28,14 @@ Declare Use FOO_I(inode) instead of &inode->u.foo_inode_i; -Add foo_alloc_inode() and foo_destory_inode() - the former should allocate +Add foo_alloc_inode() and foo_destroy_inode() - the former should allocate foo_inode_info and return the address of ->vfs_inode, the latter should free FOO_I(inode) (see in-tree filesystems for examples). Make them ->alloc_inode and ->destroy_inode in your super_operations. -Keep in mind that now you need explicit initialization of private data - -typically in ->read_inode() and after getting an inode from new_inode(). +Keep in mind that now you need explicit initialization of private data +typically between calling iget_locked() and unlocking the inode. At some point that will become mandatory. @@ -50,10 +50,11 @@ Turn your foo_read_super() into a function that would return 0 in case of success and negative number in case of error (-EINVAL unless you have more informative error value to report). Call it foo_fill_super(). Now declare -struct super_block foo_get_sb(struct file_system_type *fs_type, - int flags, const char *dev_name, void *data) +int foo_get_sb(struct file_system_type *fs_type, + int flags, const char *dev_name, void *data, struct vfsmount *mnt) { - return get_sb_bdev(fs_type, flags, dev_name, data, ext2_fill_super); + return get_sb_bdev(fs_type, flags, dev_name, data, foo_fill_super, + mnt); } (or similar with s/bdev/nodev/ or s/bdev/single/, depending on the kind of @@ -93,9 +94,8 @@ protected. --- [mandatory] -BKL is also moved from around sb operations. ->write_super() Is now called -without BKL held. BKL should have been shifted into individual fs sb_op -functions. If you don't need it, remove it. +BKL is also moved from around sb operations. BKL should have been shifted into +individual fs sb_op functions. If you don't need it, remove it. --- [informational] @@ -106,7 +106,7 @@ free to drop it... --- [informational] -->link() callers hold ->i_sem on the object we are linking to. Some of your +->link() callers hold ->i_mutex on the object we are linking to. Some of your problems might be over... --- @@ -129,9 +129,9 @@ went in - and hadn't been documented ;-/). Just remove it from fs_flags --- [mandatory] -->setattr() is called without BKL now. Caller _always_ holds ->i_sem, so -watch for ->i_sem-grabbing code that might be used by your ->setattr(). -Callers of notify_change() need ->i_sem now. +->setattr() is called without BKL now. Caller _always_ holds ->i_mutex, so +watch for ->i_mutex-grabbing code that might be used by your ->setattr(). +Callers of notify_change() need ->i_mutex now. --- [recommended] @@ -139,7 +139,7 @@ Callers of notify_change() need ->i_sem now. New super_block field "struct export_operations *s_export_op" for explicit support for exporting, e.g. via NFS. The structure is fully documented at its declaration in include/linux/fs.h, and in -Documentation/filesystems/Exporting. +Documentation/filesystems/nfs/Exporting. Briefly it allows for the definition of decode_fh and encode_fh operations to encode and decode filehandles, and allows the filesystem to use @@ -172,10 +172,10 @@ should be a non-blocking function that initializes those parts of a newly created inode to allow the test function to succeed. 'data' is passed as an opaque value to both test and set functions. -When the inode has been created by iget5_locked(), it will be returned with -the I_NEW flag set and will still be locked. read_inode has not been -called so the file system still has to finalize the initialization. Once -the inode is initialized it must be unlocked by calling unlock_new_inode(). +When the inode has been created by iget5_locked(), it will be returned with the +I_NEW flag set and will still be locked. The filesystem then needs to finalize +the initialization. Once the inode is initialized it must be unlocked by +calling unlock_new_inode(). The filesystem is responsible for setting (and possibly testing) i_ino when appropriate. There is also a simpler iget_locked function that @@ -183,11 +183,19 @@ just takes the superblock and inode number as arguments and does the test and set for you. e.g. - inode = iget_locked(sb, ino); - if (inode->i_state & I_NEW) { - read_inode_from_disk(inode); - unlock_new_inode(inode); - } + inode = iget_locked(sb, ino); + if (inode->i_state & I_NEW) { + err = read_inode_from_disk(inode); + if (err < 0) { + iget_failed(inode); + return err; + } + unlock_new_inode(inode); + } + +Note that if the process of setting up a new inode fails, then iget_failed() +should be called on the inode to render it dead, and an appropriate error +should be passed back to the caller. --- [recommended] @@ -207,7 +215,6 @@ had ->revalidate()) add calls in ->follow_link()/->readlink(). ->d_parent changes are not protected by BKL anymore. Read access is safe if at least one of the following is true: * filesystem has no cross-directory rename() - * dcache_lock is held * we know that parent had been locked (e.g. we are looking at ->d_parent of ->lookup() argument). * we are called from ->rename(). @@ -264,3 +271,195 @@ it's safe to remove it. If you don't need it, remove it. deliberate; as soon as struct block_device * is propagated in a reasonable way by that code fixing will become trivial; until then nothing can be done. + +[mandatory] + + block truncatation on error exit from ->write_begin, and ->direct_IO +moved from generic methods (block_write_begin, cont_write_begin, +nobh_write_begin, blockdev_direct_IO*) to callers. Take a look at +ext2_write_failed and callers for an example. + +[mandatory] + + ->truncate is gone. The whole truncate sequence needs to be +implemented in ->setattr, which is now mandatory for filesystems +implementing on-disk size changes. Start with a copy of the old inode_setattr +and vmtruncate, and the reorder the vmtruncate + foofs_vmtruncate sequence to +be in order of zeroing blocks using block_truncate_page or similar helpers, +size update and on finally on-disk truncation which should not fail. +inode_change_ok now includes the size checks for ATTR_SIZE and must be called +in the beginning of ->setattr unconditionally. + +[mandatory] + + ->clear_inode() and ->delete_inode() are gone; ->evict_inode() should +be used instead. It gets called whenever the inode is evicted, whether it has +remaining links or not. Caller does *not* evict the pagecache or inode-associated +metadata buffers; the method has to use truncate_inode_pages_final() to get rid +of those. Caller makes sure async writeback cannot be running for the inode while +(or after) ->evict_inode() is called. + + ->drop_inode() returns int now; it's called on final iput() with +inode->i_lock held and it returns true if filesystems wants the inode to be +dropped. As before, generic_drop_inode() is still the default and it's been +updated appropriately. generic_delete_inode() is also alive and it consists +simply of return 1. Note that all actual eviction work is done by caller after +->drop_inode() returns. + + As before, clear_inode() must be called exactly once on each call of +->evict_inode() (as it used to be for each call of ->delete_inode()). Unlike +before, if you are using inode-associated metadata buffers (i.e. +mark_buffer_dirty_inode()), it's your responsibility to call +invalidate_inode_buffers() before clear_inode(). + + NOTE: checking i_nlink in the beginning of ->write_inode() and bailing out +if it's zero is not *and* *never* *had* *been* enough. Final unlink() and iput() +may happen while the inode is in the middle of ->write_inode(); e.g. if you blindly +free the on-disk inode, you may end up doing that while ->write_inode() is writing +to it. + +--- +[mandatory] + + .d_delete() now only advises the dcache as to whether or not to cache +unreferenced dentries, and is now only called when the dentry refcount goes to +0. Even on 0 refcount transition, it must be able to tolerate being called 0, +1, or more times (eg. constant, idempotent). + +--- +[mandatory] + + .d_compare() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + + .d_hash() calling convention and locking rules are significantly +changed. Read updated documentation in Documentation/filesystems/vfs.txt (and +look at examples of other filesystems) for guidance. + +--- +[mandatory] + dcache_lock is gone, replaced by fine grained locks. See fs/dcache.c +for details of what locks to replace dcache_lock with in order to protect +particular things. Most of the time, a filesystem only needs ->d_lock, which +protects *all* the dcache state of a given dentry. + +-- +[mandatory] + + Filesystems must RCU-free their inodes, if they can have been accessed +via rcu-walk path walk (basically, if the file can have had a path name in the +vfs namespace). + + Even though i_dentry and i_rcu share storage in a union, we will +initialize the former in inode_init_always(), so just leave it alone in +the callback. It used to be necessary to clean it there, but not anymore +(starting at 3.2). + +-- +[recommended] + vfs now tries to do path walking in "rcu-walk mode", which avoids +atomic operations and scalability hazards on dentries and inodes (see +Documentation/filesystems/path-lookup.txt). d_hash and d_compare changes +(above) are examples of the changes required to support this. For more complex +filesystem callbacks, the vfs drops out of rcu-walk mode before the fs call, so +no changes are required to the filesystem. However, this is costly and loses +the benefits of rcu-walk mode. We will begin to add filesystem callbacks that +are rcu-walk aware, shown below. Filesystems should take advantage of this +where possible. + +-- +[mandatory] + d_revalidate is a callback that is made on every path element (if +the filesystem provides it), which requires dropping out of rcu-walk mode. This +may now be called in rcu-walk mode (nd->flags & LOOKUP_RCU). -ECHILD should be +returned if the filesystem cannot handle rcu-walk. See +Documentation/filesystems/vfs.txt for more details. + + permission and check_acl are inode permission checks that are called +on many or all directory inodes on the way down a path walk (to check for +exec permission). These must now be rcu-walk aware (flags & IPERM_FLAG_RCU). +See Documentation/filesystems/vfs.txt for more details. + +-- +[mandatory] + In ->fallocate() you must check the mode option passed in. If your +filesystem does not support hole punching (deallocating space in the middle of a +file) you must return -EOPNOTSUPP if FALLOC_FL_PUNCH_HOLE is set in mode. +Currently you can only have FALLOC_FL_PUNCH_HOLE with FALLOC_FL_KEEP_SIZE set, +so the i_size should not change when hole punching, even when puching the end of +a file off. + +-- +[mandatory] + ->get_sb() is gone. Switch to use of ->mount(). Typically it's just +a matter of switching from calling get_sb_... to mount_... and changing the +function type. If you were doing it manually, just switch from setting ->mnt_root +to some pointer to returning that pointer. On errors return ERR_PTR(...). + +-- +[mandatory] + ->permission() and generic_permission()have lost flags +argument; instead of passing IPERM_FLAG_RCU we add MAY_NOT_BLOCK into mask. + generic_permission() has also lost the check_acl argument; ACL checking +has been taken to VFS and filesystems need to provide a non-NULL ->i_op->get_acl +to read an ACL from disk. + +-- +[mandatory] + If you implement your own ->llseek() you must handle SEEK_HOLE and +SEEK_DATA. You can hanle this by returning -EINVAL, but it would be nicer to +support it in some way. The generic handler assumes that the entire file is +data and there is a virtual hole at the end of the file. So if the provided +offset is less than i_size and SEEK_DATA is specified, return the same offset. +If the above is true for the offset and you are given SEEK_HOLE, return the end +of the file. If the offset is i_size or greater return -ENXIO in either case. + +[mandatory] + If you have your own ->fsync() you must make sure to call +filemap_write_and_wait_range() so that all dirty pages are synced out properly. +You must also keep in mind that ->fsync() is not called with i_mutex held +anymore, so if you require i_mutex locking you must make sure to take it and +release it yourself. + +-- +[mandatory] + d_alloc_root() is gone, along with a lot of bugs caused by code +misusing it. Replacement: d_make_root(inode). The difference is, +d_make_root() drops the reference to inode if dentry allocation fails. + +-- +[mandatory] + The witch is dead! Well, 2/3 of it, anyway. ->d_revalidate() and +->lookup() do *not* take struct nameidata anymore; just the flags. +-- +[mandatory] + ->create() doesn't take struct nameidata *; unlike the previous +two, it gets "is it an O_EXCL or equivalent?" boolean argument. Note that +local filesystems can ignore tha argument - they are guaranteed that the +object doesn't exist. It's remote/distributed ones that might care... +-- +[mandatory] + FS_REVAL_DOT is gone; if you used to have it, add ->d_weak_revalidate() +in your dentry operations instead. +-- +[mandatory] + vfs_readdir() is gone; switch to iterate_dir() instead +-- +[mandatory] + ->readdir() is gone now; switch to ->iterate() +[mandatory] + vfs_follow_link has been removed. Filesystems must use nd_set_link + from ->follow_link for normal symlinks, or nd_jump_link for magic + /proc/<pid> style links. +-- +[mandatory] + iget5_locked()/ilookup5()/ilookup5_nowait() test() callback used to be + called with both ->i_lock and inode_hash_lock held; the former is *not* + taken anymore, so verify that your callbacks do not rely on it (none + of the in-tree instances did). inode_hash_lock is still held, + of course, so they are still serialized wrt removal from inode hash, + as well as wrt set() callback of iget5_locked(). diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt index d4773565ea2..ddc531a74d0 100644 --- a/Documentation/filesystems/proc.txt +++ b/Documentation/filesystems/proc.txt @@ -5,10 +5,12 @@ Bodo Bauer <bb@ricochet.net> 2.4.x update Jorge Nerin <comandante@zaralinux.com> November 14 2000 +move /proc/sys Shen Feng <shen@cn.fujitsu.com> April 1 2009 ------------------------------------------------------------------------------ Version 1.3 Kernel version 2.2.12 Kernel version 2.4.0-test11-pre4 ------------------------------------------------------------------------------ +fixes/update part 1.1 Stefani Seibold <stefani@seibold.net> June 9 2009 Table of Contents ----------------- @@ -26,19 +28,23 @@ Table of Contents 1.6 Parallel port info in /proc/parport 1.7 TTY info in /proc/tty 1.8 Miscellaneous kernel statistics in /proc/stat + 1.9 Ext4 file system parameters 2 Modifying System Parameters - 2.1 /proc/sys/fs - File system data - 2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats - 2.3 /proc/sys/kernel - general kernel parameters - 2.4 /proc/sys/vm - The virtual memory subsystem - 2.5 /proc/sys/dev - Device specific parameters - 2.6 /proc/sys/sunrpc - Remote procedure calls - 2.7 /proc/sys/net - Networking stuff - 2.8 /proc/sys/net/ipv4 - IPV4 settings - 2.9 Appletalk - 2.10 IPX - 2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem + + 3 Per-Process Parameters + 3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj - Adjust the oom-killer + score + 3.2 /proc/<pid>/oom_score - Display current oom-killer score + 3.3 /proc/<pid>/io - Display the IO accounting fields + 3.4 /proc/<pid>/coredump_filter - Core dump filtering settings + 3.5 /proc/<pid>/mountinfo - Information about mounts + 3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm + 3.7 /proc/<pid>/task/<tid>/children - Information about task children + 3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file + + 4 Configuring procfs + 4.1 Mount options ------------------------------------------------------------------------------ Preface @@ -72,9 +78,9 @@ contact Bodo Bauer at bb@ricochet.net. We'll be happy to add them to this document. The latest version of this document is available online at -http://skaro.nightcrawler.com/~bb/Docs/Proc as HTML version. +http://tldp.org/LDP/Linux-Filesystem-Hierarchy/html/proc.html -If the above direction does not works for you, ypu could try the kernel +If the above direction does not works for you, you could try the kernel mailing list at linux-kernel@vger.kernel.org and/or try to reach me at comandante@zaralinux.com. @@ -117,60 +123,132 @@ The link self points to the process reading the file system. Each process subdirectory has the entries listed in Table 1-1. -Table 1-1: Process specific entries in /proc +Table 1-1: Process specific entries in /proc .............................................................................. - File Content - cmdline Command line arguments - cpu Current and last cpu in wich it was executed (2.4)(smp) - cwd Link to the current working directory - environ Values of environment variables - exe Link to the executable of this process - fd Directory, which contains all file descriptors - maps Memory maps to executables and library files (2.4) - mem Memory held by this process - root Link to the root directory of this process - stat Process status - statm Process memory status information - status Process status in human readable form - wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan - smaps Extension based on maps, presenting the rss size for each mapped file + File Content + clear_refs Clears page referenced bits shown in smaps output + cmdline Command line arguments + cpu Current and last cpu in which it was executed (2.4)(smp) + cwd Link to the current working directory + environ Values of environment variables + exe Link to the executable of this process + fd Directory, which contains all file descriptors + maps Memory maps to executables and library files (2.4) + mem Memory held by this process + root Link to the root directory of this process + stat Process status + statm Process memory status information + status Process status in human readable form + wchan If CONFIG_KALLSYMS is set, a pre-decoded wchan + pagemap Page table + stack Report full stack trace, enable via CONFIG_STACKTRACE + smaps a extension based on maps, showing the memory consumption of + each mapping and flags associated with it .............................................................................. For example, to get the status information of a process, all you have to do is read the file /proc/PID/status: - >cat /proc/self/status - Name: cat - State: R (running) - Pid: 5452 - PPid: 743 + >cat /proc/self/status + Name: cat + State: R (running) + Tgid: 5452 + Pid: 5452 + PPid: 743 TracerPid: 0 (2.4) - Uid: 501 501 501 501 - Gid: 100 100 100 100 - Groups: 100 14 16 - VmSize: 1112 kB - VmLck: 0 kB - VmRSS: 348 kB - VmData: 24 kB - VmStk: 12 kB - VmExe: 8 kB - VmLib: 1044 kB - SigPnd: 0000000000000000 - SigBlk: 0000000000000000 - SigIgn: 0000000000000000 - SigCgt: 0000000000000000 - CapInh: 00000000fffffeff - CapPrm: 0000000000000000 - CapEff: 0000000000000000 - + Uid: 501 501 501 501 + Gid: 100 100 100 100 + FDSize: 256 + Groups: 100 14 16 + VmPeak: 5004 kB + VmSize: 5004 kB + VmLck: 0 kB + VmHWM: 476 kB + VmRSS: 476 kB + VmData: 156 kB + VmStk: 88 kB + VmExe: 68 kB + VmLib: 1412 kB + VmPTE: 20 kb + VmSwap: 0 kB + Threads: 1 + SigQ: 0/28578 + SigPnd: 0000000000000000 + ShdPnd: 0000000000000000 + SigBlk: 0000000000000000 + SigIgn: 0000000000000000 + SigCgt: 0000000000000000 + CapInh: 00000000fffffeff + CapPrm: 0000000000000000 + CapEff: 0000000000000000 + CapBnd: ffffffffffffffff + Seccomp: 0 + voluntary_ctxt_switches: 0 + nonvoluntary_ctxt_switches: 1 This shows you nearly the same information you would get if you viewed it with the ps command. In fact, ps uses the proc file system to obtain its -information. The statm file contains more detailed information about the -process memory usage. Its seven fields are explained in Table 1-2. +information. But you get a more detailed view of the process by reading the +file /proc/PID/status. It fields are described in table 1-2. + +The statm file contains more detailed information about the process +memory usage. Its seven fields are explained in Table 1-3. The stat file +contains details information about the process itself. Its fields are +explained in Table 1-4. +(for SMP CONFIG users) +For making accounting scalable, RSS related information are handled in +asynchronous manner and the vaule may not be very precise. To see a precise +snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table. +It's slow but very precise. + +Table 1-2: Contents of the status files (as of 2.6.30-rc7) +.............................................................................. + Field Content + Name filename of the executable + State state (R is running, S is sleeping, D is sleeping + in an uninterruptible wait, Z is zombie, + T is traced or stopped) + Tgid thread group ID + Pid process id + PPid process id of the parent process + TracerPid PID of process tracing this process (0 if not) + Uid Real, effective, saved set, and file system UIDs + Gid Real, effective, saved set, and file system GIDs + FDSize number of file descriptor slots currently allocated + Groups supplementary group list + VmPeak peak virtual memory size + VmSize total program size + VmLck locked memory size + VmHWM peak resident set size ("high water mark") + VmRSS size of memory portions + VmData size of data, stack, and text segments + VmStk size of data, stack, and text segments + VmExe size of text segment + VmLib size of shared library code + VmPTE size of page table entries + VmSwap size of swap usage (the number of referred swapents) + Threads number of threads + SigQ number of signals queued/max. number for queue + SigPnd bitmap of pending signals for the thread + ShdPnd bitmap of shared pending signals for the process + SigBlk bitmap of blocked signals + SigIgn bitmap of ignored signals + SigCgt bitmap of caught signals + CapInh bitmap of inheritable capabilities + CapPrm bitmap of permitted capabilities + CapEff bitmap of effective capabilities + CapBnd bitmap of capabilities bounding set + Seccomp seccomp mode, like prctl(PR_GET_SECCOMP, ...) + Cpus_allowed mask of CPUs on which this process may run + Cpus_allowed_list Same as previous, but in "list format" + Mems_allowed mask of memory nodes allowed to this process + Mems_allowed_list Same as previous, but in "list format" + voluntary_ctxt_switches number of voluntary context switches + nonvoluntary_ctxt_switches number of non voluntary context switches +.............................................................................. -Table 1-2: Contents of the statm files (as of 2.6.8-rc3) +Table 1-3: Contents of the statm files (as of 2.6.8-rc3) .............................................................................. Field Content size total program size (pages) (same as VmSize in status) @@ -184,16 +262,248 @@ Table 1-2: Contents of the statm files (as of 2.6.8-rc3) dt number of dirty pages (always 0 on 2.6) .............................................................................. + +Table 1-4: Contents of the stat files (as of 2.6.30-rc7) +.............................................................................. + Field Content + pid process id + tcomm filename of the executable + state state (R is running, S is sleeping, D is sleeping in an + uninterruptible wait, Z is zombie, T is traced or stopped) + ppid process id of the parent process + pgrp pgrp of the process + sid session id + tty_nr tty the process uses + tty_pgrp pgrp of the tty + flags task flags + min_flt number of minor faults + cmin_flt number of minor faults with child's + maj_flt number of major faults + cmaj_flt number of major faults with child's + utime user mode jiffies + stime kernel mode jiffies + cutime user mode jiffies with child's + cstime kernel mode jiffies with child's + priority priority level + nice nice level + num_threads number of threads + it_real_value (obsolete, always 0) + start_time time the process started after system boot + vsize virtual memory size + rss resident set memory size + rsslim current limit in bytes on the rss + start_code address above which program text can run + end_code address below which program text can run + start_stack address of the start of the main process stack + esp current value of ESP + eip current value of EIP + pending bitmap of pending signals + blocked bitmap of blocked signals + sigign bitmap of ignored signals + sigcatch bitmap of caught signals + wchan address where process went to sleep + 0 (place holder) + 0 (place holder) + exit_signal signal to send to parent thread on exit + task_cpu which CPU the task is scheduled on + rt_priority realtime priority + policy scheduling policy (man sched_setscheduler) + blkio_ticks time spent waiting for block IO + gtime guest time of the task in jiffies + cgtime guest time of the task children in jiffies + start_data address above which program data+bss is placed + end_data address below which program data+bss is placed + start_brk address above which program heap can be expanded with brk() + arg_start address above which program command line is placed + arg_end address below which program command line is placed + env_start address above which program environment is placed + env_end address below which program environment is placed + exit_code the thread's exit_code in the form reported by the waitpid system call +.............................................................................. + +The /proc/PID/maps file containing the currently mapped memory regions and +their access permissions. + +The format is: + +address perms offset dev inode pathname + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack:1001] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 [stack] +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +where "address" is the address space in the process that it occupies, "perms" +is a set of permissions: + + r = read + w = write + x = execute + s = shared + p = private (copy on write) + +"offset" is the offset into the mapping, "dev" is the device (major:minor), and +"inode" is the inode on that device. 0 indicates that no inode is associated +with the memory region, as the case would be with BSS (uninitialized data). +The "pathname" shows the name associated file for this mapping. If the mapping +is not associated with a file: + + [heap] = the heap of the program + [stack] = the stack of the main process + [stack:1001] = the stack of the thread with tid 1001 + [vdso] = the "virtual dynamic shared object", + the kernel system call handler + + or if empty, the mapping is anonymous. + +The /proc/PID/task/TID/maps is a view of the virtual memory from the viewpoint +of the individual tasks of a process. In this file you will see a mapping marked +as [stack] if that task sees it as a stack. This is a key difference from the +content of /proc/PID/maps, where you will see all mappings that are being used +as stack by all of those tasks. Hence, for the example above, the task-level +map, i.e. /proc/PID/task/TID/maps for thread 1001 will look like this: + +08048000-08049000 r-xp 00000000 03:00 8312 /opt/test +08049000-0804a000 rw-p 00001000 03:00 8312 /opt/test +0804a000-0806b000 rw-p 00000000 00:00 0 [heap] +a7cb1000-a7cb2000 ---p 00000000 00:00 0 +a7cb2000-a7eb2000 rw-p 00000000 00:00 0 +a7eb2000-a7eb3000 ---p 00000000 00:00 0 +a7eb3000-a7ed5000 rw-p 00000000 00:00 0 [stack] +a7ed5000-a8008000 r-xp 00000000 03:00 4222 /lib/libc.so.6 +a8008000-a800a000 r--p 00133000 03:00 4222 /lib/libc.so.6 +a800a000-a800b000 rw-p 00135000 03:00 4222 /lib/libc.so.6 +a800b000-a800e000 rw-p 00000000 00:00 0 +a800e000-a8022000 r-xp 00000000 03:00 14462 /lib/libpthread.so.0 +a8022000-a8023000 r--p 00013000 03:00 14462 /lib/libpthread.so.0 +a8023000-a8024000 rw-p 00014000 03:00 14462 /lib/libpthread.so.0 +a8024000-a8027000 rw-p 00000000 00:00 0 +a8027000-a8043000 r-xp 00000000 03:00 8317 /lib/ld-linux.so.2 +a8043000-a8044000 r--p 0001b000 03:00 8317 /lib/ld-linux.so.2 +a8044000-a8045000 rw-p 0001c000 03:00 8317 /lib/ld-linux.so.2 +aff35000-aff4a000 rw-p 00000000 00:00 0 +ffffe000-fffff000 r-xp 00000000 00:00 0 [vdso] + +The /proc/PID/smaps is an extension based on maps, showing the memory +consumption for each of the process's mappings. For each of mappings there +is a series of lines such as the following: + +08048000-080bc000 r-xp 00000000 03:02 13130 /bin/bash +Size: 1084 kB +Rss: 892 kB +Pss: 374 kB +Shared_Clean: 892 kB +Shared_Dirty: 0 kB +Private_Clean: 0 kB +Private_Dirty: 0 kB +Referenced: 892 kB +Anonymous: 0 kB +Swap: 0 kB +KernelPageSize: 4 kB +MMUPageSize: 4 kB +Locked: 374 kB +VmFlags: rd ex mr mw me de + +the first of these lines shows the same information as is displayed for the +mapping in /proc/PID/maps. The remaining lines show the size of the mapping +(size), the amount of the mapping that is currently resident in RAM (RSS), the +process' proportional share of this mapping (PSS), the number of clean and +dirty private pages in the mapping. Note that even a page which is part of a +MAP_SHARED mapping, but has only a single pte mapped, i.e. is currently used +by only one process, is accounted as private and not as shared. "Referenced" +indicates the amount of memory currently marked as referenced or accessed. +"Anonymous" shows the amount of memory that does not belong to any file. Even +a mapping associated with a file may contain anonymous pages: when MAP_PRIVATE +and a page is modified, the file page is replaced by a private anonymous copy. +"Swap" shows how much would-be-anonymous memory is also used, but out on +swap. + +"VmFlags" field deserves a separate description. This member represents the kernel +flags associated with the particular virtual memory area in two letter encoded +manner. The codes are the following: + rd - readable + wr - writeable + ex - executable + sh - shared + mr - may read + mw - may write + me - may execute + ms - may share + gd - stack segment growns down + pf - pure PFN range + dw - disabled write to the mapped file + lo - pages are locked in memory + io - memory mapped I/O area + sr - sequential read advise provided + rr - random read advise provided + dc - do not copy area on fork + de - do not expand area on remapping + ac - area is accountable + nr - swap space is not reserved for the area + ht - area uses huge tlb pages + nl - non-linear mapping + ar - architecture specific flag + dd - do not include area into core dump + sd - soft-dirty flag + mm - mixed map area + hg - huge page advise flag + nh - no-huge page advise flag + mg - mergable advise flag + +Note that there is no guarantee that every flag and associated mnemonic will +be present in all further kernel releases. Things get changed, the flags may +be vanished or the reverse -- new added. + +This file is only present if the CONFIG_MMU kernel configuration option is +enabled. + +The /proc/PID/clear_refs is used to reset the PG_Referenced and ACCESSED/YOUNG +bits on both physical and virtual pages associated with a process, and the +soft-dirty bit on pte (see Documentation/vm/soft-dirty.txt for details). +To clear the bits for all the pages associated with the process + > echo 1 > /proc/PID/clear_refs + +To clear the bits for the anonymous pages associated with the process + > echo 2 > /proc/PID/clear_refs + +To clear the bits for the file mapped pages associated with the process + > echo 3 > /proc/PID/clear_refs + +To clear the soft-dirty bit + > echo 4 > /proc/PID/clear_refs + +Any other value written to /proc/PID/clear_refs will have no effect. + +The /proc/pid/pagemap gives the PFN, which can be used to find the pageflags +using /proc/kpageflags and number of times a page is mapped using +/proc/kpagecount. For detailed explanation, see Documentation/vm/pagemap.txt. + 1.2 Kernel data --------------- Similar to the process entries, the kernel data files give information about the running kernel. The files used to obtain this information are contained in -/proc and are listed in Table 1-3. Not all of these will be present in your +/proc and are listed in Table 1-5. Not all of these will be present in your system. It depends on the kernel configuration and the loaded modules, which files are there, and which are missing. -Table 1-3: Kernel info in /proc +Table 1-5: Kernel info in /proc .............................................................................. File Content apm Advanced power management info @@ -224,20 +534,23 @@ Table 1-3: Kernel info in /proc modules List of loaded modules mounts Mounted filesystems net Networking info (see text) + pagetypeinfo Additional page allocator information (see text) (2.5) partitions Table of partitions known to the system - pci Depreciated info of PCI bus (new way -> /proc/bus/pci/, + pci Deprecated info of PCI bus (new way -> /proc/bus/pci/, decoupled by lspci (2.4) rtc Real time clock scsi SCSI info (see text) slabinfo Slab pool info + softirqs softirq usage stat Overall statistics swaps Swap space utilization sys See chapter 2 sysvipc Info of SysVIPC Resources (msg, sem, shm) (2.4) tty Info of tty drivers - uptime System uptime + uptime Wall clock since boot, combined idle time of all cpus version Kernel version video bttv info of video resources (2.4) + vmallocinfo Show vmalloced areas .............................................................................. You can, for example, check which interrupts are currently in use and what @@ -291,36 +604,82 @@ connects the CPUs in a SMP system. This means that an error has been detected, the IO-APIC automatically retry the transmission, so it should not be a big problem, but you should read the SMP-FAQ. -In this context it could be interesting to note the new irq directory in 2.4. +In 2.6.2* /proc/interrupts was expanded again. This time the goal was for +/proc/interrupts to display every IRQ vector in use by the system, not +just those considered 'most important'. The new vectors are: + + THR -- interrupt raised when a machine check threshold counter + (typically counting ECC corrected errors of memory or cache) exceeds + a configurable threshold. Only available on some systems. + + TRM -- a thermal event interrupt occurs when a temperature threshold + has been exceeded for the CPU. This interrupt may also be generated + when the temperature drops back to normal. + + SPU -- a spurious interrupt is some interrupt that was raised then lowered + by some IO device before it could be fully processed by the APIC. Hence + the APIC sees the interrupt but does not know what device it came from. + For this case the APIC will generate the interrupt with a IRQ vector + of 0xff. This might also be generated by chipset bugs. + + RES, CAL, TLB -- rescheduling, call and TLB flush interrupts are + sent from one CPU to another per the needs of the OS. Typically, + their statistics are used by kernel developers and interested users to + determine the occurrence of interrupts of the given type. + +The above IRQ vectors are displayed only when relevant. For example, +the threshold vector does not exist on x86_64 platforms. Others are +suppressed when the system is a uniprocessor. As of this writing, only +i386 and x86_64 platforms support the new IRQ vector displays. + +Of some interest is the introduction of the /proc/irq directory to 2.4. It could be used to set IRQ to CPU affinity, this means that you can "hook" an IRQ to only one CPU, or to exclude a CPU of handling IRQs. The contents of the -irq subdir is one subdir for each IRQ, and one file; prof_cpu_mask +irq subdir is one subdir for each IRQ, and two files; default_smp_affinity and +prof_cpu_mask. For example > ls /proc/irq/ 0 10 12 14 16 18 2 4 6 8 prof_cpu_mask - 1 11 13 15 17 19 3 5 7 9 + 1 11 13 15 17 19 3 5 7 9 default_smp_affinity > ls /proc/irq/0/ smp_affinity -The contents of the prof_cpu_mask file and each smp_affinity file for each IRQ -is the same by default: +smp_affinity is a bitmask, in which you can specify which CPUs can handle the +IRQ, you can set it by doing: + + > echo 1 > /proc/irq/10/smp_affinity + +This means that only the first CPU will handle the IRQ, but you can also echo +5 which means that only the first and fourth CPU can handle the IRQ. - > cat /proc/irq/0/smp_affinity +The contents of each smp_affinity file is the same by default: + + > cat /proc/irq/0/smp_affinity ffffffff -It's a bitmask, in wich you can specify wich CPUs can handle the IRQ, you can -set it by doing: +There is an alternate interface, smp_affinity_list which allows specifying +a cpu range instead of a bitmask: + + > cat /proc/irq/0/smp_affinity_list + 1024-1031 + +The default_smp_affinity mask applies to all non-active IRQs, which are the +IRQs which have not yet been allocated/activated, and hence which lack a +/proc/irq/[0-9]* directory. - > echo 1 > /proc/irq/prof_cpu_mask +The node file on an SMP system shows the node to which the device using the IRQ +reports itself as being attached. This hardware locality information does not +include information about any possible driver locality preference. -This means that only the first CPU will handle the IRQ, but you can also echo 5 -wich means that only the first and fourth CPU can handle the IRQ. +prof_cpu_mask specifies which CPUs are to be profiled by the system wide +profiler. Default value is ffffffff (all cpus if there are only 32 of them). The way IRQs are routed is handled by the IO-APIC, and it's Round Robin between all the CPUs which are allowed to handle it. As usual the kernel has more info than you and does a better job than you, so the defaults are the -best choice for almost everyone. +best choice for almost everyone. [Note this applies only to those IO-APIC's +that support "Round Robin" interrupt distribution.] There are three more important subdirectories in /proc: net, scsi, and sys. The general rule is that the contents, or even the existence of these @@ -341,7 +700,7 @@ Node 0, zone DMA 0 4 5 4 4 3 ... Node 0, zone Normal 1 0 0 1 101 8 ... Node 0, zone HighMem 2 0 0 1 1 0 ... -Memory fragmentation is a problem under some workloads, and buddyinfo is a +External fragmentation is a problem under some workloads, and buddyinfo is a useful tool for helping diagnose these problems. Buddyinfo will give you a clue as to how big an area you can safely allocate, or why a previous allocation failed. @@ -351,6 +710,48 @@ available. In this case, there are 0 chunks of 2^0*PAGE_SIZE available in ZONE_DMA, 4 chunks of 2^1*PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE available in ZONE_NORMAL, etc... +More information relevant to external fragmentation can be found in +pagetypeinfo. + +> cat /proc/pagetypeinfo +Page block order: 9 +Pages per block: 512 + +Free pages count per migrate type at order 0 1 2 3 4 5 6 7 8 9 10 +Node 0, zone DMA, type Unmovable 0 0 0 1 1 1 1 1 1 1 0 +Node 0, zone DMA, type Reclaimable 0 0 0 0 0 0 0 0 0 0 0 +Node 0, zone DMA, type Movable 1 1 2 1 2 1 1 0 1 0 2 +Node 0, zone DMA, type Reserve 0 0 0 0 0 0 0 0 0 1 0 +Node 0, zone DMA, type Isolate 0 0 0 0 0 0 0 0 0 0 0 +Node 0, zone DMA32, type Unmovable 103 54 77 1 1 1 11 8 7 1 9 +Node 0, zone DMA32, type Reclaimable 0 0 2 1 0 0 0 0 1 0 0 +Node 0, zone DMA32, type Movable 169 152 113 91 77 54 39 13 6 1 452 +Node 0, zone DMA32, type Reserve 1 2 2 2 2 0 1 1 1 1 0 +Node 0, zone DMA32, type Isolate 0 0 0 0 0 0 0 0 0 0 0 + +Number of blocks type Unmovable Reclaimable Movable Reserve Isolate +Node 0, zone DMA 2 0 5 1 0 +Node 0, zone DMA32 41 6 967 2 0 + +Fragmentation avoidance in the kernel works by grouping pages of different +migrate types into the same contiguous regions of memory called page blocks. +A page block is typically the size of the default hugepage size e.g. 2MB on +X86-64. By keeping pages grouped based on their ability to move, the kernel +can reclaim pages within a page block to satisfy a high-order allocation. + +The pagetypinfo begins with information on the size of a page block. It +then gives the same type of information as buddyinfo except broken down +by migrate-type and finishes with details on how many page blocks of each +type exist. + +If min_free_kbytes has been tuned correctly (recommendations made by hugeadm +from libhugetlbfs http://sourceforge.net/projects/libhugetlbfs/), one can +make an estimate of the likely number of huge pages that can be allocated +at a given point in time. All the "Movable" blocks should be allocatable +unless memory has been mlock()'d. Some of the Reclaimable blocks should +also be allocatable although a lot of filesystem metadata may have to be +reclaimed to achieve this. + .............................................................................. meminfo: @@ -361,9 +762,12 @@ varies by architecture and compile options. The following is from a > cat /proc/meminfo +The "Locked" indicates whether the mapping is locked in memory or not. + MemTotal: 16344972 kB MemFree: 13634064 kB +MemAvailable: 14836172 kB Buffers: 3656 kB Cached: 1195708 kB SwapCached: 0 kB @@ -377,18 +781,33 @@ SwapTotal: 0 kB SwapFree: 0 kB Dirty: 968 kB Writeback: 0 kB +AnonPages: 861800 kB Mapped: 280372 kB -Slab: 684068 kB +Slab: 284364 kB +SReclaimable: 159856 kB +SUnreclaim: 124508 kB +PageTables: 24448 kB +NFS_Unstable: 0 kB +Bounce: 0 kB +WritebackTmp: 0 kB CommitLimit: 7669796 kB Committed_AS: 100056 kB -PageTables: 24448 kB VmallocTotal: 112216 kB VmallocUsed: 428 kB VmallocChunk: 111088 kB +AnonHugePages: 49152 kB MemTotal: Total usable ram (i.e. physical ram minus a few reserved bits and the kernel binary code) MemFree: The sum of LowFree+HighFree +MemAvailable: An estimate of how much memory is available for starting new + applications, without swapping. Calculated from MemFree, + SReclaimable, the size of the file LRU lists, and the low + watermarks in each zone. + The estimate takes into account that the system needs some + page cache to function well, and that not all reclaimable + slab will be reclaimable, due to items being in use. The + impact of those factors will vary from system to system. Buffers: Relatively temporary storage for raw disk blocks shouldn't get tremendously large (20MB or so) Cached: in-memory cache for files read from the disk (the @@ -408,7 +827,7 @@ VmallocChunk: 111088 kB this memory, making it slower to access than lowmem. LowTotal: LowFree: Lowmem is memory which can be used for everything that - highmem can be used for, but it is also availble for the + highmem can be used for, but it is also available for the kernel's use for its own data structures. Among many other things, it is where everything from the Slab is allocated. Bad things happen when you're out of lowmem. @@ -417,15 +836,26 @@ VmallocChunk: 111088 kB on the disk Dirty: Memory which is waiting to get written back to the disk Writeback: Memory which is actively being written back to the disk + AnonPages: Non-file backed pages mapped into userspace page tables +AnonHugePages: Non-file backed huge pages mapped into userspace page tables Mapped: files which have been mmaped, such as libraries - Slab: in-kernel data structures cache + Slab: in-kernel data structures cache +SReclaimable: Part of Slab, that might be reclaimed, such as caches + SUnreclaim: Part of Slab, that cannot be reclaimed on memory pressure + PageTables: amount of memory dedicated to the lowest level of page + tables. +NFS_Unstable: NFS pages sent to the server, but not yet committed to stable + storage + Bounce: Memory used for block device "bounce buffers" +WritebackTmp: Memory used by FUSE for temporary writeback buffers CommitLimit: Based on the overcommit ratio ('vm.overcommit_ratio'), this is the total amount of memory currently available to be allocated on the system. This limit is only adhered to if strict overcommit accounting is enabled (mode 2 in 'vm.overcommit_memory'). The CommitLimit is calculated with the following formula: - CommitLimit = ('vm.overcommit_ratio' * Physical RAM) + Swap + CommitLimit = ([total RAM pages] - [total huge TLB pages]) * + overcommit_ratio / 100 + [total swap pages] For example, on a system with 1G of physical RAM and 7G of swap with a `vm.overcommit_ratio` of 30 it would yield a CommitLimit of 7.3G. @@ -435,21 +865,80 @@ Committed_AS: The amount of memory presently allocated on the system. The committed memory is a sum of all of the memory which has been allocated by processes, even if it has not been "used" by them as of yet. A process which malloc()'s 1G - of memory, but only touches 300M of it will only show up - as using 300M of memory even if it has the address space - allocated for the entire 1G. This 1G is memory which has - been "committed" to by the VM and can be used at any time - by the allocating application. With strict overcommit - enabled on the system (mode 2 in 'vm.overcommit_memory'), - allocations which would exceed the CommitLimit (detailed - above) will not be permitted. This is useful if one needs - to guarantee that processes will not fail due to lack of - memory once that memory has been successfully allocated. - PageTables: amount of memory dedicated to the lowest level of page - tables. + of memory, but only touches 300M of it will show up as + using 1G. This 1G is memory which has been "committed" to + by the VM and can be used at any time by the allocating + application. With strict overcommit enabled on the system + (mode 2 in 'vm.overcommit_memory'),allocations which would + exceed the CommitLimit (detailed above) will not be permitted. + This is useful if one needs to guarantee that processes will + not fail due to lack of memory once that memory has been + successfully allocated. VmallocTotal: total size of vmalloc memory area VmallocUsed: amount of vmalloc area which is used -VmallocChunk: largest contigious block of vmalloc area which is free +VmallocChunk: largest contiguous block of vmalloc area which is free + +.............................................................................. + +vmallocinfo: + +Provides information about vmalloced/vmaped areas. One line per area, +containing the virtual address range of the area, size in bytes, +caller information of the creator, and optional information depending +on the kind of area : + + pages=nr number of pages + phys=addr if a physical address was specified + ioremap I/O mapping (ioremap() and friends) + vmalloc vmalloc() area + vmap vmap()ed pages + user VM_USERMAP area + vpages buffer for pages pointers was vmalloced (huge area) + N<node>=nr (Only on NUMA kernels) + Number of pages allocated on memory node <node> + +> cat /proc/vmallocinfo +0xffffc20000000000-0xffffc20000201000 2101248 alloc_large_system_hash+0x204 ... + /0x2c0 pages=512 vmalloc N0=128 N1=128 N2=128 N3=128 +0xffffc20000201000-0xffffc20000302000 1052672 alloc_large_system_hash+0x204 ... + /0x2c0 pages=256 vmalloc N0=64 N1=64 N2=64 N3=64 +0xffffc20000302000-0xffffc20000304000 8192 acpi_tb_verify_table+0x21/0x4f... + phys=7fee8000 ioremap +0xffffc20000304000-0xffffc20000307000 12288 acpi_tb_verify_table+0x21/0x4f... + phys=7fee7000 ioremap +0xffffc2000031d000-0xffffc2000031f000 8192 init_vdso_vars+0x112/0x210 +0xffffc2000031f000-0xffffc2000032b000 49152 cramfs_uncompress_init+0x2e ... + /0x80 pages=11 vmalloc N0=3 N1=3 N2=2 N3=3 +0xffffc2000033a000-0xffffc2000033d000 12288 sys_swapon+0x640/0xac0 ... + pages=2 vmalloc N1=2 +0xffffc20000347000-0xffffc2000034c000 20480 xt_alloc_table_info+0xfe ... + /0x130 [x_tables] pages=4 vmalloc N0=4 +0xffffffffa0000000-0xffffffffa000f000 61440 sys_init_module+0xc27/0x1d00 ... + pages=14 vmalloc N2=14 +0xffffffffa000f000-0xffffffffa0014000 20480 sys_init_module+0xc27/0x1d00 ... + pages=4 vmalloc N1=4 +0xffffffffa0014000-0xffffffffa0017000 12288 sys_init_module+0xc27/0x1d00 ... + pages=2 vmalloc N1=2 +0xffffffffa0017000-0xffffffffa0022000 45056 sys_init_module+0xc27/0x1d00 ... + pages=10 vmalloc N0=10 + +.............................................................................. + +softirqs: + +Provides counts of softirq handlers serviced since boot time, for each cpu. + +> cat /proc/softirqs + CPU0 CPU1 CPU2 CPU3 + HI: 0 0 0 0 + TIMER: 27166 27120 27097 27034 + NET_TX: 0 0 0 17 + NET_RX: 42 0 0 39 + BLOCK: 0 0 107 1121 + TASKLET: 0 0 0 290 + SCHED: 27035 26983 26971 26746 + HRTIMER: 0 0 0 0 + RCU: 1678 1769 2178 2250 1.3 IDE devices in /proc/ide @@ -469,10 +958,10 @@ IDE devices: More detailed information can be found in the controller specific subdirectories. These are named ide0, ide1 and so on. Each of these -directories contains the files shown in table 1-4. +directories contains the files shown in table 1-6. -Table 1-4: IDE controller info in /proc/ide/ide? +Table 1-6: IDE controller info in /proc/ide/ide? .............................................................................. File Content channel IDE channel (0 or 1) @@ -482,11 +971,11 @@ Table 1-4: IDE controller info in /proc/ide/ide? .............................................................................. Each device connected to a controller has a separate subdirectory in the -controllers directory. The files listed in table 1-5 are contained in these +controllers directory. The files listed in table 1-7 are contained in these directories. -Table 1-5: IDE device information +Table 1-7: IDE device information .............................................................................. File Content cache The cache @@ -528,12 +1017,12 @@ the drive parameters: 1.4 Networking info in /proc/net -------------------------------- -The subdirectory /proc/net follows the usual pattern. Table 1-6 shows the +The subdirectory /proc/net follows the usual pattern. Table 1-8 shows the additional values you get for IP version 6 if you configure the kernel to -support this. Table 1-7 lists the files and their meaning. +support this. Table 1-9 lists the files and their meaning. -Table 1-6: IPv6 info in /proc/net +Table 1-8: IPv6 info in /proc/net .............................................................................. File Content udp6 UDP sockets (IPv6) @@ -548,7 +1037,7 @@ Table 1-6: IPv6 info in /proc/net .............................................................................. -Table 1-7: Network info in /proc/net +Table 1-9: Network info in /proc/net .............................................................................. File Content arp Kernel ARP table @@ -569,7 +1058,6 @@ Table 1-7: Network info in /proc/net snmp SNMP data sockstat Socket statistics tcp TCP sockets - tr_rif Token ring RIF routing table udp UDP sockets unix UNIX domain sockets wireless Wireless interface data (Wavelan etc) @@ -596,7 +1084,7 @@ your system and how much traffic was routed over those devices: ...] 1375103 17405 0 0 0 0 0 0 ...] 1703981 5535 0 0 0 3 0 0 -In addition, each Channel Bond interface has it's own directory. For +In addition, each Channel Bond interface has its own directory. For example, the bond0 device will have a directory called /proc/net/bond0/. It will contain information that is specific to that bond, such as the current slaves of the bond, the link status of the slaves, and how @@ -672,10 +1160,10 @@ The directory /proc/parport contains information about the parallel ports of your system. It has one subdirectory for each port, named after the port number (0,1,2,...). -These directories contain the four files shown in Table 1-8. +These directories contain the four files shown in Table 1-10. -Table 1-8: Files in /proc/parport +Table 1-10: Files in /proc/parport .............................................................................. File Content autoprobe Any IEEE-1284 device ID information that has been acquired. @@ -693,10 +1181,10 @@ Table 1-8: Files in /proc/parport Information about the available and actually used tty's can be found in the directory /proc/tty.You'll find entries for drivers and line disciplines in -this directory, as shown in Table 1-9. +this directory, as shown in Table 1-11. -Table 1-9: Files in /proc/tty +Table 1-11: Files in /proc/tty .............................................................................. File Content drivers list of drivers and their usage @@ -729,15 +1217,16 @@ Various pieces of information about kernel activity are available in the since the system first booted. For a quick look, simply cat the file: > cat /proc/stat - cpu 2255 34 2290 22625563 6290 127 456 - cpu0 1132 34 1441 11311718 3675 127 438 - cpu1 1123 0 849 11313845 2614 0 18 + cpu 2255 34 2290 22625563 6290 127 456 0 0 + cpu0 1132 34 1441 11311718 3675 127 438 0 0 + cpu1 1123 0 849 11313845 2614 0 18 0 0 intr 114930548 113199788 3 0 5 263 0 4 [... lots more numbers ...] ctxt 1990473 btime 1062191376 processes 2915 procs_running 1 procs_blocked 0 + softirq 183433 0 21755 12 39 1137 231 21459 2263 The very first "cpu" line aggregates the numbers in all of the other "cpuN" lines. These numbers identify the amount of time the CPU has spent performing @@ -751,11 +1240,15 @@ second). The meanings of the columns are as follows, from left to right: - iowait: waiting for I/O to complete - irq: servicing interrupts - softirq: servicing softirqs +- steal: involuntary wait +- guest: running a normal guest +- guest_nice: running a niced guest The "intr" line gives counts of interrupts serviced since boot time, for each of the possible system interrupts. The first column is the total of all -interrupts serviced; each subsequent column is the total for that particular -interrupt. +interrupts serviced including unnumbered architecture specific interrupts; +each subsequent column is the total for that particular numbered interrupt. +Unnumbered interrupts are not shown, only summed into the total. The "ctxt" line gives the total number of context switches across all CPUs. @@ -766,12 +1259,57 @@ The "processes" line gives the number of processes and threads created, which includes (but is not limited to) those created by calls to the fork() and clone() system calls. -The "procs_running" line gives the number of processes currently running on -CPUs. +The "procs_running" line gives the total number of threads that are +running or ready to run (i.e., the total number of runnable threads). The "procs_blocked" line gives the number of processes currently blocked, waiting for I/O to complete. +The "softirq" line gives counts of softirqs serviced since boot time, for each +of the possible system softirqs. The first column is the total of all +softirqs serviced; each subsequent column is the total for that particular +softirq. + + +1.9 Ext4 file system parameters +------------------------------ + +Information about mounted ext4 file systems can be found in +/proc/fs/ext4. Each mounted filesystem will have a directory in +/proc/fs/ext4 based on its device name (i.e., /proc/fs/ext4/hdc or +/proc/fs/ext4/dm-0). The files in each per-device directory are shown +in Table 1-12, below. + +Table 1-12: Files in /proc/fs/ext4/<devname> +.............................................................................. + File Content + mb_groups details of multiblock allocator buddy cache of free blocks +.............................................................................. + +2.0 /proc/consoles +------------------ +Shows registered system console lines. + +To see which character device lines are currently used for the system console +/dev/console, you may simply look into the file /proc/consoles: + + > cat /proc/consoles + tty0 -WU (ECp) 4:7 + ttyS0 -W- (Ep) 4:64 + +The columns are: + + device name of the device + operations R = can do read operations + W = can do write operations + U = can do unblank + flags E = it is enabled + C = it is preferred console + B = it is primary boot console + p = it is used for printk buffer + b = it is not a TTY but a Braille device + a = it is safe to use when cpu is offline + major:minor major and minor number of the device separated by a colon ------------------------------------------------------------------------------ Summary @@ -820,1134 +1358,424 @@ review the kernel documentation in the directory /usr/src/linux/Documentation. This chapter is heavily based on the documentation included in the pre 2.2 kernels, and became part of it in version 2.2.1 of the Linux kernel. -2.1 /proc/sys/fs - File system data ------------------------------------ - -This subdirectory contains specific file system, file handle, inode, dentry -and quota information. - -Currently, these files are in /proc/sys/fs: - -dentry-state ------------- - -Status of the directory cache. Since directory entries are dynamically -allocated and deallocated, this file indicates the current status. It holds -six values, in which the last two are not used and are always zero. The others -are listed in table 2-1. - - -Table 2-1: Status files of the directory cache -.............................................................................. - File Content - nr_dentry Almost always zero - nr_unused Number of unused cache entries - age_limit - in seconds after the entry may be reclaimed, when memory is short - want_pages internally -.............................................................................. - -dquot-nr and dquot-max ----------------------- - -The file dquot-max shows the maximum number of cached disk quota entries. - -The file dquot-nr shows the number of allocated disk quota entries and the -number of free disk quota entries. - -If the number of available cached disk quotas is very low and you have a large -number of simultaneous system users, you might want to raise the limit. - -file-nr and file-max --------------------- - -The kernel allocates file handles dynamically, but doesn't free them again at -this time. - -The value in file-max denotes the maximum number of file handles that the -Linux kernel will allocate. When you get a lot of error messages about running -out of file handles, you might want to raise this limit. The default value is -10% of RAM in kilobytes. To change it, just write the new number into the -file: - - # cat /proc/sys/fs/file-max - 4096 - # echo 8192 > /proc/sys/fs/file-max - # cat /proc/sys/fs/file-max - 8192 - - -This method of revision is useful for all customizable parameters of the -kernel - simply echo the new value to the corresponding file. - -Historically, the three values in file-nr denoted the number of allocated file -handles, the number of allocated but unused file handles, and the maximum -number of file handles. Linux 2.6 always reports 0 as the number of free file -handles -- this is not an error, it just means that the number of allocated -file handles exactly matches the number of used file handles. - -Attempts to allocate more file descriptors than file-max are reported with -printk, look for "VFS: file-max limit <number> reached". - -inode-state and inode-nr ------------------------- - -The file inode-nr contains the first two items from inode-state, so we'll skip -to that file... - -inode-state contains two actual numbers and five dummy values. The numbers -are nr_inodes and nr_free_inodes (in order of appearance). - -nr_inodes -~~~~~~~~~ - -Denotes the number of inodes the system has allocated. This number will -grow and shrink dynamically. - -nr_free_inodes --------------- - -Represents the number of free inodes. Ie. The number of inuse inodes is -(nr_inodes - nr_free_inodes). - -aio-nr and aio-max-nr ---------------------- - -aio-nr is the running total of the number of events specified on the -io_setup system call for all currently active aio contexts. If aio-nr -reaches aio-max-nr then io_setup will fail with EAGAIN. Note that -raising aio-max-nr does not result in the pre-allocation or re-sizing -of any kernel data structures. - -2.2 /proc/sys/fs/binfmt_misc - Miscellaneous binary formats ------------------------------------------------------------ - -Besides these files, there is the subdirectory /proc/sys/fs/binfmt_misc. This -handles the kernel support for miscellaneous binary formats. - -Binfmt_misc provides the ability to register additional binary formats to the -Kernel without compiling an additional module/kernel. Therefore, binfmt_misc -needs to know magic numbers at the beginning or the filename extension of the -binary. - -It works by maintaining a linked list of structs that contain a description of -a binary format, including a magic with size (or the filename extension), -offset and mask, and the interpreter name. On request it invokes the given -interpreter with the original program as argument, as binfmt_java and -binfmt_em86 and binfmt_mz do. Since binfmt_misc does not define any default -binary-formats, you have to register an additional binary-format. - -There are two general files in binfmt_misc and one file per registered format. -The two general files are register and status. - -Registering a new binary format -------------------------------- - -To register a new binary format you have to issue the command - - echo :name:type:offset:magic:mask:interpreter: > /proc/sys/fs/binfmt_misc/register - - - -with appropriate name (the name for the /proc-dir entry), offset (defaults to -0, if omitted), magic, mask (which can be omitted, defaults to all 0xff) and -last but not least, the interpreter that is to be invoked (for example and -testing /bin/echo). Type can be M for usual magic matching or E for filename -extension matching (give extension in place of magic). - -Check or reset the status of the binary format handler ------------------------------------------------------- - -If you do a cat on the file /proc/sys/fs/binfmt_misc/status, you will get the -current status (enabled/disabled) of binfmt_misc. Change the status by echoing -0 (disables) or 1 (enables) or -1 (caution: this clears all previously -registered binary formats) to status. For example echo 0 > status to disable -binfmt_misc (temporarily). - -Status of a single handler --------------------------- - -Each registered handler has an entry in /proc/sys/fs/binfmt_misc. These files -perform the same function as status, but their scope is limited to the actual -binary format. By cating this file, you also receive all related information -about the interpreter/magic of the binfmt. - -Example usage of binfmt_misc (emulate binfmt_java) --------------------------------------------------- - - cd /proc/sys/fs/binfmt_misc - echo ':Java:M::\xca\xfe\xba\xbe::/usr/local/java/bin/javawrapper:' > register - echo ':HTML:E::html::/usr/local/java/bin/appletviewer:' > register - echo ':Applet:M::<!--applet::/usr/local/java/bin/appletviewer:' > register - echo ':DEXE:M::\x0eDEX::/usr/bin/dosexec:' > register - - -These four lines add support for Java executables and Java applets (like -binfmt_java, additionally recognizing the .html extension with no need to put -<!--applet> to every applet file). You have to install the JDK and the -shell-script /usr/local/java/bin/javawrapper too. It works around the -brokenness of the Java filename handling. To add a Java binary, just create a -link to the class-file somewhere in the path. - -2.3 /proc/sys/kernel - general kernel parameters ------------------------------------------------- - -This directory reflects general kernel behaviors. As I've said before, the -contents depend on your configuration. Here you'll find the most important -files, along with descriptions of what they mean and how to use them. - -acct ----- - -The file contains three values; highwater, lowwater, and frequency. - -It exists only when BSD-style process accounting is enabled. These values -control its behavior. If the free space on the file system where the log lives -goes below lowwater percentage, accounting suspends. If it goes above -highwater percentage, accounting resumes. Frequency determines how often you -check the amount of free space (value is in seconds). Default settings are: 4, -2, and 30. That is, suspend accounting if there is less than 2 percent free; -resume it if we have a value of 3 or more percent; consider information about -the amount of free space valid for 30 seconds - -ctrl-alt-del ------------- - -When the value in this file is 0, ctrl-alt-del is trapped and sent to the init -program to handle a graceful restart. However, when the value is greater that -zero, Linux's reaction to this key combination will be an immediate reboot, -without syncing its dirty buffers. - -[NOTE] - When a program (like dosemu) has the keyboard in raw mode, the - ctrl-alt-del is intercepted by the program before it ever reaches the - kernel tty layer, and it is up to the program to decide what to do with - it. - -domainname and hostname ------------------------ - -These files can be controlled to set the NIS domainname and hostname of your -box. For the classic darkstar.frop.org a simple: - - # echo "darkstar" > /proc/sys/kernel/hostname - # echo "frop.org" > /proc/sys/kernel/domainname - - -would suffice to set your hostname and NIS domainname. - -osrelease, ostype and version ------------------------------ - -The names make it pretty obvious what these fields contain: - - > cat /proc/sys/kernel/osrelease - 2.2.12 - - > cat /proc/sys/kernel/ostype - Linux - - > cat /proc/sys/kernel/version - #4 Fri Oct 1 12:41:14 PDT 1999 - - -The files osrelease and ostype should be clear enough. Version needs a little -more clarification. The #4 means that this is the 4th kernel built from this -source base and the date after it indicates the time the kernel was built. The -only way to tune these values is to rebuild the kernel. - -panic ------ - -The value in this file represents the number of seconds the kernel waits -before rebooting on a panic. When you use the software watchdog, the -recommended setting is 60. If set to 0, the auto reboot after a kernel panic -is disabled, which is the default setting. - -printk ------- - -The four values in printk denote -* console_loglevel, -* default_message_loglevel, -* minimum_console_loglevel and -* default_console_loglevel -respectively. - -These values influence printk() behavior when printing or logging error -messages, which come from inside the kernel. See syslog(2) for more -information on the different log levels. - -console_loglevel ----------------- - -Messages with a higher priority than this will be printed to the console. - -default_message_level ---------------------- - -Messages without an explicit priority will be printed with this priority. - -minimum_console_loglevel ------------------------- - -Minimum (highest) value to which the console_loglevel can be set. - -default_console_loglevel ------------------------- - -Default value for console_loglevel. - -sg-big-buff ------------ - -This file shows the size of the generic SCSI (sg) buffer. At this point, you -can't tune it yet, but you can change it at compile time by editing -include/scsi/sg.h and changing the value of SG_BIG_BUFF. - -If you use a scanner with SANE (Scanner Access Now Easy) you might want to set -this to a higher value. Refer to the SANE documentation on this issue. - -modprobe --------- - -The location where the modprobe binary is located. The kernel uses this -program to load modules on demand. - -unknown_nmi_panic ------------------ - -The value in this file affects behavior of handling NMI. When the value is -non-zero, unknown NMI is trapped and then panic occurs. At that time, kernel -debugging information is displayed on console. - -NMI switch that most IA32 servers have fires unknown NMI up, for example. -If a system hangs up, try pressing the NMI switch. - -[NOTE] - This function and oprofile share a NMI callback. Therefore this function - cannot be enabled when oprofile is activated. - And NMI watchdog will be disabled when the value in this file is set to - non-zero. - - -2.4 /proc/sys/vm - The virtual memory subsystem ------------------------------------------------ - -The files in this directory can be used to tune the operation of the virtual -memory (VM) subsystem of the Linux kernel. - -vfs_cache_pressure ------------------- - -Controls the tendency of the kernel to reclaim the memory which is used for -caching of directory and inode objects. - -At the default value of vfs_cache_pressure=100 the kernel will attempt to -reclaim dentries and inodes at a "fair" rate with respect to pagecache and -swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer -to retain dentry and inode caches. Increasing vfs_cache_pressure beyond 100 -causes the kernel to prefer to reclaim dentries and inodes. - -dirty_background_ratio ----------------------- - -Contains, as a percentage of total system memory, the number of pages at which -the pdflush background writeback daemon will start writing out dirty data. - -dirty_ratio ------------------ - -Contains, as a percentage of total system memory, the number of pages at which -a process which is generating disk writes will itself start writing out dirty -data. - -dirty_writeback_centisecs -------------------------- - -The pdflush writeback daemons will periodically wake up and write `old' data -out to disk. This tunable expresses the interval between those wakeups, in -100'ths of a second. - -Setting this to zero disables periodic writeback altogether. - -dirty_expire_centisecs ----------------------- - -This tunable is used to define when dirty data is old enough to be eligible -for writeout by the pdflush daemons. It is expressed in 100'ths of a second. -Data which has been dirty in-memory for longer than this interval will be -written out next time a pdflush daemon wakes up. - -legacy_va_layout ----------------- - -If non-zero, this sysctl disables the new 32-bit mmap mmap layout - the kernel -will use the legacy (2.4) layout for all processes. - -lower_zone_protection ---------------------- - -For some specialised workloads on highmem machines it is dangerous for -the kernel to allow process memory to be allocated from the "lowmem" -zone. This is because that memory could then be pinned via the mlock() -system call, or by unavailability of swapspace. - -And on large highmem machines this lack of reclaimable lowmem memory -can be fatal. - -So the Linux page allocator has a mechanism which prevents allocations -which _could_ use highmem from using too much lowmem. This means that -a certain amount of lowmem is defended from the possibility of being -captured into pinned user memory. - -(The same argument applies to the old 16 megabyte ISA DMA region. This -mechanism will also defend that region from allocations which could use -highmem or lowmem). - -The `lower_zone_protection' tunable determines how aggressive the kernel is -in defending these lower zones. The default value is zero - no -protection at all. - -If you have a machine which uses highmem or ISA DMA and your -applications are using mlock(), or if you are running with no swap then -you probably should increase the lower_zone_protection setting. - -The units of this tunable are fairly vague. It is approximately equal -to "megabytes". So setting lower_zone_protection=100 will protect around 100 -megabytes of the lowmem zone from user allocations. It will also make -those 100 megabytes unavaliable for use by applications and by -pagecache, so there is a cost. - -The effects of this tunable may be observed by monitoring -/proc/meminfo:LowFree. Write a single huge file and observe the point -at which LowFree ceases to fall. - -A reasonable value for lower_zone_protection is 100. - -page-cluster ------------- - -page-cluster controls the number of pages which are written to swap in -a single attempt. The swap I/O size. - -It is a logarithmic value - setting it to zero means "1 page", setting -it to 1 means "2 pages", setting it to 2 means "4 pages", etc. - -The default value is three (eight pages at a time). There may be some -small benefits in tuning this to a different value if your workload is -swap-intensive. - -overcommit_memory ------------------ - -Controls overcommit of system memory, possibly allowing processes -to allocate (but not use) more memory than is actually available. - - -0 - Heuristic overcommit handling. Obvious overcommits of - address space are refused. Used for a typical system. It - ensures a seriously wild allocation fails while allowing - overcommit to reduce swap usage. root is allowed to - allocate slighly more memory in this mode. This is the - default. - -1 - Always overcommit. Appropriate for some scientific - applications. - -2 - Don't overcommit. The total address space commit - for the system is not permitted to exceed swap plus a - configurable percentage (default is 50) of physical RAM. - Depending on the percentage you use, in most situations - this means a process will not be killed while attempting - to use already-allocated memory but will receive errors - on memory allocation as appropriate. - -overcommit_ratio ----------------- - -Percentage of physical memory size to include in overcommit calculations -(see above.) - -Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100) - - swapspace = total size of all swap areas - physmem = size of physical memory in system - -nr_hugepages and hugetlb_shm_group ----------------------------------- - -nr_hugepages configures number of hugetlb page reserved for the system. - -hugetlb_shm_group contains group id that is allowed to create SysV shared -memory segment using hugetlb page. - -laptop_mode ------------ - -laptop_mode is a knob that controls "laptop mode". All the things that are -controlled by this knob are discussed in Documentation/laptop-mode.txt. - -block_dump ----------- - -block_dump enables block I/O debugging when set to a nonzero value. More -information on block I/O debugging is in Documentation/laptop-mode.txt. +Please see: Documentation/sysctl/ directory for descriptions of these +entries. -swap_token_timeout ------------------- - -This file contains valid hold time of swap out protection token. The Linux -VM has token based thrashing control mechanism and uses the token to prevent -unnecessary page faults in thrashing situation. The unit of the value is -second. The value would be useful to tune thrashing behavior. - -2.5 /proc/sys/dev - Device specific parameters ----------------------------------------------- - -Currently there is only support for CDROM drives, and for those, there is only -one read-only file containing information about the CD-ROM drives attached to -the system: - - >cat /proc/sys/dev/cdrom/info - CD-ROM information, Id: cdrom.c 2.55 1999/04/25 - - drive name: sr0 hdb - drive speed: 32 40 - drive # of slots: 1 0 - Can close tray: 1 1 - Can open tray: 1 1 - Can lock tray: 1 1 - Can change speed: 1 1 - Can select disk: 0 1 - Can read multisession: 1 1 - Can read MCN: 1 1 - Reports media changed: 1 1 - Can play audio: 1 1 - - -You see two drives, sr0 and hdb, along with a list of their features. - -2.6 /proc/sys/sunrpc - Remote procedure calls ---------------------------------------------- - -This directory contains four files, which enable or disable debugging for the -RPC functions NFS, NFS-daemon, RPC and NLM. The default values are 0. They can -be set to one to turn debugging on. (The default value is 0 for each) - -2.7 /proc/sys/net - Networking stuff ------------------------------------- - -The interface to the networking parts of the kernel is located in -/proc/sys/net. Table 2-3 shows all possible subdirectories. You may see only -some of them, depending on your kernel's configuration. - - -Table 2-3: Subdirectories in /proc/sys/net -.............................................................................. - Directory Content Directory Content - core General parameter appletalk Appletalk protocol - unix Unix domain sockets netrom NET/ROM - 802 E802 protocol ax25 AX25 - ethernet Ethernet protocol rose X.25 PLP layer - ipv4 IP version 4 x25 X.25 protocol - ipx IPX token-ring IBM token ring - bridge Bridging decnet DEC net - ipv6 IP version 6 -.............................................................................. - -We will concentrate on IP networking here. Since AX15, X.25, and DEC Net are -only minor players in the Linux world, we'll skip them in this chapter. You'll -find some short info on Appletalk and IPX further on in this chapter. Review -the online documentation and the kernel source to get a detailed view of the -parameters for those protocols. In this section we'll discuss the -subdirectories printed in bold letters in the table above. As default values -are suitable for most needs, there is no need to change these values. - -/proc/sys/net/core - Network core options ------------------------------------------ - -rmem_default ------------- - -The default setting of the socket receive buffer in bytes. - -rmem_max --------- - -The maximum receive socket buffer size in bytes. - -wmem_default ------------- - -The default setting (in bytes) of the socket send buffer. - -wmem_max --------- - -The maximum send socket buffer size in bytes. - -message_burst and message_cost ------------------------------- - -These parameters are used to limit the warning messages written to the kernel -log from the networking code. They enforce a rate limit to make a -denial-of-service attack impossible. A higher message_cost factor, results in -fewer messages that will be written. Message_burst controls when messages will -be dropped. The default settings limit warning messages to one every five -seconds. - -netdev_max_backlog ------------------- - -Maximum number of packets, queued on the INPUT side, when the interface -receives packets faster than kernel can process them. - -optmem_max ----------- +------------------------------------------------------------------------------ +Summary +------------------------------------------------------------------------------ +Certain aspects of kernel behavior can be modified at runtime, without the +need to recompile the kernel, or even to reboot the system. The files in the +/proc/sys tree can not only be read, but also modified. You can use the echo +command to write value into these files, thereby changing the default settings +of the kernel. +------------------------------------------------------------------------------ -Maximum ancillary buffer size allowed per socket. Ancillary data is a sequence -of struct cmsghdr structures with appended data. +------------------------------------------------------------------------------ +CHAPTER 3: PER-PROCESS PARAMETERS +------------------------------------------------------------------------------ -/proc/sys/net/unix - Parameters for Unix domain sockets +3.1 /proc/<pid>/oom_adj & /proc/<pid>/oom_score_adj- Adjust the oom-killer score +-------------------------------------------------------------------------------- + +These file can be used to adjust the badness heuristic used to select which +process gets killed in out of memory conditions. + +The badness heuristic assigns a value to each candidate task ranging from 0 +(never kill) to 1000 (always kill) to determine which process is targeted. The +units are roughly a proportion along that range of allowed memory the process +may allocate from based on an estimation of its current memory and swap use. +For example, if a task is using all allowed memory, its badness score will be +1000. If it is using half of its allowed memory, its score will be 500. + +There is an additional factor included in the badness score: the current memory +and swap usage is discounted by 3% for root processes. + +The amount of "allowed" memory depends on the context in which the oom killer +was called. If it is due to the memory assigned to the allocating task's cpuset +being exhausted, the allowed memory represents the set of mems assigned to that +cpuset. If it is due to a mempolicy's node(s) being exhausted, the allowed +memory represents the set of mempolicy nodes. If it is due to a memory +limit (or swap limit) being reached, the allowed memory is that configured +limit. Finally, if it is due to the entire system being out of memory, the +allowed memory represents all allocatable resources. + +The value of /proc/<pid>/oom_score_adj is added to the badness score before it +is used to determine which task to kill. Acceptable values range from -1000 +(OOM_SCORE_ADJ_MIN) to +1000 (OOM_SCORE_ADJ_MAX). This allows userspace to +polarize the preference for oom killing either by always preferring a certain +task or completely disabling it. The lowest possible value, -1000, is +equivalent to disabling oom killing entirely for that task since it will always +report a badness score of 0. + +Consequently, it is very simple for userspace to define the amount of memory to +consider for each task. Setting a /proc/<pid>/oom_score_adj value of +500, for +example, is roughly equivalent to allowing the remainder of tasks sharing the +same system, cpuset, mempolicy, or memory controller resources to use at least +50% more memory. A value of -500, on the other hand, would be roughly +equivalent to discounting 50% of the task's allowed memory from being considered +as scoring against the task. + +For backwards compatibility with previous kernels, /proc/<pid>/oom_adj may also +be used to tune the badness score. Its acceptable values range from -16 +(OOM_ADJUST_MIN) to +15 (OOM_ADJUST_MAX) and a special value of -17 +(OOM_DISABLE) to disable oom killing entirely for that task. Its value is +scaled linearly with /proc/<pid>/oom_score_adj. + +The value of /proc/<pid>/oom_score_adj may be reduced no lower than the last +value set by a CAP_SYS_RESOURCE process. To reduce the value any lower +requires CAP_SYS_RESOURCE. + +Caveat: when a parent task is selected, the oom killer will sacrifice any first +generation children with separate address spaces instead, if possible. This +avoids servers and important system daemons from being killed and loses the +minimal amount of work. + + +3.2 /proc/<pid>/oom_score - Display current oom-killer score +------------------------------------------------------------- + +This file can be used to check the current score used by the oom-killer is for +any given <pid>. Use it together with /proc/<pid>/oom_score_adj to tune which +process should be killed in an out-of-memory situation. + + +3.3 /proc/<pid>/io - Display the IO accounting fields ------------------------------------------------------- -There are only two files in this subdirectory. They control the delays for -deleting and destroying socket descriptors. - -2.8 /proc/sys/net/ipv4 - IPV4 settings --------------------------------------- - -IP version 4 is still the most used protocol in Unix networking. It will be -replaced by IP version 6 in the next couple of years, but for the moment it's -the de facto standard for the internet and is used in most networking -environments around the world. Because of the importance of this protocol, -we'll have a deeper look into the subtree controlling the behavior of the IPv4 -subsystem of the Linux kernel. - -Let's start with the entries in /proc/sys/net/ipv4. - -ICMP settings -------------- - -icmp_echo_ignore_all and icmp_echo_ignore_broadcasts ----------------------------------------------------- - -Turn on (1) or off (0), if the kernel should ignore all ICMP ECHO requests, or -just those to broadcast and multicast addresses. +This file contains IO statistics for each running process -Please note that if you accept ICMP echo requests with a broadcast/multi\-cast -destination address your network may be used as an exploder for denial of -service packet flooding attacks to other hosts. - -icmp_destunreach_rate, icmp_echoreply_rate, icmp_paramprob_rate and icmp_timeexeed_rate ---------------------------------------------------------------------------------------- - -Sets limits for sending ICMP packets to specific targets. A value of zero -disables all limiting. Any positive value sets the maximum package rate in -hundredth of a second (on Intel systems). - -IP settings ------------ - -ip_autoconfig -------------- - -This file contains the number one if the host received its IP configuration by -RARP, BOOTP, DHCP or a similar mechanism. Otherwise it is zero. - -ip_default_ttl --------------- - -TTL (Time To Live) for IPv4 interfaces. This is simply the maximum number of -hops a packet may travel. - -ip_dynaddr ----------- - -Enable dynamic socket address rewriting on interface address change. This is -useful for dialup interface with changing IP addresses. - -ip_forward ----------- - -Enable or disable forwarding of IP packages between interfaces. Changing this -value resets all other parameters to their default values. They differ if the -kernel is configured as host or router. - -ip_local_port_range -------------------- - -Range of ports used by TCP and UDP to choose the local port. Contains two -numbers, the first number is the lowest port, the second number the highest -local port. Default is 1024-4999. Should be changed to 32768-61000 for -high-usage systems. - -ip_no_pmtu_disc ---------------- - -Global switch to turn path MTU discovery off. It can also be set on a per -socket basis by the applications or on a per route basis. - -ip_masq_debug -------------- - -Enable/disable debugging of IP masquerading. - -IP fragmentation settings -------------------------- - -ipfrag_high_trash and ipfrag_low_trash --------------------------------------- - -Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes -of memory is allocated for this purpose, the fragment handler will toss -packets until ipfrag_low_thresh is reached. - -ipfrag_time ------------ - -Time in seconds to keep an IP fragment in memory. - -TCP settings ------------- - -tcp_ecn +Example ------- -This file controls the use of the ECN bit in the IPv4 headers, this is a new -feature about Explicit Congestion Notification, but some routers and firewalls -block trafic that has this bit set, so it could be necessary to echo 0 to -/proc/sys/net/ipv4/tcp_ecn, if you want to talk to this sites. For more info -you could read RFC2481. - -tcp_retrans_collapse --------------------- - -Bug-to-bug compatibility with some broken printers. On retransmit, try to send -larger packets to work around bugs in certain TCP stacks. Can be turned off by -setting it to zero. - -tcp_keepalive_probes --------------------- - -Number of keep alive probes TCP sends out, until it decides that the -connection is broken. - -tcp_keepalive_time ------------------- - -How often TCP sends out keep alive messages, when keep alive is enabled. The -default is 2 hours. +test:/tmp # dd if=/dev/zero of=/tmp/test.dat & +[1] 3828 -tcp_syn_retries ---------------- - -Number of times initial SYNs for a TCP connection attempt will be -retransmitted. Should not be higher than 255. This is only the timeout for -outgoing connections, for incoming connections the number of retransmits is -defined by tcp_retries1. - -tcp_sack --------- - -Enable select acknowledgments after RFC2018. +test:/tmp # cat /proc/3828/io +rchar: 323934931 +wchar: 323929600 +syscr: 632687 +syscw: 632675 +read_bytes: 0 +write_bytes: 323932160 +cancelled_write_bytes: 0 -tcp_timestamps --------------- - -Enable timestamps as defined in RFC1323. - -tcp_stdurg ----------- - -Enable the strict RFC793 interpretation of the TCP urgent pointer field. The -default is to use the BSD compatible interpretation of the urgent pointer -pointing to the first byte after the urgent data. The RFC793 interpretation is -to have it point to the last byte of urgent data. Enabling this option may -lead to interoperatibility problems. Disabled by default. - -tcp_syncookies --------------- - -Only valid when the kernel was compiled with CONFIG_SYNCOOKIES. Send out -syncookies when the syn backlog queue of a socket overflows. This is to ward -off the common 'syn flood attack'. Disabled by default. - -Note that the concept of a socket backlog is abandoned. This means the peer -may not receive reliable error messages from an over loaded server with -syncookies enabled. - -tcp_window_scaling ------------------- -Enable window scaling as defined in RFC1323. - -tcp_fin_timeout ---------------- - -The length of time in seconds it takes to receive a final FIN before the -socket is always closed. This is strictly a violation of the TCP -specification, but required to prevent denial-of-service attacks. - -tcp_max_ka_probes ------------------ - -Indicates how many keep alive probes are sent per slow timer run. Should not -be set too high to prevent bursts. - -tcp_max_syn_backlog -------------------- - -Length of the per socket backlog queue. Since Linux 2.2 the backlog specified -in listen(2) only specifies the length of the backlog queue of already -established sockets. When more connection requests arrive Linux starts to drop -packets. When syncookies are enabled the packets are still answered and the -maximum queue is effectively ignored. - -tcp_retries1 ------------- - -Defines how often an answer to a TCP connection request is retransmitted -before giving up. - -tcp_retries2 ------------- - -Defines how often a TCP packet is retransmitted before giving up. - -Interface specific settings ---------------------------- - -In the directory /proc/sys/net/ipv4/conf you'll find one subdirectory for each -interface the system knows about and one directory calls all. Changes in the -all subdirectory affect all interfaces, whereas changes in the other -subdirectories affect only one interface. All directories have the same -entries: - -accept_redirects ----------------- - -This switch decides if the kernel accepts ICMP redirect messages or not. The -default is 'yes' if the kernel is configured for a regular host and 'no' for a -router configuration. - -accept_source_route -------------------- - -Should source routed packages be accepted or declined. The default is -dependent on the kernel configuration. It's 'yes' for routers and 'no' for -hosts. - -bootp_relay -~~~~~~~~~~~ - -Accept packets with source address 0.b.c.d with destinations not to this host -as local ones. It is supposed that a BOOTP relay daemon will catch and forward -such packets. - -The default is 0, since this feature is not implemented yet (kernel version -2.2.12). - -forwarding ----------- - -Enable or disable IP forwarding on this interface. - -log_martians ------------- - -Log packets with source addresses with no known route to kernel log. - -mc_forwarding -------------- - -Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a -multicast routing daemon is required. - -proxy_arp ---------- - -Does (1) or does not (0) perform proxy ARP. - -rp_filter ---------- - -Integer value determines if a source validation should be made. 1 means yes, 0 -means no. Disabled by default, but local/broadcast address spoofing is always -on. - -If you set this to 1 on a router that is the only connection for a network to -the net, it will prevent spoofing attacks against your internal networks -(external addresses can still be spoofed), without the need for additional -firewall rules. - -secure_redirects ----------------- - -Accept ICMP redirect messages only for gateways, listed in default gateway -list. Enabled by default. - -shared_media ------------- - -If it is not set the kernel does not assume that different subnets on this -device can communicate directly. Default setting is 'yes'. - -send_redirects --------------- - -Determines whether to send ICMP redirects to other hosts. - -Routing settings ----------------- - -The directory /proc/sys/net/ipv4/route contains several file to control -routing issues. - -error_burst and error_cost --------------------------- - -These parameters are used to limit how many ICMP destination unreachable to -send from the host in question. ICMP destination unreachable messages are -sent when we can not reach the next hop, while trying to transmit a packet. -It will also print some error messages to kernel logs if someone is ignoring -our ICMP redirects. The higher the error_cost factor is, the fewer -destination unreachable and error messages will be let through. Error_burst -controls when destination unreachable messages and error messages will be -dropped. The default settings limit warning messages to five every second. +Description +----------- -flush +rchar ----- -Writing to this file results in a flush of the routing cache. - -gc_elasticity, gc_interval, gc_min_interval_ms, gc_timeout, gc_thresh ---------------------------------------------------------------------- +I/O counter: chars read +The number of bytes which this task has caused to be read from storage. This +is simply the sum of bytes which this process passed to read() and pread(). +It includes things like tty IO and it is unaffected by whether or not actual +physical disk IO was required (the read might have been satisfied from +pagecache) -Values to control the frequency and behavior of the garbage collection -algorithm for the routing cache. gc_min_interval is deprecated and replaced -by gc_min_interval_ms. +wchar +----- -max_size --------- - -Maximum size of the routing cache. Old entries will be purged once the cache -reached has this size. - -max_delay, min_delay --------------------- - -Delays for flushing the routing cache. - -redirect_load, redirect_number ------------------------------- - -Factors which determine if more ICPM redirects should be sent to a specific -host. No redirects will be sent once the load limit or the maximum number of -redirects has been reached. - -redirect_silence ----------------- - -Timeout for redirects. After this period redirects will be sent again, even if -this has been stopped, because the load or number limit has been reached. - -Network Neighbor handling -------------------------- - -Settings about how to handle connections with direct neighbors (nodes attached -to the same link) can be found in the directory /proc/sys/net/ipv4/neigh. - -As we saw it in the conf directory, there is a default subdirectory which -holds the default values, and one directory for each interface. The contents -of the directories are identical, with the single exception that the default -settings contain additional options to set garbage collection parameters. +I/O counter: chars written +The number of bytes which this task has caused, or shall cause to be written +to disk. Similar caveats apply here as with rchar. -In the interface directories you'll find the following entries: -base_reachable_time, base_reachable_time_ms -------------------------------------------- +syscr +----- -A base value used for computing the random reachable time value as specified -in RFC2461. +I/O counter: read syscalls +Attempt to count the number of read I/O operations, i.e. syscalls like read() +and pread(). -Expression of base_reachable_time, which is deprecated, is in seconds. -Expression of base_reachable_time_ms is in milliseconds. -retrans_time, retrans_time_ms ------------------------------ +syscw +----- -The time between retransmitted Neighbor Solicitation messages. -Used for address resolution and to determine if a neighbor is -unreachable. +I/O counter: write syscalls +Attempt to count the number of write I/O operations, i.e. syscalls like +write() and pwrite(). -Expression of retrans_time, which is deprecated, is in 1/100 seconds (for -IPv4) or in jiffies (for IPv6). -Expression of retrans_time_ms is in milliseconds. -unres_qlen +read_bytes ---------- -Maximum queue length for a pending arp request - the number of packets which -are accepted from other layers while the ARP address is still resolved. - -anycast_delay -------------- - -Maximum for random delay of answers to neighbor solicitation messages in -jiffies (1/100 sec). Not yet implemented (Linux does not have anycast support -yet). - -ucast_solicit -------------- - -Maximum number of retries for unicast solicitation. - -mcast_solicit -------------- - -Maximum number of retries for multicast solicitation. - -delay_first_probe_time ----------------------- - -Delay for the first time probe if the neighbor is reachable. (see -gc_stale_time) +I/O counter: bytes read +Attempt to count the number of bytes which this process really did cause to +be fetched from the storage layer. Done at the submit_bio() level, so it is +accurate for block-backed filesystems. <please add status regarding NFS and +CIFS at a later time> -locktime --------- -An ARP/neighbor entry is only replaced with a new one if the old is at least -locktime old. This prevents ARP cache thrashing. - -proxy_delay +write_bytes ----------- -Maximum time (real time is random [0..proxytime]) before answering to an ARP -request for which we have an proxy ARP entry. In some cases, this is used to -prevent network flooding. - -proxy_qlen ----------- - -Maximum queue length of the delayed proxy arp timer. (see proxy_delay). - -app_solcit ----------- +I/O counter: bytes written +Attempt to count the number of bytes which this process caused to be sent to +the storage layer. This is done at page-dirtying time. -Determines the number of requests to send to the user level ARP daemon. Use 0 -to turn off. -gc_stale_time -------------- - -Determines how often to check for stale ARP entries. After an ARP entry is -stale it will be resolved again (which is useful when an IP address migrates -to another machine). When ucast_solicit is greater than 0 it first tries to -send an ARP packet directly to the known host When that fails and -mcast_solicit is greater than 0, an ARP request is broadcasted. - -2.9 Appletalk -------------- - -The /proc/sys/net/appletalk directory holds the Appletalk configuration data -when Appletalk is loaded. The configurable parameters are: - -aarp-expiry-time ----------------- - -The amount of time we keep an ARP entry before expiring it. Used to age out -old hosts. - -aarp-resolve-time ------------------ - -The amount of time we will spend trying to resolve an Appletalk address. - -aarp-retransmit-limit +cancelled_write_bytes --------------------- -The number of times we will retransmit a query before giving up. - -aarp-tick-time --------------- - -Controls the rate at which expires are checked. +The big inaccuracy here is truncate. If a process writes 1MB to a file and +then deletes the file, it will in fact perform no writeout. But it will have +been accounted as having caused 1MB of write. +In other words: The number of bytes which this process caused to not happen, +by truncating pagecache. A task can cause "negative" IO too. If this task +truncates some dirty pagecache, some IO which another task has been accounted +for (in its write_bytes) will not be happening. We _could_ just subtract that +from the truncating task's write_bytes, but there is information loss in doing +that. -The directory /proc/net/appletalk holds the list of active Appletalk sockets -on a machine. -The fields indicate the DDP type, the local address (in network:node format) -the remote address, the size of the transmit pending queue, the size of the -received queue (bytes waiting for applications to read) the state and the uid -owning the socket. - -/proc/net/atalk_iface lists all the interfaces configured for appletalk.It -shows the name of the interface, its Appletalk address, the network range on -that address (or network number for phase 1 networks), and the status of the -interface. +Note +---- -/proc/net/atalk_route lists each known network route. It lists the target -(network) that the route leads to, the router (may be directly connected), the -route flags, and the device the route is using. +At its current implementation state, this is a bit racy on 32-bit machines: if +process A reads process B's /proc/pid/io while process B is updating one of +those 64-bit counters, process A could see an intermediate result. -2.10 IPX --------- -The IPX protocol has no tunable values in proc/sys/net. +More information about this can be found within the taskstats documentation in +Documentation/accounting. -The IPX protocol does, however, provide proc/net/ipx. This lists each IPX -socket giving the local and remote addresses in Novell format (that is -network:node:port). In accordance with the strange Novell tradition, -everything but the port is in hex. Not_Connected is displayed for sockets that -are not tied to a specific remote address. The Tx and Rx queue sizes indicate -the number of bytes pending for transmission and reception. The state -indicates the state the socket is in and the uid is the owning uid of the -socket. +3.4 /proc/<pid>/coredump_filter - Core dump filtering settings +--------------------------------------------------------------- +When a process is dumped, all anonymous memory is written to a core file as +long as the size of the core file isn't limited. But sometimes we don't want +to dump some memory segments, for example, huge shared memory. Conversely, +sometimes we want to save file-backed memory segments into a core file, not +only the individual files. -The /proc/net/ipx_interface file lists all IPX interfaces. For each interface -it gives the network number, the node number, and indicates if the network is -the primary network. It also indicates which device it is bound to (or -Internal for internal networks) and the Frame Type if appropriate. Linux -supports 802.3, 802.2, 802.2 SNAP and DIX (Blue Book) ethernet framing for -IPX. +/proc/<pid>/coredump_filter allows you to customize which memory segments +will be dumped when the <pid> process is dumped. coredump_filter is a bitmask +of memory types. If a bit of the bitmask is set, memory segments of the +corresponding memory type are dumped, otherwise they are not dumped. -The /proc/net/ipx_route table holds a list of IPX routes. For each route it -gives the destination network, the router node (or Directly) and the network -address of the router (or Connected) for internal networks. +The following 7 memory types are supported: + - (bit 0) anonymous private memory + - (bit 1) anonymous shared memory + - (bit 2) file-backed private memory + - (bit 3) file-backed shared memory + - (bit 4) ELF header pages in file-backed private memory areas (it is + effective only if the bit 2 is cleared) + - (bit 5) hugetlb private memory + - (bit 6) hugetlb shared memory + + Note that MMIO pages such as frame buffer are never dumped and vDSO pages + are always dumped regardless of the bitmask status. -2.11 /proc/sys/fs/mqueue - POSIX message queues filesystem ----------------------------------------------------------- + Note bit 0-4 doesn't effect any hugetlb memory. hugetlb memory are only + effected by bit 5-6. -The "mqueue" filesystem provides the necessary kernel features to enable the -creation of a user space library that implements the POSIX message queues -API (as noted by the MSG tag in the POSIX 1003.1-2001 version of the System -Interfaces specification.) +Default value of coredump_filter is 0x23; this means all anonymous memory +segments and hugetlb private memory are dumped. + +If you don't want to dump all shared memory segments attached to pid 1234, +write 0x21 to the process's proc file. + + $ echo 0x21 > /proc/1234/coredump_filter + +When a new process is created, the process inherits the bitmask status from its +parent. It is useful to set up coredump_filter before the program runs. +For example: + + $ echo 0x7 > /proc/self/coredump_filter + $ ./some_program + +3.5 /proc/<pid>/mountinfo - Information about mounts +-------------------------------------------------------- + +This file contains lines of the form: + +36 35 98:0 /mnt1 /mnt2 rw,noatime master:1 - ext3 /dev/root rw,errors=continue +(1)(2)(3) (4) (5) (6) (7) (8) (9) (10) (11) + +(1) mount ID: unique identifier of the mount (may be reused after umount) +(2) parent ID: ID of parent (or of self for the top of the mount tree) +(3) major:minor: value of st_dev for files on filesystem +(4) root: root of the mount within the filesystem +(5) mount point: mount point relative to the process's root +(6) mount options: per mount options +(7) optional fields: zero or more fields of the form "tag[:value]" +(8) separator: marks the end of the optional fields +(9) filesystem type: name of filesystem of the form "type[.subtype]" +(10) mount source: filesystem specific information or "none" +(11) super options: per super block options + +Parsers should ignore all unrecognised optional fields. Currently the +possible optional fields are: + +shared:X mount is shared in peer group X +master:X mount is slave to peer group X +propagate_from:X mount is slave and receives propagation from peer group X (*) +unbindable mount is unbindable + +(*) X is the closest dominant peer group under the process's root. If +X is the immediate master of the mount, or if there's no dominant peer +group under the same root, then only the "master:X" field is present +and not the "propagate_from:X" field. + +For more information on mount propagation see: + + Documentation/filesystems/sharedsubtree.txt -The "mqueue" filesystem contains values for determining/setting the amount of -resources used by the file system. -/proc/sys/fs/mqueue/queues_max is a read/write file for setting/getting the -maximum number of message queues allowed on the system. +3.6 /proc/<pid>/comm & /proc/<pid>/task/<tid>/comm +-------------------------------------------------------- +These files provide a method to access a tasks comm value. It also allows for +a task to set its own or one of its thread siblings comm value. The comm value +is limited in size compared to the cmdline value, so writing anything longer +then the kernel's TASK_COMM_LEN (currently 16 chars) will result in a truncated +comm value. + + +3.7 /proc/<pid>/task/<tid>/children - Information about task children +------------------------------------------------------------------------- +This file provides a fast way to retrieve first level children pids +of a task pointed by <pid>/<tid> pair. The format is a space separated +stream of pids. + +Note the "first level" here -- if a child has own children they will +not be listed here, one needs to read /proc/<children-pid>/task/<tid>/children +to obtain the descendants. + +Since this interface is intended to be fast and cheap it doesn't +guarantee to provide precise results and some children might be +skipped, especially if they've exited right after we printed their +pids, so one need to either stop or freeze processes being inspected +if precise results are needed. + + +3.8 /proc/<pid>/fdinfo/<fd> - Information about opened file +--------------------------------------------------------------- +This file provides information associated with an opened file. The regular +files have at least three fields -- 'pos', 'flags' and mnt_id. The 'pos' +represents the current offset of the opened file in decimal form [see lseek(2) +for details], 'flags' denotes the octal O_xxx mask the file has been +created with [see open(2) for details] and 'mnt_id' represents mount ID of +the file system containing the opened file [see 3.5 /proc/<pid>/mountinfo +for details]. + +A typical output is -/proc/sys/fs/mqueue/msg_max is a read/write file for setting/getting the -maximum number of messages in a queue value. In fact it is the limiting value -for another (user) limit which is set in mq_open invocation. This attribute of -a queue must be less or equal then msg_max. + pos: 0 + flags: 0100002 + mnt_id: 19 -/proc/sys/fs/mqueue/msgsize_max is a read/write file for setting/getting the -maximum message size value (it is every message queue's attribute set during -its creation). +The files such as eventfd, fsnotify, signalfd, epoll among the regular pos/flags +pair provide additional information particular to the objects they represent. + + Eventfd files + ~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + eventfd-count: 5a + + where 'eventfd-count' is hex value of a counter. + + Signalfd files + ~~~~~~~~~~~~~~ + pos: 0 + flags: 04002 + mnt_id: 9 + sigmask: 0000000000000200 + + where 'sigmask' is hex value of the signal mask associated + with a file. + + Epoll files + ~~~~~~~~~~~ + pos: 0 + flags: 02 + mnt_id: 9 + tfd: 5 events: 1d data: ffffffffffffffff + + where 'tfd' is a target file descriptor number in decimal form, + 'events' is events mask being watched and the 'data' is data + associated with a target [see epoll(7) for more details]. + + Fsnotify files + ~~~~~~~~~~~~~~ + For inotify files the format is the following + + pos: 0 + flags: 02000000 + inotify wd:3 ino:9e7e sdev:800013 mask:800afce ignored_mask:0 fhandle-bytes:8 fhandle-type:1 f_handle:7e9e0000640d1b6d + + where 'wd' is a watch descriptor in decimal form, ie a target file + descriptor number, 'ino' and 'sdev' are inode and device where the + target file resides and the 'mask' is the mask of events, all in hex + form [see inotify(7) for more details]. + + If the kernel was built with exportfs support, the path to the target + file is encoded as a file handle. The file handle is provided by three + fields 'fhandle-bytes', 'fhandle-type' and 'f_handle', all in hex + format. + + If the kernel is built without exportfs support the file handle won't be + printed out. + + If there is no inotify mark attached yet the 'inotify' line will be omitted. + + For fanotify files the format is + + pos: 0 + flags: 02 + mnt_id: 9 + fanotify flags:10 event-flags:0 + fanotify mnt_id:12 mflags:40 mask:38 ignored_mask:40000003 + fanotify ino:4f969 sdev:800013 mflags:0 mask:3b ignored_mask:40000000 fhandle-bytes:8 fhandle-type:1 f_handle:69f90400c275b5b4 + + where fanotify 'flags' and 'event-flags' are values used in fanotify_init + call, 'mnt_id' is the mount point identifier, 'mflags' is the value of + flags associated with mark which are tracked separately from events + mask. 'ino', 'sdev' are target inode and device, 'mask' is the events + mask and 'ignored_mask' is the mask of events which are to be ignored. + All in hex format. Incorporation of 'mflags', 'mask' and 'ignored_mask' + does provide information about flags and mask used in fanotify_mark + call [see fsnotify manpage for details]. + + While the first three lines are mandatory and always printed, the rest is + optional and may be omitted if no marks created yet. ------------------------------------------------------------------------------ -Summary ------------------------------------------------------------------------------- -Certain aspects of kernel behavior can be modified at runtime, without the -need to recompile the kernel, or even to reboot the system. The files in the -/proc/sys tree can not only be read, but also modified. You can use the echo -command to write value into these files, thereby changing the default settings -of the kernel. +Configuring procfs ------------------------------------------------------------------------------ + +4.1 Mount options +--------------------- + +The following mount options are supported: + + hidepid= Set /proc/<pid>/ access mode. + gid= Set the group authorized to learn processes information. + +hidepid=0 means classic mode - everybody may access all /proc/<pid>/ directories +(default). + +hidepid=1 means users may not access any /proc/<pid>/ directories but their +own. Sensitive files like cmdline, sched*, status are now protected against +other users. This makes it impossible to learn whether any user runs +specific program (given the program doesn't reveal itself by its behaviour). +As an additional bonus, as /proc/<pid>/cmdline is unaccessible for other users, +poorly written programs passing sensitive information via program arguments are +now protected against local eavesdroppers. + +hidepid=2 means hidepid=1 plus all /proc/<pid>/ will be fully invisible to other +users. It doesn't mean that it hides a fact whether a process with a specific +pid value exists (it can be learned by other means, e.g. by "kill -0 $PID"), +but it hides process' uid and gid, which may be learned by stat()'ing +/proc/<pid>/ otherwise. It greatly complicates an intruder's task of gathering +information about running processes, whether some daemon runs with elevated +privileges, whether other user runs some sensitive program, whether other users +run any program at all, etc. + +gid= defines a group authorized to learn processes information otherwise +prohibited by hidepid=. If you use some daemon like identd which needs to learn +information about processes information, just add identd to this group. diff --git a/Documentation/filesystems/qnx6.txt b/Documentation/filesystems/qnx6.txt new file mode 100644 index 00000000000..40867978913 --- /dev/null +++ b/Documentation/filesystems/qnx6.txt @@ -0,0 +1,174 @@ +The QNX6 Filesystem +=================== + +The qnx6fs is used by newer QNX operating system versions. (e.g. Neutrino) +It got introduced in QNX 6.4.0 and is used default since 6.4.1. + +Option +====== + +mmi_fs Mount filesystem as used for example by Audi MMI 3G system + +Specification +============= + +qnx6fs shares many properties with traditional Unix filesystems. It has the +concepts of blocks, inodes and directories. +On QNX it is possible to create little endian and big endian qnx6 filesystems. +This feature makes it possible to create and use a different endianness fs +for the target (QNX is used on quite a range of embedded systems) plattform +running on a different endianness. +The Linux driver handles endianness transparently. (LE and BE) + +Blocks +------ + +The space in the device or file is split up into blocks. These are a fixed +size of 512, 1024, 2048 or 4096, which is decided when the filesystem is +created. +Blockpointers are 32bit, so the maximum space that can be addressed is +2^32 * 4096 bytes or 16TB + +The superblocks +--------------- + +The superblock contains all global information about the filesystem. +Each qnx6fs got two superblocks, each one having a 64bit serial number. +That serial number is used to identify the "active" superblock. +In write mode with reach new snapshot (after each synchronous write), the +serial of the new master superblock is increased (old superblock serial + 1) + +So basically the snapshot functionality is realized by an atomic final +update of the serial number. Before updating that serial, all modifications +are done by copying all modified blocks during that specific write request +(or period) and building up a new (stable) filesystem structure under the +inactive superblock. + +Each superblock holds a set of root inodes for the different filesystem +parts. (Inode, Bitmap and Longfilenames) +Each of these root nodes holds information like total size of the stored +data and the addressing levels in that specific tree. +If the level value is 0, up to 16 direct blocks can be addressed by each +node. +Level 1 adds an additional indirect addressing level where each indirect +addressing block holds up to blocksize / 4 bytes pointers to data blocks. +Level 2 adds an additional indirect addressing block level (so, already up +to 16 * 256 * 256 = 1048576 blocks that can be addressed by such a tree). + +Unused block pointers are always set to ~0 - regardless of root node, +indirect addressing blocks or inodes. +Data leaves are always on the lowest level. So no data is stored on upper +tree levels. + +The first Superblock is located at 0x2000. (0x2000 is the bootblock size) +The Audi MMI 3G first superblock directly starts at byte 0. +Second superblock position can either be calculated from the superblock +information (total number of filesystem blocks) or by taking the highest +device address, zeroing the last 3 bytes and then subtracting 0x1000 from +that address. + +0x1000 is the size reserved for each superblock - regardless of the +blocksize of the filesystem. + +Inodes +------ + +Each object in the filesystem is represented by an inode. (index node) +The inode structure contains pointers to the filesystem blocks which contain +the data held in the object and all of the metadata about an object except +its longname. (filenames longer than 27 characters) +The metadata about an object includes the permissions, owner, group, flags, +size, number of blocks used, access time, change time and modification time. + +Object mode field is POSIX format. (which makes things easier) + +There are also pointers to the first 16 blocks, if the object data can be +addressed with 16 direct blocks. +For more than 16 blocks an indirect addressing in form of another tree is +used. (scheme is the same as the one used for the superblock root nodes) + +The filesize is stored 64bit. Inode counting starts with 1. (whilst long +filename inodes start with 0) + +Directories +----------- + +A directory is a filesystem object and has an inode just like a file. +It is a specially formatted file containing records which associate each +name with an inode number. +'.' inode number points to the directory inode +'..' inode number points to the parent directory inode +Eeach filename record additionally got a filename length field. + +One special case are long filenames or subdirectory names. +These got set a filename length field of 0xff in the corresponding directory +record plus the longfile inode number also stored in that record. +With that longfilename inode number, the longfilename tree can be walked +starting with the superblock longfilename root node pointers. + +Special files +------------- + +Symbolic links are also filesystem objects with inodes. They got a specific +bit in the inode mode field identifying them as symbolic link. +The directory entry file inode pointer points to the target file inode. + +Hard links got an inode, a directory entry, but a specific mode bit set, +no block pointers and the directory file record pointing to the target file +inode. + +Character and block special devices do not exist in QNX as those files +are handled by the QNX kernel/drivers and created in /dev independent of the +underlaying filesystem. + +Long filenames +-------------- + +Long filenames are stored in a separate addressing tree. The staring point +is the longfilename root node in the active superblock. +Each data block (tree leaves) holds one long filename. That filename is +limited to 510 bytes. The first two starting bytes are used as length field +for the actual filename. +If that structure shall fit for all allowed blocksizes, it is clear why there +is a limit of 510 bytes for the actual filename stored. + +Bitmap +------ + +The qnx6fs filesystem allocation bitmap is stored in a tree under bitmap +root node in the superblock and each bit in the bitmap represents one +filesystem block. +The first block is block 0, which starts 0x1000 after superblock start. +So for a normal qnx6fs 0x3000 (bootblock + superblock) is the physical +address at which block 0 is located. + +Bits at the end of the last bitmap block are set to 1, if the device is +smaller than addressing space in the bitmap. + +Bitmap system area +------------------ + +The bitmap itself is divided into three parts. +First the system area, that is split into two halves. +Then userspace. + +The requirement for a static, fixed preallocated system area comes from how +qnx6fs deals with writes. +Each superblock got it's own half of the system area. So superblock #1 +always uses blocks from the lower half whilst superblock #2 just writes to +blocks represented by the upper half bitmap system area bits. + +Bitmap blocks, Inode blocks and indirect addressing blocks for those two +tree structures are treated as system blocks. + +The rational behind that is that a write request can work on a new snapshot +(system area of the inactive - resp. lower serial numbered superblock) while +at the same time there is still a complete stable filesystem structer in the +other half of the system area. + +When finished with writing (a sync write is completed, the maximum sync leap +time or a filesystem sync is requested), serial of the previously inactive +superblock atomically is increased and the fs switches over to that - then +stable declared - superblock. + +For all data outside the system area, blocks are just copied while writing. diff --git a/Documentation/filesystems/quota.txt b/Documentation/filesystems/quota.txt new file mode 100644 index 00000000000..5e8de25bf0f --- /dev/null +++ b/Documentation/filesystems/quota.txt @@ -0,0 +1,65 @@ + +Quota subsystem +=============== + +Quota subsystem allows system administrator to set limits on used space and +number of used inodes (inode is a filesystem structure which is associated with +each file or directory) for users and/or groups. For both used space and number +of used inodes there are actually two limits. The first one is called softlimit +and the second one hardlimit. An user can never exceed a hardlimit for any +resource (unless he has CAP_SYS_RESOURCE capability). User is allowed to exceed +softlimit but only for limited period of time. This period is called "grace +period" or "grace time". When grace time is over, user is not able to allocate +more space/inodes until he frees enough of them to get below softlimit. + +Quota limits (and amount of grace time) are set independently for each +filesystem. + +For more details about quota design, see the documentation in quota-tools package +(http://sourceforge.net/projects/linuxquota). + +Quota netlink interface +======================= +When user exceeds a softlimit, runs out of grace time or reaches hardlimit, +quota subsystem traditionally printed a message to the controlling terminal of +the process which caused the excess. This method has the disadvantage that +when user is using a graphical desktop he usually cannot see the message. +Thus quota netlink interface has been designed to pass information about +the above events to userspace. There they can be captured by an application +and processed accordingly. + +The interface uses generic netlink framework (see +http://lwn.net/Articles/208755/ and http://people.suug.ch/~tgr/libnl/ for more +details about this layer). The name of the quota generic netlink interface +is "VFS_DQUOT". Definitions of constants below are in <linux/quota.h>. + Currently, the interface supports only one message type QUOTA_NL_C_WARNING. +This command is used to send a notification about any of the above mentioned +events. Each message has six attributes. These are (type of the argument is +in parentheses): + QUOTA_NL_A_QTYPE (u32) + - type of quota being exceeded (one of USRQUOTA, GRPQUOTA) + QUOTA_NL_A_EXCESS_ID (u64) + - UID/GID (depends on quota type) of user / group whose limit + is being exceeded. + QUOTA_NL_A_CAUSED_ID (u64) + - UID of a user who caused the event + QUOTA_NL_A_WARNING (u32) + - what kind of limit is exceeded: + QUOTA_NL_IHARDWARN - inode hardlimit + QUOTA_NL_ISOFTLONGWARN - inode softlimit is exceeded longer + than given grace period + QUOTA_NL_ISOFTWARN - inode softlimit + QUOTA_NL_BHARDWARN - space (block) hardlimit + QUOTA_NL_BSOFTLONGWARN - space (block) softlimit is exceeded + longer than given grace period. + QUOTA_NL_BSOFTWARN - space (block) softlimit + - four warnings are also defined for the event when user stops + exceeding some limit: + QUOTA_NL_IHARDBELOW - inode hardlimit + QUOTA_NL_ISOFTBELOW - inode softlimit + QUOTA_NL_BHARDBELOW - space (block) hardlimit + QUOTA_NL_BSOFTBELOW - space (block) softlimit + QUOTA_NL_A_DEV_MAJOR (u32) + - major number of a device with the affected filesystem + QUOTA_NL_A_DEV_MINOR (u32) + - minor number of a device with the affected filesystem diff --git a/Documentation/filesystems/ramfs-rootfs-initramfs.txt b/Documentation/filesystems/ramfs-rootfs-initramfs.txt new file mode 100644 index 00000000000..b176928e696 --- /dev/null +++ b/Documentation/filesystems/ramfs-rootfs-initramfs.txt @@ -0,0 +1,359 @@ +ramfs, rootfs and initramfs +October 17, 2005 +Rob Landley <rob@landley.net> +============================= + +What is ramfs? +-------------- + +Ramfs is a very simple filesystem that exports Linux's disk caching +mechanisms (the page cache and dentry cache) as a dynamically resizable +RAM-based filesystem. + +Normally all files are cached in memory by Linux. Pages of data read from +backing store (usually the block device the filesystem is mounted on) are kept +around in case it's needed again, but marked as clean (freeable) in case the +Virtual Memory system needs the memory for something else. Similarly, data +written to files is marked clean as soon as it has been written to backing +store, but kept around for caching purposes until the VM reallocates the +memory. A similar mechanism (the dentry cache) greatly speeds up access to +directories. + +With ramfs, there is no backing store. Files written into ramfs allocate +dentries and page cache as usual, but there's nowhere to write them to. +This means the pages are never marked clean, so they can't be freed by the +VM when it's looking to recycle memory. + +The amount of code required to implement ramfs is tiny, because all the +work is done by the existing Linux caching infrastructure. Basically, +you're mounting the disk cache as a filesystem. Because of this, ramfs is not +an optional component removable via menuconfig, since there would be negligible +space savings. + +ramfs and ramdisk: +------------------ + +The older "ram disk" mechanism created a synthetic block device out of +an area of RAM and used it as backing store for a filesystem. This block +device was of fixed size, so the filesystem mounted on it was of fixed +size. Using a ram disk also required unnecessarily copying memory from the +fake block device into the page cache (and copying changes back out), as well +as creating and destroying dentries. Plus it needed a filesystem driver +(such as ext2) to format and interpret this data. + +Compared to ramfs, this wastes memory (and memory bus bandwidth), creates +unnecessary work for the CPU, and pollutes the CPU caches. (There are tricks +to avoid this copying by playing with the page tables, but they're unpleasantly +complicated and turn out to be about as expensive as the copying anyway.) +More to the point, all the work ramfs is doing has to happen _anyway_, +since all file access goes through the page and dentry caches. The RAM +disk is simply unnecessary; ramfs is internally much simpler. + +Another reason ramdisks are semi-obsolete is that the introduction of +loopback devices offered a more flexible and convenient way to create +synthetic block devices, now from files instead of from chunks of memory. +See losetup (8) for details. + +ramfs and tmpfs: +---------------- + +One downside of ramfs is you can keep writing data into it until you fill +up all memory, and the VM can't free it because the VM thinks that files +should get written to backing store (rather than swap space), but ramfs hasn't +got any backing store. Because of this, only root (or a trusted user) should +be allowed write access to a ramfs mount. + +A ramfs derivative called tmpfs was created to add size limits, and the ability +to write the data to swap space. Normal users can be allowed write access to +tmpfs mounts. See Documentation/filesystems/tmpfs.txt for more information. + +What is rootfs? +--------------- + +Rootfs is a special instance of ramfs (or tmpfs, if that's enabled), which is +always present in 2.6 systems. You can't unmount rootfs for approximately the +same reason you can't kill the init process; rather than having special code +to check for and handle an empty list, it's smaller and simpler for the kernel +to just make sure certain lists can't become empty. + +Most systems just mount another filesystem over rootfs and ignore it. The +amount of space an empty instance of ramfs takes up is tiny. + +If CONFIG_TMPFS is enabled, rootfs will use tmpfs instead of ramfs by +default. To force ramfs, add "rootfstype=ramfs" to the kernel command +line. + +What is initramfs? +------------------ + +All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is +extracted into rootfs when the kernel boots up. After extracting, the kernel +checks to see if rootfs contains a file "init", and if so it executes it as PID +1. If found, this init process is responsible for bringing the system the +rest of the way up, including locating and mounting the real root device (if +any). If rootfs does not contain an init program after the embedded cpio +archive is extracted into it, the kernel will fall through to the older code +to locate and mount a root partition, then exec some variant of /sbin/init +out of that. + +All this differs from the old initrd in several ways: + + - The old initrd was always a separate file, while the initramfs archive is + linked into the linux kernel image. (The directory linux-*/usr is devoted + to generating this archive during the build.) + + - The old initrd file was a gzipped filesystem image (in some file format, + such as ext2, that needed a driver built into the kernel), while the new + initramfs archive is a gzipped cpio archive (like tar only simpler, + see cpio(1) and Documentation/early-userspace/buffer-format.txt). The + kernel's cpio extraction code is not only extremely small, it's also + __init text and data that can be discarded during the boot process. + + - The program run by the old initrd (which was called /initrd, not /init) did + some setup and then returned to the kernel, while the init program from + initramfs is not expected to return to the kernel. (If /init needs to hand + off control it can overmount / with a new root device and exec another init + program. See the switch_root utility, below.) + + - When switching another root device, initrd would pivot_root and then + umount the ramdisk. But initramfs is rootfs: you can neither pivot_root + rootfs, nor unmount it. Instead delete everything out of rootfs to + free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs + with the new root (cd /newmount; mount --move . /; chroot .), attach + stdin/stdout/stderr to the new /dev/console, and exec the new init. + + Since this is a remarkably persnickety process (and involves deleting + commands before you can run them), the klibc package introduced a helper + program (utils/run_init.c) to do all this for you. Most other packages + (such as busybox) have named this command "switch_root". + +Populating initramfs: +--------------------- + +The 2.6 kernel build process always creates a gzipped cpio format initramfs +archive and links it into the resulting kernel binary. By default, this +archive is empty (consuming 134 bytes on x86). + +The config option CONFIG_INITRAMFS_SOURCE (in General Setup in menuconfig, +and living in usr/Kconfig) can be used to specify a source for the +initramfs archive, which will automatically be incorporated into the +resulting binary. This option can point to an existing gzipped cpio +archive, a directory containing files to be archived, or a text file +specification such as the following example: + + dir /dev 755 0 0 + nod /dev/console 644 0 0 c 5 1 + nod /dev/loop0 644 0 0 b 7 0 + dir /bin 755 1000 1000 + slink /bin/sh busybox 777 0 0 + file /bin/busybox initramfs/busybox 755 0 0 + dir /proc 755 0 0 + dir /sys 755 0 0 + dir /mnt 755 0 0 + file /init initramfs/init.sh 755 0 0 + +Run "usr/gen_init_cpio" (after the kernel build) to get a usage message +documenting the above file format. + +One advantage of the configuration file is that root access is not required to +set permissions or create device nodes in the new archive. (Note that those +two example "file" entries expect to find files named "init.sh" and "busybox" in +a directory called "initramfs", under the linux-2.6.* directory. See +Documentation/early-userspace/README for more details.) + +The kernel does not depend on external cpio tools. If you specify a +directory instead of a configuration file, the kernel's build infrastructure +creates a configuration file from that directory (usr/Makefile calls +scripts/gen_initramfs_list.sh), and proceeds to package up that directory +using the config file (by feeding it to usr/gen_init_cpio, which is created +from usr/gen_init_cpio.c). The kernel's build-time cpio creation code is +entirely self-contained, and the kernel's boot-time extractor is also +(obviously) self-contained. + +The one thing you might need external cpio utilities installed for is creating +or extracting your own preprepared cpio files to feed to the kernel build +(instead of a config file or directory). + +The following command line can extract a cpio image (either by the above script +or by the kernel build) back into its component files: + + cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames + +The following shell script can create a prebuilt cpio archive you can +use in place of the above config file: + + #!/bin/sh + + # Copyright 2006 Rob Landley <rob@landley.net> and TimeSys Corporation. + # Licensed under GPL version 2 + + if [ $# -ne 2 ] + then + echo "usage: mkinitramfs directory imagename.cpio.gz" + exit 1 + fi + + if [ -d "$1" ] + then + echo "creating $2 from $1" + (cd "$1"; find . | cpio -o -H newc | gzip) > "$2" + else + echo "First argument must be a directory" + exit 1 + fi + +Note: The cpio man page contains some bad advice that will break your initramfs +archive if you follow it. It says "A typical way to generate the list +of filenames is with the find command; you should give find the -depth option +to minimize problems with permissions on directories that are unwritable or not +searchable." Don't do this when creating initramfs.cpio.gz images, it won't +work. The Linux kernel cpio extractor won't create files in a directory that +doesn't exist, so the directory entries must go before the files that go in +those directories. The above script gets them in the right order. + +External initramfs images: +-------------------------- + +If the kernel has initrd support enabled, an external cpio.gz archive can also +be passed into a 2.6 kernel in place of an initrd. In this case, the kernel +will autodetect the type (initramfs, not initrd) and extract the external cpio +archive into rootfs before trying to run /init. + +This has the memory efficiency advantages of initramfs (no ramdisk block +device) but the separate packaging of initrd (which is nice if you have +non-GPL code you'd like to run from initramfs, without conflating it with +the GPL licensed Linux kernel binary). + +It can also be used to supplement the kernel's built-in initramfs image. The +files in the external archive will overwrite any conflicting files in +the built-in initramfs archive. Some distributors also prefer to customize +a single kernel image with task-specific initramfs images, without recompiling. + +Contents of initramfs: +---------------------- + +An initramfs archive is a complete self-contained root filesystem for Linux. +If you don't already understand what shared libraries, devices, and paths +you need to get a minimal root filesystem up and running, here are some +references: +http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ +http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html +http://www.linuxfromscratch.org/lfs/view/stable/ + +The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is +designed to be a tiny C library to statically link early userspace +code against, along with some related utilities. It is BSD licensed. + +I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) +myself. These are LGPL and GPL, respectively. (A self-contained initramfs +package is planned for the busybox 1.3 release.) + +In theory you could use glibc, but that's not well suited for small embedded +uses like this. (A "hello world" program statically linked against glibc is +over 400k. With uClibc it's 7k. Also note that glibc dlopens libnss to do +name lookups, even when otherwise statically linked.) + +A good first step is to get initramfs to run a statically linked "hello world" +program as init, and test it under an emulator like qemu (www.qemu.org) or +User Mode Linux, like so: + + cat > hello.c << EOF + #include <stdio.h> + #include <unistd.h> + + int main(int argc, char *argv[]) + { + printf("Hello world!\n"); + sleep(999999999); + } + EOF + gcc -static hello.c -o init + echo init | cpio -o -H newc | gzip > test.cpio.gz + # Testing external initramfs using the initrd loading mechanism. + qemu -kernel /boot/vmlinuz -initrd test.cpio.gz /dev/zero + +When debugging a normal root filesystem, it's nice to be able to boot with +"init=/bin/sh". The initramfs equivalent is "rdinit=/bin/sh", and it's +just as useful. + +Why cpio rather than tar? +------------------------- + +This decision was made back in December, 2001. The discussion started here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html + +And spawned a second thread (specifically on tar vs cpio), starting here: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html + +The quick and dirty summary version (which is no substitute for reading +the above threads) is: + +1) cpio is a standard. It's decades old (from the AT&T days), and already + widely used on Linux (inside RPM, Red Hat's device driver disks). Here's + a Linux Journal article about it from 1996: + + http://www.linuxjournal.com/article/1213 + + It's not as popular as tar because the traditional cpio command line tools + require _truly_hideous_ command line arguments. But that says nothing + either way about the archive format, and there are alternative tools, + such as: + + http://freecode.com/projects/afio + +2) The cpio archive format chosen by the kernel is simpler and cleaner (and + thus easier to create and parse) than any of the (literally dozens of) + various tar archive formats. The complete initramfs archive format is + explained in buffer-format.txt, created in usr/gen_init_cpio.c, and + extracted in init/initramfs.c. All three together come to less than 26k + total of human-readable text. + +3) The GNU project standardizing on tar is approximately as relevant as + Windows standardizing on zip. Linux is not part of either, and is free + to make its own technical decisions. + +4) Since this is a kernel internal format, it could easily have been + something brand new. The kernel provides its own tools to create and + extract this format anyway. Using an existing standard was preferable, + but not essential. + +5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be + supported on the kernel side"): + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html + + explained his reasoning: + + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html + http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html + + and, most importantly, designed and implemented the initramfs code. + +Future directions: +------------------ + +Today (2.6.16), initramfs is always compiled in, but not always used. The +kernel falls back to legacy boot code that is reached only if initramfs does +not contain an /init program. The fallback is legacy code, there to ensure a +smooth transition and allowing early boot functionality to gradually move to +"early userspace" (I.E. initramfs). + +The move to early userspace is necessary because finding and mounting the real +root device is complex. Root partitions can span multiple devices (raid or +separate journal). They can be out on the network (requiring dhcp, setting a +specific MAC address, logging into a server, etc). They can live on removable +media, with dynamically allocated major/minor numbers and persistent naming +issues requiring a full udev implementation to sort out. They can be +compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, +and so on. + +This kind of complexity (which inevitably includes policy) is rightly handled +in userspace. Both klibc and busybox/uClibc are working on simple initramfs +packages to drop into a kernel build. + +The klibc package has now been accepted into Andrew Morton's 2.6.17-mm tree. +The kernel's current early boot code (partition detection, etc) will probably +be migrated into a default initramfs, automatically created and used by the +kernel build. diff --git a/Documentation/filesystems/relay.txt b/Documentation/filesystems/relay.txt new file mode 100644 index 00000000000..33e2f369473 --- /dev/null +++ b/Documentation/filesystems/relay.txt @@ -0,0 +1,494 @@ +relay interface (formerly relayfs) +================================== + +The relay interface provides a means for kernel applications to +efficiently log and transfer large quantities of data from the kernel +to userspace via user-defined 'relay channels'. + +A 'relay channel' is a kernel->user data relay mechanism implemented +as a set of per-cpu kernel buffers ('channel buffers'), each +represented as a regular file ('relay file') in user space. Kernel +clients write into the channel buffers using efficient write +functions; these automatically log into the current cpu's channel +buffer. User space applications mmap() or read() from the relay files +and retrieve the data as it becomes available. The relay files +themselves are files created in a host filesystem, e.g. debugfs, and +are associated with the channel buffers using the API described below. + +The format of the data logged into the channel buffers is completely +up to the kernel client; the relay interface does however provide +hooks which allow kernel clients to impose some structure on the +buffer data. The relay interface doesn't implement any form of data +filtering - this also is left to the kernel client. The purpose is to +keep things as simple as possible. + +This document provides an overview of the relay interface API. The +details of the function parameters are documented along with the +functions in the relay interface code - please see that for details. + +Semantics +========= + +Each relay channel has one buffer per CPU, each buffer has one or more +sub-buffers. Messages are written to the first sub-buffer until it is +too full to contain a new message, in which case it is written to +the next (if available). Messages are never split across sub-buffers. +At this point, userspace can be notified so it empties the first +sub-buffer, while the kernel continues writing to the next. + +When notified that a sub-buffer is full, the kernel knows how many +bytes of it are padding i.e. unused space occurring because a complete +message couldn't fit into a sub-buffer. Userspace can use this +knowledge to copy only valid data. + +After copying it, userspace can notify the kernel that a sub-buffer +has been consumed. + +A relay channel can operate in a mode where it will overwrite data not +yet collected by userspace, and not wait for it to be consumed. + +The relay channel itself does not provide for communication of such +data between userspace and kernel, allowing the kernel side to remain +simple and not impose a single interface on userspace. It does +provide a set of examples and a separate helper though, described +below. + +The read() interface both removes padding and internally consumes the +read sub-buffers; thus in cases where read(2) is being used to drain +the channel buffers, special-purpose communication between kernel and +user isn't necessary for basic operation. + +One of the major goals of the relay interface is to provide a low +overhead mechanism for conveying kernel data to userspace. While the +read() interface is easy to use, it's not as efficient as the mmap() +approach; the example code attempts to make the tradeoff between the +two approaches as small as possible. + +klog and relay-apps example code +================================ + +The relay interface itself is ready to use, but to make things easier, +a couple simple utility functions and a set of examples are provided. + +The relay-apps example tarball, available on the relay sourceforge +site, contains a set of self-contained examples, each consisting of a +pair of .c files containing boilerplate code for each of the user and +kernel sides of a relay application. When combined these two sets of +boilerplate code provide glue to easily stream data to disk, without +having to bother with mundane housekeeping chores. + +The 'klog debugging functions' patch (klog.patch in the relay-apps +tarball) provides a couple of high-level logging functions to the +kernel which allow writing formatted text or raw data to a channel, +regardless of whether a channel to write into exists or not, or even +whether the relay interface is compiled into the kernel or not. These +functions allow you to put unconditional 'trace' statements anywhere +in the kernel or kernel modules; only when there is a 'klog handler' +registered will data actually be logged (see the klog and kleak +examples for details). + +It is of course possible to use the relay interface from scratch, +i.e. without using any of the relay-apps example code or klog, but +you'll have to implement communication between userspace and kernel, +allowing both to convey the state of buffers (full, empty, amount of +padding). The read() interface both removes padding and internally +consumes the read sub-buffers; thus in cases where read(2) is being +used to drain the channel buffers, special-purpose communication +between kernel and user isn't necessary for basic operation. Things +such as buffer-full conditions would still need to be communicated via +some channel though. + +klog and the relay-apps examples can be found in the relay-apps +tarball on http://relayfs.sourceforge.net + +The relay interface user space API +================================== + +The relay interface implements basic file operations for user space +access to relay channel buffer data. Here are the file operations +that are available and some comments regarding their behavior: + +open() enables user to open an _existing_ channel buffer. + +mmap() results in channel buffer being mapped into the caller's + memory space. Note that you can't do a partial mmap - you + must map the entire file, which is NRBUF * SUBBUFSIZE. + +read() read the contents of a channel buffer. The bytes read are + 'consumed' by the reader, i.e. they won't be available + again to subsequent reads. If the channel is being used + in no-overwrite mode (the default), it can be read at any + time even if there's an active kernel writer. If the + channel is being used in overwrite mode and there are + active channel writers, results may be unpredictable - + users should make sure that all logging to the channel has + ended before using read() with overwrite mode. Sub-buffer + padding is automatically removed and will not be seen by + the reader. + +sendfile() transfer data from a channel buffer to an output file + descriptor. Sub-buffer padding is automatically removed + and will not be seen by the reader. + +poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are + notified when sub-buffer boundaries are crossed. + +close() decrements the channel buffer's refcount. When the refcount + reaches 0, i.e. when no process or kernel client has the + buffer open, the channel buffer is freed. + +In order for a user application to make use of relay files, the +host filesystem must be mounted. For example, + + mount -t debugfs debugfs /sys/kernel/debug + +NOTE: the host filesystem doesn't need to be mounted for kernel + clients to create or use channels - it only needs to be + mounted when user space applications need access to the buffer + data. + + +The relay interface kernel API +============================== + +Here's a summary of the API the relay interface provides to in-kernel clients: + +TBD(curr. line MT:/API/) + channel management functions: + + relay_open(base_filename, parent, subbuf_size, n_subbufs, + callbacks, private_data) + relay_close(chan) + relay_flush(chan) + relay_reset(chan) + + channel management typically called on instigation of userspace: + + relay_subbufs_consumed(chan, cpu, subbufs_consumed) + + write functions: + + relay_write(chan, data, length) + __relay_write(chan, data, length) + relay_reserve(chan, length) + + callbacks: + + subbuf_start(buf, subbuf, prev_subbuf, prev_padding) + buf_mapped(buf, filp) + buf_unmapped(buf, filp) + create_buf_file(filename, parent, mode, buf, is_global) + remove_buf_file(dentry) + + helper functions: + + relay_buf_full(buf) + subbuf_start_reserve(buf, length) + + +Creating a channel +------------------ + +relay_open() is used to create a channel, along with its per-cpu +channel buffers. Each channel buffer will have an associated file +created for it in the host filesystem, which can be and mmapped or +read from in user space. The files are named basename0...basenameN-1 +where N is the number of online cpus, and by default will be created +in the root of the filesystem (if the parent param is NULL). If you +want a directory structure to contain your relay files, you should +create it using the host filesystem's directory creation function, +e.g. debugfs_create_dir(), and pass the parent directory to +relay_open(). Users are responsible for cleaning up any directory +structure they create, when the channel is closed - again the host +filesystem's directory removal functions should be used for that, +e.g. debugfs_remove(). + +In order for a channel to be created and the host filesystem's files +associated with its channel buffers, the user must provide definitions +for two callback functions, create_buf_file() and remove_buf_file(). +create_buf_file() is called once for each per-cpu buffer from +relay_open() and allows the user to create the file which will be used +to represent the corresponding channel buffer. The callback should +return the dentry of the file created to represent the channel buffer. +remove_buf_file() must also be defined; it's responsible for deleting +the file(s) created in create_buf_file() and is called during +relay_close(). + +Here are some typical definitions for these callbacks, in this case +using debugfs: + +/* + * create_buf_file() callback. Creates relay file in debugfs. + */ +static struct dentry *create_buf_file_handler(const char *filename, + struct dentry *parent, + int mode, + struct rchan_buf *buf, + int *is_global) +{ + return debugfs_create_file(filename, mode, parent, buf, + &relay_file_operations); +} + +/* + * remove_buf_file() callback. Removes relay file from debugfs. + */ +static int remove_buf_file_handler(struct dentry *dentry) +{ + debugfs_remove(dentry); + + return 0; +} + +/* + * relay interface callbacks + */ +static struct rchan_callbacks relay_callbacks = +{ + .create_buf_file = create_buf_file_handler, + .remove_buf_file = remove_buf_file_handler, +}; + +And an example relay_open() invocation using them: + + chan = relay_open("cpu", NULL, SUBBUF_SIZE, N_SUBBUFS, &relay_callbacks, NULL); + +If the create_buf_file() callback fails, or isn't defined, channel +creation and thus relay_open() will fail. + +The total size of each per-cpu buffer is calculated by multiplying the +number of sub-buffers by the sub-buffer size passed into relay_open(). +The idea behind sub-buffers is that they're basically an extension of +double-buffering to N buffers, and they also allow applications to +easily implement random-access-on-buffer-boundary schemes, which can +be important for some high-volume applications. The number and size +of sub-buffers is completely dependent on the application and even for +the same application, different conditions will warrant different +values for these parameters at different times. Typically, the right +values to use are best decided after some experimentation; in general, +though, it's safe to assume that having only 1 sub-buffer is a bad +idea - you're guaranteed to either overwrite data or lose events +depending on the channel mode being used. + +The create_buf_file() implementation can also be defined in such a way +as to allow the creation of a single 'global' buffer instead of the +default per-cpu set. This can be useful for applications interested +mainly in seeing the relative ordering of system-wide events without +the need to bother with saving explicit timestamps for the purpose of +merging/sorting per-cpu files in a postprocessing step. + +To have relay_open() create a global buffer, the create_buf_file() +implementation should set the value of the is_global outparam to a +non-zero value in addition to creating the file that will be used to +represent the single buffer. In the case of a global buffer, +create_buf_file() and remove_buf_file() will be called only once. The +normal channel-writing functions, e.g. relay_write(), can still be +used - writes from any cpu will transparently end up in the global +buffer - but since it is a global buffer, callers should make sure +they use the proper locking for such a buffer, either by wrapping +writes in a spinlock, or by copying a write function from relay.h and +creating a local version that internally does the proper locking. + +The private_data passed into relay_open() allows clients to associate +user-defined data with a channel, and is immediately available +(including in create_buf_file()) via chan->private_data or +buf->chan->private_data. + +Buffer-only channels +-------------------- + +These channels have no files associated and can be created with +relay_open(NULL, NULL, ...). Such channels are useful in scenarios such +as when doing early tracing in the kernel, before the VFS is up. In these +cases, one may open a buffer-only channel and then call +relay_late_setup_files() when the kernel is ready to handle files, +to expose the buffered data to the userspace. + +Channel 'modes' +--------------- + +relay channels can be used in either of two modes - 'overwrite' or +'no-overwrite'. The mode is entirely determined by the implementation +of the subbuf_start() callback, as described below. The default if no +subbuf_start() callback is defined is 'no-overwrite' mode. If the +default mode suits your needs, and you plan to use the read() +interface to retrieve channel data, you can ignore the details of this +section, as it pertains mainly to mmap() implementations. + +In 'overwrite' mode, also known as 'flight recorder' mode, writes +continuously cycle around the buffer and will never fail, but will +unconditionally overwrite old data regardless of whether it's actually +been consumed. In no-overwrite mode, writes will fail, i.e. data will +be lost, if the number of unconsumed sub-buffers equals the total +number of sub-buffers in the channel. It should be clear that if +there is no consumer or if the consumer can't consume sub-buffers fast +enough, data will be lost in either case; the only difference is +whether data is lost from the beginning or the end of a buffer. + +As explained above, a relay channel is made of up one or more +per-cpu channel buffers, each implemented as a circular buffer +subdivided into one or more sub-buffers. Messages are written into +the current sub-buffer of the channel's current per-cpu buffer via the +write functions described below. Whenever a message can't fit into +the current sub-buffer, because there's no room left for it, the +client is notified via the subbuf_start() callback that a switch to a +new sub-buffer is about to occur. The client uses this callback to 1) +initialize the next sub-buffer if appropriate 2) finalize the previous +sub-buffer if appropriate and 3) return a boolean value indicating +whether or not to actually move on to the next sub-buffer. + +To implement 'no-overwrite' mode, the userspace client would provide +an implementation of the subbuf_start() callback something like the +following: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + if (relay_buf_full(buf)) + return 0; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +If the current buffer is full, i.e. all sub-buffers remain unconsumed, +the callback returns 0 to indicate that the buffer switch should not +occur yet, i.e. until the consumer has had a chance to read the +current set of ready sub-buffers. For the relay_buf_full() function +to make sense, the consumer is responsible for notifying the relay +interface when sub-buffers have been consumed via +relay_subbufs_consumed(). Any subsequent attempts to write into the +buffer will again invoke the subbuf_start() callback with the same +parameters; only when the consumer has consumed one or more of the +ready sub-buffers will relay_buf_full() return 0, in which case the +buffer switch can continue. + +The implementation of the subbuf_start() callback for 'overwrite' mode +would be very similar: + +static int subbuf_start(struct rchan_buf *buf, + void *subbuf, + void *prev_subbuf, + unsigned int prev_padding) +{ + if (prev_subbuf) + *((unsigned *)prev_subbuf) = prev_padding; + + subbuf_start_reserve(buf, sizeof(unsigned int)); + + return 1; +} + +In this case, the relay_buf_full() check is meaningless and the +callback always returns 1, causing the buffer switch to occur +unconditionally. It's also meaningless for the client to use the +relay_subbufs_consumed() function in this mode, as it's never +consulted. + +The default subbuf_start() implementation, used if the client doesn't +define any callbacks, or doesn't define the subbuf_start() callback, +implements the simplest possible 'no-overwrite' mode, i.e. it does +nothing but return 0. + +Header information can be reserved at the beginning of each sub-buffer +by calling the subbuf_start_reserve() helper function from within the +subbuf_start() callback. This reserved area can be used to store +whatever information the client wants. In the example above, room is +reserved in each sub-buffer to store the padding count for that +sub-buffer. This is filled in for the previous sub-buffer in the +subbuf_start() implementation; the padding value for the previous +sub-buffer is passed into the subbuf_start() callback along with a +pointer to the previous sub-buffer, since the padding value isn't +known until a sub-buffer is filled. The subbuf_start() callback is +also called for the first sub-buffer when the channel is opened, to +give the client a chance to reserve space in it. In this case the +previous sub-buffer pointer passed into the callback will be NULL, so +the client should check the value of the prev_subbuf pointer before +writing into the previous sub-buffer. + +Writing to a channel +-------------------- + +Kernel clients write data into the current cpu's channel buffer using +relay_write() or __relay_write(). relay_write() is the main logging +function - it uses local_irqsave() to protect the buffer and should be +used if you might be logging from interrupt context. If you know +you'll never be logging from interrupt context, you can use +__relay_write(), which only disables preemption. These functions +don't return a value, so you can't determine whether or not they +failed - the assumption is that you wouldn't want to check a return +value in the fast logging path anyway, and that they'll always succeed +unless the buffer is full and no-overwrite mode is being used, in +which case you can detect a failed write in the subbuf_start() +callback by calling the relay_buf_full() helper function. + +relay_reserve() is used to reserve a slot in a channel buffer which +can be written to later. This would typically be used in applications +that need to write directly into a channel buffer without having to +stage data in a temporary buffer beforehand. Because the actual write +may not happen immediately after the slot is reserved, applications +using relay_reserve() can keep a count of the number of bytes actually +written, either in space reserved in the sub-buffers themselves or as +a separate array. See the 'reserve' example in the relay-apps tarball +at http://relayfs.sourceforge.net for an example of how this can be +done. Because the write is under control of the client and is +separated from the reserve, relay_reserve() doesn't protect the buffer +at all - it's up to the client to provide the appropriate +synchronization when using relay_reserve(). + +Closing a channel +----------------- + +The client calls relay_close() when it's finished using the channel. +The channel and its associated buffers are destroyed when there are no +longer any references to any of the channel buffers. relay_flush() +forces a sub-buffer switch on all the channel buffers, and can be used +to finalize and process the last sub-buffers before the channel is +closed. + +Misc +---- + +Some applications may want to keep a channel around and re-use it +rather than open and close a new channel for each use. relay_reset() +can be used for this purpose - it resets a channel to its initial +state without reallocating channel buffer memory or destroying +existing mappings. It should however only be called when it's safe to +do so, i.e. when the channel isn't currently being written to. + +Finally, there are a couple of utility callbacks that can be used for +different purposes. buf_mapped() is called whenever a channel buffer +is mmapped from user space and buf_unmapped() is called when it's +unmapped. The client can use this notification to trigger actions +within the kernel application, such as enabling/disabling logging to +the channel. + + +Resources +========= + +For news, example code, mailing list, etc. see the relay interface homepage: + + http://relayfs.sourceforge.net + + +Credits +======= + +The ideas and specs for the relay interface came about as a result of +discussions on tracing involving the following: + +Michel Dagenais <michel.dagenais@polymtl.ca> +Richard Moore <richardj_moore@uk.ibm.com> +Bob Wisniewski <bob@watson.ibm.com> +Karim Yaghmour <karim@opersys.com> +Tom Zanussi <zanussi@us.ibm.com> + +Also thanks to Hubertus Franke for a lot of useful suggestions and bug +reports. diff --git a/Documentation/filesystems/relayfs.txt b/Documentation/filesystems/relayfs.txt deleted file mode 100644 index d803abed29f..00000000000 --- a/Documentation/filesystems/relayfs.txt +++ /dev/null @@ -1,362 +0,0 @@ - -relayfs - a high-speed data relay filesystem -============================================ - -relayfs is a filesystem designed to provide an efficient mechanism for -tools and facilities to relay large and potentially sustained streams -of data from kernel space to user space. - -The main abstraction of relayfs is the 'channel'. A channel consists -of a set of per-cpu kernel buffers each represented by a file in the -relayfs filesystem. Kernel clients write into a channel using -efficient write functions which automatically log to the current cpu's -channel buffer. User space applications mmap() the per-cpu files and -retrieve the data as it becomes available. - -The format of the data logged into the channel buffers is completely -up to the relayfs client; relayfs does however provide hooks which -allow clients to impose some structure on the buffer data. Nor does -relayfs implement any form of data filtering - this also is left to -the client. The purpose is to keep relayfs as simple as possible. - -This document provides an overview of the relayfs API. The details of -the function parameters are documented along with the functions in the -filesystem code - please see that for details. - -Semantics -========= - -Each relayfs channel has one buffer per CPU, each buffer has one or -more sub-buffers. Messages are written to the first sub-buffer until -it is too full to contain a new message, in which case it it is -written to the next (if available). Messages are never split across -sub-buffers. At this point, userspace can be notified so it empties -the first sub-buffer, while the kernel continues writing to the next. - -When notified that a sub-buffer is full, the kernel knows how many -bytes of it are padding i.e. unused. Userspace can use this knowledge -to copy only valid data. - -After copying it, userspace can notify the kernel that a sub-buffer -has been consumed. - -relayfs can operate in a mode where it will overwrite data not yet -collected by userspace, and not wait for it to consume it. - -relayfs itself does not provide for communication of such data between -userspace and kernel, allowing the kernel side to remain simple and not -impose a single interface on userspace. It does provide a separate -helper though, described below. - -klog, relay-app & librelay -========================== - -relayfs itself is ready to use, but to make things easier, two -additional systems are provided. klog is a simple wrapper to make -writing formatted text or raw data to a channel simpler, regardless of -whether a channel to write into exists or not, or whether relayfs is -compiled into the kernel or is configured as a module. relay-app is -the kernel counterpart of userspace librelay.c, combined these two -files provide glue to easily stream data to disk, without having to -bother with housekeeping. klog and relay-app can be used together, -with klog providing high-level logging functions to the kernel and -relay-app taking care of kernel-user control and disk-logging chores. - -It is possible to use relayfs without relay-app & librelay, but you'll -have to implement communication between userspace and kernel, allowing -both to convey the state of buffers (full, empty, amount of padding). - -klog, relay-app and librelay can be found in the relay-apps tarball on -http://relayfs.sourceforge.net - -The relayfs user space API -========================== - -relayfs implements basic file operations for user space access to -relayfs channel buffer data. Here are the file operations that are -available and some comments regarding their behavior: - -open() enables user to open an _existing_ buffer. - -mmap() results in channel buffer being mapped into the caller's - memory space. Note that you can't do a partial mmap - you must - map the entire file, which is NRBUF * SUBBUFSIZE. - -read() read the contents of a channel buffer. The bytes read are - 'consumed' by the reader i.e. they won't be available again - to subsequent reads. If the channel is being used in - no-overwrite mode (the default), it can be read at any time - even if there's an active kernel writer. If the channel is - being used in overwrite mode and there are active channel - writers, results may be unpredictable - users should make - sure that all logging to the channel has ended before using - read() with overwrite mode. - -poll() POLLIN/POLLRDNORM/POLLERR supported. User applications are - notified when sub-buffer boundaries are crossed. - -close() decrements the channel buffer's refcount. When the refcount - reaches 0 i.e. when no process or kernel client has the buffer - open, the channel buffer is freed. - - -In order for a user application to make use of relayfs files, the -relayfs filesystem must be mounted. For example, - - mount -t relayfs relayfs /mnt/relay - -NOTE: relayfs doesn't need to be mounted for kernel clients to create - or use channels - it only needs to be mounted when user space - applications need access to the buffer data. - - -The relayfs kernel API -====================== - -Here's a summary of the API relayfs provides to in-kernel clients: - - - channel management functions: - - relay_open(base_filename, parent, subbuf_size, n_subbufs, - callbacks) - relay_close(chan) - relay_flush(chan) - relay_reset(chan) - relayfs_create_dir(name, parent) - relayfs_remove_dir(dentry) - - channel management typically called on instigation of userspace: - - relay_subbufs_consumed(chan, cpu, subbufs_consumed) - - write functions: - - relay_write(chan, data, length) - __relay_write(chan, data, length) - relay_reserve(chan, length) - - callbacks: - - subbuf_start(buf, subbuf, prev_subbuf, prev_padding) - buf_mapped(buf, filp) - buf_unmapped(buf, filp) - - helper functions: - - relay_buf_full(buf) - subbuf_start_reserve(buf, length) - - -Creating a channel ------------------- - -relay_open() is used to create a channel, along with its per-cpu -channel buffers. Each channel buffer will have an associated file -created for it in the relayfs filesystem, which can be opened and -mmapped from user space if desired. The files are named -basename0...basenameN-1 where N is the number of online cpus, and by -default will be created in the root of the filesystem. If you want a -directory structure to contain your relayfs files, you can create it -with relayfs_create_dir() and pass the parent directory to -relay_open(). Clients are responsible for cleaning up any directory -structure they create when the channel is closed - use -relayfs_remove_dir() for that. - -The total size of each per-cpu buffer is calculated by multiplying the -number of sub-buffers by the sub-buffer size passed into relay_open(). -The idea behind sub-buffers is that they're basically an extension of -double-buffering to N buffers, and they also allow applications to -easily implement random-access-on-buffer-boundary schemes, which can -be important for some high-volume applications. The number and size -of sub-buffers is completely dependent on the application and even for -the same application, different conditions will warrant different -values for these parameters at different times. Typically, the right -values to use are best decided after some experimentation; in general, -though, it's safe to assume that having only 1 sub-buffer is a bad -idea - you're guaranteed to either overwrite data or lose events -depending on the channel mode being used. - -Channel 'modes' ---------------- - -relayfs channels can be used in either of two modes - 'overwrite' or -'no-overwrite'. The mode is entirely determined by the implementation -of the subbuf_start() callback, as described below. In 'overwrite' -mode, also known as 'flight recorder' mode, writes continuously cycle -around the buffer and will never fail, but will unconditionally -overwrite old data regardless of whether it's actually been consumed. -In no-overwrite mode, writes will fail i.e. data will be lost, if the -number of unconsumed sub-buffers equals the total number of -sub-buffers in the channel. It should be clear that if there is no -consumer or if the consumer can't consume sub-buffers fast enought, -data will be lost in either case; the only difference is whether data -is lost from the beginning or the end of a buffer. - -As explained above, a relayfs channel is made of up one or more -per-cpu channel buffers, each implemented as a circular buffer -subdivided into one or more sub-buffers. Messages are written into -the current sub-buffer of the channel's current per-cpu buffer via the -write functions described below. Whenever a message can't fit into -the current sub-buffer, because there's no room left for it, the -client is notified via the subbuf_start() callback that a switch to a -new sub-buffer is about to occur. The client uses this callback to 1) -initialize the next sub-buffer if appropriate 2) finalize the previous -sub-buffer if appropriate and 3) return a boolean value indicating -whether or not to actually go ahead with the sub-buffer switch. - -To implement 'no-overwrite' mode, the userspace client would provide -an implementation of the subbuf_start() callback something like the -following: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - if (relay_buf_full(buf)) - return 0; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -If the current buffer is full i.e. all sub-buffers remain unconsumed, -the callback returns 0 to indicate that the buffer switch should not -occur yet i.e. until the consumer has had a chance to read the current -set of ready sub-buffers. For the relay_buf_full() function to make -sense, the consumer is reponsible for notifying relayfs when -sub-buffers have been consumed via relay_subbufs_consumed(). Any -subsequent attempts to write into the buffer will again invoke the -subbuf_start() callback with the same parameters; only when the -consumer has consumed one or more of the ready sub-buffers will -relay_buf_full() return 0, in which case the buffer switch can -continue. - -The implementation of the subbuf_start() callback for 'overwrite' mode -would be very similar: - -static int subbuf_start(struct rchan_buf *buf, - void *subbuf, - void *prev_subbuf, - unsigned int prev_padding) -{ - if (prev_subbuf) - *((unsigned *)prev_subbuf) = prev_padding; - - subbuf_start_reserve(buf, sizeof(unsigned int)); - - return 1; -} - -In this case, the relay_buf_full() check is meaningless and the -callback always returns 1, causing the buffer switch to occur -unconditionally. It's also meaningless for the client to use the -relay_subbufs_consumed() function in this mode, as it's never -consulted. - -The default subbuf_start() implementation, used if the client doesn't -define any callbacks, or doesn't define the subbuf_start() callback, -implements the simplest possible 'no-overwrite' mode i.e. it does -nothing but return 0. - -Header information can be reserved at the beginning of each sub-buffer -by calling the subbuf_start_reserve() helper function from within the -subbuf_start() callback. This reserved area can be used to store -whatever information the client wants. In the example above, room is -reserved in each sub-buffer to store the padding count for that -sub-buffer. This is filled in for the previous sub-buffer in the -subbuf_start() implementation; the padding value for the previous -sub-buffer is passed into the subbuf_start() callback along with a -pointer to the previous sub-buffer, since the padding value isn't -known until a sub-buffer is filled. The subbuf_start() callback is -also called for the first sub-buffer when the channel is opened, to -give the client a chance to reserve space in it. In this case the -previous sub-buffer pointer passed into the callback will be NULL, so -the client should check the value of the prev_subbuf pointer before -writing into the previous sub-buffer. - -Writing to a channel --------------------- - -kernel clients write data into the current cpu's channel buffer using -relay_write() or __relay_write(). relay_write() is the main logging -function - it uses local_irqsave() to protect the buffer and should be -used if you might be logging from interrupt context. If you know -you'll never be logging from interrupt context, you can use -__relay_write(), which only disables preemption. These functions -don't return a value, so you can't determine whether or not they -failed - the assumption is that you wouldn't want to check a return -value in the fast logging path anyway, and that they'll always succeed -unless the buffer is full and no-overwrite mode is being used, in -which case you can detect a failed write in the subbuf_start() -callback by calling the relay_buf_full() helper function. - -relay_reserve() is used to reserve a slot in a channel buffer which -can be written to later. This would typically be used in applications -that need to write directly into a channel buffer without having to -stage data in a temporary buffer beforehand. Because the actual write -may not happen immediately after the slot is reserved, applications -using relay_reserve() can keep a count of the number of bytes actually -written, either in space reserved in the sub-buffers themselves or as -a separate array. See the 'reserve' example in the relay-apps tarball -at http://relayfs.sourceforge.net for an example of how this can be -done. Because the write is under control of the client and is -separated from the reserve, relay_reserve() doesn't protect the buffer -at all - it's up to the client to provide the appropriate -synchronization when using relay_reserve(). - -Closing a channel ------------------ - -The client calls relay_close() when it's finished using the channel. -The channel and its associated buffers are destroyed when there are no -longer any references to any of the channel buffers. relay_flush() -forces a sub-buffer switch on all the channel buffers, and can be used -to finalize and process the last sub-buffers before the channel is -closed. - -Misc ----- - -Some applications may want to keep a channel around and re-use it -rather than open and close a new channel for each use. relay_reset() -can be used for this purpose - it resets a channel to its initial -state without reallocating channel buffer memory or destroying -existing mappings. It should however only be called when it's safe to -do so i.e. when the channel isn't currently being written to. - -Finally, there are a couple of utility callbacks that can be used for -different purposes. buf_mapped() is called whenever a channel buffer -is mmapped from user space and buf_unmapped() is called when it's -unmapped. The client can use this notification to trigger actions -within the kernel application, such as enabling/disabling logging to -the channel. - - -Resources -========= - -For news, example code, mailing list, etc. see the relayfs homepage: - - http://relayfs.sourceforge.net - - -Credits -======= - -The ideas and specs for relayfs came about as a result of discussions -on tracing involving the following: - -Michel Dagenais <michel.dagenais@polymtl.ca> -Richard Moore <richardj_moore@uk.ibm.com> -Bob Wisniewski <bob@watson.ibm.com> -Karim Yaghmour <karim@opersys.com> -Tom Zanussi <zanussi@us.ibm.com> - -Also thanks to Hubertus Franke for a lot of useful suggestions and bug -reports. diff --git a/Documentation/filesystems/romfs.txt b/Documentation/filesystems/romfs.txt index 2d2a7b2a16b..e2b07cc9120 100644 --- a/Documentation/filesystems/romfs.txt +++ b/Documentation/filesystems/romfs.txt @@ -17,8 +17,7 @@ comparison, an actual rescue disk used up 3202 blocks with ext2, while with romfs, it needed 3079 blocks. To create such a file system, you'll need a user program named -genromfs. It is available via anonymous ftp on sunsite.unc.edu and -its mirrors, in the /pub/Linux/system/recovery/ directory. +genromfs. It is available on http://romfs.sourceforge.net/ As the name suggests, romfs could be also used (space-efficiently) on various read-only media, like (E)EPROM disks if someone will have the diff --git a/Documentation/filesystems/seq_file.txt b/Documentation/filesystems/seq_file.txt new file mode 100644 index 00000000000..1fe0ccb1af5 --- /dev/null +++ b/Documentation/filesystems/seq_file.txt @@ -0,0 +1,301 @@ +The seq_file interface + + Copyright 2003 Jonathan Corbet <corbet@lwn.net> + This file is originally from the LWN.net Driver Porting series at + http://lwn.net/Articles/driver-porting/ + + +There are numerous ways for a device driver (or other kernel component) to +provide information to the user or system administrator. One useful +technique is the creation of virtual files, in debugfs, /proc or elsewhere. +Virtual files can provide human-readable output that is easy to get at +without any special utility programs; they can also make life easier for +script writers. It is not surprising that the use of virtual files has +grown over the years. + +Creating those files correctly has always been a bit of a challenge, +however. It is not that hard to make a virtual file which returns a +string. But life gets trickier if the output is long - anything greater +than an application is likely to read in a single operation. Handling +multiple reads (and seeks) requires careful attention to the reader's +position within the virtual file - that position is, likely as not, in the +middle of a line of output. The kernel has traditionally had a number of +implementations that got this wrong. + +The 2.6 kernel contains a set of functions (implemented by Alexander Viro) +which are designed to make it easy for virtual file creators to get it +right. + +The seq_file interface is available via <linux/seq_file.h>. There are +three aspects to seq_file: + + * An iterator interface which lets a virtual file implementation + step through the objects it is presenting. + + * Some utility functions for formatting objects for output without + needing to worry about things like output buffers. + + * A set of canned file_operations which implement most operations on + the virtual file. + +We'll look at the seq_file interface via an extremely simple example: a +loadable module which creates a file called /proc/sequence. The file, when +read, simply produces a set of increasing integer values, one per line. The +sequence will continue until the user loses patience and finds something +better to do. The file is seekable, in that one can do something like the +following: + + dd if=/proc/sequence of=out1 count=1 + dd if=/proc/sequence skip=1 of=out2 count=1 + +Then concatenate the output files out1 and out2 and get the right +result. Yes, it is a thoroughly useless module, but the point is to show +how the mechanism works without getting lost in other details. (Those +wanting to see the full source for this module can find it at +http://lwn.net/Articles/22359/). + +Deprecated create_proc_entry + +Note that the above article uses create_proc_entry which was removed in +kernel 3.10. Current versions require the following update + +- entry = create_proc_entry("sequence", 0, NULL); +- if (entry) +- entry->proc_fops = &ct_file_ops; ++ entry = proc_create("sequence", 0, NULL, &ct_file_ops); + +The iterator interface + +Modules implementing a virtual file with seq_file must implement a simple +iterator object that allows stepping through the data of interest. +Iterators must be able to move to a specific position - like the file they +implement - but the interpretation of that position is up to the iterator +itself. A seq_file implementation that is formatting firewall rules, for +example, could interpret position N as the Nth rule in the chain. +Positioning can thus be done in whatever way makes the most sense for the +generator of the data, which need not be aware of how a position translates +to an offset in the virtual file. The one obvious exception is that a +position of zero should indicate the beginning of the file. + +The /proc/sequence iterator just uses the count of the next number it +will output as its position. + +Four functions must be implemented to make the iterator work. The first, +called start() takes a position as an argument and returns an iterator +which will start reading at that position. For our simple sequence example, +the start() function looks like: + + static void *ct_seq_start(struct seq_file *s, loff_t *pos) + { + loff_t *spos = kmalloc(sizeof(loff_t), GFP_KERNEL); + if (! spos) + return NULL; + *spos = *pos; + return spos; + } + +The entire data structure for this iterator is a single loff_t value +holding the current position. There is no upper bound for the sequence +iterator, but that will not be the case for most other seq_file +implementations; in most cases the start() function should check for a +"past end of file" condition and return NULL if need be. + +For more complicated applications, the private field of the seq_file +structure can be used. There is also a special value which can be returned +by the start() function called SEQ_START_TOKEN; it can be used if you wish +to instruct your show() function (described below) to print a header at the +top of the output. SEQ_START_TOKEN should only be used if the offset is +zero, however. + +The next function to implement is called, amazingly, next(); its job is to +move the iterator forward to the next position in the sequence. The +example module can simply increment the position by one; more useful +modules will do what is needed to step through some data structure. The +next() function returns a new iterator, or NULL if the sequence is +complete. Here's the example version: + + static void *ct_seq_next(struct seq_file *s, void *v, loff_t *pos) + { + loff_t *spos = v; + *pos = ++*spos; + return spos; + } + +The stop() function is called when iteration is complete; its job, of +course, is to clean up. If dynamic memory is allocated for the iterator, +stop() is the place to free it. + + static void ct_seq_stop(struct seq_file *s, void *v) + { + kfree(v); + } + +Finally, the show() function should format the object currently pointed to +by the iterator for output. The example module's show() function is: + + static int ct_seq_show(struct seq_file *s, void *v) + { + loff_t *spos = v; + seq_printf(s, "%lld\n", (long long)*spos); + return 0; + } + +If all is well, the show() function should return zero. A negative error +code in the usual manner indicates that something went wrong; it will be +passed back to user space. This function can also return SEQ_SKIP, which +causes the current item to be skipped; if the show() function has already +generated output before returning SEQ_SKIP, that output will be dropped. + +We will look at seq_printf() in a moment. But first, the definition of the +seq_file iterator is finished by creating a seq_operations structure with +the four functions we have just defined: + + static const struct seq_operations ct_seq_ops = { + .start = ct_seq_start, + .next = ct_seq_next, + .stop = ct_seq_stop, + .show = ct_seq_show + }; + +This structure will be needed to tie our iterator to the /proc file in +a little bit. + +It's worth noting that the iterator value returned by start() and +manipulated by the other functions is considered to be completely opaque by +the seq_file code. It can thus be anything that is useful in stepping +through the data to be output. Counters can be useful, but it could also be +a direct pointer into an array or linked list. Anything goes, as long as +the programmer is aware that things can happen between calls to the +iterator function. However, the seq_file code (by design) will not sleep +between the calls to start() and stop(), so holding a lock during that time +is a reasonable thing to do. The seq_file code will also avoid taking any +other locks while the iterator is active. + + +Formatted output + +The seq_file code manages positioning within the output created by the +iterator and getting it into the user's buffer. But, for that to work, that +output must be passed to the seq_file code. Some utility functions have +been defined which make this task easy. + +Most code will simply use seq_printf(), which works pretty much like +printk(), but which requires the seq_file pointer as an argument. It is +common to ignore the return value from seq_printf(), but a function +producing complicated output may want to check that value and quit if +something non-zero is returned; an error return means that the seq_file +buffer has been filled and further output will be discarded. + +For straight character output, the following functions may be used: + + int seq_putc(struct seq_file *m, char c); + int seq_puts(struct seq_file *m, const char *s); + int seq_escape(struct seq_file *m, const char *s, const char *esc); + +The first two output a single character and a string, just like one would +expect. seq_escape() is like seq_puts(), except that any character in s +which is in the string esc will be represented in octal form in the output. + +There is also a pair of functions for printing filenames: + + int seq_path(struct seq_file *m, struct path *path, char *esc); + int seq_path_root(struct seq_file *m, struct path *path, + struct path *root, char *esc) + +Here, path indicates the file of interest, and esc is a set of characters +which should be escaped in the output. A call to seq_path() will output +the path relative to the current process's filesystem root. If a different +root is desired, it can be used with seq_path_root(). Note that, if it +turns out that path cannot be reached from root, the value of root will be +changed in seq_file_root() to a root which *does* work. + + +Making it all work + +So far, we have a nice set of functions which can produce output within the +seq_file system, but we have not yet turned them into a file that a user +can see. Creating a file within the kernel requires, of course, the +creation of a set of file_operations which implement the operations on that +file. The seq_file interface provides a set of canned operations which do +most of the work. The virtual file author still must implement the open() +method, however, to hook everything up. The open function is often a single +line, as in the example module: + + static int ct_open(struct inode *inode, struct file *file) + { + return seq_open(file, &ct_seq_ops); + } + +Here, the call to seq_open() takes the seq_operations structure we created +before, and gets set up to iterate through the virtual file. + +On a successful open, seq_open() stores the struct seq_file pointer in +file->private_data. If you have an application where the same iterator can +be used for more than one file, you can store an arbitrary pointer in the +private field of the seq_file structure; that value can then be retrieved +by the iterator functions. + +The other operations of interest - read(), llseek(), and release() - are +all implemented by the seq_file code itself. So a virtual file's +file_operations structure will look like: + + static const struct file_operations ct_file_ops = { + .owner = THIS_MODULE, + .open = ct_open, + .read = seq_read, + .llseek = seq_lseek, + .release = seq_release + }; + +There is also a seq_release_private() which passes the contents of the +seq_file private field to kfree() before releasing the structure. + +The final step is the creation of the /proc file itself. In the example +code, that is done in the initialization code in the usual way: + + static int ct_init(void) + { + struct proc_dir_entry *entry; + + proc_create("sequence", 0, NULL, &ct_file_ops); + return 0; + } + + module_init(ct_init); + +And that is pretty much it. + + +seq_list + +If your file will be iterating through a linked list, you may find these +routines useful: + + struct list_head *seq_list_start(struct list_head *head, + loff_t pos); + struct list_head *seq_list_start_head(struct list_head *head, + loff_t pos); + struct list_head *seq_list_next(void *v, struct list_head *head, + loff_t *ppos); + +These helpers will interpret pos as a position within the list and iterate +accordingly. Your start() and next() functions need only invoke the +seq_list_* helpers with a pointer to the appropriate list_head structure. + + +The extra-simple version + +For extremely simple virtual files, there is an even easier interface. A +module can define only the show() function, which should create all the +output that the virtual file will contain. The file's open() method then +calls: + + int single_open(struct file *file, + int (*show)(struct seq_file *m, void *p), + void *data); + +When output time comes, the show() function will be called once. The data +value given to single_open() can be found in the private field of the +seq_file structure. When using single_open(), the programmer should use +single_release() instead of seq_release() in the file_operations structure +to avoid a memory leak. diff --git a/Documentation/filesystems/sharedsubtree.txt b/Documentation/filesystems/sharedsubtree.txt new file mode 100644 index 00000000000..32a173dd315 --- /dev/null +++ b/Documentation/filesystems/sharedsubtree.txt @@ -0,0 +1,939 @@ +Shared Subtrees +--------------- + +Contents: + 1) Overview + 2) Features + 3) Setting mount states + 4) Use-case + 5) Detailed semantics + 6) Quiz + 7) FAQ + 8) Implementation + + +1) Overview +----------- + +Consider the following situation: + +A process wants to clone its own namespace, but still wants to access the CD +that got mounted recently. Shared subtree semantics provide the necessary +mechanism to accomplish the above. + +It provides the necessary building blocks for features like per-user-namespace +and versioned filesystem. + +2) Features +----------- + +Shared subtree provides four different flavors of mounts; struct vfsmount to be +precise + + a. shared mount + b. slave mount + c. private mount + d. unbindable mount + + +2a) A shared mount can be replicated to as many mountpoints and all the +replicas continue to be exactly same. + + Here is an example: + + Let's say /mnt has a mount that is shared. + mount --make-shared /mnt + + Note: mount(8) command now supports the --make-shared flag, + so the sample 'smount' program is no longer needed and has been + removed. + + # mount --bind /mnt /tmp + The above command replicates the mount at /mnt to the mountpoint /tmp + and the contents of both the mounts remain identical. + + #ls /mnt + a b c + + #ls /tmp + a b c + + Now let's say we mount a device at /tmp/a + # mount /dev/sd0 /tmp/a + + #ls /tmp/a + t1 t2 t3 + + #ls /mnt/a + t1 t2 t3 + + Note that the mount has propagated to the mount at /mnt as well. + + And the same is true even when /dev/sd0 is mounted on /mnt/a. The + contents will be visible under /tmp/a too. + + +2b) A slave mount is like a shared mount except that mount and umount events + only propagate towards it. + + All slave mounts have a master mount which is a shared. + + Here is an example: + + Let's say /mnt has a mount which is shared. + # mount --make-shared /mnt + + Let's bind mount /mnt to /tmp + # mount --bind /mnt /tmp + + the new mount at /tmp becomes a shared mount and it is a replica of + the mount at /mnt. + + Now let's make the mount at /tmp; a slave of /mnt + # mount --make-slave /tmp + + let's mount /dev/sd0 on /mnt/a + # mount /dev/sd0 /mnt/a + + #ls /mnt/a + t1 t2 t3 + + #ls /tmp/a + t1 t2 t3 + + Note the mount event has propagated to the mount at /tmp + + However let's see what happens if we mount something on the mount at /tmp + + # mount /dev/sd1 /tmp/b + + #ls /tmp/b + s1 s2 s3 + + #ls /mnt/b + + Note how the mount event has not propagated to the mount at + /mnt + + +2c) A private mount does not forward or receive propagation. + + This is the mount we are familiar with. Its the default type. + + +2d) A unbindable mount is a unbindable private mount + + let's say we have a mount at /mnt and we make is unbindable + + # mount --make-unbindable /mnt + + Let's try to bind mount this mount somewhere else. + # mount --bind /mnt /tmp + mount: wrong fs type, bad option, bad superblock on /mnt, + or too many mounted file systems + + Binding a unbindable mount is a invalid operation. + + +3) Setting mount states + + The mount command (util-linux package) can be used to set mount + states: + + mount --make-shared mountpoint + mount --make-slave mountpoint + mount --make-private mountpoint + mount --make-unbindable mountpoint + + +4) Use cases +------------ + + A) A process wants to clone its own namespace, but still wants to + access the CD that got mounted recently. + + Solution: + + The system administrator can make the mount at /cdrom shared + mount --bind /cdrom /cdrom + mount --make-shared /cdrom + + Now any process that clones off a new namespace will have a + mount at /cdrom which is a replica of the same mount in the + parent namespace. + + So when a CD is inserted and mounted at /cdrom that mount gets + propagated to the other mount at /cdrom in all the other clone + namespaces. + + B) A process wants its mounts invisible to any other process, but + still be able to see the other system mounts. + + Solution: + + To begin with, the administrator can mark the entire mount tree + as shareable. + + mount --make-rshared / + + A new process can clone off a new namespace. And mark some part + of its namespace as slave + + mount --make-rslave /myprivatetree + + Hence forth any mounts within the /myprivatetree done by the + process will not show up in any other namespace. However mounts + done in the parent namespace under /myprivatetree still shows + up in the process's namespace. + + + Apart from the above semantics this feature provides the + building blocks to solve the following problems: + + C) Per-user namespace + + The above semantics allows a way to share mounts across + namespaces. But namespaces are associated with processes. If + namespaces are made first class objects with user API to + associate/disassociate a namespace with userid, then each user + could have his/her own namespace and tailor it to his/her + requirements. Offcourse its needs support from PAM. + + D) Versioned files + + If the entire mount tree is visible at multiple locations, then + a underlying versioning file system can return different + version of the file depending on the path used to access that + file. + + An example is: + + mount --make-shared / + mount --rbind / /view/v1 + mount --rbind / /view/v2 + mount --rbind / /view/v3 + mount --rbind / /view/v4 + + and if /usr has a versioning filesystem mounted, then that + mount appears at /view/v1/usr, /view/v2/usr, /view/v3/usr and + /view/v4/usr too + + A user can request v3 version of the file /usr/fs/namespace.c + by accessing /view/v3/usr/fs/namespace.c . The underlying + versioning filesystem can then decipher that v3 version of the + filesystem is being requested and return the corresponding + inode. + +5) Detailed semantics: +------------------- + The section below explains the detailed semantics of + bind, rbind, move, mount, umount and clone-namespace operations. + + Note: the word 'vfsmount' and the noun 'mount' have been used + to mean the same thing, throughout this document. + +5a) Mount states + + A given mount can be in one of the following states + 1) shared + 2) slave + 3) shared and slave + 4) private + 5) unbindable + + A 'propagation event' is defined as event generated on a vfsmount + that leads to mount or unmount actions in other vfsmounts. + + A 'peer group' is defined as a group of vfsmounts that propagate + events to each other. + + (1) Shared mounts + + A 'shared mount' is defined as a vfsmount that belongs to a + 'peer group'. + + For example: + mount --make-shared /mnt + mount --bind /mnt /tmp + + The mount at /mnt and that at /tmp are both shared and belong + to the same peer group. Anything mounted or unmounted under + /mnt or /tmp reflect in all the other mounts of its peer + group. + + + (2) Slave mounts + + A 'slave mount' is defined as a vfsmount that receives + propagation events and does not forward propagation events. + + A slave mount as the name implies has a master mount from which + mount/unmount events are received. Events do not propagate from + the slave mount to the master. Only a shared mount can be made + a slave by executing the following command + + mount --make-slave mount + + A shared mount that is made as a slave is no more shared unless + modified to become shared. + + (3) Shared and Slave + + A vfsmount can be both shared as well as slave. This state + indicates that the mount is a slave of some vfsmount, and + has its own peer group too. This vfsmount receives propagation + events from its master vfsmount, and also forwards propagation + events to its 'peer group' and to its slave vfsmounts. + + Strictly speaking, the vfsmount is shared having its own + peer group, and this peer-group is a slave of some other + peer group. + + Only a slave vfsmount can be made as 'shared and slave' by + either executing the following command + mount --make-shared mount + or by moving the slave vfsmount under a shared vfsmount. + + (4) Private mount + + A 'private mount' is defined as vfsmount that does not + receive or forward any propagation events. + + (5) Unbindable mount + + A 'unbindable mount' is defined as vfsmount that does not + receive or forward any propagation events and cannot + be bind mounted. + + + State diagram: + The state diagram below explains the state transition of a mount, + in response to various commands. + ------------------------------------------------------------------------ + | |make-shared | make-slave | make-private |make-unbindab| + --------------|------------|--------------|--------------|-------------| + |shared |shared |*slave/private| private | unbindable | + | | | | | | + |-------------|------------|--------------|--------------|-------------| + |slave |shared | **slave | private | unbindable | + | |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |shared |shared | slave | private | unbindable | + |and slave |and slave | | | | + |-------------|------------|--------------|--------------|-------------| + |private |shared | **private | private | unbindable | + |-------------|------------|--------------|--------------|-------------| + |unbindable |shared |**unbindable | private | unbindable | + ------------------------------------------------------------------------ + + * if the shared mount is the only mount in its peer group, making it + slave, makes it private automatically. Note that there is no master to + which it can be slaved to. + + ** slaving a non-shared mount has no effect on the mount. + + Apart from the commands listed below, the 'move' operation also changes + the state of a mount depending on type of the destination mount. Its + explained in section 5d. + +5b) Bind semantics + + Consider the following command + + mount --bind A/a B/b + + where 'A' is the source mount, 'a' is the dentry in the mount 'A', 'B' + is the destination mount and 'b' is the dentry in the destination mount. + + The outcome depends on the type of mount of 'A' and 'B'. The table + below contains quick reference. + --------------------------------------------------------------------------- + | BIND MOUNT OPERATION | + |************************************************************************** + |source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared | shared & slave | invalid | + | | | | | | + |non-shared| shared | private | slave | invalid | + *************************************************************************** + + Details: + + 1. 'A' is a shared mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a' . 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree containing 'C1',..,'Cn' is + created. This propagation tree is identical to the propagation tree of + 'B'. And finally the peer-group of 'C' is merged with the peer group + of 'A'. + + 2. 'A' is a private mount and 'B' is a shared mount. A new mount 'C' + which is clone of 'A', is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. Also new mount 'C1', 'C2', 'C3' ... + are created and mounted at the dentry 'b' on all mounts where 'B' + propagates to. A new propagation tree is set containing all new mounts + 'C', 'C1', .., 'Cn' with exactly the same configuration as the + propagation tree for 'B'. + + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. A new + mount 'C' which is clone of 'A', is created. Its root dentry is 'a' . + 'C' is mounted on mount 'B' at dentry 'b'. Also new mounts 'C1', 'C2', + 'C3' ... are created and mounted at the dentry 'b' on all mounts where + 'B' propagates to. A new propagation tree containing the new mounts + 'C','C1',.. 'Cn' is created. This propagation tree is identical to the + propagation tree for 'B'. And finally the mount 'C' and its peer group + is made the slave of mount 'Z'. In other words, mount 'C' is in the + state 'slave and shared'. + + 4. 'A' is a unbindable mount and 'B' is a shared mount. This is a + invalid operation. + + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. A new mount 'C' which is clone of 'A', is created. + Its root dentry is 'a'. 'C' is mounted on mount 'B' at dentry 'b'. + + 6. 'A' is a shared mount and 'B' is a non-shared mount. A new mount 'C' + which is a clone of 'A' is created. Its root dentry is 'a'. 'C' is + mounted on mount 'B' at dentry 'b'. 'C' is made a member of the + peer-group of 'A'. + + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. A + new mount 'C' which is a clone of 'A' is created. Its root dentry is + 'a'. 'C' is mounted on mount 'B' at dentry 'b'. Also 'C' is set as a + slave mount of 'Z'. In other words 'A' and 'C' are both slave mounts of + 'Z'. All mount/unmount events on 'Z' propagates to 'A' and 'C'. But + mount/unmount on 'A' do not propagate anywhere else. Similarly + mount/unmount on 'C' do not propagate anywhere else. + + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. This is a + invalid operation. A unbindable mount cannot be bind mounted. + +5c) Rbind semantics + + rbind is same as bind. Bind replicates the specified mount. Rbind + replicates all the mounts in the tree belonging to the specified mount. + Rbind mount is bind mount applied to all the mounts in the tree. + + If the source tree that is rbind has some unbindable mounts, + then the subtree under the unbindable mount is pruned in the new + location. + + eg: let's say we have the following mount tree. + + A + / \ + B C + / \ / \ + D E F G + + Let's say all the mount except the mount C in the tree are + of a type other than unbindable. + + If this tree is rbound to say Z + + We will have the following tree at the new location. + + Z + | + A' + / + B' Note how the tree under C is pruned + / \ in the new location. + D' E' + + + +5d) Move semantics + + Consider the following command + + mount --move A B/b + + where 'A' is the source mount, 'B' is the destination mount and 'b' is + the dentry in the destination mount. + + The outcome depends on the type of the mount of 'A' and 'B'. The table + below is a quick reference. + --------------------------------------------------------------------------- + | MOVE MOUNT OPERATION | + |************************************************************************** + | source(A)->| shared | private | slave | unbindable | + | dest(B) | | | | | + | | | | | | | + | v | | | | | + |************************************************************************** + | shared | shared | shared |shared and slave| invalid | + | | | | | | + |non-shared| shared | private | slave | unbindable | + *************************************************************************** + NOTE: moving a mount residing under a shared mount is invalid. + + Details follow: + + 1. 'A' is a shared mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', 'A2'...'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. A new propagation tree is created in the + exact same configuration as that of 'B'. This new propagation tree + contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree + of 'A'. + + 2. 'A' is a private mount and 'B' is a shared mount. The mount 'A' is + mounted on mount 'B' at dentry 'b'. Also new mount 'A1', 'A2'... 'An' + are created and mounted at dentry 'b' on all mounts that receive + propagation from mount 'B'. The mount 'A' becomes a shared mount and a + propagation tree is created which is identical to that of + 'B'. This new propagation tree contains all the new mounts 'A1', + 'A2'... 'An'. + + 3. 'A' is a slave mount of mount 'Z' and 'B' is a shared mount. The + mount 'A' is mounted on mount 'B' at dentry 'b'. Also new mounts 'A1', + 'A2'... 'An' are created and mounted at dentry 'b' on all mounts that + receive propagation from mount 'B'. A new propagation tree is created + in the exact same configuration as that of 'B'. This new propagation + tree contains all the new mounts 'A1', 'A2'... 'An'. And this new + propagation tree is appended to the already existing propagation tree of + 'A'. Mount 'A' continues to be the slave mount of 'Z' but it also + becomes 'shared'. + + 4. 'A' is a unbindable mount and 'B' is a shared mount. The operation + is invalid. Because mounting anything on the shared mount 'B' can + create new mounts that get mounted on the mounts that receive + propagation from 'B'. And since the mount 'A' is unbindable, cloning + it to mount at other mountpoints is not possible. + + 5. 'A' is a private mount and 'B' is a non-shared(private or slave or + unbindable) mount. The mount 'A' is mounted on mount 'B' at dentry 'b'. + + 6. 'A' is a shared mount and 'B' is a non-shared mount. The mount 'A' + is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + shared mount. + + 7. 'A' is a slave mount of mount 'Z' and 'B' is a non-shared mount. + The mount 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' + continues to be a slave mount of mount 'Z'. + + 8. 'A' is a unbindable mount and 'B' is a non-shared mount. The mount + 'A' is mounted on mount 'B' at dentry 'b'. Mount 'A' continues to be a + unbindable mount. + +5e) Mount semantics + + Consider the following command + + mount device B/b + + 'B' is the destination mount and 'b' is the dentry in the destination + mount. + + The above operation is the same as bind operation with the exception + that the source mount is always a private mount. + + +5f) Unmount semantics + + Consider the following command + + umount A + + where 'A' is a mount mounted on mount 'B' at dentry 'b'. + + If mount 'B' is shared, then all most-recently-mounted mounts at dentry + 'b' on mounts that receive propagation from mount 'B' and does not have + sub-mounts within them are unmounted. + + Example: Let's say 'B1', 'B2', 'B3' are shared mounts that propagate to + each other. + + let's say 'A1', 'A2', 'A3' are first mounted at dentry 'b' on mount + 'B1', 'B2' and 'B3' respectively. + + let's say 'C1', 'C2', 'C3' are next mounted at the same dentry 'b' on + mount 'B1', 'B2' and 'B3' respectively. + + if 'C1' is unmounted, all the mounts that are most-recently-mounted on + 'B1' and on the mounts that 'B1' propagates-to are unmounted. + + 'B1' propagates to 'B2' and 'B3'. And the most recently mounted mount + on 'B2' at dentry 'b' is 'C2', and that of mount 'B3' is 'C3'. + + So all 'C1', 'C2' and 'C3' should be unmounted. + + If any of 'C2' or 'C3' has some child mounts, then that mount is not + unmounted, but all other mounts are unmounted. However if 'C1' is told + to be unmounted and 'C1' has some sub-mounts, the umount operation is + failed entirely. + +5g) Clone Namespace + + A cloned namespace contains all the mounts as that of the parent + namespace. + + Let's say 'A' and 'B' are the corresponding mounts in the parent and the + child namespace. + + If 'A' is shared, then 'B' is also shared and 'A' and 'B' propagate to + each other. + + If 'A' is a slave mount of 'Z', then 'B' is also the slave mount of + 'Z'. + + If 'A' is a private mount, then 'B' is a private mount too. + + If 'A' is unbindable mount, then 'B' is a unbindable mount too. + + +6) Quiz + + A. What is the result of the following command sequence? + + mount --bind /mnt /mnt + mount --make-shared /mnt + mount --bind /mnt /tmp + mount --move /tmp /mnt/1 + + what should be the contents of /mnt /mnt/1 /mnt/1/1 should be? + Should they all be identical? or should /mnt and /mnt/1 be + identical only? + + + B. What is the result of the following command sequence? + + mount --make-rshared / + mkdir -p /v/1 + mount --rbind / /v/1 + + what should be the content of /v/1/v/1 be? + + + C. What is the result of the following command sequence? + + mount --bind /mnt /mnt + mount --make-shared /mnt + mkdir -p /mnt/1/2/3 /mnt/1/test + mount --bind /mnt/1 /tmp + mount --make-slave /mnt + mount --make-shared /mnt + mount --bind /mnt/1/2 /tmp1 + mount --make-slave /mnt + + At this point we have the first mount at /tmp and + its root dentry is 1. Let's call this mount 'A' + And then we have a second mount at /tmp1 with root + dentry 2. Let's call this mount 'B' + Next we have a third mount at /mnt with root dentry + mnt. Let's call this mount 'C' + + 'B' is the slave of 'A' and 'C' is a slave of 'B' + A -> B -> C + + at this point if we execute the following command + + mount --bind /bin /tmp/test + + The mount is attempted on 'A' + + will the mount propagate to 'B' and 'C' ? + + what would be the contents of + /mnt/1/test be? + +7) FAQ + + Q1. Why is bind mount needed? How is it different from symbolic links? + symbolic links can get stale if the destination mount gets + unmounted or moved. Bind mounts continue to exist even if the + other mount is unmounted or moved. + + Q2. Why can't the shared subtree be implemented using exportfs? + + exportfs is a heavyweight way of accomplishing part of what + shared subtree can do. I cannot imagine a way to implement the + semantics of slave mount using exportfs? + + Q3 Why is unbindable mount needed? + + Let's say we want to replicate the mount tree at multiple + locations within the same subtree. + + if one rbind mounts a tree within the same subtree 'n' times + the number of mounts created is an exponential function of 'n'. + Having unbindable mount can help prune the unneeded bind + mounts. Here is a example. + + step 1: + let's say the root tree has just two directories with + one vfsmount. + root + / \ + tmp usr + + And we want to replicate the tree at multiple + mountpoints under /root/tmp + + step2: + mount --make-shared /root + + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this: + + root + / \ + tmp usr + / + m1 + / \ + tmp usr + / + m1 + + it has two vfsmounts + + step3: + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr + / \ / + m1 m2 m1 + / \ / \ + tmp usr tmp usr + / / \ + m1 m1 m2 + / \ + tmp usr + / \ + m1 m2 + + it has 6 vfsmounts + + step 4: + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 + + I won't draw the tree..but it has 24 vfsmounts + + + at step i the number of vfsmounts is V[i] = i*V[i-1]. + This is an exponential function. And this tree has way more + mounts than what we really needed in the first place. + + One could use a series of umount at each step to prune + out the unneeded mounts. But there is a better solution. + Unclonable mounts come in handy here. + + step 1: + let's say the root tree has just two directories with + one vfsmount. + root + / \ + tmp usr + + How do we set up the same tree at multiple locations under + /root/tmp + + step2: + mount --bind /root/tmp /root/tmp + + mount --make-rshared /root + mount --make-unbindable /root/tmp + + mkdir -p /tmp/m1 + + mount --rbind /root /tmp/m1 + + the new tree now looks like this: + + root + / \ + tmp usr + / + m1 + / \ + tmp usr + + step3: + mkdir -p /tmp/m2 + mount --rbind /root /tmp/m2 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ + m1 m2 + / \ / \ + tmp usr tmp usr + + step4: + + mkdir -p /tmp/m3 + mount --rbind /root /tmp/m3 + + the new tree now looks like this: + + root + / \ + tmp usr + / \ \ + m1 m2 m3 + / \ / \ / \ + tmp usr tmp usr tmp usr + +8) Implementation + +8A) Datastructure + + 4 new fields are introduced to struct vfsmount + ->mnt_share + ->mnt_slave_list + ->mnt_slave + ->mnt_master + + ->mnt_share links together all the mount to/from which this vfsmount + send/receives propagation events. + + ->mnt_slave_list links all the mounts to which this vfsmount propagates + to. + + ->mnt_slave links together all the slaves that its master vfsmount + propagates to. + + ->mnt_master points to the master vfsmount from which this vfsmount + receives propagation. + + ->mnt_flags takes two more flags to indicate the propagation status of + the vfsmount. MNT_SHARE indicates that the vfsmount is a shared + vfsmount. MNT_UNCLONABLE indicates that the vfsmount cannot be + replicated. + + All the shared vfsmounts in a peer group form a cyclic list through + ->mnt_share. + + All vfsmounts with the same ->mnt_master form on a cyclic list anchored + in ->mnt_master->mnt_slave_list and going through ->mnt_slave. + + ->mnt_master can point to arbitrary (and possibly different) members + of master peer group. To find all immediate slaves of a peer group + you need to go through _all_ ->mnt_slave_list of its members. + Conceptually it's just a single set - distribution among the + individual lists does not affect propagation or the way propagation + tree is modified by operations. + + All vfsmounts in a peer group have the same ->mnt_master. If it is + non-NULL, they form a contiguous (ordered) segment of slave list. + + A example propagation tree looks as shown in the figure below. + [ NOTE: Though it looks like a forest, if we consider all the shared + mounts as a conceptual entity called 'pnode', it becomes a tree] + + + A <--> B <--> C <---> D + /|\ /| |\ + / F G J K H I + / + E<-->K + /|\ + M L N + + In the above figure A,B,C and D all are shared and propagate to each + other. 'A' has got 3 slave mounts 'E' 'F' and 'G' 'C' has got 2 slave + mounts 'J' and 'K' and 'D' has got two slave mounts 'H' and 'I'. + 'E' is also shared with 'K' and they propagate to each other. And + 'K' has 3 slaves 'M', 'L' and 'N' + + A's ->mnt_share links with the ->mnt_share of 'B' 'C' and 'D' + + A's ->mnt_slave_list links with ->mnt_slave of 'E', 'K', 'F' and 'G' + + E's ->mnt_share links with ->mnt_share of K + 'E', 'K', 'F', 'G' have their ->mnt_master point to struct + vfsmount of 'A' + 'M', 'L', 'N' have their ->mnt_master point to struct vfsmount of 'K' + K's ->mnt_slave_list links with ->mnt_slave of 'M', 'L' and 'N' + + C's ->mnt_slave_list links with ->mnt_slave of 'J' and 'K' + J and K's ->mnt_master points to struct vfsmount of C + and finally D's ->mnt_slave_list links with ->mnt_slave of 'H' and 'I' + 'H' and 'I' have their ->mnt_master pointing to struct vfsmount of 'D'. + + + NOTE: The propagation tree is orthogonal to the mount tree. + +8B Locking: + + ->mnt_share, ->mnt_slave, ->mnt_slave_list, ->mnt_master are protected + by namespace_sem (exclusive for modifications, shared for reading). + + Normally we have ->mnt_flags modifications serialized by vfsmount_lock. + There are two exceptions: do_add_mount() and clone_mnt(). + The former modifies a vfsmount that has not been visible in any shared + data structures yet. + The latter holds namespace_sem and the only references to vfsmount + are in lists that can't be traversed without namespace_sem. + +8C Algorithm: + + The crux of the implementation resides in rbind/move operation. + + The overall algorithm breaks the operation into 3 phases: (look at + attach_recursive_mnt() and propagate_mnt()) + + 1. prepare phase. + 2. commit phases. + 3. abort phases. + + Prepare phase: + + for each mount in the source tree: + a) Create the necessary number of mount trees to + be attached to each of the mounts that receive + propagation from the destination mount. + b) Do not attach any of the trees to its destination. + However note down its ->mnt_parent and ->mnt_mountpoint + c) Link all the new mounts to form a propagation tree that + is identical to the propagation tree of the destination + mount. + + If this phase is successful, there should be 'n' new + propagation trees; where 'n' is the number of mounts in the + source tree. Go to the commit phase + + Also there should be 'm' new mount trees, where 'm' is + the number of mounts to which the destination mount + propagates to. + + if any memory allocations fail, go to the abort phase. + + Commit phase + attach each of the mount trees to their corresponding + destination mounts. + + Abort phase + delete all the newly created trees. + + NOTE: all the propagation related functionality resides in the file + pnode.c + + +------------------------------------------------------------------------ + +version 0.1 (created the initial document, Ram Pai linuxram@us.ibm.com) +version 0.2 (Incorporated comments from Al Viro) diff --git a/Documentation/filesystems/smbfs.txt b/Documentation/filesystems/smbfs.txt deleted file mode 100644 index f673ef0de0f..00000000000 --- a/Documentation/filesystems/smbfs.txt +++ /dev/null @@ -1,8 +0,0 @@ -Smbfs is a filesystem that implements the SMB protocol, which is the -protocol used by Windows for Workgroups, Windows 95 and Windows NT. -Smbfs was inspired by Samba, the program written by Andrew Tridgell -that turns any Unix host into a file server for DOS or Windows clients. - -Smbfs is a SMB client, but uses parts of samba for it's operation. For -more info on samba, including documentation, please go to -http://www.samba.org/ and then on to your nearest mirror. diff --git a/Documentation/filesystems/spufs.txt b/Documentation/filesystems/spufs.txt new file mode 100644 index 00000000000..1343d118a9b --- /dev/null +++ b/Documentation/filesystems/spufs.txt @@ -0,0 +1,521 @@ +SPUFS(2) Linux Programmer's Manual SPUFS(2) + + + +NAME + spufs - the SPU file system + + +DESCRIPTION + The SPU file system is used on PowerPC machines that implement the Cell + Broadband Engine Architecture in order to access Synergistic Processor + Units (SPUs). + + The file system provides a name space similar to posix shared memory or + message queues. Users that have write permissions on the file system + can use spu_create(2) to establish SPU contexts in the spufs root. + + Every SPU context is represented by a directory containing a predefined + set of files. These files can be used for manipulating the state of the + logical SPU. Users can change permissions on those files, but not actu- + ally add or remove files. + + +MOUNT OPTIONS + uid=<uid> + set the user owning the mount point, the default is 0 (root). + + gid=<gid> + set the group owning the mount point, the default is 0 (root). + + +FILES + The files in spufs mostly follow the standard behavior for regular sys- + tem calls like read(2) or write(2), but often support only a subset of + the operations supported on regular file systems. This list details the + supported operations and the deviations from the behaviour in the + respective man pages. + + All files that support the read(2) operation also support readv(2) and + all files that support the write(2) operation also support writev(2). + All files support the access(2) and stat(2) family of operations, but + only the st_mode, st_nlink, st_uid and st_gid fields of struct stat + contain reliable information. + + All files support the chmod(2)/fchmod(2) and chown(2)/fchown(2) opera- + tions, but will not be able to grant permissions that contradict the + possible operations, e.g. read access on the wbox file. + + The current set of files is: + + + /mem + the contents of the local storage memory of the SPU. This can be + accessed like a regular shared memory file and contains both code and + data in the address space of the SPU. The possible operations on an + open mem file are: + + read(2), pread(2), write(2), pwrite(2), lseek(2) + These operate as documented, with the exception that seek(2), + write(2) and pwrite(2) are not supported beyond the end of the + file. The file size is the size of the local storage of the SPU, + which normally is 256 kilobytes. + + mmap(2) + Mapping mem into the process address space gives access to the + SPU local storage within the process address space. Only + MAP_SHARED mappings are allowed. + + + /mbox + The first SPU to CPU communication mailbox. This file is read-only and + can be read in units of 32 bits. The file can only be used in non- + blocking mode and it even poll() will not block on it. The possible + operations on an open mbox file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. If there is no data available in the mail + box, the return value is set to -1 and errno becomes EAGAIN. + When data has been read successfully, four bytes are placed in + the data buffer and the value four is returned. + + + /ibox + The second SPU to CPU communication mailbox. This file is similar to + the first mailbox file, but can be read in blocking I/O mode, and the + poll family of system calls can be used to wait for it. The possible + operations on an open ibox file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. If there is no data available in the mail + box and the file descriptor has been opened with O_NONBLOCK, the + return value is set to -1 and errno becomes EAGAIN. + + If there is no data available in the mail box and the file + descriptor has been opened without O_NONBLOCK, the call will + block until the SPU writes to its interrupt mailbox channel. + When data has been read successfully, four bytes are placed in + the data buffer and the value four is returned. + + poll(2) + Poll on the ibox file returns (POLLIN | POLLRDNORM) whenever + data is available for reading. + + + /wbox + The CPU to SPU communation mailbox. It is write-only and can be written + in units of 32 bits. If the mailbox is full, write() will block and + poll can be used to wait for it becoming empty again. The possible + operations on an open wbox file are: write(2) If a count smaller than + four is requested, write returns -1 and sets errno to EINVAL. If there + is no space available in the mail box and the file descriptor has been + opened with O_NONBLOCK, the return value is set to -1 and errno becomes + EAGAIN. + + If there is no space available in the mail box and the file descriptor + has been opened without O_NONBLOCK, the call will block until the SPU + reads from its PPE mailbox channel. When data has been read success- + fully, four bytes are placed in the data buffer and the value four is + returned. + + poll(2) + Poll on the ibox file returns (POLLOUT | POLLWRNORM) whenever + space is available for writing. + + + /mbox_stat + /ibox_stat + /wbox_stat + Read-only files that contain the length of the current queue, i.e. how + many words can be read from mbox or ibox or how many words can be + written to wbox without blocking. The files can be read only in 4-byte + units and return a big-endian binary integer number. The possible + operations on an open *box_stat file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the number of elements that can be + read from (for mbox_stat and ibox_stat) or written to (for + wbox_stat) the respective mail box without blocking or resulting + in EAGAIN. + + + /npc + /decr + /decr_status + /spu_tag_mask + /event_mask + /srr0 + Internal registers of the SPU. The representation is an ASCII string + with the numeric value of the next instruction to be executed. These + can be used in read/write mode for debugging, but normal operation of + programs should not rely on them because access to any of them except + npc requires an SPU context save and is therefore very inefficient. + + The contents of these files are: + + npc Next Program Counter + + decr SPU Decrementer + + decr_status Decrementer Status + + spu_tag_mask MFC tag mask for SPU DMA + + event_mask Event mask for SPU interrupts + + srr0 Interrupt Return address register + + + The possible operations on an open npc, decr, decr_status, + spu_tag_mask, event_mask or srr0 file are: + + read(2) + When the count supplied to the read call is shorter than the + required length for the pointer value plus a newline character, + subsequent reads from the same file descriptor will result in + completing the string, regardless of changes to the register by + a running SPU task. When a complete string has been read, all + subsequent read operations will return zero bytes and a new file + descriptor needs to be opened to read the value again. + + write(2) + A write operation on the file results in setting the register to + the value given in the string. The string is parsed from the + beginning to the first non-numeric character or the end of the + buffer. Subsequent writes to the same file descriptor overwrite + the previous setting. + + + /fpcr + This file gives access to the Floating Point Status and Control Regis- + ter as a four byte long file. The operations on the fpcr file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the current value of the fpcr regis- + ter. + + write(2) + If a count smaller than four is requested, write returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is copied + from the data buffer, updating the value of the fpcr register. + + + /signal1 + /signal2 + The two signal notification channels of an SPU. These are read-write + files that operate on a 32 bit word. Writing to one of these files + triggers an interrupt on the SPU. The value written to the signal + files can be read from the SPU through a channel read or from host user + space through the file. After the value has been read by the SPU, it + is reset to zero. The possible operations on an open signal1 or sig- + nal2 file are: + + read(2) + If a count smaller than four is requested, read returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is placed in + the data buffer, containing the current value of the specified + signal notification register. + + write(2) + If a count smaller than four is requested, write returns -1 and + sets errno to EINVAL. Otherwise, a four byte value is copied + from the data buffer, updating the value of the specified signal + notification register. The signal notification register will + either be replaced with the input data or will be updated to the + bitwise OR or the old value and the input data, depending on the + contents of the signal1_type, or signal2_type respectively, + file. + + + /signal1_type + /signal2_type + These two files change the behavior of the signal1 and signal2 notifi- + cation files. The contain a numerical ASCII string which is read as + either "1" or "0". In mode 0 (overwrite), the hardware replaces the + contents of the signal channel with the data that is written to it. in + mode 1 (logical OR), the hardware accumulates the bits that are subse- + quently written to it. The possible operations on an open signal1_type + or signal2_type file are: + + read(2) + When the count supplied to the read call is shorter than the + required length for the digit plus a newline character, subse- + quent reads from the same file descriptor will result in com- + pleting the string. When a complete string has been read, all + subsequent read operations will return zero bytes and a new file + descriptor needs to be opened to read the value again. + + write(2) + A write operation on the file results in setting the register to + the value given in the string. The string is parsed from the + beginning to the first non-numeric character or the end of the + buffer. Subsequent writes to the same file descriptor overwrite + the previous setting. + + +EXAMPLES + /etc/fstab entry + none /spu spufs gid=spu 0 0 + + +AUTHORS + Arnd Bergmann <arndb@de.ibm.com>, Mark Nutter <mnutter@us.ibm.com>, + Ulrich Weigand <Ulrich.Weigand@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_create(2), spu_run(2), spufs(7) + + + +Linux 2005-09-28 SPUFS(2) + +------------------------------------------------------------------------------ + +SPU_RUN(2) Linux Programmer's Manual SPU_RUN(2) + + + +NAME + spu_run - execute an spu context + + +SYNOPSIS + #include <sys/spu.h> + + int spu_run(int fd, unsigned int *npc, unsigned int *event); + +DESCRIPTION + The spu_run system call is used on PowerPC machines that implement the + Cell Broadband Engine Architecture in order to access Synergistic Pro- + cessor Units (SPUs). It uses the fd that was returned from spu_cre- + ate(2) to address a specific SPU context. When the context gets sched- + uled to a physical SPU, it starts execution at the instruction pointer + passed in npc. + + Execution of SPU code happens synchronously, meaning that spu_run does + not return while the SPU is still running. If there is a need to exe- + cute SPU code in parallel with other code on either the main CPU or + other SPUs, you need to create a new thread of execution first, e.g. + using the pthread_create(3) call. + + When spu_run returns, the current value of the SPU instruction pointer + is written back to npc, so you can call spu_run again without updating + the pointers. + + event can be a NULL pointer or point to an extended status code that + gets filled when spu_run returns. It can be one of the following con- + stants: + + SPE_EVENT_DMA_ALIGNMENT + A DMA alignment error + + SPE_EVENT_SPE_DATA_SEGMENT + A DMA segmentation error + + SPE_EVENT_SPE_DATA_STORAGE + A DMA storage error + + If NULL is passed as the event argument, these errors will result in a + signal delivered to the calling process. + +RETURN VALUE + spu_run returns the value of the spu_status register or -1 to indicate + an error and set errno to one of the error codes listed below. The + spu_status register value contains a bit mask of status codes and + optionally a 14 bit code returned from the stop-and-signal instruction + on the SPU. The bit masks for the status codes are: + + 0x02 SPU was stopped by stop-and-signal. + + 0x04 SPU was stopped by halt. + + 0x08 SPU is waiting for a channel. + + 0x10 SPU is in single-step mode. + + 0x20 SPU has tried to execute an invalid instruction. + + 0x40 SPU has tried to access an invalid channel. + + 0x3fff0000 + The bits masked with this value contain the code returned from + stop-and-signal. + + There are always one or more of the lower eight bits set or an error + code is returned from spu_run. + +ERRORS + EAGAIN or EWOULDBLOCK + fd is in non-blocking mode and spu_run would block. + + EBADF fd is not a valid file descriptor. + + EFAULT npc is not a valid pointer or status is neither NULL nor a valid + pointer. + + EINTR A signal occurred while spu_run was in progress. The npc value + has been updated to the new program counter value if necessary. + + EINVAL fd is not a file descriptor returned from spu_create(2). + + ENOMEM Insufficient memory was available to handle a page fault result- + ing from an MFC direct memory access. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + +NOTES + spu_run is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +CONFORMING TO + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +BUGS + The code does not yet fully implement all features lined out here. + + +AUTHOR + Arnd Bergmann <arndb@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_create(2), spufs(7) + + + +Linux 2005-09-28 SPU_RUN(2) + +------------------------------------------------------------------------------ + +SPU_CREATE(2) Linux Programmer's Manual SPU_CREATE(2) + + + +NAME + spu_create - create a new spu context + + +SYNOPSIS + #include <sys/types.h> + #include <sys/spu.h> + + int spu_create(const char *pathname, int flags, mode_t mode); + +DESCRIPTION + The spu_create system call is used on PowerPC machines that implement + the Cell Broadband Engine Architecture in order to access Synergistic + Processor Units (SPUs). It creates a new logical context for an SPU in + pathname and returns a handle to associated with it. pathname must + point to a non-existing directory in the mount point of the SPU file + system (spufs). When spu_create is successful, a directory gets cre- + ated on pathname and it is populated with files. + + The returned file handle can only be passed to spu_run(2) or closed, + other operations are not defined on it. When it is closed, all associ- + ated directory entries in spufs are removed. When the last file handle + pointing either inside of the context directory or to this file + descriptor is closed, the logical SPU context is destroyed. + + The parameter flags can be zero or any bitwise or'd combination of the + following constants: + + SPU_RAWIO + Allow mapping of some of the hardware registers of the SPU into + user space. This flag requires the CAP_SYS_RAWIO capability, see + capabilities(7). + + The mode parameter specifies the permissions used for creating the new + directory in spufs. mode is modified with the user's umask(2) value + and then used for both the directory and the files contained in it. The + file permissions mask out some more bits of mode because they typically + support only read or write access. See stat(2) for a full list of the + possible mode values. + + +RETURN VALUE + spu_create returns a new file descriptor. It may return -1 to indicate + an error condition and set errno to one of the error codes listed + below. + + +ERRORS + EACCESS + The current user does not have write access on the spufs mount + point. + + EEXIST An SPU context already exists at the given path name. + + EFAULT pathname is not a valid string pointer in the current address + space. + + EINVAL pathname is not a directory in the spufs mount point. + + ELOOP Too many symlinks were found while resolving pathname. + + EMFILE The process has reached its maximum open file limit. + + ENAMETOOLONG + pathname was too long. + + ENFILE The system has reached the global open file limit. + + ENOENT Part of pathname could not be resolved. + + ENOMEM The kernel could not allocate all resources required. + + ENOSPC There are not enough SPU resources available to create a new + context or the user specific limit for the number of SPU con- + texts has been reached. + + ENOSYS the functionality is not provided by the current system, because + either the hardware does not provide SPUs or the spufs module is + not loaded. + + ENOTDIR + A part of pathname is not a directory. + + + +NOTES + spu_create is meant to be used from libraries that implement a more + abstract interface to SPUs, not to be used from regular applications. + See http://www.bsc.es/projects/deepcomputing/linuxoncell/ for the rec- + ommended libraries. + + +FILES + pathname must point to a location beneath the mount point of spufs. By + convention, it gets mounted in /spu. + + +CONFORMING TO + This call is Linux specific and only implemented by the ppc64 architec- + ture. Programs using this system call are not portable. + + +BUGS + The code does not yet fully implement all features lined out here. + + +AUTHOR + Arnd Bergmann <arndb@de.ibm.com> + +SEE ALSO + capabilities(7), close(2), spu_run(2), spufs(7) + + + +Linux 2005-09-28 SPU_CREATE(2) diff --git a/Documentation/filesystems/squashfs.txt b/Documentation/filesystems/squashfs.txt new file mode 100644 index 00000000000..403c090aca3 --- /dev/null +++ b/Documentation/filesystems/squashfs.txt @@ -0,0 +1,259 @@ +SQUASHFS 4.0 FILESYSTEM +======================= + +Squashfs is a compressed read-only filesystem for Linux. +It uses zlib/lzo/xz compression to compress files, inodes and directories. +Inodes in the system are very small and all blocks are packed to minimise +data overhead. Block sizes greater than 4K are supported up to a maximum +of 1Mbytes (default block size 128K). + +Squashfs is intended for general read-only filesystem use, for archival +use (i.e. in cases where a .tar.gz file may be used), and in constrained +block device/memory systems (e.g. embedded systems) where low overhead is +needed. + +Mailing list: squashfs-devel@lists.sourceforge.net +Web site: www.squashfs.org + +1. FILESYSTEM FEATURES +---------------------- + +Squashfs filesystem features versus Cramfs: + + Squashfs Cramfs + +Max filesystem size: 2^64 256 MiB +Max file size: ~ 2 TiB 16 MiB +Max files: unlimited unlimited +Max directories: unlimited unlimited +Max entries per directory: unlimited unlimited +Max block size: 1 MiB 4 KiB +Metadata compression: yes no +Directory indexes: yes no +Sparse file support: yes no +Tail-end packing (fragments): yes no +Exportable (NFS etc.): yes no +Hard link support: yes no +"." and ".." in readdir: yes no +Real inode numbers: yes no +32-bit uids/gids: yes no +File creation time: yes no +Xattr support: yes no +ACL support: no no + +Squashfs compresses data, inodes and directories. In addition, inode and +directory data are highly compacted, and packed on byte boundaries. Each +compressed inode is on average 8 bytes in length (the exact length varies on +file type, i.e. regular file, directory, symbolic link, and block/char device +inodes have different sizes). + +2. USING SQUASHFS +----------------- + +As squashfs is a read-only filesystem, the mksquashfs program must be used to +create populated squashfs filesystems. This and other squashfs utilities +can be obtained from http://www.squashfs.org. Usage instructions can be +obtained from this site also. + +The squashfs-tools development tree is now located on kernel.org + git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git + +3. SQUASHFS FILESYSTEM DESIGN +----------------------------- + +A squashfs filesystem consists of a maximum of nine parts, packed together on a +byte alignment: + + --------------- + | superblock | + |---------------| + | compression | + | options | + |---------------| + | datablocks | + | & fragments | + |---------------| + | inode table | + |---------------| + | directory | + | table | + |---------------| + | fragment | + | table | + |---------------| + | export | + | table | + |---------------| + | uid/gid | + | lookup table | + |---------------| + | xattr | + | table | + --------------- + +Compressed data blocks are written to the filesystem as files are read from +the source directory, and checked for duplicates. Once all file data has been +written the completed inode, directory, fragment, export, uid/gid lookup and +xattr tables are written. + +3.1 Compression options +----------------------- + +Compressors can optionally support compression specific options (e.g. +dictionary size). If non-default compression options have been used, then +these are stored here. + +3.2 Inodes +---------- + +Metadata (inodes and directories) are compressed in 8Kbyte blocks. Each +compressed block is prefixed by a two byte length, the top bit is set if the +block is uncompressed. A block will be uncompressed if the -noI option is set, +or if the compressed block was larger than the uncompressed block. + +Inodes are packed into the metadata blocks, and are not aligned to block +boundaries, therefore inodes overlap compressed blocks. Inodes are identified +by a 48-bit number which encodes the location of the compressed metadata block +containing the inode, and the byte offset into that block where the inode is +placed (<block, offset>). + +To maximise compression there are different inodes for each file type +(regular file, directory, device, etc.), the inode contents and length +varying with the type. + +To further maximise compression, two types of regular file inode and +directory inode are defined: inodes optimised for frequently occurring +regular files and directories, and extended types where extra +information has to be stored. + +3.3 Directories +--------------- + +Like inodes, directories are packed into compressed metadata blocks, stored +in a directory table. Directories are accessed using the start address of +the metablock containing the directory and the offset into the +decompressed block (<block, offset>). + +Directories are organised in a slightly complex way, and are not simply +a list of file names. The organisation takes advantage of the +fact that (in most cases) the inodes of the files will be in the same +compressed metadata block, and therefore, can share the start block. +Directories are therefore organised in a two level list, a directory +header containing the shared start block value, and a sequence of directory +entries, each of which share the shared start block. A new directory header +is written once/if the inode start block changes. The directory +header/directory entry list is repeated as many times as necessary. + +Directories are sorted, and can contain a directory index to speed up +file lookup. Directory indexes store one entry per metablock, each entry +storing the index/filename mapping to the first directory header +in each metadata block. Directories are sorted in alphabetical order, +and at lookup the index is scanned linearly looking for the first filename +alphabetically larger than the filename being looked up. At this point the +location of the metadata block the filename is in has been found. +The general idea of the index is to ensure only one metadata block needs to be +decompressed to do a lookup irrespective of the length of the directory. +This scheme has the advantage that it doesn't require extra memory overhead +and doesn't require much extra storage on disk. + +3.4 File data +------------- + +Regular files consist of a sequence of contiguous compressed blocks, and/or a +compressed fragment block (tail-end packed block). The compressed size +of each datablock is stored in a block list contained within the +file inode. + +To speed up access to datablocks when reading 'large' files (256 Mbytes or +larger), the code implements an index cache that caches the mapping from +block index to datablock location on disk. + +The index cache allows Squashfs to handle large files (up to 1.75 TiB) while +retaining a simple and space-efficient block list on disk. The cache +is split into slots, caching up to eight 224 GiB files (128 KiB blocks). +Larger files use multiple slots, with 1.75 TiB files using all 8 slots. +The index cache is designed to be memory efficient, and by default uses +16 KiB. + +3.5 Fragment lookup table +------------------------- + +Regular files can contain a fragment index which is mapped to a fragment +location on disk and compressed size using a fragment lookup table. This +fragment lookup table is itself stored compressed into metadata blocks. +A second index table is used to locate these. This second index table for +speed of access (and because it is small) is read at mount time and cached +in memory. + +3.6 Uid/gid lookup table +------------------------ + +For space efficiency regular files store uid and gid indexes, which are +converted to 32-bit uids/gids using an id look up table. This table is +stored compressed into metadata blocks. A second index table is used to +locate these. This second index table for speed of access (and because it +is small) is read at mount time and cached in memory. + +3.7 Export table +---------------- + +To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems +can optionally (disabled with the -no-exports Mksquashfs option) contain +an inode number to inode disk location lookup table. This is required to +enable Squashfs to map inode numbers passed in filehandles to the inode +location on disk, which is necessary when the export code reinstantiates +expired/flushed inodes. + +This table is stored compressed into metadata blocks. A second index table is +used to locate these. This second index table for speed of access (and because +it is small) is read at mount time and cached in memory. + +3.8 Xattr table +--------------- + +The xattr table contains extended attributes for each inode. The xattrs +for each inode are stored in a list, each list entry containing a type, +name and value field. The type field encodes the xattr prefix +("user.", "trusted." etc) and it also encodes how the name/value fields +should be interpreted. Currently the type indicates whether the value +is stored inline (in which case the value field contains the xattr value), +or if it is stored out of line (in which case the value field stores a +reference to where the actual value is stored). This allows large values +to be stored out of line improving scanning and lookup performance and it +also allows values to be de-duplicated, the value being stored once, and +all other occurrences holding an out of line reference to that value. + +The xattr lists are packed into compressed 8K metadata blocks. +To reduce overhead in inodes, rather than storing the on-disk +location of the xattr list inside each inode, a 32-bit xattr id +is stored. This xattr id is mapped into the location of the xattr +list using a second xattr id lookup table. + +4. TODOS AND OUTSTANDING ISSUES +------------------------------- + +4.1 Todo list +------------- + +Implement ACL support. + +4.2 Squashfs internal cache +--------------------------- + +Blocks in Squashfs are compressed. To avoid repeatedly decompressing +recently accessed data Squashfs uses two small metadata and fragment caches. + +The cache is not used for file datablocks, these are decompressed and cached in +the page-cache in the normal way. The cache is used to temporarily cache +fragment and metadata blocks which have been read as a result of a metadata +(i.e. inode or directory) or fragment access. Because metadata and fragments +are packed together into blocks (to gain greater compression) the read of a +particular piece of metadata or fragment will retrieve other metadata/fragments +which have been packed with it, these because of locality-of-reference may be +read in the near future. Temporarily caching them ensures they are available +for near future access without requiring an additional read and decompress. + +In the future this internal cache may be replaced with an implementation which +uses the kernel page cache. Because the page cache operates on page sized +units this may introduce additional complexity in terms of locking and +associated race conditions. diff --git a/Documentation/filesystems/sysfs-pci.txt b/Documentation/filesystems/sysfs-pci.txt index 988a62fae11..74eaac26f8b 100644 --- a/Documentation/filesystems/sysfs-pci.txt +++ b/Documentation/filesystems/sysfs-pci.txt @@ -1,4 +1,5 @@ Accessing PCI device resources through sysfs +-------------------------------------------- sysfs, usually mounted at /sys, provides access to PCI resources on platforms that support it. For example, a given bus might look like this: @@ -8,8 +9,10 @@ that support it. For example, a given bus might look like this: | |-- class | |-- config | |-- device + | |-- enable | |-- irq | |-- local_cpus + | |-- remove | |-- resource | |-- resource0 | |-- resource1 @@ -31,10 +34,13 @@ files, each with their own function. class PCI class (ascii, ro) config PCI config space (binary, rw) device PCI device (ascii, ro) + enable Whether the device is enabled (ascii, rw) irq IRQ number (ascii, ro) local_cpus nearby CPU mask (cpumask, ro) + remove remove device from kernel's list (ascii, wo) resource PCI resource host addresses (ascii, ro) - resource0..N PCI resource N, if present (binary, mmap) + resource0..N PCI resource N, if present (binary, mmap, rw[1]) + resource0_wc..N_wc PCI WC map resource N, if prefetchable (binary, mmap) rom PCI ROM resource, if present (binary, ro) subsystem_device PCI subsystem device (ascii, ro) subsystem_vendor PCI subsystem vendor (ascii, ro) @@ -42,22 +48,49 @@ files, each with their own function. ro - read only file rw - file is readable and writable + wo - write only file mmap - file is mmapable ascii - file contains ascii text binary - file contains binary data cpumask - file contains a cpumask type -The read only files are informational, writes to them will be ignored. -Writable files can be used to perform actions on the device (e.g. changing -config space, detaching a device). mmapable files are available via an -mmap of the file at offset 0 and can be used to do actual device programming -from userspace. Note that some platforms don't support mmapping of certain -resources, so be sure to check the return value from any attempted mmap. +[1] rw for RESOURCE_IO (I/O port) regions only + +The read only files are informational, writes to them will be ignored, with +the exception of the 'rom' file. Writable files can be used to perform +actions on the device (e.g. changing config space, detaching a device). +mmapable files are available via an mmap of the file at offset 0 and can be +used to do actual device programming from userspace. Note that some platforms +don't support mmapping of certain resources, so be sure to check the return +value from any attempted mmap. The most notable of these are I/O port +resources, which also provide read/write access. + +The 'enable' file provides a counter that indicates how many times the device +has been enabled. If the 'enable' file currently returns '4', and a '1' is +echoed into it, it will then return '5'. Echoing a '0' into it will decrease +the count. Even when it returns to 0, though, some of the initialisation +may not be reversed. + +The 'rom' file is special in that it provides read-only access to the device's +ROM file, if available. It's disabled by default, however, so applications +should write the string "1" to the file to enable it before attempting a read +call, and disable it following the access by writing "0" to the file. Note +that the device must be enabled for a rom read to return data successfully. +In the event a driver is not bound to the device, it can be enabled using the +'enable' file, documented above. + +The 'remove' file is used to remove the PCI device, by writing a non-zero +integer to the file. This does not involve any kind of hot-plug functionality, +e.g. powering off the device. The device is removed from the kernel's list of +PCI devices, the sysfs directory for it is removed, and the device will be +removed from any drivers attached to it. Removal of PCI root buses is +disallowed. Accessing legacy resources through sysfs +---------------------------------------- Legacy I/O port and ISA memory resources are also provided in sysfs if the -underlying platform supports them. They're located in the PCI class heirarchy, +underlying platform supports them. They're located in the PCI class hierarchy, e.g. /sys/class/pci_bus/0000:17/ @@ -75,6 +108,7 @@ simply dereference the returned pointer (after checking for errors of course) to access legacy memory space. Supporting PCI access on new platforms +-------------------------------------- In order to support PCI resource mapping as described above, Linux platform code must define HAVE_PCI_MMAP and provide a pci_mmap_page_range function. diff --git a/Documentation/filesystems/sysfs-tagging.txt b/Documentation/filesystems/sysfs-tagging.txt new file mode 100644 index 00000000000..eb843e49c5a --- /dev/null +++ b/Documentation/filesystems/sysfs-tagging.txt @@ -0,0 +1,42 @@ +Sysfs tagging +------------- + +(Taken almost verbatim from Eric Biederman's netns tagging patch +commit msg) + +The problem. Network devices show up in sysfs and with the network +namespace active multiple devices with the same name can show up in +the same directory, ouch! + +To avoid that problem and allow existing applications in network +namespaces to see the same interface that is currently presented in +sysfs, sysfs now has tagging directory support. + +By using the network namespace pointers as tags to separate out the +the sysfs directory entries we ensure that we don't have conflicts +in the directories and applications only see a limited set of +the network devices. + +Each sysfs directory entry may be tagged with zero or one +namespaces. A sysfs_dirent is augmented with a void *s_ns. If a +directory entry is tagged, then sysfs_dirent->s_flags will have a +flag between KOBJ_NS_TYPE_NONE and KOBJ_NS_TYPES, and s_ns will +point to the namespace to which it belongs. + +Each sysfs superblock's sysfs_super_info contains an array void +*ns[KOBJ_NS_TYPES]. When a task in a tagging namespace +kobj_nstype first mounts sysfs, a new superblock is created. It +will be differentiated from other sysfs mounts by having its +s_fs_info->ns[kobj_nstype] set to the new namespace. Note that +through bind mounting and mounts propagation, a task can easily view +the contents of other namespaces' sysfs mounts. Therefore, when a +namespace exits, it will call kobj_ns_exit() to invalidate any +sysfs_dirent->s_ns pointers pointing to it. + +Users of this interface: +- define a type in the kobj_ns_type enumeration. +- call kobj_ns_type_register() with its kobj_ns_type_operations which has + - current_ns() which returns current's namespace + - netlink_ns() which returns a socket's namespace + - initial_ns() which returns the initial namesapce +- call kobj_ns_exit() when an individual tag is no longer valid diff --git a/Documentation/filesystems/sysfs.txt b/Documentation/filesystems/sysfs.txt index c8bce82ddca..b35a64b82f9 100644 --- a/Documentation/filesystems/sysfs.txt +++ b/Documentation/filesystems/sysfs.txt @@ -2,8 +2,10 @@ sysfs - _The_ filesystem for exporting kernel objects. Patrick Mochel <mochel@osdl.org> +Mike Murphy <mamurph@cs.clemson.edu> -10 January 2003 +Revised: 16 August 2011 +Original: 10 January 2003 What it is: @@ -21,7 +23,8 @@ interface. Using sysfs ~~~~~~~~~~~ -sysfs is always compiled in. You can access it by doing: +sysfs is always compiled in if CONFIG_SYSFS is defined. You can access +it by doing: mount -t sysfs sysfs /sys @@ -36,10 +39,12 @@ userspace. Top-level directories in sysfs represent the common ancestors of object hierarchies; i.e. the subsystems the objects belong to. -Sysfs internally stores the kobject that owns the directory in the -->d_fsdata pointer of the directory's dentry. This allows sysfs to do -reference counting directly on the kobject when the file is opened and -closed. +Sysfs internally stores a pointer to the kobject that implements a +directory in the sysfs_dirent object associated with the directory. In +the past this kobject pointer has been used by sysfs to do reference +counting directly on the kobject whenever the file is opened or closed. +With the current sysfs implementation the kobject reference count is +only modified directly by the function sysfs_schedule_callback(). Attributes @@ -51,25 +56,26 @@ for the attributes, providing a means to read and write kernel attributes. Attributes should be ASCII text files, preferably with only one value -per file. It is noted that it may not be efficient to contain only +per file. It is noted that it may not be efficient to contain only one value per file, so it is socially acceptable to express an array of values of the same type. Mixing types, expressing multiple lines of data, and doing fancy formatting of data is heavily frowned upon. Doing these things may get -you publically humiliated and your code rewritten without notice. +you publicly humiliated and your code rewritten without notice. An attribute definition is simply: struct attribute { char * name; - mode_t mode; + struct module *owner; + umode_t mode; }; -int sysfs_create_file(struct kobject * kobj, struct attribute * attr); -void sysfs_remove_file(struct kobject * kobj, struct attribute * attr); +int sysfs_create_file(struct kobject * kobj, const struct attribute * attr); +void sysfs_remove_file(struct kobject * kobj, const struct attribute * attr); A bare attribute contains no means to read or write the value of the @@ -80,22 +86,20 @@ a specific object type. For example, the driver model defines struct device_attribute like: struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); }; -int device_create_file(struct device *, struct device_attribute *); -void device_remove_file(struct device *, struct device_attribute *); +int device_create_file(struct device *, const struct device_attribute *); +void device_remove_file(struct device *, const struct device_attribute *); It also defines this helper for defining device attributes: -#define DEVICE_ATTR(_name, _mode, _show, _store) \ -struct device_attribute dev_attr_##_name = { \ - .attr = {.name = __stringify(_name) , .mode = _mode }, \ - .show = _show, \ - .store = _store, \ -}; +#define DEVICE_ATTR(_name, _mode, _show, _store) \ +struct device_attribute dev_attr_##_name = __ATTR(_name, _mode, _show, _store) For example, declaring @@ -104,7 +108,7 @@ static DEVICE_ATTR(foo, S_IWUSR | S_IRUGO, show_foo, store_foo); is equivalent to doing: static struct device_attribute dev_attr_foo = { - .attr = { + .attr = { .name = "foo", .mode = S_IWUSR | S_IRUGO, }, @@ -122,7 +126,7 @@ show and store methods of the attribute owners. struct sysfs_ops { ssize_t (*show)(struct kobject *, struct attribute *, char *); - ssize_t (*store)(struct kobject *, struct attribute *, const char *); + ssize_t (*store)(struct kobject *, struct attribute *, const char *, size_t); }; [ Subsystems should have already defined a struct kobj_type as a @@ -137,18 +141,22 @@ calls the associated methods. To illustrate: +#define to_dev(obj) container_of(obj, struct device, kobj) #define to_dev_attr(_attr) container_of(_attr, struct device_attribute, attr) -#define to_dev(d) container_of(d, struct device, kobj) -static ssize_t -dev_attr_show(struct kobject * kobj, struct attribute * attr, char * buf) +static ssize_t dev_attr_show(struct kobject *kobj, struct attribute *attr, + char *buf) { - struct device_attribute * dev_attr = to_dev_attr(attr); - struct device * dev = to_dev(kobj); - ssize_t ret = 0; + struct device_attribute *dev_attr = to_dev_attr(attr); + struct device *dev = to_dev(kobj); + ssize_t ret = -EIO; if (dev_attr->show) - ret = dev_attr->show(dev, buf); + ret = dev_attr->show(dev, dev_attr, buf); + if (ret >= (ssize_t)PAGE_SIZE) { + print_symbol("dev_attr_show: %s returned bad count\n", + (unsigned long)dev_attr->show); + } return ret; } @@ -161,10 +169,11 @@ To read or write attributes, show() or store() methods must be specified when declaring the attribute. The method types should be as simple as those defined for device attributes: - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); +ssize_t (*show)(struct device *dev, struct device_attribute *attr, char *buf); +ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); -IOW, they should take only an object and a buffer as parameters. +IOW, they should take only an object, an attribute, and a buffer as parameters. sysfs allocates a buffer of size (PAGE_SIZE) and passes it to the @@ -176,8 +185,10 @@ implementations: Recall that an attribute should only be exporting one value, or an array of similar values, so this shouldn't be that expensive. - This allows userspace to do partial reads and seeks arbitrarily over - the entire file at will. + This allows userspace to do partial reads and forward seeks + arbitrarily over the entire file at will. If userspace seeks back to + zero or does a pread(2) with an offset of '0' the show() method will + be called again, rearmed, to fill the buffer. - On write(2), sysfs expects the entire buffer to be passed during the first write. Sysfs then passes the entire buffer to the store() @@ -192,16 +203,19 @@ implementations: Other notes: +- Writing causes the show() method to be rearmed regardless of current + file position. + - The buffer will always be PAGE_SIZE bytes in length. On i386, this is 4096. - show() methods should return the number of bytes printed into the - buffer. This is the return value of snprintf(). + buffer. This is the return value of scnprintf(). -- show() should always use snprintf(). +- show() should always use scnprintf(). -- store() should return the number of bytes used from the buffer. This - can be done using strlen(). +- store() should return the number of bytes used from the buffer. If the + entire buffer has been used, just return the count argument. - show() or store() can always return errors. If a bad value comes through, be sure to return an error. @@ -214,15 +228,18 @@ Other notes: A very simple (and naive) implementation of a device attribute is: -static ssize_t show_name(struct device *dev, struct device_attribute *attr, char *buf) +static ssize_t show_name(struct device *dev, struct device_attribute *attr, + char *buf) { - return snprintf(buf, PAGE_SIZE, "%s\n", dev->name); + return scnprintf(buf, PAGE_SIZE, "%s\n", dev->name); } -static ssize_t store_name(struct device * dev, const char * buf) +static ssize_t store_name(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count) { - sscanf(buf, "%20s", dev->name); - return strnlen(buf, PAGE_SIZE); + snprintf(dev->name, sizeof(dev->name), "%.*s", + (int)min(count, sizeof(dev->name) - 1), buf); + return count; } static DEVICE_ATTR(name, S_IRUGO, show_name, store_name); @@ -238,14 +255,16 @@ Top Level Directory Layout The sysfs directory arrangement exposes the relationship of kernel data structures. -The top level sysfs diretory looks like: +The top level sysfs directory looks like: block/ bus/ class/ +dev/ devices/ firmware/ net/ +fs/ devices/ contains a filesystem representation of the device tree. It maps directly to the internal kernel device tree, which is a hierarchy of @@ -264,6 +283,15 @@ drivers/ contains a directory for each device driver that is loaded for devices on that particular bus (this assumes that drivers do not span multiple bus types). +fs/ contains a directory for some filesystems. Currently each +filesystem wanting to export attributes must create its own hierarchy +below fs/ (see ./fuse.txt for an example). + +dev/ contains two directories char/ and block/. Inside these two +directories there are symlinks named <major>:<minor>. These symlinks +point to the sysfs directory for the given device. /sys/dev provides a +quick way to lookup the sysfs interface for a device from the result of +a stat(2) operation. More information can driver-model specific features can be found in Documentation/driver-model/. @@ -283,19 +311,21 @@ The following interface layers currently exist in sysfs: Structure: struct device_attribute { - struct attribute attr; - ssize_t (*show)(struct device * dev, char * buf); - ssize_t (*store)(struct device * dev, const char * buf); + struct attribute attr; + ssize_t (*show)(struct device *dev, struct device_attribute *attr, + char *buf); + ssize_t (*store)(struct device *dev, struct device_attribute *attr, + const char *buf, size_t count); }; Declaring: -DEVICE_ATTR(_name, _str, _mode, _show, _store); +DEVICE_ATTR(_name, _mode, _show, _store); Creation/Removal: -int device_create_file(struct device *device, struct device_attribute * attr); -void device_remove_file(struct device * dev, struct device_attribute * attr); +int device_create_file(struct device *dev, const struct device_attribute * attr); +void device_remove_file(struct device *dev, const struct device_attribute * attr); - bus drivers (include/linux/device.h) @@ -305,7 +335,7 @@ Structure: struct bus_attribute { struct attribute attr; ssize_t (*show)(struct bus_type *, char * buf); - ssize_t (*store)(struct bus_type *, const char * buf); + ssize_t (*store)(struct bus_type *, const char * buf, size_t count); }; Declaring: @@ -326,7 +356,8 @@ Structure: struct driver_attribute { struct attribute attr; ssize_t (*show)(struct device_driver *, char * buf); - ssize_t (*store)(struct device_driver *, const char * buf); + ssize_t (*store)(struct device_driver *, const char * buf, + size_t count); }; Declaring: @@ -335,7 +366,15 @@ DRIVER_ATTR(_name, _mode, _show, _store) Creation/Removal: -int driver_create_file(struct device_driver *, struct driver_attribute *); -void driver_remove_file(struct device_driver *, struct driver_attribute *); +int driver_create_file(struct device_driver *, const struct driver_attribute *); +void driver_remove_file(struct device_driver *, const struct driver_attribute *); + +Documentation +~~~~~~~~~~~~~ +The sysfs directory structure and the attributes in each directory define an +ABI between the kernel and user space. As for any ABI, it is important that +this ABI is stable and properly documented. All new sysfs attributes must be +documented in Documentation/ABI. See also Documentation/ABI/README for more +information. diff --git a/Documentation/filesystems/sysv-fs.txt b/Documentation/filesystems/sysv-fs.txt index d8172241801..253b50d1328 100644 --- a/Documentation/filesystems/sysv-fs.txt +++ b/Documentation/filesystems/sysv-fs.txt @@ -1,11 +1,8 @@ -This is the implementation of the SystemV/Coherent filesystem for Linux. It implements all of - Xenix FS, - SystemV/386 FS, - Coherent FS. -This is version beta 4. - To install: * Answer the 'System V and Coherent filesystem support' question with 'y' when configuring the kernel. @@ -28,11 +25,173 @@ Bugs in the present implementation: for this FS on hard disk yet. -Please report any bugs and suggestions to - Bruno Haible <haible@ma2s2.mathematik.uni-karlsruhe.de> - Pascal Haible <haible@izfm.uni-stuttgart.de> - Krzysztof G. Baranowski <kgb@manjak.knm.org.pl> +These filesystems are rather similar. Here is a comparison with Minix FS: + +* Linux fdisk reports on partitions + - Minix FS 0x81 Linux/Minix + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS 0x08 AIX bootable + +* Size of a block or zone (data allocation unit on disk) + - Minix FS 1024 + - Xenix FS 1024 (also 512 ??) + - SystemV FS 1024 (also 512 and 2048) + - Coherent FS 512 + +* General layout: all have one boot block, one super block and + separate areas for inodes and for directories/data. + On SystemV Release 2 FS (e.g. Microport) the first track is reserved and + all the block numbers (including the super block) are offset by one track. + +* Byte ordering of "short" (16 bit entities) on disk: + - Minix FS little endian 0 1 + - Xenix FS little endian 0 1 + - SystemV FS little endian 0 1 + - Coherent FS little endian 0 1 + Of course, this affects only the file system, not the data of files on it! + +* Byte ordering of "long" (32 bit entities) on disk: + - Minix FS little endian 0 1 2 3 + - Xenix FS little endian 0 1 2 3 + - SystemV FS little endian 0 1 2 3 + - Coherent FS PDP-11 2 3 0 1 + Of course, this affects only the file system, not the data of files on it! + +* Inode on disk: "short", 0 means non-existent, the root dir ino is: + - Minix FS 1 + - Xenix FS, SystemV FS, Coherent FS 2 + +* Maximum number of hard links to a file: + - Minix FS 250 + - Xenix FS ?? + - SystemV FS ?? + - Coherent FS >=10000 + +* Free inode management: + - Minix FS a bitmap + - Xenix FS, SystemV FS, Coherent FS + There is a cache of a certain number of free inodes in the super-block. + When it is exhausted, new free inodes are found using a linear search. + +* Free block management: + - Minix FS a bitmap + - Xenix FS, SystemV FS, Coherent FS + Free blocks are organized in a "free list". Maybe a misleading term, + since it is not true that every free block contains a pointer to + the next free block. Rather, the free blocks are organized in chunks + of limited size, and every now and then a free block contains pointers + to the free blocks pertaining to the next chunk; the first of these + contains pointers and so on. The list terminates with a "block number" + 0 on Xenix FS and SystemV FS, with a block zeroed out on Coherent FS. + +* Super-block location: + - Minix FS block 1 = bytes 1024..2047 + - Xenix FS block 1 = bytes 1024..2047 + - SystemV FS bytes 512..1023 + - Coherent FS block 1 = bytes 512..1023 + +* Super-block layout: + - Minix FS + unsigned short s_ninodes; + unsigned short s_nzones; + unsigned short s_imap_blocks; + unsigned short s_zmap_blocks; + unsigned short s_firstdatazone; + unsigned short s_log_zone_size; + unsigned long s_max_size; + unsigned short s_magic; + - Xenix FS, SystemV FS, Coherent FS + unsigned short s_firstdatazone; + unsigned long s_nzones; + unsigned short s_fzone_count; + unsigned long s_fzones[NICFREE]; + unsigned short s_finode_count; + unsigned short s_finodes[NICINOD]; + char s_flock; + char s_ilock; + char s_modified; + char s_rdonly; + unsigned long s_time; + short s_dinfo[4]; -- SystemV FS only + unsigned long s_free_zones; + unsigned short s_free_inodes; + short s_dinfo[4]; -- Xenix FS only + unsigned short s_interleave_m,s_interleave_n; -- Coherent FS only + char s_fname[6]; + char s_fpack[6]; + then they differ considerably: + Xenix FS + char s_clean; + char s_fill[371]; + long s_magic; + long s_type; + SystemV FS + long s_fill[12 or 14]; + long s_state; + long s_magic; + long s_type; + Coherent FS + unsigned long s_unique; + Note that Coherent FS has no magic. + +* Inode layout: + - Minix FS + unsigned short i_mode; + unsigned short i_uid; + unsigned long i_size; + unsigned long i_time; + unsigned char i_gid; + unsigned char i_nlinks; + unsigned short i_zone[7+1+1]; + - Xenix FS, SystemV FS, Coherent FS + unsigned short i_mode; + unsigned short i_nlink; + unsigned short i_uid; + unsigned short i_gid; + unsigned long i_size; + unsigned char i_zone[3*(10+1+1+1)]; + unsigned long i_atime; + unsigned long i_mtime; + unsigned long i_ctime; + +* Regular file data blocks are organized as + - Minix FS + 7 direct blocks + 1 indirect block (pointers to blocks) + 1 double-indirect block (pointer to pointers to blocks) + - Xenix FS, SystemV FS, Coherent FS + 10 direct blocks + 1 indirect block (pointers to blocks) + 1 double-indirect block (pointer to pointers to blocks) + 1 triple-indirect block (pointer to pointers to pointers to blocks) + +* Inode size, inodes per block + - Minix FS 32 32 + - Xenix FS 64 16 + - SystemV FS 64 16 + - Coherent FS 64 8 + +* Directory entry on disk + - Minix FS + unsigned short inode; + char name[14/30]; + - Xenix FS, SystemV FS, Coherent FS + unsigned short inode; + char name[14]; + +* Dir entry size, dir entries per block + - Minix FS 16/32 64/32 + - Xenix FS 16 64 + - SystemV FS 16 64 + - Coherent FS 16 32 + +* How to implement symbolic links such that the host fsck doesn't scream: + - Minix FS normal + - Xenix FS kludge: as regular files with chmod 1000 + - SystemV FS ?? + - Coherent FS kludge: as regular files with chmod 1000 -Bruno Haible -<haible@ma2s2.mathematik.uni-karlsruhe.de> +Notation: We often speak of a "block" but mean a zone (the allocation unit) +and not the disk driver's notion of "block". diff --git a/Documentation/filesystems/tmpfs.txt b/Documentation/filesystems/tmpfs.txt index 0d783c504ea..98ef5512415 100644 --- a/Documentation/filesystems/tmpfs.txt +++ b/Documentation/filesystems/tmpfs.txt @@ -39,7 +39,7 @@ tmpfs has the following uses: tmpfs /dev/shm tmpfs defaults 0 0 Remember to create the directory that you intend to mount tmpfs on - if necessary (/dev/shm is automagically created if you use devfs). + if necessary. This mount is _not_ needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was @@ -63,7 +63,7 @@ size: The limit of allocated bytes for this tmpfs instance. The nr_blocks: The same as size, but in blocks of PAGE_CACHE_SIZE. nr_inodes: The maximum number of inodes for this instance. The default is half of the number of your physical RAM pages, or (on a - a machine with highmem) the number of lowmem RAM pages, + machine with highmem) the number of lowmem RAM pages, whichever is the lower. These parameters accept a suffix k, m or g for kilo, mega and giga and @@ -78,6 +78,52 @@ use up all the memory on the machine; but enhances the scalability of that instance in a system with many cpus making intensive use of it. +tmpfs has a mount option to set the NUMA memory allocation policy for +all files in that instance (if CONFIG_NUMA is enabled) - which can be +adjusted on the fly via 'mount -o remount ...' + +mpol=default use the process allocation policy + (see set_mempolicy(2)) +mpol=prefer:Node prefers to allocate memory from the given Node +mpol=bind:NodeList allocates memory only from nodes in NodeList +mpol=interleave prefers to allocate from each node in turn +mpol=interleave:NodeList allocates from each node of NodeList in turn +mpol=local prefers to allocate memory from the local node + +NodeList format is a comma-separated list of decimal numbers and ranges, +a range being two hyphen-separated decimal numbers, the smallest and +largest node numbers in the range. For example, mpol=bind:0-3,5,7,9-15 + +A memory policy with a valid NodeList will be saved, as specified, for +use at file creation time. When a task allocates a file in the file +system, the mount option memory policy will be applied with a NodeList, +if any, modified by the calling task's cpuset constraints +[See Documentation/cgroups/cpusets.txt] and any optional flags, listed +below. If the resulting NodeLists is the empty set, the effective memory +policy for the file will revert to "default" policy. + +NUMA memory allocation policies have optional flags that can be used in +conjunction with their modes. These optional flags can be specified +when tmpfs is mounted by appending them to the mode before the NodeList. +See Documentation/vm/numa_memory_policy.txt for a list of all available +memory allocation policy mode flags and their effect on memory policy. + + =static is equivalent to MPOL_F_STATIC_NODES + =relative is equivalent to MPOL_F_RELATIVE_NODES + +For example, mpol=bind=static:NodeList, is the equivalent of an +allocation policy of MPOL_BIND | MPOL_F_STATIC_NODES. + +Note that trying to mount a tmpfs with an mpol option will fail if the +running kernel does not support NUMA; and will fail if its nodelist +specifies a node which is not online. If your system relies on that +tmpfs being mounted, but from time to time runs a kernel built without +NUMA capability (perhaps a safe recovery kernel), or with fewer nodes +online, then it is advisable to omit the mpol option from automatic +mount options. It can be added later, when the tmpfs is already mounted +on MountPoint, by 'mount -o remount,mpol=Policy:NodeList MountPoint'. + + To specify the initial root directory you can use the following mount options: @@ -97,4 +143,6 @@ RAM/SWAP in 10240 inodes and it is only accessible by root. Author: Christoph Rohland <cr@sap.com>, 1.12.01 Updated: - Hugh Dickins <hugh@veritas.com>, 13 March 2005 + Hugh Dickins, 4 June 2007 +Updated: + KOSAKI Motohiro, 16 Mar 2010 diff --git a/Documentation/filesystems/ubifs.txt b/Documentation/filesystems/ubifs.txt new file mode 100644 index 00000000000..a0a61d2f389 --- /dev/null +++ b/Documentation/filesystems/ubifs.txt @@ -0,0 +1,119 @@ +Introduction +============= + +UBIFS file-system stands for UBI File System. UBI stands for "Unsorted +Block Images". UBIFS is a flash file system, which means it is designed +to work with flash devices. It is important to understand, that UBIFS +is completely different to any traditional file-system in Linux, like +Ext2, XFS, JFS, etc. UBIFS represents a separate class of file-systems +which work with MTD devices, not block devices. The other Linux +file-system of this class is JFFS2. + +To make it more clear, here is a small comparison of MTD devices and +block devices. + +1 MTD devices represent flash devices and they consist of eraseblocks of + rather large size, typically about 128KiB. Block devices consist of + small blocks, typically 512 bytes. +2 MTD devices support 3 main operations - read from some offset within an + eraseblock, write to some offset within an eraseblock, and erase a whole + eraseblock. Block devices support 2 main operations - read a whole + block and write a whole block. +3 The whole eraseblock has to be erased before it becomes possible to + re-write its contents. Blocks may be just re-written. +4 Eraseblocks become worn out after some number of erase cycles - + typically 100K-1G for SLC NAND and NOR flashes, and 1K-10K for MLC + NAND flashes. Blocks do not have the wear-out property. +5 Eraseblocks may become bad (only on NAND flashes) and software should + deal with this. Blocks on hard drives typically do not become bad, + because hardware has mechanisms to substitute bad blocks, at least in + modern LBA disks. + +It should be quite obvious why UBIFS is very different to traditional +file-systems. + +UBIFS works on top of UBI. UBI is a separate software layer which may be +found in drivers/mtd/ubi. UBI is basically a volume management and +wear-leveling layer. It provides so called UBI volumes which is a higher +level abstraction than a MTD device. The programming model of UBI devices +is very similar to MTD devices - they still consist of large eraseblocks, +they have read/write/erase operations, but UBI devices are devoid of +limitations like wear and bad blocks (items 4 and 5 in the above list). + +In a sense, UBIFS is a next generation of JFFS2 file-system, but it is +very different and incompatible to JFFS2. The following are the main +differences. + +* JFFS2 works on top of MTD devices, UBIFS depends on UBI and works on + top of UBI volumes. +* JFFS2 does not have on-media index and has to build it while mounting, + which requires full media scan. UBIFS maintains the FS indexing + information on the flash media and does not require full media scan, + so it mounts many times faster than JFFS2. +* JFFS2 is a write-through file-system, while UBIFS supports write-back, + which makes UBIFS much faster on writes. + +Similarly to JFFS2, UBIFS supports on-the-flight compression which makes +it possible to fit quite a lot of data to the flash. + +Similarly to JFFS2, UBIFS is tolerant of unclean reboots and power-cuts. +It does not need stuff like fsck.ext2. UBIFS automatically replays its +journal and recovers from crashes, ensuring that the on-flash data +structures are consistent. + +UBIFS scales logarithmically (most of the data structures it uses are +trees), so the mount time and memory consumption do not linearly depend +on the flash size, like in case of JFFS2. This is because UBIFS +maintains the FS index on the flash media. However, UBIFS depends on +UBI, which scales linearly. So overall UBI/UBIFS stack scales linearly. +Nevertheless, UBI/UBIFS scales considerably better than JFFS2. + +The authors of UBIFS believe, that it is possible to develop UBI2 which +would scale logarithmically as well. UBI2 would support the same API as UBI, +but it would be binary incompatible to UBI. So UBIFS would not need to be +changed to use UBI2 + + +Mount options +============= + +(*) == default. + +bulk_read read more in one go to take advantage of flash + media that read faster sequentially +no_bulk_read (*) do not bulk-read +no_chk_data_crc (*) skip checking of CRCs on data nodes in order to + improve read performance. Use this option only + if the flash media is highly reliable. The effect + of this option is that corruption of the contents + of a file can go unnoticed. +chk_data_crc do not skip checking CRCs on data nodes +compr=none override default compressor and set it to "none" +compr=lzo override default compressor and set it to "lzo" +compr=zlib override default compressor and set it to "zlib" + + +Quick usage instructions +======================== + +The UBI volume to mount is specified using "ubiX_Y" or "ubiX:NAME" syntax, +where "X" is UBI device number, "Y" is UBI volume number, and "NAME" is +UBI volume name. + +Mount volume 0 on UBI device 0 to /mnt/ubifs: +$ mount -t ubifs ubi0_0 /mnt/ubifs + +Mount "rootfs" volume of UBI device 0 to /mnt/ubifs ("rootfs" is volume +name): +$ mount -t ubifs ubi0:rootfs /mnt/ubifs + +The following is an example of the kernel boot arguments to attach mtd0 +to UBI and mount volume "rootfs": +ubi.mtd=0 root=ubi0:rootfs rootfstype=ubifs + +References +========== + +UBIFS documentation and FAQ/HOWTO at the MTD web site: +http://www.linux-mtd.infradead.org/doc/ubifs.html +http://www.linux-mtd.infradead.org/faq/ubifs.html diff --git a/Documentation/filesystems/udf.txt b/Documentation/filesystems/udf.txt index e5213bc301f..902b95d0ee5 100644 --- a/Documentation/filesystems/udf.txt +++ b/Documentation/filesystems/udf.txt @@ -7,14 +7,25 @@ If you encounter problems with reading UDF discs using this driver, please report them to linux_udf@hpesjro.fc.hp.com, which is the developer's list. -Write support requires a block driver which supports writing. The current -scsi and ide cdrom drivers do not support writing. +Write support requires a block driver which supports writing. Currently +dvd+rw drives and media support true random sector writes, and so a udf +filesystem on such devices can be directly mounted read/write. CD-RW +media however, does not support this. Instead the media can be formatted +for packet mode using the utility cdrwtool, then the pktcdvd driver can +be bound to the underlying cd device to provide the required buffering +and read-modify-write cycles to allow the filesystem random sector writes +while providing the hardware with only full packet writes. While not +required for dvd+rw media, use of the pktcdvd driver often enhances +performance due to very poor read-modify-write support supplied internally +by drive firmware. ------------------------------------------------------------------------------- The following mount options are supported: gid= Set the default group. umask= Set the default umask. + mode= Set the default file permissions. + dmode= Set the default directory permissions. uid= Set the default user. bs= Set the block size. unhide Show otherwise hidden files. @@ -26,6 +37,20 @@ The following mount options are supported: nostrict Unset strict conformance iocharset= Set the NLS character set +The uid= and gid= options need a bit more explaining. They will accept a +decimal numeric value which will be used as the default ID for that mount. +They will also accept the string "ignore" and "forget". For files on the disk +that are owned by nobody ( -1 ), they will instead look as if they are owned +by the default ID. The ignore option causes the default ID to override all +IDs on the disk, not just -1. The forget option causes all IDs to be written +to disk as -1, so when the media is later remounted, they will appear to be +owned by whatever default ID it is mounted with at that time. + +For typical desktop use of removable media, you should set the ID to that +of the interactively logged on user, and also specify both the forget and +ignore options. This way the interactive user will always see the files +on the disk as belonging to him. + The remaining are for debugging and disaster recovery: novrs Skip volume sequence recognition diff --git a/Documentation/filesystems/ufs.txt b/Documentation/filesystems/ufs.txt index 2b5a56a6a55..7a602adeca2 100644 --- a/Documentation/filesystems/ufs.txt +++ b/Documentation/filesystems/ufs.txt @@ -21,7 +21,7 @@ ufstype=type_of_ufs supported as read-write ufs2 used in FreeBSD 5.x - supported as read-only + supported as read-write 5xbsd synonym for ufs2 @@ -50,12 +50,11 @@ ufstype=type_of_ufs POSSIBLE PROBLEMS ================= -There is still bug in reallocation of fragment, in file fs/ufs/balloc.c, -line 364. But it seems working on current buffer cache configuration. +See next section, if you have any. BUG REPORTS =========== -Any ufs bug report you can send to daniel.pirkl@email.cz (do not send -partition tables bug reports.) +Any ufs bug report you can send to daniel.pirkl@email.cz or +to dushistov@mail.ru (do not send partition tables bug reports). diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt deleted file mode 100644 index 4e92feb6b50..00000000000 --- a/Documentation/filesystems/v9fs.txt +++ /dev/null @@ -1,95 +0,0 @@ - V9FS: 9P2000 for Linux - ====================== - -ABOUT -===== - -v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol. - -This software was originally developed by Ron Minnich <rminnich@lanl.gov> -and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson -<gwatson@lanl.gov> and most recently Eric Van Hensbergen -<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>. - -USAGE -===== - -For remote file server: - - mount -t 9P 10.10.1.2 /mnt/9 - -For Plan 9 From User Space applications (http://swtch.com/plan9) - - mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER - -OPTIONS -======= - - proto=name select an alternative transport. Valid options are - currently: - unix - specifying a named pipe mount point - tcp - specifying a normal TCP/IP connection - fd - used passed file descriptors for connection - (see rfdno and wfdno) - - name=name user name to attempt mount as on the remote server. The - server may override or ignore this value. Certain user - names may require authentication. - - aname=name aname specifies the file tree to access when the server is - offering several exported file systems. - - debug=n specifies debug level. The debug level is a bitmask. - 0x01 = display verbose error messages - 0x02 = developer debug (DEBUG_CURRENT) - 0x04 = display 9P trace - 0x08 = display VFS trace - 0x10 = display Marshalling debug - 0x20 = display RPC debug - 0x40 = display transport debug - 0x80 = display allocation debug - - rfdno=n the file descriptor for reading with proto=fd - - wfdno=n the file descriptor for writing with proto=fd - - maxdata=n the number of bytes to use for 9P packet payload (msize) - - port=n port to connect to on the remote server - - timeout=n request timeouts (in ms) (default 60000ms) - - noextend force legacy mode (no 9P2000.u semantics) - - uid attempt to mount as a particular uid - - gid attempt to mount with a particular gid - - afid security channel - used by Plan 9 authentication protocols - - nodevmap do not map special files - represent them as normal files. - This can be used to share devices/named pipes/sockets between - hosts. This functionality will be expanded in later versions. - -RESOURCES -========= - -The Linux version of the 9P server, along with some client-side utilities -can be found at http://v9fs.sf.net (along with a CVS repository of the -development branch of this module). There are user and developer mailing -lists here, as well as a bug-tracker. - -For more information on the Plan 9 Operating System check out -http://plan9.bell-labs.com/plan9 - -For information on Plan 9 from User Space (Plan 9 applications and libraries -ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 - - -STATUS -====== - -The 2.6 kernel support is working on PPC and x86. - -PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS. - diff --git a/Documentation/filesystems/vfat.txt b/Documentation/filesystems/vfat.txt index 5ead20c6c74..ce1126aceed 100644 --- a/Documentation/filesystems/vfat.txt +++ b/Documentation/filesystems/vfat.txt @@ -8,6 +8,12 @@ if you want to format from within Linux. VFAT MOUNT OPTIONS ---------------------------------------------------------------------- +uid=### -- Set the owner of all files on this filesystem. + The default is the uid of current process. + +gid=### -- Set the group of all files on this filesystem. + The default is the gid of current process. + umask=### -- The permission mask (for files and directories, see umask(1)). The default is the umask of current process. @@ -17,27 +23,42 @@ dmask=### -- The permission mask for the directory. fmask=### -- The permission mask for files. The default is the umask of current process. +allow_utime=### -- This option controls the permission check of mtime/atime. + + 20 - If current process is in group of file's group ID, + you can change timestamp. + 2 - Other users can change timestamp. + + The default is set from `dmask' option. (If the directory is + writable, utime(2) is also allowed. I.e. ~dmask & 022) + + Normally utime(2) checks current process is owner of + the file, or it has CAP_FOWNER capability. But FAT + filesystem doesn't have uid/gid on disk, so normal + check is too unflexible. With this option you can + relax it. + codepage=### -- Sets the codepage number for converting to shortname characters on FAT filesystem. By default, FAT_DEFAULT_CODEPAGE setting is used. -iocharset=name -- Character set to use for converting between the +iocharset=<name> -- Character set to use for converting between the encoding is used for user visible filename and 16 bit Unicode characters. Long filenames are stored on disk in Unicode format, but Unix for the most part doesn't know how to deal with Unicode. By default, FAT_DEFAULT_IOCHARSET setting is used. - There is also an option of doing UTF8 translations + There is also an option of doing UTF-8 translations with the utf8 option. NOTE: "iocharset=utf8" is not recommended. If unsure, you should consider the following option instead. -utf8=<bool> -- UTF8 is the filesystem safe version of Unicode that - is used by the console. It can be be enabled for the +utf8=<bool> -- UTF-8 is the filesystem safe version of Unicode that + is used by the console. It can be enabled for the filesystem with this option. If 'uni_xlate' gets set, - UTF8 gets disabled. + UTF-8 gets disabled. uni_xlate=<bool> -- Translate unhandled Unicode characters to special escaped sequences. This would let you backup and @@ -57,6 +78,13 @@ nonumtail=<bool> -- When creating 8.3 aliases, normally the alias will currently exist in the directory, 'longfile.txt' will be the short alias instead of 'longfi~1.txt'. +usefree -- Use the "free clusters" value stored on FSINFO. It'll + be used to determine number of free clusters without + scanning disk. But it's not used by default, because + recent Windows don't update it correctly in some + case. If you are sure the "free clusters" on FSINFO is + correct, by this option you can avoid scanning disk. + quiet -- Stops printing certain warning messages. check=s|r|n -- Case sensitivity checking setting. @@ -64,6 +92,8 @@ check=s|r|n -- Case sensitivity checking setting. r: relaxed, case insensitive n: normal, default setting, currently case insensitive +nocase -- This was deprecated for vfat. Use shortname=win95 instead. + shortname=lower|win95|winnt|mixed -- Shortname display/create setting. lower: convert to lowercase for display, @@ -72,7 +102,81 @@ shortname=lower|win95|winnt|mixed winnt: emulate the Windows NT rule for display/create. mixed: emulate the Windows NT rule for display, emulate the Windows 95 rule for create. - Default setting is `lower'. + Default setting is `mixed'. + +tz=UTC -- Interpret timestamps as UTC rather than local time. + This option disables the conversion of timestamps + between local time (as used by Windows on FAT) and UTC + (which Linux uses internally). This is particularly + useful when mounting devices (like digital cameras) + that are set to UTC in order to avoid the pitfalls of + local time. +time_offset=minutes + -- Set offset for conversion of timestamps from local time + used by FAT to UTC. I.e. <minutes> minutes will be subtracted + from each timestamp to convert it to UTC used internally by + Linux. This is useful when time zone set in sys_tz is + not the time zone used by the filesystem. Note that this + option still does not provide correct time stamps in all + cases in presence of DST - time stamps in a different DST + setting will be off by one hour. + +showexec -- If set, the execute permission bits of the file will be + allowed only if the extension part of the name is .EXE, + .COM, or .BAT. Not set by default. + +debug -- Can be set, but unused by the current implementation. + +sys_immutable -- If set, ATTR_SYS attribute on FAT is handled as + IMMUTABLE flag on Linux. Not set by default. + +flush -- If set, the filesystem will try to flush to disk more + early than normal. Not set by default. + +rodir -- FAT has the ATTR_RO (read-only) attribute. On Windows, + the ATTR_RO of the directory will just be ignored, + and is used only by applications as a flag (e.g. it's set + for the customized folder). + + If you want to use ATTR_RO as read-only flag even for + the directory, set this option. + +errors=panic|continue|remount-ro + -- specify FAT behavior on critical errors: panic, continue + without doing anything or remount the partition in + read-only mode (default behavior). + +discard -- If set, issues discard/TRIM commands to the block + device when blocks are freed. This is useful for SSD devices + and sparse/thinly-provisoned LUNs. + +nfs=stale_rw|nostale_ro + Enable this only if you want to export the FAT filesystem + over NFS. + + stale_rw: This option maintains an index (cache) of directory + inodes by i_logstart which is used by the nfs-related code to + improve look-ups. Full file operations (read/write) over NFS is + supported but with cache eviction at NFS server, this could + result in ESTALE issues. + + nostale_ro: This option bases the inode number and filehandle + on the on-disk location of a file in the MS-DOS directory entry. + This ensures that ESTALE will not be returned after a file is + evicted from the inode cache. However, it means that operations + such as rename, create and unlink could cause filehandles that + previously pointed at one file to point at a different file, + potentially causing data corruption. For this reason, this + option also mounts the filesystem readonly. + + To maintain backward compatibility, '-o nfs' is also accepted, + defaulting to stale_rw + +dos1xfloppy -- If set, use a fallback default BIOS Parameter Block + configuration, determined by backing device size. These static + parameters match defaults assumed by DOS 1.x for 160 kiB, + 180 kiB, 320 kiB, and 360 kiB floppies and floppy images. + <bool>: 0,1,yes,no,true,false @@ -102,7 +206,8 @@ TEST SUITE If you plan to make any modifications to the vfat filesystem, please get the test suite that comes with the vfat distribution at - http://bmrc.berkeley.edu/people/chaffee/vfat.html + http://web.archive.org/web/*/http://bmrc.berkeley.edu/ + people/chaffee/vfat.html This tests quite a few parts of the vfat filesystem and additional tests for new features or untested features would be appreciated. diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt index f042c12e0ed..a1d0d7a3016 100644 --- a/Documentation/filesystems/vfs.txt +++ b/Documentation/filesystems/vfs.txt @@ -3,7 +3,7 @@ Original author: Richard Gooch <rgooch@atnf.csiro.au> - Last updated on August 25, 2005 + Last updated on June 24, 2007. Copyright (C) 1999 Richard Gooch Copyright (C) 2005 Pekka Enberg @@ -11,127 +11,117 @@ This file is released under the GPLv2. -What is it? -=========== +Introduction +============ -The Virtual File System (otherwise known as the Virtual Filesystem -Switch) is the software layer in the kernel that provides the -filesystem interface to userspace programs. It also provides an -abstraction within the kernel which allows different filesystem -implementations to coexist. +The Virtual File System (also known as the Virtual Filesystem Switch) +is the software layer in the kernel that provides the filesystem +interface to userspace programs. It also provides an abstraction +within the kernel which allows different filesystem implementations to +coexist. +VFS system calls open(2), stat(2), read(2), write(2), chmod(2) and so +on are called from a process context. Filesystem locking is described +in the document Documentation/filesystems/Locking. -A Quick Look At How It Works -============================ -In this section I'll briefly describe how things work, before -launching into the details. I'll start with describing what happens -when user programs open and manipulate files, and then look from the -other view which is how a filesystem is supported and subsequently -mounted. - - -Opening a File --------------- - -The VFS implements the open(2), stat(2), chmod(2) and similar system -calls. The pathname argument is used by the VFS to search through the -directory entry cache (dentry cache or "dcache"). This provides a very -fast look-up mechanism to translate a pathname (filename) into a -specific dentry. - -An individual dentry usually has a pointer to an inode. Inodes are the -things that live on disc drives, and can be regular files (you know: -those things that you write data into), directories, FIFOs and other -beasts. Dentries live in RAM and are never saved to disc: they exist -only for performance. Inodes live on disc and are copied into memory -when required. Later any changes are written back to disc. The inode -that lives in RAM is a VFS inode, and it is this which the dentry -points to. A single inode can be pointed to by multiple dentries -(think about hardlinks). - -The dcache is meant to be a view into your entire filespace. Unlike -Linus, most of us losers can't fit enough dentries into RAM to cover -all of our filespace, so the dcache has bits missing. In order to -resolve your pathname into a dentry, the VFS may have to resort to -creating dentries along the way, and then loading the inode. This is -done by looking up the inode. - -To look up an inode (usually read from disc) requires that the VFS -calls the lookup() method of the parent directory inode. This method -is installed by the specific filesystem implementation that the inode -lives in. There will be more on this later. +Directory Entry Cache (dcache) +------------------------------ -Once the VFS has the required dentry (and hence the inode), we can do -all those boring things like open(2) the file, or stat(2) it to peek -at the inode data. The stat(2) operation is fairly simple: once the -VFS has the dentry, it peeks at the inode data and passes some of it -back to userspace. +The VFS implements the open(2), stat(2), chmod(2), and similar system +calls. The pathname argument that is passed to them is used by the VFS +to search through the directory entry cache (also known as the dentry +cache or dcache). This provides a very fast look-up mechanism to +translate a pathname (filename) into a specific dentry. Dentries live +in RAM and are never saved to disc: they exist only for performance. + +The dentry cache is meant to be a view into your entire filespace. As +most computers cannot fit all dentries in the RAM at the same time, +some bits of the cache are missing. In order to resolve your pathname +into a dentry, the VFS may have to resort to creating dentries along +the way, and then loading the inode. This is done by looking up the +inode. + + +The Inode Object +---------------- + +An individual dentry usually has a pointer to an inode. Inodes are +filesystem objects such as regular files, directories, FIFOs and other +beasts. They live either on the disc (for block device filesystems) +or in the memory (for pseudo filesystems). Inodes that live on the +disc are copied into the memory when required and changes to the inode +are written back to disc. A single inode can be pointed to by multiple +dentries (hard links, for example, do this). + +To look up an inode requires that the VFS calls the lookup() method of +the parent directory inode. This method is installed by the specific +filesystem implementation that the inode lives in. Once the VFS has +the required dentry (and hence the inode), we can do all those boring +things like open(2) the file, or stat(2) it to peek at the inode +data. The stat(2) operation is fairly simple: once the VFS has the +dentry, it peeks at the inode data and passes some of it back to +userspace. + + +The File Object +--------------- Opening a file requires another operation: allocation of a file structure (this is the kernel-side implementation of file descriptors). The freshly allocated file structure is initialized with a pointer to the dentry and a set of file operation member functions. These are taken from the inode data. The open() file method is then -called so the specific filesystem implementation can do it's work. You -can see that this is another switch performed by the VFS. - -The file structure is placed into the file descriptor table for the -process. +called so the specific filesystem implementation can do its work. You +can see that this is another switch performed by the VFS. The file +structure is placed into the file descriptor table for the process. Reading, writing and closing files (and other assorted VFS operations) is done by using the userspace file descriptor to grab the appropriate -file structure, and then calling the required file structure method -function to do whatever is required. - -For as long as the file is open, it keeps the dentry "open" (in use), -which in turn means that the VFS inode is still in use. - -All VFS system calls (i.e. open(2), stat(2), read(2), write(2), -chmod(2) and so on) are called from a process context. You should -assume that these calls are made without any kernel locks being -held. This means that the processes may be executing the same piece of -filesystem or driver code at the same time, on different -processors. You should ensure that access to shared resources is -protected by appropriate locks. +file structure, and then calling the required file structure method to +do whatever is required. For as long as the file is open, it keeps the +dentry in use, which in turn means that the VFS inode is still in use. Registering and Mounting a Filesystem -------------------------------------- +===================================== -If you want to support a new kind of filesystem in the kernel, all you -need to do is call register_filesystem(). You pass a structure -describing the filesystem implementation (struct file_system_type) -which is then added to an internal table of supported filesystems. You -can do: +To register and unregister a filesystem, use the following API +functions: -% cat /proc/filesystems + #include <linux/fs.h> -to see what filesystems are currently available on your system. + extern int register_filesystem(struct file_system_type *); + extern int unregister_filesystem(struct file_system_type *); -When a request is made to mount a block device onto a directory in -your filespace the VFS will call the appropriate method for the -specific filesystem. The dentry for the mount point will then be -updated to point to the root inode for the new filesystem. +The passed struct file_system_type describes your filesystem. When a +request is made to mount a filesystem onto a directory in your namespace, +the VFS will call the appropriate mount() method for the specific +filesystem. New vfsmount referring to the tree returned by ->mount() +will be attached to the mountpoint, so that when pathname resolution +reaches the mountpoint it will jump into the root of that vfsmount. -It's now time to look at things in more detail. +You can see all filesystems that are registered to the kernel in the +file /proc/filesystems. struct file_system_type -======================= +----------------------- -This describes the filesystem. As of kernel 2.6.13, the following +This describes the filesystem. As of kernel 2.6.39, the following members are defined: struct file_system_type { const char *name; int fs_flags; - struct super_block *(*get_sb) (struct file_system_type *, int, - const char *, void *); + struct dentry *(*mount) (struct file_system_type *, int, + const char *, void *); void (*kill_sb) (struct super_block *); struct module *owner; struct file_system_type * next; struct list_head fs_supers; + struct lock_class_key s_lock_key; + struct lock_class_key s_umount_key; }; name: the name of the filesystem type, such as "ext2", "iso9660", @@ -139,97 +129,107 @@ struct file_system_type { fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.) - get_sb: the method to call when a new instance of this + mount: the method to call when a new instance of this filesystem should be mounted kill_sb: the method to call when an instance of this filesystem - should be unmounted + should be shut down owner: for internal VFS use: you should initialize this to THIS_MODULE in most cases. next: for internal VFS use: you should initialize this to NULL -The get_sb() method has the following arguments: + s_lock_key, s_umount_key: lockdep-specific + +The mount() method has the following arguments: - struct super_block *sb: the superblock structure. This is partially - initialized by the VFS and the rest must be initialized by the - get_sb() method + struct file_system_type *fs_type: describes the filesystem, partly initialized + by the specific filesystem code int flags: mount flags const char *dev_name: the device name we are mounting. void *data: arbitrary mount options, usually comes as an ASCII - string + string (see "Mount Options" section) - int silent: whether or not to be silent on error +The mount() method must return the root dentry of the tree requested by +caller. An active reference to its superblock must be grabbed and the +superblock must be locked. On failure it should return ERR_PTR(error). + +The arguments match those of mount(2) and their interpretation +depends on filesystem type. E.g. for block filesystems, dev_name is +interpreted as block device name, that device is opened and if it +contains a suitable filesystem image the method creates and initializes +struct super_block accordingly, returning its root dentry to caller. -The get_sb() method must determine if the block device specified -in the superblock contains a filesystem of the type the method -supports. On success the method returns the superblock pointer, on -failure it returns NULL. +->mount() may choose to return a subtree of existing filesystem - it +doesn't have to create a new one. The main result from the caller's +point of view is a reference to dentry at the root of (sub)tree to +be attached; creation of new superblock is a common side effect. The most interesting member of the superblock structure that the -get_sb() method fills in is the "s_op" field. This is a pointer to +mount() method fills in is the "s_op" field. This is a pointer to a "struct super_operations" which describes the next level of the filesystem implementation. -Usually, a filesystem uses generic one of the generic get_sb() -implementations and provides a fill_super() method instead. The -generic methods are: +Usually, a filesystem uses one of the generic mount() implementations +and provides a fill_super() callback instead. The generic variants are: - get_sb_bdev: mount a filesystem residing on a block device + mount_bdev: mount a filesystem residing on a block device - get_sb_nodev: mount a filesystem that is not backed by a device + mount_nodev: mount a filesystem that is not backed by a device - get_sb_single: mount a filesystem which shares the instance between + mount_single: mount a filesystem which shares the instance between all mounts -A fill_super() method implementation has the following arguments: +A fill_super() callback implementation has the following arguments: - struct super_block *sb: the superblock structure. The method fill_super() + struct super_block *sb: the superblock structure. The callback must initialize this properly. void *data: arbitrary mount options, usually comes as an ASCII - string + string (see "Mount Options" section) int silent: whether or not to be silent on error +The Superblock Object +===================== + +A superblock object represents a mounted filesystem. + + struct super_operations -======================= +----------------------- This describes how the VFS can manipulate the superblock of your -filesystem. As of kernel 2.6.13, the following members are defined: +filesystem. As of kernel 2.6.22, the following members are defined: struct super_operations { struct inode *(*alloc_inode)(struct super_block *sb); void (*destroy_inode)(struct inode *); - void (*read_inode) (struct inode *); - - void (*dirty_inode) (struct inode *); + void (*dirty_inode) (struct inode *, int flags); int (*write_inode) (struct inode *, int); - void (*put_inode) (struct inode *); void (*drop_inode) (struct inode *); void (*delete_inode) (struct inode *); void (*put_super) (struct super_block *); - void (*write_super) (struct super_block *); int (*sync_fs)(struct super_block *sb, int wait); - void (*write_super_lockfs) (struct super_block *); - void (*unlockfs) (struct super_block *); - int (*statfs) (struct super_block *, struct kstatfs *); + int (*freeze_fs) (struct super_block *); + int (*unfreeze_fs) (struct super_block *); + int (*statfs) (struct dentry *, struct kstatfs *); int (*remount_fs) (struct super_block *, int *, char *); void (*clear_inode) (struct inode *); void (*umount_begin) (struct super_block *); - void (*sync_inodes) (struct super_block *sb, - struct writeback_control *wbc); - int (*show_options)(struct seq_file *, struct vfsmount *); + int (*show_options)(struct seq_file *, struct dentry *); ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t); ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t); + int (*nr_cached_objects)(struct super_block *); + void (*free_cached_objects)(struct super_block *, int); }; All methods are called without any locks being held, unless otherwise @@ -238,19 +238,15 @@ only called from a process context (i.e. not from an interrupt handler or bottom half). alloc_inode: this method is called by inode_alloc() to allocate memory - for struct inode and initialize it. + for struct inode and initialize it. If this function is not + defined, a simple 'struct inode' is allocated. Normally + alloc_inode will be used to allocate a larger structure which + contains a 'struct inode' embedded within it. destroy_inode: this method is called by destroy_inode() to release - resources allocated for struct inode. - - read_inode: this method is called to read a specific inode from the - mounted filesystem. The i_ino member in the struct inode is - initialized by the VFS to indicate which inode to read. Other - members are filled in by this method. - - You can set this to NULL and use iget5_locked() instead of iget() - to read inodes. This is necessary for filesystems for which the - inode number is not sufficient to identify an inode. + resources allocated for struct inode. It is only required if + ->alloc_inode was defined and simply undoes anything done by + ->alloc_inode. dirty_inode: this method is called by the VFS to mark an inode dirty. @@ -258,11 +254,8 @@ or bottom half). inode to disc. The second parameter indicates whether the write should be synchronous or not, not all filesystems check this flag. - put_inode: called when the VFS inode is removed from the inode - cache. - drop_inode: called when the last access to the inode is dropped, - with the inode_lock spinlock held. + with the inode->i_lock spinlock held. This method should be either NULL (normal UNIX filesystem semantics) or "generic_delete_inode" (for filesystems that do not @@ -279,22 +272,18 @@ or bottom half). put_super: called when the VFS wishes to free the superblock (i.e. unmount). This is called with the superblock lock held - write_super: called when the VFS superblock needs to be written to - disc. This method is optional - sync_fs: called when VFS is writing out all dirty data associated with a superblock. The second parameter indicates whether the method should wait until the write out has been completed. Optional. - write_super_lockfs: called when VFS is locking a filesystem and forcing - it into a consistent state. This function is currently used by the - Logical Volume Manager (LVM). + freeze_fs: called when VFS is locking a filesystem and + forcing it into a consistent state. This method is currently + used by the Logical Volume Manager (LVM). - unlockfs: called when VFS is unlocking a filesystem and making it writable + unfreeze_fs: called when VFS is unlocking a filesystem and making it writable again. - statfs: called when the VFS needs to get filesystem statistics. This - is called with the kernel lock held + statfs: called when the VFS needs to get filesystem statistics. remount_fs: called when the filesystem is remounted. This is called with the kernel lock held @@ -303,48 +292,78 @@ or bottom half). umount_begin: called when the VFS is unmounting a filesystem. - sync_inodes: called when the VFS is writing out dirty data associated with - a superblock. - - show_options: called by the VFS to show mount options for /proc/<pid>/mounts. + show_options: called by the VFS to show mount options for + /proc/<pid>/mounts. (see "Mount Options" section) quota_read: called by the VFS to read from filesystem quota file. quota_write: called by the VFS to write to filesystem quota file. -The read_inode() method is responsible for filling in the "i_op" -field. This is a pointer to a "struct inode_operations" which -describes the methods that can be performed on individual inodes. + nr_cached_objects: called by the sb cache shrinking function for the + filesystem to return the number of freeable cached objects it contains. + Optional. + + free_cache_objects: called by the sb cache shrinking function for the + filesystem to scan the number of objects indicated to try to free them. + Optional, but any filesystem implementing this method needs to also + implement ->nr_cached_objects for it to be called correctly. + + We can't do anything with any errors that the filesystem might + encountered, hence the void return type. This will never be called if + the VM is trying to reclaim under GFP_NOFS conditions, hence this + method does not need to handle that situation itself. + + Implementations must include conditional reschedule calls inside any + scanning loop that is done. This allows the VFS to determine + appropriate scan batch sizes without having to worry about whether + implementations will cause holdoff problems due to large scan batch + sizes. + +Whoever sets up the inode is responsible for filling in the "i_op" field. This +is a pointer to a "struct inode_operations" which describes the methods that +can be performed on individual inodes. + + +The Inode Object +================ + +An inode object represents an object within the filesystem. struct inode_operations -======================= +----------------------- This describes how the VFS can manipulate an inode in your -filesystem. As of kernel 2.6.13, the following members are defined: +filesystem. As of kernel 2.6.22, the following members are defined: struct inode_operations { - int (*create) (struct inode *,struct dentry *,int, struct nameidata *); - struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *); + int (*create) (struct inode *,struct dentry *, umode_t, bool); + struct dentry * (*lookup) (struct inode *,struct dentry *, unsigned int); int (*link) (struct dentry *,struct inode *,struct dentry *); int (*unlink) (struct inode *,struct dentry *); int (*symlink) (struct inode *,struct dentry *,const char *); - int (*mkdir) (struct inode *,struct dentry *,int); + int (*mkdir) (struct inode *,struct dentry *,umode_t); int (*rmdir) (struct inode *,struct dentry *); - int (*mknod) (struct inode *,struct dentry *,int,dev_t); + int (*mknod) (struct inode *,struct dentry *,umode_t,dev_t); int (*rename) (struct inode *, struct dentry *, struct inode *, struct dentry *); + int (*rename2) (struct inode *, struct dentry *, + struct inode *, struct dentry *, unsigned int); int (*readlink) (struct dentry *, char __user *,int); void * (*follow_link) (struct dentry *, struct nameidata *); void (*put_link) (struct dentry *, struct nameidata *, void *); - void (*truncate) (struct inode *); - int (*permission) (struct inode *, int, struct nameidata *); + int (*permission) (struct inode *, int); + int (*get_acl)(struct inode *, int); int (*setattr) (struct dentry *, struct iattr *); int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *); int (*setxattr) (struct dentry *, const char *,const void *,size_t,int); ssize_t (*getxattr) (struct dentry *, const char *, void *, size_t); ssize_t (*listxattr) (struct dentry *, char *, size_t); int (*removexattr) (struct dentry *, const char *); + void (*update_time)(struct inode *, struct timespec *, int); + int (*atomic_open)(struct inode *, struct dentry *, struct file *, + unsigned open_flag, umode_t create_mode, int *opened); + int (*tmpfile) (struct inode *, struct dentry *, umode_t); }; Again, all methods are called without any locks being held, unless @@ -394,147 +413,422 @@ otherwise noted. will probably need to call d_instantiate() just as you would in the create() method + rename: called by the rename(2) system call to rename the object to + have the parent and name given by the second inode and dentry. + + rename2: this has an additional flags argument compared to rename. + If no flags are supported by the filesystem then this method + need not be implemented. If some flags are supported then the + filesystem must return -EINVAL for any unsupported or unknown + flags. Currently the following flags are implemented: + (1) RENAME_NOREPLACE: this flag indicates that if the target + of the rename exists the rename should fail with -EEXIST + instead of replacing the target. The VFS already checks for + existence, so for local filesystems the RENAME_NOREPLACE + implementation is equivalent to plain rename. + (2) RENAME_EXCHANGE: exchange source and target. Both must + exist; this is checked by the VFS. Unlike plain rename, + source and target may be of different type. + readlink: called by the readlink(2) system call. Only required if you want to support reading symbolic links follow_link: called by the VFS to follow a symbolic link to the inode it points to. Only required if you want to support - symbolic links. This function returns a void pointer cookie + symbolic links. This method returns a void pointer cookie that is passed to put_link(). put_link: called by the VFS to release resources allocated by - follow_link(). The cookie returned by follow_link() is passed to - to this function as the last parameter. It is used by filesystems - such as NFS where page cache is not stable (i.e. page that was - installed when the symbolic link walk started might not be in the - page cache at the end of the walk). - - truncate: called by the VFS to change the size of a file. The i_size - field of the inode is set to the desired size by the VFS before - this function is called. This function is called by the truncate(2) - system call and related functionality. + follow_link(). The cookie returned by follow_link() is passed + to this method as the last parameter. It is used by + filesystems such as NFS where page cache is not stable + (i.e. page that was installed when the symbolic link walk + started might not be in the page cache at the end of the + walk). permission: called by the VFS to check for access rights on a POSIX-like filesystem. - setattr: called by the VFS to set attributes for a file. This function is - called by chmod(2) and related system calls. - - getattr: called by the VFS to get attributes of a file. This function is - called by stat(2) and related system calls. + May be called in rcu-walk mode (mask & MAY_NOT_BLOCK). If in rcu-walk + mode, the filesystem must check the permission without blocking or + storing to the inode. - setxattr: called by the VFS to set an extended attribute for a file. - Extended attribute is a name:value pair associated with an inode. This - function is called by setxattr(2) system call. - - getxattr: called by the VFS to retrieve the value of an extended attribute - name. This function is called by getxattr(2) function call. + If a situation is encountered that rcu-walk cannot handle, return + -ECHILD and it will be called again in ref-walk mode. - listxattr: called by the VFS to list all extended attributes for a given - file. This function is called by listxattr(2) system call. + setattr: called by the VFS to set attributes for a file. This method + is called by chmod(2) and related system calls. - removexattr: called by the VFS to remove an extended attribute from a file. - This function is called by removexattr(2) system call. + getattr: called by the VFS to get attributes of a file. This method + is called by stat(2) and related system calls. + setxattr: called by the VFS to set an extended attribute for a file. + Extended attribute is a name:value pair associated with an + inode. This method is called by setxattr(2) system call. + + getxattr: called by the VFS to retrieve the value of an extended + attribute name. This method is called by getxattr(2) function + call. + + listxattr: called by the VFS to list all extended attributes for a + given file. This method is called by listxattr(2) system call. + + removexattr: called by the VFS to remove an extended attribute from + a file. This method is called by removexattr(2) system call. + + update_time: called by the VFS to update a specific time or the i_version of + an inode. If this is not defined the VFS will update the inode itself + and call mark_inode_dirty_sync. + + atomic_open: called on the last component of an open. Using this optional + method the filesystem can look up, possibly create and open the file in + one atomic operation. If it cannot perform this (e.g. the file type + turned out to be wrong) it may signal this by returning 1 instead of + usual 0 or -ve . This method is only called if the last component is + negative or needs lookup. Cached positive dentries are still handled by + f_op->open(). If the file was created, the FILE_CREATED flag should be + set in "opened". In case of O_EXCL the method must only succeed if the + file didn't exist and hence FILE_CREATED shall always be set on success. + + tmpfile: called in the end of O_TMPFILE open(). Optional, equivalent to + atomically creating, opening and unlinking a file in given directory. + +The Address Space Object +======================== + +The address space object is used to group and manage pages in the page +cache. It can be used to keep track of the pages in a file (or +anything else) and also track the mapping of sections of the file into +process address spaces. + +There are a number of distinct yet related services that an +address-space can provide. These include communicating memory +pressure, page lookup by address, and keeping track of pages tagged as +Dirty or Writeback. + +The first can be used independently to the others. The VM can try to +either write dirty pages in order to clean them, or release clean +pages in order to reuse them. To do this it can call the ->writepage +method on dirty pages, and ->releasepage on clean pages with +PagePrivate set. Clean pages without PagePrivate and with no external +references will be released without notice being given to the +address_space. + +To achieve this functionality, pages need to be placed on an LRU with +lru_cache_add and mark_page_active needs to be called whenever the +page is used. + +Pages are normally kept in a radix tree index by ->index. This tree +maintains information about the PG_Dirty and PG_Writeback status of +each page, so that pages with either of these flags can be found +quickly. + +The Dirty tag is primarily used by mpage_writepages - the default +->writepages method. It uses the tag to find dirty pages to call +->writepage on. If mpage_writepages is not used (i.e. the address +provides its own ->writepages) , the PAGECACHE_TAG_DIRTY tag is +almost unused. write_inode_now and sync_inode do use it (through +__sync_single_inode) to check if ->writepages has been successful in +writing out the whole address_space. + +The Writeback tag is used by filemap*wait* and sync_page* functions, +via filemap_fdatawait_range, to wait for all writeback to +complete. While waiting ->sync_page (if defined) will be called on +each page that is found to require writeback. + +An address_space handler may attach extra information to a page, +typically using the 'private' field in the 'struct page'. If such +information is attached, the PG_Private flag should be set. This will +cause various VM routines to make extra calls into the address_space +handler to deal with that data. + +An address space acts as an intermediate between storage and +application. Data is read into the address space a whole page at a +time, and provided to the application either by copying of the page, +or by memory-mapping the page. +Data is written into the address space by the application, and then +written-back to storage typically in whole pages, however the +address_space has finer control of write sizes. + +The read process essentially only requires 'readpage'. The write +process is more complicated and uses write_begin/write_end or +set_page_dirty to write data into the address_space, and writepage, +sync_page, and writepages to writeback data to storage. + +Adding and removing pages to/from an address_space is protected by the +inode's i_mutex. + +When data is written to a page, the PG_Dirty flag should be set. It +typically remains set until writepage asks for it to be written. This +should clear PG_Dirty and set PG_Writeback. It can be actually +written at any point after PG_Dirty is clear. Once it is known to be +safe, PG_Writeback is cleared. + +Writeback makes use of a writeback_control structure... struct address_space_operations -=============================== +------------------------------- This describes how the VFS can manipulate mapping of a file to page cache in -your filesystem. As of kernel 2.6.13, the following members are defined: +your filesystem. The following members are defined: struct address_space_operations { int (*writepage)(struct page *page, struct writeback_control *wbc); int (*readpage)(struct file *, struct page *); - int (*sync_page)(struct page *); int (*writepages)(struct address_space *, struct writeback_control *); int (*set_page_dirty)(struct page *page); int (*readpages)(struct file *filp, struct address_space *mapping, struct list_head *pages, unsigned nr_pages); - int (*prepare_write)(struct file *, struct page *, unsigned, unsigned); - int (*commit_write)(struct file *, struct page *, unsigned, unsigned); + int (*write_begin)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned flags, + struct page **pagep, void **fsdata); + int (*write_end)(struct file *, struct address_space *mapping, + loff_t pos, unsigned len, unsigned copied, + struct page *page, void *fsdata); sector_t (*bmap)(struct address_space *, sector_t); - int (*invalidatepage) (struct page *, unsigned long); + void (*invalidatepage) (struct page *, unsigned int, unsigned int); int (*releasepage) (struct page *, int); - ssize_t (*direct_IO)(int, struct kiocb *, const struct iovec *iov, - loff_t offset, unsigned long nr_segs); + void (*freepage)(struct page *); + ssize_t (*direct_IO)(int, struct kiocb *, struct iov_iter *iter, loff_t offset); struct page* (*get_xip_page)(struct address_space *, sector_t, int); + /* migrate the contents of a page to the specified target */ + int (*migratepage) (struct page *, struct page *); + int (*launder_page) (struct page *); + int (*is_partially_uptodate) (struct page *, unsigned long, + unsigned long); + void (*is_dirty_writeback) (struct page *, bool *, bool *); + int (*error_remove_page) (struct mapping *mapping, struct page *page); + int (*swap_activate)(struct file *); + int (*swap_deactivate)(struct file *); }; - writepage: called by the VM write a dirty page to backing store. + writepage: called by the VM to write a dirty page to backing store. + This may happen for data integrity reasons (i.e. 'sync'), or + to free up memory (flush). The difference can be seen in + wbc->sync_mode. + The PG_Dirty flag has been cleared and PageLocked is true. + writepage should start writeout, should set PG_Writeback, + and should make sure the page is unlocked, either synchronously + or asynchronously when the write operation completes. - readpage: called by the VM to read a page from backing store. + If wbc->sync_mode is WB_SYNC_NONE, ->writepage doesn't have to + try too hard if there are problems, and may choose to write out + other pages from the mapping if that is easier (e.g. due to + internal dependencies). If it chooses not to start writeout, it + should return AOP_WRITEPAGE_ACTIVATE so that the VM will not keep + calling ->writepage on that page. + + See the file "Locking" for more details. - sync_page: called by the VM to notify the backing store to perform all - queued I/O operations for a page. I/O operations for other pages - associated with this address_space object may also be performed. + readpage: called by the VM to read a page from backing store. + The page will be Locked when readpage is called, and should be + unlocked and marked uptodate once the read completes. + If ->readpage discovers that it needs to unlock the page for + some reason, it can do so, and then return AOP_TRUNCATED_PAGE. + In this case, the page will be relocated, relocked and if + that all succeeds, ->readpage will be called again. writepages: called by the VM to write out pages associated with the - address_space object. + address_space object. If wbc->sync_mode is WBC_SYNC_ALL, then + the writeback_control will specify a range of pages that must be + written out. If it is WBC_SYNC_NONE, then a nr_to_write is given + and that many pages should be written if possible. + If no ->writepages is given, then mpage_writepages is used + instead. This will choose pages from the address space that are + tagged as DIRTY and will pass them to ->writepage. set_page_dirty: called by the VM to set a page dirty. + This is particularly needed if an address space attaches + private data to a page, and that data needs to be updated when + a page is dirtied. This is called, for example, when a memory + mapped page gets modified. + If defined, it should set the PageDirty flag, and the + PAGECACHE_TAG_DIRTY tag in the radix tree. readpages: called by the VM to read pages associated with the address_space - object. + object. This is essentially just a vector version of + readpage. Instead of just one page, several pages are + requested. + readpages is only used for read-ahead, so read errors are + ignored. If anything goes wrong, feel free to give up. - prepare_write: called by the generic write path in VM to set up a write - request for a page. + write_begin: + Called by the generic buffered write code to ask the filesystem to + prepare to write len bytes at the given offset in the file. The + address_space should check that the write will be able to complete, + by allocating space if necessary and doing any other internal + housekeeping. If the write will update parts of any basic-blocks on + storage, then those blocks should be pre-read (if they haven't been + read already) so that the updated blocks can be written out properly. - commit_write: called by the generic write path in VM to write page to - its backing store. + The filesystem must return the locked pagecache page for the specified + offset, in *pagep, for the caller to write into. - bmap: called by the VFS to map a logical block offset within object to - physical block number. This method is use by for the legacy FIBMAP - ioctl. Other uses are discouraged. + It must be able to cope with short writes (where the length passed to + write_begin is greater than the number of bytes copied into the page). + + flags is a field for AOP_FLAG_xxx flags, described in + include/linux/fs.h. + + A void * may be returned in fsdata, which then gets passed into + write_end. - invalidatepage: called by the VM on truncate to disassociate a page from its - address_space mapping. + Returns 0 on success; < 0 on failure (which is the error code), in + which case write_end is not called. - releasepage: called by the VFS to release filesystem specific metadata from - a page. + write_end: After a successful write_begin, and data copy, write_end must + be called. len is the original len passed to write_begin, and copied + is the amount that was able to be copied (copied == len is always true + if write_begin was called with the AOP_FLAG_UNINTERRUPTIBLE flag). - direct_IO: called by the VM for direct I/O writes and reads. + The filesystem must take care of unlocking the page and releasing it + refcount, and updating i_size. + + Returns < 0 on failure, otherwise the number of bytes (<= 'copied') + that were able to be copied into pagecache. + + bmap: called by the VFS to map a logical block offset within object to + physical block number. This method is used by the FIBMAP + ioctl and for working with swap-files. To be able to swap to + a file, the file must have a stable mapping to a block + device. The swap system does not go through the filesystem + but instead uses bmap to find out where the blocks in the file + are and uses those addresses directly. + + + invalidatepage: If a page has PagePrivate set, then invalidatepage + will be called when part or all of the page is to be removed + from the address space. This generally corresponds to either a + truncation, punch hole or a complete invalidation of the address + space (in the latter case 'offset' will always be 0 and 'length' + will be PAGE_CACHE_SIZE). Any private data associated with the page + should be updated to reflect this truncation. If offset is 0 and + length is PAGE_CACHE_SIZE, then the private data should be released, + because the page must be able to be completely discarded. This may + be done by calling the ->releasepage function, but in this case the + release MUST succeed. + + releasepage: releasepage is called on PagePrivate pages to indicate + that the page should be freed if possible. ->releasepage + should remove any private data from the page and clear the + PagePrivate flag. If releasepage() fails for some reason, it must + indicate failure with a 0 return value. + releasepage() is used in two distinct though related cases. The + first is when the VM finds a clean page with no active users and + wants to make it a free page. If ->releasepage succeeds, the + page will be removed from the address_space and become free. + + The second case is when a request has been made to invalidate + some or all pages in an address_space. This can happen + through the fadvice(POSIX_FADV_DONTNEED) system call or by the + filesystem explicitly requesting it as nfs and 9fs do (when + they believe the cache may be out of date with storage) by + calling invalidate_inode_pages2(). + If the filesystem makes such a call, and needs to be certain + that all pages are invalidated, then its releasepage will + need to ensure this. Possibly it can clear the PageUptodate + bit if it cannot free private data yet. + + freepage: freepage is called once the page is no longer visible in + the page cache in order to allow the cleanup of any private + data. Since it may be called by the memory reclaimer, it + should not assume that the original address_space mapping still + exists, and it should not block. + + direct_IO: called by the generic read/write routines to perform + direct_IO - that is IO requests which bypass the page cache + and transfer data directly between the storage and the + application's address space. get_xip_page: called by the VM to translate a block number to a page. The page is valid until the corresponding filesystem is unmounted. Filesystems that want to use execute-in-place (XIP) need to implement it. An example implementation can be found in fs/ext2/xip.c. + migrate_page: This is used to compact the physical memory usage. + If the VM wants to relocate a page (maybe off a memory card + that is signalling imminent failure) it will pass a new page + and an old page to this function. migrate_page should + transfer any private data across and update any references + that it has to the page. + + launder_page: Called before freeing a page - it writes back the dirty page. To + prevent redirtying the page, it is kept locked during the whole + operation. + + is_partially_uptodate: Called by the VM when reading a file through the + pagecache when the underlying blocksize != pagesize. If the required + block is up to date then the read can complete without needing the IO + to bring the whole page up to date. + + is_dirty_writeback: Called by the VM when attempting to reclaim a page. + The VM uses dirty and writeback information to determine if it needs + to stall to allow flushers a chance to complete some IO. Ordinarily + it can use PageDirty and PageWriteback but some filesystems have + more complex state (unstable pages in NFS prevent reclaim) or + do not set those flags due to locking problems (jbd). This callback + allows a filesystem to indicate to the VM if a page should be + treated as dirty or writeback for the purposes of stalling. + + error_remove_page: normally set to generic_error_remove_page if truncation + is ok for this address space. Used for memory failure handling. + Setting this implies you deal with pages going away under you, + unless you have them locked or reference counts increased. + + swap_activate: Called when swapon is used on a file to allocate + space if necessary and pin the block lookup information in + memory. A return value of zero indicates success, + in which case this file can be used to back swapspace. The + swapspace operations will be proxied to this address space's + ->swap_{out,in} methods. + + swap_deactivate: Called during swapoff on files where swap_activate + was successful. + + +The File Object +=============== + +A file object represents a file opened by a process. + struct file_operations -====================== +---------------------- This describes how the VFS can manipulate an open file. As of kernel -2.6.13, the following members are defined: +3.12, the following members are defined: struct file_operations { + struct module *owner; loff_t (*llseek) (struct file *, loff_t, int); ssize_t (*read) (struct file *, char __user *, size_t, loff_t *); - ssize_t (*aio_read) (struct kiocb *, char __user *, size_t, loff_t); ssize_t (*write) (struct file *, const char __user *, size_t, loff_t *); - ssize_t (*aio_write) (struct kiocb *, const char __user *, size_t, loff_t); - int (*readdir) (struct file *, void *, filldir_t); + ssize_t (*aio_read) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*aio_write) (struct kiocb *, const struct iovec *, unsigned long, loff_t); + ssize_t (*read_iter) (struct kiocb *, struct iov_iter *); + ssize_t (*write_iter) (struct kiocb *, struct iov_iter *); + int (*iterate) (struct file *, struct dir_context *); unsigned int (*poll) (struct file *, struct poll_table_struct *); - int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long); long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long); long (*compat_ioctl) (struct file *, unsigned int, unsigned long); int (*mmap) (struct file *, struct vm_area_struct *); int (*open) (struct inode *, struct file *); int (*flush) (struct file *); int (*release) (struct inode *, struct file *); - int (*fsync) (struct file *, struct dentry *, int datasync); + int (*fsync) (struct file *, loff_t, loff_t, int datasync); int (*aio_fsync) (struct kiocb *, int datasync); int (*fasync) (int, struct file *, int); int (*lock) (struct file *, int, struct file_lock *); - ssize_t (*readv) (struct file *, const struct iovec *, unsigned long, loff_t *); - ssize_t (*writev) (struct file *, const struct iovec *, unsigned long, loff_t *); - ssize_t (*sendfile) (struct file *, loff_t *, size_t, read_actor_t, void *); ssize_t (*sendpage) (struct file *, struct page *, int, size_t, loff_t *, int); unsigned long (*get_unmapped_area)(struct file *, unsigned long, unsigned long, unsigned long, unsigned long); int (*check_flags)(int); - int (*dir_notify)(struct file *filp, unsigned long arg); int (*flock) (struct file *, int, struct file_lock *); + ssize_t (*splice_write)(struct pipe_inode_info *, struct file *, size_t, unsigned int); + ssize_t (*splice_read)(struct file *, struct pipe_inode_info *, size_t, unsigned int); + int (*setlease)(struct file *, long arg, struct file_lock **); + long (*fallocate)(struct file *, int mode, loff_t offset, loff_t len); + int (*show_fdinfo)(struct seq_file *m, struct file *f); }; Again, all methods are called without any locks being held, unless @@ -544,22 +838,23 @@ otherwise noted. read: called by read(2) and related system calls - aio_read: called by io_submit(2) and other asynchronous I/O operations + aio_read: vectored, possibly asynchronous read + + read_iter: possibly asynchronous read with iov_iter as destination write: called by write(2) and related system calls - aio_write: called by io_submit(2) and other asynchronous I/O operations + aio_write: vectored, possibly asynchronous write + + write_iter: possibly asynchronous write with iov_iter as source - readdir: called when the VFS needs to read the directory contents + iterate: called when the VFS needs to read the directory contents poll: called by the VFS when a process wants to check if there is activity on this file and (optionally) go to sleep until there is activity. Called by the select(2) and poll(2) system calls - ioctl: called by the ioctl(2) system call - - unlocked_ioctl: called by the ioctl(2) system call. Filesystems that do not - require the BKL should use this method instead of the ioctl() above. + unlocked_ioctl: called by the ioctl(2) system call. compat_ioctl: called by the ioctl(2) system call when 32 bit system calls are used on 64 bit kernels. @@ -588,19 +883,22 @@ otherwise noted. lock: called by the fcntl(2) system call for F_GETLK, F_SETLK, and F_SETLKW commands - readv: called by the readv(2) system call + get_unmapped_area: called by the mmap(2) system call - writev: called by the writev(2) system call + check_flags: called by the fcntl(2) system call for F_SETFL command - sendfile: called by the sendfile(2) system call + flock: called by the flock(2) system call - get_unmapped_area: called by the mmap(2) system call + splice_write: called by the VFS to splice data from a pipe to a file. This + method is used by the splice(2) system call - check_flags: called by the fcntl(2) system call for F_SETFL command + splice_read: called by the VFS to splice data from file to a pipe. This + method is used by the splice(2) system call - dir_notify: called by the fcntl(2) system call for F_NOTIFY command + setlease: called by the VFS to set or release a file lock lease. + setlease has the file_lock_lock held and must not sleep. - flock: called by the flock(2) system call + fallocate: called by the VFS to preallocate blocks or punch a hole. Note that the file operations are implemented by the specific filesystem in which the inode resides. When opening a device node @@ -624,30 +922,85 @@ This describes how a filesystem can overload the standard dentry operations. Dentries and the dcache are the domain of the VFS and the individual filesystem implementations. Device drivers have no business here. These methods may be set to NULL, as they are either optional or -the VFS uses a default. As of kernel 2.6.13, the following members are +the VFS uses a default. As of kernel 2.6.22, the following members are defined: struct dentry_operations { - int (*d_revalidate)(struct dentry *, struct nameidata *); - int (*d_hash) (struct dentry *, struct qstr *); - int (*d_compare) (struct dentry *, struct qstr *, struct qstr *); - int (*d_delete)(struct dentry *); + int (*d_revalidate)(struct dentry *, unsigned int); + int (*d_weak_revalidate)(struct dentry *, unsigned int); + int (*d_hash)(const struct dentry *, struct qstr *); + int (*d_compare)(const struct dentry *, const struct dentry *, + unsigned int, const char *, const struct qstr *); + int (*d_delete)(const struct dentry *); void (*d_release)(struct dentry *); void (*d_iput)(struct dentry *, struct inode *); + char *(*d_dname)(struct dentry *, char *, int); + struct vfsmount *(*d_automount)(struct path *); + int (*d_manage)(struct dentry *, bool); }; d_revalidate: called when the VFS needs to revalidate a dentry. This is called whenever a name look-up finds a dentry in the - dcache. Most filesystems leave this as NULL, because all their - dentries in the dcache are valid + dcache. Most local filesystems leave this as NULL, because all their + dentries in the dcache are valid. Network filesystems are different + since things can change on the server without the client necessarily + being aware of it. + + This function should return a positive value if the dentry is still + valid, and zero or a negative error code if it isn't. + + d_revalidate may be called in rcu-walk mode (flags & LOOKUP_RCU). + If in rcu-walk mode, the filesystem must revalidate the dentry without + blocking or storing to the dentry, d_parent and d_inode should not be + used without care (because they can change and, in d_inode case, even + become NULL under us). + + If a situation is encountered that rcu-walk cannot handle, return + -ECHILD and it will be called again in ref-walk mode. + + d_weak_revalidate: called when the VFS needs to revalidate a "jumped" dentry. + This is called when a path-walk ends at dentry that was not acquired by + doing a lookup in the parent directory. This includes "/", "." and "..", + as well as procfs-style symlinks and mountpoint traversal. + + In this case, we are less concerned with whether the dentry is still + fully correct, but rather that the inode is still valid. As with + d_revalidate, most local filesystems will set this to NULL since their + dcache entries are always valid. - d_hash: called when the VFS adds a dentry to the hash table + This function has the same return code semantics as d_revalidate. - d_compare: called when a dentry should be compared with another + d_weak_revalidate is only called after leaving rcu-walk mode. - d_delete: called when the last reference to a dentry is - deleted. This means no-one is using the dentry, however it is - still valid and in the dcache + d_hash: called when the VFS adds a dentry to the hash table. The first + dentry passed to d_hash is the parent directory that the name is + to be hashed into. + + Same locking and synchronisation rules as d_compare regarding + what is safe to dereference etc. + + d_compare: called to compare a dentry name with a given name. The first + dentry is the parent of the dentry to be compared, the second is + the child dentry. len and name string are properties of the dentry + to be compared. qstr is the name to compare it with. + + Must be constant and idempotent, and should not take locks if + possible, and should not or store into the dentry. + Should not dereference pointers outside the dentry without + lots of care (eg. d_parent, d_inode, d_name should not be used). + + However, our vfsmount is pinned, and RCU held, so the dentries and + inodes won't disappear, neither will our sb or filesystem module. + ->d_sb may be used. + + It is a tricky calling convention because it needs to be called under + "rcu-walk", ie. without any locks or references on things. + + d_delete: called when the last reference to a dentry is dropped and the + dcache is deciding whether or not to cache it. Return 1 to delete + immediately, or 0 to cache the dentry. Default is NULL which means to + always cache a reachable dentry. d_delete must be constant and + idempotent. d_release: called when a dentry is really deallocated @@ -656,12 +1009,69 @@ struct dentry_operations { VFS calls iput(). If you define this method, you must call iput() yourself + d_dname: called when the pathname of a dentry should be generated. + Useful for some pseudo filesystems (sockfs, pipefs, ...) to delay + pathname generation. (Instead of doing it when dentry is created, + it's done only when the path is needed.). Real filesystems probably + dont want to use it, because their dentries are present in global + dcache hash, so their hash should be an invariant. As no lock is + held, d_dname() should not try to modify the dentry itself, unless + appropriate SMP safety is used. CAUTION : d_path() logic is quite + tricky. The correct way to return for example "Hello" is to put it + at the end of the buffer, and returns a pointer to the first char. + dynamic_dname() helper function is provided to take care of this. + + d_automount: called when an automount dentry is to be traversed (optional). + This should create a new VFS mount record and return the record to the + caller. The caller is supplied with a path parameter giving the + automount directory to describe the automount target and the parent + VFS mount record to provide inheritable mount parameters. NULL should + be returned if someone else managed to make the automount first. If + the vfsmount creation failed, then an error code should be returned. + If -EISDIR is returned, then the directory will be treated as an + ordinary directory and returned to pathwalk to continue walking. + + If a vfsmount is returned, the caller will attempt to mount it on the + mountpoint and will remove the vfsmount from its expiration list in + the case of failure. The vfsmount should be returned with 2 refs on + it to prevent automatic expiration - the caller will clean up the + additional ref. + + This function is only used if DCACHE_NEED_AUTOMOUNT is set on the + dentry. This is set by __d_instantiate() if S_AUTOMOUNT is set on the + inode being added. + + d_manage: called to allow the filesystem to manage the transition from a + dentry (optional). This allows autofs, for example, to hold up clients + waiting to explore behind a 'mountpoint' whilst letting the daemon go + past and construct the subtree there. 0 should be returned to let the + calling process continue. -EISDIR can be returned to tell pathwalk to + use this directory as an ordinary directory and to ignore anything + mounted on it and not to check the automount flag. Any other error + code will abort pathwalk completely. + + If the 'rcu_walk' parameter is true, then the caller is doing a + pathwalk in RCU-walk mode. Sleeping is not permitted in this mode, + and the caller can be asked to leave it and call again by returning + -ECHILD. + + This function is only used if DCACHE_MANAGE_TRANSIT is set on the + dentry being transited from. + +Example : + +static char *pipefs_dname(struct dentry *dent, char *buffer, int buflen) +{ + return dynamic_dname(dentry, buffer, buflen, "pipe:[%lu]", + dentry->d_inode->i_ino); +} + Each dentry has a pointer to its parent dentry, as well as a hash list of child dentries. Child dentries are basically like files in a directory. -Directory Entry Cache APIs +Directory Entry Cache API -------------------------- There are a number of functions defined which permit a filesystem to @@ -671,14 +1081,11 @@ manipulate dentries: the usage count) dput: close a handle for a dentry (decrements the usage count). If - the usage count drops to 0, the "d_delete" method is called - and the dentry is placed on the unused list if the dentry is - still in its parents hash list. Putting the dentry on the - unused list just means that if the system needs some RAM, it - goes through the unused list of dentries and deallocates them. - If the dentry has already been unhashed and the usage count - drops to 0, in this case the dentry is deallocated after the - "d_delete" method is called + the usage count drops to 0, and the dentry is still in its + parent's hash, the "d_delete" method is called to check whether + it should be cached. If it should not be cached, or if the dentry + is not hashed, it is deleted. Otherwise cached dentries are put + into an LRU list to be reclaimed on memory shortage. d_drop: this unhashes a dentry from its parents hash list. A subsequent call to dput() will deallocate the dentry if its @@ -702,181 +1109,67 @@ manipulate dentries: d_lookup: look up a dentry given its parent and path name component It looks up the child of that given name from the dcache hash table. If it is found, the reference count is incremented - and the dentry is returned. The caller must use d_put() + and the dentry is returned. The caller must use dput() to free the dentry when it finishes using it. +Mount Options +============= -RCU-based dcache locking model ------------------------------- +Parsing options +--------------- -On many workloads, the most common operation on dcache is -to look up a dentry, given a parent dentry and the name -of the child. Typically, for every open(), stat() etc., -the dentry corresponding to the pathname will be looked -up by walking the tree starting with the first component -of the pathname and using that dentry along with the next -component to look up the next level and so on. Since it -is a frequent operation for workloads like multiuser -environments and web servers, it is important to optimize -this path. - -Prior to 2.5.10, dcache_lock was acquired in d_lookup and thus -in every component during path look-up. Since 2.5.10 onwards, -fast-walk algorithm changed this by holding the dcache_lock -at the beginning and walking as many cached path component -dentries as possible. This significantly decreases the number -of acquisition of dcache_lock. However it also increases the -lock hold time significantly and affects performance in large -SMP machines. Since 2.5.62 kernel, dcache has been using -a new locking model that uses RCU to make dcache look-up -lock-free. - -The current dcache locking model is not very different from the existing -dcache locking model. Prior to 2.5.62 kernel, dcache_lock -protected the hash chain, d_child, d_alias, d_lru lists as well -as d_inode and several other things like mount look-up. RCU-based -changes affect only the way the hash chain is protected. For everything -else the dcache_lock must be taken for both traversing as well as -updating. The hash chain updates too take the dcache_lock. -The significant change is the way d_lookup traverses the hash chain, -it doesn't acquire the dcache_lock for this and rely on RCU to -ensure that the dentry has not been *freed*. - - -Dcache locking details ----------------------- +On mount and remount the filesystem is passed a string containing a +comma separated list of mount options. The options can have either of +these forms: + + option + option=value + +The <linux/parser.h> header defines an API that helps parse these +options. There are plenty of examples on how to use it in existing +filesystems. + +Showing options +--------------- + +If a filesystem accepts mount options, it must define show_options() +to show all the currently active options. The rules are: + + - options MUST be shown which are not default or their values differ + from the default + + - options MAY be shown which are enabled by default or have their + default value + +Options used only internally between a mount helper and the kernel +(such as file descriptors), or which only have an effect during the +mounting (such as ones controlling the creation of a journal) are exempt +from the above rules. + +The underlying reason for the above rules is to make sure, that a +mount can be accurately replicated (e.g. umounting and mounting again) +based on the information found in /proc/mounts. + +A simple method of saving options at mount/remount time and showing +them is provided with the save_mount_options() and +generic_show_options() helper functions. Please note, that using +these may have drawbacks. For more info see header comments for these +functions in fs/namespace.c. + +Resources +========= + +(Note some of these resources are not up-to-date with the latest kernel + version.) + +Creating Linux virtual filesystems. 2002 + <http://lwn.net/Articles/13325/> + +The Linux Virtual File-system Layer by Neil Brown. 1999 + <http://www.cse.unsw.edu.au/~neilb/oss/linux-commentary/vfs.html> + +A tour of the Linux VFS by Michael K. Johnson. 1996 + <http://www.tldp.org/LDP/khg/HyperNews/get/fs/vfstour.html> -For many multi-user workloads, open() and stat() on files are -very frequently occurring operations. Both involve walking -of path names to find the dentry corresponding to the -concerned file. In 2.4 kernel, dcache_lock was held -during look-up of each path component. Contention and -cache-line bouncing of this global lock caused significant -scalability problems. With the introduction of RCU -in Linux kernel, this was worked around by making -the look-up of path components during path walking lock-free. - - -Safe lock-free look-up of dcache hash table -=========================================== - -Dcache is a complex data structure with the hash table entries -also linked together in other lists. In 2.4 kernel, dcache_lock -protected all the lists. We applied RCU only on hash chain -walking. The rest of the lists are still protected by dcache_lock. -Some of the important changes are : - -1. The deletion from hash chain is done using hlist_del_rcu() macro which - doesn't initialize next pointer of the deleted dentry and this - allows us to walk safely lock-free while a deletion is happening. - -2. Insertion of a dentry into the hash table is done using - hlist_add_head_rcu() which take care of ordering the writes - - the writes to the dentry must be visible before the dentry - is inserted. This works in conjunction with hlist_for_each_rcu() - while walking the hash chain. The only requirement is that - all initialization to the dentry must be done before hlist_add_head_rcu() - since we don't have dcache_lock protection while traversing - the hash chain. This isn't different from the existing code. - -3. The dentry looked up without holding dcache_lock by cannot be - returned for walking if it is unhashed. It then may have a NULL - d_inode or other bogosity since RCU doesn't protect the other - fields in the dentry. We therefore use a flag DCACHE_UNHASHED to - indicate unhashed dentries and use this in conjunction with a - per-dentry lock (d_lock). Once looked up without the dcache_lock, - we acquire the per-dentry lock (d_lock) and check if the - dentry is unhashed. If so, the look-up is failed. If not, the - reference count of the dentry is increased and the dentry is returned. - -4. Once a dentry is looked up, it must be ensured during the path - walk for that component it doesn't go away. In pre-2.5.10 code, - this was done holding a reference to the dentry. dcache_rcu does - the same. In some sense, dcache_rcu path walking looks like - the pre-2.5.10 version. - -5. All dentry hash chain updates must take the dcache_lock as well as - the per-dentry lock in that order. dput() does this to ensure - that a dentry that has just been looked up in another CPU - doesn't get deleted before dget() can be done on it. - -6. There are several ways to do reference counting of RCU protected - objects. One such example is in ipv4 route cache where - deferred freeing (using call_rcu()) is done as soon as - the reference count goes to zero. This cannot be done in - the case of dentries because tearing down of dentries - require blocking (dentry_iput()) which isn't supported from - RCU callbacks. Instead, tearing down of dentries happen - synchronously in dput(), but actual freeing happens later - when RCU grace period is over. This allows safe lock-free - walking of the hash chains, but a matched dentry may have - been partially torn down. The checking of DCACHE_UNHASHED - flag with d_lock held detects such dentries and prevents - them from being returned from look-up. - - -Maintaining POSIX rename semantics -================================== - -Since look-up of dentries is lock-free, it can race against -a concurrent rename operation. For example, during rename -of file A to B, look-up of either A or B must succeed. -So, if look-up of B happens after A has been removed from the -hash chain but not added to the new hash chain, it may fail. -Also, a comparison while the name is being written concurrently -by a rename may result in false positive matches violating -rename semantics. Issues related to race with rename are -handled as described below : - -1. Look-up can be done in two ways - d_lookup() which is safe - from simultaneous renames and __d_lookup() which is not. - If __d_lookup() fails, it must be followed up by a d_lookup() - to correctly determine whether a dentry is in the hash table - or not. d_lookup() protects look-ups using a sequence - lock (rename_lock). - -2. The name associated with a dentry (d_name) may be changed if - a rename is allowed to happen simultaneously. To avoid memcmp() - in __d_lookup() go out of bounds due to a rename and false - positive comparison, the name comparison is done while holding the - per-dentry lock. This prevents concurrent renames during this - operation. - -3. Hash table walking during look-up may move to a different bucket as - the current dentry is moved to a different bucket due to rename. - But we use hlists in dcache hash table and they are null-terminated. - So, even if a dentry moves to a different bucket, hash chain - walk will terminate. [with a list_head list, it may not since - termination is when the list_head in the original bucket is reached]. - Since we redo the d_parent check and compare name while holding - d_lock, lock-free look-up will not race against d_move(). - -4. There can be a theoretical race when a dentry keeps coming back - to original bucket due to double moves. Due to this look-up may - consider that it has never moved and can end up in a infinite loop. - But this is not any worse that theoretical livelocks we already - have in the kernel. - - -Important guidelines for filesystem developers related to dcache_rcu -==================================================================== - -1. Existing dcache interfaces (pre-2.5.62) exported to filesystem - don't change. Only dcache internal implementation changes. However - filesystems *must not* delete from the dentry hash chains directly - using the list macros like allowed earlier. They must use dcache - APIs like d_drop() or __d_drop() depending on the situation. - -2. d_flags is now protected by a per-dentry lock (d_lock). All - access to d_flags must be protected by it. - -3. For a hashed dentry, checking of d_count needs to be protected - by d_lock. - - -Papers and other documentation on dcache locking -================================================ - -1. Scaling dcache with RCU (http://linuxjournal.com/article.php?sid=7124). - -2. http://lse.sourceforge.net/locking/dcache/dcache.html +A small trail through the Linux kernel by Andries Brouwer. 2001 + <http://www.win.tue.nl/~aeb/linux/vfs/trail.html> diff --git a/Documentation/filesystems/xfs-delayed-logging-design.txt b/Documentation/filesystems/xfs-delayed-logging-design.txt new file mode 100644 index 00000000000..2ce36439c09 --- /dev/null +++ b/Documentation/filesystems/xfs-delayed-logging-design.txt @@ -0,0 +1,793 @@ +XFS Delayed Logging Design +-------------------------- + +Introduction to Re-logging in XFS +--------------------------------- + +XFS logging is a combination of logical and physical logging. Some objects, +such as inodes and dquots, are logged in logical format where the details +logged are made up of the changes to in-core structures rather than on-disk +structures. Other objects - typically buffers - have their physical changes +logged. The reason for these differences is to reduce the amount of log space +required for objects that are frequently logged. Some parts of inodes are more +frequently logged than others, and inodes are typically more frequently logged +than any other object (except maybe the superblock buffer) so keeping the +amount of metadata logged low is of prime importance. + +The reason that this is such a concern is that XFS allows multiple separate +modifications to a single object to be carried in the log at any given time. +This allows the log to avoid needing to flush each change to disk before +recording a new change to the object. XFS does this via a method called +"re-logging". Conceptually, this is quite simple - all it requires is that any +new change to the object is recorded with a *new copy* of all the existing +changes in the new transaction that is written to the log. + +That is, if we have a sequence of changes A through to F, and the object was +written to disk after change D, we would see in the log the following series +of transactions, their contents and the log sequence number (LSN) of the +transaction: + + Transaction Contents LSN + A A X + B A+B X+n + C A+B+C X+n+m + D A+B+C+D X+n+m+o + <object written to disk> + E E Y (> X+n+m+o) + F E+F YÙ+p + +In other words, each time an object is relogged, the new transaction contains +the aggregation of all the previous changes currently held only in the log. + +This relogging technique also allows objects to be moved forward in the log so +that an object being relogged does not prevent the tail of the log from ever +moving forward. This can be seen in the table above by the changing +(increasing) LSN of each subsequent transaction - the LSN is effectively a +direct encoding of the location in the log of the transaction. + +This relogging is also used to implement long-running, multiple-commit +transactions. These transaction are known as rolling transactions, and require +a special log reservation known as a permanent transaction reservation. A +typical example of a rolling transaction is the removal of extents from an +inode which can only be done at a rate of two extents per transaction because +of reservation size limitations. Hence a rolling extent removal transaction +keeps relogging the inode and btree buffers as they get modified in each +removal operation. This keeps them moving forward in the log as the operation +progresses, ensuring that current operation never gets blocked by itself if the +log wraps around. + +Hence it can be seen that the relogging operation is fundamental to the correct +working of the XFS journalling subsystem. From the above description, most +people should be able to see why the XFS metadata operations writes so much to +the log - repeated operations to the same objects write the same changes to +the log over and over again. Worse is the fact that objects tend to get +dirtier as they get relogged, so each subsequent transaction is writing more +metadata into the log. + +Another feature of the XFS transaction subsystem is that most transactions are +asynchronous. That is, they don't commit to disk until either a log buffer is +filled (a log buffer can hold multiple transactions) or a synchronous operation +forces the log buffers holding the transactions to disk. This means that XFS is +doing aggregation of transactions in memory - batching them, if you like - to +minimise the impact of the log IO on transaction throughput. + +The limitation on asynchronous transaction throughput is the number and size of +log buffers made available by the log manager. By default there are 8 log +buffers available and the size of each is 32kB - the size can be increased up +to 256kB by use of a mount option. + +Effectively, this gives us the maximum bound of outstanding metadata changes +that can be made to the filesystem at any point in time - if all the log +buffers are full and under IO, then no more transactions can be committed until +the current batch completes. It is now common for a single current CPU core to +be to able to issue enough transactions to keep the log buffers full and under +IO permanently. Hence the XFS journalling subsystem can be considered to be IO +bound. + +Delayed Logging: Concepts +------------------------- + +The key thing to note about the asynchronous logging combined with the +relogging technique XFS uses is that we can be relogging changed objects +multiple times before they are committed to disk in the log buffers. If we +return to the previous relogging example, it is entirely possible that +transactions A through D are committed to disk in the same log buffer. + +That is, a single log buffer may contain multiple copies of the same object, +but only one of those copies needs to be there - the last one "D", as it +contains all the changes from the previous changes. In other words, we have one +necessary copy in the log buffer, and three stale copies that are simply +wasting space. When we are doing repeated operations on the same set of +objects, these "stale objects" can be over 90% of the space used in the log +buffers. It is clear that reducing the number of stale objects written to the +log would greatly reduce the amount of metadata we write to the log, and this +is the fundamental goal of delayed logging. + +From a conceptual point of view, XFS is already doing relogging in memory (where +memory == log buffer), only it is doing it extremely inefficiently. It is using +logical to physical formatting to do the relogging because there is no +infrastructure to keep track of logical changes in memory prior to physically +formatting the changes in a transaction to the log buffer. Hence we cannot avoid +accumulating stale objects in the log buffers. + +Delayed logging is the name we've given to keeping and tracking transactional +changes to objects in memory outside the log buffer infrastructure. Because of +the relogging concept fundamental to the XFS journalling subsystem, this is +actually relatively easy to do - all the changes to logged items are already +tracked in the current infrastructure. The big problem is how to accumulate +them and get them to the log in a consistent, recoverable manner. +Describing the problems and how they have been solved is the focus of this +document. + +One of the key changes that delayed logging makes to the operation of the +journalling subsystem is that it disassociates the amount of outstanding +metadata changes from the size and number of log buffers available. In other +words, instead of there only being a maximum of 2MB of transaction changes not +written to the log at any point in time, there may be a much greater amount +being accumulated in memory. Hence the potential for loss of metadata on a +crash is much greater than for the existing logging mechanism. + +It should be noted that this does not change the guarantee that log recovery +will result in a consistent filesystem. What it does mean is that as far as the +recovered filesystem is concerned, there may be many thousands of transactions +that simply did not occur as a result of the crash. This makes it even more +important that applications that care about their data use fsync() where they +need to ensure application level data integrity is maintained. + +It should be noted that delayed logging is not an innovative new concept that +warrants rigorous proofs to determine whether it is correct or not. The method +of accumulating changes in memory for some period before writing them to the +log is used effectively in many filesystems including ext3 and ext4. Hence +no time is spent in this document trying to convince the reader that the +concept is sound. Instead it is simply considered a "solved problem" and as +such implementing it in XFS is purely an exercise in software engineering. + +The fundamental requirements for delayed logging in XFS are simple: + + 1. Reduce the amount of metadata written to the log by at least + an order of magnitude. + 2. Supply sufficient statistics to validate Requirement #1. + 3. Supply sufficient new tracing infrastructure to be able to debug + problems with the new code. + 4. No on-disk format change (metadata or log format). + 5. Enable and disable with a mount option. + 6. No performance regressions for synchronous transaction workloads. + +Delayed Logging: Design +----------------------- + +Storing Changes + +The problem with accumulating changes at a logical level (i.e. just using the +existing log item dirty region tracking) is that when it comes to writing the +changes to the log buffers, we need to ensure that the object we are formatting +is not changing while we do this. This requires locking the object to prevent +concurrent modification. Hence flushing the logical changes to the log would +require us to lock every object, format them, and then unlock them again. + +This introduces lots of scope for deadlocks with transactions that are already +running. For example, a transaction has object A locked and modified, but needs +the delayed logging tracking lock to commit the transaction. However, the +flushing thread has the delayed logging tracking lock already held, and is +trying to get the lock on object A to flush it to the log buffer. This appears +to be an unsolvable deadlock condition, and it was solving this problem that +was the barrier to implementing delayed logging for so long. + +The solution is relatively simple - it just took a long time to recognise it. +Put simply, the current logging code formats the changes to each item into an +vector array that points to the changed regions in the item. The log write code +simply copies the memory these vectors point to into the log buffer during +transaction commit while the item is locked in the transaction. Instead of +using the log buffer as the destination of the formatting code, we can use an +allocated memory buffer big enough to fit the formatted vector. + +If we then copy the vector into the memory buffer and rewrite the vector to +point to the memory buffer rather than the object itself, we now have a copy of +the changes in a format that is compatible with the log buffer writing code. +that does not require us to lock the item to access. This formatting and +rewriting can all be done while the object is locked during transaction commit, +resulting in a vector that is transactionally consistent and can be accessed +without needing to lock the owning item. + +Hence we avoid the need to lock items when we need to flush outstanding +asynchronous transactions to the log. The differences between the existing +formatting method and the delayed logging formatting can be seen in the +diagram below. + +Current format log vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Log Buffer +-V1-+-V2-+----V3----+ + +Delayed logging vector: + +Object +---------------------------------------------+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +After formatting: + +Memory Buffer +-V1-+-V2-+----V3----+ +Vector 1 +----+ +Vector 2 +----+ +Vector 3 +----------+ + +The memory buffer and associated vector need to be passed as a single object, +but still need to be associated with the parent object so if the object is +relogged we can replace the current memory buffer with a new memory buffer that +contains the latest changes. + +The reason for keeping the vector around after we've formatted the memory +buffer is to support splitting vectors across log buffer boundaries correctly. +If we don't keep the vector around, we do not know where the region boundaries +are in the item, so we'd need a new encapsulation method for regions in the log +buffer writing (i.e. double encapsulation). This would be an on-disk format +change and as such is not desirable. It also means we'd have to write the log +region headers in the formatting stage, which is problematic as there is per +region state that needs to be placed into the headers during the log write. + +Hence we need to keep the vector, but by attaching the memory buffer to it and +rewriting the vector addresses to point at the memory buffer we end up with a +self-describing object that can be passed to the log buffer write code to be +handled in exactly the same manner as the existing log vectors are handled. +Hence we avoid needing a new on-disk format to handle items that have been +relogged in memory. + + +Tracking Changes + +Now that we can record transactional changes in memory in a form that allows +them to be used without limitations, we need to be able to track and accumulate +them so that they can be written to the log at some later point in time. The +log item is the natural place to store this vector and buffer, and also makes sense +to be the object that is used to track committed objects as it will always +exist once the object has been included in a transaction. + +The log item is already used to track the log items that have been written to +the log but not yet written to disk. Such log items are considered "active" +and as such are stored in the Active Item List (AIL) which is a LSN-ordered +double linked list. Items are inserted into this list during log buffer IO +completion, after which they are unpinned and can be written to disk. An object +that is in the AIL can be relogged, which causes the object to be pinned again +and then moved forward in the AIL when the log buffer IO completes for that +transaction. + +Essentially, this shows that an item that is in the AIL can still be modified +and relogged, so any tracking must be separate to the AIL infrastructure. As +such, we cannot reuse the AIL list pointers for tracking committed items, nor +can we store state in any field that is protected by the AIL lock. Hence the +committed item tracking needs it's own locks, lists and state fields in the log +item. + +Similar to the AIL, tracking of committed items is done through a new list +called the Committed Item List (CIL). The list tracks log items that have been +committed and have formatted memory buffers attached to them. It tracks objects +in transaction commit order, so when an object is relogged it is removed from +it's place in the list and re-inserted at the tail. This is entirely arbitrary +and done to make it easy for debugging - the last items in the list are the +ones that are most recently modified. Ordering of the CIL is not necessary for +transactional integrity (as discussed in the next section) so the ordering is +done for convenience/sanity of the developers. + + +Delayed Logging: Checkpoints + +When we have a log synchronisation event, commonly known as a "log force", +all the items in the CIL must be written into the log via the log buffers. +We need to write these items in the order that they exist in the CIL, and they +need to be written as an atomic transaction. The need for all the objects to be +written as an atomic transaction comes from the requirements of relogging and +log replay - all the changes in all the objects in a given transaction must +either be completely replayed during log recovery, or not replayed at all. If +a transaction is not replayed because it is not complete in the log, then +no later transactions should be replayed, either. + +To fulfill this requirement, we need to write the entire CIL in a single log +transaction. Fortunately, the XFS log code has no fixed limit on the size of a +transaction, nor does the log replay code. The only fundamental limit is that +the transaction cannot be larger than just under half the size of the log. The +reason for this limit is that to find the head and tail of the log, there must +be at least one complete transaction in the log at any given time. If a +transaction is larger than half the log, then there is the possibility that a +crash during the write of a such a transaction could partially overwrite the +only complete previous transaction in the log. This will result in a recovery +failure and an inconsistent filesystem and hence we must enforce the maximum +size of a checkpoint to be slightly less than a half the log. + +Apart from this size requirement, a checkpoint transaction looks no different +to any other transaction - it contains a transaction header, a series of +formatted log items and a commit record at the tail. From a recovery +perspective, the checkpoint transaction is also no different - just a lot +bigger with a lot more items in it. The worst case effect of this is that we +might need to tune the recovery transaction object hash size. + +Because the checkpoint is just another transaction and all the changes to log +items are stored as log vectors, we can use the existing log buffer writing +code to write the changes into the log. To do this efficiently, we need to +minimise the time we hold the CIL locked while writing the checkpoint +transaction. The current log write code enables us to do this easily with the +way it separates the writing of the transaction contents (the log vectors) from +the transaction commit record, but tracking this requires us to have a +per-checkpoint context that travels through the log write process through to +checkpoint completion. + +Hence a checkpoint has a context that tracks the state of the current +checkpoint from initiation to checkpoint completion. A new context is initiated +at the same time a checkpoint transaction is started. That is, when we remove +all the current items from the CIL during a checkpoint operation, we move all +those changes into the current checkpoint context. We then initialise a new +context and attach that to the CIL for aggregation of new transactions. + +This allows us to unlock the CIL immediately after transfer of all the +committed items and effectively allow new transactions to be issued while we +are formatting the checkpoint into the log. It also allows concurrent +checkpoints to be written into the log buffers in the case of log force heavy +workloads, just like the existing transaction commit code does. This, however, +requires that we strictly order the commit records in the log so that +checkpoint sequence order is maintained during log replay. + +To ensure that we can be writing an item into a checkpoint transaction at +the same time another transaction modifies the item and inserts the log item +into the new CIL, then checkpoint transaction commit code cannot use log items +to store the list of log vectors that need to be written into the transaction. +Hence log vectors need to be able to be chained together to allow them to be +detached from the log items. That is, when the CIL is flushed the memory +buffer and log vector attached to each log item needs to be attached to the +checkpoint context so that the log item can be released. In diagrammatic form, +the CIL would look like this before the flush: + + CIL Head + | + V + Log Item <-> log vector 1 -> memory buffer + | -> vector array + V + Log Item <-> log vector 2 -> memory buffer + | -> vector array + V + ...... + | + V + Log Item <-> log vector N-1 -> memory buffer + | -> vector array + V + Log Item <-> log vector N -> memory buffer + -> vector array + +And after the flush the CIL head is empty, and the checkpoint context log +vector list would look like: + + Checkpoint Context + | + V + log vector 1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector 2 -> memory buffer + | -> vector array + | -> Log Item + V + ...... + | + V + log vector N-1 -> memory buffer + | -> vector array + | -> Log Item + V + log vector N -> memory buffer + -> vector array + -> Log Item + +Once this transfer is done, the CIL can be unlocked and new transactions can +start, while the checkpoint flush code works over the log vector chain to +commit the checkpoint. + +Once the checkpoint is written into the log buffers, the checkpoint context is +attached to the log buffer that the commit record was written to along with a +completion callback. Log IO completion will call that callback, which can then +run transaction committed processing for the log items (i.e. insert into AIL +and unpin) in the log vector chain and then free the log vector chain and +checkpoint context. + +Discussion Point: I am uncertain as to whether the log item is the most +efficient way to track vectors, even though it seems like the natural way to do +it. The fact that we walk the log items (in the CIL) just to chain the log +vectors and break the link between the log item and the log vector means that +we take a cache line hit for the log item list modification, then another for +the log vector chaining. If we track by the log vectors, then we only need to +break the link between the log item and the log vector, which means we should +dirty only the log item cachelines. Normally I wouldn't be concerned about one +vs two dirty cachelines except for the fact I've seen upwards of 80,000 log +vectors in one checkpoint transaction. I'd guess this is a "measure and +compare" situation that can be done after a working and reviewed implementation +is in the dev tree.... + +Delayed Logging: Checkpoint Sequencing + +One of the key aspects of the XFS transaction subsystem is that it tags +committed transactions with the log sequence number of the transaction commit. +This allows transactions to be issued asynchronously even though there may be +future operations that cannot be completed until that transaction is fully +committed to the log. In the rare case that a dependent operation occurs (e.g. +re-using a freed metadata extent for a data extent), a special, optimised log +force can be issued to force the dependent transaction to disk immediately. + +To do this, transactions need to record the LSN of the commit record of the +transaction. This LSN comes directly from the log buffer the transaction is +written into. While this works just fine for the existing transaction +mechanism, it does not work for delayed logging because transactions are not +written directly into the log buffers. Hence some other method of sequencing +transactions is required. + +As discussed in the checkpoint section, delayed logging uses per-checkpoint +contexts, and as such it is simple to assign a sequence number to each +checkpoint. Because the switching of checkpoint contexts must be done +atomically, it is simple to ensure that each new context has a monotonically +increasing sequence number assigned to it without the need for an external +atomic counter - we can just take the current context sequence number and add +one to it for the new context. + +Then, instead of assigning a log buffer LSN to the transaction commit LSN +during the commit, we can assign the current checkpoint sequence. This allows +operations that track transactions that have not yet completed know what +checkpoint sequence needs to be committed before they can continue. As a +result, the code that forces the log to a specific LSN now needs to ensure that +the log forces to a specific checkpoint. + +To ensure that we can do this, we need to track all the checkpoint contexts +that are currently committing to the log. When we flush a checkpoint, the +context gets added to a "committing" list which can be searched. When a +checkpoint commit completes, it is removed from the committing list. Because +the checkpoint context records the LSN of the commit record for the checkpoint, +we can also wait on the log buffer that contains the commit record, thereby +using the existing log force mechanisms to execute synchronous forces. + +It should be noted that the synchronous forces may need to be extended with +mitigation algorithms similar to the current log buffer code to allow +aggregation of multiple synchronous transactions if there are already +synchronous transactions being flushed. Investigation of the performance of the +current design is needed before making any decisions here. + +The main concern with log forces is to ensure that all the previous checkpoints +are also committed to disk before the one we need to wait for. Therefore we +need to check that all the prior contexts in the committing list are also +complete before waiting on the one we need to complete. We do this +synchronisation in the log force code so that we don't need to wait anywhere +else for such serialisation - it only matters when we do a log force. + +The only remaining complexity is that a log force now also has to handle the +case where the forcing sequence number is the same as the current context. That +is, we need to flush the CIL and potentially wait for it to complete. This is a +simple addition to the existing log forcing code to check the sequence numbers +and push if required. Indeed, placing the current sequence checkpoint flush in +the log force code enables the current mechanism for issuing synchronous +transactions to remain untouched (i.e. commit an asynchronous transaction, then +force the log at the LSN of that transaction) and so the higher level code +behaves the same regardless of whether delayed logging is being used or not. + +Delayed Logging: Checkpoint Log Space Accounting + +The big issue for a checkpoint transaction is the log space reservation for the +transaction. We don't know how big a checkpoint transaction is going to be +ahead of time, nor how many log buffers it will take to write out, nor the +number of split log vector regions are going to be used. We can track the +amount of log space required as we add items to the commit item list, but we +still need to reserve the space in the log for the checkpoint. + +A typical transaction reserves enough space in the log for the worst case space +usage of the transaction. The reservation accounts for log record headers, +transaction and region headers, headers for split regions, buffer tail padding, +etc. as well as the actual space for all the changed metadata in the +transaction. While some of this is fixed overhead, much of it is dependent on +the size of the transaction and the number of regions being logged (the number +of log vectors in the transaction). + +An example of the differences would be logging directory changes versus logging +inode changes. If you modify lots of inode cores (e.g. chmod -R g+w *), then +there are lots of transactions that only contain an inode core and an inode log +format structure. That is, two vectors totaling roughly 150 bytes. If we modify +10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each +vector is 12 bytes, so the total to be logged is approximately 1.75MB. In +comparison, if we are logging full directory buffers, they are typically 4KB +each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a +buffer format structure for each buffer - roughly 800 vectors or 1.51MB total +space. From this, it should be obvious that a static log space reservation is +not particularly flexible and is difficult to select the "optimal value" for +all workloads. + +Further, if we are going to use a static reservation, which bit of the entire +reservation does it cover? We account for space used by the transaction +reservation by tracking the space currently used by the object in the CIL and +then calculating the increase or decrease in space used as the object is +relogged. This allows for a checkpoint reservation to only have to account for +log buffer metadata used such as log header records. + +However, even using a static reservation for just the log metadata is +problematic. Typically log record headers use at least 16KB of log space per +1MB of log space consumed (512 bytes per 32k) and the reservation needs to be +large enough to handle arbitrary sized checkpoint transactions. This +reservation needs to be made before the checkpoint is started, and we need to +be able to reserve the space without sleeping. For a 8MB checkpoint, we need a +reservation of around 150KB, which is a non-trivial amount of space. + +A static reservation needs to manipulate the log grant counters - we can take a +permanent reservation on the space, but we still need to make sure we refresh +the write reservation (the actual space available to the transaction) after +every checkpoint transaction completion. Unfortunately, if this space is not +available when required, then the regrant code will sleep waiting for it. + +The problem with this is that it can lead to deadlocks as we may need to commit +checkpoints to be able to free up log space (refer back to the description of +rolling transactions for an example of this). Hence we *must* always have +space available in the log if we are to use static reservations, and that is +very difficult and complex to arrange. It is possible to do, but there is a +simpler way. + +The simpler way of doing this is tracking the entire log space used by the +items in the CIL and using this to dynamically calculate the amount of log +space required by the log metadata. If this log metadata space changes as a +result of a transaction commit inserting a new memory buffer into the CIL, then +the difference in space required is removed from the transaction that causes +the change. Transactions at this level will *always* have enough space +available in their reservation for this as they have already reserved the +maximal amount of log metadata space they require, and such a delta reservation +will always be less than or equal to the maximal amount in the reservation. + +Hence we can grow the checkpoint transaction reservation dynamically as items +are added to the CIL and avoid the need for reserving and regranting log space +up front. This avoids deadlocks and removes a blocking point from the +checkpoint flush code. + +As mentioned early, transactions can't grow to more than half the size of the +log. Hence as part of the reservation growing, we need to also check the size +of the reservation against the maximum allowed transaction size. If we reach +the maximum threshold, we need to push the CIL to the log. This is effectively +a "background flush" and is done on demand. This is identical to +a CIL push triggered by a log force, only that there is no waiting for the +checkpoint commit to complete. This background push is checked and executed by +transaction commit code. + +If the transaction subsystem goes idle while we still have items in the CIL, +they will be flushed by the periodic log force issued by the xfssyncd. This log +force will push the CIL to disk, and if the transaction subsystem stays idle, +allow the idle log to be covered (effectively marked clean) in exactly the same +manner that is done for the existing logging method. A discussion point is +whether this log force needs to be done more frequently than the current rate +which is once every 30s. + + +Delayed Logging: Log Item Pinning + +Currently log items are pinned during transaction commit while the items are +still locked. This happens just after the items are formatted, though it could +be done any time before the items are unlocked. The result of this mechanism is +that items get pinned once for every transaction that is committed to the log +buffers. Hence items that are relogged in the log buffers will have a pin count +for every outstanding transaction they were dirtied in. When each of these +transactions is completed, they will unpin the item once. As a result, the item +only becomes unpinned when all the transactions complete and there are no +pending transactions. Thus the pinning and unpinning of a log item is symmetric +as there is a 1:1 relationship with transaction commit and log item completion. + +For delayed logging, however, we have an asymmetric transaction commit to +completion relationship. Every time an object is relogged in the CIL it goes +through the commit process without a corresponding completion being registered. +That is, we now have a many-to-one relationship between transaction commit and +log item completion. The result of this is that pinning and unpinning of the +log items becomes unbalanced if we retain the "pin on transaction commit, unpin +on transaction completion" model. + +To keep pin/unpin symmetry, the algorithm needs to change to a "pin on +insertion into the CIL, unpin on checkpoint completion". In other words, the +pinning and unpinning becomes symmetric around a checkpoint context. We have to +pin the object the first time it is inserted into the CIL - if it is already in +the CIL during a transaction commit, then we do not pin it again. Because there +can be multiple outstanding checkpoint contexts, we can still see elevated pin +counts, but as each checkpoint completes the pin count will retain the correct +value according to it's context. + +Just to make matters more slightly more complex, this checkpoint level context +for the pin count means that the pinning of an item must take place under the +CIL commit/flush lock. If we pin the object outside this lock, we cannot +guarantee which context the pin count is associated with. This is because of +the fact pinning the item is dependent on whether the item is present in the +current CIL or not. If we don't pin the CIL first before we check and pin the +object, we have a race with CIL being flushed between the check and the pin +(or not pinning, as the case may be). Hence we must hold the CIL flush/commit +lock to guarantee that we pin the items correctly. + +Delayed Logging: Concurrent Scalability + +A fundamental requirement for the CIL is that accesses through transaction +commits must scale to many concurrent commits. The current transaction commit +code does not break down even when there are transactions coming from 2048 +processors at once. The current transaction code does not go any faster than if +there was only one CPU using it, but it does not slow down either. + +As a result, the delayed logging transaction commit code needs to be designed +for concurrency from the ground up. It is obvious that there are serialisation +points in the design - the three important ones are: + + 1. Locking out new transaction commits while flushing the CIL + 2. Adding items to the CIL and updating item space accounting + 3. Checkpoint commit ordering + +Looking at the transaction commit and CIL flushing interactions, it is clear +that we have a many-to-one interaction here. That is, the only restriction on +the number of concurrent transactions that can be trying to commit at once is +the amount of space available in the log for their reservations. The practical +limit here is in the order of several hundred concurrent transactions for a +128MB log, which means that it is generally one per CPU in a machine. + +The amount of time a transaction commit needs to hold out a flush is a +relatively long period of time - the pinning of log items needs to be done +while we are holding out a CIL flush, so at the moment that means it is held +across the formatting of the objects into memory buffers (i.e. while memcpy()s +are in progress). Ultimately a two pass algorithm where the formatting is done +separately to the pinning of objects could be used to reduce the hold time of +the transaction commit side. + +Because of the number of potential transaction commit side holders, the lock +really needs to be a sleeping lock - if the CIL flush takes the lock, we do not +want every other CPU in the machine spinning on the CIL lock. Given that +flushing the CIL could involve walking a list of tens of thousands of log +items, it will get held for a significant time and so spin contention is a +significant concern. Preventing lots of CPUs spinning doing nothing is the +main reason for choosing a sleeping lock even though nothing in either the +transaction commit or CIL flush side sleeps with the lock held. + +It should also be noted that CIL flushing is also a relatively rare operation +compared to transaction commit for asynchronous transaction workloads - only +time will tell if using a read-write semaphore for exclusion will limit +transaction commit concurrency due to cache line bouncing of the lock on the +read side. + +The second serialisation point is on the transaction commit side where items +are inserted into the CIL. Because transactions can enter this code +concurrently, the CIL needs to be protected separately from the above +commit/flush exclusion. It also needs to be an exclusive lock but it is only +held for a very short time and so a spin lock is appropriate here. It is +possible that this lock will become a contention point, but given the short +hold time once per transaction I think that contention is unlikely. + +The final serialisation point is the checkpoint commit record ordering code +that is run as part of the checkpoint commit and log force sequencing. The code +path that triggers a CIL flush (i.e. whatever triggers the log force) will enter +an ordering loop after writing all the log vectors into the log buffers but +before writing the commit record. This loop walks the list of committing +checkpoints and needs to block waiting for checkpoints to complete their commit +record write. As a result it needs a lock and a wait variable. Log force +sequencing also requires the same lock, list walk, and blocking mechanism to +ensure completion of checkpoints. + +These two sequencing operations can use the mechanism even though the +events they are waiting for are different. The checkpoint commit record +sequencing needs to wait until checkpoint contexts contain a commit LSN +(obtained through completion of a commit record write) while log force +sequencing needs to wait until previous checkpoint contexts are removed from +the committing list (i.e. they've completed). A simple wait variable and +broadcast wakeups (thundering herds) has been used to implement these two +serialisation queues. They use the same lock as the CIL, too. If we see too +much contention on the CIL lock, or too many context switches as a result of +the broadcast wakeups these operations can be put under a new spinlock and +given separate wait lists to reduce lock contention and the number of processes +woken by the wrong event. + + +Lifecycle Changes + +The existing log item life cycle is as follows: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory + Format item into log buffer + Write commit LSN into transaction + Unlock item + Attach transaction to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 7. Transaction completion + Mark log item committed + Insert log item into AIL + Write commit LSN into log item + Unpin log item + 8. AIL traversal + Lock item + Mark log item clean + Flush item to disk + + <item IO completion> + + 9. Log item removed from AIL + Moves log tail + Item unlocked + +Essentially, steps 1-6 operate independently from step 7, which is also +independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 +at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur +at the same time. If the log item is in the AIL or between steps 6 and 7 +and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 +are entered and completed is the object considered clean. + +With delayed logging, there are new steps inserted into the life cycle: + + 1. Transaction allocate + 2. Transaction reserve + 3. Lock item + 4. Join item to transaction + If not already attached, + Allocate log item + Attach log item to owner item + Attach log item to transaction + 5. Modify item + Record modifications in log item + 6. Transaction commit + Pin item in memory if not pinned in CIL + Format item into log vector + buffer + Attach log vector and buffer to log item + Insert log item into CIL + Write CIL context sequence into transaction + Unlock item + + <next log force> + + 7. CIL push + lock CIL flush + Chain log vectors and buffers together + Remove items from CIL + unlock CIL flush + write log vectors into log + sequence commit records + attach checkpoint context to log buffer + + <log buffer IO dispatched> + <log buffer IO completes> + + 8. Checkpoint completion + Mark log item committed + Insert item into AIL + Write commit LSN into log item + Unpin log item + 9. AIL traversal + Lock item + Mark log item clean + Flush item to disk + <item IO completion> + 10. Log item removed from AIL + Moves log tail + Item unlocked + +From this, it can be seen that the only life cycle differences between the two +logging methods are in the middle of the life cycle - they still have the same +beginning and end and execution constraints. The only differences are in the +committing of the log items to the log itself and the completion processing. +Hence delayed logging should not introduce any constraints on log item +behaviour, allocation or freeing that don't already exist. + +As a result of this zero-impact "insertion" of delayed logging infrastructure +and the design of the internal structures to avoid on disk format changes, we +can basically switch between delayed logging and the existing mechanism with a +mount option. Fundamentally, there is no reason why the log manager would not +be able to swap methods automatically and transparently depending on load +characteristics, but this should not be necessary if delayed logging works as +designed. diff --git a/Documentation/filesystems/xfs-self-describing-metadata.txt b/Documentation/filesystems/xfs-self-describing-metadata.txt new file mode 100644 index 00000000000..05aa455163e --- /dev/null +++ b/Documentation/filesystems/xfs-self-describing-metadata.txt @@ -0,0 +1,350 @@ +XFS Self Describing Metadata +---------------------------- + +Introduction +------------ + +The largest scalability problem facing XFS is not one of algorithmic +scalability, but of verification of the filesystem structure. Scalabilty of the +structures and indexes on disk and the algorithms for iterating them are +adequate for supporting PB scale filesystems with billions of inodes, however it +is this very scalability that causes the verification problem. + +Almost all metadata on XFS is dynamically allocated. The only fixed location +metadata is the allocation group headers (SB, AGF, AGFL and AGI), while all +other metadata structures need to be discovered by walking the filesystem +structure in different ways. While this is already done by userspace tools for +validating and repairing the structure, there are limits to what they can +verify, and this in turn limits the supportable size of an XFS filesystem. + +For example, it is entirely possible to manually use xfs_db and a bit of +scripting to analyse the structure of a 100TB filesystem when trying to +determine the root cause of a corruption problem, but it is still mainly a +manual task of verifying that things like single bit errors or misplaced writes +weren't the ultimate cause of a corruption event. It may take a few hours to a +few days to perform such forensic analysis, so for at this scale root cause +analysis is entirely possible. + +However, if we scale the filesystem up to 1PB, we now have 10x as much metadata +to analyse and so that analysis blows out towards weeks/months of forensic work. +Most of the analysis work is slow and tedious, so as the amount of analysis goes +up, the more likely that the cause will be lost in the noise. Hence the primary +concern for supporting PB scale filesystems is minimising the time and effort +required for basic forensic analysis of the filesystem structure. + + +Self Describing Metadata +------------------------ + +One of the problems with the current metadata format is that apart from the +magic number in the metadata block, we have no other way of identifying what it +is supposed to be. We can't even identify if it is the right place. Put simply, +you can't look at a single metadata block in isolation and say "yes, it is +supposed to be there and the contents are valid". + +Hence most of the time spent on forensic analysis is spent doing basic +verification of metadata values, looking for values that are in range (and hence +not detected by automated verification checks) but are not correct. Finding and +understanding how things like cross linked block lists (e.g. sibling +pointers in a btree end up with loops in them) are the key to understanding what +went wrong, but it is impossible to tell what order the blocks were linked into +each other or written to disk after the fact. + +Hence we need to record more information into the metadata to allow us to +quickly determine if the metadata is intact and can be ignored for the purpose +of analysis. We can't protect against every possible type of error, but we can +ensure that common types of errors are easily detectable. Hence the concept of +self describing metadata. + +The first, fundamental requirement of self describing metadata is that the +metadata object contains some form of unique identifier in a well known +location. This allows us to identify the expected contents of the block and +hence parse and verify the metadata object. IF we can't independently identify +the type of metadata in the object, then the metadata doesn't describe itself +very well at all! + +Luckily, almost all XFS metadata has magic numbers embedded already - only the +AGFL, remote symlinks and remote attribute blocks do not contain identifying +magic numbers. Hence we can change the on-disk format of all these objects to +add more identifying information and detect this simply by changing the magic +numbers in the metadata objects. That is, if it has the current magic number, +the metadata isn't self identifying. If it contains a new magic number, it is +self identifying and we can do much more expansive automated verification of the +metadata object at runtime, during forensic analysis or repair. + +As a primary concern, self describing metadata needs some form of overall +integrity checking. We cannot trust the metadata if we cannot verify that it has +not been changed as a result of external influences. Hence we need some form of +integrity check, and this is done by adding CRC32c validation to the metadata +block. If we can verify the block contains the metadata it was intended to +contain, a large amount of the manual verification work can be skipped. + +CRC32c was selected as metadata cannot be more than 64k in length in XFS and +hence a 32 bit CRC is more than sufficient to detect multi-bit errors in +metadata blocks. CRC32c is also now hardware accelerated on common CPUs so it is +fast. So while CRC32c is not the strongest of possible integrity checks that +could be used, it is more than sufficient for our needs and has relatively +little overhead. Adding support for larger integrity fields and/or algorithms +does really provide any extra value over CRC32c, but it does add a lot of +complexity and so there is no provision for changing the integrity checking +mechanism. + +Self describing metadata needs to contain enough information so that the +metadata block can be verified as being in the correct place without needing to +look at any other metadata. This means it needs to contain location information. +Just adding a block number to the metadata is not sufficient to protect against +mis-directed writes - a write might be misdirected to the wrong LUN and so be +written to the "correct block" of the wrong filesystem. Hence location +information must contain a filesystem identifier as well as a block number. + +Another key information point in forensic analysis is knowing who the metadata +block belongs to. We already know the type, the location, that it is valid +and/or corrupted, and how long ago that it was last modified. Knowing the owner +of the block is important as it allows us to find other related metadata to +determine the scope of the corruption. For example, if we have a extent btree +object, we don't know what inode it belongs to and hence have to walk the entire +filesystem to find the owner of the block. Worse, the corruption could mean that +no owner can be found (i.e. it's an orphan block), and so without an owner field +in the metadata we have no idea of the scope of the corruption. If we have an +owner field in the metadata object, we can immediately do top down validation to +determine the scope of the problem. + +Different types of metadata have different owner identifiers. For example, +directory, attribute and extent tree blocks are all owned by an inode, whilst +freespace btree blocks are owned by an allocation group. Hence the size and +contents of the owner field are determined by the type of metadata object we are +looking at. The owner information can also identify misplaced writes (e.g. +freespace btree block written to the wrong AG). + +Self describing metadata also needs to contain some indication of when it was +written to the filesystem. One of the key information points when doing forensic +analysis is how recently the block was modified. Correlation of set of corrupted +metadata blocks based on modification times is important as it can indicate +whether the corruptions are related, whether there's been multiple corruption +events that lead to the eventual failure, and even whether there are corruptions +present that the run-time verification is not detecting. + +For example, we can determine whether a metadata object is supposed to be free +space or still allocated if it is still referenced by its owner by looking at +when the free space btree block that contains the block was last written +compared to when the metadata object itself was last written. If the free space +block is more recent than the object and the object's owner, then there is a +very good chance that the block should have been removed from the owner. + +To provide this "written timestamp", each metadata block gets the Log Sequence +Number (LSN) of the most recent transaction it was modified on written into it. +This number will always increase over the life of the filesystem, and the only +thing that resets it is running xfs_repair on the filesystem. Further, by use of +the LSN we can tell if the corrupted metadata all belonged to the same log +checkpoint and hence have some idea of how much modification occurred between +the first and last instance of corrupt metadata on disk and, further, how much +modification occurred between the corruption being written and when it was +detected. + +Runtime Validation +------------------ + +Validation of self-describing metadata takes place at runtime in two places: + + - immediately after a successful read from disk + - immediately prior to write IO submission + +The verification is completely stateless - it is done independently of the +modification process, and seeks only to check that the metadata is what it says +it is and that the metadata fields are within bounds and internally consistent. +As such, we cannot catch all types of corruption that can occur within a block +as there may be certain limitations that operational state enforces of the +metadata, or there may be corruption of interblock relationships (e.g. corrupted +sibling pointer lists). Hence we still need stateful checking in the main code +body, but in general most of the per-field validation is handled by the +verifiers. + +For read verification, the caller needs to specify the expected type of metadata +that it should see, and the IO completion process verifies that the metadata +object matches what was expected. If the verification process fails, then it +marks the object being read as EFSCORRUPTED. The caller needs to catch this +error (same as for IO errors), and if it needs to take special action due to a +verification error it can do so by catching the EFSCORRUPTED error value. If we +need more discrimination of error type at higher levels, we can define new +error numbers for different errors as necessary. + +The first step in read verification is checking the magic number and determining +whether CRC validating is necessary. If it is, the CRC32c is calculated and +compared against the value stored in the object itself. Once this is validated, +further checks are made against the location information, followed by extensive +object specific metadata validation. If any of these checks fail, then the +buffer is considered corrupt and the EFSCORRUPTED error is set appropriately. + +Write verification is the opposite of the read verification - first the object +is extensively verified and if it is OK we then update the LSN from the last +modification made to the object, After this, we calculate the CRC and insert it +into the object. Once this is done the write IO is allowed to continue. If any +error occurs during this process, the buffer is again marked with a EFSCORRUPTED +error for the higher layers to catch. + +Structures +---------- + +A typical on-disk structure needs to contain the following information: + +struct xfs_ondisk_hdr { + __be32 magic; /* magic number */ + __be32 crc; /* CRC, not logged */ + uuid_t uuid; /* filesystem identifier */ + __be64 owner; /* parent object */ + __be64 blkno; /* location on disk */ + __be64 lsn; /* last modification in log, not logged */ +}; + +Depending on the metadata, this information may be part of a header structure +separate to the metadata contents, or may be distributed through an existing +structure. The latter occurs with metadata that already contains some of this +information, such as the superblock and AG headers. + +Other metadata may have different formats for the information, but the same +level of information is generally provided. For example: + + - short btree blocks have a 32 bit owner (ag number) and a 32 bit block + number for location. The two of these combined provide the same + information as @owner and @blkno in eh above structure, but using 8 + bytes less space on disk. + + - directory/attribute node blocks have a 16 bit magic number, and the + header that contains the magic number has other information in it as + well. hence the additional metadata headers change the overall format + of the metadata. + +A typical buffer read verifier is structured as follows: + +#define XFS_FOO_CRC_OFF offsetof(struct xfs_ondisk_hdr, crc) + +static void +xfs_foo_read_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + + if ((xfs_sb_version_hascrc(&mp->m_sb) && + !xfs_verify_cksum(bp->b_addr, BBTOB(bp->b_length), + XFS_FOO_CRC_OFF)) || + !xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + } +} + +The code ensures that the CRC is only checked if the filesystem has CRCs enabled +by checking the superblock of the feature bit, and then if the CRC verifies OK +(or is not needed) it verifies the actual contents of the block. + +The verifier function will take a couple of different forms, depending on +whether the magic number can be used to determine the format of the block. In +the case it can't, the code is structured as follows: + +static bool +xfs_foo_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + if (!xfs_sb_version_hascrc(&mp->m_sb)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } + + /* object specific verification checks here */ + + return true; +} + +If there are different magic numbers for the different formats, the verifier +will look like: + +static bool +xfs_foo_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_ondisk_hdr *hdr = bp->b_addr; + + if (hdr->magic == cpu_to_be32(XFS_FOO_CRC_MAGIC)) { + if (!uuid_equal(&hdr->uuid, &mp->m_sb.sb_uuid)) + return false; + if (bp->b_bn != be64_to_cpu(hdr->blkno)) + return false; + if (hdr->owner == 0) + return false; + } else if (hdr->magic != cpu_to_be32(XFS_FOO_MAGIC)) + return false; + + /* object specific verification checks here */ + + return true; +} + +Write verifiers are very similar to the read verifiers, they just do things in +the opposite order to the read verifiers. A typical write verifier: + +static void +xfs_foo_write_verify( + struct xfs_buf *bp) +{ + struct xfs_mount *mp = bp->b_target->bt_mount; + struct xfs_buf_log_item *bip = bp->b_fspriv; + + if (!xfs_foo_verify(bp)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, bp->b_addr); + xfs_buf_ioerror(bp, EFSCORRUPTED); + return; + } + + if (!xfs_sb_version_hascrc(&mp->m_sb)) + return; + + + if (bip) { + struct xfs_ondisk_hdr *hdr = bp->b_addr; + hdr->lsn = cpu_to_be64(bip->bli_item.li_lsn); + } + xfs_update_cksum(bp->b_addr, BBTOB(bp->b_length), XFS_FOO_CRC_OFF); +} + +This will verify the internal structure of the metadata before we go any +further, detecting corruptions that have occurred as the metadata has been +modified in memory. If the metadata verifies OK, and CRCs are enabled, we then +update the LSN field (when it was last modified) and calculate the CRC on the +metadata. Once this is done, we can issue the IO. + +Inodes and Dquots +----------------- + +Inodes and dquots are special snowflakes. They have per-object CRC and +self-identifiers, but they are packed so that there are multiple objects per +buffer. Hence we do not use per-buffer verifiers to do the work of per-object +verification and CRC calculations. The per-buffer verifiers simply perform basic +identification of the buffer - that they contain inodes or dquots, and that +there are magic numbers in all the expected spots. All further CRC and +verification checks are done when each inode is read from or written back to the +buffer. + +The structure of the verifiers and the identifiers checks is very similar to the +buffer code described above. The only difference is where they are called. For +example, inode read verification is done in xfs_iread() when the inode is first +read out of the buffer and the struct xfs_inode is instantiated. The inode is +already extensively verified during writeback in xfs_iflush_int, so the only +addition here is to add the LSN and CRC to the inode as it is copied back into +the buffer. + +XXX: inode unlinked list modification doesn't recalculate the inode CRC! None of +the unlinked list modifications check or update CRCs, neither during unlink nor +log recovery. So, it's gone unnoticed until now. This won't matter immediately - +repair will probably complain about it - but it needs to be fixed. + diff --git a/Documentation/filesystems/xfs.txt b/Documentation/filesystems/xfs.txt index 74aeb142ae5..5be51fd888b 100644 --- a/Documentation/filesystems/xfs.txt +++ b/Documentation/filesystems/xfs.txt @@ -18,6 +18,8 @@ Mount Options ============= When mounting an XFS filesystem, the following options are accepted. +For boolean mount options, the names with the (*) suffix is the +default behaviour. allocsize=size Sets the buffered I/O end-of-file preallocation size when @@ -25,82 +27,128 @@ When mounting an XFS filesystem, the following options are accepted. Valid values for this option are page size (typically 4KiB) through to 1GiB, inclusive, in power-of-2 increments. - attr2/noattr2 - The options enable/disable (default is disabled for backward - compatibility on-disk) an "opportunistic" improvement to be - made in the way inline extended attributes are stored on-disk. - When the new form is used for the first time (by setting or - removing extended attributes) the on-disk superblock feature - bit field will be updated to reflect this format being in use. - - barrier - Enables the use of block layer write barriers for writes into - the journal and unwritten extent conversion. This allows for - drive level write caching to be enabled, for devices that - support write barriers. - - dmapi - Enable the DMAPI (Data Management API) event callouts. - Use with the "mtpt" option. - - grpid/bsdgroups and nogrpid/sysvgroups - These options define what group ID a newly created file gets. - When grpid is set, it takes the group ID of the directory in - which it is created; otherwise (the default) it takes the fsgid - of the current process, unless the directory has the setgid bit - set, in which case it takes the gid from the parent directory, - and also gets the setgid bit set if it is a directory itself. - - ihashsize=value - Sets the number of hash buckets available for hashing the - in-memory inodes of the specified mount point. If a value - of zero is used, the value selected by the default algorithm - will be displayed in /proc/mounts. - - ikeep/noikeep - When inode clusters are emptied of inodes, keep them around - on the disk (ikeep) - this is the traditional XFS behaviour - and is still the default for now. Using the noikeep option, - inode clusters are returned to the free space pool. - - inode64 - Indicates that XFS is allowed to create inodes at any location - in the filesystem, including those which will result in inode - numbers occupying more than 32 bits of significance. This is - provided for backwards compatibility, but causes problems for - backup applications that cannot handle large inode numbers. - - largeio/nolargeio + The default behaviour is for dynamic end-of-file + preallocation size, which uses a set of heuristics to + optimise the preallocation size based on the current + allocation patterns within the file and the access patterns + to the file. Specifying a fixed allocsize value turns off + the dynamic behaviour. + + attr2 + noattr2 + The options enable/disable an "opportunistic" improvement to + be made in the way inline extended attributes are stored + on-disk. When the new form is used for the first time when + attr2 is selected (either when setting or removing extended + attributes) the on-disk superblock feature bit field will be + updated to reflect this format being in use. + + The default behaviour is determined by the on-disk feature + bit indicating that attr2 behaviour is active. If either + mount option it set, then that becomes the new default used + by the filesystem. + + CRC enabled filesystems always use the attr2 format, and so + will reject the noattr2 mount option if it is set. + + barrier (*) + nobarrier + Enables/disables the use of block layer write barriers for + writes into the journal and for data integrity operations. + This allows for drive level write caching to be enabled, for + devices that support write barriers. + + discard + nodiscard (*) + Enable/disable the issuing of commands to let the block + device reclaim space freed by the filesystem. This is + useful for SSD devices, thinly provisioned LUNs and virtual + machine images, but may have a performance impact. + + Note: It is currently recommended that you use the fstrim + application to discard unused blocks rather than the discard + mount option because the performance impact of this option + is quite severe. + + grpid/bsdgroups + nogrpid/sysvgroups (*) + These options define what group ID a newly created file + gets. When grpid is set, it takes the group ID of the + directory in which it is created; otherwise it takes the + fsgid of the current process, unless the directory has the + setgid bit set, in which case it takes the gid from the + parent directory, and also gets the setgid bit set if it is + a directory itself. + + filestreams + Make the data allocator use the filestreams allocation mode + across the entire filesystem rather than just on directories + configured to use it. + + ikeep + noikeep (*) + When ikeep is specified, XFS does not delete empty inode + clusters and keeps them around on disk. When noikeep is + specified, empty inode clusters are returned to the free + space pool. + + inode32 + inode64 (*) + When inode32 is specified, it indicates that XFS limits + inode creation to locations which will not result in inode + numbers with more than 32 bits of significance. + + When inode64 is specified, it indicates that XFS is allowed + to create inodes at any location in the filesystem, + including those which will result in inode numbers occupying + more than 32 bits of significance. + + inode32 is provided for backwards compatibility with older + systems and applications, since 64 bits inode numbers might + cause problems for some applications that cannot handle + large inode numbers. If applications are in use which do + not handle inode numbers bigger than 32 bits, the inode32 + option should be specified. + + + largeio + nolargeio (*) If "nolargeio" is specified, the optimal I/O reported in - st_blksize by stat(2) will be as small as possible to allow user - applications to avoid inefficient read/modify/write I/O. - If "largeio" specified, a filesystem that has a "swidth" specified - will return the "swidth" value (in bytes) in st_blksize. If the - filesystem does not have a "swidth" specified but does specify - an "allocsize" then "allocsize" (in bytes) will be returned - instead. - If neither of these two options are specified, then filesystem - will behave as if "nolargeio" was specified. + st_blksize by stat(2) will be as small as possible to allow + user applications to avoid inefficient read/modify/write + I/O. This is typically the page size of the machine, as + this is the granularity of the page cache. + + If "largeio" specified, a filesystem that was created with a + "swidth" specified will return the "swidth" value (in bytes) + in st_blksize. If the filesystem does not have a "swidth" + specified but does specify an "allocsize" then "allocsize" + (in bytes) will be returned instead. Otherwise the behaviour + is the same as if "nolargeio" was specified. logbufs=value - Set the number of in-memory log buffers. Valid numbers range - from 2-8 inclusive. - The default value is 8 buffers for filesystems with a - blocksize of 64KiB, 4 buffers for filesystems with a blocksize - of 32KiB, 3 buffers for filesystems with a blocksize of 16KiB - and 2 buffers for all other configurations. Increasing the - number of buffers may increase performance on some workloads - at the cost of the memory used for the additional log buffers - and their associated control structures. + Set the number of in-memory log buffers. Valid numbers + range from 2-8 inclusive. + + The default value is 8 buffers. + + If the memory cost of 8 log buffers is too high on small + systems, then it may be reduced at some cost to performance + on metadata intensive workloads. The logbsize option below + controls the size of each buffer and so is also relevant to + this case. logbsize=value - Set the size of each in-memory log buffer. - Size may be specified in bytes, or in kilobytes with a "k" suffix. - Valid sizes for version 1 and version 2 logs are 16384 (16k) and - 32768 (32k). Valid sizes for version 2 logs also include - 65536 (64k), 131072 (128k) and 262144 (256k). - The default value for machines with more than 32MiB of memory - is 32768, machines with less memory use 16384 by default. + Set the size of each in-memory log buffer. The size may be + specified in bytes, or in kilobytes with a "k" suffix. + Valid sizes for version 1 and version 2 logs are 16384 (16k) + and 32768 (32k). Valid sizes for version 2 logs also + include 65536 (64k), 131072 (128k) and 262144 (256k). The + logbsize must be an integer multiple of the log + stripe unit configured at mkfs time. + + The default value for for version 1 logs is 32768, while the + default value for version 2 logs is MAX(32768, log_sunit). logdev=device and rtdev=device Use an external log (metadata journal) and/or real-time device. @@ -109,16 +157,11 @@ When mounting an XFS filesystem, the following options are accepted. optional, and the log section can be separate from the data section or contained within it. - mtpt=mountpoint - Use with the "dmapi" option. The value specified here will be - included in the DMAPI mount event, and should be the path of - the actual mountpoint that is used. - noalign - Data allocations will not be aligned at stripe unit boundaries. - - noatime - Access timestamps are not updated when a file is read. + Data allocations will not be aligned at stripe unit + boundaries. This is only relevant to filesystems created + with non-zero data alignment parameters (sunit, swidth) by + mkfs. norecovery The filesystem will be mounted without running log recovery. @@ -129,19 +172,14 @@ When mounting an XFS filesystem, the following options are accepted. the mount will fail. nouuid - Don't check for double mounted file systems using the file system uuid. - This is useful to mount LVM snapshot volumes. + Don't check for double mounted file systems using the file + system uuid. This is useful to mount LVM snapshot volumes, + and often used in combination with "norecovery" for mounting + read-only snapshots. - osyncisosync - Make O_SYNC writes implement true O_SYNC. WITHOUT this option, - Linux XFS behaves as if an "osyncisdsync" option is used, - which will make writes to files opened with the O_SYNC flag set - behave as if the O_DSYNC flag had been used instead. - This can result in better performance without compromising - data safety. - However if this option is not in effect, timestamp updates from - O_SYNC writes can be lost if the system crashes. - If timestamp updates are critical, use the osyncisosync option. + noquota + Forcibly turns off all quota accounting and enforcement + within the filesystem. uquota/usrquota/uqnoenforce/quota User disk quota accounting enabled, and limits (optionally) @@ -156,24 +194,64 @@ When mounting an XFS filesystem, the following options are accepted. enforced. Refer to xfs_quota(8) for further details. sunit=value and swidth=value - Used to specify the stripe unit and width for a RAID device or - a stripe volume. "value" must be specified in 512-byte block - units. - If this option is not specified and the filesystem was made on - a stripe volume or the stripe width or unit were specified for - the RAID device at mkfs time, then the mount system call will - restore the value from the superblock. For filesystems that - are made directly on RAID devices, these options can be used - to override the information in the superblock if the underlying - disk layout changes after the filesystem has been created. - The "swidth" option is required if the "sunit" option has been - specified, and must be a multiple of the "sunit" value. + Used to specify the stripe unit and width for a RAID device + or a stripe volume. "value" must be specified in 512-byte + block units. These options are only relevant to filesystems + that were created with non-zero data alignment parameters. + + The sunit and swidth parameters specified must be compatible + with the existing filesystem alignment characteristics. In + general, that means the only valid changes to sunit are + increasing it by a power-of-2 multiple. Valid swidth values + are any integer multiple of a valid sunit value. + + Typically the only time these mount options are necessary if + after an underlying RAID device has had it's geometry + modified, such as adding a new disk to a RAID5 lun and + reshaping it. swalloc Data allocations will be rounded up to stripe width boundaries when the current end of file is being extended and the file size is larger than the stripe width size. + wsync + When specified, all filesystem namespace operations are + executed synchronously. This ensures that when the namespace + operation (create, unlink, etc) completes, the change to the + namespace is on stable storage. This is useful in HA setups + where failover must not result in clients seeing + inconsistent namespace presentation during or after a + failover event. + + +Deprecated Mount Options +======================== + + delaylog/nodelaylog + Delayed logging is the only logging method that XFS supports + now, so these mount options are now ignored. + + Due for removal in 3.12. + + ihashsize=value + In memory inode hashes have been removed, so this option has + no function as of August 2007. Option is deprecated. + + Due for removal in 3.12. + + irixsgid + This behaviour is now controlled by a sysctl, so the mount + option is ignored. + + Due for removal in 3.12. + + osyncisdsync + osyncisosync + O_SYNC and O_DSYNC are fully supported, so there is no need + for these options any more. + + Due for removal in 3.12. sysctls ======= @@ -185,15 +263,20 @@ The following sysctls are available for the XFS filesystem: in /proc/fs/xfs/stat. It then immediately resets to "0". fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000) - The interval at which the xfssyncd thread flushes metadata - out to disk. This thread will flush log activity out, and - do some processing on unlinked inodes. + The interval at which the filesystem flushes metadata + out to disk and runs internal cache cleanup routines. - fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) - The interval at which xfsbufd scans the dirty metadata buffers list. + fs.xfs.filestream_centisecs (Min: 1 Default: 3000 Max: 360000) + The interval at which the filesystem ages filestreams cache + references and returns timed-out AGs back to the free stream + pool. - fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) - The age at which xfsbufd flushes dirty metadata buffers to disk. + fs.xfs.speculative_prealloc_lifetime + (Units: seconds Min: 1 Default: 300 Max: 86400) + The interval at which the background scanning for inodes + with unused speculative preallocation runs. The scan + removes unused preallocation from clean inodes and releases + the unused space back to the free pool. fs.xfs.error_level (Min: 0 Default: 3 Max: 11) A volume knob for error reporting when internal errors occur. @@ -230,10 +313,6 @@ The following sysctls are available for the XFS filesystem: ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl is set. - fs.xfs.restrict_chown (Min: 0 Default: 1 Max: 1) - Controls whether unprivileged users can use chown to "give away" - a file to another user. - fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1) Setting this to "1" will cause the "sync" flag set by the xfs_io(8) chattr command on a directory to be @@ -254,9 +333,31 @@ The following sysctls are available for the XFS filesystem: by the xfs_io(8) chattr command on a directory to be inherited by files in that directory. + fs.xfs.inherit_nodefrag (Min: 0 Default: 1 Max: 1) + Setting this to "1" will cause the "nodefrag" flag set + by the xfs_io(8) chattr command on a directory to be + inherited by files in that directory. + fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256) In "inode32" allocation mode, this option determines how many files the allocator attempts to allocate in the same allocation group before moving to the next allocation group. The intent is to control the rate at which the allocator moves between allocation groups when allocating extents for new files. + +Deprecated Sysctls +================== + + fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. + + fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000) + Dirty metadata is now tracked by the log subsystem and + flushing is driven by log space and idling demands. The + xfsbufd no longer exists, so this syctl does nothing. + + Due for removal in 3.14. diff --git a/Documentation/filesystems/xip.txt b/Documentation/filesystems/xip.txt index 6c0cef10eb4..0466ee56927 100644 --- a/Documentation/filesystems/xip.txt +++ b/Documentation/filesystems/xip.txt @@ -19,7 +19,7 @@ completely. With execute-in-place, read&write type operations are performed directly from/to the memory backed storage device. For file mappings, the storage device itself is mapped directly into userspace. -This implementation was initialy written for shared memory segments between +This implementation was initially written for shared memory segments between different virtual machines on s390 hardware to allow multiple machines to share the same binaries and libraries. @@ -39,10 +39,11 @@ The block device operation is optional, these block devices support it as of today: - dcssblk: s390 dcss block device driver -An address space operation named get_xip_page is used to retrieve reference -to a struct page. To address the target page, a reference to an address_space, -and a sector number is provided. A 3rd argument indicates whether the -function should allocate blocks if needed. +An address space operation named get_xip_mem is used to retrieve references +to a page frame number and a kernel address. To obtain these values a reference +to an address_space is provided. This function assigns values to the kmem and +pfn parameters. The third argument indicates whether the function should allocate +blocks if needed. This address space operation is mutually exclusive with readpage&writepage that do page cache read/write operations. |
