aboutsummaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/files.txt123
-rw-r--r--Documentation/filesystems/fuse.txt315
-rw-r--r--Documentation/filesystems/ntfs.txt12
-rw-r--r--Documentation/filesystems/proc.txt42
-rw-r--r--Documentation/filesystems/v9fs.txt95
-rw-r--r--Documentation/filesystems/vfs.txt435
6 files changed, 900 insertions, 122 deletions
diff --git a/Documentation/filesystems/files.txt b/Documentation/filesystems/files.txt
new file mode 100644
index 00000000000..8c206f4e025
--- /dev/null
+++ b/Documentation/filesystems/files.txt
@@ -0,0 +1,123 @@
+File management in the Linux kernel
+-----------------------------------
+
+This document describes how locking for files (struct file)
+and file descriptor table (struct files) works.
+
+Up until 2.6.12, the file descriptor table has been protected
+with a lock (files->file_lock) and reference count (files->count).
+->file_lock protected accesses to all the file related fields
+of the table. ->count was used for sharing the file descriptor
+table between tasks cloned with CLONE_FILES flag. Typically
+this would be the case for posix threads. As with the common
+refcounting model in the kernel, the last task doing
+a put_files_struct() frees the file descriptor (fd) table.
+The files (struct file) themselves are protected using
+reference count (->f_count).
+
+In the new lock-free model of file descriptor management,
+the reference counting is similar, but the locking is
+based on RCU. The file descriptor table contains multiple
+elements - the fd sets (open_fds and close_on_exec, the
+array of file pointers, the sizes of the sets and the array
+etc.). In order for the updates to appear atomic to
+a lock-free reader, all the elements of the file descriptor
+table are in a separate structure - struct fdtable.
+files_struct contains a pointer to struct fdtable through
+which the actual fd table is accessed. Initially the
+fdtable is embedded in files_struct itself. On a subsequent
+expansion of fdtable, a new fdtable structure is allocated
+and files->fdtab points to the new structure. The fdtable
+structure is freed with RCU and lock-free readers either
+see the old fdtable or the new fdtable making the update
+appear atomic. Here are the locking rules for
+the fdtable structure -
+
+1. All references to the fdtable must be done through
+ the files_fdtable() macro :
+
+ struct fdtable *fdt;
+
+ rcu_read_lock();
+
+ fdt = files_fdtable(files);
+ ....
+ if (n <= fdt->max_fds)
+ ....
+ ...
+ rcu_read_unlock();
+
+ files_fdtable() uses rcu_dereference() macro which takes care of
+ the memory barrier requirements for lock-free dereference.
+ The fdtable pointer must be read within the read-side
+ critical section.
+
+2. Reading of the fdtable as described above must be protected
+ by rcu_read_lock()/rcu_read_unlock().
+
+3. For any update to the the fd table, files->file_lock must
+ be held.
+
+4. To look up the file structure given an fd, a reader
+ must use either fcheck() or fcheck_files() APIs. These
+ take care of barrier requirements due to lock-free lookup.
+ An example :
+
+ struct file *file;
+
+ rcu_read_lock();
+ file = fcheck(fd);
+ if (file) {
+ ...
+ }
+ ....
+ rcu_read_unlock();
+
+5. Handling of the file structures is special. Since the look-up
+ of the fd (fget()/fget_light()) are lock-free, it is possible
+ that look-up may race with the last put() operation on the
+ file structure. This is avoided using the rcuref APIs
+ on ->f_count :
+
+ rcu_read_lock();
+ file = fcheck_files(files, fd);
+ if (file) {
+ if (rcuref_inc_lf(&file->f_count))
+ *fput_needed = 1;
+ else
+ /* Didn't get the reference, someone's freed */
+ file = NULL;
+ }
+ rcu_read_unlock();
+ ....
+ return file;
+
+ rcuref_inc_lf() detects if refcounts is already zero or
+ goes to zero during increment. If it does, we fail
+ fget()/fget_light().
+
+6. Since both fdtable and file structures can be looked up
+ lock-free, they must be installed using rcu_assign_pointer()
+ API. If they are looked up lock-free, rcu_dereference()
+ must be used. However it is advisable to use files_fdtable()
+ and fcheck()/fcheck_files() which take care of these issues.
+
+7. While updating, the fdtable pointer must be looked up while
+ holding files->file_lock. If ->file_lock is dropped, then
+ another thread expand the files thereby creating a new
+ fdtable and making the earlier fdtable pointer stale.
+ For example :
+
+ spin_lock(&files->file_lock);
+ fd = locate_fd(files, file, start);
+ if (fd >= 0) {
+ /* locate_fd() may have expanded fdtable, load the ptr */
+ fdt = files_fdtable(files);
+ FD_SET(fd, fdt->open_fds);
+ FD_CLR(fd, fdt->close_on_exec);
+ spin_unlock(&files->file_lock);
+ .....
+
+ Since locate_fd() can drop ->file_lock (and reacquire ->file_lock),
+ the fdtable pointer (fdt) must be loaded after locate_fd().
+
diff --git a/Documentation/filesystems/fuse.txt b/Documentation/filesystems/fuse.txt
new file mode 100644
index 00000000000..6b5741e651a
--- /dev/null
+++ b/Documentation/filesystems/fuse.txt
@@ -0,0 +1,315 @@
+Definitions
+~~~~~~~~~~~
+
+Userspace filesystem:
+
+ A filesystem in which data and metadata are provided by an ordinary
+ userspace process. The filesystem can be accessed normally through
+ the kernel interface.
+
+Filesystem daemon:
+
+ The process(es) providing the data and metadata of the filesystem.
+
+Non-privileged mount (or user mount):
+
+ A userspace filesystem mounted by a non-privileged (non-root) user.
+ The filesystem daemon is running with the privileges of the mounting
+ user. NOTE: this is not the same as mounts allowed with the "user"
+ option in /etc/fstab, which is not discussed here.
+
+Mount owner:
+
+ The user who does the mounting.
+
+User:
+
+ The user who is performing filesystem operations.
+
+What is FUSE?
+~~~~~~~~~~~~~
+
+FUSE is a userspace filesystem framework. It consists of a kernel
+module (fuse.ko), a userspace library (libfuse.*) and a mount utility
+(fusermount).
+
+One of the most important features of FUSE is allowing secure,
+non-privileged mounts. This opens up new possibilities for the use of
+filesystems. A good example is sshfs: a secure network filesystem
+using the sftp protocol.
+
+The userspace library and utilities are available from the FUSE
+homepage:
+
+ http://fuse.sourceforge.net/
+
+Mount options
+~~~~~~~~~~~~~
+
+'fd=N'
+
+ The file descriptor to use for communication between the userspace
+ filesystem and the kernel. The file descriptor must have been
+ obtained by opening the FUSE device ('/dev/fuse').
+
+'rootmode=M'
+
+ The file mode of the filesystem's root in octal representation.
+
+'user_id=N'
+
+ The numeric user id of the mount owner.
+
+'group_id=N'
+
+ The numeric group id of the mount owner.
+
+'default_permissions'
+
+ By default FUSE doesn't check file access permissions, the
+ filesystem is free to implement it's access policy or leave it to
+ the underlying file access mechanism (e.g. in case of network
+ filesystems). This option enables permission checking, restricting
+ access based on file mode. This is option is usually useful
+ together with the 'allow_other' mount option.
+
+'allow_other'
+
+ This option overrides the security measure restricting file access
+ to the user mounting the filesystem. This option is by default only
+ allowed to root, but this restriction can be removed with a
+ (userspace) configuration option.
+
+'max_read=N'
+
+ With this option the maximum size of read operations can be set.
+ The default is infinite. Note that the size of read requests is
+ limited anyway to 32 pages (which is 128kbyte on i386).
+
+How do non-privileged mounts work?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Since the mount() system call is a privileged operation, a helper
+program (fusermount) is needed, which is installed setuid root.
+
+The implication of providing non-privileged mounts is that the mount
+owner must not be able to use this capability to compromise the
+system. Obvious requirements arising from this are:
+
+ A) mount owner should not be able to get elevated privileges with the
+ help of the mounted filesystem
+
+ B) mount owner should not get illegitimate access to information from
+ other users' and the super user's processes
+
+ C) mount owner should not be able to induce undesired behavior in
+ other users' or the super user's processes
+
+How are requirements fulfilled?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+ A) The mount owner could gain elevated privileges by either:
+
+ 1) creating a filesystem containing a device file, then opening
+ this device
+
+ 2) creating a filesystem containing a suid or sgid application,
+ then executing this application
+
+ The solution is not to allow opening device files and ignore
+ setuid and setgid bits when executing programs. To ensure this
+ fusermount always adds "nosuid" and "nodev" to the mount options
+ for non-privileged mounts.
+
+ B) If another user is accessing files or directories in the
+ filesystem, the filesystem daemon serving requests can record the
+ exact sequence and timing of operations performed. This
+ information is otherwise inaccessible to the mount owner, so this
+ counts as an information leak.
+
+ The solution to this problem will be presented in point 2) of C).
+
+ C) There are several ways in which the mount owner can induce
+ undesired behavior in other users' processes, such as:
+
+ 1) mounting a filesystem over a file or directory which the mount
+ owner could otherwise not be able to modify (or could only
+ make limited modifications).
+
+ This is solved in fusermount, by checking the access
+ permissions on the mountpoint and only allowing the mount if
+ the mount owner can do unlimited modification (has write
+ access to the mountpoint, and mountpoint is not a "sticky"
+ directory)
+
+ 2) Even if 1) is solved the mount owner can change the behavior
+ of other users' processes.
+
+ i) It can slow down or indefinitely delay the execution of a
+ filesystem operation creating a DoS against the user or the
+ whole system. For example a suid application locking a
+ system file, and then accessing a file on the mount owner's
+ filesystem could be stopped, and thus causing the system
+ file to be locked forever.
+
+ ii) It can present files or directories of unlimited length, or
+ directory structures of unlimited depth, possibly causing a
+ system process to eat up diskspace, memory or other
+ resources, again causing DoS.
+
+ The solution to this as well as B) is not to allow processes
+ to access the filesystem, which could otherwise not be
+ monitored or manipulated by the mount owner. Since if the
+ mount owner can ptrace a process, it can do all of the above
+ without using a FUSE mount, the same criteria as used in
+ ptrace can be used to check if a process is allowed to access
+ the filesystem or not.
+
+ Note that the ptrace check is not strictly necessary to
+ prevent B/2/i, it is enough to check if mount owner has enough
+ privilege to send signal to the process accessing the
+ filesystem, since SIGSTOP can be used to get a similar effect.
+
+I think these limitations are unacceptable?
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a sysadmin trusts the users enough, or can ensure through other
+measures, that system processes will never enter non-privileged
+mounts, it can relax the last limitation with a "user_allow_other"
+config option. If this config option is set, the mounting user can
+add the "allow_other" mount option which disables the check for other
+users' processes.
+
+Kernel - userspace interface
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The following diagram shows how a filesystem operation (in this
+example unlink) is performed in FUSE.
+
+NOTE: everything in this description is greatly simplified
+
+ | "rm /mnt/fuse/file" | FUSE filesystem daemon
+ | |
+ | | >sys_read()
+ | | >fuse_dev_read()
+ | | >request_wait()
+ | | [sleep on fc->waitq]
+ | |
+ | >sys_unlink() |
+ | >fuse_unlink() |
+ | [get request from |
+ | fc->unused_list] |
+ | >request_send() |
+ | [queue req on fc->pending] |
+ | [wake up fc->waitq] | [woken up]
+ | >request_wait_answer() |
+ | [sleep on req->waitq] |
+ | | <request_wait()
+ | | [remove req from fc->pending]
+ | | [copy req to read buffer]
+ | | [add req to fc->processing]
+ | | <fuse_dev_read()
+ | | <sys_read()
+ | |
+ | | [perform unlink]
+ | |
+ | | >sys_write()
+ | | >fuse_dev_write()
+ | | [look up req in fc->processing]
+ | | [remove from fc->processing]
+ | | [copy write buffer to req]
+ | [woken up] | [wake up req->waitq]
+ | | <fuse_dev_write()
+ | | <sys_write()
+ | <request_wait_answer() |
+ | <request_send() |
+ | [add request to |
+ | fc->unused_list] |
+ | <fuse_unlink() |
+ | <sys_unlink() |
+
+There are a couple of ways in which to deadlock a FUSE filesystem.
+Since we are talking about unprivileged userspace programs,
+something must be done about these.
+
+Scenario 1 - Simple deadlock
+-----------------------------
+
+ | "rm /mnt/fuse/file" | FUSE filesystem daemon
+ | |
+ | >sys_unlink("/mnt/fuse/file") |
+ | [acquire inode semaphore |
+ | for "file"] |
+ | >fuse_unlink() |
+ | [sleep on req->waitq] |
+ | | <sys_read()
+ | | >sys_unlink("/mnt/fuse/file")
+ | | [acquire inode semaphore
+ | | for "file"]
+ | | *DEADLOCK*
+
+The solution for this is to allow requests to be interrupted while
+they are in userspace:
+
+ | [interrupted by signal] |
+ | <fuse_unlink() |
+ | [release semaphore] | [semaphore acquired]
+ | <sys_unlink() |
+ | | >fuse_unlink()
+ | | [queue req on fc->pending]
+ | | [wake up fc->waitq]
+ | | [sleep on req->waitq]
+
+If the filesystem daemon was single threaded, this will stop here,
+since there's no other thread to dequeue and execute the request.
+In this case the solution is to kill the FUSE daemon as well. If
+there are multiple serving threads, you just have to kill them as
+long as any remain.
+
+Moral: a filesystem which deadlocks, can soon find itself dead.
+
+Scenario 2 - Tricky deadlock
+----------------------------
+
+This one needs a carefully crafted filesystem. It's a variation on
+the above, only the call back to the filesystem is not explicit,
+but is caused by a pagefault.
+
+ | Kamikaze filesystem thread 1 | Kamikaze filesystem thread 2
+ | |
+ | [fd = open("/mnt/fuse/file")] | [request served normally]
+ | [mmap fd to 'addr'] |
+ | [close fd] | [FLUSH triggers 'magic' flag]
+ | [read a byte from addr] |
+ | >do_page_fault() |
+ | [find or create page] |
+ | [lock page] |
+ | >fuse_readpage() |
+ | [queue READ request] |
+ | [sleep on req->waitq] |
+ | | [read request to buffer]
+ | | [create reply header before addr]
+ | | >sys_write(addr - headerlength)
+ | | >fuse_dev_write()
+ | | [look up req in fc->processing]
+ | | [remove from fc->processing]
+ | | [copy write buffer to req]
+ | | >do_page_fault()
+ | | [find or create page]
+ | | [lock page]
+ | | * DEADLOCK *
+
+Solution is again to let the the request be interrupted (not
+elaborated further).
+
+An additional problem is that while the write buffer is being
+copied to the request, the request must not be interrupted. This
+is because the destination address of the copy may not be valid
+after the request is interrupted.
+
+This is solved with doing the copy atomically, and allowing
+interruption while the page(s) belonging to the write buffer are
+faulted with get_user_pages(). The 'req->locked' flag indicates
+when the copy is taking place, and interruption is delayed until
+this flag is unset.
+
diff --git a/Documentation/filesystems/ntfs.txt b/Documentation/filesystems/ntfs.txt
index eef4aca0c75..a5fbc8e897f 100644
--- a/Documentation/filesystems/ntfs.txt
+++ b/Documentation/filesystems/ntfs.txt
@@ -439,6 +439,18 @@ ChangeLog
Note, a technical ChangeLog aimed at kernel hackers is in fs/ntfs/ChangeLog.
+2.1.24:
+ - Support journals ($LogFile) which have been modified by chkdsk. This
+ means users can boot into Windows after we marked the volume dirty.
+ The Windows boot will run chkdsk and then reboot. The user can then
+ immediately boot into Linux rather than having to do a full Windows
+ boot first before rebooting into Linux and we will recognize such a
+ journal and empty it as it is clean by definition.
+ - Support journals ($LogFile) with only one restart page as well as
+ journals with two different restart pages. We sanity check both and
+ either use the only sane one or the more recent one of the two in the
+ case that both are valid.
+ - Lots of bug fixes and enhancements across the board.
2.1.23:
- Stamp the user space journal, aka transaction log, aka $UsnJrnl, if
it is present and active thus telling Windows and applications using
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5024ba7a592..d4773565ea2 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1241,16 +1241,38 @@ swap-intensive.
overcommit_memory
-----------------
-This file contains one value. The following algorithm is used to decide if
-there's enough memory: if the value of overcommit_memory is positive, then
-there's always enough memory. This is a useful feature, since programs often
-malloc() huge amounts of memory 'just in case', while they only use a small
-part of it. Leaving this value at 0 will lead to the failure of such a huge
-malloc(), when in fact the system has enough memory for the program to run.
-
-On the other hand, enabling this feature can cause you to run out of memory
-and thrash the system to death, so large and/or important servers will want to
-set this value to 0.
+Controls overcommit of system memory, possibly allowing processes
+to allocate (but not use) more memory than is actually available.
+
+
+0 - Heuristic overcommit handling. Obvious overcommits of
+ address space are refused. Used for a typical system. It
+ ensures a seriously wild allocation fails while allowing
+ overcommit to reduce swap usage. root is allowed to
+ allocate slighly more memory in this mode. This is the
+ default.
+
+1 - Always overcommit. Appropriate for some scientific
+ applications.
+
+2 - Don't overcommit. The total address space commit
+ for the system is not permitted to exceed swap plus a
+ configurable percentage (default is 50) of physical RAM.
+ Depending on the percentage you use, in most situations
+ this means a process will not be killed while attempting
+ to use already-allocated memory but will receive errors
+ on memory allocation as appropriate.
+
+overcommit_ratio
+----------------
+
+Percentage of physical memory size to include in overcommit calculations
+(see above.)
+
+Memory allocation limit = swapspace + physmem * (overcommit_ratio / 100)
+
+ swapspace = total size of all swap areas
+ physmem = size of physical memory in system
nr_hugepages and hugetlb_shm_group
----------------------------------
diff --git a/Documentation/filesystems/v9fs.txt b/Documentation/filesystems/v9fs.txt
new file mode 100644
index 00000000000..4e92feb6b50
--- /dev/null
+++ b/Documentation/filesystems/v9fs.txt
@@ -0,0 +1,95 @@
+ V9FS: 9P2000 for Linux
+ ======================
+
+ABOUT
+=====
+
+v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol.
+
+This software was originally developed by Ron Minnich <rminnich@lanl.gov>
+and Maya Gokhale <maya@lanl.gov>. Additional development by Greg Watson
+<gwatson@lanl.gov> and most recently Eric Van Hensbergen
+<ericvh@gmail.com> and Latchesar Ionkov <lucho@ionkov.net>.
+
+USAGE
+=====
+
+For remote file server:
+
+ mount -t 9P 10.10.1.2 /mnt/9
+
+For Plan 9 From User Space applications (http://swtch.com/plan9)
+
+ mount -t 9P `namespace`/acme /mnt/9 -o proto=unix,name=$USER
+
+OPTIONS
+=======
+
+ proto=name select an alternative transport. Valid options are
+ currently:
+ unix - specifying a named pipe mount point
+ tcp - specifying a normal TCP/IP connection
+ fd - used passed file descriptors for connection
+ (see rfdno and wfdno)
+
+ name=name user name to attempt mount as on the remote server. The
+ server may override or ignore this value. Certain user
+ names may require authentication.
+
+ aname=name aname specifies the file tree to access when the server is
+ offering several exported file systems.
+
+ debug=n specifies debug level. The debug level is a bitmask.
+ 0x01 = display verbose error messages
+ 0x02 = developer debug (DEBUG_CURRENT)
+ 0x04 = display 9P trace
+ 0x08 = display VFS trace
+ 0x10 = display Marshalling debug
+ 0x20 = display RPC debug
+ 0x40 = display transport debug
+ 0x80 = display allocation debug
+
+ rfdno=n the file descriptor for reading with proto=fd
+
+ wfdno=n the file descriptor for writing with proto=fd
+
+ maxdata=n the number of bytes to use for 9P packet payload (msize)
+
+ port=n port to connect to on the remote server
+
+ timeout=n request timeouts (in ms) (default 60000ms)
+
+ noextend force legacy mode (no 9P2000.u semantics)
+
+ uid attempt to mount as a particular uid
+
+ gid attempt to mount with a particular gid
+
+ afid security channel - used by Plan 9 authentication protocols
+
+ nodevmap do not map special files - represent them as normal files.
+ This can be used to share devices/named pipes/sockets between
+ hosts. This functionality will be expanded in later versions.
+
+RESOURCES
+=========
+
+The Linux version of the 9P server, along with some client-side utilities
+can be found at http://v9fs.sf.net (along with a CVS repository of the
+development branch of this module). There are user and developer mailing
+lists here, as well as a bug-tracker.
+
+For more information on the Plan 9 Operating System check out
+http://plan9.bell-labs.com/plan9
+
+For information on Plan 9 from User Space (Plan 9 applications and libraries
+ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9
+
+
+STATUS
+======
+
+The 2.6 kernel support is working on PPC and x86.
+
+PLEASE USE THE SOURCEFORGE BUG-TRACKER TO REPORT PROBLEMS.
+
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3f318dd44c7..f042c12e0ed 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -1,35 +1,27 @@
-/* -*- auto-fill -*- */
- Overview of the Virtual File System
+ Overview of the Linux Virtual File System
- Richard Gooch <rgooch@atnf.csiro.au>
+ Original author: Richard Gooch <rgooch@atnf.csiro.au>
- 5-JUL-1999
+ Last updated on August 25, 2005
+ Copyright (C) 1999 Richard Gooch
+ Copyright (C) 2005 Pekka Enberg
-Conventions used in this document <section>
-=================================
+ This file is released under the GPLv2.
-Each section in this document will have the string "<section>" at the
-right-hand side of the section title. Each subsection will have
-"<subsection>" at the right-hand side. These strings are meant to make
-it easier to search through the document.
-NOTE that the master copy of this document is available online at:
-http://www.atnf.csiro.au/~rgooch/linux/docs/vfs.txt
-
-
-What is it? <section>
+What is it?
===========
The Virtual File System (otherwise known as the Virtual Filesystem
Switch) is the software layer in the kernel that provides the
filesystem interface to userspace programs. It also provides an
abstraction within the kernel which allows different filesystem
-implementations to co-exist.
+implementations to coexist.
-A Quick Look At How It Works <section>
+A Quick Look At How It Works
============================
In this section I'll briefly describe how things work, before
@@ -38,7 +30,8 @@ when user programs open and manipulate files, and then look from the
other view which is how a filesystem is supported and subsequently
mounted.
-Opening a File <subsection>
+
+Opening a File
--------------
The VFS implements the open(2), stat(2), chmod(2) and similar system
@@ -77,7 +70,7 @@ back to userspace.
Opening a file requires another operation: allocation of a file
structure (this is the kernel-side implementation of file
-descriptors). The freshly allocated file structure is initialised with
+descriptors). The freshly allocated file structure is initialized with
a pointer to the dentry and a set of file operation member functions.
These are taken from the inode data. The open() file method is then
called so the specific filesystem implementation can do it's work. You
@@ -102,7 +95,8 @@ filesystem or driver code at the same time, on different
processors. You should ensure that access to shared resources is
protected by appropriate locks.
-Registering and Mounting a Filesystem <subsection>
+
+Registering and Mounting a Filesystem
-------------------------------------
If you want to support a new kind of filesystem in the kernel, all you
@@ -123,17 +117,21 @@ updated to point to the root inode for the new filesystem.
It's now time to look at things in more detail.
-struct file_system_type <section>
+struct file_system_type
=======================
-This describes the filesystem. As of kernel 2.1.99, the following
+This describes the filesystem. As of kernel 2.6.13, the following
members are defined:
struct file_system_type {
const char *name;
int fs_flags;
- struct super_block *(*read_super) (struct super_block *, void *, int);
- struct file_system_type * next;
+ struct super_block *(*get_sb) (struct file_system_type *, int,
+ const char *, void *);
+ void (*kill_sb) (struct super_block *);
+ struct module *owner;
+ struct file_system_type * next;
+ struct list_head fs_supers;
};
name: the name of the filesystem type, such as "ext2", "iso9660",
@@ -141,51 +139,97 @@ struct file_system_type {
fs_flags: various flags (i.e. FS_REQUIRES_DEV, FS_NO_DCACHE, etc.)
- read_super: the method to call when a new instance of this
+ get_sb: the method to call when a new instance of this
filesystem should be mounted
- next: for internal VFS use: you should initialise this to NULL
+ kill_sb: the method to call when an instance of this filesystem
+ should be unmounted
+
+ owner: for internal VFS use: you should initialize this to THIS_MODULE in
+ most cases.
-The read_super() method has the following arguments:
+ next: for internal VFS use: you should initialize this to NULL
+
+The get_sb() method has the following arguments:
struct super_block *sb: the superblock structure. This is partially
- initialised by the VFS and the rest must be initialised by the
- read_super() method
+ initialized by the VFS and the rest must be initialized by the
+ get_sb() method
+
+ int flags: mount flags
+
+ const char *dev_name: the device name we are mounting.
void *data: arbitrary mount options, usually comes as an ASCII
string
int silent: whether or not to be silent on error
-The read_super() method must determine if the block device specified
+The get_sb() method must determine if the block device specified
in the superblock contains a filesystem of the type the method
supports. On success the method returns the superblock pointer, on
failure it returns NULL.
The most interesting member of the superblock structure that the
-read_super() method fills in is the "s_op" field. This is a pointer to
+get_sb() method fills in is the "s_op" field. This is a pointer to
a "struct super_operations" which describes the next level of the
filesystem implementation.
+Usually, a filesystem uses generic one of the generic get_sb()
+implementations and provides a fill_super() method instead. The
+generic methods are:
+
+ get_sb_bdev: mount a filesystem residing on a block device
-struct super_operations <section>
+ get_sb_nodev: mount a filesystem that is not backed by a device
+
+ get_sb_single: mount a filesystem which shares the instance between
+ all mounts
+
+A fill_super() method implementation has the following arguments:
+
+ struct super_block *sb: the superblock structure. The method fill_super()
+ must initialize this properly.
+
+ void *data: arbitrary mount options, usually comes as an ASCII
+ string
+
+ int silent: whether or not to be silent on error
+
+
+struct super_operations
=======================
This describes how the VFS can manipulate the superblock of your
-filesystem. As of kernel 2.1.99, the following members are defined:
+filesystem. As of kernel 2.6.13, the following members are defined:
struct super_operations {
- void (*read_inode) (struct inode *);
- int (*write_inode) (struct inode *, int);
- void (*put_inode) (struct inode *);
- void (*drop_inode) (struct inode *);
- void (*delete_inode) (struct inode *);
- int (*notify_change) (struct dentry *, struct iattr *);
- void (*put_super) (struct super_block *);
- void (*write_super) (struct super_block *);
- int (*statfs) (struct super_block *, struct statfs *, int);
- int (*remount_fs) (struct super_block *, int *, char *);
- void (*clear_inode) (struct inode *);
+ struct inode *(*alloc_inode)(struct super_block *sb);
+ void (*destroy_inode)(struct inode *);
+
+ void (*read_inode) (struct inode *);
+
+ void (*dirty_inode) (struct inode *);
+ int (*write_inode) (struct inode *, int);
+ void (*put_inode) (struct inode *);
+ void (*drop_inode) (struct inode *);
+ void (*delete_inode) (struct inode *);
+ void (*put_super) (struct super_block *);
+ void (*write_super) (struct super_block *);
+ int (*sync_fs)(struct super_block *sb, int wait);
+ void (*write_super_lockfs) (struct super_block *);
+ void (*unlockfs) (struct super_block *);
+ int (*statfs) (struct super_block *, struct kstatfs *);
+ int (*remount_fs) (struct super_block *, int *, char *);
+ void (*clear_inode) (struct inode *);
+ void (*umount_begin) (struct super_block *);
+
+ void (*sync_inodes) (struct super_block *sb,
+ struct writeback_control *wbc);
+ int (*show_options)(struct seq_file *, struct vfsmount *);
+
+ ssize_t (*quota_read)(struct super_block *, int, char *, size_t, loff_t);
+ ssize_t (*quota_write)(struct super_block *, int, const char *, size_t, loff_t);
};
All methods are called without any locks being held, unless otherwise
@@ -193,43 +237,62 @@ noted. This means that most methods can block safely. All methods are
only called from a process context (i.e. not from an interrupt handler
or bottom half).
+ alloc_inode: this method is called by inode_alloc() to allocate memory
+ for struct inode and initialize it.
+
+ destroy_inode: this method is called by destroy_inode() to release
+ resources allocated for struct inode.
+
read_inode: this method is called to read a specific inode from the
- mounted filesystem. The "i_ino" member in the "struct inode"
- will be initialised by the VFS to indicate which inode to
- read. Other members are filled in by this method
+ mounted filesystem. The i_ino member in the struct inode is
+ initialized by the VFS to indicate which inode to read. Other
+ members are filled in by this method.
+
+ You can set this to NULL and use iget5_locked() instead of iget()
+ to read inodes. This is necessary for filesystems for which the
+ inode number is not sufficient to identify an inode.
+
+ dirty_inode: this method is called by the VFS to mark an inode dirty.
write_inode: this method is called when the VFS needs to write an
inode to disc. The second parameter indicates whether the write
should be synchronous or not, not all filesystems check this flag.
put_inode: called when the VFS inode is removed from the inode
- cache. This method is optional
+ cache.
drop_inode: called when the last access to the inode is dropped,
with the inode_lock spinlock held.
- This method should be either NULL (normal unix filesystem
+ This method should be either NULL (normal UNIX filesystem
semantics) or "generic_delete_inode" (for filesystems that do not
want to cache inodes - causing "delete_inode" to always be
called regardless of the value of i_nlink)
- The "generic_delete_inode()" behaviour is equivalent to the
+ The "generic_delete_inode()" behavior is equivalent to the
old practice of using "force_delete" in the put_inode() case,
but does not have the races that the "force_delete()" approach
had.
delete_inode: called when the VFS wants to delete an inode
- notify_change: called when VFS inode attributes are changed. If this
- is NULL the VFS falls back to the write_inode() method. This
- is called with the kernel lock held
-
put_super: called when the VFS wishes to free the superblock
(i.e. unmount). This is called with the superblock lock held
write_super: called when the VFS superblock needs to be written to
disc. This method is optional
+ sync_fs: called when VFS is writing out all dirty data associated with
+ a superblock. The second parameter indicates whether the method
+ should wait until the write out has been completed. Optional.
+
+ write_super_lockfs: called when VFS is locking a filesystem and forcing
+ it into a consistent state. This function is currently used by the
+ Logical Volume Manager (LVM).
+
+ unlockfs: called when VFS is unlocking a filesystem and making it writable
+ again.
+