diff options
| author | Grant Likely <grant.likely@secretlab.ca> | 2010-01-28 14:38:25 -0700 | 
|---|---|---|
| committer | Grant Likely <grant.likely@secretlab.ca> | 2010-01-28 14:38:25 -0700 | 
| commit | 0ada0a73120c28cc432bcdbac061781465c2f48f (patch) | |
| tree | d17cadd4ea47e25d9e48e7d409a39c84268fbd27 /Documentation/filesystems/nfs | |
| parent | 6016a363f6b56b46b24655bcfc0499b715851cf3 (diff) | |
| parent | 92dcffb916d309aa01778bf8963a6932e4014d07 (diff) | |
Merge commit 'v2.6.33-rc5' into secretlab/test-devicetree
Diffstat (limited to 'Documentation/filesystems/nfs')
| -rw-r--r-- | Documentation/filesystems/nfs/00-INDEX | 16 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/Exporting | 147 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/knfsd-stats.txt | 159 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/nfs-rdma.txt | 271 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/nfs.txt | 98 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/nfs41-server.txt | 222 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/nfsroot.txt | 270 | ||||
| -rw-r--r-- | Documentation/filesystems/nfs/rpc-cache.txt | 202 | 
8 files changed, 1385 insertions, 0 deletions
| diff --git a/Documentation/filesystems/nfs/00-INDEX b/Documentation/filesystems/nfs/00-INDEX new file mode 100644 index 00000000000..2f68cd68876 --- /dev/null +++ b/Documentation/filesystems/nfs/00-INDEX @@ -0,0 +1,16 @@ +00-INDEX +	- this file (nfs-related documentation). +Exporting +	- explanation of how to make filesystems exportable. +knfsd-stats.txt +	- statistics which the NFS server makes available to user space. +nfs.txt +	- nfs client, and DNS resolution for fs_locations. +nfs41-server.txt +	- info on the Linux server implementation of NFSv4 minor version 1. +nfs-rdma.txt +	- how to install and setup the Linux NFS/RDMA client and server software +nfsroot.txt +	- short guide on setting up a diskless box with NFS root filesystem. +rpc-cache.txt +	- introduction to the caching mechanisms in the sunrpc layer. diff --git a/Documentation/filesystems/nfs/Exporting b/Documentation/filesystems/nfs/Exporting new file mode 100644 index 00000000000..87019d2b598 --- /dev/null +++ b/Documentation/filesystems/nfs/Exporting @@ -0,0 +1,147 @@ + +Making Filesystems Exportable +============================= + +Overview +-------- + +All filesystem operations require a dentry (or two) as a starting +point.  Local applications have a reference-counted hold on suitable +dentries via open file descriptors or cwd/root.  However remote +applications that access a filesystem via a remote filesystem protocol +such as NFS may not be able to hold such a reference, and so need a +different way to refer to a particular dentry.  As the alternative +form of reference needs to be stable across renames, truncates, and +server-reboot (among other things, though these tend to be the most +problematic), there is no simple answer like 'filename'. + +The mechanism discussed here allows each filesystem implementation to +specify how to generate an opaque (outside of the filesystem) byte +string for any dentry, and how to find an appropriate dentry for any +given opaque byte string. +This byte string will be called a "filehandle fragment" as it +corresponds to part of an NFS filehandle. + +A filesystem which supports the mapping between filehandle fragments +and dentries will be termed "exportable". + + + +Dcache Issues +------------- + +The dcache normally contains a proper prefix of any given filesystem +tree.  This means that if any filesystem object is in the dcache, then +all of the ancestors of that filesystem object are also in the dcache. +As normal access is by filename this prefix is created naturally and +maintained easily (by each object maintaining a reference count on +its parent). + +However when objects are included into the dcache by interpreting a +filehandle fragment, there is no automatic creation of a path prefix +for the object.  This leads to two related but distinct features of +the dcache that are not needed for normal filesystem access. + +1/ The dcache must sometimes contain objects that are not part of the +   proper prefix. i.e that are not connected to the root. +2/ The dcache must be prepared for a newly found (via ->lookup) directory +   to already have a (non-connected) dentry, and must be able to move +   that dentry into place (based on the parent and name in the +   ->lookup).   This is particularly needed for directories as +   it is a dcache invariant that directories only have one dentry. + +To implement these features, the dcache has: + +a/ A dentry flag DCACHE_DISCONNECTED which is set on +   any dentry that might not be part of the proper prefix. +   This is set when anonymous dentries are created, and cleared when a +   dentry is noticed to be a child of a dentry which is in the proper +   prefix.  + +b/ A per-superblock list "s_anon" of dentries which are the roots of +   subtrees that are not in the proper prefix.  These dentries, as +   well as the proper prefix, need to be released at unmount time.  As +   these dentries will not be hashed, they are linked together on the +   d_hash list_head. + +c/ Helper routines to allocate anonymous dentries, and to help attach +   loose directory dentries at lookup time. They are: +    d_alloc_anon(inode) will return a dentry for the given inode. +      If the inode already has a dentry, one of those is returned. +      If it doesn't, a new anonymous (IS_ROOT and +        DCACHE_DISCONNECTED) dentry is allocated and attached. +      In the case of a directory, care is taken that only one dentry +      can ever be attached. +    d_splice_alias(inode, dentry) will make sure that there is a +      dentry with the same name and parent as the given dentry, and +      which refers to the given inode. +      If the inode is a directory and already has a dentry, then that +      dentry is d_moved over the given dentry. +      If the passed dentry gets attached, care is taken that this is +      mutually exclusive to a d_alloc_anon operation. +      If the passed dentry is used, NULL is returned, else the used +      dentry is returned.  This corresponds to the calling pattern of +      ->lookup. +   +  +Filesystem Issues +----------------- + +For a filesystem to be exportable it must: +  +   1/ provide the filehandle fragment routines described below. +   2/ make sure that d_splice_alias is used rather than d_add +      when ->lookup finds an inode for a given parent and name. +      Typically the ->lookup routine will end with a: + +		return d_splice_alias(inode, dentry); +	} + + + +  A file system implementation declares that instances of the filesystem +are exportable by setting the s_export_op field in the struct +super_block.  This field must point to a "struct export_operations" +struct which has the following members: + + encode_fh  (optional) +    Takes a dentry and creates a filehandle fragment which can later be used +    to find or create a dentry for the same object.  The default +    implementation creates a filehandle fragment that encodes a 32bit inode +    and generation number for the inode encoded, and if necessary the +    same information for the parent. + +  fh_to_dentry (mandatory) +    Given a filehandle fragment, this should find the implied object and +    create a dentry for it (possibly with d_alloc_anon). + +  fh_to_parent (optional but strongly recommended) +    Given a filehandle fragment, this should find the parent of the +    implied object and create a dentry for it (possibly with d_alloc_anon). +    May fail if the filehandle fragment is too small. + +  get_parent (optional but strongly recommended) +    When given a dentry for a directory, this should return  a dentry for +    the parent.  Quite possibly the parent dentry will have been allocated +    by d_alloc_anon.  The default get_parent function just returns an error +    so any filehandle lookup that requires finding a parent will fail. +    ->lookup("..") is *not* used as a default as it can leave ".." entries +    in the dcache which are too messy to work with. + +  get_name (optional) +    When given a parent dentry and a child dentry, this should find a name +    in the directory identified by the parent dentry, which leads to the +    object identified by the child dentry.  If no get_name function is +    supplied, a default implementation is provided which uses vfs_readdir +    to find potential names, and matches inode numbers to find the correct +    match. + + +A filehandle fragment consists of an array of 1 or more 4byte words, +together with a one byte "type". +The decode_fh routine should not depend on the stated size that is +passed to it.  This size may be larger than the original filehandle +generated by encode_fh, in which case it will have been padded with +nuls.  Rather, the encode_fh routine should choose a "type" which +indicates the decode_fh how much of the filehandle is valid, and how +it should be interpreted. diff --git a/Documentation/filesystems/nfs/knfsd-stats.txt b/Documentation/filesystems/nfs/knfsd-stats.txt new file mode 100644 index 00000000000..64ced5149d3 --- /dev/null +++ b/Documentation/filesystems/nfs/knfsd-stats.txt @@ -0,0 +1,159 @@ + +Kernel NFS Server Statistics +============================ + +This document describes the format and semantics of the statistics +which the kernel NFS server makes available to userspace.  These +statistics are available in several text form pseudo files, each of +which is described separately below. + +In most cases you don't need to know these formats, as the nfsstat(8) +program from the nfs-utils distribution provides a helpful command-line +interface for extracting and printing them. + +All the files described here are formatted as a sequence of text lines, +separated by newline '\n' characters.  Lines beginning with a hash +'#' character are comments intended for humans and should be ignored +by parsing routines.  All other lines contain a sequence of fields +separated by whitespace. + +/proc/fs/nfsd/pool_stats +------------------------ + +This file is available in kernels from 2.6.30 onwards, if the +/proc/fs/nfsd filesystem is mounted (it almost always should be). + +The first line is a comment which describes the fields present in +all the other lines.  The other lines present the following data as +a sequence of unsigned decimal numeric fields.  One line is shown +for each NFS thread pool. + +All counters are 64 bits wide and wrap naturally.  There is no way +to zero these counters, instead applications should do their own +rate conversion. + +pool +	The id number of the NFS thread pool to which this line applies. +	This number does not change. + +	Thread pool ids are a contiguous set of small integers starting +	at zero.  The maximum value depends on the thread pool mode, but +	currently cannot be larger than the number of CPUs in the system. +	Note that in the default case there will be a single thread pool +	which contains all the nfsd threads and all the CPUs in the system, +	and thus this file will have a single line with a pool id of "0". + +packets-arrived +	Counts how many NFS packets have arrived.  More precisely, this +	is the number of times that the network stack has notified the +	sunrpc server layer that new data may be available on a transport +	(e.g. an NFS or UDP socket or an NFS/RDMA endpoint). + +	Depending on the NFS workload patterns and various network stack +	effects (such as Large Receive Offload) which can combine packets +	on the wire, this may be either more or less than the number +	of NFS calls received (which statistic is available elsewhere). +	However this is a more accurate and less workload-dependent measure +	of how much CPU load is being placed on the sunrpc server layer +	due to NFS network traffic. + +sockets-enqueued +	Counts how many times an NFS transport is enqueued to wait for +	an nfsd thread to service it, i.e. no nfsd thread was considered +	available. + +	The circumstance this statistic tracks indicates that there was NFS +	network-facing work to be done but it couldn't be done immediately, +	thus introducing a small delay in servicing NFS calls.  The ideal +	rate of change for this counter is zero; significantly non-zero +	values may indicate a performance limitation. + +	This can happen either because there are too few nfsd threads in the +	thread pool for the NFS workload (the workload is thread-limited), +	or because the NFS workload needs more CPU time than is available in +	the thread pool (the workload is CPU-limited).  In the former case, +	configuring more nfsd threads will probably improve the performance +	of the NFS workload.  In the latter case, the sunrpc server layer is +	already choosing not to wake idle nfsd threads because there are too +	many nfsd threads which want to run but cannot, so configuring more +	nfsd threads will make no difference whatsoever.  The overloads-avoided +	statistic (see below) can be used to distinguish these cases. + +threads-woken +	Counts how many times an idle nfsd thread is woken to try to +	receive some data from an NFS transport. + +	This statistic tracks the circumstance where incoming +	network-facing NFS work is being handled quickly, which is a good +	thing.  The ideal rate of change for this counter will be close +	to but less than the rate of change of the packets-arrived counter. + +overloads-avoided +	Counts how many times the sunrpc server layer chose not to wake an +	nfsd thread, despite the presence of idle nfsd threads, because +	too many nfsd threads had been recently woken but could not get +	enough CPU time to actually run. + +	This statistic counts a circumstance where the sunrpc layer +	heuristically avoids overloading the CPU scheduler with too many +	runnable nfsd threads.  The ideal rate of change for this counter +	is zero.  Significant non-zero values indicate that the workload +	is CPU limited.  Usually this is associated with heavy CPU usage +	on all the CPUs in the nfsd thread pool. + +	If a sustained large overloads-avoided rate is detected on a pool, +	the top(1) utility should be used to check for the following +	pattern of CPU usage on all the CPUs associated with the given +	nfsd thread pool. + +	 - %us ~= 0 (as you're *NOT* running applications on your NFS server) + +	 - %wa ~= 0 + +	 - %id ~= 0 + +	 - %sy + %hi + %si ~= 100 + +	If this pattern is seen, configuring more nfsd threads will *not* +	improve the performance of the workload.  If this patten is not +	seen, then something more subtle is wrong. + +threads-timedout +	Counts how many times an nfsd thread triggered an idle timeout, +	i.e. was not woken to handle any incoming network packets for +	some time. + +	This statistic counts a circumstance where there are more nfsd +	threads configured than can be used by the NFS workload.  This is +	a clue that the number of nfsd threads can be reduced without +	affecting performance.  Unfortunately, it's only a clue and not +	a strong indication, for a couple of reasons: + +	 - Currently the rate at which the counter is incremented is quite +	   slow; the idle timeout is 60 minutes.  Unless the NFS workload +	   remains constant for hours at a time, this counter is unlikely +	   to be providing information that is still useful. + +	 - It is usually a wise policy to provide some slack, +	   i.e. configure a few more nfsds than are currently needed, +	   to allow for future spikes in load. + + +Note that incoming packets on NFS transports will be dealt with in +one of three ways.  An nfsd thread can be woken (threads-woken counts +this case), or the transport can be enqueued for later attention +(sockets-enqueued counts this case), or the packet can be temporarily +deferred because the transport is currently being used by an nfsd +thread.  This last case is not very interesting and is not explicitly +counted, but can be inferred from the other counters thus: + +packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken ) + + +More +---- +Descriptions of the other statistics file should go here. + + +Greg Banks <gnb@sgi.com> +26 Mar 2009 diff --git a/Documentation/filesystems/nfs/nfs-rdma.txt b/Documentation/filesystems/nfs/nfs-rdma.txt new file mode 100644 index 00000000000..e386f7e4bce --- /dev/null +++ b/Documentation/filesystems/nfs/nfs-rdma.txt @@ -0,0 +1,271 @@ +################################################################################ +#									       # +#				NFS/RDMA README				       # +#									       # +################################################################################ + + Author: NetApp and Open Grid Computing + Date: May 29, 2008 + +Table of Contents +~~~~~~~~~~~~~~~~~ + - Overview + - Getting Help + - Installation + - Check RDMA and NFS Setup + - NFS/RDMA Setup + +Overview +~~~~~~~~ + +  This document describes how to install and setup the Linux NFS/RDMA client +  and server software. + +  The NFS/RDMA client was first included in Linux 2.6.24. The NFS/RDMA server +  was first included in the following release, Linux 2.6.25. + +  In our testing, we have obtained excellent performance results (full 10Gbit +  wire bandwidth at minimal client CPU) under many workloads. The code passes +  the full Connectathon test suite and operates over both Infiniband and iWARP +  RDMA adapters. + +Getting Help +~~~~~~~~~~~~ + +  If you get stuck, you can ask questions on the + +                nfs-rdma-devel@lists.sourceforge.net + +  mailing list. + +Installation +~~~~~~~~~~~~ + +  These instructions are a step by step guide to building a machine for +  use with NFS/RDMA. + +  - Install an RDMA device + +    Any device supported by the drivers in drivers/infiniband/hw is acceptable. + +    Testing has been performed using several Mellanox-based IB cards, the +    Ammasso AMS1100 iWARP adapter, and the Chelsio cxgb3 iWARP adapter. + +  - Install a Linux distribution and tools + +    The first kernel release to contain both the NFS/RDMA client and server was +    Linux 2.6.25  Therefore, a distribution compatible with this and subsequent +    Linux kernel release should be installed. + +    The procedures described in this document have been tested with +    distributions from Red Hat's Fedora Project (http://fedora.redhat.com/). + +  - Install nfs-utils-1.1.2 or greater on the client + +    An NFS/RDMA mount point can be obtained by using the mount.nfs command in +    nfs-utils-1.1.2 or greater (nfs-utils-1.1.1 was the first nfs-utils +    version with support for NFS/RDMA mounts, but for various reasons we +    recommend using nfs-utils-1.1.2 or greater). To see which version of +    mount.nfs you are using, type: + +    $ /sbin/mount.nfs -V + +    If the version is less than 1.1.2 or the command does not exist, +    you should install the latest version of nfs-utils. + +    Download the latest package from: + +    http://www.kernel.org/pub/linux/utils/nfs + +    Uncompress the package and follow the installation instructions. + +    If you will not need the idmapper and gssd executables (you do not need +    these to create an NFS/RDMA enabled mount command), the installation +    process can be simplified by disabling these features when running +    configure: + +    $ ./configure --disable-gss --disable-nfsv4 + +    To build nfs-utils you will need the tcp_wrappers package installed. For +    more information on this see the package's README and INSTALL files. + +    After building the nfs-utils package, there will be a mount.nfs binary in +    the utils/mount directory. This binary can be used to initiate NFS v2, v3, +    or v4 mounts. To initiate a v4 mount, the binary must be called +    mount.nfs4.  The standard technique is to create a symlink called +    mount.nfs4 to mount.nfs. + +    This mount.nfs binary should be installed at /sbin/mount.nfs as follows: + +    $ sudo cp utils/mount/mount.nfs /sbin/mount.nfs + +    In this location, mount.nfs will be invoked automatically for NFS mounts +    by the system mount command. + +    NOTE: mount.nfs and therefore nfs-utils-1.1.2 or greater is only needed +    on the NFS client machine. You do not need this specific version of +    nfs-utils on the server. Furthermore, only the mount.nfs command from +    nfs-utils-1.1.2 is needed on the client. + +  - Install a Linux kernel with NFS/RDMA + +    The NFS/RDMA client and server are both included in the mainline Linux +    kernel version 2.6.25 and later. This and other versions of the 2.6 Linux +    kernel can be found at: + +    ftp://ftp.kernel.org/pub/linux/kernel/v2.6/ + +    Download the sources and place them in an appropriate location. + +  - Configure the RDMA stack + +    Make sure your kernel configuration has RDMA support enabled. Under +    Device Drivers -> InfiniBand support, update the kernel configuration +    to enable InfiniBand support [NOTE: the option name is misleading. Enabling +    InfiniBand support is required for all RDMA devices (IB, iWARP, etc.)]. + +    Enable the appropriate IB HCA support (mlx4, mthca, ehca, ipath, etc.) or +    iWARP adapter support (amso, cxgb3, etc.). + +    If you are using InfiniBand, be sure to enable IP-over-InfiniBand support. + +  - Configure the NFS client and server + +    Your kernel configuration must also have NFS file system support and/or +    NFS server support enabled. These and other NFS related configuration +    options can be found under File Systems -> Network File Systems. + +  - Build, install, reboot + +    The NFS/RDMA code will be enabled automatically if NFS and RDMA +    are turned on. The NFS/RDMA client and server are configured via the hidden +    SUNRPC_XPRT_RDMA config option that depends on SUNRPC and INFINIBAND. The +    value of SUNRPC_XPRT_RDMA will be: + +     - N if either SUNRPC or INFINIBAND are N, in this case the NFS/RDMA client +       and server will not be built +     - M if both SUNRPC and INFINIBAND are on (M or Y) and at least one is M, +       in this case the NFS/RDMA client and server will be built as modules +     - Y if both SUNRPC and INFINIBAND are Y, in this case the NFS/RDMA client +       and server will be built into the kernel + +    Therefore, if you have followed the steps above and turned no NFS and RDMA, +    the NFS/RDMA client and server will be built. + +    Build a new kernel, install it, boot it. + +Check RDMA and NFS Setup +~~~~~~~~~~~~~~~~~~~~~~~~ + +    Before configuring the NFS/RDMA software, it is a good idea to test +    your new kernel to ensure that the kernel is working correctly. +    In particular, it is a good idea to verify that the RDMA stack +    is functioning as expected and standard NFS over TCP/IP and/or UDP/IP +    is working properly. + +  - Check RDMA Setup + +    If you built the RDMA components as modules, load them at +    this time. For example, if you are using a Mellanox Tavor/Sinai/Arbel +    card: + +    $ modprobe ib_mthca +    $ modprobe ib_ipoib + +    If you are using InfiniBand, make sure there is a Subnet Manager (SM) +    running on the network. If your IB switch has an embedded SM, you can +    use it. Otherwise, you will need to run an SM, such as OpenSM, on one +    of your end nodes. + +    If an SM is running on your network, you should see the following: + +    $ cat /sys/class/infiniband/driverX/ports/1/state +    4: ACTIVE + +    where driverX is mthca0, ipath5, ehca3, etc. + +    To further test the InfiniBand software stack, use IPoIB (this +    assumes you have two IB hosts named host1 and host2): + +    host1$ ifconfig ib0 a.b.c.x +    host2$ ifconfig ib0 a.b.c.y +    host1$ ping a.b.c.y +    host2$ ping a.b.c.x + +    For other device types, follow the appropriate procedures. + +  - Check NFS Setup + +    For the NFS components enabled above (client and/or server), +    test their functionality over standard Ethernet using TCP/IP or UDP/IP. + +NFS/RDMA Setup +~~~~~~~~~~~~~~ + +  We recommend that you use two machines, one to act as the client and +  one to act as the server. + +  One time configuration: + +  - On the server system, configure the /etc/exports file and +    start the NFS/RDMA server. + +    Exports entries with the following formats have been tested: + +    /vol0   192.168.0.47(fsid=0,rw,async,insecure,no_root_squash) +    /vol0   192.168.0.0/255.255.255.0(fsid=0,rw,async,insecure,no_root_squash) + +    The IP address(es) is(are) the client's IPoIB address for an InfiniBand +    HCA or the cleint's iWARP address(es) for an RNIC. + +    NOTE: The "insecure" option must be used because the NFS/RDMA client does +    not use a reserved port. + + Each time a machine boots: + +  - Load and configure the RDMA drivers + +    For InfiniBand using a Mellanox adapter: + +    $ modprobe ib_mthca +    $ modprobe ib_ipoib +    $ ifconfig ib0 a.b.c.d + +    NOTE: use unique addresses for the client and server + +  - Start the NFS server + +    If the NFS/RDMA server was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in +    kernel config), load the RDMA transport module: + +    $ modprobe svcrdma + +    Regardless of how the server was built (module or built-in), start the +    server: + +    $ /etc/init.d/nfs start + +    or + +    $ service nfs start + +    Instruct the server to listen on the RDMA transport: + +    $ echo rdma 20049 > /proc/fs/nfsd/portlist + +  - On the client system + +    If the NFS/RDMA client was built as a module (CONFIG_SUNRPC_XPRT_RDMA=m in +    kernel config), load the RDMA client module: + +    $ modprobe xprtrdma.ko + +    Regardless of how the client was built (module or built-in), use this +    command to mount the NFS/RDMA server: + +    $ mount -o rdma,port=20049 <IPoIB-server-name-or-address>:/<export> /mnt + +    To verify that the mount is using RDMA, run "cat /proc/mounts" and check +    the "proto" field for the given mount. + +  Congratulations! You're using NFS/RDMA! diff --git a/Documentation/filesystems/nfs/nfs.txt b/Documentation/filesystems/nfs/nfs.txt new file mode 100644 index 00000000000..f50f26ce6cd --- /dev/null +++ b/Documentation/filesystems/nfs/nfs.txt @@ -0,0 +1,98 @@ + +The NFS client +============== + +The NFS version 2 protocol was first documented in RFC1094 (March 1989). +Since then two more major releases of NFS have been published, with NFSv3 +being documented in RFC1813 (June 1995), and NFSv4 in RFC3530 (April +2003). + +The Linux NFS client currently supports all the above published versions, +and work is in progress on adding support for minor version 1 of the NFSv4 +protocol. + +The purpose of this document is to provide information on some of the +upcall interfaces that are used in order to provide the NFS client with +some of the information that it requires in order to fully comply with +the NFS spec. + +The DNS resolver +================ + +NFSv4 allows for one server to refer the NFS client to data that has been +migrated onto another server by means of the special "fs_locations" +attribute. See +	http://tools.ietf.org/html/rfc3530#section-6 +and +	http://tools.ietf.org/html/draft-ietf-nfsv4-referrals-00 + +The fs_locations information can take the form of either an ip address and +a path, or a DNS hostname and a path. The latter requires the NFS client to +do a DNS lookup in order to mount the new volume, and hence the need for an +upcall to allow userland to provide this service. + +Assuming that the user has the 'rpc_pipefs' filesystem mounted in the usual +/var/lib/nfs/rpc_pipefs, the upcall consists of the following steps: + +   (1) The process checks the dns_resolve cache to see if it contains a +       valid entry. If so, it returns that entry and exits. + +   (2) If no valid entry exists, the helper script '/sbin/nfs_cache_getent' +       (may be changed using the 'nfs.cache_getent' kernel boot parameter) +       is run, with two arguments: +		- the cache name, "dns_resolve" +		- the hostname to resolve + +   (3) After looking up the corresponding ip address, the helper script +       writes the result into the rpc_pipefs pseudo-file +       '/var/lib/nfs/rpc_pipefs/cache/dns_resolve/channel' +       in the following (text) format: + +		"<ip address> <hostname> <ttl>\n" + +       Where <ip address> is in the usual IPv4 (123.456.78.90) or IPv6 +       (ffee:ddcc:bbaa:9988:7766:5544:3322:1100, ffee::1100, ...) format. +       <hostname> is identical to the second argument of the helper +       script, and <ttl> is the 'time to live' of this cache entry (in +       units of seconds). + +       Note: If <ip address> is invalid, say the string "0", then a negative +       entry is created, which will cause the kernel to treat the hostname +       as having no valid DNS translation. + + + + +A basic sample /sbin/nfs_cache_getent +===================================== + +#!/bin/bash +# +ttl=600 +# +cut=/usr/bin/cut +getent=/usr/bin/getent +rpc_pipefs=/var/lib/nfs/rpc_pipefs +# +die() +{ +	echo "Usage: $0 cache_name entry_name" +	exit 1 +} + +[ $# -lt 2 ] && die +cachename="$1" +cache_path=${rpc_pipefs}/cache/${cachename}/channel + +case "${cachename}" in +	dns_resolve) +		name="$2" +		result="$(${getent} hosts ${name} | ${cut} -f1 -d\ )" +		[ -z "${result}" ] && result="0" +		;; +	*) +		die +		;; +esac +echo "${result} ${name} ${ttl}" >${cache_path} + diff --git a/Documentation/filesystems/nfs/nfs41-server.txt b/Documentation/filesystems/nfs/nfs41-server.txt new file mode 100644 index 00000000000..1bd0d0c0517 --- /dev/null +++ b/Documentation/filesystems/nfs/nfs41-server.txt @@ -0,0 +1,222 @@ +NFSv4.1 Server Implementation + +Server support for minorversion 1 can be controlled using the +/proc/fs/nfsd/versions control file.  The string output returned +by reading this file will contain either "+4.1" or "-4.1" +correspondingly. + +Currently, server support for minorversion 1 is disabled by default. +It can be enabled at run time by writing the string "+4.1" to +the /proc/fs/nfsd/versions control file.  Note that to write this +control file, the nfsd service must be taken down.  Use your user-mode +nfs-utils to set this up; see rpc.nfsd(8) + +(Warning: older servers will interpret "+4.1" and "-4.1" as "+4" and +"-4", respectively.  Therefore, code meant to work on both new and old +kernels must turn 4.1 on or off *before* turning support for version 4 +on or off; rpc.nfsd does this correctly.) + +The NFSv4 minorversion 1 (NFSv4.1) implementation in nfsd is based +on the latest NFSv4.1 Internet Draft: +http://tools.ietf.org/html/draft-ietf-nfsv4-minorversion1-29 + +From the many new features in NFSv4.1 the current implementation +focuses on the mandatory-to-implement NFSv4.1 Sessions, providing +"exactly once" semantics and better control and throttling of the +resources allocated for each client. + +Other NFSv4.1 features, Parallel NFS operations in particular, +are still under development out of tree. +See http://wiki.linux-nfs.org/wiki/index.php/PNFS_prototype_design +for more information. + +The current implementation is intended for developers only: while it +does support ordinary file operations on clients we have tested against +(including the linux client), it is incomplete in ways which may limit +features unexpectedly, cause known bugs in rare cases, or cause +interoperability problems with future clients.  Known issues: + +	- gss support is questionable: currently mounts with kerberos +	  from a linux client are possible, but we aren't really +	  conformant with the spec (for example, we don't use kerberos +	  on the backchannel correctly). +	- no trunking support: no clients currently take advantage of +	  trunking, but this is a mandatory feature, and its use is +	  recommended to clients in a number of places.  (E.g. to ensure +	  timely renewal in case an existing connection's retry timeouts +	  have gotten too long; see section 8.3 of the draft.) +	  Therefore, lack of this feature may cause future clients to +	  fail. +	- Incomplete backchannel support: incomplete backchannel gss +	  support and no support for BACKCHANNEL_CTL mean that +	  callbacks (hence delegations and layouts) may not be +	  available and clients confused by the incomplete +	  implementation may fail. +	- Server reboot recovery is unsupported; if the server reboots, +	  clients may fail. +	- We do not support SSV, which provides security for shared +	  client-server state (thus preventing unauthorized tampering +	  with locks and opens, for example).  It is mandatory for +	  servers to support this, though no clients use it yet. +	- Mandatory operations which we do not support, such as +	  DESTROY_CLIENTID, FREE_STATEID, SECINFO_NO_NAME, and +	  TEST_STATEID, are not currently used by clients, but will be +	  (and the spec recommends their uses in common cases), and +	  clients should not be expected to know how to recover from the +	  case where they are not supported.  This will eventually cause +	  interoperability failures. + +In addition, some limitations are inherited from the current NFSv4 +implementation: + +	- Incomplete delegation enforcement: if a file is renamed or +	  unlinked, a client holding a delegation may continue to +	  indefinitely allow opens of the file under the old name. + +The table below, taken from the NFSv4.1 document, lists +the operations that are mandatory to implement (REQ), optional +(OPT), and NFSv4.0 operations that are required not to implement (MNI) +in minor version 1.  The first column indicates the operations that +are not supported yet by the linux server implementation. + +The OPTIONAL features identified and their abbreviations are as follows: +	pNFS	Parallel NFS +	FDELG	File Delegations +	DDELG	Directory Delegations + +The following abbreviations indicate the linux server implementation status. +	I	Implemented NFSv4.1 operations. +	NS	Not Supported. +	NS*	unimplemented optional feature. +	P	pNFS features implemented out of tree. +	PNS	pNFS features that are not supported yet (out of tree). + +Operations + +   +----------------------+------------+--------------+----------------+ +   | Operation            | REQ, REC,  | Feature      | Definition     | +   |                      | OPT, or    | (REQ, REC,   |                | +   |                      | MNI        | or OPT)      |                | +   +----------------------+------------+--------------+----------------+ +   | ACCESS               | REQ        |              | Section 18.1   | +NS | BACKCHANNEL_CTL      | REQ        |              | Section 18.33  | +NS | BIND_CONN_TO_SESSION | REQ        |              | Section 18.34  | +   | CLOSE                | REQ        |              | Section 18.2   | +   | COMMIT               | REQ        |              | Section 18.3   | +   | CREATE               | REQ        |              | Section 18.4   | +I  | CREATE_SESSION       | REQ        |              | Section 18.36  | +NS*| DELEGPURGE           | OPT        | FDELG (REQ)  | Section 18.5   | +   | DELEGRETURN          | OPT        | FDELG,       | Section 18.6   | +   |                      |            | DDELG, pNFS  |                | +   |                      |            | (REQ)        |                | +NS | DESTROY_CLIENTID     | REQ        |              | Section 18.50  | +I  | DESTROY_SESSION      | REQ        |              | Section 18.37  | +I  | EXCHANGE_ID          | REQ        |              | Section 18.35  | +NS | FREE_STATEID         | REQ        |              | Section 18.38  | +   | GETATTR              | REQ        |              | Section 18.7   | +P  | GETDEVICEINFO        | OPT        | pNFS (REQ)   | Section 18.40  | +P  | GETDEVICELIST        | OPT        | pNFS (OPT)   | Section 18.41  | +   | GETFH                | REQ        |              | Section 18.8   | +NS*| GET_DIR_DELEGATION   | OPT        | DDELG (REQ)  | Section 18.39  | +P  | LAYOUTCOMMIT         | OPT        | pNFS (REQ)   | Section 18.42  | +P  | LAYOUTGET            | OPT        | pNFS (REQ)   | Section 18.43  | +P  | LAYOUTRETURN         | OPT        | pNFS (REQ)   | Section 18.44  | +   | LINK                 | OPT        |              | Section 18.9   | +   | LOCK                 | REQ        |              | Section 18.10  | +   | LOCKT                | REQ        |              | Section 18.11  | +   | LOCKU                | REQ        |              | Section 18.12  | +   | LOOKUP               | REQ        |              | Section 18.13  | +   | LOOKUPP              | REQ        |              | Section 18.14  | +   | NVERIFY              | REQ        |              | Section 18.15  | +   | OPEN                 | REQ        |              | Section 18.16  | +NS*| OPENATTR             | OPT        |              | Section 18.17  | +   | OPEN_CONFIRM         | MNI        |              | N/A            | +   | OPEN_DOWNGRADE       | REQ        |              | Section 18.18  | +   | PUTFH                | REQ        |              | Section 18.19  | +   | PUTPUBFH             | REQ        |              | Section 18.20  | +   | PUTROOTFH            | REQ        |              | Section 18.21  | +   | READ                 | REQ        |              | Section 18.22  | +   | READDIR              | REQ        |              | Section 18.23  | +   | READLINK             | OPT        |              | Section 18.24  | +NS | RECLAIM_COMPLETE     | REQ        |              | Section 18.51  | +   | RELEASE_LOCKOWNER    | MNI        |              | N/A            | +   | REMOVE               | REQ        |              | Section 18.25  | +   | RENAME               | REQ        |              | Section 18.26  | +   | RENEW                | MNI        |              | N/A            | +   | RESTOREFH            | REQ        |              | Section 18.27  | +   | SAVEFH               | REQ        |              | Section 18.28  | +   | SECINFO              | REQ        |              | Section 18.29  | +NS | SECINFO_NO_NAME      | REC        | pNFS files   | Section 18.45, | +   |                      |            | layout (REQ) | Section 13.12  | +I  | SEQUENCE             | REQ        |              | Section 18.46  | +   | SETATTR              | REQ        |              | Section 18.30  | +   | SETCLIENTID          | MNI        |              | N/A            | +   | SETCLIENTID_CONFIRM  | MNI        |              | N/A            | +NS | SET_SSV              | REQ        |              | Section 18.47  | +NS | TEST_STATEID         | REQ        |              | Section 18.48  | +   | VERIFY               | REQ        |              | Section 18.31  | +NS*| WANT_DELEGATION      | OPT        | FDELG (OPT)  | Section 18.49  | +   | WRITE                | REQ        |              | Section 18.32  | + +Callback Operations + +   +-------------------------+-----------+-------------+---------------+ +   | Operation               | REQ, REC, | Feature     | Definition    | +   |                         | OPT, or   | (REQ, REC,  |               | +   |                         | MNI       | or OPT)     |               | +   +-------------------------+-----------+-------------+---------------+ +   | CB_GETATTR              | OPT       | FDELG (REQ) | Section 20.1  | +P  | CB_LAYOUTRECALL         | OPT       | pNFS (REQ)  | Section 20.3  | +NS*| CB_NOTIFY               | OPT       | DDELG (REQ) | Section 20.4  | +P  | CB_NOTIFY_DEVICEID      | OPT       | pNFS (OPT)  | Section 20.12 | +NS*| CB_NOTIFY_LOCK          | OPT       |             | Section 20.11 | +NS*| CB_PUSH_DELEG           | OPT       | FDELG (OPT) | Section 20.5  | +   | CB_RECALL               | OPT       | FDELG,      | Section 20.2  | +   |                         |           | DDELG, pNFS |               | +   |                         |           | (REQ)       |               | +NS*| CB_RECALL_ANY           | OPT       | FDELG,      | Section 20.6  | +   |                         |           | DDELG, pNFS |               | +   |                         |           | (REQ)       |               | +NS | CB_RECALL_SLOT          | REQ       |             | Section 20.8  | +NS*| CB_RECALLABLE_OBJ_AVAIL | OPT       | DDELG, pNFS | Section 20.7  | +   |                         |           | (REQ)       |               | +I  | CB_SEQUENCE             | OPT       | FDELG,      | Section 20.9  | +   |                         |           | DDELG, pNFS |               | +   |                         |           | (REQ)       |               | +NS*| CB_WANTS_CANCELLED      | OPT       | FDELG,      | Section 20.10 | +   |                         |           | DDELG, pNFS |               | +   |                         |           | (REQ)       |               | +   +-------------------------+-----------+-------------+---------------+ + +Implementation notes: + +DELEGPURGE: +* mandatory only for servers that support CLAIM_DELEGATE_PREV and/or +  CLAIM_DELEG_PREV_FH (which allows clients to keep delegations that +  persist across client reboots).  Thus we need not implement this for +  now. + +EXCHANGE_ID: +* only SP4_NONE state protection supported +* implementation ids are ignored + +CREATE_SESSION: +* backchannel attributes are ignored +* backchannel security parameters are ignored + +SEQUENCE: +* no support for dynamic slot table renegotiation (optional) + +nfsv4.1 COMPOUND rules: +The following cases aren't supported yet: +* Enforcing of NFS4ERR_NOT_ONLY_OP for: BIND_CONN_TO_SESSION, CREATE_SESSION, +  DESTROY_CLIENTID, DESTROY_SESSION, EXCHANGE_ID. +* DESTROY_SESSION MUST be the final operation in the COMPOUND request. + +Nonstandard compound limitations: +* No support for a sessions fore channel RPC compound that requires both a +  ca_maxrequestsize request and a ca_maxresponsesize reply, so we may +  fail to live up to the promise we made in CREATE_SESSION fore channel +  negotiation. +* No more than one IO operation (read, write, readdir) allowed per +  compound. diff --git a/Documentation/filesystems/nfs/nfsroot.txt b/Documentation/filesystems/nfs/nfsroot.txt new file mode 100644 index 00000000000..3ba0b945aaf --- /dev/null +++ b/Documentation/filesystems/nfs/nfsroot.txt @@ -0,0 +1,270 @@ +Mounting the root filesystem via NFS (nfsroot) +=============================================== + +Written 1996 by Gero Kuhlmann <gero@gkminix.han.de> +Updated 1997 by Martin Mares <mj@atrey.karlin.mff.cuni.cz> +Updated 2006 by Nico Schottelius <nico-kernel-nfsroot@schottelius.org> +Updated 2006 by Horms <horms@verge.net.au> + + + +In order to use a diskless system, such as an X-terminal or printer server +for example, it is necessary for the root filesystem to be present on a +non-disk device. This may be an initramfs (see Documentation/filesystems/ +ramfs-rootfs-initramfs.txt), a ramdisk (see Documentation/initrd.txt) or a +filesystem mounted via NFS. The following text describes on how to use NFS +for the root filesystem. For the rest of this text 'client' means the +diskless system, and 'server' means the NFS server. + + + + +1.) Enabling nfsroot capabilities +    ----------------------------- + +In order to use nfsroot, NFS client support needs to be selected as +built-in during configuration. Once this has been selected, the nfsroot +option will become available, which should also be selected. + +In the networking options, kernel level autoconfiguration can be selected, +along with the types of autoconfiguration to support. Selecting all of +DHCP, BOOTP and RARP is safe. + + + + +2.) Kernel command line +    ------------------- + +When the kernel has been loaded by a boot loader (see below) it needs to be +told what root fs device to use. And in the case of nfsroot, where to find +both the server and the name of the directory on the server to mount as root. +This can be established using the following kernel command line parameters: + + +root=/dev/nfs + +  This is necessary to enable the pseudo-NFS-device. Note that it's not a +  real device but just a synonym to tell the kernel to use NFS instead of +  a real device. + + +nfsroot=[<server-ip>:]<root-dir>[,<nfs-options>] + +  If the `nfsroot' parameter is NOT given on the command line, +  the default "/tftpboot/%s" will be used. + +  <server-ip>	Specifies the IP address of the NFS server. +		The default address is determined by the `ip' parameter +		(see below). This parameter allows the use of different +		servers for IP autoconfiguration and NFS. + +  <root-dir>	Name of the directory on the server to mount as root. +		If there is a "%s" token in the string, it will be +		replaced by the ASCII-representation of the client's +		IP address. + +  <nfs-options>	Standard NFS options. All options are separated by commas. +		The following defaults are used: +			port		= as given by server portmap daemon +			rsize		= 4096 +			wsize		= 4096 +			timeo		= 7 +			retrans		= 3 +			acregmin	= 3 +			acregmax	= 60 +			acdirmin	= 30 +			acdirmax	= 60 +			flags		= hard, nointr, noposix, cto, ac + + +ip=<client-ip>:<server-ip>:<gw-ip>:<netmask>:<hostname>:<device>:<autoconf> + +  This parameter tells the kernel how to configure IP addresses of devices +  and also how to set up the IP routing table. It was originally called +  `nfsaddrs', but now the boot-time IP configuration works independently of +  NFS, so it was renamed to `ip' and the old name remained as an alias for +  compatibility reasons. + +  If this parameter is missing from the kernel command line, all fields are +  assumed to be empty, and the defaults mentioned below apply. In general +  this means that the kernel tries to configure everything using +  autoconfiguration. + +  The <autoconf> parameter can appear alone as the value to the `ip' +  parameter (without all the ':' characters before).  If the value is +  "ip=off" or "ip=none", no autoconfiguration will take place, otherwise +  autoconfiguration will take place.  The most common way to use this +  is "ip=dhcp". + +  <client-ip>	IP address of the client. + +  		Default:  Determined using autoconfiguration. + +  <server-ip>	IP address of the NFS server. If RARP is used to determine +		the client address and this parameter is NOT empty only +		replies from the specified server are accepted. + +		Only required for NFS root. That is autoconfiguration +		will not be triggered if it is missing and NFS root is not +		in operation. + +		Default: Determined using autoconfiguration. +		         The address of the autoconfiguration server is used. + +  <gw-ip>	IP address of a gateway if the server is on a different subnet. + +		Default: Determined using autoconfiguration. + +  <netmask>	Netmask for local network interface. If unspecified +		the netmask is derived from the client IP address assuming +		classful addressing. + +		Default:  Determined using autoconfiguration. + +  <hostname>	Name of the client. May be supplied by autoconfiguration, +  		but its absence will not trigger autoconfiguration. + +  		Default: Client IP address is used in ASCII notation. + +  <device>	Name of network device to use. + +		Default: If the host only has one device, it is used. +			 Otherwise the device is determined using +			 autoconfiguration. This is done by sending +			 autoconfiguration requests out of all devices, +			 and using the device that received the first reply. + +  <autoconf>	Method to use for autoconfiguration. In the case of options +                which specify multiple autoconfiguration protocols, +		requests are sent using all protocols, and the first one +		to reply is used. + +		Only autoconfiguration protocols that have been compiled +		into the kernel will be used, regardless of the value of +		this option. + +                  off or none: don't use autoconfiguration +				(do static IP assignment instead) +		  on or any:   use any protocol available in the kernel +			       (default) +		  dhcp:        use DHCP +		  bootp:       use BOOTP +		  rarp:        use RARP +		  both:        use both BOOTP and RARP but not DHCP +		               (old option kept for backwards compatibility) + +                Default: any + + + + +3.) Boot Loader +    ---------- + +To get the kernel into memory different approaches can be used. +They depend on various facilities being available: + + +3.1)  Booting from a floppy using syslinux + +	When building kernels, an easy way to create a boot floppy that uses +	syslinux is to use the zdisk or bzdisk make targets which use zimage +      	and bzimage images respectively. Both targets accept the +     	FDARGS parameter which can be used to set the kernel command line. + +	e.g. +	   make bzdisk FDARGS="root=/dev/nfs" + +   	Note that the user running this command will need to have +     	access to the floppy drive device, /dev/fd0 + +     	For more information on syslinux, including how to create bootdisks +     	for prebuilt kernels, see http://syslinux.zytor.com/ + +	N.B: Previously it was possible to write a kernel directly to +	     a floppy using dd, configure the boot device using rdev, and +	     boot using the resulting floppy. Linux no longer supports this +	     method of booting. + +3.2) Booting from a cdrom using isolinux + +     	When building kernels, an easy way to create a bootable cdrom that +     	uses isolinux is to use the isoimage target which uses a bzimage +     	image. Like zdisk and bzdisk, this target accepts the FDARGS +     	parameter which can be used to set the kernel command line. + +	e.g. +	  make isoimage FDARGS="root=/dev/nfs" + +     	The resulting iso image will be arch/<ARCH>/boot/image.iso +     	This can be written to a cdrom using a variety of tools including +     	cdrecord. + +	e.g. +	  cdrecord dev=ATAPI:1,0,0 arch/i386/boot/image.iso + +     	For more information on isolinux, including how to create bootdisks +     	for prebuilt kernels, see http://syslinux.zytor.com/ + +3.2) Using LILO +	When using LILO all the necessary command line parameters may be +	specified using the 'append=' directive in the LILO configuration +	file. + +	However, to use the 'root=' directive you also need to create +	a dummy root device, which may be removed after LILO is run. + +	mknod /dev/boot255 c 0 255 + +	For information on configuring LILO, please refer to its documentation. + +3.3) Using GRUB +	When using GRUB, kernel parameter are simply appended after the kernel +	specification: kernel <kernel> <parameters> + +3.4) Using loadlin +	loadlin may be used to boot Linux from a DOS command prompt without +	requiring a local hard disk to mount as root. This has not been +	thoroughly tested by the authors of this document, but in general +	it should be possible configure the kernel command line similarly +	to the configuration of LILO. + +	Please refer to the loadlin documentation for further information. + +3.5) Using a boot ROM +	This is probably the most elegant way of booting a diskless client. +	With a boot ROM the kernel is loaded using the TFTP protocol. The +	authors of this document are not aware of any no commercial boot +	ROMs that support booting Linux over the network. However, there +	are two free implementations of a boot ROM, netboot-nfs and +	etherboot, both of which are available on sunsite.unc.edu, and both +	of which contain everything you need to boot a diskless Linux client. + +3.6) Using pxelinux +	Pxelinux may be used to boot linux using the PXE boot loader +	which is present on many modern network cards. + +	When using pxelinux, the kernel image is specified using +	"kernel <relative-path-below /tftpboot>". The nfsroot parameters +	are passed to the kernel by adding them to the "append" line. +	It is common to use serial console in conjunction with pxeliunx, +	see Documentation/serial-console.txt for more information. + +	For more information on isolinux, including how to create bootdisks +	for prebuilt kernels, see http://syslinux.zytor.com/ + + + + +4.) Credits +    ------- + +  The nfsroot code in the kernel and the RARP support have been written +  by Gero Kuhlmann <gero@gkminix.han.de>. + +  The rest of the IP layer autoconfiguration code has been written +  by Martin Mares <mj@atrey.karlin.mff.cuni.cz>. + +  In order to write the initial version of nfsroot I would like to thank +  Jens-Uwe Mager <jum@anubis.han.de> for his help. diff --git a/Documentation/filesystems/nfs/rpc-cache.txt b/Documentation/filesystems/nfs/rpc-cache.txt new file mode 100644 index 00000000000..8a382bea680 --- /dev/null +++ b/Documentation/filesystems/nfs/rpc-cache.txt @@ -0,0 +1,202 @@ +	This document gives a brief introduction to the caching +mechanisms in the sunrpc layer that is used, in particular, +for NFS authentication. + +CACHES +====== +The caching replaces the old exports table and allows for +a wide variety of values to be caches. + +There are a number of caches that are similar in structure though +quite possibly very different in content and use.  There is a corpus +of common code for managing these caches. + +Examples of caches that are likely to be needed are: +  - mapping from IP address to client name +  - mapping from client name and filesystem to export options +  - mapping from UID to list of GIDs, to work around NFS's limitation +    of 16 gids. +  - mappings between local UID/GID and remote UID/GID for sites that +    do not have uniform uid assignment +  - mapping from network identify to public key for crypto authentication. + +The common code handles such things as: +   - general cache lookup with correct locking +   - supporting 'NEGATIVE' as well as positive entries +   - allowing an EXPIRED time on cache items, and removing +     items after they expire, and are no longer in-use. +   - making requests to user-space to fill in cache entries +   - allowing user-space to directly set entries in the cache +   - delaying RPC requests that depend on as-yet incomplete +     cache entries, and replaying those requests when the cache entry +     is complete. +   - clean out old entries as they expire. + +Creating a Cache +---------------- + +1/ A cache needs a datum to store.  This is in the form of a +   structure definition that must contain a +     struct cache_head +   as an element, usually the first. +   It will also contain a key and some content. +   Each cache element is reference counted and contains +   expiry and update times for use in cache management. +2/ A cache needs a "cache_detail" structure that +   describes the cache.  This stores the hash table, some +   parameters for cache management, and some operations detailing how +   to work with particular cache items. +   The operations requires are: +   	struct cache_head *alloc(void) +		This simply allocates appropriate memory and returns +   		a pointer to the cache_detail embedded within the +		structure +	void cache_put(struct kref *) +		This is called when the last reference to an item is +		dropped.  The pointer passed is to the 'ref' field +		in the cache_head.  cache_put should release any +		references create by 'cache_init' and, if CACHE_VALID +		is set, any references created by cache_update. +		It should then release the memory allocated by +   		'alloc'. +        int match(struct cache_head *orig, struct cache_head *new) +		test if the keys in the two structures match.  Return +		1 if they do, 0 if they don't. +	void init(struct cache_head *orig, struct cache_head *new) +		Set the 'key' fields in 'new' from 'orig'.  This may +		include taking references to shared objects. +	void update(struct cache_head *orig, struct cache_head *new) +		Set the 'content' fileds in 'new' from 'orig'. +	int cache_show(struct seq_file *m, struct cache_detail *cd, +			struct cache_head *h) +		Optional.  Used to provide a /proc file that lists the +		contents of a cache.  This should show one item, +   		usually on just one line. +	int cache_request(struct cache_detail *cd, struct cache_head *h, +   		char **bpp, int *blen) +		Format a request to be send to user-space for an item +   		to be instantiated.  *bpp is a buffer of size *blen. +		bpp should be moved forward over the encoded message, +		and  *blen should be reduced to show how much free +		space remains.  Return 0 on success or <0 if not +		enough room or other problem. +	int cache_parse(struct cache_detail *cd, char *buf, int len) +		A message from user space has arrived to fill out a +		cache entry.  It is in 'buf' of length 'len'. +		cache_parse should parse this, find the item in the +		cache with sunrpc_cache_lookup, and update the item +		with sunrpc_cache_update. + + +3/ A cache needs to be registered using cache_register().  This +   includes it on a list of caches that will be regularly +   cleaned to discard old data. + +Using a cache +------------- + +To find a value in a cache, call sunrpc_cache_lookup passing a pointer +to the cache_head in a sample item with the 'key' fields filled in. +This will be passed to ->match to identify the target entry.  If no +entry is found, a new entry will be create, added to the cache, and +marked as not containing valid data. + +The item returned is typically passed to cache_check which will check +if the data is valid, and may initiate an up-call to get fresh data. +cache_check will return -ENOENT in the entry is negative or if an up +call is needed but not possible, -EAGAIN if an upcall is pending, +or 0 if the data is valid; + +cache_check can be passed a "struct cache_req *".  This structure is +typically embedded in the actual request and can be used to create a +deferred copy of the request (struct cache_deferred_req).  This is +done when the found cache item is not uptodate, but the is reason to +believe that userspace might provide information soon.  When the cache +item does become valid, the deferred copy of the request will be +revisited (->revisit).  It is expected that this method will +reschedule the request for processing. + +The value returned by sunrpc_cache_lookup can also be passed to +sunrpc_cache_update to set the content for the item.  A second item is +passed which should hold the content.  If the item found by _lookup +has valid data, then it is discarded and a new item is created.  This +saves any user of an item from worrying about content changing while +it is being inspected.  If the item found by _lookup does not contain +valid data, then the content is copied across and CACHE_VALID is set. + +Populating a cache +------------------ + +Each cache has a name, and when the cache is registered, a directory +with that name is created in /proc/net/rpc + +This directory contains a file called 'channel' which is a channel +for communicating between kernel and user for populating the cache. +This directory may later contain other files of interacting +with the cache. + +The 'channel' works a bit like a datagram socket. Each 'write' is +passed as a whole to the cache for parsing and interpretation. +Each cache can treat the write requests differently, but it is +expected that a message written will contain: +  - a key +  - an expiry time +  - a content. +with the intention that an item in the cache with the give key +should be create or updated to have the given content, and the +expiry time should be set on that item. + +Reading from a channel is a bit more interesting.  When a cache +lookup fails, or when it succeeds but finds an entry that may soon +expire, a request is lodged for that cache item to be updated by +user-space.  These requests appear in the channel file. + +Successive reads will return successive requests. +If there are no more requests to return, read will return EOF, but a +select or poll for read will block waiting for another request to be +added. + +Thus a user-space helper is likely to: +  open the channel. +    select for readable +    read a request +    write a response +  loop. + +If it dies and needs to be restarted, any requests that have not been +answered will still appear in the file and will be read by the new +instance of the helper. + +Each cache should define a "cache_parse" method which takes a message +written from user-space and processes it.  It should return an error +(which propagates back to the write syscall) or 0. + +Each cache should also define a "cache_request" method which +takes a cache item and encodes a request into the buffer +provided. + +Note: If a cache has no active readers on the channel, and has had not +active readers for more than 60 seconds, further requests will not be +added to the channel but instead all lookups that do not find a valid +entry will fail.  This is partly for backward compatibility: The +previous nfs exports table was deemed to be authoritative and a +failed lookup meant a definite 'no'. + +request/response format +----------------------- + +While each cache is free to use it's own format for requests +and responses over channel, the following is recommended as +appropriate and support routines are available to help: +Each request or response record should be printable ASCII +with precisely one newline character which should be at the end. +Fields within the record should be separated by spaces, normally one. +If spaces, newlines, or nul characters are needed in a field they +much be quoted.  two mechanisms are available: +1/ If a field begins '\x' then it must contain an even number of +   hex digits, and pairs of these digits provide the bytes in the +   field. +2/ otherwise a \ in the field must be followed by 3 octal digits +   which give the code for a byte.  Other characters are treated +   as them selves.  At the very least, space, newline, nul, and +   '\' must be quoted in this way. | 
