Merge master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

* master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6: (448 commits) [IPV4] nl_fib_lookup: Initialise res.r before fib_res_put(&res) [IPV6]: Fix thinko in ipv6_rthdr_rcv() changes. [IPV4]: Add multipath cached to feature-removal-schedule.txt [WIRELESS] cfg80211: Clarify locking comment. [WIRELESS] cfg80211: Fix locking in wiphy_new. [WEXT] net_device: Don't include wext bits if not required. [WEXT]: Misc code cleanups. [WEXT]: Reduce inline abuse. [WEXT]: Move EXPORT_SYMBOL statements where they belong. [WEXT]: Cleanup early ioctl call path. [WEXT]: Remove options. [WEXT]: Remove dead debug code. [WEXT]: Clean up how wext is called. [WEXT]: Move to net/wireless [AFS]: Eliminate cmpxchg() usage in vlocation code. [RXRPC]: Fix pointers passed to bitops. [RXRPC]: Remove bogus atomic_* overrides. [AFS]: Fix u64 printing in debug logging. [AFS]: Add "directory write" support. [AFS]: Implement the CB.InitCallBackState3 operation. ...
author: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-04-27 09:26:46 -0700
committer: Linus Torvalds <torvalds@woody.linux-foundation.org> 2007-04-27 09:26:46 -0700
commit: 15c54033964a943de7b0763efd3bd0ede7326395 (patch)
tree: 840b292612d1b5396d5bab5bde537a9013db3ceb /Documentation
parent: ad5da3cf39a5b11a198929be1f2644e17ecd767e (diff)
parent: 912a41a4ab935ce8c4308428ec13fc7f8b1f18f4 (diff)
9 files changed, 1093 insertions, 118 deletions
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index 19b4c96b2a4..6da663607f7 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -211,15 +211,6 @@ Who:   Adrian Bunk <bunk@stusta.de>
 
 ---------------------------
 
-What:	IPv4 only connection tracking/NAT/helpers
-When:	2.6.22
-Why:	The new layer 3 independant connection tracking replaces the old
-	IPv4 only version. After some stabilization of the new code the
-	old one will be removed.
-Who:	Patrick McHardy <kaber@trash.net>
-
----------------------------
-
 What:	ACPI hooks (X86_SPEEDSTEP_CENTRINO_ACPI) in speedstep-centrino driver
 When:	December 2006
 Why:	Speedstep-centrino driver with ACPI hooks and acpi-cpufreq driver are
@@ -294,18 +285,6 @@ Who:	Richard Purdie <rpurdie@rpsys.net>
 
 ---------------------------
 
-What:	Wireless extensions over netlink (CONFIG_NET_WIRELESS_RTNETLINK)
-When:	with the merge of wireless-dev, 2.6.22 or later
-Why:	The option/code is
-	 * not enabled on most kernels
-	 * not required by any userspace tools (except an experimental one,
-	   and even there only for some parts, others use ioctl)
-	 * pointless since wext is no longer evolving and the ioctl
-	   interface needs to be kept
-Who:	Johannes Berg <johannes@sipsolutions.net>
-
----------------------------
-
 What:	i8xx_tco watchdog driver
 When:	in 2.6.22
 Why:	the i8xx_tco watchdog driver has been replaced by the iTCO_wdt
@@ -313,3 +292,22 @@ Why:	the i8xx_tco watchdog driver has been replaced by the iTCO_wdt
 Who:	Wim Van Sebroeck <wim@iguana.be>
 
 ---------------------------
+
+What:	Multipath cached routing support in ipv4
+When:	in 2.6.23
+Why:	Code was merged, then submitter immediately disappeared leaving
+	us with no maintainer and lots of bugs.  The code should not have
+	been merged in the first place, and many aspects of it's
+	implementation are blocking more critical core networking
+	development.  It's marked EXPERIMENTAL and no distribution
+	enables it because it cause obscure crashes due to unfixable bugs
+	(interfaces don't return errors so memory allocation can't be
+	handled, calling contexts of these interfaces make handling
+	errors impossible too because they get called after we've
+	totally commited to creating a route object, for example).
+	This problem has existed for years and no forward progress
+	has ever been made, and nobody steps up to try and salvage
+	this code, so we're going to finally just get rid of it.
+Who:	David S. Miller <davem@davemloft.net>
+
+---------------------------
diff --git a/Documentation/filesystems/afs.txt b/Documentation/filesystems/afs.txt
index 2f4237dfb8c..12ad6c7f4e5 100644
--- a/Documentation/filesystems/afs.txt
+++ b/Documentation/filesystems/afs.txt
@@ -1,31 +1,82 @@
+			     ====================
 			     kAFS: AFS FILESYSTEM
 			     ====================
 
-ABOUT
-=====
+Contents:
+
+ - Overview.
+ - Usage.
+ - Mountpoints.
+ - Proc filesystem.
+ - The cell database.
+ - Security.
+ - Examples.
+
+
+========
+OVERVIEW
+========
 
-This filesystem provides a fairly simple AFS filesystem driver. It is under
-development and only provides very basic facilities. It does not yet support
-the following AFS features:
+This filesystem provides a fairly simple secure AFS filesystem driver. It is
+under development and does not yet provide the full feature set.  The features
+it does support include:
 
-	(*) Write support.
-	(*) Communications security.
-	(*) Local caching.
-	(*) pioctl() system call.
-	(*) Automatic mounting of embedded mountpoints.
+ (*) Security (currently only AFS kaserver and KerberosIV tickets).
 
+ (*) File reading.
 
+ (*) Automounting.
+
+It does not yet support the following AFS features:
+
+ (*) Write support.
+
+ (*) Local caching.
+
+ (*) pioctl() system call.
+
+
+===========
+COMPILATION
+===========
+
+The filesystem should be enabled by turning on the kernel configuration
+options:
+
+	CONFIG_AF_RXRPC		- The RxRPC protocol transport
+	CONFIG_RXKAD		- The RxRPC Kerberos security handler
+	CONFIG_AFS		- The AFS filesystem
+
+Additionally, the following can be turned on to aid debugging:
+
+	CONFIG_AF_RXRPC_DEBUG	- Permit AF_RXRPC debugging to be enabled
+	CONFIG_AFS_DEBUG	- Permit AFS debugging to be enabled
+
+They permit the debugging messages to be turned on dynamically by manipulating
+the masks in the following files:
+
+	/sys/module/af_rxrpc/parameters/debug
+	/sys/module/afs/parameters/debug
+
+
+=====
 USAGE
 =====
 
 When inserting the driver modules the root cell must be specified along with a
 list of volume location server IP addresses:
 
-	insmod rxrpc.o
+	insmod af_rxrpc.o
+	insmod rxkad.o
 	insmod kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
 
-The first module is a driver for the RxRPC remote operation protocol, and the
-second is the actual filesystem driver for the AFS filesystem.
+The first module is the AF_RXRPC network protocol driver.  This provides the
+RxRPC remote operation protocol and may also be accessed from userspace.  See:
+
+	Documentation/networking/rxrpc.txt
+
+The second module is the kerberos RxRPC security driver, and the third module
+is the actual filesystem driver for the AFS filesystem.
 
 Once the module has been loaded, more modules can be added by the following
 procedure:
@@ -33,7 +84,7 @@ procedure:
 	echo add grand.central.org 18.7.14.88:128.2.191.224 >/proc/fs/afs/cells
 
 Where the parameters to the "add" command are the name of a cell and a list of
-volume location servers within that cell.
+volume location servers within that cell, with the latter separated by colons.
 
 Filesystems can be mounted anywhere by commands similar to the following:
 
@@ -42,11 +93,6 @@ Filesystems can be mounted anywhere by commands similar to the following:
 	mount -t afs "#root.afs." /afs
 	mount -t afs "#root.cell." /afs/cambridge
 
-  NB: When using this on Linux 2.4, the mount command has to be different,
-      since the filesystem doesn't have access to the device name argument:
-
-	mount -t afs none /afs -ovol="#root.afs."
-
 Where the initial character is either a hash or a percent symbol depending on
 whether you definitely want a R/W volume (hash) or whether you'd prefer a R/O
 volume, but are willing to use a R/W volume instead (percent).
@@ -60,55 +106,66 @@ named volume will be looked up in the cell specified during insmod.
 Additional cells can be added through /proc (see later section).
 
 
+===========
 MOUNTPOINTS
 ===========
 
-AFS has a concept of mountpoints. These are specially formatted symbolic links
-(of the same form as the "device name" passed to mount). kAFS presents these
-to the user as directories that have special properties:
+AFS has a concept of mountpoints. In AFS terms, these are specially formatted
+symbolic links (of the same form as the "device name" passed to mount).  kAFS
+presents these to the user as directories that have a follow-link capability
+(ie: symbolic link semantics).  If anyone attempts to access them, they will
+automatically cause the target volume to be mounted (if possible) on that site.
 
-  (*) They cannot be listed. Running a program like "ls" on them will incur an
-      EREMOTE error (Object is remote).
+Automatically mounted filesystems will be automatically unmounted approximately
+twenty minutes after they were last used.  Alternatively they can be unmounted
+directly with the umount() system call.
 
-  (*) Other objects can't be looked up inside of them. This also incurs an
-      EREMOTE error.
+Manually unmounting an AFS volume will cause any idle submounts upon it to be
+culled first.  If all are culled, then the requested volume will also be
+unmounted, otherwise error EBUSY will be returned.
 
-  (*) They can be queried with the readlink() system call, which will return
-      the name of the mountpoint to which they point. The "readlink" program
-      will also work.
+This can be used by the administrator to attempt to unmount the whole AFS tree
+mounted on /afs in one go by doing:
 
-  (*) They can be mounted on (which symbolic links can't).
+	umount /afs
 
 
+===============
 PROC FILESYSTEM
 ===============
 
-The rxrpc module creates a number of files in various places in the /proc
-filesystem:
-
-  (*) Firstly, some information files are made available in a directory called
-      "/proc/net/rxrpc/". These list the extant transport endpoint, peer,
-      connection and call records.
-
-  (*) Secondly, some control files are made available in a directory called
-      "/proc/sys/rxrpc/". Currently, all these files can be used for is to
-      turn on various levels of tracing.
-
 The AFS modules creates a "/proc/fs/afs/" directory and populates it:
 
-  (*) A "cells" file that lists cells currently known to the afs module.
+  (*) A "cells" file that lists cells currently known to the afs module and
+      their usage counts:
+
+	[root@andromeda ~]# cat /proc/fs/afs/cells
+	USE NAME
+	  3 cambridge.redhat.com
 
   (*) A directory per cell that contains files that list volume location
       servers, volumes, and active servers known within that cell.
 
+	[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/servers
+	USE ADDR            STATE
+	  4 172.16.18.91        0
+	[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/vlservers
+	ADDRESS
+	172.16.18.91
+	[root@andromeda ~]# cat /proc/fs/afs/cambridge.redhat.com/volumes
+	USE STT VLID[0]  VLID[1]  VLID[2]  NAME
+	  1 Val 20000000 20000001 20000002 root.afs
 
+
+=================
 THE CELL DATABASE
 =================
 
-The filesystem maintains an internal database of all the cells it knows and
-the IP addresses of the volume location servers for those cells. The cell to
-which the computer belongs is added to the database when insmod is performed
-by the "rootcell=" argument.
+The filesystem maintains an internal database of all the cells it knows and the
+IP addresses of the volume location servers for those cells.  The cell to which
+the system belongs is added to the database when insmod is performed by the
+"rootcell=" argument or, if compiled in, using a "kafs.rootcell=" argument on
+the kernel command line.
 
 Further cells can be added by commands similar to the following:
 
@@ -118,20 +175,65 @@ Further cells can be added by commands similar to the following:
 No other cell database operations are available at this time.
 
 
+========
+SECURITY
+========
+
+Secure operations are initiated by acquiring a key using the klog program.  A
+very primitive klog program is available at:
+
+	http://people.redhat.com/~dhowells/rxrpc/klog.c
+
+This should be compiled by:
+
+	make klog LDLIBS="-lcrypto -lcrypt -lkrb4 -lkeyutils"
+
+And then run as:
+
+	./klog
+
+Assuming it's successful, this adds a key of type RxRPC, named for the service
+and cell, eg: "afs@<cellname>".  This can be viewed with the keyctl program or
+by cat'ing /proc/keys:
+
+	[root@andromeda ~]# keyctl show
+	Session Keyring
+	       -3 --alswrv      0     0  keyring: _ses.3268
+		2 --alswrv      0     0   \_ keyring: _uid.0
+	111416553 --als--v      0     0   \_ rxrpc: afs@CAMBRIDGE.REDHAT.COM
+
+Currently the username, realm, password and proposed ticket lifetime are
+compiled in to the program.
+
+It is not required to acquire a key before using AFS facilities, but if one is
+not acquired then all operations will be governed by the anonymous user parts
+of the ACLs.
+
+If a key is acquired, then all AFS operations, including mounts and automounts,
+made by a possessor of that key will be secured with that key.
+
+If a file is opened with a particular key and then the file descriptor is
+passed to a process that doesn't have that key (perhaps over an AF_UNIX
+socket), then the operations on the file will be made with key that was used to
+open the file.
+
+
+========
 EXAMPLES
 ========
 
-Here's what I use to test this. Some of the names and IP addresses are local
-to my internal DNS. My "root.afs" partition has a mount point within it for
+Here's what I use to test this.  Some of the names and IP addresses are local
+to my internal DNS.  My "root.afs" partition has a mount point within it for
 some public volumes volumes.
 
-insmod -S /tmp/rxrpc.o 
-insmod -S /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.73:172.16.18.91
+insmod /tmp/rxrpc.o
+insmod /tmp/rxkad.o
+insmod /tmp/kafs.o rootcell=cambridge.redhat.com:172.16.18.91
 
 mount -t afs \%root.afs. /afs
 mount -t afs \%cambridge.redhat.com:root.cell. /afs/cambridge.redhat.com/
 
-echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells 
+echo add grand.central.org 18.7.14.88:128.2.191.224 > /proc/fs/afs/cells
 mount -t afs "#grand.central.org:root.cell." /afs/grand.central.org/
 mount -t afs "#grand.central.org:root.archive." /afs/grand.central.org/archive
 mount -t afs "#grand.central.org:root.contrib." /afs/grand.central.org/contrib
@@ -141,15 +243,7 @@ mount -t afs "#grand.central.org:root.service." /afs/grand.central.org/service
 mount -t afs "#grand.central.org:root.software." /afs/grand.central.org/software
 mount -t afs "#grand.central.org:root.user." /afs/grand.central.org/user
 
-umount /afs/grand.central.org/user
-umount /afs/grand.central.org/software
-umount /afs/grand.central.org/service
-umount /afs/grand.central.org/project
-umount /afs/grand.central.org/doc
-umount /afs/grand.central.org/contrib
-umount /afs/grand.central.org/archive
-umount /afs/grand.central.org
-umount /afs/cambridge.redhat.com
 umount /afs
 rmmod kafs
+rmmod rxkad
 rmmod rxrpc
diff --git a/Documentation/filesystems/proc.txt b/Documentation/filesystems/proc.txt
index 5484ab5efd4..7aaf09b86a5 100644
--- a/Documentation/filesystems/proc.txt
+++ b/Documentation/filesystems/proc.txt
@@ -1421,6 +1421,15 @@ fewer messages that will be written. Message_burst controls when messages will
 be dropped.  The  default  settings  limit  warning messages to one every five
 seconds.
 
+warnings
+--------
+
+This controls console messages from the networking stack that can occur because
+of problems on the network like duplicate address or bad checksums. Normally,
+this should be enabled, but if the problem persists the messages can be
+disabled.
+
+
 netdev_max_backlog
 ------------------
 
diff --git a/Documentation/keys.txt b/Documentation/keys.txt
index 60c665d9cfa..81d9aa09729 100644
--- a/Documentation/keys.txt
+++ b/Documentation/keys.txt
@@ -859,6 +859,18 @@ payload contents" for more information.
 	void unregister_key_type(struct key_type *type);
 
 
+Under some circumstances, it may be desirable to desirable to deal with a
+bundle of keys.  The facility provides access to the keyring type for managing
+such a bundle:
+
+	struct key_type key_type_keyring;
+
+This can be used with a function such as request_key() to find a specific
+keyring in a process's keyrings.  A keyring thus found can then be searched
+with keyring_search().  Note that it is not possible to use request_key() to
+search a specific keyring, so using keyrings in this way is of limited utility.
+
+
 ===================================
 NOTES ON ACCESSING PAYLOAD CONTENTS
 ===================================
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index de809e58092..1da56663083 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -920,40 +920,9 @@ options, you may wish to use the "max_bonds" module parameter,
 documented above.
 
 	To create multiple bonding devices with differing options, it
-is necessary to load the bonding driver multiple times.  Note that
-current versions of the sysconfig network initialization scripts
-handle this automatically; if your distro uses these scripts, no
-special action is needed.  See the section Configuring Bonding
-Devices, above, if you're not sure about your network initialization
-scripts.
-
-	To load multiple instances of the module, it is necessary to
-specify a different name for each instance (the module loading system
-requires that every loaded module, even multiple instances of the same
-module, have a unique name).  This is accomplished by supplying
-multiple sets of bonding options in /etc/modprobe.conf, for example:
-	
-alias bond0 bonding
-options bond0 -o bond0 mode=balance-rr miimon=100
-
-alias bond1 bonding
-options bond1 -o bond1 mode=balance-alb miimon=50
-
-	will load the bonding module two times.  The first instance is
-named "bond0" and creates the bond0 device in balance-rr mode with an
-miimon of 100.  The second instance is named "bond1" and creates the
-bond1 device in balance-alb mode with an miimon of 50.
-
-	In some circumstances (typically with older distributions),
-the above does not work, and the second bonding instance never sees
-its options.  In that case, the second options line can be substituted
-as follows:
-
-install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \
-	mode=balance-alb miimon=50
+is necessary to use bonding parameters exported by sysfs, documented
+in the section below.
 
-	This may be repeated any number of times, specifying a new and
-unique name in place of bond1 for each subsequent instance.
 
 3.4 Configuring Bonding Manually via Sysfs
 ------------------------------------------
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index 387482e46c4..4504cc59e40 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -57,6 +57,16 @@ DCCP_SOCKOPT_SEND_CSCOV is for the receiver and has a different meaning: it
 	coverage value are also acceptable. The higher the number, the more
 	restrictive this setting (see [RFC 4340, sec. 9.2.1]).
 
+The following two options apply to CCID 3 exclusively and are getsockopt()-only.
+In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned.
+DCCP_SOCKOPT_CCID_RX_INFO
+	Returns a `struct tfrc_rx_info' in optval; the buffer for optval and
+	optlen must be set to at least sizeof(struct tfrc_rx_info).
+DCCP_SOCKOPT_CCID_TX_INFO
+	Returns a `struct tfrc_tx_info' in optval; the buffer for optval and
+	optlen must be set to at least sizeof(struct tfrc_tx_info).
+
+
 Sysctl variables
 ================
 Several DCCP default parameters can be managed by the following sysctls
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 702d1d8dd04..af6a63ab902 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -179,11 +179,31 @@ tcp_fin_timeout - INTEGER
 	because they eat maximum 1.5K of memory, but they tend
 	to live longer.	Cf. tcp_max_orphans.
 
-tcp_frto - BOOLEAN
+tcp_frto - INTEGER
 	Enables F-RTO, an enhanced recovery algorithm for TCP retransmission
 	timeouts.  It is particularly beneficial in wireless environments
 	where packet loss is typically due to random radio interference
-	rather than intermediate router congestion.
+	rather than intermediate router congestion. If set to 1, basic
+	version is enabled. 2 enables SACK enhanced F-RTO, which is
+	EXPERIMENTAL. The basic version can be used also when SACK is
+	enabled for a flow through tcp_sack sysctl.
+
+tcp_frto_response - INTEGER
+	When F-RTO has detected that a TCP retransmission timeout was
+	spurious (i.e, the timeout would have been avoided had TCP set a
+	longer retransmission timeout), TCP has several options what to do
+	next. Possible values are:
+		0 Rate halving based; a smooth and conservative response,
+		  results in halved cwnd and ssthresh after one RTT
+		1 Very conservative response; not recommended because even
+		  though being valid, it interacts poorly with the rest of
+		  Linux TCP, halves cwnd and ssthresh immediately
+		2 Aggressive response; undoes congestion control measures
+		  that are now known to be unnecessary (ignoring the
+		  possibility of a lost retransmission that would require
+		  TCP to be more cautious), cwnd and ssthresh are restored
+		  to the values prior timeout
+	Default: 0 (rate halving based)
 
 tcp_keepalive_time - INTEGER
 	How often TCP sends out keepalive messages when keepalive is enabled.
@@ -995,7 +1015,12 @@ bridge-nf-call-ip6tables - BOOLEAN
 	Default: 1
 
 bridge-nf-filter-vlan-tagged - BOOLEAN
-	1 : pass bridged vlan-tagged ARP/IP traffic to arptables/iptables.
+	1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables.
+	0 : disable this.
+	Default: 1
+
+bridge-nf-filter-pppoe-tagged - BOOLEAN
+	1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables.
 	0 : disable this.
 	Default: 1
 
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
new file mode 100644
index 00000000000..cae231b1c13
--- /dev/null
+++ b/Documentation/networking/rxrpc.txt
@@ -0,0 +1,859 @@
+			    ======================
+			    RxRPC NETWORK PROTOCOL
+			    ======================
+
+The RxRPC protocol driver provides a reliable two-phase transport on top of UDP
+that can be used to perform RxRPC remote operations.  This is done over sockets
+of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and
+receive data, aborts and errors.
+
+Contents of this document:
+
+ (*) Overview.
+
+ (*) RxRPC protocol summary.
+
+ (*) AF_RXRPC driver model.
+
+ (*) Control messages.
+
+ (*) Socket options.
+
+ (*) Security.
+
+ (*) Example client usage.
+
+ (*) Example server usage.
+
+ (*) AF_RXRPC kernel interface.
+
+
+========
+OVERVIEW
+========
+
+RxRPC is a two-layer protocol.  There is a session layer which provides
+reliable virtual connections using UDP over IPv4 (or IPv6) as the transport
+layer, but implements a real network protocol; and there's the presentation
+layer which renders structured data to binary blobs and back again using XDR
+(as does SunRPC):
+
+		+-------------+
+		| Application |
+		+-------------+
+		|     XDR     |		Presentation
+		+-------------+
+		|    RxRPC    |		Session
+		+-------------+
+		|     UDP     |		Transport
+		+-------------+
+
+
+AF_RXRPC provides:
+
+ (1) Part of an RxRPC facility for both kernel and userspace applications by
+     making the session part of it a Linux network protocol (AF_RXRPC).
+
+ (2) A two-phase protocol.  The client transmits a blob (the request) and then
+     receives a blob (the reply), and the server receives the request and then
+     transmits the reply.
+
+ (3) Retention of the reusable bits of the transport system set up for one call
+     to speed up subsequent calls.
+
+ (4) A secure protocol, using the Linux kernel's key retention facility to
+     manage security on the client end.  The server end must of necessity be
+     more active in security negotiations.
+
+AF_RXRPC does not provide XDR marshalling/presentation facilities.  That is
+left to the application.  AF_RXRPC only deals in blobs.  Even the operation ID
+is just the first four bytes of the request blob, and as such is beyond the
+kernel's interest.
+
+
+Sockets of AF_RXRPC family are:
+
+ (1) created as type SOCK_DGRAM;
+
+ (2) provided with a protocol of the type of underlying transport they're going
+     to use - currently only PF_INET is supported.
+
+
+The Andrew File System (AFS) is an example of an application that uses this and
+that has both kernel (filesystem) and userspace (utility) components.
+
+
+======================
+RXRPC PROTOCOL SUMMARY
+======================
+
+An overview of the RxRPC protocol:
+
+ (*) RxRPC sits on top of another networking protocol (UDP is the only option
+     currently), and uses this to provide network transport.  UDP ports, for
+     example, provide transport endpoints.
+
+ (*) RxRPC supports multiple virtual "connections" from any given transport
+     endpoint, thus allowing the endpoints to be shared, even to the same
+     remote endpoint.
+
+ (*) Each connection goes to a particular "service".  A connection may not go
+     to multiple services.  A service may be considered the RxRPC equivalent of
+     a port number.  AF_RXRPC permits multiple services to share an endpoint.
+
+ (*) Client-originating packets are marked, thus a transport endpoint can be
+     shared between client and server connections (connections have a
+     direction).
+
+ (*) Up to a billion connections may be supported concurrently between one
+     local transport endpoint and one service on one remote endpoint.  An RxRPC
+     connection is described by seven numbers:
+
+	Local address	}
+	Local port	} Transport (UDP) address
+	Remote address	}
+	Remote port	}
+	Direction
+	Connection ID
+	Service ID
+
+ (*) Each RxRPC operation is a "call".  A connection may make up to four
+     billion calls, but only up to four calls may be in progress on a
+     connection at any one time.
+
+ (*) Calls are two-phase and asymmetric: the client sends its request data,
+     which the service receives; then the service transmits the reply data
+     which the client receives.
+
+ (*) The data blobs are of indefinite size, the end of a phase is marked with a
+     flag in the packet.  The number of packets of data making up one blob may
+     not exceed 4 billion, however, as this would cause the sequence number to
+     wrap.
+
+ (*) The first four bytes of the request data are the service operation ID.
+
+ (*) Security is negotiated on a per-connection basis.  The connection is
+     initiated by the first data packet on it arriving.  If security is
+     requested, the server then issues a "challenge" and then the client
+     replies with a "response".  If the response is successful, the security is
+     set for the lifetime of that connection, and all subsequent calls made
+     upon it use that same security.  In the event that the server lets a
+     connection lapse before the client, the security will be renegotiated if
+     the client uses the connection again.
+
+ (*) Calls use ACK packets to handle reliability.  Data packets are also
+     explicitly sequenced per call.
+
+ (*) There are two types of positive acknowledgement: hard-ACKs and soft-ACKs.
+     A hard-ACK indicates to the far side that all the data received to a point
+     has been received and processed; a soft-ACK indicates that the data has
+     been received but may yet be discarded and re-requested.  The sender may
+     not discard any transmittable packets until they've been hard-ACK'd.
+
+ (*) Reception of a reply data packet implicitly hard-ACK's all the data
+     packets that make up the request.
+
+ (*) An call is complete when the request has been sent, the reply has been
+     received and the final hard-ACK on the last packet of the reply has
+     reached the server.
+
+ (*) An call may be aborted by either end at any time up to its completion.
+
+
+=====================
+AF_RXRPC DRIVER MODEL
+=====================
+
+About the AF_RXRPC driver:
+
+ (*) The AF_RXRPC protocol transparently uses internal sockets of the transport
+     protocol to represent transport endpoints.
+
+ (*) AF_RXRPC sockets map onto RxRPC connection bundles.  Actual RxRPC
+     connections are handled transparently.  One client socket may be used to
+     make multiple simultaneous calls to the same service.  One server socket
+     may handle calls from many clients.
+
+ (*) Additional parallel client connections will be initiated to support extra
+     concurrent calls, up to a tunable limit.
+
+ (*) Each connection is retained for a certain amount of time [tunable] after
+     the last call currently using it has completed in case a new call is made
+     that could reuse it.
+
+ (*) Each internal UDP socket is retained [tunable] for a certain amount of
+     time [tunable] after the last connection using it discarded, in case a new
+     connection is made that could use it.
+
+ (*) A client-side connection is only shared between calls if they have have
+     the same key struct describing their security (and assuming the calls
+     would otherwise share the connection).  Non-secured calls would also be
+     able to share connections with each other.
+
+ (*) A server-side connection is shared if the client says it is.
+
+ (*) ACK'ing is handled by the protocol driver automatically, including ping
+     replying.
+
+ (*) SO_KEEPALIVE automatically pings the other side to keep the connection
+     alive [TODO].
+
+ (*) If an ICMP error is received, all calls affected by that error will be
+     aborted with an appropriate network error passed through recvmsg().
+
+
+Interaction with the user of the RxRPC socket:
+
+ (*) A socket is made into a server socket by binding an address with a
+     non-zero service ID.
+
+ (*) In the client, sending a request is achieved with one or more sendmsgs,
+     followed by the reply being received with one or more recvmsgs.
+
+ (*) The first sendmsg for a request to be sent from a client contains a tag to
+     be used in all other sendmsgs or recvmsgs associated with that call.  The
+     tag is carried in the control data.
+
+ (*) connect() is used to supply a default destination address for a client
+     socket.  This may be overridden by supplying an alternate address to the
+     first sendmsg() of a call (struct msghdr::msg_name).
+
+ (*) If connect() is called on an unbound client, a random local port will
+     bound before the operation takes place.
+
+ (*) A server socket may also be used to make client calls.  To do this, the
+     first sendmsg() of the call must specify the target address.  The server's
+     transport endpoint is used to send the packets.
+
+ (*) Once the application has received the last message associated with a call,
+     the tag is guaranteed not to be seen again, and so it can be used to pin
+     client resources.  A new call can then be initiated with the same tag
+     without fear of interference.
+
+ (*) In the server, a request is received with one or more recvmsgs, then the
+     the reply is transmitted with one or more sendmsgs, and then the final ACK
+     is received with a last recvmsg.
+
+ (*) When sending data for a call, sendmsg is given MSG_MORE if there's more
+     data to come on that call.
+
+ (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more
+     data to come for that call.
+
+ (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg
+     to indicate the terminal message for that call.
+
+ (*) A call may be aborted by adding an abort control message to the control
+     data.  Issuing an abort terminates the kernel's use of that call's tag.
+     Any messages waiting in the receive queue for that call will be discarded.
+
+ (*) Aborts, busy notifications and challenge packets are delivered by recvmsg,
+     and control data messages will be set to indicate the context.  Receiving
+     an abort or a busy message terminates the kernel's use of that call's tag.
+
+ (*) The control data part of the msghdr struct is used for a number of things:
+
+     (*) The tag of the intended or affected call.
+
+     (*) Sending or receiving errors, aborts and busy notifications.
+
+     (*) Notifications of incoming calls.
+
+     (*) Sending debug requests and receiving debug replies [TODO].
+
+ (*) When the kernel has received and set up an incoming call, it sends a
+     message to server application to let it know there's a new call awaiting
+     its acceptance [recvmsg reports a special control message].  The server
+     application then uses sendmsg to assign a tag to the new call.  Once that
+     is done, the first part of the request data will be delivered by recvmsg.
+
+ (*) The server application has to provide the server socket with a keyring of
+     secret keys corresponding to the security types it permits.  When a secure
+     connection is being set up, the kernel looks up the appropriate secret key
+     in the keyring and then sends a challenge packet to the client and
+     receives a response packet.  The kernel then checks the authorisation of
+     the packet and either aborts the connection or sets up the security.
+
+ (*) The name of the key a client will use to secure its communications is
+     nominated by a socket option.
+
+
+Notes on recvmsg:
+
+ (*) If there's a sequence of data messages belonging to a particular call on
+     the receive queue, then recvmsg will keep working through them until:
+
+     (a) it meets the end of that call's received data,
+
+     (b) it meets a non-data message,
+
+     (c) it meets a message belonging to a different call, or
+
+     (d) it fills the user buffer.
+
+     If recvmsg is called in blocking mode, it will keep sleeping, awaiting the
+     reception of further data, until one of the above four conditions is met.
+
+ (2) MSG_PEEK operates similarly, but will return immediately if it has put any
+     data in the buffer rather than sleeping until it can fill the buffer.
+
+ (3) If a data message is only partially consumed in filling a user buffer,
+     then the remainder of that message will be left on the front of the queue
+     for the next taker.  MSG_TRUNC will never be flagged.
+
+ (4) If there is more data to be had on a call (it hasn't copied the last byte
+     of the last data message in that phase yet), then MSG_MORE will be
+     flagged.
+
+
+================
+CONTROL MESSAGES
+================
+
+AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex
+calls, to invoke certain actions and to report certain conditions.  These are:
+
+	MESSAGE ID		SRT DATA	MEANING
+	=======================	=== ===========	===============================
+	RXRPC_USER_CALL_ID	sr- User ID	App's call specifier
+	RXRPC_ABORT		srt Abort code	Abort code to issue/received
+	RXRPC_ACK		-rt n/a		Final ACK received
+	RXRPC_NET_ERROR		-rt error num	Network error on call
+	RXRPC_BUSY		-rt n/a		Call rejected (server busy)
+	RXRPC_LOCAL_ERROR	-rt error num	Local error encountered
+	RXRPC_NEW_CALL		-r- n/a		New call received
+	RXRPC_ACCEPT		s-- n/a		Accept new call<
author	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-04-27 09:26:46 -0700
committer	Linus Torvalds <torvalds@woody.linux-foundation.org>	2007-04-27 09:26:46 -0700
commit	15c54033964a943de7b0763efd3bd0ede7326395 (patch)
tree	840b292612d1b5396d5bab5bde537a9013db3ceb /Documentation
parent	ad5da3cf39a5b11a198929be1f2644e17ecd767e (diff)
parent	912a41a4ab935ce8c4308428ec13fc7f8b1f18f4 (diff)