aboutsummaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/dccp.txt3
-rw-r--r--Documentation/networking/ip-sysctl.txt148
-rw-r--r--Documentation/networking/ipv6.txt35
-rw-r--r--Documentation/networking/ixgbe.txt199
-rw-r--r--Documentation/networking/rds.txt356
-rw-r--r--Documentation/networking/timestamping.txt180
-rw-r--r--Documentation/networking/timestamping/.gitignore1
-rw-r--r--Documentation/networking/timestamping/Makefile6
-rw-r--r--Documentation/networking/timestamping/timestamping.c533
-rw-r--r--Documentation/networking/vxge.txt100
10 files changed, 1493 insertions, 68 deletions
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index 7a3bb1abb83..b132e4a3cf0 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -141,7 +141,8 @@ rx_ccid = 2
Default CCID for the receiver-sender half-connection; see tx_ccid.
seq_window = 100
- The initial sequence window (sec. 7.5.2).
+ The initial sequence window (sec. 7.5.2) of the sender. This influences
+ the local ackno validity and the remote seqno validity windows (7.5.1).
tx_qlen = 5
The size of the transmit buffer in packets. A value of 0 corresponds
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7712787933..ec5de02f543 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -2,7 +2,7 @@
ip_forward - BOOLEAN
0 - disabled (default)
- not 0 - enabled
+ not 0 - enabled
Forward Packets between interfaces.
@@ -36,49 +36,49 @@ rt_cache_rebuild_count - INTEGER
IP Fragmentation:
ipfrag_high_thresh - INTEGER
- Maximum memory used to reassemble IP fragments. When
+ Maximum memory used to reassemble IP fragments. When
ipfrag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ipfrag_low_thresh
is reached.
-
+
ipfrag_low_thresh - INTEGER
- See ipfrag_high_thresh
+ See ipfrag_high_thresh
ipfrag_time - INTEGER
- Time in seconds to keep an IP fragment in memory.
+ Time in seconds to keep an IP fragment in memory.
ipfrag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
+ Regeneration interval (in seconds) of the hash secret (or lifetime
for the hash secret) for IP fragments.
Default: 600
ipfrag_max_dist - INTEGER
- ipfrag_max_dist is a non-negative integer value which defines the
- maximum "disorder" which is allowed among fragments which share a
- common IP source address. Note that reordering of packets is
- not unusual, but if a large number of fragments arrive from a source
- IP address while a particular fragment queue remains incomplete, it
- probably indicates that one or more fragments belonging to that queue
- have been lost. When ipfrag_max_dist is positive, an additional check
- is done on fragments before they are added to a reassembly queue - if
- ipfrag_max_dist (or more) fragments have arrived from a particular IP
- address between additions to any IP fragment queue using that source
- address, it's presumed that one or more fragments in the queue are
- lost. The existing fragment queue will be dropped, and a new one
+ ipfrag_max_dist is a non-negative integer value which defines the
+ maximum "disorder" which is allowed among fragments which share a
+ common IP source address. Note that reordering of packets is
+ not unusual, but if a large number of fragments arrive from a source
+ IP address while a particular fragment queue remains incomplete, it
+ probably indicates that one or more fragments belonging to that queue
+ have been lost. When ipfrag_max_dist is positive, an additional check
+ is done on fragments before they are added to a reassembly queue - if
+ ipfrag_max_dist (or more) fragments have arrived from a particular IP
+ address between additions to any IP fragment queue using that source
+ address, it's presumed that one or more fragments in the queue are
+ lost. The existing fragment queue will be dropped, and a new one
started. An ipfrag_max_dist value of zero disables this check.
Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can
result in unnecessarily dropping fragment queues when normal
- reordering of packets occurs, which could lead to poor application
- performance. Using a very large value, e.g. 50000, increases the
- likelihood of incorrectly reassembling IP fragments that originate
+ reordering of packets occurs, which could lead to poor application
+ performance. Using a very large value, e.g. 50000, increases the
+ likelihood of incorrectly reassembling IP fragments that originate
from different IP datagrams, which could result in data corruption.
Default: 64
INET peer storage:
inet_peer_threshold - INTEGER
- The approximate size of the storage. Starting from this threshold
+ The approximate size of the storage. Starting from this threshold
entries will be thrown aggressively. This threshold also determines
entries' time-to-live and time intervals between garbage collection
passes. More entries, less time-to-live, less GC interval.
@@ -105,7 +105,7 @@ inet_peer_gc_maxtime - INTEGER
in effect under low (or absent) memory pressure on the pool.
Measured in seconds.
-TCP variables:
+TCP variables:
somaxconn - INTEGER
Limit of socket listen() backlog, known in userspace as SOMAXCONN.
@@ -310,7 +310,7 @@ tcp_orphan_retries - INTEGER
tcp_reordering - INTEGER
Maximal reordering of packets in a TCP stream.
- Default: 3
+ Default: 3
tcp_retrans_collapse - BOOLEAN
Bug-to-bug compatibility with some broken printers.
@@ -521,7 +521,7 @@ IP Variables:
ip_local_port_range - 2 INTEGERS
Defines the local port range that is used by TCP and UDP to
- choose the local port. The first number is the first, the
+ choose the local port. The first number is the first, the
second the last local port number. Default value depends on
amount of memory available on the system:
> 128Mb 32768-61000
@@ -594,12 +594,12 @@ icmp_errors_use_inbound_ifaddr - BOOLEAN
If zero, icmp error messages are sent with the primary address of
the exiting interface.
-
+
If non-zero, the message will be sent with the primary address of
the interface that received the packet that caused the icmp error.
This is the behaviour network many administrators will expect from
a router. And it can make debugging complicated network layouts
- much easier.
+ much easier.
Note that if no primary address exists for the interface selected,
then the primary address of the first non-loopback interface that
@@ -611,7 +611,7 @@ igmp_max_memberships - INTEGER
Change the maximum number of multicast groups we can subscribe to.
Default: 20
-conf/interface/* changes special settings per interface (where "interface" is
+conf/interface/* changes special settings per interface (where "interface" is
the name of your network interface)
conf/all/* is special, changes the settings for all interfaces
@@ -625,11 +625,11 @@ log_martians - BOOLEAN
accept_redirects - BOOLEAN
Accept ICMP redirect messages.
accept_redirects for the interface will be enabled if:
- - both conf/{all,interface}/accept_redirects are TRUE in the case forwarding
- for the interface is enabled
+ - both conf/{all,interface}/accept_redirects are TRUE in the case
+ forwarding for the interface is enabled
or
- - at least one of conf/{all,interface}/accept_redirects is TRUE in the case
- forwarding for the interface is disabled
+ - at least one of conf/{all,interface}/accept_redirects is TRUE in the
+ case forwarding for the interface is disabled
accept_redirects for the interface will be disabled otherwise
default TRUE (host)
FALSE (router)
@@ -640,8 +640,8 @@ forwarding - BOOLEAN
mc_forwarding - BOOLEAN
Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE
and a multicast routing daemon is required.
- conf/all/mc_forwarding must also be set to TRUE to enable multicast routing
- for the interface
+ conf/all/mc_forwarding must also be set to TRUE to enable multicast
+ routing for the interface
medium_id - INTEGER
Integer value used to differentiate the devices by the medium they
@@ -649,7 +649,7 @@ medium_id - INTEGER
the broadcast packets are received only on one of them.
The default value 0 means that the device is the only interface
to its medium, value of -1 means that medium is not known.
-
+
Currently, it is used to change the proxy_arp behavior:
the proxy_arp feature is enabled for packets forwarded between
two devices attached to different media.
@@ -699,16 +699,22 @@ accept_source_route - BOOLEAN
default TRUE (router)
FALSE (host)
-rp_filter - BOOLEAN
- 1 - do source validation by reversed path, as specified in RFC1812
- Recommended option for single homed hosts and stub network
- routers. Could cause troubles for complicated (not loop free)
- networks running a slow unreliable protocol (sort of RIP),
- or using static routes.
-
+rp_filter - INTEGER
0 - No source validation.
-
- conf/all/rp_filter must also be set to TRUE to do source validation
+ 1 - Strict mode as defined in RFC3704 Strict Reverse Path
+ Each incoming packet is tested against the FIB and if the interface
+ is not the best reverse path the packet check will fail.
+ By default failed packets are discarded.
+ 2 - Loose mode as defined in RFC3704 Loose Reverse Path
+ Each incoming packet's source address is also tested against the FIB
+ and if the source address is not reachable via any interface
+ the packet check will fail.
+
+ Current recommended practice in RFC3704 is to enable strict mode
+ to prevent IP spoofing from DDos attacks. If using asymmetric routing
+ or other complicated routing, then loose mode is recommended.
+
+ conf/all/rp_filter must also be set to non-zero to do source validation
on the interface
Default value is 0. Note that some distributions enable it
@@ -782,6 +788,12 @@ arp_ignore - INTEGER
The max value from conf/{all,interface}/arp_ignore is used
when ARP request is received on the {interface}
+arp_notify - BOOLEAN
+ Define mode for notification of address and device changes.
+ 0 - (default): do nothing
+ 1 - Generate gratuitous arp replies when device is brought up
+ or hardware address changes.
+
arp_accept - BOOLEAN
Define behavior when gratuitous arp replies are received:
0 - drop gratuitous arp frames
@@ -823,7 +835,7 @@ apply to IPv6 [XXX?].
bindv6only - BOOLEAN
Default value for IPV6_V6ONLY socket option,
- which restricts use of the IPv6 socket to IPv6 communication
+ which restricts use of the IPv6 socket to IPv6 communication
only.
TRUE: disable IPv4-mapped address feature
FALSE: enable IPv4-mapped address feature
@@ -833,19 +845,19 @@ bindv6only - BOOLEAN
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
- Maximum memory used to reassemble IPv6 fragments. When
+ Maximum memory used to reassemble IPv6 fragments. When
ip6frag_high_thresh bytes of memory is allocated for this purpose,
the fragment handler will toss packets until ip6frag_low_thresh
is reached.
-
+
ip6frag_low_thresh - INTEGER
- See ip6frag_high_thresh
+ See ip6frag_high_thresh
ip6frag_time - INTEGER
Time in seconds to keep an IPv6 fragment in memory.
ip6frag_secret_interval - INTEGER
- Regeneration interval (in seconds) of the hash secret (or lifetime
+ Regeneration interval (in seconds) of the hash secret (or lifetime
for the hash secret) for IPv6 fragments.
Default: 600
@@ -854,17 +866,17 @@ conf/default/*:
conf/all/*:
- Change all the interface-specific settings.
+ Change all the interface-specific settings.
[XXX: Other special features than forwarding?]
conf/all/forwarding - BOOLEAN
- Enable global IPv6 forwarding between all interfaces.
+ Enable global IPv6 forwarding between all interfaces.
- IPv4 and IPv6 work differently here; e.g. netfilter must be used
+ IPv4 and IPv6 work differently here; e.g. netfilter must be used
to control which interfaces may forward packets and which not.
- This also sets all interfaces' Host/Router setting
+ This also sets all interfaces' Host/Router setting
'forwarding' to the specified value. See below for details.
This referred to as global forwarding.
@@ -875,12 +887,12 @@ proxy_ndp - BOOLEAN
conf/interface/*:
Change special settings per interface.
- The functional behaviour for certain settings is different
+ The functional behaviour for certain settings is different
depending on whether local forwarding is enabled or not.
accept_ra - BOOLEAN
Accept Router Advertisements; autoconfigure using them.
-
+
Functional default: enabled if local forwarding is disabled.
disabled if local forwarding is enabled.
@@ -926,7 +938,7 @@ accept_source_route - INTEGER
Default: 0
autoconf - BOOLEAN
- Autoconfigure addresses using Prefix Information in Router
+ Autoconfigure addresses using Prefix Information in Router
Advertisements.
Functional default: enabled if accept_ra_pinfo is enabled.
@@ -935,11 +947,11 @@ autoconf - BOOLEAN
dad_transmits - INTEGER
The amount of Duplicate Address Detection probes to send.
Default: 1
-
+
forwarding - BOOLEAN
- Configure interface-specific Host/Router behaviour.
+ Configure interface-specific Host/Router behaviour.
- Note: It is recommended to have the same setting on all
+ Note: It is recommended to have the same setting on all
interfaces; mixed router/host scenarios are rather uncommon.
FALSE:
@@ -948,13 +960,13 @@ forwarding - BOOLEAN
1. IsRouter flag is not set in Neighbour Advertisements.
2. Router Solicitations are being sent when necessary.
- 3. If accept_ra is TRUE (default), accept Router
+ 3. If accept_ra is TRUE (default), accept Router
Advertisements (and do autoconfiguration).
4. If accept_redirects is TRUE (default), accept Redirects.
TRUE:
- If local forwarding is enabled, Router behaviour is assumed.
+ If local forwarding is enabled, Router behaviour is assumed.
This means exactly the reverse from the above:
1. IsRouter flag is set in Neighbour Advertisements.
@@ -989,7 +1001,7 @@ router_solicitation_interval - INTEGER
Default: 4
router_solicitations - INTEGER
- Number of Router Solicitations to send until assuming no
+ Number of Router Solicitations to send until assuming no
routers are present.
Default: 3
@@ -1013,11 +1025,11 @@ temp_prefered_lft - INTEGER
max_desync_factor - INTEGER
Maximum value for DESYNC_FACTOR, which is a random value
- that ensures that clients don't synchronize with each
+ that ensures that clients don't synchronize with each
other and generate new addresses at exactly the same time.
value is in seconds.
Default: 600
-
+
regen_max_retry - INTEGER
Number of attempts before give up attempting to generate
valid temporary addresses.
@@ -1025,13 +1037,15 @@ regen_max_retry - INTEGER
max_addresses - INTEGER
Number of maximum addresses per interface. 0 disables limitation.
- It is recommended not set too large value (or 0) because it would
- be too easy way to crash kernel to allow to create too much of
+ It is recommended not set too large value (or 0) because it would
+ be too easy way to crash kernel to allow to create too much of
autoconfigured addresses.
Default: 16
disable_ipv6 - BOOLEAN
- Disable IPv6 operation.
+ Disable IPv6 operation. If accept_dad is set to 2, this value
+ will be dynamically set to TRUE if DAD fails for the link-local
+ address.
Default: FALSE (enable IPv6 operation)
accept_dad - INTEGER
diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt
new file mode 100644
index 00000000000..268e5c103dd
--- /dev/null
+++ b/Documentation/networking/ipv6.txt
@@ -0,0 +1,35 @@
+
+Options for the ipv6 module are supplied as parameters at load time.
+
+Module options may be given as command line arguments to the insmod
+or modprobe command, but are usually specified in either the
+/etc/modules.conf or /etc/modprobe.conf configuration file, or in a
+distro-specific configuration file.
+
+The available ipv6 module parameters are listed below. If a parameter
+is not specified the default value is used.
+
+The parameters are as follows:
+
+disable
+
+ Specifies whether to load the IPv6 module, but disable all
+ its functionality. This might be used when another module
+ has a dependency on the IPv6 module being loaded, but no
+ IPv6 addresses or operations are desired.
+
+ The possible values and their effects are:
+
+ 0
+ IPv6 is enabled.
+
+ This is the default value.
+
+ 1
+ IPv6 is disabled.
+
+ No IPv6 addresses will be added to interfaces, and
+ it will not be possible to open an IPv6 socket.
+
+ A reboot is required to enable IPv6.
+
diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt
new file mode 100644
index 00000000000..eeb68685c78
--- /dev/null
+++ b/Documentation/networking/ixgbe.txt
@@ -0,0 +1,199 @@
+Linux Base Driver for 10 Gigabit PCI Express Intel(R) Network Connection
+========================================================================
+
+March 10, 2009
+
+
+Contents
+========
+
+- In This Release
+- Identifying Your Adapter
+- Building and Installation
+- Additional Configurations
+- Support
+
+
+
+In This Release
+===============
+
+This file describes the ixgbe Linux Base Driver for the 10 Gigabit PCI
+Express Intel(R) Network Connection. This driver includes support for
+Itanium(R)2-based systems.
+
+For questions related to hardware requirements, refer to the documentation
+supplied with your 10 Gigabit adapter. All hardware requirements listed apply
+to use with Linux.
+
+The following features are available in this kernel:
+ - Native VLANs
+ - Channel Bonding (teaming)
+ - SNMP
+ - Generic Receive Offload
+ - Data Center Bridging
+
+Channel Bonding documentation can be found in the Linux kernel source:
+/Documentation/networking/bonding.txt
+
+Ethtool, lspci, and ifconfig can be used to display device and driver
+specific information.
+
+
+Identifying Your Adapter
+========================
+
+This driver supports devices based on the 82598 controller and the 82599
+controller.
+
+For specific information on identifying which adapter you have, please visit:
+
+ http://support.intel.com/support/network/sb/CS-008441.htm
+
+
+Building and Installation
+=========================
+
+select m for "Intel(R) 10GbE PCI Express adapters support" located at:
+ Location:
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+
+1. make modules & make modules_install
+
+2. Load the module:
+
+# modprobe ixgbe
+
+ The insmod command can be used if the full
+ path to the driver module is specified. For example:
+
+ insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgbe/ixgbe.ko
+
+ With 2.6 based kernels also make sure that older ixgbe drivers are
+ removed from the kernel, before loading the new module:
+
+ rmmod ixgbe; modprobe ixgbe
+
+3. Assign an IP address to the interface by entering the following, where
+ x is the interface number:
+
+ ifconfig ethx <IP_address>
+
+4. Verify that the interface works. Enter the following, where <IP_address>
+ is the IP address for another machine on the same subnet as the interface
+ that is being tested:
+
+ ping <IP_address>
+
+
+Additional Configurations
+=========================
+
+ Viewing Link Messages
+ ---------------------
+ Link messages will not be displayed to the console if the distribution is
+ restricting system messages. In order to see network driver link messages on
+ your console, set dmesg to eight by entering the following:
+
+ dmesg -n 8
+
+ NOTE: This setting is not saved across reboots.
+
+
+ Jumbo Frames
+ ------------
+ The driver supports Jumbo Frames for all adapters. Jumbo Frames support is
+ enabled by changing the MTU to a value larger than the default of 1500.
+ The maximum value for the MTU is 16110. Use the ifconfig command to
+ increase the MTU size. For example:
+
+ ifconfig ethx mtu 9000 up
+
+ The maximum MTU setting for Jumbo Frames is 16110. This value coincides
+ with the maximum Jumbo Frames size of 16128.
+
+ Generic Receive Offload, aka GRO
+ --------------------------------
+ The driver supports the in-kernel software implementation of GRO. GRO has
+ shown that by coalescing Rx traffic into larger chunks of data, CPU
+ utilization can be significantly reduced when under large Rx load. GRO is an
+ evolution of the previously-used LRO interface. GRO is able to coalesce
+ other protocols besides TCP. It's also safe to use with configurations that
+ are problematic for LRO, namely bridging and iSCSI.
+
+ GRO is enabled by default in the driver. Future versions of ethtool will
+ support disabling and re-enabling GRO on the fly.
+
+
+ Data Center Bridging, aka DCB
+ -----------------------------
+
+ DCB is a configuration Quality of Service implementation in hardware.
+ It uses the VLAN priority tag (802.1p) to filter traffic. That means
+ that there are 8 different priorities that traffic can be filtered into.
+ It also enables priority flow control which can limit or eliminate the
+ number of dropped packets during network stress. Bandwidth can be
+ allocated to each of these priorities, which is enforced at the hardware
+ level.
+
+ To enable DCB support in ixgbe, you must enable the DCB netlink layer to
+ allow the userspace tools (see below) to communicate with the driver.
+ This can be found in the kernel configuration here:
+
+ -> Networking support
+ -> Networking options
+ -> Data Center Bridging support
+
+ Once this is selected, DCB support must be selected for ixgbe. This can
+ be found here:
+
+ -> Device Drivers
+ -> Network device support (NETDEVICES [=y])
+ -> Ethernet (10000 Mbit) (NETDEV_10000 [=y])
+ -> Intel(R) 10GbE PCI Express adapters support
+ -> Data Center Bridging (DCB) Support
+
+ After these options are selected, you must rebuild your kernel and your
+ modules.
+
+ In order to use DCB, userspace tools must be downloaded and installed.
+ The dcbd tools can be found at:
+
+ http://e1000.sf.net
+
+
+ Ethtool
+ -------
+ The driver utilizes the ethtool interface for driver configuration and
+ diagnostics, as well as displaying statistical information. Ethtool
+ version 3.0 or later is required for this functionality.
+
+ The latest release of ethtool can be found from
+ http://sourceforge.net/projects/gkernel.
+
+
+ NAPI
+ ----
+
+ NAPI (Rx polling mode) is supported in the ixgbe driver. NAPI is enabled
+ by default in the driver.
+
+ See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI.
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://e1000.sourceforge.net
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt
new file mode 100644
index 00000000000..c67077cbeb8
--- /dev/null
+++ b/Documentation/networking/rds.txt
@@ -0,0 +1,356 @@
+
+Overview
+========
+
+This readme tries to provide some background on the hows and whys of RDS,
+and will hopefully help you find your way around the code.
+
+In addition, please see this email about RDS origins:
+http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html
+
+RDS Architecture
+================
+
+RDS provides reliable, ordered datagram delivery by using a single
+reliable connection between any two nodes in the cluster. This allows
+applications to use a single socket to talk to any other process in the
+cluster - so in a cluster with N processes you need N sockets, in contrast
+to N*N if you use a connection-oriented socket transport like TCP.
+
+RDS is not Infiniband-specific; it was designed to support different
+transports. The current implementation used to support RDS over TCP as well
+as IB. Work is in progress to support RDS over iWARP, and using DCE to
+guarantee no dropped packets on Ethernet, it may be possible to use RDS over
+UDP in the future.
+
+The high-level semantics of RDS from the application's point of view are
+
+ * Addressing
+ RDS uses IPv4 addresses and 16bit port numbers to identify
+ the end point of a connection. All socket operations that involve
+ passing addresses between kernel and user space generally
+ use a struct sockaddr_in.
+
+ The fact that IPv4 addresses are used does not mean the underlying
+ transport has to be IP-based. In fact, RDS over IB uses a
+ reliable IB connection; the IP address is used exclusively to
+ locate the remote node's GID (by ARPing for the given IP).
+
+ The port space is entirely independent of UDP, TCP or any other
+ protocol.
+
+ * Socket interface
+ RDS sockets work *mostly* as you would expect from a BSD
+ socket. The next section will cover the details. At any rate,
+ all I/O is performed through the standard BSD socket API.
+ Some additions like zerocopy support are implemented through
+ control messages, while other extensions use the getsockopt/
+ setsockopt calls.
+
+ Sockets must be bound before you can send or receive data.
+ This is needed because binding also selects a transport and
+ attaches it to the socket. Once bound, the transport assignment
+ does not change. RDS will tolerate IPs moving around (eg in
+ a active-active HA scenario), but only as long as the address
+ doesn't move to a different transport.
+
+ * sysctls
+ RDS supports a number of sysctls in /proc/sys/net/rds
+
+
+Socket Interface
+================
+
+ AF_RDS, PF_RDS, SOL_RDS
+ These constants haven't been assigned yet, because RDS isn't in
+ mainline yet. Currently, the kernel module assigns some constant
+ and publishes it to user space through two sysctl files
+ /proc/sys/net/rds/pf_rds
+ /proc/sys/net/rds/sol_rds
+
+ fd = socket(PF_RDS, SOCK_SEQPACKET, 0);
+ This creates a new, unbound RDS socket.
+
+ setsockopt(SOL_SOCKET): send and receive buffer size
+ RDS honors the send and receive buffer size socket options.
+ You are not allowed to queue more than SO_SNDSIZE bytes to
+ a socket. A message is queued when sendmsg is called, and
+ it leaves the queue when the remote system acknowledges
+ its arrival.
+
+ The SO_RCVSIZE option controls the maximum receive queue length.
+ This is a soft limit rather than a hard limit - RDS will
+ continue to accept and queue incoming messages, even if that
+ takes the queue length over the limit. However, it will also
+ mark the port as "congested" and send a congestion update to
+ the source node. The source node is supposed to throttle any
+ processes sending to this congested port.
+
+ bind(fd, &sockaddr_in, ...)
+ This binds the socket to a local IP address and port, and a
+ transport.
+
+ sendmsg(fd, ...)
+ Sends a message to the indicated recipient. The kernel will
+ transparently establish the underlying reliable connection
+ if it isn't up yet.
+
+ An attempt to send a message that exceeds SO_SNDSIZE will
+ return with -EMSGSIZE
+
+ An attempt to send a message that would take the total number
+ of queued bytes over the SO_SNDSIZE threshold will return
+ EAGAIN.
+
+ An attempt to send a message to a destination that is marked
+ as "congested" will return ENOBUFS.
+
+ recvmsg(fd, ...)
+ Receives a message that was queued to this socket. The sockets
+ recv queue accounting is adjusted, and if the queue length
+ drops below SO_SNDSIZE, the port is marked uncongested, and
+ a congestion update is sent to all peers.
+
+ Applications can ask the RDS kernel module to receive
+ notifications via control messages (for instance, there is a
+ notification when a congestion update arrived, or when a RDMA
+ operation completes). These notifications are received through
+ the msg.msg_control buffer of struct msghdr. The format of the
+ messages is described in manpages.
+
+ poll(fd)
+ RDS supports the poll interface to allow the application
+ to implement async I/O.
+
+ POLLIN handling is pretty straightforward. When there's an
+ incoming message queued to the socket, or a pending notification,
+ we signal POLLIN.
+
+ POLLOUT is a little harder. Since you can essentially send
+ to any destination, RDS will always signal POLLOUT as long as
+ there's room on the send queue (ie the number of bytes queued
+ is less than the sendbuf size).
+
+ However, the kernel will refuse to accept messages to
+ a destination marked congested - in this case you will loop
+ forever if you rely on poll to tell you what to do.
+ This isn't a trivial problem, but applications can deal with
+ this - by using congestion notifications, and by checking for
+ ENOBUFS errors returned by sendmsg.
+
+ setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in)
+ This allows the application to discard all messages queued to a
+ specific destination on this particular socket.
+
+ This allows the application to cancel outstanding messages if
+ it detects a timeout. For instance, if it tried to send a message,
+ and the remote host is unreachable, RDS will keep trying forever.
+ The application may decide it's not worth it, and cancel the
+ operation. In this case, it would use RDS_CANCEL_SENT_TO to
+ nuke any pending messages.
+
+
+RDMA for RDS
+============
+
+ see rds-rdma(7) manpage (available in rds-tools)
+
+
+Congestion Notifications
+========================
+
+ see rds(7) manpage
+
+
+RDS Protocol
+============
+
+ Message header
+
+ The message header is a 'struct rds_header' (see rds.h):
+ Fields:
+ h_sequence:
+ per-packet sequence number
+ h_ack:
+ piggybacked acknowledgment of last packet received
+ h_len:
+ length of data, not including header
+ h_sport:
+ source port
+ h_dport:
+ destination port
+ h_flags:
+ CONG_BITMAP - this is a congestion update bitmap
+ ACK_REQUIRED - receiver must ack this packet
+ RETRANSMITTED - packet has previously been sent
+ h_credit:
+ indicate to other end of connection that
+ it has more credits available (i.e. there is
+ more send room)
+ h_padding[4]:
+ unused, for future use
+ h_csum:
+ header checksum
+ h_exthdr:
+ optional data can be passed here. This is currently used for
+ passing RDMA-related information.
+
+ ACK and retransmit handling
+
+ One might think that with reliable IB connections you wouldn't need
+ to ack messages that have been received. The problem is that IB
+ hardware generates an ack message before it has DMAed the message
+ into memory. This creates a potential message loss if the HCA is
+ disabled for any reason between when it sends the ack and before
+ the message is DMAed and processed. This is only a potential issue
+ if another HCA is available for fail-over.
+
+ Sending an ack immediately would allow the sender to free the sent
+ message from their send queue quickly, but could cause excessive
+ traffic to be used for acks. RDS piggybacks acks on sent data
+ packets. Ack-only packets are reduced by only allowing one to be
+ in flight at a time, and by the sender only asking for acks when
+ its send buffers start to fill up. All retransmissions are also
+ acked.
+
+ Flow Control
+
+ RDS's IB transport uses a credit-based mechanism to verify that
+ there is space in the peer's receive buffers for more data. This
+ eliminates the need for hardware retries on the connection.
+
+ Congestion
+
+ Messages waiting in the receive queue on the receiving socket
+ are accounted against the sockets SO_RCVBUF option value. Only
+ the payload bytes in the message are accounted for. If the
+ number of bytes queued equals or exceeds rcvbuf then the socket
+ is congested. All sends attempted to this socket's address
+ should return block or return -EWOULDBLOCK.
+
+ Applications are expected to be reasonably tuned such that this
+ situation very rarely occurs. An application encountering this
+ "back-pressure" is considered a bug.
+
+ This is implemented by having each node maintain bitmaps which
+ indicate which ports on bound addresses are congested. As the
+ bitmap changes it is sent through all the connections which
+ terminate in the local address of the bitmap which changed.
+
+ The bitmaps are allocated as connections are brought up. This
+ avoids allocation in the interrupt handling path which queues
+ sages on sockets. The dense bitmaps let transports send the
+ entire bitmap on any bitmap change reasonably efficiently. This
+ is much easier to implement than some finer-grained
+ communication of per-port congestion. The sender does a very
+ inexpensive bit test to test if the port it's about to send to
+ is congested or not.
+
+
+RDS Transport Layer
+==================
+
+ As mentioned above, RDS is not IB-specific. Its code is divided
+ into a general RDS layer and a transport layer.
+
+ The general layer handles the socket API, congestion handling,
+ loopback, stats, usermem pinning, and the connection state machine.
+
+ The transport layer handles the details of the transport. The IB
+ transport, for example, handles all the queue pairs, work requests,
+ CM event handlers, and other Infiniband details.
+
+
+RDS Kernel Structures
+=====================
+
+ struct rds_message
+ aka possibly "rds_outgoing", the generic RDS layer copies data to
+ be sent and sets header fields as needed, based on the socket API.
+ This is then queued for the individual connection and sent by the
+ connection's transport.
+ struct rds_incoming
+ a generic struct referring to incoming data that can be handed from
+ the transport to the general code and queued by the general code
+ while the socket is awoken. It is then passed back to the transport
+ code to handle the actual copy-to-user.
+ struct rds_socket
+ per-socket information
+ struct rds_connection
+ per-connection information
+ struct rds_transport
+ pointers to transport-specific functions
+ struct rds_statistics
+ non-transport-specific statistics
+ struct rds_cong_map
+ wraps the raw congestion bitmap, contains rbnode, waitq, etc.
+
+Connection management
+=====================
+
+ Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and
+ ERROR states.
+
+ The first time an attempt is made by an RDS socket to send data to
+ a node, a connection is allocated and connected. That connection is