diff options
Diffstat (limited to 'Documentation/networking')
143 files changed, 17928 insertions, 7414 deletions
diff --git a/Documentation/networking/.gitignore b/Documentation/networking/.gitignore new file mode 100644 index 00000000000..e69de29bb2d --- /dev/null +++ b/Documentation/networking/.gitignore diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index 5b01d5cc4e9..557b6ef70c2 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -2,18 +2,28 @@ - this file 3c505.txt - information on the 3Com EtherLink Plus (3c505) driver. +3c509.txt + - information on the 3Com Etherlink III Series Ethernet cards. 6pack.txt - info on the 6pack protocol, an alternative to KISS for AX.25 -Configurable - - info on some of the configurable network parameters -DLINK.txt - - info on the D-Link DE-600/DE-620 parallel port pocket adapters +LICENSE.qla3xxx + - GPLv2 for QLogic Linux Networking HBA Driver +LICENSE.qlge + - GPLv2 for QLogic Linux qlge NIC Driver +LICENSE.qlcnic + - GPLv2 for QLogic Linux qlcnic NIC Driver +Makefile + - Makefile for docsrc. PLIP.txt - PLIP: The Parallel Line Internet Protocol device driver +README.ipw2100 + - README for the Intel PRO/Wireless 2100 driver. +README.ipw2200 + - README for the Intel PRO/Wireless 2915ABG and 2200BG driver. README.sb1000 - info on General Instrument/NextLevel SURFboard1000 cable modem. alias.txt - - info on using alias network devices + - info on using alias network devices. arcnet-hardware.txt - tons of info on ARCnet, hubs, jumper settings for ARCnet cards, etc. arcnet.txt @@ -22,42 +32,76 @@ atm.txt - info on where to get ATM programs and support for Linux. ax25.txt - info on using AX.25 and NET/ROM code for Linux +batman-adv.txt + - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames. baycom.txt - info on the driver for Baycom style amateur radio modems +bonding.txt + - Linux Ethernet Bonding Driver HOWTO: link aggregation in Linux. bridge.txt - where to get user space programs for ethernet bridging with Linux. -comx.txt - - info on drivers for COMX line of synchronous serial adapters. +can.txt + - documentation on CAN protocol family. cops.txt - info on the COPS LocalTalk Linux driver cs89x0.txt - the Crystal LAN (CS8900/20-based) Ethernet ISA adapter driver +cxacru.txt + - Conexant AccessRunner USB ADSL Modem +cxacru-cf.py + - Conexant AccessRunner USB ADSL Modem configuration file parser +cxgb.txt + - Release Notes for the Chelsio N210 Linux device driver. +dccp.txt + - the Datagram Congestion Control Protocol (DCCP) (RFC 4340..42). de4x5.txt - the Digital EtherWORKS DE4?? and DE5?? PCI Ethernet driver decnet.txt - info on using the DECnet networking layer in Linux. -depca.txt - - the Digital DEPCA/EtherWORKS DE1?? and DE2?? LANCE Ethernet driver -dgrs.txt - - the Digi International RightSwitch SE-X Ethernet driver +dl2k.txt + - README for D-Link DL2000-based Gigabit Ethernet Adapters (dl2k.ko). +dm9000.txt + - README for the Simtec DM9000 Network driver. dmfe.txt - info on the Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver. +dns_resolver.txt + - The DNS resolver module allows kernel servies to make DNS queries. +driver.txt + - Softnet driver issues. e100.txt - info on Intel's EtherExpress PRO/100 line of 10/100 boards e1000.txt - info on Intel's E1000 line of gigabit ethernet boards +e1000e.txt + - README for the Intel Gigabit Ethernet Driver (e1000e). eql.txt - serial IP load balancing -ethertap.txt - - the Ethertap user space packet reception and transmission driver -ewrk3.txt - - the Digital EtherWORKS 3 DE203/4/5 Ethernet driver +fib_trie.txt + - Level Compressed Trie (LC-trie) notes: a structure for routing. filter.txt - Linux Socket Filtering fore200e.txt - FORE Systems PCA-200E/SBA-200E ATM NIC driver info. framerelay.txt - info on using Frame Relay/Data Link Connection Identifier (DLCI). +gen_stats.txt + - Generic networking statistics for netlink users. +generic-hdlc.txt + - The generic High Level Data Link Control (HDLC) layer. +generic_netlink.txt + - info on Generic Netlink +gianfar.txt + - Gianfar Ethernet Driver. +i40e.txt + - README for the Intel Ethernet Controller XL710 Driver (i40e). +i40evf.txt + - Short note on the Driver for the Intel(R) XL710 X710 Virtual Function +ieee802154.txt + - Linux IEEE 802.15.4 implementation, API and drivers +igb.txt + - README for the Intel Gigabit Ethernet Driver (igb). +igbvf.txt + - README for the Intel Gigabit Ethernet Driver (igbvf). ip-sysctl.txt - /proc/sys/net/ipv4/* variables ip_dynaddr.txt @@ -66,60 +110,125 @@ ipddp.txt - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation iphase.txt - Interphase PCI ATM (i)Chip IA Linux driver info. +ipsec.txt + - Note on not compressing IPSec payload and resulting failed policy check. +ipv6.txt + - Options to the ipv6 kernel module. +ipvs-sysctl.txt + - Per-inode explanation of the /proc/sys/net/ipv4/vs interface. irda.txt - where to get IrDA (infrared) utilities and info for Linux. +ixgb.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgb). +ixgbe.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgbe). +ixgbevf.txt + - README for the Intel Virtual Function (VF) Driver (ixgbevf). +l2tp.txt + - User guide to the L2TP tunnel protocol. lapb-module.txt - programming information of the LAPB module. ltpc.txt - the Apple or Farallon LocalTalk PC card driver -multicast.txt - - Behaviour of cards under Multicast -ncsa-telnet - - notes on how NCSA telnet (DOS) breaks with MTU discovery enabled. -net-modules.txt - - info and "insmod" parameters for all network driver modules. +mac80211-auth-assoc-deauth.txt + - authentication and association / deauth-disassoc with max80211 +mac80211-injection.txt + - HOWTO use packet injection with mac80211 +multiqueue.txt + - HOWTO for multiqueue network device support. +netconsole.txt + - The network console module netconsole.ko: configuration and notes. +netdev-FAQ.txt + - FAQ describing how to submit net changes to netdev mailing list. +netdev-features.txt + - Network interface features API description. netdevices.txt - info on network device driver functions exported to the kernel. -olympic.txt - - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. +netif-msg.txt + - Design of the network interface message level setting (NETIF_MSG_*). +netlink_mmap.txt + - memory mapped I/O with netlink +nf_conntrack-sysctl.txt + - list of netfilter-sysctl knobs. +nfc.txt + - The Linux Near Field Communication (NFS) subsystem. +openvswitch.txt + - Open vSwitch developer documentation. +operstates.txt + - Overview of network interface operational states. +packet_mmap.txt + - User guide to memory mapped packet socket rings (PACKET_[RT]X_RING). +phonet.txt + - The Phonet packet protocol used in Nokia cellular modems. +phy.txt + - The PHY abstraction layer. +pktgen.txt + - User guide to the kernel packet generator (pktgen.ko). policy-routing.txt - IP policy-based routing -pt.txt - - the Gracilis Packetwin AX.25 device driver +ppp_generic.txt + - Information about the generic PPP driver. +proc_net_tcp.txt + - Per inode overview of the /proc/net/tcp and /proc/net/tcp6 interfaces. +radiotap-headers.txt + - Background on radiotap headers. ray_cs.txt - Raylink Wireless LAN card driver info. -routing.txt - - the new routing mechanism -shaper.txt - - info on the module that can shape/limit transmitted traffic. -sis900.txt - - SiS 900/7016 Fast Ethernet device driver info. -sk98lin.txt - - Marvell Yukon Chipset / SysKonnect SK-98xx compliant Gigabit - Ethernet Adapter family driver info +rds.txt + - Background on the reliable, ordered datagram delivery method RDS. +regulatory.txt + - Overview of the Linux wireless regulatory infrastructure. +rxrpc.txt + - Guide to the RxRPC protocol. +s2io.txt + - Release notes for Neterion Xframe I/II 10GbE driver. +scaling.txt + - Explanation of network scaling techniques: RSS, RPS, RFS, aRFS, XPS. +sctp.txt + - Notes on the Linux kernel implementation of the SCTP protocol. +secid.txt + - Explanation of the secid member in flow structures. skfp.txt - SysKonnect FDDI (SK-5xxx, Compaq Netelligent) driver info. smc9.txt - the driver for SMC's 9000 series of Ethernet cards -smctr.txt - - SMC TokenCard TokenRing Linux driver info. +spider_net.txt + - README for the Spidernet Driver (as found in PS3 / Cell BE). +stmmac.txt + - README for the STMicro Synopsys Ethernet driver. +tc-actions-env-rules.txt + - rules for traffic control (tc) actions. +timestamping.txt + - overview of network packet timestamping variants. tcp.txt - short blurb on how TCP output takes place. +tcp-thin.txt + - kernel tuning options for low rate 'thin' TCP streams. +team.txt + - pointer to information for ethernet teaming devices. tlan.txt - ThunderLAN (Compaq Netelligent 10/100, Olicom OC-2xxx) driver info. -tms380tr.txt - - SysKonnect Token Ring ISA/PCI adapter driver info. +tproxy.txt + - Transparent proxy support user guide. tuntap.txt - TUN/TAP device driver, allowing user space Rx/Tx of packets. +udplite.txt + - UDP-Lite protocol (RFC 3828) introduction. vortex.txt - info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards. -wan-router.txt - - WAN router documentation -wavelan.txt - - AT&T GIS (nee NCR) WaveLAN card: An Ethernet-like radio transceiver +vxge.txt + - README for the Neterion X3100 PCIe Server Adapter. +vxlan.txt + - Virtual extensible LAN overview x25.txt - general info on X.25 development. x25-iface.txt - description of the X.25 Packet Layer to LAPB device interface. +xfrm_proc.txt + - description of the statistics package for XFRM. +xfrm_sync.txt + - sync patches for XFRM enable migration of an SA between hosts. +xfrm_sysctl.txt + - description of the XFRM configuration options. z8530drv.txt - info about Linux driver for Z8530 based HDLC cards for AX.25 diff --git a/Documentation/networking/3c359.txt b/Documentation/networking/3c359.txt deleted file mode 100644 index 4af8071a6d1..00000000000 --- a/Documentation/networking/3c359.txt +++ /dev/null @@ -1,58 +0,0 @@ - -3COM PCI TOKEN LINK VELOCITY XL TOKEN RING CARDS README - -Release 0.9.0 - Release - Jul 17th 2000 Mike Phillips - - 1.2.0 - Final - Feb 17th 2002 Mike Phillips - Updated for submission to the 2.4.x kernel. - -Thanks: - Terry Murphy from 3Com for tech docs and support, - Adam D. Ligas for testing the driver. - -Note: - This driver will NOT work with the 3C339 Token Ring cards, you need -to use the tms380 driver instead. - -Options: - -The driver accepts three options: ringspeed, pkt_buf_sz and message_level. - -These options can be specified differently for each card found. - -ringspeed: Has one of three settings 0 (default), 4 or 16. 0 will -make the card autosense the ringspeed and join at the appropriate speed, -this will be the default option for most people. 4 or 16 allow you to -explicitly force the card to operate at a certain speed. The card will fail -if you try to insert it at the wrong speed. (Although some hubs will allow -this so be *very* careful). The main purpose for explicitly setting the ring -speed is for when the card is first on the ring. In autosense mode, if the card -cannot detect any active monitors on the ring it will open at the same speed as -its last opening. This can be hazardous if this speed does not match the speed -you want the ring to operate at. - -pkt_buf_sz: This is this initial receive buffer allocation size. This will -default to 4096 if no value is entered. You may increase performance of the -driver by setting this to a value larger than the network packet size, although -the driver now re-sizes buffers based on MTU settings as well. - -message_level: Controls level of messages created by the driver. Defaults to 0: -which only displays start-up and critical messages. Presently any non-zero -value will display all soft messages as well. NB This does not turn -debugging messages on, that must be done by modified the source code. - -Variable MTU size: - -The driver can handle a MTU size upto either 4500 or 18000 depending upon -ring speed. The driver also changes the size of the receive buffers as part -of the mtu re-sizing, so if you set mtu = 18000, you will need to be able -to allocate 16 * (sk_buff with 18000 buffer size) call it 18500 bytes per ring -position = 296,000 bytes of memory space, plus of course anything -necessary for the tx sk_buff's. Remember this is per card, so if you are -building routers, gateway's etc, you could start to use a lot of memory -real fast. - -2/17/02 Mike Phillips - diff --git a/Documentation/networking/3c505.txt b/Documentation/networking/3c505.txt deleted file mode 100644 index b9d5b723011..00000000000 --- a/Documentation/networking/3c505.txt +++ /dev/null @@ -1,46 +0,0 @@ -The 3Com Etherlink Plus (3c505) driver. - -This driver now uses DMA. There is currently no support for PIO operation. -The default DMA channel is 6; this is _not_ autoprobed, so you must -make sure you configure it correctly. If loading the driver as a -module, you can do this with "modprobe 3c505 dma=n". If the driver is -linked statically into the kernel, you must either use an "ether=" -statement on the command line, or change the definition of ELP_DMA in 3c505.h. - -The driver will warn you if it has to fall back on the compiled in -default DMA channel. - -If no base address is given at boot time, the driver will autoprobe -ports 0x300, 0x280 and 0x310 (in that order). If no IRQ is given, the driver -will try to probe for it. - -The driver can be used as a loadable module. See net-modules.txt for details -of the parameters it can take. - -Theoretically, one instance of the driver can now run multiple cards, -in the standard way (when loading a module, say "modprobe 3c505 -io=0x300,0x340 irq=10,11 dma=6,7" or whatever). I have not tested -this, though. - -The driver may now support revision 2 hardware; the dependency on -being able to read the host control register has been removed. This -is also untested, since I don't have a suitable card. - -Known problems: - I still see "DMA upload timed out" messages from time to time. These -seem to be fairly non-fatal though. - The card is old and slow. - -To do: - Improve probe/setup code - Test multicast and promiscuous operation - -Authors: - The driver is mainly written by Craig Southeren, email - <craigs@ineluki.apana.org.au>. - Parts of the driver (adapting the driver to 1.1.4+ kernels, - IRQ/address detection, some changes) and this README by - Juha Laiho <jlaiho@ichaos.nullnet.fi>. - DMA mode, more fixes, etc, by Philip Blundell <pjb27@cam.ac.uk> - Multicard support, Software configurable DMA, etc., by - Christopher Collins <ccollins@pcug.org.au> diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt index 867a99f88c6..fbf722e15ac 100644 --- a/Documentation/networking/3c509.txt +++ b/Documentation/networking/3c509.txt @@ -25,13 +25,12 @@ models: 3c509B (later revision of the ISA card; supports full-duplex) 3c589 (PCMCIA) 3c589B (later revision of the 3c589; supports full-duplex) - 3c529 (MCA) 3c579 (EISA) Large portions of this documentation were heavily borrowed from the guide written the original author of the 3c509 driver, Donald Becker. The master copy of that document, which contains notes on older versions of the driver, -currently resides on Scyld web server: http://www.scyld.com/network/3c509.html. +currently resides on Scyld web server: http://www.scyld.com/. (1) Special Driver Features @@ -48,11 +47,11 @@ for LILO parameters for doing this: This configures the first found 3c509 card for IRQ 10, base I/O 0x310, and transceiver type 3 (10base2). The flag "0x3c509" must be set to avoid conflicts with other card types when overriding the I/O address. When the driver is -loaded as a module, only the IRQ and transceiver setting may be overridden. -For example, setting two cards to 10base2/IRQ10 and AUI/IRQ11 is done by using -the xcvr and irq module options: +loaded as a module, only the IRQ may be overridden. For example, +setting two cards to IRQ10 and IRQ11 is done by using the irq module +option: - options 3c509 xcvr=3,1 irq=10,11 + options 3c509 irq=10,11 (2) Full-duplex mode @@ -77,6 +76,8 @@ operation. itself full-duplex capable. This is almost certainly one of two things: a full- duplex-capable Ethernet switch (*not* a hub), or a full-duplex-capable NIC on another system that's connected directly to the 3c509B via a crossover cable. + +Full-duplex mode can be enabled using 'ethtool'. /////Extremely important caution concerning full-duplex mode///// Understand that the 3c509B's hardware's full-duplex support is much more @@ -113,6 +114,8 @@ This insured that merely upgrading the driver from an earlier version would never automatically enable full-duplex mode in an existing installation; it must always be explicitly enabled via one of these code in order to be activated. + +The transceiver type can be changed using 'ethtool'. (4a) Interpretation of error messages and common problems @@ -126,7 +129,7 @@ packets faster than they can be removed from the card. This should be rare or impossible in normal operation. Possible causes of this error report are: - a "green" mode enabled that slows the processor down when there is no - keyboard activitiy. + keyboard activity. - some other device or device driver hogging the bus or disabling interrupts. Check /proc/interrupts for excessive interrupt counts. The timer tick diff --git a/Documentation/networking/6pack.txt b/Documentation/networking/6pack.txt index 48ed2b711bd..8f339428fdf 100644 --- a/Documentation/networking/6pack.txt +++ b/Documentation/networking/6pack.txt @@ -1,7 +1,7 @@ This is the 6pack-mini-HOWTO, written by -Andreas Könsgen DG3KQ -Internet: ajk@iehk.rwth-aachen.de +Andreas Könsgen DG3KQ +Internet: ajk@comnets.uni-bremen.de AMPR-net: dg3kq@db0pra.ampr.org AX.25: dg3kq@db0ach.#nrw.deu.eu diff --git a/Documentation/networking/Configurable b/Documentation/networking/Configurable deleted file mode 100644 index 69c0dd466ea..00000000000 --- a/Documentation/networking/Configurable +++ /dev/null @@ -1,34 +0,0 @@ - -There are a few network parameters that can be tuned to better match -the kernel to your system hardware and intended usage. The defaults -are usually a good choice for 99% of the people 99% of the time, but -you should be aware they do exist and can be changed. - -The current list of parameters can be found in the files: - - linux/net/TUNABLE - Documentation/networking/ip-sysctl.txt - -Some of these are accessible via the sysctl interface, and many more are -scheduled to be added in this way. For example, some parameters related -to Address Resolution Protocol (ARP) are very easily viewed and altered. - - # cat /proc/sys/net/ipv4/arp_timeout - 6000 - # echo 7000 > /proc/sys/net/ipv4/arp_timeout - # cat /proc/sys/net/ipv4/arp_timeout - 7000 - -Others are already accessible via the related user space programs. -For example, MAX_WINDOW has a default of 32 k which is a good choice for -modern hardware, but if you have a slow (8 bit) Ethernet card and/or a slow -machine, then this will be far too big for the card to keep up with fast -machines transmitting on the same net, resulting in overruns and receive errors. -A value of about 4 k would be more appropriate, which can be set via: - - # route add -net 192.168.3.0 window 4096 - -The remainder of these can only be presently changed by altering a #define -in the related header file. This means an edit and recompile cycle. - - Paul Gortmaker 06/96 diff --git a/Documentation/networking/DLINK.txt b/Documentation/networking/DLINK.txt deleted file mode 100644 index 55d24433d15..00000000000 --- a/Documentation/networking/DLINK.txt +++ /dev/null @@ -1,203 +0,0 @@ -Released 1994-06-13 - - - CONTENTS: - - 1. Introduction. - 2. License. - 3. Files in this release. - 4. Installation. - 5. Problems and tuning. - 6. Using the drivers with earlier releases. - 7. Acknowledgments. - - - 1. INTRODUCTION. - - This is a set of Ethernet drivers for the D-Link DE-600/DE-620 - pocket adapters, for the parallel port on a Linux based machine. - Some adapter "clones" will also work. Xircom is _not_ a clone... - These drivers _can_ be used as loadable modules, - and were developed for use on Linux 1.1.13 and above. - For use on Linux 1.0.X, or earlier releases, see below. - - I have used these drivers for NFS, ftp, telnet and X-clients on - remote machines. Transmissions with ftp seems to work as - good as can be expected (i.e. > 80k bytes/sec) from a - parallel port...:-) Receive speeds will be about 60-80% of this. - Depending on your machine, somewhat higher speeds can be achieved. - - All comments/fixes to Bjorn Ekwall (bj0rn@blox.se). - - - 2. LICENSE. - - This program is free software; you can redistribute it - and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2, or (at your option) any later version. - - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR - PURPOSE. See the GNU General Public License for more - details. - - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 675 Mass Ave, Cambridge, MA - 02139, USA. - - - 3. FILES IN THIS RELEASE. - - README.DLINK This file. - de600.c The Source (may it be with You :-) for the DE-600 - de620.c ditto for the DE-620 - de620.h Macros for de620.c - - If you are upgrading from the d-link tar release, there will - also be a "dlink-patches" file that will patch Linux 1.1.18: - linux/drivers/net/Makefile - linux/drivers/net/CONFIG - linux/drivers/net/MODULES - linux/drivers/net/Space.c - linux/config.in - Apply the patch by: - "cd /usr/src; patch -p0 < linux/drivers/net/dlink-patches" - The old source, "linux/drivers/net/d_link.c", can be removed. - - - 4. INSTALLATION. - - o Get the latest net binaries, according to current net.wisdom. - - o Read the NET-2 and Ethernet HOWTOs and modify your setup. - - o If your parallel port has a strange address or irq, - modify "linux/drivers/net/CONFIG" accordingly, or adjust - the parameters in the "tuning" section in the sources. - - If you are going to use the drivers as loadable modules, do _not_ - enable them while doing "make config", but instead make sure that - the drivers are included in "linux/drivers/net/MODULES". - - If you are _not_ going to use the driver(s) as loadable modules, - but instead have them included in the kernel, remember to enable - the drivers while doing "make config". - - o To include networking and DE600/DE620 support in your kernel: - # cd /linux - (as modules:) - # make config (answer yes on CONFIG_NET and CONFIG_INET) - (else included in the kernel:) - # make config (answer yes on CONFIG _NET, _INET and _DE600 or _DE620) - # make clean - # make zImage (or whatever magic you usually do) - - o I use lilo to boot multiple kernels, so that I at least - can have one working kernel :-). If you do too, append - these lines to /etc/lilo/config: - - image = /linux/zImage - label = newlinux - root = /dev/hda2 (or whatever YOU have...) - - # /etc/lilo/install - - o Do "sync" and reboot the new kernel with a D-Link - DE-600/DE-620 pocket adapter connected. - - o The adapter can be configured with ifconfig eth? - where the actual number is decided by the kernel - when the drivers are initialized. - - - 5. "PROBLEMS" AND TUNING, - - o If you see error messages from the driver, and if the traffic - stops on the adapter, try to do "ifconfig" and "route" once - more, just as in "rc.inet1". This should take care of most - problems, including effects from power loss, or adapters that - aren't connected to the printer port in some way or another. - You can somewhat change the behaviour by enabling/disabling - the macro SHUTDOWN_WHEN_LOST in the "tuning" section. - For the DE-600 there is another macro, CHECK_LOST_DE600, - that you might want to read about in the "tuning" section. - - o Some machines have trouble handling the parallel port and - the adapter at high speed. If you experience problems: - - DE-600: - - The adapter is not recognized at boot, i.e. an Ethernet - address of 00:80:c8:... is not shown, try to add another - "; SLOW_DOWN_IO" - at DE600_SLOW_DOWN in the "tuning" section. As a last resort, - uncomment: "#define REALLY_SLOW_IO" (see <asm/io.h> for hints). - - - You experience "timeout" messages: first try to add another - "; SLOW_DOWN_IO" - at DE600_SLOW_DOWN in the "tuning" section, _then_ try to - increase the value (original value: 5) at - "if (tickssofar < 5)" near line 422. - - DE-620: - - Your parallel port might be "sluggish". To cater for - this, there are the macros LOWSPEED and READ_DELAY/WRITE_DELAY - in the "tuning" section. Your first step should be to enable - LOWSPEED, and after that you can "tune" the XXX_DELAY values. - - o If the adapter _is_ recognized at boot but you get messages - about "Network Unreachable", then the problem is probably - _not_ with the driver. Check your net configuration instead - (ifconfig and route) in "rc.inet1". - - o There is some rudimentary support for debugging, look at - the source. Use "-DDE600_DEBUG=3" or "-DDE620_DEBUG=3" - when compiling, or include it in "linux/drivers/net/CONFIG". - IF YOU HAVE PROBLEMS YOU CAN'T SOLVE: PLEASE COMPILE THE DRIVER - WITH DEBUGGING ENABLED, AND SEND ME THE RESULTING OUTPUT! - - - 6. USING THE DRIVERS WITH EARLIER RELEASES. - - The later 1.1.X releases of the Linux kernel include some - changes in the networking layer (a.k.a. NET3). This affects - these drivers in a few places. The hints that follow are - _not_ tested by me, since I don't have the disk space to keep - all releases on-line. - Known needed changes to date: - - release patchfile: some patches will fail, but they should - be easy to apply "by hand", since they are trivial. - (Space.c: d_link_init() is now called de600_probe()) - - de600.c: change "mark_bh(NET_BH)" to "mark_bh(INET_BH)". - - de620.c: (maybe) change the code around "netif_rx(skb);" to be - similar to the code around "dev_rint(...)" in de600.c - - - 7. ACKNOWLEDGMENTS. - - These drivers wouldn't have been done without the base - (and support) from Ross Biro, and D-Link Systems Inc. - The driver relies upon GPL-ed source from D-Link Systems Inc. - and from Russel Nelson at Crynwr Software <nelson@crynwr.com>. - - Additional input also from: - Donald Becker <becker@super.org>, Alan Cox <A.Cox@swansea.ac.uk> - and Fred N. van Kempen <waltje@uWalt.NL.Mugnet.ORG> - - DE-600 alpha release primary victim^H^H^H^H^H^Htester: - - Erik Proper <erikp@cs.kun.nl>. - Good input also from several users, most notably - - Mark Burton <markb@ordern.demon.co.uk>. - - DE-620 alpha release victims^H^H^H^H^H^H^Htesters: - - J. Joshua Kopper <kopper@rtsg.mot.com> - - Olav Kvittem <Olav.Kvittem@uninett.no> - - Germano Caronni <caronni@nessie.cs.id.ethz.ch> - - Jeremy Fitzhardinge <jeremy@suite.sw.oz.au> - - - Happy hacking! - - Bjorn Ekwall == bj0rn@blox.se diff --git a/Documentation/networking/LICENSE.qla3xxx b/Documentation/networking/LICENSE.qla3xxx new file mode 100644 index 00000000000..2f2077e34d8 --- /dev/null +++ b/Documentation/networking/LICENSE.qla3xxx @@ -0,0 +1,46 @@ +Copyright (c) 2003-2006 QLogic Corporation +QLogic Linux Networking HBA Driver + +This program includes a device driver for Linux 2.6 that may be +distributed with QLogic hardware specific firmware binary file. +You may modify and redistribute the device driver code under the +GNU General Public License as published by the Free Software +Foundation (version 2 or a later version). + +You may redistribute the hardware specific firmware binary file +under the following terms: + + 1. Redistribution of source code (only if applicable), + must retain the above copyright notice, this list of + conditions and the following disclaimer. + + 2. Redistribution in binary form must reproduce the above + copyright notice, this list of conditions and the + following disclaimer in the documentation and/or other + materials provided with the distribution. + + 3. The name of QLogic Corporation may not be used to + endorse or promote products derived from this software + without specific prior written permission + +REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, +THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY +EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE +IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR +BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, +EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED +TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON +ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY +OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE +POSSIBILITY OF SUCH DAMAGE. + +USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT +CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR +OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, +TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN +ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN +COMBINATION WITH THIS PROGRAM. + diff --git a/Documentation/networking/LICENSE.qlcnic b/Documentation/networking/LICENSE.qlcnic new file mode 100644 index 00000000000..2ae3b64983a --- /dev/null +++ b/Documentation/networking/LICENSE.qlcnic @@ -0,0 +1,288 @@ +Copyright (c) 2009-2013 QLogic Corporation +QLogic Linux qlcnic NIC Driver + +You may modify and redistribute the device driver code under the +GNU General Public License (a copy of which is attached hereto as +Exhibit A) published by the Free Software Foundation (version 2). + + +EXHIBIT A + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. diff --git a/Documentation/networking/LICENSE.qlge b/Documentation/networking/LICENSE.qlge new file mode 100644 index 00000000000..ce64e4d15b2 --- /dev/null +++ b/Documentation/networking/LICENSE.qlge @@ -0,0 +1,288 @@ +Copyright (c) 2003-2011 QLogic Corporation +QLogic Linux qlge NIC Driver + +You may modify and redistribute the device driver code under the +GNU General Public License (a copy of which is attached hereto as +Exhibit A) published by the Free Software Foundation (version 2). + + +EXHIBIT A + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile new file mode 100644 index 00000000000..0aa1ac98fc2 --- /dev/null +++ b/Documentation/networking/Makefile @@ -0,0 +1,7 @@ +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +obj-m := timestamping/ diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt deleted file mode 100644 index 54376e8249c..00000000000 --- a/Documentation/networking/NAPI_HOWTO.txt +++ /dev/null @@ -1,766 +0,0 @@ -HISTORY: -February 16/2002 -- revision 0.2.1: -COR typo corrected -February 10/2002 -- revision 0.2: -some spell checking ;-> -January 12/2002 -- revision 0.1 -This is still work in progress so may change. -To keep up to date please watch this space. - -Introduction to NAPI -==================== - -NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique -to improve network performance on Linux. For more details please -read that paper. -NAPI provides a "inherent mitigation" which is bound by system capacity -as can be seen from the following data collected by Robert on Gigabit -ethernet (e1000): - - Psize Ipps Tput Rxint Txint Done Ndone - --------------------------------------------------------------- - 60 890000 409362 17 27622 7 6823 - 128 758150 464364 21 9301 10 7738 - 256 445632 774646 42 15507 21 12906 - 512 232666 994445 241292 19147 241192 1062 - 1024 119061 1000003 872519 19258 872511 0 - 1440 85193 1000003 946576 19505 946569 0 - - -Legend: -"Ipps" stands for input packets per second. -"Tput" == packets out of total 1M that made it out. -"txint" == transmit completion interrupts seen -"Done" == The number of times that the poll() managed to pull all -packets out of the rx ring. Note from this that the lower the -load the more we could clean up the rxring -"Ndone" == is the converse of "Done". Note again, that the higher -the load the more times we couldnt clean up the rxring. - -Observe that: -when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. -The system cant handle the processing at 1 interrupt/packet at that load level. -At lower rates on the other hand, rx interrupts go up and therefore the -interrupt/packet ratio goes up (as observable from that table). So there is -possibility that under low enough input, you get one poll call for each -input packet caused by a single interrupt each time. And if the system -cant handle interrupt per packet ratio of 1, then it will just have to -chug along .... - - -0) Prerequisites: -================== -A driver MAY continue using the old 2.4 technique for interfacing -to the network stack and not benefit from the NAPI changes. -NAPI additions to the kernel do not break backward compatibility. -NAPI, however, requires the following features to be available: - -A) DMA ring or enough RAM to store packets in software devices. - -B) Ability to turn off interrupts or maybe events that send packets up -the stack. - -NAPI processes packet events in what is known as dev->poll() method. -Typically, only packet receive events are processed in dev->poll(). -The rest of the events MAY be processed by the regular interrupt handler -to reduce processing latency (justified also because there are not that -many of them). -Note, however, NAPI does not enforce that dev->poll() only processes -receive events. -Tests with the tulip driver indicated slightly increased latency if -all of the interrupt handler is moved to dev->poll(). Also MII handling -gets a little trickier. -The example used in this document is to move the receive processing only -to dev->poll(); this is shown with the patch for the tulip driver. -For an example of code that moves all the interrupt driver to -dev->poll() look at the ported e1000 code. - -There are caveats that might force you to go with moving everything to -dev->poll(). Different NICs work differently depending on their status/event -acknowledgement setup. -There are two types of event register ACK mechanisms. - I) what is known as Clear-on-read (COR). - when you read the status/event register, it clears everything! - The natsemi and sunbmac NICs are known to do this. - In this case your only choice is to move all to dev->poll() - - II) Clear-on-write (COW) - i) you clear the status by writing a 1 in the bit-location you want. - These are the majority of the NICs and work the best with NAPI. - Put only receive events in dev->poll(); leave the rest in - the old interrupt handler. - ii) whatever you write in the status register clears every thing ;-> - Cant seem to find any supported by Linux which do this. If - someone knows such a chip email us please. - Move all to dev->poll() - -C) Ability to detect new work correctly. -NAPI works by shutting down event interrupts when theres work and -turning them on when theres none. -New packets might show up in the small window while interrupts were being -re-enabled (refer to appendix 2). A packet might sneak in during the period -we are enabling interrupts. We only get to know about such a packet when the -next new packet arrives and generates an interrupt. -Essentially, there is a small window of opportunity for a race condition -which for clarity we'll refer to as the "rotting packet". - -This is a very important topic and appendix 2 is dedicated for more -discussion. - -Locking rules and environmental guarantees -========================================== - --Guarantee: Only one CPU at any time can call dev->poll(); this is because -only one CPU can pick the initial interrupt and hence the initial -netif_rx_schedule(dev); -- The core layer invokes devices to send packets in a round robin format. -This implies receive is totaly lockless because of the guarantee only that -one CPU is executing it. -- contention can only be the result of some other CPU accessing the rx -ring. This happens only in close() and suspend() (when these methods -try to clean the rx ring); -****guarantee: driver authors need not worry about this; synchronization -is taken care for them by the top net layer. --local interrupts are enabled (if you dont move all to dev->poll()). For -example link/MII and txcomplete continue functioning just same old way. -This improves the latency of processing these events. It is also assumed that -the receive interrupt is the largest cause of noise. Note this might not -always be true. -[according to Manfred Spraul, the winbond insists on sending one -txmitcomplete interrupt for each packet (although this can be mitigated)]. -For these broken drivers, move all to dev->poll(). - -For the rest of this text, we'll assume that dev->poll() only -processes receive events. - -new methods introduce by NAPI -============================= - -a) netif_rx_schedule(dev) -Called by an IRQ handler to schedule a poll for device - -b) netif_rx_schedule_prep(dev) -puts the device in a state which allows for it to be added to the -CPU polling list if it is up and running. You can look at this as -the first half of netif_rx_schedule(dev) above; the second half -being c) below. - -c) __netif_rx_schedule(dev) -Add device to the poll list for this CPU; assuming that _prep above -has already been called and returned 1. - -d) netif_rx_reschedule(dev, undo) -Called to reschedule polling for device specifically for some -deficient hardware. Read Appendix 2 for more details. - -e) netif_rx_complete(dev) - -Remove interface from the CPU poll list: it must be in the poll list -on current cpu. This primitive is called by dev->poll(), when -it completes its work. The device cannot be out of poll list at this -call, if it is then clearly it is a BUG(). You'll know ;-> - -All these above nethods are used below. So keep reading for clarity. - -Device driver changes to be made when porting NAPI -================================================== - -Below we describe what kind of changes are required for NAPI to work. - -1) introduction of dev->poll() method -===================================== - -This is the method that is invoked by the network core when it requests -for new packets from the driver. A driver is allowed to send upto -dev->quota packets by the current CPU before yielding to the network -subsystem (so other devices can also get opportunity to send to the stack). - -dev->poll() prototype looks as follows: -int my_poll(struct net_device *dev, int *budget) - -budget is the remaining number of packets the network subsystem on the -current CPU can send up the stack before yielding to other system tasks. -*Each driver is responsible for decrementing budget by the total number of -packets sent. - Total number of packets cannot exceed dev->quota. - -dev->poll() method is invoked by the top layer, the driver just sends if it -can to the stack the packet quantity requested. - -more on dev->poll() below after the interrupt changes are explained. - -2) registering dev->poll() method -=================================== - -dev->poll should be set in the dev->probe() method. -e.g: -dev->open = my_open; -. -. -/* two new additions */ -/* first register my poll method */ -dev->poll = my_poll; -/* next register my weight/quanta; can be overridden in /proc */ -dev->weight = 16; -. -. -dev->stop = my_close; - - - -3) scheduling dev->poll() -============================= -This involves modifying the interrupt handler and the code -path which takes the packet off the NIC and sends them to the -stack. - -it's important at this point to introduce the classical D Becker -interrupt processor: - ------------------- -static irqreturn_t -netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) -{ - - struct net_device *dev = (struct net_device *)dev_instance; - struct my_private *tp = (struct my_private *)dev->priv; - - int work_count = my_work_count; - status = read_interrupt_status_reg(); - if (status == 0) - return IRQ_NONE; /* Shared IRQ: not us */ - if (status == 0xffff) - return IRQ_HANDLED; /* Hot unplug */ - if (status & error) - do_some_error_handling() - - do { - acknowledge_ints_ASAP(); - - if (status & link_interrupt) { - spin_lock(&tp->link_lock); - do_some_link_stat_stuff(); - spin_lock(&tp->link_lock); - } - - if (status & rx_interrupt) { - receive_packets(dev); - } - - if (status & rx_nobufs) { - make_rx_buffs_avail(); - } - - if (status & tx_related) { - spin_lock(&tp->lock); - tx_ring_free(dev); - if (tx_died) - restart_tx(); - spin_unlock(&tp->lock); - } - - status = read_interrupt_status_reg(); - - } while (!(status & error) || more_work_to_be_done); - return IRQ_HANDLED; -} - ----------------------------------------------------------------------- - -We now change this to what is shown below to NAPI-enable it: - ----------------------------------------------------------------------- -static irqreturn_t -netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) -{ - struct net_device *dev = (struct net_device *)dev_instance; - struct my_private *tp = (struct my_private *)dev->priv; - - status = read_interrupt_status_reg(); - if (status == 0) - return IRQ_NONE; /* Shared IRQ: not us */ - if (status == 0xffff) - return IRQ_HANDLED; /* Hot unplug */ - if (status & error) - do_some_error_handling(); - - do { -/************************ start note *********************************/ - acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here -/************************ end note *********************************/ - - if (status & link_interrupt) { - spin_lock(&tp->link_lock); - do_some_link_stat_stuff(); - spin_unlock(&tp->link_lock); - } -/************************ start note *********************************/ - if (status & rx_interrupt || (status & rx_nobuffs)) { - if (netif_rx_schedule_prep(dev)) { - - /* disable interrupts caused - * by arriving packets */ - disable_rx_and_rxnobuff_ints(); - /* tell system we have work to be done. */ - __netif_rx_schedule(dev); - } else { - printk("driver bug! interrupt while in poll\n"); - /* FIX by disabling interrupts */ - disable_rx_and_rxnobuff_ints(); - } - } -/************************ end note note *********************************/ - - if (status & tx_related) { - spin_lock(&tp->lock); - tx_ring_free(dev); - - if (tx_died) - restart_tx(); - spin_unlock(&tp->lock); - } - - status = read_interrupt_status_reg(); - -/************************ start note *********************************/ - } while (!(status & error) || more_work_to_be_done(status)); -/************************ end note note *********************************/ - return IRQ_HANDLED; -} - ---------------------------------------------------------------------- - - -We note several things from above: - -I) Any interrupt source which is caused by arriving packets is now -turned off when it occurs. Depending on the hardware, there could be -several reasons that arriving packets would cause interrupts; these are the -interrupt sources we wish to avoid. The two common ones are a) a packet -arriving (rxint) b) a packet arriving and finding no DMA buffers available -(rxnobuff) . -This means also acknowledge_ints_ASAP() will not clear the status -register for those two items above; clearing is done in the place where -proper work is done within NAPI; at the poll() and refill_rx_ring() -discussed further below. -netif_rx_schedule_prep() returns 1 if device is in running state and -gets successfully added to the core poll list. If we get a zero value -we can _almost_ assume are already added to the list (instead of not running. -Logic based on the fact that you shouldn't get interrupt if not running) -We rectify this by disabling rx and rxnobuf interrupts. - -II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. -These functionalities are still around actually...... - -infact, receive_packets(dev) is very close to my_poll() and -make_rx_buffs_avail() is invoked from my_poll() - -4) converting receive_packets() to dev->poll() -=============================================== - -We need to convert the classical D Becker receive_packets(dev) to my_poll() - -First the typical receive_packets() below: -------------------------------------------------------------------- - -/* this is called by interrupt handler */ -static void receive_packets (struct net_device *dev) -{ - - struct my_private *tp = (struct my_private *)dev->priv; - rx_ring = tp->rx_ring; - cur_rx = tp->cur_rx; - int entry = cur_rx % RX_RING_SIZE; - int received = 0; - int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; - - while (rx_ring_not_empty) { - u32 rx_status; - unsigned int rx_size; - unsigned int pkt_size; - struct sk_buff *skb; - /* read size+status of next frame from DMA ring buffer */ - /* the number 16 and 4 are just examples */ - rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); - rx_size = rx_status >> 16; - pkt_size = rx_size - 4; - - /* process errors */ - if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || - (!(rx_status & RxStatusOK))) { - netdrv_rx_err (rx_status, dev, tp, ioaddr); - return; - } - - if (--rx_work_limit < 0) - break; - - /* grab a skb */ - skb = dev_alloc_skb (pkt_size + 2); - if (skb) { - . - . - netif_rx (skb); - . - . - } else { /* OOM */ - /*seems very driver specific ... some just pass - whatever is on the ring already. */ - } - - /* move to the next skb on the ring */ - entry = (++tp->cur_rx) % RX_RING_SIZE; - received++ ; - - } - - /* store current ring pointer state */ - tp->cur_rx = cur_rx; - - /* Refill the Rx ring buffers if they are needed */ - refill_rx_ring(); - . - . - -} -------------------------------------------------------------------- -We change it to a new one below; note the additional parameter in -the call. - -------------------------------------------------------------------- - -/* this is called by the network core */ -static int my_poll (struct net_device *dev, int *budget) -{ - - struct my_private *tp = (struct my_private *)dev->priv; - rx_ring = tp->rx_ring; - cur_rx = tp->cur_rx; - int entry = cur_rx % RX_BUF_LEN; - /* maximum packets to send to the stack */ -/************************ note note *********************************/ - int rx_work_limit = dev->quota; - -/************************ end note note *********************************/ - do { // outer beginning loop starts here - - clear_rx_status_register_bit(); - - while (rx_ring_not_empty) { - u32 rx_status; - unsigned int rx_size; - unsigned int pkt_size; - struct sk_buff *skb; - /* read size+status of next frame from DMA ring buffer */ - /* the number 16 and 4 are just examples */ - rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); - rx_size = rx_status >> 16; - pkt_size = rx_size - 4; - - /* process errors */ - if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || - (!(rx_status & RxStatusOK))) { - netdrv_rx_err (rx_status, dev, tp, ioaddr); - return 1; - } - -/************************ note note *********************************/ - if (--rx_work_limit < 0) { /* we got packets, but no quota */ - /* store current ring pointer state */ - tp->cur_rx = cur_rx; - - /* Refill the Rx ring buffers if they are needed */ - refill_rx_ring(dev); - goto not_done; - } -/********************** end note **********************************/ - - /* grab a skb */ - skb = dev_alloc_skb (pkt_size + 2); - if (skb) { - . - . -/************************ note note *********************************/ - netif_receive_skb (skb); -/********************** end note **********************************/ - . - . - } else { /* OOM */ - /*seems very driver specific ... common is just pass - whatever is on the ring already. */ - } - - /* move to the next skb on the ring */ - entry = (++tp->cur_rx) % RX_RING_SIZE; - received++ ; - - } - - /* store current ring pointer state */ - tp->cur_rx = cur_rx; - - /* Refill the Rx ring buffers if they are needed */ - refill_rx_ring(dev); - - /* no packets on ring; but new ones can arrive since we last - checked */ - status = read_interrupt_status_reg(); - if (rx status is not set) { - /* If something arrives in this narrow window, - an interrupt will be generated */ - goto done; - } - /* done! at least thats what it looks like ;-> - if new packets came in after our last check on status bits - they'll be caught by the while check and we go back and clear them - since we havent exceeded our quota */ - } while (rx_status_is_set); - -done: - -/************************ note note *********************************/ - dev->quota -= received; - *budget -= received; - - /* If RX ring is not full we are out of memory. */ - if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) - goto oom; - - /* we are happy/done, no more packets on ring; put us back - to where we can start processing interrupts again */ - netif_rx_complete(dev); - enable_rx_and_rxnobuf_ints(); - - /* The last op happens after poll completion. Which means the following: - * 1. it can race with disabling irqs in irq handler (which are done to - * schedule polls) - * 2. it can race with dis/enabling irqs in other poll threads - * 3. if an irq raised after the begining of the outer beginning - * loop(marked in the code above), it will be immediately - * triggered here. - * - * Summarizing: the logic may results in some redundant irqs both - * due to races in masking and due to too late acking of already - * processed irqs. The good news: no events are ever lost. - */ - - return 0; /* done */ - -not_done: - if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || - tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) - refill_rx_ring(dev); - - if (!received) { - printk("received==0\n"); - received = 1; - } - dev->quota -= received; - *budget -= received; - return 1; /* not_done */ - -oom: - /* Start timer, stop polling, but do not enable rx interrupts. */ - start_poll_timer(dev); - return 0; /* we'll take it from here so tell core "done"*/ - -/************************ End note note *********************************/ -} -------------------------------------------------------------------- - -From above we note that: -0) rx_work_limit = dev->quota -1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when -it does the work. -2) We have a done and not_done state. -3) instead of netif_rx() we call netif_receive_skb() to pass the skb. -4) we have a new way of handling oom condition -5) A new outer for (;;) loop has been added. This serves the purpose of -ensuring that if a new packet has come in, after we are all set and done, -and we have not exceeded our quota that we continue sending packets up. - - ------------------------------------------------------------ -Poll timer code will need to do the following: - -a) - - if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || - tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) - refill_rx_ring(dev); - - /* If RX ring is not full we are still out of memory. - Restart the timer again. Else we re-add ourselves - to the master poll list. - */ - - if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) - restart_timer(); - - else netif_rx_schedule(dev); /* we are back on the poll list */ - -5) dev->close() and dev->suspend() issues -========================================== -The driver writter neednt worry about this. The top net layer takes -care of it. - -6) Adding new Stats to /proc -============================= -In order to debug some of the new features, we introduce new stats -that need to be collected. -TODO: Fill this later. - -APPENDIX 1: discussion on using ethernet HW FC -============================================== -Most chips with FC only send a pause packet when they run out of Rx buffers. -Since packets are pulled off the DMA ring by a softirq in NAPI, -if the system is slow in grabbing them and we have a high input -rate (faster than the system's capacity to remove packets), then theoretically -there will only be one rx interrupt for all packets during a given packetstorm. -Under low load, we might have a single interrupt per packet. -FC should be programmed to apply in the case when the system cant pull out -packets fast enough i.e send a pause only when you run out of rx buffers. -Note FC in itself is a good solution but we have found it to not be -much of a commodity feature (both in NICs and switches) and hence falls -under the same category as using NIC based mitigation. Also experiments -indicate that its much harder to resolve the resource allocation -issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness -proved harder. In any case, FC works even better with NAPI but is not -necessary. - - -APPENDIX 2: the "rotting packet" race-window avoidance scheme -============================================================= - -There are two types of associations seen here - -1) status/int which honors level triggered IRQ - -If a status bit for receive or rxnobuff is set and the corresponding -interrupt-enable bit is not on, then no interrupts will be generated. However, -as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is -generated. [assuming the status bit was not turned off]. -Generally the concept of level triggered IRQs in association with a status and -interrupt-enable CSR register set is used to avoid the race. - -If we take the example of the tulip: -"pending work" is indicated by the status bit(CSR5 in tulip). -the corresponding interrupt bit (CSR7 in tulip) might be turned off (but -the CSR5 will continue to be turned on with new packet arrivals even if -we clear it the first time) -Very important is the fact that if we turn on the interrupt bit on when -status is set that an immediate irq is triggered. - -If we cleared the rx ring and proclaimed there was "no more work -to be done" and then went on to do a few other things; then when we enable -interrupts, there is a possibility that a new packet might sneak in during -this phase. It helps to look at the pseudo code for the tulip poll -routine: - --------------------------- - do { - ACK; - while (ring_is_not_empty()) { - work-work-work - if quota is exceeded: exit, no touching irq status/mask - } - /* No packets, but new can arrive while we are doing this*/ - CSR5 := read - if (CSR5 is not set) { - /* If something arrives in this narrow window here, - * where the comments are ;-> irq will be generated */ - unmask irqs; - exit poll; - } - } while (rx_status_is_set); ------------------------- - -CSR5 bit of interest is only the rx status. -If you look at the last if statement: -you just finished grabbing all the packets from the rx ring .. you check if -status bit says theres more packets just in ... it says none; you then -enable rx interrupts again; if a new packet just came in during this check, -we are counting that CSR5 will be set in that small window of opportunity -and that by re-enabling interrupts, we would actually triger an interrupt -to register the new packet for processing. - -[The above description nay be very verbose, if you have better wording -that will make this more understandable, please suggest it.] - -2) non-capable hardware - -These do not generally respect level triggered IRQs. Normally, -irqs may be lost while being masked and the only way to leave poll is to do -a double check for new input after netif_rx_complete() is invoked -and re-enable polling (after seeing this new input). - -Sample code: - ---------- - . - . -restart_poll: - while (ring_is_not_empty()) { - work-work-work - if quota is exceeded: exit, not touching irq status/mask - } - . - . - . - enable_rx_interrupts() - netif_rx_complete(dev); - if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { - disable_rx_and_rxnobufs() - goto restart_poll - } while (rx_status_is_set); ---------- - -Basically netif_rx_complete() removes us from the poll list, but because a -new packet which will never be caught due to the possibility of a race -might come in, we attempt to re-add ourselves to the poll list. - - - - -APPENDIX 3: Scheduling issues. -============================== -As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the -general solution to schedule softirq's to run before next interrupt and by putting -them under scheduler control. Also this prevents consecutive softirq's from -monopolize the CPU. This also have the effect that the priority of ksoftirq needs -to be considered when running very CPU-intensive applications and networking to -get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 -(eventually more) is reported cure problems with low network performance at high -CPU load. - -Most used processes in a GIGE router: -USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND -root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) -root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated - --------------------------------------------------------------------- - -relevant sites: -================== -ftp://robur.slu.se/pub/Linux/net-development/NAPI/ - - --------------------------------------------------------------------- -TODO: Write net-skeleton.c driver. -------------------------------------------------------------- - -Authors: -======== -Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> -Jamal Hadi Salim <hadi@cyberus.ca> -Robert Olsson <Robert.Olsson@data.slu.se> - -Acknowledgements: -================ -People who made this document better: - -Lennert Buytenhek <buytenh@gnu.org> -Andrew Morton <akpm@zip.com.au> -Manfred Spraul <manfred@colorfullife.com> -Donald Becker <becker@scyld.com> -Jeff Garzik <jgarzik@pobox.com> diff --git a/Documentation/networking/README.ipw2100 b/Documentation/networking/README.ipw2100 index 2046948b020..6f85e1d0603 100644 --- a/Documentation/networking/README.ipw2100 +++ b/Documentation/networking/README.ipw2100 @@ -1,27 +1,81 @@ -=========================== -Intel(R) PRO/Wireless 2100 Network Connection Driver for Linux +Intel(R) PRO/Wireless 2100 Driver for Linux in support of: + +Intel(R) PRO/Wireless 2100 Network Connection + +Copyright (C) 2003-2006, Intel Corporation + README.ipw2100 -March 14, 2005 +Version: git-1.1.5 +Date : January 25, 2006 -=========================== Index ---------------------------- -0. Introduction -1. Release 1.1.0 Current Features -2. Command Line Parameters -3. Sysfs Helper Files -4. Radio Kill Switch -5. Dynamic Firmware -6. Power Management -7. Support -8. License - - -=========================== -0. Introduction ------------- ----- ----- ---- --- -- - +----------------------------------------------- +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +1. Introduction +2. Release git-1.1.5 Current Features +3. Command Line Parameters +4. Sysfs Helper Files +5. Radio Kill Switch +6. Dynamic Firmware +7. Power Management +8. Support +9. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +----------------------------------------------- + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://www.intel.com/support/wireless/sb/CS-006408.htm + +1. Introduction +----------------------------------------------- This document provides a brief overview of the features supported by the IPW2100 driver project. The main project website, where the latest @@ -34,9 +88,8 @@ potential fixes and patches, as well as links to the development mailing list for the driver project. -=========================== -1. Release 1.1.0 Current Supported Features ---------------------------- +2. Release git-1.1.5 Current Supported Features +----------------------------------------------- - Managed (BSS) and Ad-Hoc (IBSS) - WEP (shared key and open) - Wireless Tools support @@ -51,9 +104,8 @@ on the amount of validation and interoperability testing that has been performed on a given feature. -=========================== -2. Command Line Parameters ---------------------------- +3. Command Line Parameters +----------------------------------------------- If the driver is built as a module, the following optional parameters are used by entering them on the command line with the modprobe command using this @@ -75,9 +127,9 @@ associate boolean associate=0 /* Do NOT auto associate */ disable boolean disable=1 /* Do not power the HW */ -=========================== -3. Sysfs Helper Files +4. Sysfs Helper Files --------------------------- +----------------------------------------------- There are several ways to control the behavior of the driver. Many of the general capabilities are exposed through the Wireless Tools (iwconfig). There @@ -120,9 +172,8 @@ For the device level files, see /sys/bus/pci/drivers/ipw2100: based RF kill from ON -> OFF -> ON, the radio will NOT come back on -=========================== -4. Radio Kill Switch ---------------------------- +5. Radio Kill Switch +----------------------------------------------- Most laptops provide the ability for the user to physically disable the radio. Some vendors have implemented this as a physical switch that requires no software to turn the radio off and on. On other laptops, however, the switch @@ -134,9 +185,8 @@ See the Sysfs helper file 'rf_kill' for determining the state of the RF switch on your system. -=========================== -5. Dynamic Firmware ---------------------------- +6. Dynamic Firmware +----------------------------------------------- As the firmware is licensed under a restricted use license, it can not be included within the kernel sources. To enable the IPW2100 you will need a firmware image to load into the wireless NIC's processors. @@ -146,9 +196,8 @@ You can obtain these images from <http://ipw2100.sf.net/firmware.php>. See INSTALL for instructions on installing the firmware. -=========================== -6. Power Management ---------------------------- +7. Power Management +----------------------------------------------- The IPW2100 supports the configuration of the Power Save Protocol through a private wireless extension interface. The IPW2100 supports the following different modes: @@ -200,9 +249,8 @@ xxxx/yyyy will be replaced with 'off' -- the level reported will be the active level if `iwconfig eth1 power on` is invoked. -=========================== -7. Support ---------------------------- +8. Support +----------------------------------------------- For general development information and support, go to: @@ -218,11 +266,10 @@ For installation support on the ipw2100 1.1.0 driver on Linux kernels http://supportmail.intel.com -=========================== -8. License ---------------------------- +9. License +----------------------------------------------- - Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved. + Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (version 2) as diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/README.ipw2200 index 6916080c5f0..b7658bed490 100644 --- a/Documentation/networking/README.ipw2200 +++ b/Documentation/networking/README.ipw2200 @@ -1,33 +1,91 @@ Intel(R) PRO/Wireless 2915ABG Driver for Linux in support of: -Intel(R) PRO/Wireless 2200BG Network Connection -Intel(R) PRO/Wireless 2915ABG Network Connection +Intel(R) PRO/Wireless 2200BG Network Connection +Intel(R) PRO/Wireless 2915ABG Network Connection -Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R) -PRO/Wireless 2200BG Driver for Linux is a unified driver that works on -both hardware adapters listed above. In this document the Intel(R) -PRO/Wireless 2915ABG Driver for Linux will be used to reference the +Note: The Intel(R) PRO/Wireless 2915ABG Driver for Linux and Intel(R) +PRO/Wireless 2200BG Driver for Linux is a unified driver that works on +both hardware adapters listed above. In this document the Intel(R) +PRO/Wireless 2915ABG Driver for Linux will be used to reference the unified driver. -Copyright (C) 2004-2005, Intel Corporation +Copyright (C) 2004-2006, Intel Corporation README.ipw2200 -Version: 1.0.0 -Date : January 31, 2005 +Version: 1.1.2 +Date : March 30, 2006 Index ----------------------------------------------- +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER 1. Introduction 1.1. Overview of features 1.2. Module parameters 1.3. Wireless Extension Private Methods 1.4. Sysfs Helper Files -2. About the Version Numbers -3. Support -4. License +1.5. Supported channels +2. Ad-Hoc Networking +3. Interacting with Wireless Tools +3.1. iwconfig mode +3.2. iwconfig sens +4. About the Version Numbers +5. Firmware installation +6. Support +7. License + + +0. IMPORTANT INFORMATION BEFORE USING THIS DRIVER +----------------------------------------------- + +Important Notice FOR ALL USERS OR DISTRIBUTORS!!!! + +Intel wireless LAN adapters are engineered, manufactured, tested, and +quality checked to ensure that they meet all necessary local and +governmental regulatory agency requirements for the regions that they +are designated and/or marked to ship into. Since wireless LANs are +generally unlicensed devices that share spectrum with radars, +satellites, and other licensed and unlicensed devices, it is sometimes +necessary to dynamically detect, avoid, and limit usage to avoid +interference with these devices. In many instances Intel is required to +provide test data to prove regional and local compliance to regional and +governmental regulations before certification or approval to use the +product is granted. Intel's wireless LAN's EEPROM, firmware, and +software driver are designed to carefully control parameters that affect +radio operation and to ensure electromagnetic compliance (EMC). These +parameters include, without limitation, RF power, spectrum usage, +channel scanning, and human exposure. + +For these reasons Intel cannot permit any manipulation by third parties +of the software provided in binary format with the wireless WLAN +adapters (e.g., the EEPROM and firmware). Furthermore, if you use any +patches, utilities, or code with the Intel wireless LAN adapters that +have been manipulated by an unauthorized party (i.e., patches, +utilities, or code (including open source code modifications) which have +not been validated by Intel), (i) you will be solely responsible for +ensuring the regulatory compliance of the products, (ii) Intel will bear +no liability, under any theory of liability for any issues associated +with the modified products, including without limitation, claims under +the warranty and/or issues arising from regulatory non-compliance, and +(iii) Intel will not provide or be required to assist in providing +support to any third parties for such modified products. + +Note: Many regulatory agencies consider Wireless LAN adapters to be +modules, and accordingly, condition system-level regulatory approval +upon receipt and review of test data documenting that the antennas and +system configuration do not cause the EMC and radio operation to be +non-compliant. + +The drivers available for download from SourceForge are provided as a +part of a development project. Conformance to local regulatory +requirements is the responsibility of the individual developer. As +such, if you are interested in deploying or shipping a driver as part of +solution intended to be used for purposes other than development, please +obtain a tested driver from Intel Customer Support at: + +http://support.intel.com 1. Introduction @@ -45,7 +103,7 @@ file. 1.1. Overview of Features ----------------------------------------------- -The current release (1.0.0) supports the following features: +The current release (1.1.2) supports the following features: + BSS mode (Infrastructure, Managed) + IBSS mode (Ad-Hoc) @@ -56,17 +114,27 @@ The current release (1.0.0) supports the following features: + Full A rate support (2915 only) + Transmit power control + S state support (ACPI suspend/resume) + +The following features are currently enabled, but not officially +supported: + ++ WPA + long/short preamble support ++ Monitor mode (aka RFMon) + +The distinction between officially supported and enabled is a reflection +on the amount of validation and interoperability testing that has been +performed on a given feature. 1.2. Command Line Parameters ----------------------------------------------- -Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless -2915ABG Driver for Linux allows certain configuration options to be -provided as module parameters. The most common way to specify a module -parameter is via the command line. +Like many modules used in the Linux kernel, the Intel(R) PRO/Wireless +2915ABG Driver for Linux allows configuration options to be provided +as module parameters. The most common way to specify a module parameter +is via the command line. The general form is: @@ -79,7 +147,7 @@ Where the supported parameter are: driver. If disabled, the driver will not attempt to scan for and associate to a network until it has been configured with one or more properties for the target network, for example configuring - the network SSID. Default is 1 (auto-associate) + the network SSID. Default is 0 (do not auto-associate) Example: % modprobe ipw2200 associate=0 @@ -96,14 +164,18 @@ Where the supported parameter are: debug If using a debug build, this is used to control the amount of debug - info is logged. See the 'dval' and 'load' script for more info on - how to use this (the dval and load scripts are provided as part + info is logged. See the 'dvals' and 'load' script for more info on + how to use this (the dvals and load scripts are provided as part of the ipw2200 development snapshot releases available from the SourceForge project at http://ipw2200.sf.net) + + led + Can be used to turn on experimental LED code. + 0 = Off, 1 = On. Default is 1. mode Can be used to set the default mode of the adapter. - 0 = Managed, 1 = Ad-Hoc + 0 = Managed, 1 = Ad-Hoc, 2 = Monitor 1.3. Wireless Extension Private Methods @@ -164,8 +236,8 @@ The supported private methods are: ----------------------------------------------- The Linux kernel provides a pseudo file system that can be used to -access various components of the operating system. The Intel(R) -PRO/Wireless 2915ABG Driver for Linux exposes several configuration +access various components of the operating system. The Intel(R) +PRO/Wireless 2915ABG Driver for Linux exposes several configuration parameters through this mechanism. An entry in the sysfs can support reading and/or writing. You can @@ -175,8 +247,8 @@ and can set the contents via echo. For example: % cat /sys/bus/pci/drivers/ipw2200/debug_level Will report the current debug level of the driver's logging subsystem -(only available if CONFIG_IPW_DEBUG was configured when the driver was -built). +(only available if CONFIG_IPW2200_DEBUG was configured when the driver +was built). You can set the debug level via: @@ -188,9 +260,9 @@ firmware loader used by hotplug utilizes sysfs entries for transferring the firmware image from user space into the driver. The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries -at two levels -- driver level, which apply to all instances of the -driver (in the event that there are more than one device installed) and -device level, which applies only to the single specific instance. +at two levels -- driver level, which apply to all instances of the driver +(in the event that there are more than one device installed) and device +level, which applies only to the single specific instance. 1.4.1 Driver Level Sysfs Helper Files @@ -203,6 +275,7 @@ For the driver level files, look in /sys/bus/pci/drivers/ipw2200/ This controls the same global as the 'debug' module parameter + 1.4.2 Device Level Sysfs Helper Files ----------------------------------------------- @@ -213,7 +286,7 @@ For the device level files, look in For example: /sys/bus/pci/drivers/ipw2200/0000:02:01.0 -For the device level files, see /sys/bus/pci/[drivers/ipw2200: +For the device level files, see /sys/bus/pci/drivers/ipw2200: rf_kill read - @@ -231,8 +304,97 @@ For the device level files, see /sys/bus/pci/[drivers/ipw2200: ucode read-only access to the ucode version number + led + read - + 0 = LED code disabled + 1 = LED code enabled + write - + 0 = Disable LED code + 1 = Enable LED code + + NOTE: The LED code has been reported to hang some systems when + running ifconfig and is therefore disabled by default. -2. About the Version Numbers + +1.5. Supported channels +----------------------------------------------- + +Upon loading the Intel(R) PRO/Wireless 2915ABG Driver for Linux, a +message stating the detected geography code and the number of 802.11 +channels supported by the card will be displayed in the log. + +The geography code corresponds to a regulatory domain as shown in the +table below. + + Supported channels +Code Geography 802.11bg 802.11a + +--- Restricted 11 0 +ZZF Custom US/Canada 11 8 +ZZD Rest of World 13 0 +ZZA Custom USA & Europe & High 11 13 +ZZB Custom NA & Europe 11 13 +ZZC Custom Japan 11 4 +ZZM Custom 11 0 +ZZE Europe 13 19 +ZZJ Custom Japan 14 4 +ZZR Rest of World 14 0 +ZZH High Band 13 4 +ZZG Custom Europe 13 4 +ZZK Europe 13 24 +ZZL Europe 11 13 + + +2. Ad-Hoc Networking +----------------------------------------------- + +When using a device in an Ad-Hoc network, it is useful to understand the +sequence and requirements for the driver to be able to create, join, or +merge networks. + +The following attempts to provide enough information so that you can +have a consistent experience while using the driver as a member of an +Ad-Hoc network. + +2.1. Joining an Ad-Hoc Network +----------------------------------------------- + +The easiest way to get onto an Ad-Hoc network is to join one that +already exists. + +2.2. Creating an Ad-Hoc Network +----------------------------------------------- + +An Ad-Hoc networks is created using the syntax of the Wireless tool. + +For Example: +iwconfig eth1 mode ad-hoc essid testing channel 2 + +2.3. Merging Ad-Hoc Networks +----------------------------------------------- + + +3. Interaction with Wireless Tools +----------------------------------------------- + +3.1 iwconfig mode +----------------------------------------------- + +When configuring the mode of the adapter, all run-time configured parameters +are reset to the value used when the module was loaded. This includes +channels, rates, ESSID, etc. + +3.2 iwconfig sens +----------------------------------------------- + +The 'iwconfig ethX sens XX' command will not set the signal sensitivity +threshold, as described in iwconfig documentation, but rather the number +of consecutive missed beacons that will trigger handover, i.e. roaming +to another access point. At the same time, it will set the disassociation +threshold to 3 times the given value. + + +4. About the Version Numbers ----------------------------------------------- Due to the nature of open source development projects, there are @@ -259,12 +421,23 @@ available as quickly as possible, unknown anomalies should be expected. The major version number will be incremented when significant changes are made to the driver. Currently, there are no major changes planned. +5. Firmware installation +---------------------------------------------- + +The driver requires a firmware image, download it and extract the +files under /lib/firmware (or wherever your hotplug's firmware.agent +will look for firmware files) -3. Support +The firmware can be downloaded from the following URL: + + http://ipw2200.sf.net/ + + +6. Support ----------------------------------------------- -For installation support of the 1.0.0 version, you can contact -http://supportmail.intel.com, or you can use the open source project +For direct support of the 1.0.0 version, you can contact +http://supportmail.intel.com, or you can use the open source project support. For general information and support, go to: @@ -272,10 +445,10 @@ For general information and support, go to: http://ipw2200.sf.net/ -4. License +7. License ----------------------------------------------- - Copyright(c) 2003 - 2005 Intel Corporation. All rights reserved. + Copyright(c) 2003 - 2006 Intel Corporation. All rights reserved. This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License version 2 as @@ -297,4 +470,3 @@ For general information and support, go to: James P. Ketrenos <ipw2100-admin@linux.intel.com> Intel Corporation, 5200 N.E. Elam Young Parkway, Hillsboro, OR 97124-6497 - diff --git a/Documentation/networking/README.sb1000 b/Documentation/networking/README.sb1000 index f82d42584e9..f92c2aac56a 100644 --- a/Documentation/networking/README.sb1000 +++ b/Documentation/networking/README.sb1000 @@ -27,8 +27,8 @@ cable modem easy. in Franco's original source code distribution .tar.gz file. Support for the sb1000 driver can be found at: - http://home.adelphia.net/~siglercm/sb1000.html - http://linuxpower.cx/~cable/ + http://web.archive.org/web/*/http://home.adelphia.net/~siglercm/sb1000.html + http://web.archive.org/web/*/http://linuxpower.cx/~cable/ along with these utilities. diff --git a/Documentation/networking/TODO b/Documentation/networking/TODO deleted file mode 100644 index 66d36ff14ba..00000000000 --- a/Documentation/networking/TODO +++ /dev/null @@ -1,18 +0,0 @@ -To-do items for network drivers -------------------------------- - -* Move ethernet crc routine to generic code - -* (for 2.5) Integrate Jamal Hadi Salim's netdev Rx polling API change - -* Audit all net drivers to make sure magic packet / wake-on-lan / - similar features are disabled in the driver by default. - -* Audit all net drivers to make sure the module always prints out a - version string when loaded as a module, but only prints a version - string when built into the kernel if a device is detected. - -* Add ETHTOOL_GDRVINFO ioctl support to all ethernet drivers. - -* dmfe PCI DMA is totally wrong and only works on x86 - diff --git a/Documentation/networking/alias.txt b/Documentation/networking/alias.txt index cd12c2ff518..85046f53fcf 100644 --- a/Documentation/networking/alias.txt +++ b/Documentation/networking/alias.txt @@ -2,13 +2,13 @@ IP-Aliasing: ============ -IP-aliases are additional IP-addresses/masks hooked up to a base -interface by adding a colon and a string when running ifconfig. -This string is usually numeric, but this is not a must. - -IP-Aliases are avail if CONFIG_INET (`standard' IPv4 networking) -is configured in the kernel. +IP-aliases are an obsolete way to manage multiple IP-addresses/masks +per interface. Newer tools such as iproute2 support multiple +address/prefixes per interface, but aliases are still supported +for backwards compatibility. +An alias is formed by adding a colon and a string when running ifconfig. +This string is usually numeric, but this is not a must. o Alias creation. Alias creation is done by 'magic' interface naming: eg. to create a @@ -38,16 +38,3 @@ o Relationship with main device If the base device is shut down the added aliases will be deleted too. - - -Contact -------- -Please finger or e-mail me: - Juan Jose Ciarlante <jjciarla@raiz.uncu.edu.ar> - -Updated by Erik Schoenfelder <schoenfr@gaertner.DE> - -; local variables: -; mode: indented-text -; mode: auto-fill -; end: diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt new file mode 100644 index 00000000000..3f24df8c6e6 --- /dev/null +++ b/Documentation/networking/altera_tse.txt @@ -0,0 +1,263 @@ + Altera Triple-Speed Ethernet MAC driver + +Copyright (C) 2008-2014 Altera Corporation + +This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers +using the SGDMA and MSGDMA soft DMA IP components. The driver uses the +platform bus to obtain component resources. The designs used to test this +driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, +and tested with ARM and NIOS processor hosts seperately. The anticipated use +cases are simple communications between an embedded system and an external peer +for status and simple configuration of the embedded system. + +For more information visit www.altera.com and www.rocketboards.org. Support +forums for the driver may be found on www.rocketboards.org, and a design used +to test this driver may be found there as well. Support is also available from +the maintainer of this driver, found in MAINTAINERS. + +The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP +components that can be assembled and built into an FPGA using the Altera +Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that +this driver was tested against. The sopc2dts tool is used to create the +device tree for the driver, and may be found at rocketboards.org. + +The driver probe function examines the device tree and determines if the +Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The +probe function then installs the appropriate set of DMA routines to +initialize, setup transmits, receives, and interrupt handling primitives for +the respective configurations. + +The SGDMA component is to be deprecated in the near future (over the next 1-2 +years as of this writing in early 2014) in favor of the MSGDMA component. +SGDMA support is included for existing designs and reference in case a +developer wishes to support their own soft DMA logic and driver support. Any +new designs should not use the SGDMA. + +The SGDMA supports only a single transmit or receive operation at a time, and +therefore will not perform as well compared to the MSGDMA soft IP. Please +visit www.altera.com for known, documented SGDMA errata. + +Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. +Scatter-gather DMA will be added to a future maintenance update to this +driver. + +Jumbo frames are not supported at this time. + +The driver limits PHY operations to 10/100Mbps, and has not yet been fully +tested for 1Gbps. This support will be added in a future maintenance update. + +1) Kernel Configuration +The kernel configuration option is ALTERA_TSE: + Device Drivers ---> Network device support ---> Ethernet driver support ---> + Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + dma_rx_num: Number of descriptors in the RX list (default is 64); + dma_tx_num: Number of descriptors in the TX list (default is 64). + +3) Command line options +Driver parameters can be also passed in command line by using: + altera_tse=dma_rx_num:128,dma_tx_num:512 + +4) Driver information and notes + +4.1) Transmit process +When the driver's transmit routine is called by the kernel, it sets up a +transmit descriptor by calling the underlying DMA transmit routine (SGDMA or +MSGDMA), and initites a transmit operation. Once the transmit is complete, an +interrupt is driven by the transmit DMA logic. The driver handles the transmit +completion in the context of the interrupt handling chain by recycling +resource required to send and track the requested transmit operation. + +4.2) Receive process +The driver will post receive buffers to the receive DMA logic during driver +intialization. Receive buffers may or may not be queued depending upon the +underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able +to queue receive buffers to the SGDMA receive logic). When a packet is +received, the DMA logic generates an interrupt. The driver handles a receive +interrupt by obtaining the DMA receive logic status, reaping receive +completions until no more receive completions are available. + +4.3) Interrupt Mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for receive operations. Interrupt mitigation is not yet supported +for transmit operations, but will be added in a future maintenance release. + +4.4) Ethtool support +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.5) PHY Support +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.7) List of source files: + o Kconfig + o Makefile + o altera_tse_main.c: main network device driver + o altera_tse_ethtool.c: ethtool support + o altera_tse.h: private driver structure and common definitions + o altera_msgdma.h: MSGDMA implementation function definitions + o altera_sgdma.h: SGDMA implementation function definitions + o altera_msgdma.c: MSGDMA implementation + o altera_sgdma.c: SGDMA implementation + o altera_sgdmahw.h: SGDMA register and descriptor definitions + o altera_msgdmahw.h: MSGDMA register and descriptor definitions + o altera_utils.c: Driver utility functions + o altera_utils.h: Driver utility function definitions + +5) Debug Information + +The driver exports debug information such as internal statistics, +debug information, MAC and DMA registers etc. + +A user may use the ethtool support to get statistics: +e.g. using: ethtool -S ethX (that shows the statistics counters) +or sees the MAC registers: e.g. using: ethtool -d ethX + +The developer can also use the "debug" module parameter to get +further debug information. + +6) Statistics Support + +The controller and driver support a mix of IEEE standard defined statistics, +RFC defined statistics, and driver or Altera defined statistics. The four +specifications containing the standard definitions for these statistics are +as follows: + + o IEEE 802.3-2012 - IEEE Standard for Ethernet. + o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. + o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. + o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com + +The statistics supported by the TSE and the device driver are as follows: + +"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.2. This statistics is the count of frames that are successfully +transmitted. + +"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.5. This statistic is the count of frames that are successfully +received. This count does not include any error packets such as CRC errors, +length errors, or alignment errors. + +"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE +802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are +an integral number of bytes in length and do not pass the CRC test as the frame +is received. + +"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, +Section 5.2.2.1.7. This statistic is the count of frames that are not an +integral number of bytes in length and do not pass the CRC test as the frame is +received. + +"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.8. This statistic is the count of data and pad bytes +successfully transmitted from the interface. + +"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.14. This statistic is the count of data and pad bytes +successfully received by the controller. + +"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE +802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames +transmitted from the network controller. + +"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE +802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames +received by the network controller. + +"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is +a count of the number of packets received containing errors that prevented the +packet from being delivered to a higher level protocol. + +"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic +is a count of the number of packets that could not be transmitted due to errors. + +"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were not addressed +to the broadcast address or a multicast group. + +"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +a multicast address group. + +"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +the broadcast address. + +"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This +statistic is the number of outbound packets not transmitted even though an +error was not detected. An example of a reason this might occur is to free up +internal buffer space. + +"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were not addressed to +a multicast group or broadcast address. + +"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +multicast group. + +"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +broadcast address. + +"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. +This statistic counts the number of packets dropped due to lack of internal +controller resources. + +"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. +This statistic counts the total number of bytes received by the controller, +including error and discarded packets. + +"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. +This statistic counts the total number of packets received by the controller, +including error, discarded, unicast, multicast, and broadcast packets. + +"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets received less +than 64 bytes long. + +"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets greater than 1518 +bytes long. + +"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. +This statistic counts the total number of packets received that were 64 octets +in length. + +"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC +2819. This statistic counts the total number of packets received that were +between 65 and 127 octets in length inclusive. + +"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 128 and 255 octets in length inclusive. + +"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 256 and 511 octets in length inclusive. + +"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 512 and 1023 octets in length inclusive. + +"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define +in RFC 2819. This statistic is the total number of packets received that were +between 1024 and 1518 octets in length inclusive. + +"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the +Altera TSE. This statistics counts the number of received good and errored +frames between the length of 1519 and the maximum frame length configured +in the frm_length register. See the Altera TSE User Guide for More details. + +"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This +statistic is the total number of packets received that were longer than 1518 +octets, and had either a bad CRC with an integral number of octets (CRC Error) +or a bad CRC with a non-integral number of octets (Alignment Error). + +"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This +statistic is the total number of packets received that were less than 64 octets +in length and had either a bad CRC with an integral number of octets (CRC +error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/arcnet-hardware.txt b/Documentation/networking/arcnet-hardware.txt index 30a5f01403d..731de411513 100644 --- a/Documentation/networking/arcnet-hardware.txt +++ b/Documentation/networking/arcnet-hardware.txt @@ -139,7 +139,7 @@ And now to the cabling. What you can connect together: 5. An active hub to passive hub. -Remember, that you can not connect two passive hubs together. The power loss +Remember that you cannot connect two passive hubs together. The power loss implied by such a connection is too high for the net to operate reliably. An example of a typical ARCnet network: diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt index 770fc41a78e..aff97f47c05 100644 --- a/Documentation/networking/arcnet.txt +++ b/Documentation/networking/arcnet.txt @@ -46,7 +46,7 @@ These are the ARCnet drivers for Linux. This new release (2.91) has been put together by David Woodhouse -<dwmw2@cam.ac.uk>, in an attempt to tidy up the driver after adding support +<dwmw2@infradead.org>, in an attempt to tidy up the driver after adding support for yet another chipset. Now the generic support has been separated from the individual chipset drivers, and the source files aren't quite so packed with #ifdefs! I've changed this file a bit, but kept it in the first person from @@ -68,18 +68,19 @@ REAL NAME" to listserv@tichy.ch.uj.edu.pl. Then, to submit messages to the list, mail to linux-arcnet@tichy.ch.uj.edu.pl. There are archives of the mailing list at: - http://tichy.ch.uj.edu.pl/lists/linux-arcnet + http://epistolary.org/mailman/listinfo.cgi/arcnet -The people on linux-net@vger.kernel.org have also been known to be very -helpful, especially when we're talking about ALPHA Linux kernels that may or -may not work right in the first place. +The people on linux-net@vger.kernel.org (now defunct, replaced by +netdev@vger.kernel.org) have also been known to be very helpful, especially +when we're talking about ALPHA Linux kernels that may or may not work right +in the first place. Other Drivers and Info ---------------------- You can try my ARCNET page on the World Wide Web at: - http://www.worldvisions.ca/~apenwarr/arcnet/ + http://www.qis.net/~jschmitz/arcnet/ Also, SMC (one of the companies that makes ARCnet cards) has a WWW site you might be interested in, which includes several drivers for various cards diff --git a/Documentation/networking/ax25.txt b/Documentation/networking/ax25.txt index 37c25b0925f..8257dbf9be5 100644 --- a/Documentation/networking/ax25.txt +++ b/Documentation/networking/ax25.txt @@ -1,16 +1,10 @@ To use the amateur radio protocols within Linux you will need to get a -suitable copy of the AX.25 Utilities. More detailed information about these -and associated programs can be found on http://zone.pspt.fi/~jsn/. - -For more information about the AX.25, NET/ROM and ROSE protocol stacks, see -the AX25-HOWTO written by Terry Dawson <terry@perf.no.itg.telstra.com.au> -who is also the AX.25 Utilities maintainer. +suitable copy of the AX.25 Utilities. More detailed information about +AX.25, NET/ROM and ROSE, associated programs and and utilities can be +found on http://www.linux-ax25.org. There is an active mailing list for discussing Linux amateur radio matters -called linux-hams. To subscribe to it, send a message to +called linux-hams@vger.kernel.org. To subscribe to it, send a message to majordomo@vger.kernel.org with the words "subscribe linux-hams" in the body -of the message, the subject field is ignored. - -Jonathan G4KLX - -g4klx@g4klx.demon.co.uk +of the message, the subject field is ignored. You don't need to be +subscribed to post but of course that means you might miss an answer. diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt new file mode 100644 index 00000000000..58e49042fc2 --- /dev/null +++ b/Documentation/networking/batman-adv.txt @@ -0,0 +1,202 @@ +BATMAN-ADV +---------- + +Batman advanced is a new approach to wireless networking which +does no longer operate on the IP basis. Unlike the batman daemon, +which exchanges information using UDP packets and sets routing +tables, batman-advanced operates on ISO/OSI Layer 2 only and uses +and routes (or better: bridges) Ethernet Frames. It emulates a +virtual network switch of all nodes participating. Therefore all +nodes appear to be link local, thus all higher operating proto- +cols won't be affected by any changes within the network. You can +run almost any protocol above batman advanced, prominent examples +are: IPv4, IPv6, DHCP, IPX. + +Batman advanced was implemented as a Linux kernel driver to re- +duce the overhead to a minimum. It does not depend on any (other) +network driver, and can be used on wifi as well as ethernet lan, +vpn, etc ... (anything with ethernet-style layer 2). + + +CONFIGURATION +------------- + +Load the batman-adv module into your kernel: + +# insmod batman-adv.ko + +The module is now waiting for activation. You must add some in- +terfaces on which batman can operate. After loading the module +batman advanced will scan your systems interfaces to search for +compatible interfaces. Once found, it will create subfolders in +the /sys directories of each supported interface, e.g. + +# ls /sys/class/net/eth0/batman_adv/ +# iface_status mesh_iface + +If an interface does not have the "batman_adv" subfolder it prob- +ably is not supported. Not supported interfaces are: loopback, +non-ethernet and batman's own interfaces. + +Note: After the module was loaded it will continuously watch for +new interfaces to verify the compatibility. There is no need to +reload the module if you plug your USB wifi adapter into your ma- +chine after batman advanced was initially loaded. + +To activate a given interface simply write "bat0" into its +"mesh_iface" file inside the batman_adv subfolder: + +# echo bat0 > /sys/class/net/eth0/batman_adv/mesh_iface + +Repeat this step for all interfaces you wish to add. Now batman +starts using/broadcasting on this/these interface(s). + +By reading the "iface_status" file you can check its status: + +# cat /sys/class/net/eth0/batman_adv/iface_status +# active + +To deactivate an interface you have to write "none" into its +"mesh_iface" file: + +# echo none > /sys/class/net/eth0/batman_adv/mesh_iface + + +All mesh wide settings can be found in batman's own interface +folder: + +# ls /sys/class/net/bat0/mesh/ +#aggregated_ogms distributed_arp_table gw_sel_class orig_interval +#ap_isolation fragmentation hop_penalty routing_algo +#bonding gw_bandwidth isolation_mark vlan0 +#bridge_loop_avoidance gw_mode log_level + +There is a special folder for debugging information: + +# ls /sys/kernel/debug/batman_adv/bat0/ +# bla_backbone_table log transtable_global +# bla_claim_table originators transtable_local +# gateways socket + +Some of the files contain all sort of status information regard- +ing the mesh network. For example, you can view the table of +originators (mesh participants) with: + +# cat /sys/kernel/debug/batman_adv/bat0/originators + +Other files allow to change batman's behaviour to better fit your +requirements. For instance, you can check the current originator +interval (value in milliseconds which determines how often batman +sends its broadcast packets): + +# cat /sys/class/net/bat0/mesh/orig_interval +# 1000 + +and also change its value: + +# echo 3000 > /sys/class/net/bat0/mesh/orig_interval + +In very mobile scenarios, you might want to adjust the originator +interval to a lower value. This will make the mesh more respon- +sive to topology changes, but will also increase the overhead. + + +USAGE +----- + +To make use of your newly created mesh, batman advanced provides +a new interface "bat0" which you should use from this point on. +All interfaces added to batman advanced are not relevant any +longer because batman handles them for you. Basically, one "hands +over" the data by using the batman interface and batman will make +sure it reaches its destination. + +The "bat0" interface can be used like any other regular inter- +face. It needs an IP address which can be either statically con- +figured or dynamically (by using DHCP or similar services): + +# NodeA: ifconfig bat0 192.168.0.1 +# NodeB: ifconfig bat0 192.168.0.2 +# NodeB: ping 192.168.0.1 + +Note: In order to avoid problems remove all IP addresses previ- +ously assigned to interfaces now used by batman advanced, e.g. + +# ifconfig eth0 0.0.0.0 + + +LOGGING/DEBUGGING +----------------- + +All error messages, warnings and information messages are sent to +the kernel log. Depending on your operating system distribution +this can be read in one of a number of ways. Try using the com- +mands: dmesg, logread, or looking in the files /var/log/kern.log +or /var/log/syslog. All batman-adv messages are prefixed with +"batman-adv:" So to see just these messages try + +# dmesg | grep batman-adv + +When investigating problems with your mesh network it is some- +times necessary to see more detail debug messages. This must be +enabled when compiling the batman-adv module. When building bat- +man-adv as part of kernel, use "make menuconfig" and enable the +option "B.A.T.M.A.N. debugging". + +Those additional debug messages can be accessed using a special +file in debugfs + +# cat /sys/kernel/debug/batman_adv/bat0/log + +The additional debug output is by default disabled. It can be en- +abled during run time. Following log_levels are defined: + +0 - All debug output disabled +1 - Enable messages related to routing / flooding / broadcasting +2 - Enable messages related to route added / changed / deleted +4 - Enable messages related to translation table operations +8 - Enable messages related to bridge loop avoidance +16 - Enable messaged related to DAT, ARP snooping and parsing +31 - Enable all messages + +The debug output can be changed at runtime using the file +/sys/class/net/bat0/mesh/log_level. e.g. + +# echo 6 > /sys/class/net/bat0/mesh/log_level + +will enable debug messages for when routes change. + +Counters for different types of packets entering and leaving the +batman-adv module are available through ethtool: + +# ethtool --statistics bat0 + + +BATCTL +------ + +As batman advanced operates on layer 2 all hosts participating in +the virtual switch are completely transparent for all protocols +above layer 2. Therefore the common diagnosis tools do not work +as expected. To overcome these problems batctl was created. At +the moment the batctl contains ping, traceroute, tcpdump and +interfaces to the kernel module settings. + +For more information, please see the manpage (man batctl). + +batctl is available on http://www.open-mesh.org/ + + +CONTACT +------- + +Please send us comments, experiences, questions, anything :) + +IRC: #batman on irc.freenode.org +Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription + at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n) + +You can also contact the Authors: + +Marek Lindner <mareklindner@neomailbox.ch> +Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt index 4e68849d563..688f18fd446 100644 --- a/Documentation/networking/baycom.txt +++ b/Documentation/networking/baycom.txt @@ -93,7 +93,7 @@ Every time a driver is inserted into the kernel, it has to know which modems it should access at which ports. This can be done with the setbaycom utility. If you are only using one modem, you can also configure the driver from the insmod command line (or by means of an option line in -/etc/modprobe.conf). +/etc/modprobe.d/*.conf). Examples: modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index b0fe41da007..9c723ecd002 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1,7 +1,7 @@ Linux Ethernet Bonding Driver HOWTO - Latest update: 21 June 2005 + Latest update: 27 April 2011 Initial release : Thomas Davis <tadavis at lbl.gov> Corrections, HA extensions : 2000/10/03-15 : @@ -12,6 +12,8 @@ Corrections, HA extensions : 2000/10/03-15 : - Jay Vosburgh <fubar at us dot ibm dot com> Reorganized and updated Feb 2005 by Jay Vosburgh +Added Sysfs information: 2006/04/24 + - Mitch Williams <mitch.a.williams at intel.com> Introduction ============ @@ -38,69 +40,71 @@ Table of Contents 2. Bonding Driver Options 3. Configuring Bonding Devices -3.1 Configuration with sysconfig support -3.1.1 Using DHCP with sysconfig -3.1.2 Configuring Multiple Bonds with sysconfig -3.2 Configuration with initscripts support -3.2.1 Using DHCP with initscripts -3.2.2 Configuring Multiple Bonds with initscripts -3.3 Configuring Bonding Manually +3.1 Configuration with Sysconfig Support +3.1.1 Using DHCP with Sysconfig +3.1.2 Configuring Multiple Bonds with Sysconfig +3.2 Configuration with Initscripts Support +3.2.1 Using DHCP with Initscripts +3.2.2 Configuring Multiple Bonds with Initscripts +3.3 Configuring Bonding Manually with Ifenslave 3.3.1 Configuring Multiple Bonds Manually +3.4 Configuring Bonding Manually via Sysfs +3.5 Configuration with Interfaces Support +3.6 Overriding Configuration for Special Cases -5. Querying Bonding Configuration -5.1 Bonding Configuration -5.2 Network Configuration +4. Querying Bonding Configuration +4.1 Bonding Configuration +4.2 Network Configuration -6. Switch Configuration +5. Switch Configuration -7. 802.1q VLAN Support +6. 802.1q VLAN Support -8. Link Monitoring -8.1 ARP Monitor Operation -8.2 Configuring Multiple ARP Targets -8.3 MII Monitor Operation +7. Link Monitoring +7.1 ARP Monitor Operation +7.2 Configuring Multiple ARP Targets +7.3 MII Monitor Operation -9. Potential Trouble Sources -9.1 Adventures in Routing -9.2 Ethernet Device Renaming -9.3 Painfully Slow Or No Failed Link Detection By Miimon +8. Potential Trouble Sources +8.1 Adventures in Routing +8.2 Ethernet Device Renaming +8.3 Painfully Slow Or No Failed Link Detection By Miimon -10. SNMP agents +9. SNMP agents -11. Promiscuous mode +10. Promiscuous mode -12. Configuring Bonding for High Availability -12.1 High Availability in a Single Switch Topology -12.2 High Availability in a Multiple Switch Topology -12.2.1 HA Bonding Mode Selection for Multiple Switch Topology -12.2.2 HA Link Monitoring for Multiple Switch Topology +11. Configuring Bonding for High Availability +11.1 High Availability in a Single Switch Topology +11.2 High Availability in a Multiple Switch Topology +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology +11.2.2 HA Link Monitoring for Multiple Switch Topology -13. Configuring Bonding for Maximum Throughput -13.1 Maximum Throughput in a Single Switch Topology -13.1.1 MT Bonding Mode Selection for Single Switch Topology -13.1.2 MT Link Monitoring for Single Switch Topology -13.2 Maximum Throughput in a Multiple Switch Topology -13.2.1 MT Bonding Mode Selection for Multiple Switch Topology -13.2.2 MT Link Monitoring for Multiple Switch Topology +12. Configuring Bonding for Maximum Throughput +12.1 Maximum Throughput in a Single Switch Topology +12.1.1 MT Bonding Mode Selection for Single Switch Topology +12.1.2 MT Link Monitoring for Single Switch Topology +12.2 Maximum Throughput in a Multiple Switch Topology +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology +12.2.2 MT Link Monitoring for Multiple Switch Topology -14. Switch Behavior Issues -14.1 Link Establishment and Failover Delays -14.2 Duplicated Incoming Packets +13. Switch Behavior Issues +13.1 Link Establishment and Failover Delays +13.2 Duplicated Incoming Packets -15. Hardware Specific Considerations -15.1 IBM BladeCenter +14. Hardware Specific Considerations +14.1 IBM BladeCenter -16. Frequently Asked Questions +15. Frequently Asked Questions -17. Resources and Links +16. Resources and Links 1. Bonding Driver Installation ============================== Most popular distro kernels ship with the bonding driver -already available as a module and the ifenslave user level control -program installed and ready for use. If your distro does not, or you +already available as a module. If your distro does not, or you have need to compile bonding from source (e.g., configuring and installing a mainline kernel from kernel.org), you'll need to perform the following steps: @@ -119,53 +123,27 @@ device support" section. It is recommended that you configure the driver as module since it is currently the only way to pass parameters to the driver or configure more than one bonding device. - Build and install the new kernel and modules, then continue -below to install ifenslave. + Build and install the new kernel and modules. -1.2 Install ifenslave Control Utility +1.2 Bonding Control Utility ------------------------------------- - The ifenslave user level control program is included in the -kernel source tree, in the file Documentation/networking/ifenslave.c. -It is generally recommended that you use the ifenslave that -corresponds to the kernel that you are using (either from the same -source tree or supplied with the distro), however, ifenslave -executables from older kernels should function (but features newer -than the ifenslave release are not supported). Running an ifenslave -that is newer than the kernel is not supported, and may or may not -work. - - To install ifenslave, do the following: - -# gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave -# cp ifenslave /sbin/ifenslave - - If your kernel source is not in "/usr/src/linux," then replace -"/usr/src/linux/include" in the above with the location of your kernel -source include directory. - - You may wish to back up any existing /sbin/ifenslave, or, for -testing or informal use, tag the ifenslave to the kernel version -(e.g., name the ifenslave executable /sbin/ifenslave-2.6.10). - -IMPORTANT NOTE: - - If you omit the "-I" or specify an incorrect directory, you -may end up with an ifenslave that is incompatible with the kernel -you're trying to build it for. Some distros (e.g., Red Hat from 7.1 -onwards) do not have /usr/include/linux symbolically linked to the -default kernel source include directory. - + It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. 2. Bonding Driver Options ========================= - Options for the bonding driver are supplied as parameters to -the bonding module at load time. They may be given as command line -arguments to the insmod or modprobe command, but are usually specified -in either the /etc/modules.conf or /etc/modprobe.conf configuration -file, or in a distro-specific configuration file (some of which are -detailed in the next section). + Options for the bonding driver are supplied as parameters to the +bonding module at load time, or are specified via sysfs. + + Module options may be given as command line arguments to the +insmod or modprobe command, but are usually specified in either the +/etc/modrobe.d/*.conf configuration files, or in a distro-specific +configuration file (some of which are detailed in the next section). + + Details on bonding support for sysfs is provided in the +"Configuring Bonding Manually via Sysfs" section, below. The available bonding driver parameters are listed below. If a parameter is not specified the default value is used. When initially @@ -183,9 +161,91 @@ or, for backwards compatibility, the option value. E.g., The parameters are as follows: +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + +ad_select + + Specifies the 802.3ad aggregation selection logic to use. The + possible values and their effects are: + + stable or 0 + + The active aggregator is chosen by largest aggregate + bandwidth. + + Reselection of the active aggregator occurs only when all + slaves of the active aggregator are down or the active + aggregator has no slaves. + + This is the default value. + + bandwidth or 1 + + The active aggregator is chosen by largest aggregate + bandwidth. Reselection occurs if: + + - A slave is added to or removed from the bond + + - Any slave's link state changes + + - Any slave's 802.3ad association state changes + + - The bond's administrative state changes to up + + count or 2 + + The active aggregator is chosen by the largest number of + ports (slaves). Reselection occurs as described under the + "bandwidth" setting, above. + + The bandwidth and count selection policies permit failover of + 802.3ad aggregations when partial failure of the active aggregator + occurs. This keeps the aggregator with the highest availability + (either in bandwidth or in number of ports) active at all times. + + This option was added in bonding version 3.4.0. + +all_slaves_active + + Specifies that duplicate frames (received on inactive ports) should be + dropped (0) or delivered (1). + + Normally, bonding will drop duplicate frames (received on inactive + ports), which is desirable for most users. But there are some times + it is nice to allow duplicate frames to be delivered. + + The default value is 0 (drop duplicate frames received on inactive + ports). + arp_interval Specifies the ARP link monitoring frequency in milliseconds. + + The ARP monitor works by periodically checking the slave + devices to determine whether they have sent or received + traffic recently (the precise criteria depends upon the + bonding mode, and the state of the slave). Regular traffic is + generated via ARP probes issued for the addresses specified by + the arp_ip_target option. + + This behavior can be modified by the arp_validate option, + below. + If ARP monitoring is used in an etherchannel compatible mode (modes 0 and 2), the switch should be configured in a mode that evenly distributes packets across all links. If the @@ -207,6 +267,115 @@ arp_ip_target maximum number of targets that can be specified is 16. The default value is no IP addresses. +arp_validate + + Specifies whether or not ARP probes and replies should be + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. + + Possible values are: + + none or 0 + + No validation or filtering is performed. + + active or 1 + + Validation is performed only for the active slave. + + backup or 2 + + Validation is performed only for backup slaves. + + all or 3 + + Validation is performed for all slaves. + + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. + + This option was added in bonding version 3.1.0. + +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + downdelay Specifies the time, in milliseconds, to wait before disabling @@ -216,6 +385,77 @@ downdelay will be rounded down to the nearest multiple. The default value is 0. +fail_over_mac + + Specifies whether active-backup mode should set all slaves to + the same MAC address at enslavement (the traditional + behavior), or, when enabled, perform special handling of the + bond's MAC address in accordance with the selected policy. + + Possible values are: + + none or 0 + + This setting disables fail_over_mac, and causes + bonding to set all slaves of an active-backup bond to + the same MAC address at enslavement time. This is the + default. + + active or 1 + + The "active" fail_over_mac policy indicates that the + MAC address of the bond should always be the MAC + address of the currently active slave. The MAC + address of the slaves is not changed; instead, the MAC + address of the bond changes during a failover. + + This policy is useful for devices that cannot ever + alter their MAC address, or for devices that refuse + incoming broadcasts with their own source MAC (which + interferes with the ARP monitor). + + The down side of this policy is that every device on + the network must be updated via gratuitous ARP, + vs. just updating a switch or set of switches (which + often takes place for any traffic, not just ARP + traffic, if the switch snoops incoming traffic to + update its tables) for the traditional method. If the + gratuitous ARP is lost, communication may be + disrupted. + + When this policy is used in conjunction with the mii + monitor, devices which assert link up prior to being + able to actually transmit and receive are particularly + susceptible to loss of the gratuitous ARP, and an + appropriate updelay setting may be required. + + follow or 2 + + The "follow" fail_over_mac policy causes the MAC + address of the bond to be selected normally (normally + the MAC address of the first slave added to the bond). + However, the second and subsequent slaves are not set + to this MAC address while they are in a backup role; a + slave is programmed with the bond's MAC address at + failover time (and the formerly active slave receives + the newly active slave's MAC address). + + This policy is useful for multiport devices that + either become confused or incur a performance penalty + when multiple ports are programmed with the same MAC + address. + + + The default policy is none, unless the first slave cannot + change its MAC address, in which case the active policy is + selected by default. + + This option may be modified via sysfs only when no slaves are + present in the bond. + + This option was added in bonding version 3.2.0. The "follow" + policy was added in bonding version 3.3.0. + lacp_rate Option specifying the rate in which we'll ask our link partner @@ -235,7 +475,8 @@ max_bonds Specifies the number of bonding devices to create for this instance of the bonding driver. E.g., if max_bonds is 3, and the bonding driver is not already loaded, then bond0, bond1 - and bond2 will be created. The default value is 1. + and bond2 will be created. The default value is 1. Specifying + a value of 0 will load bonding, but will not create any devices. miimon @@ -247,6 +488,23 @@ miimon determined. See the High Availability section for additional information. The default value is 0. +min_links + + Specifies the minimum number of links that must be active before + asserting carrier. It is similar to the Cisco EtherChannel min-links + feature. This allows setting the minimum number of member ports that + must be up (link-up state) before marking the bond device as up + (carrier on). This is useful for situations where higher level services + such as clustering want to ensure a minimum number of low bandwidth + links are active before switchover. This option only affect 802.3ad + mode. + + The default value is 0. This will cause carrier to be asserted (for + 802.3ad mode) whenever there is an active aggregator, regardless of the + number of available links in that aggregator. Note that, because an + aggregator cannot be active without at least one available link, + setting this option to 0 or to 1 has the exact same effect. + mode Specifies one of the bonding policies. The default is @@ -270,7 +528,7 @@ mode In bonding version 2.6.2 or later, when a failover occurs in active-backup mode, bonding will issue one or more gratuitous ARPs on the newly active slave. - One gratutious ARP is issued for the bonding master + One gratuitous ARP is issued for the bonding master interface and each VLAN interfaces configured above it, provided that the interface has at least one IP address configured. Gratuitous ARPs issued for VLAN @@ -327,13 +585,19 @@ mode balance-tlb or 5 Adaptive transmit load balancing: channel bonding that - does not require any special switch support. The - outgoing traffic is distributed according to the - current load (computed relative to the speed) on each - slave. Incoming traffic is received by the current - slave. If the receiving slave fails, another slave - takes over the MAC address of the failed receiving - slave. + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. Prerequisite: @@ -377,7 +641,7 @@ mode When a link is reconnected or a new slave joins the bond the receive traffic is redistributed among all active slaves in the bond by initiating ARP Replies - with the selected mac address to each of the + with the selected MAC address to each of the clients. The updelay parameter (detailed below) must be set to a value equal or greater than the switch's forwarding delay so that the ARP Replies sent to the @@ -398,6 +662,34 @@ mode swapped with the new curr_active_slave that was chosen. +num_grat_arp +num_unsol_na + + Specify the number of peer notifications (gratuitous ARPs and + unsolicited IPv6 Neighbor Advertisements) to be issued after a + failover event. As soon as the link is up on the new slave + (possibly immediately) a peer notification is sent on the + bonding device and each VLAN sub-device. This is repeated at + each link monitor interval (arp_interval or miimon, whichever + is active) if the number is greater than 1. + + The valid range is 0 - 255; the default value is 1. These options + affect only the active-backup mode. These options were added for + bonding versions 3.3.0 and 3.4.0 respectively. + + From Linux 3.0 and bonding version 3.7.1, these notifications + are generated by the ipv4 and ipv6 code and the numbers of + repetitions cannot be set independently. + +packets_per_slave + + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. + + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. + primary A string (eth0, eth2, etc) specifying which slave is the @@ -407,7 +699,70 @@ primary one slave is preferred over another, e.g., when one slave has higher throughput than another. - The primary option is only valid for active-backup mode. + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. + +primary_reselect + + Specifies the reselection policy for the primary slave. This + affects how the primary slave is chosen to become the active slave + when failure of the active slave or recovery of the primary slave + occurs. This option is designed to prevent flip-flopping between + the primary slave and other slaves. Possible values are: + + always or 0 (default) + + The primary slave becomes the active slave whenever it + comes back up. + + better or 1 + + The primary slave becomes the active slave when it comes + back up, if the speed and duplex of the primary slave is + better than the speed and duplex of the current active + slave. + + failure or 2 + + The primary slave becomes the active slave only if the + current active slave fails and the primary slave is up. + + The primary_reselect setting is ignored in two cases: + + If no slaves are active, the first slave to recover is + made the active slave. + + When initially enslaved, the primary slave is always made + the active slave. + + Changing the primary_reselect policy via sysfs will cause an + immediate selection of the best active slave according to the new + policy. This may or may not result in a change of the active + slave, depending upon the circumstances. + + This option was added for bonding version 3.6.0. + +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + updelay @@ -442,7 +797,7 @@ use_carrier xmit_hash_policy Selects the transmit hash policy to use for slave selection in - balance-xor and 802.3ad modes. Possible values are: + balance-xor, 802.3ad, and tlb modes. Possible values are: layer2 @@ -456,6 +811,35 @@ xmit_hash_policy This algorithm is 802.3ad compliant. + layer2+3 + + This policy uses a combination of layer2 and layer3 + protocol information to generate the hash. + + Uses XOR of hardware MAC addresses and IP addresses to + generate the hash. The formula is + + hash = source MAC XOR destination MAC + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. + + This algorithm will place all traffic to a particular + network peer on the same slave. For non-IP traffic, + the formula is the same as for the layer2 transmit + hash policy. + + This policy is intended to provide a more balanced + distribution of traffic than layer2 alone, especially + in environments where a layer3 gateway device is + required to reach most destinations. + + This algorithm is 802.3ad compliant. + layer3+4 This policy uses upper layer protocol information, @@ -466,20 +850,21 @@ xmit_hash_policy The formula for unfragmented TCP and UDP packets is - ((source port XOR dest port) XOR - ((source IP XOR dest IP) AND 0xffff) - modulo slave count + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. - For fragmented TCP or UDP packets and all other IP - protocol traffic, the source and destination port + For fragmented TCP or UDP packets and all other IPv4 and + IPv6 protocol traffic, the source and destination port information is omitted. For non-IP traffic, the formula is the same as for the layer2 transmit hash policy. - This policy is intended to mimic the behavior of - certain switches, notably Cisco switches with PFC2 as - well as some Foundry and IBM products. - This algorithm is not fully 802.3ad compliant. A single TCP or UDP conversation containing both fragmented and unfragmented packets will see packets @@ -490,31 +875,82 @@ xmit_hash_policy conversations. Other implementations of 802.3ad may or may not tolerate this noncompliance. + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + The default value is layer2. This option was added in bonding -version 2.6.3. In earlier versions of bonding, this parameter does -not exist, and the layer2 policy is the only policy. + version 2.6.3. In earlier versions of bonding, this parameter + does not exist, and the layer2 policy is the only policy. The + layer2+3 value was added for bonding version 3.2.2. + +resend_igmp + + Specifies the number of IGMP membership reports to be issued after + a failover event. One membership report is issued immediately after + the failover, subsequent packets are sent in each 200ms interval. + + The valid range is 0 - 255; the default value is 1. A value of 0 + prevents the IGMP membership report from being issued in response + to the failover event. + + This option is useful for bonding modes balance-rr (0), active-backup + (1), balance-tlb (5) and balance-alb (6), in which a failover can + switch the IGMP traffic from one slave to another. Therefore a fresh + IGMP report must be issued to cause the switch to forward the incoming + IGMP traffic over the newly selected slave. + + This option was added for bonding version 3.7.0. +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. 3. Configuring Bonding Devices ============================== - There are, essentially, two methods for configuring bonding: -with support from the distro's network initialization scripts, and -without. Distros generally use one of two packages for the network -initialization scripts: initscripts or sysconfig. Recent versions of -these packages have support for bonding, while older versions do not. + You can configure bonding using either your distro's network +initialization scripts, or manually using either iproute2 or the +sysfs interface. Distros generally use one of three packages for the +network initialization scripts: initscripts, sysconfig or interfaces. +Recent versions of these packages have support for bonding, while older +versions do not. We will first describe the options for configuring bonding for -distros using versions of initscripts and sysconfig with full or -partial support for bonding, then provide information on enabling +distros using versions of initscripts, sysconfig and interfaces with full +or partial support for bonding, then provide information on enabling bonding without support from the network initialization scripts (i.e., older versions of initscripts or sysconfig). - If you're unsure whether your distro uses sysconfig or -initscripts, or don't know if it's new enough, have no fear. + If you're unsure whether your distro uses sysconfig, +initscripts or interfaces, or don't know if it's new enough, have no fear. Determining this is fairly straightforward. - First, issue the command: + First, look for a file called interfaces in /etc/network directory. +If this file is present in your system, then your system use interfaces. See +Configuration with Interfaces Support. + + Else, issue the command: $ rpm -qf /sbin/ifup @@ -530,7 +966,7 @@ $ grep ifenslave /sbin/ifup If this returns any matches, then your initscripts or sysconfig has support for bonding. -3.1 Configuration with sysconfig support +3.1 Configuration with Sysconfig Support ---------------------------------------- This section applies to distros using a version of sysconfig @@ -538,7 +974,7 @@ with bonding support, for example, SuSE Linux Enterprise Server 9. SuSE SLES 9's networking configuration system does support bonding, however, at this writing, the YaST system configuration -frontend does not provide any means to work with bonding devices. +front end does not provide any means to work with bonding devices. Bonding devices can be managed by hand, however, as follows. First, if they have not already been configured, configure the @@ -660,7 +1096,7 @@ format can be found in an example ifcfg template file: Note that the template does not document the various BONDING_ settings described above, but does describe many of the other options. -3.1.1 Using DHCP with sysconfig +3.1.1 Using DHCP with Sysconfig ------------------------------- Under sysconfig, configuring a device with BOOTPROTO='dhcp' @@ -670,7 +1106,7 @@ attempt to obtain the device address from DHCP prior to adding any of the slave devices. Without active slaves, the DHCP requests are not sent to the network. -3.1.2 Configuring Multiple Bonds with sysconfig +3.1.2 Configuring Multiple Bonds with Sysconfig ----------------------------------------------- The sysconfig network initialization system is capable of @@ -683,16 +1119,18 @@ ifcfg-bondX files. Because the sysconfig scripts supply the bonding module options in the ifcfg-bondX file, it is not necessary to add them to -the system /etc/modules.conf or /etc/modprobe.conf configuration file. +the system /etc/modules.d/*.conf configuration files. -3.2 Configuration with initscripts support +3.2 Configuration with Initscripts Support ------------------------------------------ - This section applies to distros using a version of initscripts -with bonding support, for example, Red Hat Linux 9 or Red Hat -Enterprise Linux version 3 or 4. On these systems, the network -initialization scripts have some knowledge of bonding, and can be -configured to control bonding devices. + This section applies to distros using a recent version of +initscripts with bonding support, for example, Red Hat Enterprise Linux +version 3 or later, Fedora, etc. On these systems, the network +initialization scripts have knowledge of bonding, and can be configured to +control bonding devices. Note that older versions of the initscripts +package have lower levels of support for bonding; this will be noted where +applicable. These distros will not automatically load the network adapter driver unless the ethX device is configured with an IP address. @@ -740,11 +1178,31 @@ USERCTL=no Be sure to change the networking specific lines (IPADDR, NETMASK, NETWORK and BROADCAST) to match your network configuration. - Finally, it is necessary to edit /etc/modules.conf (or -/etc/modprobe.conf, depending upon your distro) to load the bonding -module with your desired options when the bond0 interface is brought -up. The following lines in /etc/modules.conf (or modprobe.conf) will -load the bonding module, and select its options: + For later versions of initscripts, such as that found with Fedora +7 (or later) and Red Hat Enterprise Linux version 5 (or later), it is possible, +and, indeed, preferable, to specify the bonding options in the ifcfg-bond0 +file, e.g. a line of the format: + +BONDING_OPTS="mode=active-backup arp_interval=60 arp_ip_target=192.168.1.254" + + will configure the bond with the specified options. The options +specified in BONDING_OPTS are identical to the bonding module parameters +except for the arp_ip_target field when using versions of initscripts older +than and 8.57 (Fedora 8) and 8.45.19 (Red Hat Enterprise Linux 5.2). When +using older versions each target should be included as a separate option and +should be preceded by a '+' to indicate it should be added to the list of +queried targets, e.g., + + arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 + + is the proper syntax to specify multiple targets. When specifying +options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf. + + For even older versions of initscripts that do not support +BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon +your distro) to load the bonding module with your desired options when the +bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf +will load the bonding module, and select its options: alias bond0 bonding options bond0 mode=balance-alb miimon=100 @@ -756,36 +1214,33 @@ options for your configuration. will restart the networking subsystem and your bond link should be now up and running. -3.2.1 Using DHCP with initscripts +3.2.1 Using DHCP with Initscripts --------------------------------- - Recent versions of initscripts (the version supplied with -Fedora Core 3 and Red Hat Enterprise Linux 4 is reported to work) do -have support for assigning IP information to bonding devices via DHCP. + Recent versions of initscripts (the versions supplied with Fedora +Core 3 and Red Hat Enterprise Linux 4, or later versions, are reported to +work) have support for assigning IP information to bonding devices via +DHCP. To configure bonding for DHCP, configure it as described above, except replace the line "BOOTPROTO=none" with "BOOTPROTO=dhcp" and add a line consisting of "TYPE=Bonding". Note that the TYPE value is case sensitive. -3.2.2 Configuring Multiple Bonds with initscripts +3.2.2 Configuring Multiple Bonds with Initscripts ------------------------------------------------- - At this writing, the initscripts package does not directly -support loading the bonding driver multiple times, so the process for -doing so is the same as described in the "Configuring Multiple Bonds -Manually" section, below. - - NOTE: It has been observed that some Red Hat supplied kernels -are apparently unable to rename modules at load time (the "-o bond1" -part). Attempts to pass that option to modprobe will produce an -"Operation not permitted" error. This has been reported on some -Fedora Core kernels, and has been seen on RHEL 4 as well. On kernels -exhibiting this problem, it will be impossible to configure multiple -bonds with differing parameters. + Initscripts packages that are included with Fedora 7 and Red Hat +Enterprise Linux 5 support multiple bonding interfaces by simply +specifying the appropriate BONDING_OPTS= in ifcfg-bondX where X is the +number of the bond. This support requires sysfs support in the kernel, +and a bonding driver of version 3.0.0 or later. Other configurations may +not support this method for specifying multiple bonding interfaces; for +those instances, see the "Configuring Multiple Bonds Manually" section, +below. -3.3 Configuring Bonding Manually --------------------------------- +3.3 Configuring Bonding Manually with iproute2 +----------------------------------------------- This section applies to distros whose network initialization scripts (the sysconfig or initscripts package) do not have specific @@ -793,9 +1248,9 @@ knowledge of bonding. One such distro is SuSE Linux Enterprise Server version 8. The general method for these systems is to place the bonding -module parameters into /etc/modules.conf or /etc/modprobe.conf (as +module parameters into a config file in /etc/modprobe.d/ (as appropriate for the installed distro), then add modprobe and/or -ifenslave commands to the system's global init script. The name of +`ip link` commands to the system's global init script. The name of the global init script differs; for sysconfig, it is /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. @@ -807,8 +1262,8 @@ reboots, edit the appropriate file (/etc/init.d/boot.local or modprobe bonding mode=balance-alb miimon=100 modprobe e100 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -ifenslave bond0 eth0 -ifenslave bond0 eth1 +ip link set eth0 master bond0 +ip link set eth1 master bond0 Replace the example bonding module parameters and bond0 network configuration (IP address, netmask, etc) with the appropriate @@ -853,20 +1308,24 @@ initialization scripts lack support for configuring multiple bonds. options, you may wish to use the "max_bonds" module parameter, documented above. - To create multiple bonding devices with differing options, it -is necessary to load the bonding driver multiple times. Note that -current versions of the sysconfig network initialization scripts -handle this automatically; if your distro uses these scripts, no -special action is needed. See the section Configuring Bonding -Devices, above, if you're not sure about your network initialization -scripts. + To create multiple bonding devices with differing options, it is +preferable to use bonding parameters exported by sysfs, documented in the +section below. + + For versions of bonding without sysfs support, the only means to +provide multiple instances of bonding with differing options is to load +the bonding driver multiple times. Note that current versions of the +sysconfig network initialization scripts handle this automatically; if +your distro uses these scripts, no special action is needed. See the +section Configuring Bonding Devices, above, if you're not sure about your +network initialization scripts. To load multiple instances of the module, it is necessary to specify a different name for each instance (the module loading system requires that every loaded module, even multiple instances of the same -module, have a unique name). This is accomplished by supplying -multiple sets of bonding options in /etc/modprobe.conf, for example: - +module, have a unique name). This is accomplished by supplying multiple +sets of bonding options in /etc/modprobe.d/*.conf, for example: + alias bond0 bonding options bond0 -o bond0 mode=balance-rr miimon=100 @@ -889,11 +1348,283 @@ install bond1 /sbin/modprobe --ignore-install bonding -o bond1 \ This may be repeated any number of times, specifying a new and unique name in place of bond1 for each subsequent instance. + It has been observed that some Red Hat supplied kernels are unable +to rename modules at load time (the "-o bond1" part). Attempts to pass +that option to modprobe will produce an "Operation not permitted" error. +This has been reported on some Fedora Core kernels, and has been seen on +RHEL 4 as well. On kernels exhibiting this problem, it will be impossible +to configure multiple bonds with differing parameters (as they are older +kernels, and also lack sysfs support). + +3.4 Configuring Bonding Manually via Sysfs +------------------------------------------ + + Starting with version 3.0.0, Channel Bonding may be configured +via the sysfs interface. This interface allows dynamic configuration +of all bonds in the system without unloading the module. It also +allows for adding and removing bonds at runtime. Ifenslave is no +longer required, though it is still supported. + + Use of the sysfs interface allows you to use multiple bonds +with different configurations without having to reload the module. +It also allows you to use multiple, differently configured bonds when +bonding is compiled into the kernel. + + You must have the sysfs filesystem mounted to configure +bonding this way. The examples in this document assume that you +are using the standard mount point for sysfs, e.g. /sys. If your +sysfs filesystem is mounted elsewhere, you will need to adjust the +example paths accordingly. + +Creating and Destroying Bonds +----------------------------- +To add a new bond foo: +# echo +foo > /sys/class/net/bonding_masters + +To remove an existing bond bar: +# echo -bar > /sys/class/net/bonding_masters + +To show all existing bonds: +# cat /sys/class/net/bonding_masters + +NOTE: due to 4K size limitation of sysfs files, this list may be +truncated if you have more than a few hundred bonds. This is unlikely +to occur under normal operating conditions. + +Adding and Removing Slaves +-------------------------- + Interfaces may be enslaved to a bond using the file +/sys/class/net/<bond>/bonding/slaves. The semantics for this file +are the same as for the bonding_masters file. + +To enslave interface eth0 to bond bond0: +# ifconfig bond0 up +# echo +eth0 > /sys/class/net/bond0/bonding/slaves + +To free slave eth0 from bond bond0: +# echo -eth0 > /sys/class/net/bond0/bonding/slaves + + When an interface is enslaved to a bond, symlinks between the +two are created in the sysfs filesystem. In this case, you would get +/sys/class/net/bond0/slave_eth0 pointing to /sys/class/net/eth0, and +/sys/class/net/eth0/master pointing to /sys/class/net/bond0. + + This means that you can tell quickly whether or not an +interface is enslaved by looking for the master symlink. Thus: +# echo -eth0 > /sys/class/net/eth0/master/bonding/slaves +will free eth0 from whatever bond it is enslaved to, regardless of +the name of the bond interface. + +Changing a Bond's Configuration +------------------------------- + Each bond may be configured individually by manipulating the +files located in /sys/class/net/<bond name>/bonding + + The names of these files correspond directly with the command- +line parameters described elsewhere in this file, and, with the +exception of arp_ip_target, they accept the same values. To see the +current setting, simply cat the appropriate file. + + A few examples will be given here; for specific usage +guidelines for each parameter, see the appropriate section in this +document. + +To configure bond0 for balance-alb mode: +# ifconfig bond0 down +# echo 6 > /sys/class/net/bond0/bonding/mode + - or - +# echo balance-alb > /sys/class/net/bond0/bonding/mode + NOTE: The bond interface must be down before the mode can be +changed. + +To enable MII monitoring on bond0 with a 1 second interval: +# echo 1000 > /sys/class/net/bond0/bonding/miimon + NOTE: If ARP monitoring is enabled, it will disabled when MII +monitoring is enabled, and vice-versa. + +To add ARP targets: +# echo +192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target +# echo +192.168.0.101 > /sys/class/net/bond0/bonding/arp_ip_target + NOTE: up to 16 target addresses may be specified. + +To remove an ARP target: +# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target + +To configure the interval between learning packet transmits: +# echo 12 > /sys/class/net/bond0/bonding/lp_interval + NOTE: the lp_inteval is the number of seconds between instances where +the bonding driver sends learning packets to each slaves peer switch. The +default interval is 1 second. + +Example Configuration +--------------------- + We begin with the same example that is shown in section 3.3, +executed with sysfs, and without using ifenslave. + + To make a simple bond of two e100 devices (presumed to be eth0 +and eth1), and have it persist across reboots, edit the appropriate +file (/etc/init.d/boot.local or /etc/rc.d/rc.local), and add the +following: -5. Querying Bonding Configuration +modprobe bonding +modprobe e100 +echo balance-alb > /sys/class/net/bond0/bonding/mode +ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up +echo 100 > /sys/class/net/bond0/bonding/miimon +echo +eth0 > /sys/class/net/bond0/bonding/slaves +echo +eth1 > /sys/class/net/bond0/bonding/slaves + + To add a second bond, with two e1000 interfaces in +active-backup mode, using ARP monitoring, add the following lines to +your init script: + +modprobe e1000 +echo +bond1 > /sys/class/net/bonding_masters +echo active-backup > /sys/class/net/bond1/bonding/mode +ifconfig bond1 192.168.2.1 netmask 255.255.255.0 up +echo +192.168.2.100 /sys/class/net/bond1/bonding/arp_ip_target +echo 2000 > /sys/class/net/bond1/bonding/arp_interval +echo +eth2 > /sys/class/net/bond1/bonding/slaves +echo +eth3 > /sys/class/net/bond1/bonding/slaves + +3.5 Configuration with Interfaces Support +----------------------------------------- + + This section applies to distros which use /etc/network/interfaces file +to describe network interface configuration, most notably Debian and it's +derivatives. + + The ifup and ifdown commands on Debian don't support bonding out of +the box. The ifenslave-2.6 package should be installed to provide bonding +support. Once installed, this package will provide bond-* options to be used +into /etc/network/interfaces. + + Note that ifenslave-2.6 package will load the bonding module and use +the ifenslave command when appropriate. + +Example Configurations +---------------------- + +In /etc/network/interfaces, the following stanza will configure bond0, in +active-backup mode, with eth0 and eth1 as slaves. + +auto bond0 +iface bond0 inet dhcp + bond-slaves eth0 eth1 + bond-mode active-backup + bond-miimon 100 + bond-primary eth0 eth1 + +If the above configuration doesn't work, you might have a system using +upstart for system startup. This is most notably true for recent +Ubuntu versions. The following stanza in /etc/network/interfaces will +produce the same result on those systems. + +auto bond0 +iface bond0 inet dhcp + bond-slaves none + bond-mode active-backup + bond-miimon 100 + +auto eth0 +iface eth0 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +auto eth1 +iface eth1 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +For a full list of bond-* supported options in /etc/network/interfaces and some +more advanced examples tailored to you particular distros, see the files in +/usr/share/doc/ifenslave-2.6. + +3.6 Overriding Configuration for Special Cases +---------------------------------------------- + +When using the bonding driver, the physical port which transmits a frame is +typically selected by the bonding driver, and is not relevant to the user or +system administrator. The output port is simply selected using the policies of +the selected bonding mode. On occasion however, it is helpful to direct certain +classes of traffic to certain physical interfaces on output to implement +slightly more complex policies. For example, to reach a web server over a +bonded interface in which eth0 connects to a private network, while eth1 +connects via a public network, it may be desirous to bias the bond to send said +traffic over eth0 first, using eth1 only as a fall back, while all other traffic +can safely be sent over either interface. Such configurations may be achieved +using the traffic control utilities inherent in linux. + +By default the bonding driver is multiqueue aware and 16 queues are created +when the driver initializes (see Documentation/networking/multiqueue.txt +for details). If more or less queues are desired the module parameter +tx_queues can be used to change this value. There is no sysfs parameter +available as the allocation is done at module init time. + +The output of the file /proc/net/bonding/bondX has changed so the output Queue +ID is now printed for each slave: + +Bonding Mode: fault-tolerance (active-backup) +Primary Slave: None +Currently Active Slave: eth0 +MII Status: up +MII Polling Interval (ms): 0 +Up Delay (ms): 0 +Down Delay (ms): 0 + +Slave Interface: eth0 +MII Status: up +Link Failure Count: 0 +Permanent HW addr: 00:1a:a0:12:8f:cb +Slave queue ID: 0 + +Slave Interface: eth1 +MII Status: up +Link Failure Count: 0 +Permanent HW addr: 00:1a:a0:12:8f:cc +Slave queue ID: 2 + +The queue_id for a slave can be set using the command: + +# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id + +Any interface that needs a queue_id set should set it with multiple calls +like the one above until proper priorities are set for all interfaces. On +distributions that allow configuration via initscripts, multiple 'queue_id' +arguments can be added to BONDING_OPTS to set all needed slave queues. + +These queue id's can be used in conjunction with the tc utility to configure +a multiqueue qdisc and filters to bias certain traffic to transmit on certain +slave devices. For instance, say we wanted, in the above configuration to +force all traffic bound to 192.168.1.100 to use eth1 in the bond as its output +device. The following commands would accomplish this: + +# tc qdisc add dev bond0 handle 1 root multiq + +# tc filter add dev bond0 protocol ip parent 1: prio 1 u32 match ip dst \ + 192.168.1.100 action skbedit queue_mapping 2 + +These commands tell the kernel to attach a multiqueue queue discipline to the +bond0 interface and filter traffic enqueued to it, such that packets with a dst +ip of 192.168.1.100 have their output queue mapping value overwritten to 2. +This value is then passed into the driver, causing the normal output path +selection policy to be overridden, selecting instead qid 2, which maps to eth1. + +Note that qid values begin at 1. Qid 0 is reserved to initiate to the driver +that normal output policy selection should take place. One benefit to simply +leaving the qid for a slave to 0 is the multiqueue awareness in the bonding +driver that is now present. This awareness allows tc filters to be placed on +slave devices as well as bond devices and the bonding driver will simply act as +a pass-through for selecting output queues on the slave device rather than +output port selection. + +This feature first appeared in bonding driver version 3.7.0 and support for +output slave selection was limited to round-robin and active-backup modes. + +4 Querying Bonding Configuration ================================= -5.1 Bonding Configuration +4.1 Bonding Configuration ------------------------- Each bonding device has a read-only file residing in the @@ -923,7 +1654,7 @@ generally as follows: The precise format and contents will change depending upon the bonding configuration, state, and version of the bonding driver. -5.2 Network configuration +4.2 Network configuration ------------------------- The network configuration can be inspected using the ifconfig @@ -945,7 +1676,6 @@ bond0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 collisions:0 txqueuelen:0 eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3573025 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643167 errors:1 dropped:0 overruns:1 carrier:0 @@ -953,14 +1683,13 @@ eth0 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 Interrupt:10 Base address:0x1080 eth1 Link encap:Ethernet HWaddr 00:C0:F0:1F:37:B4 - inet addr:XXX.XXX.XXX.YYY Bcast:XXX.XXX.XXX.255 Mask:255.255.252.0 UP BROADCAST RUNNING SLAVE MULTICAST MTU:1500 Metric:1 RX packets:3651769 errors:0 dropped:0 overruns:0 frame:0 TX packets:1643480 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:100 Interrupt:9 Base address:0x1400 -6. Switch Configuration +5. Switch Configuration ======================= For this section, "switch" refers to whatever system the @@ -993,7 +1722,7 @@ transmit policy for an EtherChannel group; all three will interoperate with another EtherChannel group. -7. 802.1q VLAN Support +6. 802.1q VLAN Support ====================== It is possible to configure VLAN devices over a bond interface @@ -1044,7 +1773,7 @@ underlying device -- i.e. the bonding interface -- to promiscuous mode, which might not be what you want. -8. Link Monitoring +7. Link Monitoring ================== The bonding driver at present supports two schemes for @@ -1055,7 +1784,7 @@ monitor. bonding driver itself, it is not possible to enable both ARP and MII monitoring simultaneously. -8.1 ARP Monitor Operation +7.1 ARP Monitor Operation ------------------------- The ARP monitor operates as its name suggests: it sends ARP @@ -1073,7 +1802,7 @@ those slaves will stay down. If networking monitoring (tcpdump, etc) shows the ARP requests and replies on the network, then it may be that your device driver is not updating last_rx and trans_start. -8.2 Configuring Multiple ARP Targets +7.2 Configuring Multiple ARP Targets ------------------------------------ While ARP monitoring can be done with just one target, it can @@ -1096,7 +1825,7 @@ alias bond0 bonding options bond0 arp_interval=60 arp_ip_target=192.168.0.100 -8.3 MII Monitor Operation +7.3 MII Monitor Operation ------------------------- The MII monitor monitors only the carrier state of the local @@ -1122,14 +1851,14 @@ does not support or had some error in processing both the MII register and ethtool requests), then the MII monitor will assume the link is up. -9. Potential Sources of Trouble +8. Potential Sources of Trouble =============================== -9.1 Adventures in Routing +8.1 Adventures in Routing ------------------------- When bonding is configured, it is important that the slave -devices not have routes that supercede routes of the master (or, +devices not have routes that supersede routes of the master (or, generally, not have routes at all). For example, suppose the bonding device bond0 has two slaves, eth0 and eth1, and the routing table is as follows: @@ -1156,18 +1885,18 @@ by the state of the routing table. The solution here is simply to insure that slaves do not have routes of their own, and if for some reason they must, those routes do -not supercede routes of their master. This should generally be the +not supersede routes of their master. This should generally be the case, but unusual configurations or errant manual or automatic static route additions may cause trouble. -9.2 Ethernet Device Renaming +8.2 Ethernet Device Renaming ---------------------------- On systems with network configuration scripts that do not associate physical devices directly with network interface names (so that the same physical device always has the same "ethX" name), it may -be necessary to add some special logic to either /etc/modules.conf or -/etc/modprobe.conf (depending upon which is installed on the system). +be necessary to add some special logic to config files in +/etc/modprobe.d/. For example, given a modules.conf containing the following: @@ -1194,22 +1923,17 @@ add above bonding e1000 tg3 bonding is loaded. This command is fully documented in the modules.conf manual page. - On systems utilizing modprobe.conf (or modprobe.conf.local), -an equivalent problem can occur. In this case, the following can be -added to modprobe.conf (or modprobe.conf.local, as appropriate), as -follows (all on one line; it has been split here for clarity): + On systems utilizing modprobe an equivalent problem can occur. +In this case, the following can be added to config files in +/etc/modprobe.d/ as: -install bonding /sbin/modprobe tg3; /sbin/modprobe e1000; - /sbin/modprobe --ignore-install bonding +softdep bonding pre: tg3 e1000 - This will, when loading the bonding module, rather than -performing the normal action, instead execute the provided command. -This command loads the device drivers in the order needed, then calls -modprobe with --ignore-install to cause the normal action to then take -place. Full documentation on this can be found in the modprobe.conf -and modprobe manual pages. + This will load tg3 and e1000 modules before loading the bonding one. +Full documentation on this can be found in the modprobe.d and modprobe +manual pages. -9.3. Painfully Slow Or No Failed Link Detection By Miimon +8.3. Painfully Slow Or No Failed Link Detection By Miimon --------------------------------------------------------- By default, bonding enables the use_carrier option, which @@ -1237,7 +1961,7 @@ carrier state. It has no way to determine the state of devices on or beyond other ports of a switch, or if a switch is refusing to pass traffic while still maintaining carrier on. -10. SNMP agents +9. SNMP agents =============== If running SNMP agents, the bonding driver should be loaded @@ -1283,7 +2007,7 @@ ifDescr, the association between the IP address and IfIndex remains and SNMP functions such as Interface_Scan_Next will report that association. -11. Promiscuous mode +10. Promiscuous mode ==================== When running network monitoring tools, e.g., tcpdump, it is @@ -1310,7 +2034,7 @@ sending to peers that are unassigned or if the load is unbalanced. the active slave changes (e.g., due to a link failure), the promiscuous setting will be propagated to the new active slave. -12. Configuring Bonding for High Availability +11. Configuring Bonding for High Availability ============================================= High Availability refers to configurations that provide @@ -1320,7 +2044,7 @@ goal is to provide the maximum availability of network connectivity (i.e., the network always works), even though other configurations could provide higher throughput. -12.1 High Availability in a Single Switch Topology +11.1 High Availability in a Single Switch Topology -------------------------------------------------- If two hosts (or a host and a single switch) are directly @@ -1331,10 +2055,10 @@ access to fail over to. Additionally, the bonding load balance modes support link monitoring of their members, so if individual links fail, the load will be rebalanced across the remaining devices. - See Section 13, "Configuring Bonding for Maximum Throughput" + See Section 12, "Configuring Bonding for Maximum Throughput" for information on configuring bonding with one peer device. -12.2 High Availability in a Multiple Switch Topology +11.2 High Availability in a Multiple Switch Topology ---------------------------------------------------- With multiple switches, the configuration of bonding and the @@ -1361,7 +2085,7 @@ switches (ISL, or inter switch link), and multiple ports connecting to the outside world ("port3" on each switch). There is no technical reason that this could not be extended to a third switch. -12.2.1 HA Bonding Mode Selection for Multiple Switch Topology +11.2.1 HA Bonding Mode Selection for Multiple Switch Topology ------------------------------------------------------------- In a topology such as the example above, the active-backup and @@ -1383,7 +2107,7 @@ broadcast: This mode is really a special purpose mode, and is suitable necessary for some specific one-way traffic to reach both independent networks, then the broadcast mode may be suitable. -12.2.2 HA Link Monitoring Selection for Multiple Switch Topology +11.2.2 HA Link Monitoring Selection for Multiple Switch Topology ---------------------------------------------------------------- The choice of link monitoring ultimately depends upon your @@ -1403,11 +2127,20 @@ one for each switch in the network). This will insure that, regardless of which switch is active, the ARP monitor has a suitable target to query. - -13. Configuring Bonding for Maximum Throughput + Note, also, that of late many switches now support a functionality +generally referred to as "trunk failover." This is a feature of the +switch that causes the link state of a particular switch port to be set +down (or up) when the state of another switch port goes down (or up). +Its purpose is to propagate link failures from logically "exterior" ports +to the logically "interior" ports that bonding is able to monitor via +miimon. Availability and configuration for trunk failover varies by +switch, but this can be a viable alternative to the ARP monitor when using +suitable switches. + +12. Configuring Bonding for Maximum Throughput ============================================== -13.1 Maximizing Throughput in a Single Switch Topology +12.1 Maximizing Throughput in a Single Switch Topology ------------------------------------------------------ In a single switch configuration, the best method to maximize @@ -1478,7 +2211,7 @@ destination to make load balancing decisions. The behavior of each mode is described below. -13.1.1 MT Bonding Mode Selection for Single Switch Topology +12.1.1 MT Bonding Mode Selection for Single Switch Topology ----------------------------------------------------------- This configuration is the easiest to set up and to understand, @@ -1490,7 +2223,7 @@ balance-rr: This mode is the only mode that will permit a single interfaces. It is therefore the only mode that will allow a single TCP/IP stream to utilize more than one interface's worth of throughput. This comes at a cost, however: the - striping often results in peer systems receiving packets out + striping generally results in peer systems receiving packets out of order, causing TCP/IP's congestion control system to kick in, often by retransmitting segments. @@ -1502,22 +2235,20 @@ balance-rr: This mode is the only mode that will permit a single interface's worth of throughput, even after adjusting tcp_reordering. - Note that this out of order delivery occurs when both the - sending and receiving systems are utilizing a multiple - interface bond. Consider a configuration in which a - balance-rr bond feeds into a single higher capacity network - channel (e.g., multiple 100Mb/sec ethernets feeding a single - gigabit ethernet via an etherchannel capable switch). In this - configuration, traffic sent from the multiple 100Mb devices to - a destination connected to the gigabit device will not see - packets out of order. However, traffic sent from the gigabit - device to the multiple 100Mb devices may or may not see - traffic out of order, depending upon the balance policy of the - switch. Many switches do not support any modes that stripe - traffic (instead choosing a port based upon IP or MAC level - addresses); for those devices, traffic flowing from the - gigabit device to the many 100Mb devices will only utilize one - interface. + Note that the fraction of packets that will be delivered out of + order is highly variable, and is unlikely to be zero. The level + of reordering depends upon a variety of factors, including the + networking interfaces, the switch, and the topology of the + configuration. Speaking in general terms, higher speed network + cards produce more reordering (due to factors such as packet + coalescing), and a "many to many" topology will reorder at a + higher rate than a "many slow to one fast" configuration. + + Many switches do not support any modes that stripe traffic + (instead choosing a port based upon IP or MAC level addresses); + for those devices, traffic for a particular connection flowing + through the switch to a balance-rr bond will not utilize greater + than one interface's worth of bandwidth. If you are utilizing protocols other than TCP/IP, UDP for example, and your application can tolerate out of order @@ -1609,7 +2340,7 @@ balance-alb: This mode is everything that balance-tlb is, and more. device driver must support changing the hardware address while the device is open. -13.1.2 MT Link Monitoring for Single Switch Topology +12.1.2 MT Link Monitoring for Single Switch Topology ---------------------------------------------------- The choice of link monitoring may largely depend upon which @@ -1618,7 +2349,7 @@ support the use of the ARP monitor, and are thus restricted to using the MII monitor (which does not provide as high a level of end to end assurance as the ARP monitor). -13.2 Maximum Throughput in a Multiple Switch Topology +12.2 Maximum Throughput in a Multiple Switch Topology ----------------------------------------------------- Multiple switches may be utilized to optimize for throughput @@ -1653,7 +2384,7 @@ a single 72 port switch. can be equipped with an additional network device connected to an external network; this host then additionally acts as a gateway. -13.2.1 MT Bonding Mode Selection for Multiple Switch Topology +12.2.1 MT Bonding Mode Selection for Multiple Switch Topology ------------------------------------------------------------- In actual practice, the bonding mode typically employed in @@ -1666,7 +2397,7 @@ packets has arrived). When employed in this fashion, the balance-rr mode allows individual connections between two hosts to effectively utilize greater than one interface's bandwidth. -13.2.2 MT Link Monitoring for Multiple Switch Topology +12.2.2 MT Link Monitoring for Multiple Switch Topology ------------------------------------------------------ Again, in actual practice, the MII monitor is most often used @@ -1676,10 +2407,10 @@ advantages over the MII monitor are mitigated by the volume of probes needed as the number of systems involved grows (remember that each host in the network is configured with bonding). -14. Switch Behavior Issues +13. Switch Behavior Issues ========================== -14.1 Link Establishment and Failover Delays +13.1 Link Establishment and Failover Delays ------------------------------------------- Some switches exhibit undesirable behavior with regard to the @@ -1714,9 +2445,13 @@ switches take a long time to go into backup mode, it may be desirable to not activate a backup interface immediately after a link goes down. Failover may be delayed via the downdelay bonding module option. -14.2 Duplicated Incoming Packets +13.2 Duplicated Incoming Packets -------------------------------- + NOTE: Starting with version 3.0.2, the bonding driver has logic to +suppress duplicate packets, which should largely eliminate this problem. +The following description is kept for reference. + It is not uncommon to observe a short burst of duplicated traffic when the bonding device is first used, or after it has been idle for some period of time. This is most easily observed by issuing @@ -1753,14 +2488,14 @@ behavior, it can be induced by clearing the MAC forwarding table (on most Cisco switches, the privileged command "clear mac address-table dynamic" will accomplish this). -15. Hardware Specific Considerations +14. Hardware Specific Considerations ==================================== This section contains additional information for configuring bonding on specific hardware platforms, or for interfacing bonding with particular switches or other devices. -15.1 IBM BladeCenter +14.1 IBM BladeCenter -------------------- This applies to the JS20 and similar systems. @@ -1863,7 +2598,7 @@ bonding driver. avoid fail-over delay issues when using bonding. -16. Frequently Asked Questions +15. Frequently Asked Questions ============================== 1. Is it SMP safe? @@ -1877,6 +2612,9 @@ The new driver was designed to be SMP safe from the start. EtherExpress PRO/100 and a 3com 3c905b, for example). For most modes, devices need not be of the same speed. + Starting with version 3.2.1, bonding also supports Infiniband +slaves in active-backup mode. + 3. How many bonding devices can I have? There is no limit. @@ -1927,7 +2665,7 @@ not have special switch requirements, but do need device drivers that support specific features (described in the appropriate section under module parameters, above). - In 802.3ad mode, it works with with systems that support IEEE + In 802.3ad mode, it works with systems that support IEEE 802.3ad Dynamic Link Aggregation. Most managed and many unmanaged switches currently available support 802.3ad. @@ -1935,11 +2673,15 @@ switches currently available support 802.3ad. 8. Where does a bonding device get its MAC address from? - If not explicitly configured (with ifconfig or ip link), the -MAC address of the bonding device is taken from its first slave -device. This MAC address is then passed to all following slaves and -remains persistent (even if the first slave is removed) until the -bonding device is brought down or reconfigured. + When using slave devices that have fixed MAC addresses, or when +the fail_over_mac option is enabled, the bonding device's MAC address is +the MAC address of the active slave. + + For other configurations, if not explicitly configured (with +ifconfig or ip link), the MAC address of the bonding device is taken from +its first slave device. This MAC address is then passed to all following +slaves and remains persistent (even if the first slave is removed) until +the bonding device is brought down or reconfigured. If you wish to change the MAC address, you can set it with ifconfig or ip link: @@ -1966,18 +2708,15 @@ enslaved. 16. Resources and Links ======================= -The latest version of the bonding driver can be found in the latest + The latest version of the bonding driver can be found in the latest version of the linux kernel, found on http://kernel.org -The latest version of this document can be found in either the latest -kernel source (named Documentation/networking/bonding.txt), or on the -bonding sourceforge site: - -http://www.sourceforge.net/projects/bonding + The latest version of this document can be found in the latest kernel +source (named Documentation/networking/bonding.txt). -Discussions regarding the bonding driver take place primarily on the -bonding-devel mailing list, hosted at sourceforge.net. If you have -questions or problems, post them to the list. The list address is: + Discussions regarding the usage of the bonding driver take place on the +bonding-devel mailing list, hosted at sourceforge.net. If you have questions or +problems, post them to the list. The list address is: bonding-devel@lists.sourceforge.net @@ -1986,8 +2725,19 @@ be found at: https://lists.sourceforge.net/lists/listinfo/bonding-devel + Discussions regarding the development of the bonding driver take place +on the main Linux network mailing list, hosted at vger.kernel.org. The list +address is: + +netdev@vger.kernel.org + + The administrative interface (to subscribe or unsubscribe) can +be found at: + +http://vger.kernel.org/vger-lists.html#netdev + Donald Becker's Ethernet Drivers and diag programs may be found at : - - http://www.scyld.com/network/ + - http://web.archive.org/web/*/http://www.scyld.com/network/ You will also find a lot of information regarding Ethernet, NWay, MII, etc. at www.scyld.com. diff --git a/Documentation/networking/bridge.txt b/Documentation/networking/bridge.txt index bdae2db4119..a27cb6214ed 100644 --- a/Documentation/networking/bridge.txt +++ b/Documentation/networking/bridge.txt @@ -1,8 +1,15 @@ In order to use the Ethernet bridging functionality, you'll need the -userspace tools. These programs and documentation are available -at http://bridge.sourceforge.net. The download page is -http://prdownloads.sourceforge.net/bridge. +userspace tools. + +Documentation for Linux bridging is on: + http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge + +The bridge-utilities are maintained at: + git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git + +Additionally, the iproute2 utilities can be used to configure +bridge devices. If you still have questions, don't hesitate to post to the mailing list -(more info http://lists.osdl.org/mailman/listinfo/bridge). +(more info https://lists.linux-foundation.org/mailman/listinfo/bridge). diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt new file mode 100644 index 00000000000..0aa4bd381be --- /dev/null +++ b/Documentation/networking/caif/Linux-CAIF.txt @@ -0,0 +1,175 @@ +Linux CAIF +=========== +copyright (C) ST-Ericsson AB 2010 +Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +License terms: GNU General Public License (GPL) version 2 + + +Introduction +------------ +CAIF is a MUX protocol used by ST-Ericsson cellular modems for +communication between Modem and host. The host processes can open virtual AT +channels, initiate GPRS Data connections, Video channels and Utility Channels. +The Utility Channels are general purpose pipes between modem and host. + +ST-Ericsson modems support a number of transports between modem +and host. Currently, UART and Loopback are available for Linux. + + +Architecture: +------------ +The implementation of CAIF is divided into: +* CAIF Socket Layer and GPRS IP Interface. +* CAIF Core Protocol Implementation +* CAIF Link Layer, implemented as NET devices. + + + RTNL + ! + ! +------+ +------+ + ! +------+! +------+! + ! ! IP !! !Socket!! + +-------> !interf!+ ! API !+ <- CAIF Client APIs + ! +------+ +------! + ! ! ! + ! +-----------+ + ! ! + ! +------+ <- CAIF Core Protocol + ! ! CAIF ! + ! ! Core ! + ! +------+ + ! +----------!---------+ + ! ! ! ! + ! +------+ +-----+ +------+ + +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) + +------+ +-----+ +------+ + + + +I M P L E M E N T A T I O N +=========================== + + +CAIF Core Protocol Layer +========================================= + +CAIF Core layer implements the CAIF protocol as defined by ST-Ericsson. +It implements the CAIF protocol stack in a layered approach, where +each layer described in the specification is implemented as a separate layer. +The architecture is inspired by the design patterns "Protocol Layer" and +"Protocol Packet". + +== CAIF structure == +The Core CAIF implementation contains: + - Simple implementation of CAIF. + - Layered architecture (a la Streams), each layer in the CAIF + specification is implemented in a separate c-file. + - Clients must call configuration function to add PHY layer. + - Clients must implement CAIF layer to consume/produce + CAIF payload with receive and transmit functions. + - Clients must call configuration function to add and connect the + Client layer. + - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed + to the called function (except for framing layers' receive function) + +Layered Architecture +-------------------- +The CAIF protocol can be divided into two parts: Support functions and Protocol +Implementation. The support functions include: + + - CFPKT CAIF Packet. Implementation of CAIF Protocol Packet. The + CAIF Packet has functions for creating, destroying and adding content + and for adding/extracting header and trailers to protocol packets. + +The CAIF Protocol implementation contains: + + - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol + Stack and provides a Client interface for adding Link-Layer and + Driver interfaces on top of the CAIF Stack. + + - CFCTRL CAIF Control layer. Encodes and Decodes control messages + such as enumeration and channel setup. Also matches request and + response messages. + + - CFSERVL General CAIF Service Layer functionality; handles flow + control and remote shutdown requests. + + - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual + External Interface). This layer encodes/decodes VEI frames. + + - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP + traffic), encodes/decodes Datagram frames. + + - CFMUX CAIF Mux layer. Handles multiplexing between multiple + physical bearers and multiple channels such as VEI, Datagram, etc. + The MUX keeps track of the existing CAIF Channels and + Physical Instances and selects the appropriate instance based + on Channel-Id and Physical-ID. + + - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length + and frame checksum. + + - CFSERL CAIF Serial layer. Handles concatenation/split of frames + into CAIF Frames with correct length. + + + + +---------+ + | Config | + | CFCNFG | + +---------+ + ! + +---------+ +---------+ +---------+ + | AT | | Control | | Datagram| + | CFVEIL | | CFCTRL | | CFDGML | + +---------+ +---------+ +---------+ + \_____________!______________/ + ! + +---------+ + | MUX | + | | + +---------+ + _____!_____ + / \ + +---------+ +---------+ + | CFFRML | | CFFRML | + | Framing | | Framing | + +---------+ +---------+ + ! ! + +---------+ +---------+ + | | | Serial | + | | | CFSERL | + +---------+ +---------+ + + +In this layered approach the following "rules" apply. + - All layers embed the same structure "struct cflayer" + - A layer does not depend on any other layer's private data. + - Layers are stacked by setting the pointers + layer->up , layer->dn + - In order to send data upwards, each layer should do + layer->up->receive(layer->up, packet); + - In order to send data downwards, each layer should do + layer->dn->transmit(layer->dn, packet); + + +CAIF Socket and IP interface +=========================== + +The IP interface and CAIF socket API are implemented on top of the +CAIF Core protocol. The IP Interface and CAIF socket have an instance of +'struct cflayer', just like the CAIF Core protocol stack. +Net device and Socket implement the 'receive()' function defined by +'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and +receive of packets is handled as by the rest of the layers: the 'dn->transmit()' +function is called in order to transmit data. + +Configuration of Link Layer +--------------------------- +The Link Layer is implemented as Linux network devices (struct net_device). +Payload handling and registration is done using standard Linux mechanisms. + +The CAIF Protocol relies on a loss-less link layer without implementing +retransmission. This implies that packet drops must not happen. +Therefore a flow-control mechanism is implemented where the physical +interface can initiate flow stop for all CAIF Channels. diff --git a/Documentation/networking/caif/README b/Documentation/networking/caif/README new file mode 100644 index 00000000000..757ccfaa138 --- /dev/null +++ b/Documentation/networking/caif/README @@ -0,0 +1,109 @@ +Copyright (C) ST-Ericsson AB 2010 +Author: Sjur Brendeland/ sjur.brandeland@stericsson.com +License terms: GNU General Public License (GPL) version 2 +--------------------------------------------------------- + +=== Start === +If you have compiled CAIF for modules do: + +$modprobe crc_ccitt +$modprobe caif +$modprobe caif_socket +$modprobe chnl_net + + +=== Preparing the setup with a STE modem === + +If you are working on integration of CAIF you should make sure +that the kernel is built with module support. + +There are some things that need to be tweaked to get the host TTY correctly +set up to talk to the modem. +Since the CAIF stack is running in the kernel and we want to use the existing +TTY, we are installing our physical serial driver as a line discipline above +the TTY device. + +To achieve this we need to install the N_CAIF ldisc from user space. +The benefit is that we can hook up to any TTY. + +The use of Start-of-frame-extension (STX) must also be set as +module parameter "ser_use_stx". + +Normally Frame Checksum is always used on UART, but this is also provided as a +module parameter "ser_use_fcs". + +$ modprobe caif_serial ser_ttyname=/dev/ttyS0 ser_use_stx=yes +$ ifconfig caif_ttyS0 up + +PLEASE NOTE: There is a limitation in Android shell. + It only accepts one argument to insmod/modprobe! + +=== Trouble shooting === + +There are debugfs parameters provided for serial communication. +/sys/kernel/debug/caif_serial/<tty-name>/ + +* ser_state: Prints the bit-mask status where + - 0x02 means SENDING, this is a transient state. + - 0x10 means FLOW_OFF_SENT, i.e. the previous frame has not been sent + and is blocking further send operation. Flow OFF has been propagated + to all CAIF Channels using this TTY. + +* tty_status: Prints the bit-mask tty status information + - 0x01 - tty->warned is on. + - 0x02 - tty->low_latency is on. + - 0x04 - tty->packed is on. + - 0x08 - tty->flow_stopped is on. + - 0x10 - tty->hw_stopped is on. + - 0x20 - tty->stopped is on. + +* last_tx_msg: Binary blob Prints the last transmitted frame. + This can be printed with + $od --format=x1 /sys/kernel/debug/caif_serial/<tty>/last_rx_msg. + The first two tx messages sent look like this. Note: The initial + byte 02 is start of frame extension (STX) used for re-syncing + upon errors. + + - Enumeration: + 0000000 02 05 00 00 03 01 d2 02 + | | | | | | + STX(1) | | | | + Length(2)| | | + Control Channel(1) + Command:Enumeration(1) + Link-ID(1) + Checksum(2) + - Channel Setup: + 0000000 02 07 00 00 00 21 a1 00 48 df + | | | | | | | | + STX(1) | | | | | | + Length(2)| | | | | + Control Channel(1) + Command:Channel Setup(1) + Channel Type(1) + Priority and Link-ID(1) + Endpoint(1) + Checksum(2) + +* last_rx_msg: Prints the last transmitted frame. + The RX messages for LinkSetup look almost identical but they have the + bit 0x20 set in the command bit, and Channel Setup has added one byte + before Checksum containing Channel ID. + NOTE: Several CAIF Messages might be concatenated. The maximum debug + buffer size is 128 bytes. + +== Error Scenarios: +- last_tx_msg contains channel setup message and last_rx_msg is empty -> + The host seems to be able to send over the UART, at least the CAIF ldisc get + notified that sending is completed. + +- last_tx_msg contains enumeration message and last_rx_msg is empty -> + The host is not able to send the message from UART, the tty has not been + able to complete the transmit operation. + +- if /sys/kernel/debug/caif_serial/<tty>/tty_status is non-zero there + might be problems transmitting over UART. + E.g. host and modem wiring is not correct you will typically see + tty_status = 0x10 (hw_stopped) and ser_state = 0x10 (FLOW_OFF_SENT). + You will probably see the enumeration message in last_tx_message + and empty last_rx_message. diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt new file mode 100644 index 00000000000..9efd0687dc4 --- /dev/null +++ b/Documentation/networking/caif/spi_porting.txt @@ -0,0 +1,208 @@ +- CAIF SPI porting - + +- CAIF SPI basics: + +Running CAIF over SPI needs some extra setup, owing to the nature of SPI. +Two extra GPIOs have been added in order to negotiate the transfers + between the master and the slave. The minimum requirement for running +CAIF over SPI is a SPI slave chip and two GPIOs (more details below). +Please note that running as a slave implies that you need to keep up +with the master clock. An overrun or underrun event is fatal. + +- CAIF SPI framework: + +To make porting as easy as possible, the CAIF SPI has been divided in +two parts. The first part (called the interface part) deals with all +generic functionality such as length framing, SPI frame negotiation +and SPI frame delivery and transmission. The other part is the CAIF +SPI slave device part, which is the module that you have to write if +you want to run SPI CAIF on a new hardware. This part takes care of +the physical hardware, both with regard to SPI and to GPIOs. + +- Implementing a CAIF SPI device: + + - Functionality provided by the CAIF SPI slave device: + + In order to implement a SPI device you will, as a minimum, + need to implement the following + functions: + + int (*init_xfer) (struct cfspi_xfer * xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface to give + you a chance to set up your hardware to be ready to receive + a stream of data from the master. The xfer structure contains + both physical and logical addresses, as well as the total length + of the transfer in both directions.The dev parameter can be used + to map to different CAIF SPI slave devices. + + void (*sig_xfer) (bool xfer, struct cfspi_dev *dev): + + This function is called by the CAIF SPI interface when the output + (SPI_INT) GPIO needs to change state. The boolean value of the xfer + variable indicates whether the GPIO should be asserted (HIGH) or + deasserted (LOW). The dev parameter can be used to map to different CAIF + SPI slave devices. + + - Functionality provided by the CAIF SPI interface: + + void (*ss_cb) (bool assert, struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + signal a change of state of the input GPIO (SS) to the interface. + Only active edges are mandatory to be reported. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + void (*xfer_done_cb) (struct cfspi_ifc *ifc); + + This function is called by the CAIF SPI slave device in order to + report that a transfer is completed. This function should only be + called once both the transmission and the reception are completed. + This function can be called from IRQ context (recommended in order + not to introduce latency). The ifc parameter should be the pointer + returned from the platform probe function in the SPI device structure. + + - Connecting the bits and pieces: + + - Filling in the SPI slave device structure: + + Connect the necessary callback functions. + Indicate clock speed (used to calculate toggle delays). + Chose a suitable name (helps debugging if you use several CAIF + SPI slave devices). + Assign your private data (can be used to map to your structure). + + - Filling in the SPI slave platform device structure: + Add name of driver to connect to ("cfspi_sspi"). + Assign the SPI slave device structure as platform data. + +- Padding: + +In order to optimize throughput, a number of SPI padding options are provided. +Padding can be enabled independently for uplink and downlink transfers. +Padding can be enabled for the head, the tail and for the total frame size. +The padding needs to be correctly configured on both sides of the link. +The padding can be changed via module parameters in cfspi_sspi.c or via +the sysfs directory of the cfspi_sspi driver (before device registration). + +- CAIF SPI device template: + +/* + * Copyright (C) ST-Ericsson AB 2010 + * Author: Daniel Martensson / Daniel.Martensson@stericsson.com + * License terms: GNU General Public License (GPL), version 2. + * + */ + +#include <linux/init.h> +#include <linux/module.h> +#include <linux/device.h> +#include <linux/wait.h> +#include <linux/interrupt.h> +#include <linux/dma-mapping.h> +#include <net/caif/caif_spi.h> + +MODULE_LICENSE("GPL"); + +struct sspi_struct { + struct cfspi_dev sdev; + struct cfspi_xfer *xfer; +}; + +static struct sspi_struct slave; +static struct platform_device slave_device; + +static irqreturn_t sspi_irq(int irq, void *arg) +{ + /* You only need to trigger on an edge to the active state of the + * SS signal. Once a edge is detected, the ss_cb() function should be + * called with the parameter assert set to true. It is OK + * (and even advised) to call the ss_cb() function in IRQ context in + * order not to add any delay. */ + + return IRQ_HANDLED; +} + +static void sspi_complete(void *context) +{ + /* Normally the DMA or the SPI framework will call you back + * in something similar to this. The only thing you need to + * do is to call the xfer_done_cb() function, providing the pointer + * to the CAIF SPI interface. It is OK to call this function + * from IRQ context. */ +} + +static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) +{ + /* Store transfer info. For a normal implementation you should + * set up your DMA here and make sure that you are ready to + * receive the data from the master SPI. */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; + + sspi->xfer = xfer; + + return 0; +} + +void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) +{ + /* If xfer is true then you should assert the SPI_INT to indicate to + * the master that you are ready to receive the data from the master + * SPI. If xfer is false then you should de-assert SPI_INT to indicate + * that the transfer is done. + */ + + struct sspi_struct *sspi = (struct sspi_struct *)dev->priv; +} + +static void sspi_release(struct device *dev) +{ + /* + * Here you should release your SPI device resources. + */ +} + +static int __init sspi_init(void) +{ + /* Here you should initialize your SPI device by providing the + * necessary functions, clock speed, name and private data. Once + * done, you can register your device with the + * platform_device_register() function. This function will return + * with the CAIF SPI interface initialized. This is probably also + * the place where you should set up your GPIOs, interrupts and SPI + * resources. */ + + int res = 0; + + /* Initialize slave device. */ + slave.sdev.init_xfer = sspi_init_xfer; + slave.sdev.sig_xfer = sspi_sig_xfer; + slave.sdev.clk_mhz = 13; + slave.sdev.priv = &slave; + slave.sdev.name = "spi_sspi"; + slave_device.dev.release = sspi_release; + + /* Initialize platform device. */ + slave_device.name = "cfspi_sspi"; + slave_device.dev.platform_data = &slave.sdev; + + /* Register platform device. */ + res = platform_device_register(&slave_device); + if (res) { + printk(KERN_WARNING "sspi_init: failed to register dev.\n"); + return -ENODEV; + } + + return res; +} + +static void __exit sspi_exit(void) +{ + platform_device_del(&slave_device); +} + +module_init(sspi_init); +module_exit(sspi_exit); diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt new file mode 100644 index 00000000000..2236d6dcb7d --- /dev/null +++ b/Documentation/networking/can.txt @@ -0,0 +1,1200 @@ +============================================================================ + +can.txt + +Readme file for the Controller Area Network Protocol Family (aka SocketCAN) + +This file contains + + 1 Overview / What is SocketCAN + + 2 Motivation / Why using the socket API + + 3 SocketCAN concept + 3.1 receive lists + 3.2 local loopback of sent frames + 3.3 network problem notifications + + 4 How to use SocketCAN + 4.1 RAW protocol sockets with can_filters (SOCK_RAW) + 4.1.1 RAW socket option CAN_RAW_FILTER + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER + 4.1.3 RAW socket option CAN_RAW_LOOPBACK + 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS + 4.1.5 RAW socket option CAN_RAW_FD_FRAMES + 4.1.6 RAW socket returned message flags + 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + 4.2.1 Broadcast Manager operations + 4.2.2 Broadcast Manager message flags + 4.2.3 Broadcast Manager transmission timers + 4.2.4 Broadcast Manager message sequence transmission + 4.2.5 Broadcast Manager receive filter timers + 4.2.6 Broadcast Manager multiplex message receive filter + 4.3 connected transport protocols (SOCK_SEQPACKET) + 4.4 unconnected transport protocols (SOCK_DGRAM) + + 5 SocketCAN core module + 5.1 can.ko module params + 5.2 procfs content + 5.3 writing own CAN protocol modules + + 6 CAN network drivers + 6.1 general settings + 6.2 local loopback of sent frames + 6.3 CAN controller hardware filters + 6.4 The virtual CAN driver (vcan) + 6.5 The CAN network device driver interface + 6.5.1 Netlink interface to set/get devices properties + 6.5.2 Setting the CAN bit-timing + 6.5.3 Starting and stopping the CAN network device + 6.6 CAN FD (flexible data rate) driver support + 6.7 supported CAN hardware + + 7 SocketCAN resources + + 8 Credits + +============================================================================ + +1. Overview / What is SocketCAN +-------------------------------- + +The socketcan package is an implementation of CAN protocols +(Controller Area Network) for Linux. CAN is a networking technology +which has widespread use in automation, embedded devices, and +automotive fields. While there have been other CAN implementations +for Linux based on character devices, SocketCAN uses the Berkeley +socket API, the Linux network stack and implements the CAN device +drivers as network interfaces. The CAN socket API has been designed +as similar as possible to the TCP/IP protocols to allow programmers, +familiar with network programming, to easily learn how to use CAN +sockets. + +2. Motivation / Why using the socket API +---------------------------------------- + +There have been CAN implementations for Linux before SocketCAN so the +question arises, why we have started another project. Most existing +implementations come as a device driver for some CAN hardware, they +are based on character devices and provide comparatively little +functionality. Usually, there is only a hardware-specific device +driver which provides a character device interface to send and +receive raw CAN frames, directly to/from the controller hardware. +Queueing of frames and higher-level transport protocols like ISO-TP +have to be implemented in user space applications. Also, most +character-device implementations support only one single process to +open the device at a time, similar to a serial interface. Exchanging +the CAN controller requires employment of another device driver and +often the need for adaption of large parts of the application to the +new driver's API. + +SocketCAN was designed to overcome all of these limitations. A new +protocol family has been implemented which provides a socket interface +to user space applications and which builds upon the Linux network +layer, enabling use all of the provided queueing functionality. A device +driver for CAN controller hardware registers itself with the Linux +network layer as a network device, so that CAN frames from the +controller can be passed up to the network layer and on to the CAN +protocol family module and also vice-versa. Also, the protocol family +module provides an API for transport protocol modules to register, so +that any number of transport protocols can be loaded or unloaded +dynamically. In fact, the can core module alone does not provide any +protocol and cannot be used without loading at least one additional +protocol module. Multiple sockets can be opened at the same time, +on different or the same protocol module and they can listen/send +frames on different or the same CAN IDs. Several sockets listening on +the same interface for frames with the same CAN ID are all passed the +same received matching CAN frames. An application wishing to +communicate using a specific transport protocol, e.g. ISO-TP, just +selects that protocol when opening the socket, and then can read and +write application data byte streams, without having to deal with +CAN-IDs, frames, etc. + +Similar functionality visible from user-space could be provided by a +character device, too, but this would lead to a technically inelegant +solution for a couple of reasons: + +* Intricate usage. Instead of passing a protocol argument to + socket(2) and using bind(2) to select a CAN interface and CAN ID, an + application would have to do all these operations using ioctl(2)s. + +* Code duplication. A character device cannot make use of the Linux + network queueing code, so all that code would have to be duplicated + for CAN networking. + +* Abstraction. In most existing character-device implementations, the + hardware-specific device driver for a CAN controller directly + provides the character device for the application to work with. + This is at least very unusual in Unix systems for both, char and + block devices. For example you don't have a character device for a + certain UART of a serial interface, a certain sound chip in your + computer, a SCSI or IDE controller providing access to your hard + disk or tape streamer device. Instead, you have abstraction layers + which provide a unified character or block device interface to the + application on the one hand, and a interface for hardware-specific + device drivers on the other hand. These abstractions are provided + by subsystems like the tty layer, the audio subsystem or the SCSI + and IDE subsystems for the devices mentioned above. + + The easiest way to implement a CAN device driver is as a character + device without such a (complete) abstraction layer, as is done by most + existing drivers. The right way, however, would be to add such a + layer with all the functionality like registering for certain CAN + IDs, supporting several open file descriptors and (de)multiplexing + CAN frames between them, (sophisticated) queueing of CAN frames, and + providing an API for device drivers to register with. However, then + it would be no more difficult, or may be even easier, to use the + networking framework provided by the Linux kernel, and this is what + SocketCAN does. + + The use of the networking framework of the Linux kernel is just the + natural and most appropriate way to implement CAN for Linux. + +3. SocketCAN concept +--------------------- + + As described in chapter 2 it is the main goal of SocketCAN to + provide a socket interface to user space applications which builds + upon the Linux network layer. In contrast to the commonly known + TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!) + medium that has no MAC-layer addressing like ethernet. The CAN-identifier + (can_id) is used for arbitration on the CAN-bus. Therefore the CAN-IDs + have to be chosen uniquely on the bus. When designing a CAN-ECU + network the CAN-IDs are mapped to be sent by a specific ECU. + For this reason a CAN-ID can be treated best as a kind of source address. + + 3.1 receive lists + + The network transparent access of multiple applications leads to the + problem that different applications may be interested in the same + CAN-IDs from the same CAN network interface. The SocketCAN core + module - which implements the protocol family CAN - provides several + high efficient receive lists for this reason. If e.g. a user space + application opens a CAN RAW socket, the raw protocol module itself + requests the (range of) CAN-IDs from the SocketCAN core that are + requested by the user. The subscription and unsubscription of + CAN-IDs can be done for specific CAN interfaces or for all(!) known + CAN interfaces with the can_rx_(un)register() functions provided to + CAN protocol modules by the SocketCAN core (see chapter 5). + To optimize the CPU usage at runtime the receive lists are split up + into several specific lists per device that match the requested + filter complexity for a given use-case. + + 3.2 local loopback of sent frames + + As known from other networking concepts the data exchanging + applications may run on the same or different nodes without any + change (except for the according addressing information): + + ___ ___ ___ _______ ___ + | _ | | _ | | _ | | _ _ | | _ | + ||A|| ||B|| ||C|| ||A| |B|| ||C|| + |___| |___| |___| |_______| |___| + | | | | | + -----------------(1)- CAN bus -(2)--------------- + + To ensure that application A receives the same information in the + example (2) as it would receive in example (1) there is need for + some kind of local loopback of the sent CAN frames on the appropriate + node. + + The Linux network devices (by default) just can handle the + transmission and reception of media dependent frames. Due to the + arbitration on the CAN bus the transmission of a low prio CAN-ID + may be delayed by the reception of a high prio CAN frame. To + reflect the correct* traffic on the node the loopback of the sent + data has to be performed right after a successful transmission. If + the CAN network interface is not capable of performing the loopback for + some reason the SocketCAN core can do this task as a fallback solution. + See chapter 6.2 for details (recommended). + + The loopback functionality is enabled by default to reflect standard + networking behaviour for CAN applications. Due to some requests from + the RT-SocketCAN group the loopback optionally may be disabled for each + separate socket. See sockopts from the CAN RAW sockets in chapter 4.1. + + * = you really like to have this when you're running analyser tools + like 'candump' or 'cansniffer' on the (same) node. + + 3.3 network problem notifications + + The use of the CAN bus may lead to several problems on the physical + and media access control layer. Detecting and logging of these lower + layer problems is a vital requirement for CAN users to identify + hardware issues on the physical transceiver layer as well as + arbitration problems and error frames caused by the different + ECUs. The occurrence of detected errors are important for diagnosis + and have to be logged together with the exact timestamp. For this + reason the CAN interface driver can generate so called Error Message + Frames that can optionally be passed to the user application in the + same way as other CAN frames. Whenever an error on the physical layer + or the MAC layer is detected (e.g. by the CAN controller) the driver + creates an appropriate error message frame. Error messages frames can + be requested by the user application using the common CAN filter + mechanisms. Inside this filter definition the (interested) type of + errors may be selected. The reception of error messages is disabled + by default. The format of the CAN error message frame is briefly + described in the Linux header file "include/linux/can/error.h". + +4. How to use SocketCAN +------------------------ + + Like TCP/IP, you first need to open a socket for communicating over a + CAN network. Since SocketCAN implements a new protocol family, you + need to pass PF_CAN as the first argument to the socket(2) system + call. Currently, there are two CAN protocols to choose from, the raw + socket protocol and the broadcast manager (BCM). So to open a socket, + you would write + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + + and + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + + respectively. After the successful creation of the socket, you would + normally use the bind(2) system call to bind the socket to a CAN + interface (which is different from TCP/IP due to different addressing + - see chapter 3). After binding (CAN_RAW) or connecting (CAN_BCM) + the socket, you can read(2) and write(2) from/to the socket or use + send(2), sendto(2), sendmsg(2) and the recv* counterpart operations + on the socket as usual. There are also CAN specific socket options + described below. + + The basic CAN frame structure and the sockaddr structure are defined + in include/linux/can.h: + + struct can_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 can_dlc; /* frame payload length in byte (0 .. 8) */ + __u8 data[8] __attribute__((aligned(8))); + }; + + The alignment of the (linear) payload data[] to a 64bit boundary + allows the user to define their own structs and unions to easily access + the CAN payload. There is no given byteorder on the CAN bus by + default. A read(2) system call on a CAN_RAW socket transfers a + struct can_frame to the user space. + + The sockaddr_can structure has an interface index like the + PF_PACKET socket, that also binds to a specific interface: + + struct sockaddr_can { + sa_family_t can_family; + int can_ifindex; + union { + /* transport protocol class address info (e.g. ISOTP) */ + struct { canid_t rx_id, tx_id; } tp; + + /* reserved for future CAN protocols address information */ + } can_addr; + }; + + To determine the interface index an appropriate ioctl() has to + be used (example for CAN_RAW sockets without error checking): + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_RAW, CAN_RAW); + + strcpy(ifr.ifr_name, "can0" ); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + bind(s, (struct sockaddr *)&addr, sizeof(addr)); + + (..) + + To bind a socket to all(!) CAN interfaces the interface index must + be 0 (zero). In this case the socket receives CAN frames from every + enabled CAN interface. To determine the originating CAN interface + the system call recvfrom(2) may be used instead of read(2). To send + on a socket that is bound to 'any' interface sendto(2) is needed to + specify the outgoing interface. + + Reading CAN frames from a bound CAN_RAW socket (see above) consists + of reading a struct can_frame: + + struct can_frame frame; + + nbytes = read(s, &frame, sizeof(struct can_frame)); + + if (nbytes < 0) { + perror("can raw socket read"); + return 1; + } + + /* paranoid check ... */ + if (nbytes < sizeof(struct can_frame)) { + fprintf(stderr, "read: incomplete CAN frame\n"); + return 1; + } + + /* do something with the received CAN frame */ + + Writing CAN frames can be done similarly, with the write(2) system call: + + nbytes = write(s, &frame, sizeof(struct can_frame)); + + When the CAN interface is bound to 'any' existing CAN interface + (addr.can_ifindex = 0) it is recommended to use recvfrom(2) if the + information about the originating CAN interface is needed: + + struct sockaddr_can addr; + struct ifreq ifr; + socklen_t len = sizeof(addr); + struct can_frame frame; + + nbytes = recvfrom(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, &len); + + /* get interface name of the received CAN frame */ + ifr.ifr_ifindex = addr.can_ifindex; + ioctl(s, SIOCGIFNAME, &ifr); + printf("Received a CAN frame from interface %s", ifr.ifr_name); + + To write CAN frames on sockets bound to 'any' CAN interface the + outgoing interface has to be defined certainly. + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + addr.can_ifindex = ifr.ifr_ifindex; + addr.can_family = AF_CAN; + + nbytes = sendto(s, &frame, sizeof(struct can_frame), + 0, (struct sockaddr*)&addr, sizeof(addr)); + + Remark about CAN FD (flexible data rate) support: + + Generally the handling of CAN FD is very similar to the formerly described + examples. The new CAN FD capable CAN controllers support two different + bitrates for the arbitration phase and the payload phase of the CAN FD frame + and up to 64 bytes of payload. This extended payload length breaks all the + kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight + bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g. + the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that + switches the socket into a mode that allows the handling of CAN FD frames + and (legacy) CAN frames simultaneously (see section 4.1.5). + + The struct canfd_frame is defined in include/linux/can.h: + + struct canfd_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 len; /* frame payload length in byte (0 .. 64) */ + __u8 flags; /* additional flags for CAN FD */ + __u8 __res0; /* reserved / padding */ + __u8 __res1; /* reserved / padding */ + __u8 data[64] __attribute__((aligned(8))); + }; + + The struct canfd_frame and the existing struct can_frame have the can_id, + the payload length and the payload data at the same offset inside their + structures. This allows to handle the different structures very similar. + When the content of a struct can_frame is copied into a struct canfd_frame + all structure elements can be used as-is - only the data[] becomes extended. + + When introducing the struct canfd_frame it turned out that the data length + code (DLC) of the struct can_frame was used as a length information as the + length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve + the easy handling of the length information the canfd_frame.len element + contains a plain length value from 0 .. 64. So both canfd_frame.len and + can_frame.can_dlc are equal and contain a length information and no DLC. + For details about the distinction of CAN and CAN FD capable devices and + the mapping to the bus-relevant data length code (DLC), see chapter 6.6. + + The length of the two CAN(FD) frame structures define the maximum transfer + unit (MTU) of the CAN(FD) network interface and skbuff data length. Two + definitions are specified for CAN specific MTUs in include/linux/can.h : + + #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame + #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame + + 4.1 RAW protocol sockets with can_filters (SOCK_RAW) + + Using CAN_RAW sockets is extensively comparable to the commonly + known access to CAN character devices. To meet the new possibilities + provided by the multi user SocketCAN approach, some reasonable + defaults are set at RAW socket binding time: + + - The filters are set to exactly one filter receiving everything + - The socket only receives valid data frames (=> no error message frames) + - The loopback of sent CAN frames is enabled (see chapter 3.2) + - The socket does not receive its own sent frames (in loopback mode) + + These default settings may be changed before or after binding the socket. + To use the referenced definitions of the socket options for CAN_RAW + sockets, include <linux/can/raw.h>. + + 4.1.1 RAW socket option CAN_RAW_FILTER + + The reception of CAN frames using CAN_RAW sockets can be controlled + by defining 0 .. n filters with the CAN_RAW_FILTER socket option. + + The CAN filter structure is defined in include/linux/can.h: + + struct can_filter { + canid_t can_id; + canid_t can_mask; + }; + + A filter matches, when + + <received_can_id> & mask == can_id & mask + + which is analogous to known CAN controllers hardware filter semantics. + The filter can be inverted in this semantic, when the CAN_INV_FILTER + bit is set in can_id element of the can_filter structure. In + contrast to CAN controller hardware filters the user may set 0 .. n + receive filters for each open socket separately: + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + rfilter[1].can_id = 0x200; + rfilter[1].can_mask = 0x700; + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + + To disable the reception of CAN frames on the selected CAN_RAW socket: + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0); + + To set the filters to zero filters is quite obsolete as to not read + data causes the raw socket to discard the received CAN frames. But + having this 'send only' use-case we may remove the receive list in the + Kernel to save a little (really a very little!) CPU usage. + + 4.1.1.1 CAN filter usage optimisation + + The CAN filters are processed in per-device filter lists at CAN frame + reception time. To reduce the number of checks that need to be performed + while walking through the filter lists the CAN core provides an optimized + filter handling when the filter subscription focusses on a single CAN ID. + + For the possible 2048 SFF CAN identifiers the identifier is used as an index + to access the corresponding subscription list without any further checks. + For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as + hash function to retrieve the EFF table index. + + To benefit from the optimized filters for single CAN identifiers the + CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together + with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the + can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is + subscribed. E.g. in the example from above + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + + both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass. + + To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the + filter has to be defined in this way to benefit from the optimized filters: + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK); + rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG; + rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER + + As described in chapter 3.4 the CAN interface driver can generate so + called Error Message Frames that can optionally be passed to the user + application in the same way as other CAN frames. The possible + errors are divided into different error classes that may be filtered + using the appropriate error mask. To register for every possible + error condition CAN_ERR_MASK can be used as value for the error mask. + The values for the error mask are defined in linux/can/error.h . + + can_err_mask_t err_mask = ( CAN_ERR_TX_TIMEOUT | CAN_ERR_BUSOFF ); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_ERR_FILTER, + &err_mask, sizeof(err_mask)); + + 4.1.3 RAW socket option CAN_RAW_LOOPBACK + + To meet multi user needs the local loopback is enabled by default + (see chapter 3.2 for details). But in some embedded use-cases + (e.g. when only one application uses the CAN bus) this loopback + functionality can be disabled (separately for each socket): + + int loopback = 0; /* 0 = disabled, 1 = enabled (default) */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_LOOPBACK, &loopback, sizeof(loopback)); + + 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS + + When the local loopback is enabled, all the sent CAN frames are + looped back to the open CAN sockets that registered for the CAN + frames' CAN-ID on this given interface to meet the multi user + needs. The reception of the CAN frames on the same socket that was + sending the CAN frame is assumed to be unwanted and therefore + disabled by default. This default behaviour may be changed on + demand: + + int recv_own_msgs = 1; /* 0 = disabled (default), 1 = enabled */ + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, + &recv_own_msgs, sizeof(recv_own_msgs)); + + 4.1.5 RAW socket option CAN_RAW_FD_FRAMES + + CAN FD support in CAN_RAW sockets can be enabled with a new socket option + CAN_RAW_FD_FRAMES which is off by default. When the new socket option is + not supported by the CAN_RAW socket (e.g. on older kernels), switching the + CAN_RAW_FD_FRAMES option returns the error -ENOPROTOOPT. + + Once CAN_RAW_FD_FRAMES is enabled the application can send both CAN frames + and CAN FD frames. OTOH the application has to handle CAN and CAN FD frames + when reading from the socket. + + CAN_RAW_FD_FRAMES enabled: CAN_MTU and CANFD_MTU are allowed + CAN_RAW_FD_FRAMES disabled: only CAN_MTU is allowed (default) + + Example: + [ remember: CANFD_MTU == sizeof(struct canfd_frame) ] + + struct canfd_frame cfd; + + nbytes = read(s, &cfd, CANFD_MTU); + + if (nbytes == CANFD_MTU) { + printf("got CAN FD frame with length %d\n", cfd.len); + /* cfd.flags contains valid data */ + } else if (nbytes == CAN_MTU) { + printf("got legacy CAN frame with length %d\n", cfd.len); + /* cfd.flags is undefined */ + } else { + fprintf(stderr, "read: invalid CAN(FD) frame\n"); + return 1; + } + + /* the content can be handled independently from the received MTU size */ + + printf("can_id: %X data length: %d data: ", cfd.can_id, cfd.len); + for (i = 0; i < cfd.len; i++) + printf("%02X ", cfd.data[i]); + + When reading with size CANFD_MTU only returns CAN_MTU bytes that have + been received from the socket a legacy CAN frame has been read into the + provided CAN FD structure. Note that the canfd_frame.flags data field is + not specified in the struct can_frame and therefore it is only valid in + CANFD_MTU sized CAN FD frames. + + Implementation hint for new CAN applications: + + To build a CAN FD aware application use struct canfd_frame as basic CAN + data structure for CAN_RAW based applications. When the application is + executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES + socket option returns an error: No problem. You'll get legacy CAN frames + or CAN FD frames and can process them the same way. + + When sending to CAN devices make sure that the device is capable to handle + CAN FD frames by checking if the device maximum transfer unit is CANFD_MTU. + The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. + + 4.1.6 RAW socket returned message flags + + When using recvmsg() call, the msg->msg_flags may contain following flags: + + MSG_DONTROUTE: set when the received frame was created on the local host. + + MSG_CONFIRM: set when the frame was sent via the socket it is received on. + This flag can be interpreted as a 'transmission confirmation' when the + CAN driver supports the echo of frames on driver level, see 3.2 and 6.2. + In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set. + + 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + + The Broadcast Manager protocol provides a command based configuration + interface to filter and send (e.g. cyclic) CAN messages in kernel space. + + Receive filters can be used to down sample frequent messages; detect events + such as message contents changes, packet length changes, and do time-out + monitoring of received messages. + + Periodic transmission tasks of CAN frames or a sequence of CAN frames can be + created and modified at runtime; both the message content and the two + possible transmit intervals can be altered. + + A BCM socket is not intended for sending individual CAN frames using the + struct can_frame as known from the CAN_RAW socket. Instead a special BCM + configuration message is defined. The basic BCM configuration message used + to communicate with the broadcast manager and the available operations are + defined in the linux/can/bcm.h include. The BCM message consists of a + message header with a command ('opcode') followed by zero or more CAN frames. + The broadcast manager sends responses to user space in the same form: + + struct bcm_msg_head { + __u32 opcode; /* command */ + __u32 flags; /* special flags */ + __u32 count; /* run 'count' times with ival1 */ + struct timeval ival1, ival2; /* count and subsequent interval */ + canid_t can_id; /* unique can_id for task */ + __u32 nframes; /* number of can_frames following */ + struct can_frame frames[0]; + }; + + The aligned payload 'frames' uses the same basic CAN frame structure defined + at the beginning of section 4 and in the include/linux/can.h include. All + messages to the broadcast manager from user space have this structure. + + Note a CAN_BCM socket must be connected instead of bound after socket + creation (example without error checking): + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + connect(s, (struct sockaddr *)&addr, sizeof(addr)) + + (..) + + The broadcast manager socket is able to handle any number of in flight + transmissions or receive filters concurrently. The different RX/TX jobs are + distinguished by the unique can_id in each BCM message. However additional + CAN_BCM sockets are recommended to communicate on multiple CAN interfaces. + When the broadcast manager socket is bound to 'any' CAN interface (=> the + interface index is set to zero) the configured receive filters apply to any + CAN interface unless the sendto() syscall is used to overrule the 'any' CAN + interface index. When using recvfrom() instead of read() to retrieve BCM + socket messages the originating CAN interface is provided in can_ifindex. + + 4.2.1 Broadcast Manager operations + + The opcode defines the operation for the broadcast manager to carry out, + or details the broadcast managers response to several events, including + user requests. + + Transmit Operations (user space to broadcast manager): + + TX_SETUP: Create (cyclic) transmission task. + + TX_DELETE: Remove (cyclic) transmission task, requires only can_id. + + TX_READ: Read properties of (cyclic) transmission task for can_id. + + TX_SEND: Send one CAN frame. + + Transmit Responses (broadcast manager to user space): + + TX_STATUS: Reply to TX_READ request (transmission task configuration). + + TX_EXPIRED: Notification when counter finishes sending at initial interval + 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP. + + Receive Operations (user space to broadcast manager): + + RX_SETUP: Create RX content filter subscription. + + RX_DELETE: Remove RX content filter subscription, requires only can_id. + + RX_READ: Read properties of RX content filter subscription for can_id. + + Receive Responses (broadcast manager to user space): + + RX_STATUS: Reply to RX_READ request (filter task configuration). + + RX_TIMEOUT: Cyclic message is detected to be absent (timer ival1 expired). + + RX_CHANGED: BCM message with updated CAN frame (detected content change). + Sent on first message received or on receipt of revised CAN messages. + + 4.2.2 Broadcast Manager message flags + + When sending a message to the broadcast manager the 'flags' element may + contain the following flag definitions which influence the behaviour: + + SETTIMER: Set the values of ival1, ival2 and count + + STARTTIMER: Start the timer with the actual values of ival1, ival2 + and count. Starting the timer leads simultaneously to emit a CAN frame. + + TX_COUNTEVT: Create the message TX_EXPIRED when count expires + + TX_ANNOUNCE: A change of data by the process is emitted immediately. + + TX_CP_CAN_ID: Copies the can_id from the message header to each + subsequent frame in frames. This is intended as usage simplification. For + TX tasks the unique can_id from the message header may differ from the + can_id(s) stored for transmission in the subsequent struct can_frame(s). + + RX_FILTER_ID: Filter by can_id alone, no frames required (nframes=0). + + RX_CHECK_DLC: A change of the DLC leads to an RX_CHANGED. + + RX_NO_AUTOTIMER: Prevent automatically starting the timeout monitor. + + RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occurred, a + RX_CHANGED message will be generated when the (cyclic) receive restarts. + + TX_RESET_MULTI_IDX: Reset the index for the multiple frame transmission. + + RX_RTR_FRAME: Send reply for RTR-request (placed in op->frames[0]). + + 4.2.3 Broadcast Manager transmission timers + + Periodic transmission configurations may use up to two interval timers. + In this case the BCM sends a number of messages ('count') at an interval + 'ival1', then continuing to send at another given interval 'ival2'. When + only one timer is needed 'count' is set to zero and only 'ival2' is used. + When SET_TIMER and START_TIMER flag were set the timers are activated. + The timer values can be altered at runtime when only SET_TIMER is set. + + 4.2.4 Broadcast Manager message sequence transmission + + Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic + TX task configuration. The number of CAN frames is provided in the 'nframes' + element of the BCM message head. The defined number of CAN frames are added + as array to the TX_SETUP BCM configuration message. + + /* create a struct to set up a sequence of four CAN frames */ + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[4]; + } mytxmsg; + + (..) + mytxmsg.nframes = 4; + (..) + + write(s, &mytxmsg, sizeof(mytxmsg)); + + With every transmission the index in the array of CAN frames is increased + and set to zero at index overflow. + + 4.2.5 Broadcast Manager receive filter timers + + The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP. + When the SET_TIMER flag is set the timers are enabled: + + ival1: Send RX_TIMEOUT when a received message is not received again within + the given time. When START_TIMER is set at RX_SETUP the timeout detection + is activated directly - even without a former CAN frame reception. + + ival2: Throttle the received message rate down to the value of ival2. This + is useful to reduce messages for the application when the signal inside the + CAN frame is stateless as state changes within the ival2 periode may get + lost. + + 4.2.6 Broadcast Manager multiplex message receive filter + + To filter for content changes in multiplex message sequences an array of more + than one CAN frames can be passed in a RX_SETUP configuration message. The + data bytes of the first CAN frame contain the mask of relevant bits that + have to match in the subsequent CAN frames with the received CAN frame. + If one of the subsequent CAN frames is matching the bits in that frame data + mark the relevant content to be compared with the previous received content. + Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN + filters) can be added as array to the TX_SETUP BCM configuration message. + + /* usually used to clear CAN frame data[] - beware of endian problems! */ + #define U64_DATA(p) (*(unsigned long long*)(p)->data) + + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[5]; + } msg; + + msg.msg_head.opcode = RX_SETUP; + msg.msg_head.can_id = 0x42; + msg.msg_head.flags = 0; + msg.msg_head.nframes = 5; + U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */ + U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */ + U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */ + U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */ + U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */ + + write(s, &msg, sizeof(msg)); + + 4.3 connected transport protocols (SOCK_SEQPACKET) + 4.4 unconnected transport protocols (SOCK_DGRAM) + + +5. SocketCAN core module +------------------------- + + The SocketCAN core module implements the protocol family + PF_CAN. CAN protocol modules are loaded by the core module at + runtime. The core module provides an interface for CAN protocol + modules to subscribe needed CAN IDs (see chapter 3.1). + + 5.1 can.ko module params + + - stats_timer: To calculate the SocketCAN core statistics + (e.g. current/maximum frames per second) this 1 second timer is + invoked at can.ko module start time by default. This timer can be + disabled by using stattimer=0 on the module commandline. + + - debug: (removed since SocketCAN SVN r546) + + 5.2 procfs content + + As described in chapter 3.1 the SocketCAN core uses several filter + lists to deliver received CAN frames to CAN protocol modules. These + receive lists, their filters and the count of filter matches can be + checked in the appropriate receive list. All entries contain the + device and a protocol module identifier: + + foo@bar:~$ cat /proc/net/can/rcvlist_all + + receive list 'rx_all': + (vcan3: no entry) + (vcan2: no entry) + (vcan1: no entry) + device can_id can_mask function userdata matches ident + vcan0 000 00000000 f88e6370 f6c6f400 0 raw + (any: no entry) + + In this example an application requests any CAN traffic from vcan0. + + rcvlist_all - list for unfiltered entries (no filter operations) + rcvlist_eff - list for single extended frame (EFF) entries + rcvlist_err - list for error message frames masks + rcvlist_fil - list for mask/value filters + rcvlist_inv - list for mask/value filters (inverse semantic) + rcvlist_sff - list for single standard frame (SFF) entries + + Additional procfs files in /proc/net/can + + stats - SocketCAN core statistics (rx/tx frames, match ratios, ...) + reset_stats - manual statistic reset + version - prints the SocketCAN core version and the ABI version + + 5.3 writing own CAN protocol modules + + To implement a new protocol in the protocol family PF_CAN a new + protocol has to be defined in include/linux/can.h . + The prototypes and definitions to use the SocketCAN core can be + accessed by including include/linux/can/core.h . + In addition to functions that register the CAN protocol and the + CAN device notifier chain there are functions to subscribe CAN + frames received by CAN interfaces and to send CAN frames: + + can_rx_register - subscribe CAN frames from a specific interface + can_rx_unregister - unsubscribe CAN frames from a specific interface + can_send - transmit a CAN frame (optional with local loopback) + + For details see the kerneldoc documentation in net/can/af_can.c or + the source code of net/can/raw.c or net/can/bcm.c . + +6. CAN network drivers +---------------------- + + Writing a CAN network device driver is much easier than writing a + CAN character device driver. Similar to other known network device + drivers you mainly have to deal with: + + - TX: Put the CAN frame from the socket buffer to the CAN controller. + - RX: Put the CAN frame from the CAN controller to the socket buffer. + + See e.g. at Documentation/networking/netdevices.txt . The differences + for writing CAN network device driver are described below: + + 6.1 general settings + + dev->type = ARPHRD_CAN; /* the netdevice hardware type */ + dev->flags = IFF_NOARP; /* CAN has no arp */ + + dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */ + + or alternative, when the controller supports CAN with flexible data rate: + dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */ + + The struct can_frame or struct canfd_frame is the payload of each socket + buffer (skbuff) in the protocol family PF_CAN. + + 6.2 local loopback of sent frames + + As described in chapter 3.2 the CAN network device driver should + support a local loopback functionality similar to the local echo + e.g. of tty devices. In this case the driver flag IFF_ECHO has to be + set to prevent the PF_CAN core from locally echoing sent frames + (aka loopback) as fallback solution: + + dev->flags = (IFF_NOARP | IFF_ECHO); + + 6.3 CAN controller hardware filters + + To reduce the interrupt load on deep embedded systems some CAN + controllers support the filtering of CAN IDs or ranges of CAN IDs. + These hardware filter capabilities vary from controller to + controller and have to be identified as not feasible in a multi-user + networking approach. The use of the very controller specific + hardware filters could make sense in a very dedicated use-case, as a + filter on driver level would affect all users in the multi-user + system. The high efficient filter sets inside the PF_CAN core allow + to set different multiple filters for each socket separately. + Therefore the use of hardware filters goes to the category 'handmade + tuning on deep embedded systems'. The author is running a MPC603e + @133MHz with four SJA1000 CAN controllers from 2002 under heavy bus + load without any problems ... + + 6.4 The virtual CAN driver (vcan) + + Similar to the network loopback devices, vcan offers a virtual local + CAN interface. A full qualified address on CAN consists of + + - a unique CAN Identifier (CAN ID) + - the CAN bus this CAN ID is transmitted on (e.g. can0) + + so in common use cases more than one virtual CAN interface is needed. + + The virtual CAN interfaces allow the transmission and reception of CAN + frames without real CAN controller hardware. Virtual CAN network + devices are usually named 'vcanX', like vcan0 vcan1 vcan2 ... + When compiled as a module the virtual CAN driver module is called vcan.ko + + Since Linux Kernel version 2.6.24 the vcan driver supports the Kernel + netlink interface to create vcan network devices. The creation and + removal of vcan network devices can be managed with the ip(8) tool: + + - Create a virtual CAN network interface: + $ ip link add type vcan + + - Create a virtual CAN network interface with a specific name 'vcan42': + $ ip link add dev vcan42 type vcan + + - Remove a (virtual CAN) network interface 'vcan42': + $ ip link del vcan42 + + 6.5 The CAN network device driver interface + + The CAN network device driver interface provides a generic interface + to setup, configure and monitor CAN network devices. The user can then + configure the CAN device, like setting the bit-timing parameters, via + the netlink interface using the program "ip" from the "IPROUTE2" + utility suite. The following chapter describes briefly how to use it. + Furthermore, the interface uses a common data structure and exports a + set of common functions, which all real CAN network device drivers + should use. Please have a look to the SJA1000 or MSCAN driver to + understand how to use them. The name of the module is can-dev.ko. + + 6.5.1 Netlink interface to set/get devices properties + + The CAN device must be configured via netlink interface. The supported + netlink message types are defined and briefly described in + "include/linux/can/netlink.h". CAN link support for the program "ip" + of the IPROUTE2 utility suite is available and it can be used as shown + below: + + - Setting CAN device properties: + + $ ip link set can0 type can help + Usage: ip link set DEVICE type can + [ bitrate BITRATE [ sample-point SAMPLE-POINT] ] | + [ tq TQ prop-seg PROP_SEG phase-seg1 PHASE-SEG1 + phase-seg2 PHASE-SEG2 [ sjw SJW ] ] + + [ loopback { on | off } ] + [ listen-only { on | off } ] + [ triple-sampling { on | off } ] + + [ restart-ms TIME-MS ] + [ restart ] + + Where: BITRATE := { 1..1000000 } + SAMPLE-POINT := { 0.000..0.999 } + TQ := { NUMBER } + PROP-SEG := { 1..8 } + PHASE-SEG1 := { 1..8 } + PHASE-SEG2 := { 1..8 } + SJW := { 1..4 } + RESTART-MS := { 0 | NUMBER } + + - Display CAN device details and statistics: + + $ ip -details -statistics link show can0 + 2: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UP qlen 10 + link/can + can <TRIPLE-SAMPLING> state ERROR-ACTIVE restart-ms 100 + bitrate 125000 sample_point 0.875 + tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1 + sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 + clock 8000000 + re-started bus-errors arbit-lost error-warn error-pass bus-off + 41 17457 0 41 42 41 + RX: bytes packets errors dropped overrun mcast + 140859 17608 17457 0 0 0 + TX: bytes packets errors dropped carrier collsns + 861 112 0 41 0 0 + + More info to the above output: + + "<TRIPLE-SAMPLING>" + Shows the list of selected CAN controller modes: LOOPBACK, + LISTEN-ONLY, or TRIPLE-SAMPLING. + + "state ERROR-ACTIVE" + The current state of the CAN controller: "ERROR-ACTIVE", + "ERROR-WARNING", "ERROR-PASSIVE", "BUS-OFF" or "STOPPED" + + "restart-ms 100" + Automatic restart delay time. If set to a non-zero value, a + restart of the CAN controller will be triggered automatically + in case of a bus-off condition after the specified delay time + in milliseconds. By default it's off. + + "bitrate 125000 sample-point 0.875" + Shows the real bit-rate in bits/sec and the sample-point in the + range 0.000..0.999. If the calculation of bit-timing parameters + is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the + bit-timing can be defined by setting the "bitrate" argument. + Optionally the "sample-point" can be specified. By default it's + 0.000 assuming CIA-recommended sample-points. + + "tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1" + Shows the time quanta in ns, propagation segment, phase buffer + segment 1 and 2 and the synchronisation jump width in units of + tq. They allow to define the CAN bit-timing in a hardware + independent format as proposed by the Bosch CAN 2.0 spec (see + chapter 8 of http://www.semiconductors.bosch.de/pdf/can2spec.pdf). + + "sja1000: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 + clock 8000000" + Shows the bit-timing constants of the CAN controller, here the + "sja1000". The minimum and maximum values of the time segment 1 + and 2, the synchronisation jump width in units of tq, the + bitrate pre-scaler and the CAN system clock frequency in Hz. + These constants could be used for user-defined (non-standard) + bit-timing calculation algorithms in user-space. + + "re-started bus-errors arbit-lost error-warn error-pass bus-off" + Shows the number of restarts, bus and arbitration lost errors, + and the state changes to the error-warning, error-passive and + bus-off state. RX overrun errors are listed in the "overrun" + field of the standard network statistics. + + 6.5.2 Setting the CAN bit-timing + + The CAN bit-timing parameters can always be defined in a hardware + independent format as proposed in the Bosch CAN 2.0 specification + specifying the arguments "tq", "prop_seg", "phase_seg1", "phase_seg2" + and "sjw": + + $ ip link set canX type can tq 125 prop-seg 6 \ + phase-seg1 7 phase-seg2 2 sjw 1 + + If the kernel option CONFIG_CAN_CALC_BITTIMING is enabled, CIA + recommended CAN bit-timing parameters will be calculated if the bit- + rate is specified with the argument "bitrate": + + $ ip link set canX type can bitrate 125000 + + Note that this works fine for the most common CAN controllers with + standard bit-rates but may *fail* for exotic bit-rates or CAN system + clock frequencies. Disabling CONFIG_CAN_CALC_BITTIMING saves some + space and allows user-space tools to solely determine and set the + bit-timing parameters. The CAN controller specific bit-timing + constants can be used for that purpose. They are listed by the + following command: + + $ ip -details link show can0 + ... + sja1000: clock 8000000 tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1 + + 6.5.3 Starting and stopping the CAN network device + + A CAN network device is started or stopped as usual with the command + "ifconfig canX up/down" or "ip link set canX up/down". Be aware that + you *must* define proper bit-timing parameters for real CAN devices + before you can start it to avoid error-prone default settings: + + $ ip link set canX up type can bitrate 125000 + + A device may enter the "bus-off" state if too many errors occurred on + the CAN bus. Then no more messages are received or sent. An automatic + bus-off recovery can be enabled by setting the "restart-ms" to a + non-zero value, e.g.: + + $ ip link set canX type can restart-ms 100 + + Alternatively, the application may realize the "bus-off" condition + by monitoring CAN error message frames and do a restart when + appropriate with the command: + + $ ip link set canX type can restart + + Note that a restart will also create a CAN error message frame (see + also chapter 3.4). + + 6.6 CAN FD (flexible data rate) driver support + + CAN FD capable CAN controllers support two different bitrates for the + arbitration phase and the payload phase of the CAN FD frame. Therefore a + second bit timing has to be specified in order to enable the CAN FD bitrate. + + Additionally CAN FD capable CAN controllers support up to 64 bytes of + payload. The representation of this length in can_frame.can_dlc and + canfd_frame.len for userspace applications and inside the Linux network + layer is a plain value from 0 .. 64 instead of the CAN 'data length code'. + The data length code was a 1:1 mapping to the payload length in the legacy + CAN frames anyway. The payload length to the bus-relevant DLC mapping is + only performed inside the CAN drivers, preferably with the helper + functions can_dlc2len() and can_len2dlc(). + + The CAN netdevice driver capabilities can be distinguished by the network + devices maximum transfer unit (MTU): + + MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device + MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device + + The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. + N.B. CAN FD capable devices can also handle and send legacy CAN frames. + + FIXME: Add details about the CAN FD controller configuration when available. + + 6.7 Supported CAN hardware + + Please check the "Kconfig" file in "drivers/net/can" to get an actual + list of the support CAN hardware. On the SocketCAN project website + (see chapter 7) there might be further drivers available, also for + older kernel versions. + +7. SocketCAN resources +----------------------- + + The Linux CAN / SocketCAN project ressources (project site / mailing list) + are referenced in the MAINTAINERS file in the Linux source tree. + Search for CAN NETWORK [LAYERS|DRIVERS]. + +8. Credits +---------- + + Oliver Hartkopp (PF_CAN core, filters, drivers, bcm, SJA1000 driver) + Urs Thuermann (PF_CAN core, kernel integration, socket interfaces, raw, vcan) + Jan Kizka (RT-SocketCAN core, Socket-API reconciliation) + Wolfgang Grandegger (RT-SocketCAN core & drivers, Raw Socket-API reviews, + CAN device driver interface, MSCAN driver) + Robert Schwebel (design reviews, PTXdist integration) + Marc Kleine-Budde (design reviews, Kernel 2.6 cleanups, drivers) + Benedikt Spranger (reviews) + Thomas Gleixner (LKML reviews, coding style, posting hints) + Andrey Volkov (kernel subtree structure, ioctls, MSCAN driver) + Matthias Brukner (first SJA1000 CAN netdevice implementation Q2/2003) + Klaus Hitschler (PEAK driver integration) + Uwe Koppe (CAN netdevices with PF_PACKET approach) + Michael Schulze (driver layer loopback requirement, RT CAN drivers review) + Pavel Pisa (Bit-timing calculation) + Sascha Hauer (SJA1000 platform driver) + Sebastian Haas (SJA1000 EMS PCI driver) + Markus Plessing (SJA1000 EMS PCI driver) + Per Dalen (SJA1000 Kvaser PCI driver) + Sam Ravnborg (reviews, coding style, kbuild help) diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt new file mode 100644 index 00000000000..a15ea602aa5 --- /dev/null +++ b/Documentation/networking/cdc_mbim.txt @@ -0,0 +1,339 @@ + cdc_mbim - Driver for CDC MBIM Mobile Broadband modems + ======================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +Type: Boolean +Valid Range: N/Y (0-1) +Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides an userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +-------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + + #include <stdio.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <linux/types.h> + #include <linux/usb/cdc-wdm.h> + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that an userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 <VLANID> + 256 - 511 DSS 0 - 255 <VLANID - 256> + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + +[1] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + - http://www.usb.org/developers/docs/devclass_docs/ + +[2] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + - http://www.usb.org/developers/docs/devclass_docs/ + +[3] libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + - http://www.freedesktop.org/wiki/Software/libmbim/ + +[4] ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + - http://www.freedesktop.org/wiki/Software/ModemManager/ + +[5] "MBIM (Mobile Broadband Interface Model) Registry" + - http://compliance.usb.org/mbim/ + +[6] "/proc/bus/usb filesystem output" + - Documentation/usb/proc_usb_info.txt + +[7] "/sys/bus/usb/devices/.../descriptors" + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/comx.txt b/Documentation/networking/comx.txt deleted file mode 100644 index d1526eba264..00000000000 --- a/Documentation/networking/comx.txt +++ /dev/null @@ -1,248 +0,0 @@ - - COMX drivers for the 2.2 kernel - -Originally written by: Tivadar Szemethy, <tiv@itc.hu> -Currently maintained by: Gergely Madarasz <gorgo@itc.hu> - -Last change: 21/06/1999. - -INTRODUCTION - -This document describes the software drivers and their use for the -COMX line of synchronous serial adapters for Linux version 2.2.0 and -above. -The cards are produced and sold by ITC-Pro Ltd. Budapest, Hungary -For further info contact <info@itc.hu> -or http://www.itc.hu (mostly in Hungarian). -The firmware files and software are available from ftp://ftp.itc.hu - -Currently, the drivers support the following cards and protocols: - -COMX (2x64 kbps intelligent board) -CMX (1x256 + 1x128 kbps intelligent board) -HiCOMX (2x2Mbps intelligent board) -LoCOMX (1x512 kbps passive board) -MixCOM (1x512 or 2x512kbps passive board with a hardware watchdog an - optional BRI interface and optional flashROM (1-32M)) -SliceCOM (1x2Mbps channelized E1 board) -PciCOM (X21) - -At the moment of writing this document, the (Cisco)-HDLC, LAPB, SyncPPP and -Frame Relay (DTE, rfc1294 IP encapsulation with partially implemented Q933a -LMI) protocols are available as link-level protocol. -X.25 support is being worked on. - -USAGE - -Load the comx.o module and the hardware-specific and protocol-specific -modules you'll need into the running kernel using the insmod utility. -This creates the /proc/comx directory. -See the example scripts in the 'etc' directory. - -/proc INTERFACE INTRO - -The COMX driver set has a new type of user interface based on the /proc -filesystem which eliminates the need for external user-land software doing -IOCTL calls. -Each network interface or device (i.e. those ones you configure with 'ifconfig' -and 'route' etc.) has a corresponding directory under /proc/comx. You can -dynamically create a new interface by saying 'mkdir /proc/comx/comx0' (or you -can name it whatever you want up to 8 characters long, comx[n] is just a -convention). -Generally the files contained in these directories are text files, which can -be viewed by 'cat filename' and you can write a string to such a file by -saying 'echo _string_ >filename'. This is very similar to the sysctl interface. -Don't use a text editor to edit these files, always use 'echo' (or 'cat' -where appropriate). -When you've created the comx[n] directory, two files are created automagically -in it: 'boardtype' and 'protocol'. You have to fill in these files correctly -for your board and protocol you intend to use (see the board and protocol -descriptions in this file below or the example scripts in the 'etc' directory). -After filling in these files, other files will appear in the directory for -setting the various hardware- and protocol-related informations (for example -irq and io addresses, keepalive values etc.) These files are set to default -values upon creation, so you don't necessarily have to change all of them. - -When you're ready with filling in the files in the comx[n] directory, you can -configure the corresponding network interface with the standard network -configuration utilities. If you're unable to bring the interfaces up, look up -the various kernel log files on your system, and consult the messages for -a probable reason. - -EXAMPLE - -To create the interface 'comx0' which is the first channel of a COMX card: - -insmod comx -# insmod comx-hw-comx ; insmod comx-proto-ppp (these are usually -autoloaded if you use the kernel module loader) - -mkdir /proc/comx/comx0 -echo comx >/proc/comx/comx0/boardtype -echo 0x360 >/proc/comx/comx0/io <- jumper-selectable I/O port -echo 0x0a >/proc/comx/comx0/irq <- jumper-selectable IRQ line -echo 0xd000 >/proc/comx/comx0/memaddr <- software-configurable memory - address. COMX uses 64 KB, and this - can be: 0xa000, 0xb000, 0xc000, - 0xd000, 0xe000. Avoid conflicts - with other hardware. -cat </etc/siol1.rom >/proc/comx/comx0/firmware <- the firmware for the card -echo HDLC >/proc/comx/comx0/protocol <- the data-link protocol -echo 10 >/proc/comx/comx0/keepalive <- the keepalive for the protocol -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 <- - finally configure it with ifconfig -Check its status: -cat /proc/comx/comx0/status - -If you want to use the second channel of this board: - -mkdir /proc/comx/comx1 -echo comx >/proc/comx/comx1/boardtype -echo 0x360 >/proc/comx/comx1/io -echo 10 >/proc/comx/comx1/irq -echo 0xd000 >/proc/comx/comx1/memaddr -echo 1 >/proc/comx/comx1/channel <- channels are numbered - as 0 (default) and 1 - -Now, check if the driver recognized that you're going to use the other -channel of the same adapter: - -cat /proc/comx/comx0/twin -comx1 -cat /proc/comx/comx1/twin -comx0 - -You don't have to load the firmware twice, if you use both channels of -an adapter, just write it into the channel 0's /proc firmware file. - -Default values: io 0x360 for COMX, 0x320 (HICOMX), irq 10, memaddr 0xd0000 - -THE LOCOMX HARDWARE DRIVER - -The LoCOMX driver doesn't require firmware, and it doesn't use memory either, -but it uses DMA channels 1 and 3. You can set the clock rate (if enabled by -jumpers on the board) by writing the kbps value into the file named 'clock'. -Set it to 'external' (it is the default) if you have external clock source. - -(Note: currently the LoCOMX driver does not support the internal clock) - -THE COMX, CMX AND HICOMX DRIVERS - -On the HICOMX, COMX and CMX, you have to load the firmware (it is different for -the three cards!). All these adapters can share the same memory -address (we usually use 0xd0000). On the CMX you can set the internal -clock rate (if enabled by jumpers on the small adapter boards) by writing -the kbps value into the 'clock' file. You have to do this before initializing -the card. If you use both HICOMX and CMX/COMX cards, initialize the HICOMX -first. The I/O address of the HICOMX board is not configurable by any -method available to the user: it is hardwired to 0x320, and if you have to -change it, consult ITC-Pro Ltd. - -THE MIXCOM DRIVER - -The MixCOM board doesn't require firmware, the driver communicates with -it through I/O ports. You can have three of these cards in one machine. - -THE SLICECOM DRIVER - -The SliceCOM board doesn't require firmware. You can have 4 of these cards -in one machine. The driver doesn't (yet) support shared interrupts, so -you will need a separate IRQ line for every board. -Read Documentation/networking/slicecom.txt for help on configuring -this adapter. - -THE HDLC/PPP LINE PROTOCOL DRIVER - -The HDLC/SyncPPP line protocol driver uses the kernel's built-in syncppp -driver (syncppp.o). You don't have to manually select syncppp.o when building -the kernel, the dependencies compile it in automatically. - - - - -EXAMPLE -(setting up hw parameters, see above) - -# using HDLC: -echo hdlc >/proc/comx/comx0/protocol -echo 10 >/proc/comx/comx0/keepalive <- not necessary, 10 is the default -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - -(setting up hw parameters, see above) - -# using PPP: -echo ppp >/proc/comx/comx0/protocol -ifconfig comx0 up -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - - -THE LAPB LINE PROTOCOL DRIVER - -For this, you'll need to configure LAPB support (See 'LAPB Data Link Driver' in -'Network options' section) into your kernel (thanks to Jonathan Naylor for his -excellent implementation). -comx-proto-lapb.o provides the following files in the appropriate directory -(the default values in parens): t1 (5), t2 (1), n2 (20), mode (DTE, STD) and -window (7). Agree with the administrator of your peer router on these -settings (most people use defaults, but you have to know if you are DTE or -DCE). - -EXAMPLE - -(setting up hw parameters, see above) -echo lapb >/proc/comx/comx0/protocol -echo dce >/proc/comx/comx0/mode <- DCE interface in this example -ifconfig comx0 1.2.3.4 pointopoint 5.6.7.8 netmask 255.255.255.255 - - -THE FRAME RELAY PROTOCOL DRIVER - -You DON'T need any other frame relay related modules from the kernel to use -COMX-Frame Relay. This protocol is a bit more complicated than the others, -because it allows to use 'subinterfaces' or DLCIs within one physical device. -First you have to create the 'master' device (the actual physical interface) -as you would do for other protocols. Specify 'frad' as protocol type. -Now you can bring this interface up by saying 'ifconfig comx0 up' (or whatever -you've named the interface). Do not assign any IP address to this interface -and do not set any routes through it. -Then, set up your DLCIs the following way: create a comx interface for each -DLCI you intend to use (with mkdir), and write 'dlci' to the 'boardtype' file, -and 'ietf-ip' to the 'protocol' file. Currently, the only supported -encapsulation type is this (also called as RFC1294/1490 IP encapsulation). -Write the DLCI number to the 'dlci' file, and write the name of the physical -COMX device to the file called 'master'. -Now you can assign an IP address to this interface and set routes using it. -See the example file for further info and example config script. -Notes: this driver implements a DTE interface with partially implemented -Q933a LMI. -You can find an extensively commented example in the 'etc' directory. - -FURTHER /proc FILES - -boardtype: -Type of the hardware. Valid values are: - 'comx', 'hicomx', 'locomx', 'cmx', 'slicecom'. - -protocol: -Data-link protocol on this channel. Can be: HDLC, LAPB, PPP, FRAD - -status: -You can read the channel's actual status from the 'status' file, for example -'cat /proc/comx/comx3/status'. - -lineup_delay: -Interpreted in seconds (default is 1). Used to avoid line jitter: the system -will consider the line status 'UP' only if it is up for at least this number -of seconds. - -debug: -You can set various debug options through this file. Valid options are: -'comx_events', 'comx_tx', 'comx_rx', 'hw_events', 'hw_tx', 'hw_rx'. -You can enable a debug options by writing its name prepended by a '+' into -the debug file, for example 'echo +comx_rx >comx0/debug'. -Disabling an option happens similarly, use the '-' prefix -(e.g. 'echo -hw_rx >debug'). -Debug results can be read from the debug file, for example: -tail -f /proc/comx/comx2/debug - - diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/cs89x0.txt index 188beb7d6a1..0e190180eec 100644 --- a/Documentation/networking/cs89x0.txt +++ b/Documentation/networking/cs89x0.txt @@ -3,7 +3,7 @@ NOTE ---- This document was contributed by Cirrus Logic for kernel 2.2.5. This version -has been updated for 2.3.48 by Andrew Morton <andrewm@uow.edu.au> +has been updated for 2.3.48 by Andrew Morton. Cirrus make a copy of this driver available at their website, as described below. In general, you should use the driver version which @@ -36,7 +36,6 @@ TABLE OF CONTENTS 4.1 Compiling the Driver as a Loadable Module 4.2 Compiling the driver to support memory mode 4.3 Compiling the driver to support Rx DMA - 4.4 Compiling the Driver into the Kernel 5.0 TESTING AND TROUBLESHOOTING 5.1 Known Defects and Limitations @@ -227,7 +226,7 @@ configuration options are available on the command line: * media=rj45 - specify media type or media=bnc or media=aui - or medai=auto + or media=auto * duplex=full - specify forced half/full/autonegotiate duplex or duplex=half or duplex=auto @@ -248,7 +247,7 @@ c) The driver's hardware probe routine is designed to avoid with device probing. To avoid this behaviour, add one to the `io=' module parameter. This doesn't actually change the I/O address, but it is a flag to tell the driver - topartially initialise the hardware before trying to + to partially initialise the hardware before trying to identify the card. This could be dangerous if you are not sure that there is a cs89x0 card at the provided address. @@ -364,84 +363,6 @@ The compile-time optionality for DMA was removed in the 2.3 kernel series. DMA support is now unconditionally part of the driver. It is enabled by the 'use_dma=1' module option. -4.4 COMPILING THE DRIVER INTO THE KERNEL - -If your Linux distribution already has support for the cs89x0 driver -then simply copy the source file to the /usr/src/linux/drivers/net -directory to replace the original ones and run the make utility to -rebuild the kernel. See Step 3 for rebuilding the kernel. - -If your Linux does not include the cs89x0 driver, you need to edit three -configuration files, copy the source file to the /usr/src/linux/drivers/net -directory, and then run the make utility to rebuild the kernel. - -1. Edit the following configuration files by adding the statements as -indicated. (When possible, try to locate the added text to the section of the -file containing similar statements). - - -a.) In /usr/src/linux/drivers/net/Config.in, add: - -tristate 'CS89x0 support' CONFIG_CS89x0 - -Example: - - if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then - tristate 'ICL EtherTeam 16i/32 support' CONFIG_ETH16I - fi - - tristate 'CS89x0 support' CONFIG_CS89x0 - - tristate 'NE2000/NE1000 support' CONFIG_NE2000 - if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then - tristate 'NI5210 support' CONFIG_NI52 - - -b.) In /usr/src/linux/drivers/net/Makefile, add the following lines: - -ifeq ($(CONFIG_CS89x0),y) -L_OBJS += cs89x0.o -else - ifeq ($(CONFIG_CS89x0),m) - M_OBJS += cs89x0.o - endif -endif - - -c.) In /linux/drivers/net/Space.c file, add the line: - -extern int cs89x0_probe(struct device *dev); - - -Example: - - extern int ultra_probe(struct device *dev); - extern int wd_probe(struct device *dev); - extern int el2_probe(struct device *dev); - - extern int cs89x0_probe(struct device *dev); - - extern int ne_probe(struct device *dev); - extern int hp_probe(struct device *dev); - extern int hp_plus_probe(struct device *dev); - - -Also add: - - #ifdef CONFIG_CS89x0 - { cs89x0_probe,0 }, - #endif - - -2.) Copy the driver source files (cs89x0.c and cs89x0.h) -into the /usr/src/linux/drivers/net directory. - - -3.) Go to /usr/src/linux directory and run 'make config' followed by 'make' -(or make bzImage) to rebuild the kernel. - -4.) Use the DOS 'setup' utility to disable plug and play on the NIC. - 5.0 TESTING AND TROUBLESHOOTING =============================================================================== @@ -584,7 +505,7 @@ of four ways after installing and or configuring the CS8900/20-based adapter: 1.) The system does not boot properly (or at all). - 2.) The driver can not communicate with the adapter, reporting an "Adapter + 2.) The driver cannot communicate with the adapter, reporting an "Adapter not found" error message. 3.) You cannot connect to the network or the driver will not load. @@ -620,8 +541,8 @@ I/O Address Device IRQ Device 12 Mouse (PS/2) Memory Address Device 13 Math Coprocessor -------------- --------------------- 14 Hard Disk controller -A000-BFFF EGA Graphics Adpater -A000-C7FF VGA Graphics Adpater +A000-BFFF EGA Graphics Adapter +A000-C7FF VGA Graphics Adapter B000-BFFF Mono Graphics Adapter B800-BFFF Color Graphics Adapter E000-FFFF AT BIOS @@ -684,13 +605,13 @@ ethernet@crystal.cirrus.com) and request that you be registered for automatic software-update notification. Cirrus Logic maintains a web page at http://www.cirrus.com with the -the latest drivers and technical publications. +latest drivers and technical publications. 6.4 Current maintainer In February 2000 the maintenance of this driver was assumed by Andrew -Morton <akpm@zip.com.au> +Morton. 6.5 Kernel module parameters diff --git a/Documentation/networking/cxacru-cf.py b/Documentation/networking/cxacru-cf.py new file mode 100644 index 00000000000..b41d298398c --- /dev/null +++ b/Documentation/networking/cxacru-cf.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python +# Copyright 2009 Simon Arlott +# +# This program is free software; you can redistribute it and/or modify it +# under the terms of the GNU General Public License as published by the Free +# Software Foundation; either version 2 of the License, or (at your option) +# any later version. +# +# This program is distributed in the hope that it will be useful, but WITHOUT +# ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or +# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for +# more details. +# +# You should have received a copy of the GNU General Public License along with +# this program; if not, write to the Free Software Foundation, Inc., 59 +# Temple Place - Suite 330, Boston, MA 02111-1307, USA. +# +# Usage: cxacru-cf.py < cxacru-cf.bin +# Output: values string suitable for the sysfs adsl_config attribute +# +# Warning: cxacru-cf.bin with MD5 hash cdbac2689969d5ed5d4850f117702110 +# contains mis-aligned values which will stop the modem from being able +# to make a connection. If the first and last two bytes are removed then +# the values become valid, but the modulation will be forced to ANSI +# T1.413 only which may not be appropriate. +# +# The original binary format is a packed list of le32 values. + +import sys +import struct + +i = 0 +while True: + buf = sys.stdin.read(4) + + if len(buf) == 0: + break + elif len(buf) != 4: + sys.stdout.write("\n") + sys.stderr.write("Error: read {0} not 4 bytes\n".format(len(buf))) + sys.exit(1) + + if i > 0: + sys.stdout.write(" ") + sys.stdout.write("{0:x}={1}".format(i, struct.unpack("<I", buf)[0])) + i += 1 + +sys.stdout.write("\n") diff --git a/Documentation/networking/cxacru.txt b/Documentation/networking/cxacru.txt new file mode 100644 index 00000000000..2cce04457b4 --- /dev/null +++ b/Documentation/networking/cxacru.txt @@ -0,0 +1,100 @@ +Firmware is required for this device: http://accessrunner.sourceforge.net/ + +While it is capable of managing/maintaining the ADSL connection without the +module loaded, the device will sometimes stop responding after unloading the +driver and it is necessary to unplug/remove power to the device to fix this. + +Note: support for cxacru-cf.bin has been removed. It was not loaded correctly +so it had no effect on the device configuration. Fixing it could have stopped +existing devices working when an invalid configuration is supplied. + +There is a script cxacru-cf.py to convert an existing file to the sysfs form. + +Detected devices will appear as ATM devices named "cxacru". In /sys/class/atm/ +these are directories named cxacruN where N is the device number. A symlink +named device points to the USB interface device's directory which contains +several sysfs attribute files for retrieving device statistics: + +* adsl_controller_version + +* adsl_headend +* adsl_headend_environment + Information about the remote headend. + +* adsl_config + Configuration writing interface. + Write parameters in hexadecimal format <index>=<value>, + separated by whitespace, e.g.: + "1=0 a=5" + Up to 7 parameters at a time will be sent and the modem will restart + the ADSL connection when any value is set. These are logged for future + reference. + +* downstream_attenuation (dB) +* downstream_bits_per_frame +* downstream_rate (kbps) +* downstream_snr_margin (dB) + Downstream stats. + +* upstream_attenuation (dB) +* upstream_bits_per_frame +* upstream_rate (kbps) +* upstream_snr_margin (dB) +* transmitter_power (dBm/Hz) + Upstream stats. + +* downstream_crc_errors +* downstream_fec_errors +* downstream_hec_errors +* upstream_crc_errors +* upstream_fec_errors +* upstream_hec_errors + Error counts. + +* line_startable + Indicates that ADSL support on the device + is/can be enabled, see adsl_start. + +* line_status + "initialising" + "down" + "attempting to activate" + "training" + "channel analysis" + "exchange" + "waiting" + "up" + + Changes between "down" and "attempting to activate" + if there is no signal. + +* link_status + "not connected" + "connected" + "lost" + +* mac_address + +* modulation + "" (when not connected) + "ANSI T1.413" + "ITU-T G.992.1 (G.DMT)" + "ITU-T G.992.2 (G.LITE)" + +* startup_attempts + Count of total attempts to initialise ADSL. + +To enable/disable ADSL, the following can be written to the adsl_state file: + "start" + "stop + "restart" (stops, waits 1.5s, then starts) + "poll" (used to resume status polling if it was disabled due to failure) + +Changes in adsl/line state are reported via kernel log messages: + [4942145.150704] ATM dev 0: ADSL state: running + [4942243.663766] ATM dev 0: ADSL line: down + [4942249.665075] ATM dev 0: ADSL line: attempting to activate + [4942253.654954] ATM dev 0: ADSL line: training + [4942255.666387] ATM dev 0: ADSL line: channel analysis + [4942259.656262] ATM dev 0: ADSL line: exchange + [2635357.696901] ATM dev 0: ADSL line: up (8128 kb/s down | 832 kb/s up) diff --git a/Documentation/networking/cxgb.txt b/Documentation/networking/cxgb.txt index 76324638626..20a887615c4 100644 --- a/Documentation/networking/cxgb.txt +++ b/Documentation/networking/cxgb.txt @@ -56,7 +56,7 @@ FEATURES ethtool -C eth0 rx-usecs 100 - You may also provide a timer latency value while disabling adpative-rx: + You may also provide a timer latency value while disabling adaptive-rx: ethtool -C <interface> adaptive-rx off rx-usecs <microseconds> @@ -172,7 +172,7 @@ PERFORMANCE smaller window prevents congestion and facilitates better pacing, especially if/when MAC level flow control does not work well or when it is not supported on the machine. Experimentation may be necessary to attain - the correct value. This method is provided as a starting point fot the + the correct value. This method is provided as a starting point for the correct receive buffer size. Setting the min, max, and default receive buffer (RX_WINDOW) size is performed in the same manner as single connection. diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt new file mode 100644 index 00000000000..55c575fcaf1 --- /dev/null +++ b/Documentation/networking/dccp.txt @@ -0,0 +1,207 @@ +DCCP protocol +============= + + +Contents +======== +- Introduction +- Missing features +- Socket options +- Sysctl variables +- IOCTLs +- Other tunables +- Notes + + +Introduction +============ +Datagram Congestion Control Protocol (DCCP) is an unreliable, connection +oriented protocol designed to solve issues present in UDP and TCP, particularly +for real-time and multimedia (streaming) traffic. +It divides into a base protocol (RFC 4340) and pluggable congestion control +modules called CCIDs. Like pluggable TCP congestion control, at least one CCID +needs to be enabled in order for the protocol to function properly. In the Linux +implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as +the TCP-friendly CCID3 (RFC 4342), are optional. +For a brief introduction to CCIDs and suggestions for choosing a CCID to match +given applications, see section 10 of RFC 4340. + +It has a base protocol and pluggable congestion control IDs (CCIDs). + +DCCP is a Proposed Standard (RFC 2026), and the homepage for DCCP as a protocol +is at http://www.ietf.org/html.charters/dccp-charter.html + + +Missing features +================ +The Linux DCCP implementation does not currently support all the features that are +specified in RFCs 4340...42. + +The known bugs are at: + http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP + +For more up-to-date versions of the DCCP implementation, please consider using +the experimental DCCP test tree; instructions for checking this out are on: +http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree + + +Socket options +============== +DCCP_SOCKOPT_QPOLICY_ID sets the dequeuing policy for outgoing packets. It takes +a policy ID as argument and can only be set before the connection (i.e. changes +during an established connection are not supported). Currently, two policies are +defined: the "simple" policy (DCCPQ_POLICY_SIMPLE), which does nothing special, +and a priority-based variant (DCCPQ_POLICY_PRIO). The latter allows to pass an +u32 priority value as ancillary data to sendmsg(), where higher numbers indicate +a higher packet priority (similar to SO_PRIORITY). This ancillary data needs to +be formatted using a cmsg(3) message header filled in as follows: + cmsg->cmsg_level = SOL_DCCP; + cmsg->cmsg_type = DCCP_SCM_PRIORITY; + cmsg->cmsg_len = CMSG_LEN(sizeof(uint32_t)); /* or CMSG_LEN(4) */ + +DCCP_SOCKOPT_QPOLICY_TXQLEN sets the maximum length of the output queue. A zero +value is always interpreted as unbounded queue length. If different from zero, +the interpretation of this parameter depends on the current dequeuing policy +(see above): the "simple" policy will enforce a fixed queue size by returning +EAGAIN, whereas the "prio" policy enforces a fixed queue length by dropping the +lowest-priority packet first. The default value for this parameter is +initialised from /proc/sys/net/dccp/default/tx_qlen. + +DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of +service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, +the socket will fall back to 0 (which means that no meaningful service code +is present). On active sockets this is set before connect(); specifying more +than one code has no effect (all subsequent service codes are ignored). The +case is different for passive sockets, where multiple service codes (up to 32) +can be set before calling bind(). + +DCCP_SOCKOPT_GET_CUR_MPS is read-only and retrieves the current maximum packet +size (application payload size) in bytes, see RFC 4340, section 14. + +DCCP_SOCKOPT_AVAILABLE_CCIDS is also read-only and returns the list of CCIDs +supported by the endpoint. The option value is an array of type uint8_t whose +size is passed as option length. The minimum array size is 4 elements, the +value returned in the optlen argument always reflects the true number of +built-in CCIDs. + +DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same +time, combining the operation of the next two socket options. This option is +preferable over the latter two, since often applications will use the same +type of CCID for both directions; and mixed use of CCIDs is not currently well +understood. This socket option takes as argument at least one uint8_t value, or +an array of uint8_t values, which must match available CCIDS (see above). CCIDs +must be registered on the socket before calling connect() or listen(). + +DCCP_SOCKOPT_TX_CCID is read/write. It returns the current CCID (if set) or sets +the preference list for the TX CCID, using the same format as DCCP_SOCKOPT_CCID. +Please note that the getsockopt argument type here is `int', not uint8_t. + +DCCP_SOCKOPT_RX_CCID is analogous to DCCP_SOCKOPT_TX_CCID, but for the RX CCID. + +DCCP_SOCKOPT_SERVER_TIMEWAIT enables the server (listening socket) to hold +timewait state when closing the connection (RFC 4340, 8.3). The usual case is +that the closing server sends a CloseReq, whereupon the client holds timewait +state. When this boolean socket option is on, the server sends a Close instead +and will enter TIMEWAIT. This option must be set after accept() returns. + +DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the +partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums +always cover the entire packet and that only fully covered application data is +accepted by the receiver. Hence, when using this feature on the sender, it must +be enabled at the receiver, too with suitable choice of CsCov. + +DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the + range 0..15 are acceptable. The default setting is 0 (full coverage), + values between 1..15 indicate partial coverage. +DCCP_SOCKOPT_RECV_CSCOV is for the receiver and has a different meaning: it + sets a threshold, where again values 0..15 are acceptable. The default + of 0 means that all packets with a partial coverage will be discarded. + Values in the range 1..15 indicate that packets with minimally such a + coverage value are also acceptable. The higher the number, the more + restrictive this setting (see [RFC 4340, sec. 9.2.1]). Partial coverage + settings are inherited to the child socket after accept(). + +The following two options apply to CCID 3 exclusively and are getsockopt()-only. +In either case, a TFRC info struct (defined in <linux/tfrc.h>) is returned. +DCCP_SOCKOPT_CCID_RX_INFO + Returns a `struct tfrc_rx_info' in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_rx_info). +DCCP_SOCKOPT_CCID_TX_INFO + Returns a `struct tfrc_tx_info' in optval; the buffer for optval and + optlen must be set to at least sizeof(struct tfrc_tx_info). + +On unidirectional connections it is useful to close the unused half-connection +via shutdown (SHUT_WR or SHUT_RD): this will reduce per-packet processing costs. + + +Sysctl variables +================ +Several DCCP default parameters can be managed by the following sysctls +(sysctl net.dccp.default or /proc/sys/net/dccp/default): + +request_retries + The number of active connection initiation retries (the number of + Requests minus one) before timing out. In addition, it also governs + the behaviour of the other, passive side: this variable also sets + the number of times DCCP repeats sending a Response when the initial + handshake does not progress from RESPOND to OPEN (i.e. when no Ack + is received after the initial Request). This value should be greater + than 0, suggested is less than 10. Analogue of tcp_syn_retries. + +retries1 + How often a DCCP Response is retransmitted until the listening DCCP + side considers its connecting peer dead. Analogue of tcp_retries1. + +retries2 + The number of times a general DCCP packet is retransmitted. This has + importance for retransmitted acknowledgments and feature negotiation, + data packets are never retransmitted. Analogue of tcp_retries2. + +tx_ccid = 2 + Default CCID for the sender-receiver half-connection. Depending on the + choice of CCID, the Send Ack Vector feature is enabled automatically. + +rx_ccid = 2 + Default CCID for the receiver-sender half-connection; see tx_ccid. + +seq_window = 100 + The initial sequence window (sec. 7.5.2) of the sender. This influences + the local ackno validity and the remote seqno validity windows (7.5.1). + Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. + +tx_qlen = 5 + The size of the transmit buffer in packets. A value of 0 corresponds + to an unbounded transmit buffer. + +sync_ratelimit = 125 ms + The timeout between subsequent DCCP-Sync packets sent in response to + sequence-invalid packets on the same socket (RFC 4340, 7.5.4). The unit + of this parameter is milliseconds; a value of 0 disables rate-limiting. + + +IOCTLS +====== +FIONREAD + Works as in udp(7): returns in the `int' argument pointer the size of + the next pending datagram in bytes, or 0 when no datagram is pending. + + +Other tunables +============== +Per-route rto_min support + CCID-2 supports the RTAX_RTO_MIN per-route setting for the minimum value + of the RTO timer. This setting can be modified via the 'rto_min' option + of iproute2; for example: + > ip route change 10.0.0.0/24 rto_min 250j dev wlan0 + > ip route add 10.0.0.254/32 rto_min 800j dev wlan0 + > ip route show dev wlan0 + CCID-3 also supports the rto_min setting: it is used to define the lower + bound for the expiry of the nofeedback timer. This can be useful on LANs + with very low RTTs (e.g., loopback, Gbit ethernet). + + +Notes +===== +DCCP does not travel through NAT successfully at present on many boxes. This is +because the checksum covers the pseudo-header as per TCP and UDP. Linux NAT +support for DCCP has been added. diff --git a/Documentation/networking/decnet.txt b/Documentation/networking/decnet.txt index c6bd25f5d61..e12a4900cf7 100644 --- a/Documentation/networking/decnet.txt +++ b/Documentation/networking/decnet.txt @@ -4,7 +4,7 @@ 1) Other documentation.... o Project Home Pages - http://www.chygwyn.com/DECnet/ - Kernel info + http://www.chygwyn.com/ - Kernel info http://linux-decnet.sourceforge.net/ - Userland tools http://www.sourceforge.net/projects/linux-decnet/ - Status page @@ -60,7 +60,7 @@ operation of the local communications in any other way though. The kernel command line takes options looking like the following: - decnet=1,2 + decnet.addr=1,2 the two numbers are the node address 1,2 = 1.2 For 2.2.xx kernels and early 2.3.xx kernels, you must use a comma when specifying the @@ -82,7 +82,7 @@ ethernet address of your ethernet card has to be set according to the DECnet address of the node in order for it to be autoconfigured (and then appear in /proc/net/decnet_dev). There is a utility available at the above FTP sites called dn2ethaddr which can compute the correct ethernet -address to use. The address can be set by ifconfig either before at +address to use. The address can be set by ifconfig either before or at the time the device is brought up. If you are using RedHat you can add the line: @@ -176,8 +176,6 @@ information (_most_ of which _is_ _essential_) includes: - Which client caused the problem ? - How much data was being transferred ? - Was the network congested ? - - If there was a kernel panic, please run the output through ksymoops - before sending it to me, otherwise its _useless_. - How can the problem be reproduced ? - Can you use tcpdump to get a trace ? (N.B. Most (all?) versions of tcpdump don't understand how to dump DECnet properly, so including diff --git a/Documentation/networking/depca.txt b/Documentation/networking/depca.txt deleted file mode 100644 index 24c6b26e565..00000000000 --- a/Documentation/networking/depca.txt +++ /dev/null @@ -1,92 +0,0 @@ - -DE10x -===== - -Memory Addresses: - - SW1 SW2 SW3 SW4 -64K on on on on d0000 dbfff - off on on on c0000 cbfff - off off on on e0000 ebfff - -32K on on off on d8000 dbfff - off on off on c8000 cbfff - off off off on e8000 ebfff - -DBR ROM on on dc000 dffff - off on cc000 cffff - off off ec000 effff - -Note that the 2K mode is set by SW3/SW4 on/off or off/off. Address -assignment is through the RBSA register. - -I/O Address: - SW5 -0x300 on -0x200 off - -Remote Boot: - SW6 -Disable on -Enable off - -Remote Boot Timeout: - SW7 -2.5min on -30s off - -IRQ: - SW8 SW9 SW10 SW11 SW12 -2 on off off off off -3 off on off off off -4 off off on off off -5 off off off on off -7 off off off off on - -DE20x -===== - -Memory Size: - - SW3 SW4 -64K on on -32K off on -2K on off -2K off off - -Start Addresses: - - SW1 SW2 SW3 SW4 -64K on on on on c0000 cffff - on off on on d0000 dffff - off on on on e0000 effff - -32K on on off off c8000 cffff - on off off off d8000 dffff - off on off off e8000 effff - -Illegal off off - - - - - -I/O Address: - SW5 -0x300 on -0x200 off - -Remote Boot: - SW6 -Disable on -Enable off - -Remote Boot Timeout: - SW7 -2.5min on -30s off - -IRQ: - SW8 SW9 SW10 SW11 SW12 -5 on off off off off -9 off on off off off -10 off off on off off -11 off off off on off -15 off off off off on - diff --git a/Documentation/networking/dgrs.txt b/Documentation/networking/dgrs.txt deleted file mode 100644 index 1aa1bb3f94a..00000000000 --- a/Documentation/networking/dgrs.txt +++ /dev/null @@ -1,52 +0,0 @@ - The Digi International RightSwitch SE-X (dgrs) Device Driver - -This is a Linux driver for the Digi International RightSwitch SE-X -EISA and PCI boards. These are 4 (EISA) or 6 (PCI) port Ethernet -switches and a NIC combined into a single board. This driver can -be compiled into the kernel statically or as a loadable module. - -There is also a companion management tool, called "xrightswitch". -The management tool lets you watch the performance graphically, -as well as set the SNMP agent IP and IPX addresses, IEEE Spanning -Tree, and Aging time. These can also be set from the command line -when the driver is loaded. The driver command line options are: - - debug=NNN Debug printing level - dma=0/1 Disable/Enable DMA on PCI card - spantree=0/1 Disable/Enable IEEE spanning tree - hashexpire=NNN Change address aging time (default 300 seconds) - ipaddr=A,B,C,D Set SNMP agent IP address i.e. 199,86,8,221 - iptrap=A,B,C,D Set SNMP agent IP trap address i.e. 199,86,8,221 - ipxnet=NNN Set SNMP agent IPX network number - nicmode=0/1 Disable/Enable multiple NIC mode - -There is also a tool for setting up input and output packet filters -on each port, called "dgrsfilt". - -Both the management tool and the filtering tool are available -separately from the following FTP site: - - ftp://ftp.dgii.com/drivers/rightswitch/linux/ - -When nicmode=1, the board and driver operate as 4 or 6 individual -NIC ports (eth0...eth5) instead of as a switch. All switching -functions are disabled. In the future, the board firmware may include -a routing cache when in this mode. - -Copyright 1995-1996 Digi International Inc. - -This software may be used and distributed according to the terms -of the GNU General Public License, incorporated herein by reference. - -For information on purchasing a RightSwitch SE-4 or SE-6 -board, please contact Digi's sales department at 1-612-912-3444 -or 1-800-DIGIBRD. Outside the U.S., please check our Web page at: - - http://www.dgii.com - -for sales offices worldwide. Tech support is also available through -the channels listed on the Web site, although as long as I am -employed on networking products at Digi I will be happy to provide -any bug fixes that may be needed. - --Rick Richardson, rick@dgii.com diff --git a/Documentation/networking/dl2k.txt b/Documentation/networking/dl2k.txt index d460492037e..cba74f7a3ab 100644 --- a/Documentation/networking/dl2k.txt +++ b/Documentation/networking/dl2k.txt @@ -45,12 +45,13 @@ Now eth0 should active, you can test it by "ping" or get more information by "ifconfig". If tested ok, continue the next step. 4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net -5. Add the following line to /etc/modprobe.conf: +5. Add the following line to /etc/modprobe.d/dl2k.conf: alias eth0 dl2k -6. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 +6. Run depmod to updated module indexes. +7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 located at /etc/sysconfig/network-scripts or create it manually. [see - Configuration Script Sample] -7. Driver will automatically load and configure at next boot time. +8. Driver will automatically load and configure at next boot time. Compiling the Driver ==================== @@ -154,8 +155,8 @@ Installing the Driver ----------------- 1. Copy dl2k.o to the network modules directory, typically /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net. - 2. Locate the boot module configuration file, most commonly modprobe.conf - or modules.conf (for 2.4) in the /etc directory. Add the following lines: + 2. Locate the boot module configuration file, most commonly in the + /etc/modprobe.d/ directory. Add the following lines: alias ethx dl2k options dl2k <optional parameters> @@ -173,7 +174,7 @@ Installing the Driver Parameter Description ===================== -You can install this driver without any addtional parameter. However, if you +You can install this driver without any additional parameter. However, if you are going to have extensive functions then it is necessary to set extra parameter. Below is a list of the command line parameters supported by the Linux device @@ -222,7 +223,7 @@ rx_timeout=n - Rx DMA wait time for an interrupt. reach timeout of n * 640 nano seconds. Set proper rx_coalesce and rx_timeout can reduce congestion collapse and overload which - has been a bottlenect for high speed network. + has been a bottleneck for high speed network. For example, rx_coalesce=10 rx_timeout=800. that is, hardware assert only 1 interrupt diff --git a/Documentation/networking/dm9000.txt b/Documentation/networking/dm9000.txt new file mode 100644 index 00000000000..5552e2e575c --- /dev/null +++ b/Documentation/networking/dm9000.txt @@ -0,0 +1,167 @@ +DM9000 Network driver +===================== + +Copyright 2008 Simtec Electronics, + Ben Dooks <ben@simtec.co.uk> <ben-linux@fluff.org> + + +Introduction +------------ + +This file describes how to use the DM9000 platform-device based network driver +that is contained in the files drivers/net/dm9000.c and drivers/net/dm9000.h. + +The driver supports three DM9000 variants, the DM9000E which is the first chip +supported as well as the newer DM9000A and DM9000B devices. It is currently +maintained and tested by Ben Dooks, who should be CC: to any patches for this +driver. + + +Defining the platform device +---------------------------- + +The minimum set of resources attached to the platform device are as follows: + + 1) The physical address of the address register + 2) The physical address of the data register + 3) The IRQ line the device's interrupt pin is connected to. + +These resources should be specified in that order, as the ordering of the +two address regions is important (the driver expects these to be address +and then data). + +An example from arch/arm/mach-s3c2410/mach-bast.c is: + +static struct resource bast_dm9k_resource[] = { + [0] = { + .start = S3C2410_CS5 + BAST_PA_DM9000, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 3, + .flags = IORESOURCE_MEM, + }, + [1] = { + .start = S3C2410_CS5 + BAST_PA_DM9000 + 0x40, + .end = S3C2410_CS5 + BAST_PA_DM9000 + 0x40 + 0x3f, + .flags = IORESOURCE_MEM, + }, + [2] = { + .start = IRQ_DM9000, + .end = IRQ_DM9000, + .flags = IORESOURCE_IRQ | IORESOURCE_IRQ_HIGHLEVEL, + } +}; + +static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, +}; + +Note the setting of the IRQ trigger flag in bast_dm9k_resource[2].flags, +as this will generate a warning if it is not present. The trigger from +the flags field will be passed to request_irq() when registering the IRQ +handler to ensure that the IRQ is setup correctly. + +This shows a typical platform device, without the optional configuration +platform data supplied. The next example uses the same resources, but adds +the optional platform data to pass extra configuration data: + +static struct dm9000_plat_data bast_dm9k_platdata = { + .flags = DM9000_PLATF_16BITONLY, +}; + +static struct platform_device bast_device_dm9k = { + .name = "dm9000", + .id = 0, + .num_resources = ARRAY_SIZE(bast_dm9k_resource), + .resource = bast_dm9k_resource, + .dev = { + .platform_data = &bast_dm9k_platdata, + } +}; + +The platform data is defined in include/linux/dm9000.h and described below. + + +Platform data +------------- + +Extra platform data for the DM9000 can describe the IO bus width to the +device, whether or not an external PHY is attached to the device and +the availability of an external configuration EEPROM. + +The flags for the platform data .flags field are as follows: + +DM9000_PLATF_8BITONLY + + The IO should be done with 8bit operations. + +DM9000_PLATF_16BITONLY + + The IO should be done with 16bit operations. + +DM9000_PLATF_32BITONLY + + The IO should be done with 32bit operations. + +DM9000_PLATF_EXT_PHY + + The chip is connected to an external PHY. + +DM9000_PLATF_NO_EEPROM + + This can be used to signify that the board does not have an + EEPROM, or that the EEPROM should be hidden from the user. + +DM9000_PLATF_SIMPLE_PHY + + Switch to using the simpler PHY polling method which does not + try and read the MII PHY state regularly. This is only available + when using the internal PHY. See the section on link state polling + for more information. + + The config symbol DM9000_FORCE_SIMPLE_PHY_POLL, Kconfig entry + "Force simple NSR based PHY polling" allows this flag to be + forced on at build time. + + +PHY Link state polling +---------------------- + +The driver keeps track of the link state and informs the network core +about link (carrier) availability. This is managed by several methods +depending on the version of the chip and on which PHY is being used. + +For the internal PHY, the original (and currently default) method is +to read the MII state, either when the status changes if we have the +necessary interrupt support in the chip or every two seconds via a +periodic timer. + +To reduce the overhead for the internal PHY, there is now the option +of using the DM9000_FORCE_SIMPLE_PHY_POLL config, or DM9000_PLATF_SIMPLE_PHY +platform data option to read the summary information without the +expensive MII accesses. This method is faster, but does not print +as much information. + +When using an external PHY, the driver currently has to poll the MII +link status as there is no method for getting an interrupt on link change. + + +DM9000A / DM9000B +----------------- + +These chips are functionally similar to the DM9000E and are supported easily +by the same driver. The features are: + + 1) Interrupt on internal PHY state change. This means that the periodic + polling of the PHY status may be disabled on these devices when using + the internal PHY. + + 2) TCP/UDP checksum offloading, which the driver does not currently support. + + +ethtool +------- + +The driver supports the ethtool interface for access to the driver +state information, the PHY state and the EEPROM. diff --git a/Documentation/networking/dmfe.txt b/Documentation/networking/dmfe.txt index 046363552d0..25320bf19c8 100644 --- a/Documentation/networking/dmfe.txt +++ b/Documentation/networking/dmfe.txt @@ -1,3 +1,5 @@ +Note: This driver doesn't have a maintainer. + Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux. This program is free software; you can redistribute it and/or @@ -34,7 +36,7 @@ Next you should configure your network interface with a command similar to : ifconfig eth0 172.22.3.18 ^^^^^^^^^^^ - Your IP Adress + Your IP Address Then you may have to modify the default routing table with command : @@ -55,11 +57,10 @@ Test and make sure PCI latency is now correct for all cases. Authors: Sten Wang <sten_wang@davicom.com.tw > : Original Author -Tobias Ringstrom <tori@unhappy.mine.nu> : Current Maintainer Contributors: Marcelo Tosatti <marcelo@conectiva.com.br> -Alan Cox <alan@redhat.com> +Alan Cox <alan@lxorguk.ukuu.org.uk> Jeff Garzik <jgarzik@pobox.com> Vojtech Pavlik <vojtech@suse.cz> diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt new file mode 100644 index 00000000000..d86adcdae42 --- /dev/null +++ b/Documentation/networking/dns_resolver.txt @@ -0,0 +1,157 @@ + =================== + DNS Resolver Module + =================== + +Contents: + + - Overview. + - Compilation. + - Setting up. + - Usage. + - Mechanism. + - Debugging. + + +======== +OVERVIEW +======== + +The DNS resolver module provides a way for kernel services to make DNS queries +by way of requesting a key of key type dns_resolver. These queries are +upcalled to userspace through /sbin/request-key. + +These routines must be supported by userspace tools dns.upcall, cifs.upcall and +request-key. It is under development and does not yet provide the full feature +set. The features it does support include: + + (*) Implements the dns_resolver key_type to contact userspace. + +It does not yet support the following AFS features: + + (*) Dns query support for AFSDB resource record. + +This code is extracted from the CIFS filesystem. + + +=========== +COMPILATION +=========== + +The module should be enabled by turning on the kernel configuration options: + + CONFIG_DNS_RESOLVER - tristate "DNS Resolver support" + + +========== +SETTING UP +========== + +To set up this facility, the /etc/request-key.conf file must be altered so that +/sbin/request-key can appropriately direct the upcalls. For example, to handle +basic dname to IPv4/IPv6 address resolution, the following line should be +added: + + #OP TYPE DESC CO-INFO PROGRAM ARG1 ARG2 ARG3 ... + #====== ============ ======= ======= ========================== + create dns_resolver * * /usr/sbin/cifs.upcall %k + +To direct a query for query type 'foo', a line of the following should be added +before the more general line given above as the first match is the one taken. + + create dns_resolver foo:* * /usr/sbin/dns.foo %k + + +===== +USAGE +===== + +To make use of this facility, one of the following functions that are +implemented in the module can be called after doing: + + #include <linux/dns_resolver.h> + + (1) int dns_query(const char *type, const char *name, size_t namelen, + const char *options, char **_result, time_t *_expiry); + + This is the basic access function. It looks for a cached DNS query and if + it doesn't find it, it upcalls to userspace to make a new DNS query, which + may then be cached. The key description is constructed as a string of the + form: + + [<type>:]<name> + + where <type> optionally specifies the particular upcall program to invoke, + and thus the type of query to do, and <name> specifies the string to be + looked up. The default query type is a straight hostname to IP address + set lookup. + + The name parameter is not required to be a NUL-terminated string, and its + length should be given by the namelen argument. + + The options parameter may be NULL or it may be a set of options + appropriate to the query type. + + The return value is a string appropriate to the query type. For instance, + for the default query type it is just a list of comma-separated IPv4 and + IPv6 addresses. The caller must free the result. + + The length of the result string is returned on success, and a negative + error code is returned otherwise. -EKEYREJECTED will be returned if the + DNS lookup failed. + + If _expiry is non-NULL, the expiry time (TTL) of the result will be + returned also. + +The kernel maintains an internal keyring in which it caches looked up keys. +This can be cleared by any process that has the CAP_SYS_ADMIN capability by +the use of KEYCTL_KEYRING_CLEAR on the keyring ID. + + +=============================== +READING DNS KEYS FROM USERSPACE +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + + +========= +MECHANISM +========= + +The dnsresolver module registers a key type called "dns_resolver". Keys of +this type are used to transport and cache DNS lookup results from userspace. + +When dns_query() is invoked, it calls request_key() to search the local +keyrings for a cached DNS result. If that fails to find one, it upcalls to +userspace to get a new result. + +Upcalls to userspace are made through the request_key() upcall vector, and are +directed by means of configuration lines in /etc/request-key.conf that tell +/sbin/request-key what program to run to instantiate the key. + +The upcall handler program is responsible for querying the DNS, processing the +result into a form suitable for passing to the keyctl_instantiate_key() +routine. This then passes the data to dns_resolver_instantiate() which strips +off and processes any options included in the data, and then attaches the +remainder of the string to the key as its payload. + +The upcall handler program should set the expiry time on the key to that of the +lowest TTL of all the records it has extracted a result from. This means that +the key will be discarded and recreated when the data it holds has expired. + +dns_query() returns a copy of the value attached to the key, or an error if +that is indicated instead. + +See <file:Documentation/security/keys-request-key.txt> for further +information about request-key function. + + +========= +DEBUGGING +========= + +Debugging messages can be turned on dynamically by writing a 1 into the +following file: + + /sys/module/dnsresolver/parameters/debug diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt index 11fd0ef5ff5..da59e288413 100644 --- a/Documentation/networking/driver.txt +++ b/Documentation/networking/driver.txt @@ -1,22 +1,19 @@ -Documents about softnet driver issues in general can be found -at: - - http://www.firstfloor.org/~andi/softnet/ +Document about softnet driver issues Transmit path guidelines: -1) The hard_start_xmit method must never return '1' under any - normal circumstances. It is considered a hard error unless +1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under + any normal circumstances. It is considered a hard error unless there is no way your device can tell ahead of time when it's transmit function will become busy. Instead it must maintain the queue properly. For example, for a driver implementing scatter-gather this means: - static int drv_hard_start_xmit(struct sk_buff *skb, - struct net_device *dev) + static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, + struct net_device *dev) { - struct drv *dp = dev->priv; + struct drv *dp = netdev_priv(dev); lock_tx(dp); ... @@ -26,7 +23,7 @@ Transmit path guidelines: unlock_tx(dp); printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", dev->name); - return 1; + return NETDEV_TX_BUSY; } ... queue packet to card ... @@ -38,9 +35,10 @@ Transmit path guidelines: ... unlock_tx(dp); ... + return NETDEV_TX_OK; } - And then at the end of your TX reclaimation event handling: + And then at the end of your TX reclamation event handling: if (netif_queue_stopped(dp->dev) && TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) @@ -61,12 +59,12 @@ Transmit path guidelines: TX_BUFFS_AVAIL(dp) > 0) netif_wake_queue(dp->dev); -2) Do not forget to update netdev->trans_start to jiffies after - each new tx packet is given to the hardware. +2) An ndo_start_xmit method must not modify the shared parts of a + cloned SKB. -3) Do not forget that once you return 0 from your hard_start_xmit - method, it is your driver's responsibility to free up the SKB - and in some finite amount of time. +3) Do not forget that once you return NETDEV_TX_OK from your + ndo_start_xmit method, it is your driver's responsibility to free + up the SKB and in some finite amount of time. For example, this means that it is not allowed for your TX mitigation scheme to let TX packets "hang out" in the TX @@ -74,8 +72,9 @@ Transmit path guidelines: This error can deadlock sockets waiting for send buffer room to be freed up. - If you return 1 from the hard_start_xmit method, you must not keep - any reference to that SKB and you must not attempt to free it up. + If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you + must not keep any reference to that SKB and you must not attempt + to free it up. Probing guidelines: @@ -85,10 +84,10 @@ Probing guidelines: Close/stop guidelines: -1) After the dev->stop routine has been called, the hardware must +1) After the ndo_stop routine has been called, the hardware must not receive or transmit any data. All in flight packets must be aborted. If necessary, poll or wait for completion of any reset commands. -2) The dev->stop routine will be called by unregister_netdevice +2) The ndo_stop routine will be called by unregister_netdevice if device is still UP. diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt index 4ef9f7cd5dc..f862cf3aff3 100644 --- a/Documentation/networking/e100.txt +++ b/Documentation/networking/e100.txt @@ -1,16 +1,17 @@ Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters ============================================================== -November 17, 2004 - +March 15, 2011 Contents ======== - In This Release - Identifying Your Adapter +- Building and Installation - Driver Configuration Parameters - Additional Configurations +- Known Issues - Support @@ -18,18 +19,30 @@ In This Release =============== This file describes the Linux* Base Driver for the Intel(R) PRO/100 Family of -Adapters, version 3.3.x. This driver supports 2.4.x and 2.6.x kernels. +Adapters. This driver includes support for Itanium(R)2-based systems. + +For questions related to hardware requirements, refer to the documentation +supplied with your Intel PRO/100 adapter. + +The following features are now available in supported kernels: + - Native VLANs + - Channel Bonding (teaming) + - SNMP + +Channel Bonding documentation can be found in the Linux kernel source: +/Documentation/networking/bonding.txt + Identifying Your Adapter ======================== -For more information on how to identify your adapter, go to the Adapter & +For more information on how to identify your adapter, go to the Adapter & Driver ID Guide at: http://support.intel.com/support/network/adapter/pro100/21397.htm -For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the networking link on the left to search for your adapter: http://downloadfinder.intel.com/scripts-df/support_intel.asp @@ -40,101 +53,92 @@ Driver Configuration Parameters The default value for each parameter is generally the recommended setting, unless otherwise noted. -Rx Descriptors: Number of receive descriptors. A receive descriptor is a data - structure that describes a receive buffer and its attributes to the network - controller. The data in the descriptor is used by the controller to write - data from the controller to host memory. In the 3.0.x driver the valid - range for this parameter is 64-256. The default value is 64. This parameter - can be changed using the command - +Rx Descriptors: Number of receive descriptors. A receive descriptor is a data + structure that describes a receive buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to write + data from the controller to host memory. In the 3.x.x driver the valid range + for this parameter is 64-256. The default value is 64. This parameter can be + changed using the command: + ethtool -G eth? rx n, where n is the number of desired rx descriptors. -Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a - data structure that describes a transmit buffer and its attributes to the - network controller. The data in the descriptor is used by the controller to - read data from the host memory to the controller. In the 3.0.x driver the - valid range for this parameter is 64-256. The default value is 64. This - parameter can be changed using the command +Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data + structure that describes a transmit buffer and its attributes to the network + controller. The data in the descriptor is used by the controller to read + data from the host memory to the controller. In the 3.x.x driver the valid + range for this parameter is 64-256. The default value is 64. This parameter + can be changed using the command: ethtool -G eth? tx n, where n is the number of desired tx descriptors. -Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by - default. Ethtool can be used as follows to force speed/duplex. +Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by + default. The ethtool utility can be used as follows to force speed/duplex. ethtool -s eth? autoneg off speed {10|100} duplex {full|half} NOTE: setting the speed/duplex to incorrect values will cause the link to fail. -Event Log Message Level: The driver uses the message level flag to log events - to syslog. The message level can be set at driver load time. It can also be - set using the command +Event Log Message Level: The driver uses the message level flag to log events + to syslog. The message level can be set at driver load time. It can also be + set using the command: ethtool -s eth? msglvl n + Additional Configurations ========================= Configuring the Driver on Different Distributions ------------------------------------------------- - Configuring a network driver to load properly when the system is started is - distribution dependent. Typically, the configuration process involves adding - an alias line to /etc/modules.conf as well as editing other system startup - scripts and/or configuration files. Many popular Linux distributions ship - with tools to make these changes for you. To learn the proper way to - configure a network device for your system, refer to your distribution - documentation. If during this process you are asked for the driver or module - name, the name for the Linux Base Driver for the Intel PRO/100 Family of - Adapters is e100. + Configuring a network driver to load properly when the system is started is + distribution dependent. Typically, the configuration process involves adding + an alias line to /etc/modprobe.d/*.conf as well as editing other system + startup scripts and/or configuration files. Many popular Linux + distributions ship with tools to make these changes for you. To learn the + proper way to configure a network device for your system, refer to your + distribution documentation. If during this process you are asked for the + driver or module name, the name for the Linux Base Driver for the Intel + PRO/100 Family of Adapters is e100. - As an example, if you install the e100 driver for two PRO/100 adapters - (eth0 and eth1), add the following to modules.conf: + As an example, if you install the e100 driver for two PRO/100 adapters + (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/ alias eth0 e100 alias eth1 e100 Viewing Link Messages --------------------- - In order to see link messages and other Intel driver information on your - console, you must set the dmesg level up to six. This can be done by - entering the following on the command line before loading the e100 driver: + In order to see link messages and other Intel driver information on your + console, you must set the dmesg level up to six. This can be done by + entering the following on the command line before loading the e100 driver: dmesg -n 8 - If you wish to see all messages issued by the driver, including debug + If you wish to see all messages issued by the driver, including debug messages, set the dmesg level to eight. NOTE: This setting is not saved across reboots. - Ethtool + + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool + diagnostics, as well as displaying statistical information. The ethtool version 1.6 or later is required for this functionality. - The latest release of ethtool can be found at: - http://sf.net/projects/gkernel. - - NOTE: This driver uses mii support from the kernel. As a result, when - there is no link, ethtool will report speed/duplex to be 10/half. - - NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support - for a more complete ethtool feature set can be enabled by upgrading - ethtool to ethtool-1.8.1. + The latest release of ethtool can be found from + http://ftp.kernel.org/pub/software/network/ethtool/ Enabling Wake on LAN* (WoL) --------------------------- - WoL is provided through the Ethtool* utility. Ethtool is included with Red - Hat* 8.0. For other Linux distributions, download and install Ethtool from - the following website: http://sourceforge.net/projects/gkernel. - - For instructions on enabling WoL with Ethtool, refer to the Ethtool man - page. + WoL is provided through the ethtool* utility. For instructions on enabling + WoL with ethtool, refer to the ethtool man page. WoL will be enabled on the system during the next shut down or reboot. For - this driver version, in order to enable WoL, the e100 driver must be + this driver version, in order to enable WoL, the e100 driver must be loaded when shutting down or rebooting the system. NAPI @@ -144,6 +148,25 @@ Additional Configurations See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. + Multiple Interfaces on Same Ethernet Broadcast Network + ------------------------------------------------------ + + Due to the default ARP behavior on Linux, it is not possible to have + one system on two IP networks in the same Ethernet broadcast domain + (non-partitioned switch) behave as expected. All Ethernet interfaces + will respond to IP traffic for any IP address assigned to the system. + This results in unbalanced receive traffic. + + If you have multiple interfaces in a server, either turn on ARP + filtering by + + (1) entering: echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + (this only works if your kernel's version is higher than 2.4.5), or + + (2) installing the interfaces in separate broadcast domains (either + in different switches or in a switch partitioned to VLANs). + + Support ======= @@ -151,20 +174,24 @@ For general information, go to the Intel support website at: http://support.intel.com + or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related to -the issue to linux.nics@intel.com. +kernel with a supported adapter, email the specific information related to the +issue to e1000-devel@lists.sourceforge.net. License ======= -This software program is released under the terms of a license agreement -between you ('Licensee') and Intel. Do not use or load this software or any -associated materials (collectively, the 'Software') until you have carefully -read the full terms and conditions of the LICENSE located in this software -package. By loading or using the Software, you agree to the terms of this -Agreement. If you do not agree with the terms of this Agreement, do not -install or use the Software. +This software program is released under the terms of a license agreement +between you ('Licensee') and Intel. Do not use or load this software or any +associated materials (collectively, the 'Software') until you have carefully +read the full terms and conditions of the file COPYING located in this software +package. By loading or using the Software, you agree to the terms of this +Agreement. If you do not agree with the terms of this Agreement, do not install +or use the Software. * Other names and brands may be claimed as the property of others. diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt index 2ebd4058d46..437b2099cce 100644 --- a/Documentation/networking/e1000.txt +++ b/Documentation/networking/e1000.txt @@ -1,380 +1,450 @@ -Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters -=============================================================== - -November 17, 2004 +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== -- In This Release - Identifying Your Adapter - Command Line Parameters - Speed and Duplex Configuration - Additional Configurations -- Known Issues - Support - -In This Release -=============== - -This file describes the Linux* Base Driver for the Intel(R) PRO/1000 Family -of Adapters, version 5.x.x. - -For questions related to hardware requirements, refer to the documentation -supplied with your Intel PRO/1000 adapter. All hardware requirements listed -apply to use with Linux. - -Native VLANs are now available with supported kernels. - Identifying Your Adapter ======================== -For more information on how to identify your adapter, go to the Adapter & +For more information on how to identify your adapter, go to the Adapter & Driver ID Guide at: - http://support.intel.com/support/network/adapter/pro100/21397.htm + http://support.intel.com/support/go/network/adapter/idguide.htm -For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the networking link on the left to search for your adapter: - http://downloadfinder.intel.com/scripts-df/support_intel.asp + http://support.intel.com/support/go/network/adapter/home.htm Command Line Parameters ======================= -If the driver is built as a module, the following optional parameters are -used by entering them on the command line with the modprobe or insmod command -using this syntax: - - modprobe e1000 [<option>=<VAL1>,<VAL2>,...] - - insmod e1000 [<option>=<VAL1>,<VAL2>,...] - -For example, with two PRO/1000 PCI adapters, entering: - - insmod e1000 TxDescriptors=80,128 +The default value for each parameter is generally the recommended setting, +unless otherwise noted. -loads the e1000 driver with 80 TX descriptors for the first adapter and 128 TX -descriptors for the second adapter. +NOTES: For more information about the AutoNeg, Duplex, and Speed + parameters, see the "Speed and Duplex Configuration" section in + this document. -The default value for each parameter is generally the recommended setting, -unless otherwise noted. Also, if the driver is statically built into the -kernel, the driver is loaded with the default values for all the parameters. -Ethtool can be used to change some of the parameters at runtime. + For more information about the InterruptThrottleRate, + RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay + parameters, see the application note at: + http://www.intel.com/design/network/applnots/ap450.htm - NOTES: For more information about the AutoNeg, Duplex, and Speed - parameters, see the "Speed and Duplex Configuration" section in - this document. +AutoNeg +------- +(Supported only on adapters with copper connections) +Valid Range: 0x01-0x0F, 0x20-0x2F +Default Value: 0x2F - For more information about the InterruptThrottleRate, RxIntDelay, - TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay parameters, see the - application note at: - http://www.intel.com/design/network/applnots/ap450.htm +This parameter is a bit-mask that specifies the speed and duplex settings +advertised by the adapter. When this parameter is used, the Speed and +Duplex parameters must not be specified. - A descriptor describes a data buffer and attributes related to the - data buffer. This information is accessed by the hardware. +NOTE: Refer to the Speed and Duplex section of this readme for more + information on the AutoNeg parameter. -AutoNeg (adapters using copper connections only) -Valid Range: 0x01-0x0F, 0x20-0x2F -Default Value: 0x2F - This parameter is a bit mask that specifies which speed and duplex - settings the board advertises. When this parameter is used, the Speed and - Duplex parameters must not be specified. - NOTE: Refer to the Speed and Duplex section of this readme for more - information on the AutoNeg parameter. - -Duplex (adapters using copper connections only) -Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) +Duplex +------ +(Supported only on adapters with copper connections) +Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) Default Value: 0 - Defines the direction in which data is allowed to flow. Can be either one - or two-directional. If both Duplex and the link partner are set to auto- - negotiate, the board auto-detects the correct duplex. If the link partner - is forced (either full or half), Duplex defaults to half-duplex. + +This defines the direction in which data is allowed to flow. Can be +either one or two-directional. If both Duplex and the link partner are +set to auto-negotiate, the board auto-detects the correct duplex. If the +link partner is forced (either full or half), Duplex defaults to half- +duplex. FlowControl -Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) -Default: Read flow control settings from the EEPROM - This parameter controls the automatic generation(Tx) and response(Rx) to - Ethernet PAUSE frames. +----------- +Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) +Default Value: Reads flow control settings from the EEPROM + +This parameter controls the automatic generation(Tx) and response(Rx) +to Ethernet PAUSE frames. InterruptThrottleRate -Valid Range: 100-100000 (0=off, 1=dynamic) -Default Value: 8000 - This value represents the maximum number of interrupts per second the - controller generates. InterruptThrottleRate is another setting used in - interrupt moderation. Dynamic mode uses a heuristic algorithm to adjust - InterruptThrottleRate based on the current traffic load. -Un-supported Adapters: InterruptThrottleRate is NOT supported by 82542, 82543 - or 82544-based adapters. - - NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and - RxAbsIntDelay parameters. In other words, minimizing the receive - and/or transmit absolute delays does not force the controller to - generate more interrupts than what the Interrupt Throttle Rate - allows. - CAUTION: If you are using the Intel PRO/1000 CT Network Connection - (controller 82547), setting InterruptThrottleRate to a value - greater than 75,000, may hang (stop transmitting) adapters under - certain network conditions. If this occurs a NETDEV WATCHDOG - message is logged in the system event log. In addition, the - controller is automatically reset, restoring the network - connection. To eliminate the potential for the hang, ensure - that InterruptThrottleRate is set no greater than 75,000 and is - not set to 0. - NOTE: When e1000 is loaded with default settings and multiple adapters are - in use simultaneously, the CPU utilization may increase non-linearly. - In order to limit the CPU utilization without impacting the overall - throughput, we recommend that you load the driver as follows: - - insmod e1000.o InterruptThrottleRate=3000,3000,3000 - - This sets the InterruptThrottleRate to 3000 interrupts/sec for the - first, second, and third instances of the driver. The range of 2000 to - 3000 interrupts per second works on a majority of systems and is a - good starting point, but the optimal value will be platform-specific. - If CPU utilization is not a concern, use RX_POLLING (NAPI) and default - driver settings. +--------------------- +(not supported on Intel(R) 82542, 82543 or 82544-based adapters) +Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, + 4=simplified balancing) +Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types,but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the +interrupt rate will drop as low as 2000 interrupts per second. If the +traffic is mostly transmit or mostly receive, the interrupt rate could +be as high as 8000. + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. + +NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and + RxAbsIntDelay parameters. In other words, minimizing the receive + and/or transmit absolute delays does not force the controller to + generate more interrupts than what the Interrupt Throttle Rate + allows. + +CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection + (controller 82547), setting InterruptThrottleRate to a value + greater than 75,000, may hang (stop transmitting) adapters + under certain network conditions. If this occurs a NETDEV + WATCHDOG message is logged in the system event log. In + addition, the controller is automatically reset, restoring + the network connection. To eliminate the potential for the + hang, ensure that InterruptThrottleRate is set no greater + than 75,000 and is not set to 0. + +NOTE: When e1000 is loaded with default settings and multiple adapters + are in use simultaneously, the CPU utilization may increase non- + linearly. In order to limit the CPU utilization without impacting + the overall throughput, we recommend that you load the driver as + follows: + + modprobe e1000 InterruptThrottleRate=3000,3000,3000 + + This sets the InterruptThrottleRate to 3000 interrupts/sec for + the first, second, and third instances of the driver. The range + of 2000 to 3000 interrupts per second works on a majority of + systems and is a good starting point, but the optimal value will + be platform-specific. If CPU utilization is not a concern, use + RX_POLLING (NAPI) and default driver settings. RxDescriptors -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters +------------- +Valid Range: 80-256 for 82542 and 82543-based adapters + 80-4096 for all other supported adapters Default Value: 256 - This value is the number of receive descriptors allocated by the driver. - Increasing this value allows the driver to buffer more incoming packets. - Each descriptor is 16 bytes. A receive buffer is allocated for each - descriptor and can either be 2048 or 4096 bytes long, depending on the MTU - setting. An incoming packet can span one or more receive descriptors. - The maximum MTU size is 16110. +This value specifies the number of receive buffer descriptors allocated +by the driver. Increasing this value allows the driver to buffer more +incoming packets, at the expense of increased system memory utilization. - NOTE: MTU designates the frame size. It only needs to be set for Jumbo - Frames. - NOTE: Depending on the available system resources, the request for a - higher number of receive descriptors may be denied. In this case, - use a lower number. +Each descriptor is 16 bytes. A receive buffer is also allocated for each +descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending +on the MTU setting. The maximum MTU size is 16110. + +NOTE: MTU designates the frame size. It only needs to be set for Jumbo + Frames. Depending on the available system resources, the request + for a higher number of receive descriptors may be denied. In this + case, use a lower number. RxIntDelay -Valid Range: 0-65535 (0=off) +---------- +Valid Range: 0-65535 (0=off) Default Value: 0 - This value delays the generation of receive interrupts in units of 1.024 - microseconds. Receive interrupt reduction can improve CPU efficiency if - properly tuned for specific network traffic. Increasing this value adds - extra latency to frame reception and can end up decreasing the throughput - of TCP traffic. If the system is reporting dropped receives, this value - may be set too high, causing the driver to run out of available receive - descriptors. - - CAUTION: When setting RxIntDelay to a value other than 0, adapters may - hang (stop transmitting) under certain network conditions. If - this occurs a NETDEV WATCHDOG message is logged in the system - event log. In addition, the controller is automatically reset, - restoring the network connection. To eliminate the potential for - the hang ensure that RxIntDelay is set to 0. - -RxAbsIntDelay (82540, 82545 and later adapters only) -Valid Range: 0-65535 (0=off) + +This value delays the generation of receive interrupts in units of 1.024 +microseconds. Receive interrupt reduction can improve CPU efficiency if +properly tuned for specific network traffic. Increasing this value adds +extra latency to frame reception and can end up decreasing the throughput +of TCP traffic. If the system is reporting dropped receives, this value +may be set too high, causing the driver to run out of available receive +descriptors. + +CAUTION: When setting RxIntDelay to a value other than 0, adapters may + hang (stop transmitting) under certain network conditions. If + this occurs a NETDEV WATCHDOG message is logged in the system + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential + for the hang ensure that RxIntDelay is set to 0. + +RxAbsIntDelay +------------- +(This parameter is supported only on 82540, 82545 and later adapters.) +Valid Range: 0-65535 (0=off) Default Value: 128 - This value, in units of 1.024 microseconds, limits the delay in which a - receive interrupt is generated. Useful only if RxIntDelay is non-zero, - this value ensures that an interrupt is generated after the initial - packet is received within the set amount of time. Proper tuning, - along with RxIntDelay, may improve traffic throughput in specific network - conditions. - -Speed (adapters using copper connections only) + +This value, in units of 1.024 microseconds, limits the delay in which a +receive interrupt is generated. Useful only if RxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is received within the set amount of time. Proper tuning, +along with RxIntDelay, may improve traffic throughput in specific network +conditions. + +Speed +----- +(This parameter is supported only on adapters with copper connections.) Valid Settings: 0, 10, 100, 1000 -Default Value: 0 (auto-negotiate at all supported speeds) - Speed forces the line speed to the specified value in megabits per second - (Mbps). If this parameter is not specified or is set to 0 and the link - partner is set to auto-negotiate, the board will auto-detect the correct - speed. Duplex should also be set when Speed is set to either 10 or 100. +Default Value: 0 (auto-negotiate at all supported speeds) + +Speed forces the line speed to the specified value in megabits per second +(Mbps). If this parameter is not specified or is set to 0 and the link +partner is set to auto-negotiate, the board will auto-detect the correct +speed. Duplex should also be set when Speed is set to either 10 or 100. TxDescriptors -Valid Range: 80-256 for 82542 and 82543-based adapters - 80-4096 for all other supported adapters +------------- +Valid Range: 80-256 for 82542 and 82543-based adapters + 80-4096 for all other supported adapters Default Value: 256 - This value is the number of transmit descriptors allocated by the driver. - Increasing this value allows the driver to queue more transmits. Each - descriptor is 16 bytes. - NOTE: Depending on the available system resources, the request for a - higher number of transmit descriptors may be denied. In this case, - use a lower number. +This value is the number of transmit descriptors allocated by the driver. +Increasing this value allows the driver to queue more transmits. Each +descriptor is 16 bytes. + +NOTE: Depending on the available system resources, the request for a + higher number of transmit descriptors may be denied. In this case, + use a lower number. + +TxDescriptorStep +---------------- +Valid Range: 1 (use every Tx Descriptor) + 4 (use every 4th Tx Descriptor) + +Default Value: 1 (use every Tx Descriptor) + +On certain non-Intel architectures, it has been observed that intense TX +traffic bursts of short packets may result in an improper descriptor +writeback. If this occurs, the driver will report a "TX Timeout" and reset +the adapter, after which the transmit flow will restart, though data may +have stalled for as much as 10 seconds before it resumes. + +The improper writeback does not occur on the first descriptor in a system +memory cache-line, which is typically 32 bytes, or 4 descriptors long. + +Setting TxDescriptorStep to a value of 4 will ensure that all TX descriptors +are aligned to the start of a system memory cache line, and so this problem +will not occur. + +NOTES: Setting TxDescriptorStep to 4 effectively reduces the number of + TxDescriptors available for transmits to 1/4 of the normal allocation. + This has a possible negative performance impact, which may be + compensated for by allocating more descriptors using the TxDescriptors + module parameter. + + There are other conditions which may result in "TX Timeout", which will + not be resolved by the use of the TxDescriptorStep parameter. As the + issue addressed by this parameter has never been observed on Intel + Architecture platforms, it should not be used on Intel platforms. TxIntDelay -Valid Range: 0-65535 (0=off) +---------- +Valid Range: 0-65535 (0=off) Default Value: 64 - This value delays the generation of transmit interrupts in units of - 1.024 microseconds. Transmit interrupt reduction can improve CPU - efficiency if properly tuned for specific network traffic. If the - system is reporting dropped transmits, this value may be set too high - causing the driver to run out of available transmit descriptors. - -TxAbsIntDelay (82540, 82545 and later adapters only) -Valid Range: 0-65535 (0=off) + +This value delays the generation of transmit interrupts in units of +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the +system is reporting dropped transmits, this value may be set too high +causing the driver to run out of available transmit descriptors. + +TxAbsIntDelay +------------- +(This parameter is supported only on 82540, 82545 and later adapters.) +Valid Range: 0-65535 (0=off) Default Value: 64 - This value, in units of 1.024 microseconds, limits the delay in which a - transmit interrupt is generated. Useful only if TxIntDelay is non-zero, - this value ensures that an interrupt is generated after the initial - packet is sent on the wire within the set amount of time. Proper tuning, - along with TxIntDelay, may improve traffic throughput in specific - network conditions. - -XsumRX (not available on the 82542-based adapter) -Valid Range: 0-1 + +This value, in units of 1.024 microseconds, limits the delay in which a +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is sent on the wire within the set amount of time. Proper tuning, +along with TxIntDelay, may improve traffic throughput in specific +network conditions. + +XsumRX +------ +(This parameter is NOT supported on the 82542-based adapter.) +Valid Range: 0-1 Default Value: 1 - A value of '1' indicates that the driver should enable IP checksum - offload for received packets (both UDP and TCP) to the adapter hardware. -Speed and Duplex Configuration -============================== +A value of '1' indicates that the driver should enable IP checksum +offload for received packets (both UDP and TCP) to the adapter hardware. -Three keywords are used to control the speed and duplex configuration. These -keywords are Speed, Duplex, and AutoNeg. +Copybreak +--------- +Valid Range: 0-xxxxxxx (0=off) +Default Value: 256 +Usage: insmod e1000.ko copybreak=128 -If the board uses a fiber interface, these keywords are ignored, and the -fiber interface board only links at 1000 Mbps full-duplex. +Driver copies all packets below or equaling this size to a fresh RX +buffer before handing it up the stack. -For copper-based boards, the keywords interact as follows: +This parameter is different than other parameters, in that it is a +single (not 1,1,1 etc.) parameter applied to all driver instances and +it is also available during runtime at +/sys/module/e1000/parameters/copybreak - The default operation is auto-negotiate. The board advertises all supported - speed and duplex combinations, and it links at the highest common speed and - duplex mode IF the link partner is set to auto-negotiate. +SmartPowerDownEnable +-------------------- +Valid Range: 0-1 +Default Value: 0 (disabled) - If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps is - advertised (The 1000BaseT spec requires auto-negotiation.) +Allows PHY to turn off in lower power states. The user can turn off +this parameter in supported chipsets. - If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- - negotiation is disabled, and the AutoNeg parameter is ignored. Partner SHOULD - also be forced. +KumeranLockLoss +--------------- +Valid Range: 0-1 +Default Value: 1 (enabled) -The AutoNeg parameter is used when more control is required over the auto- -negotiation process. When this parameter is used, Speed and Duplex parameters -must not be specified. The following table describes supported values for the -AutoNeg parameter: +This workaround skips resetting the PHY at shutdown for the initial +silicon releases of ICH8 systems. -Speed (Mbps) 1000 100 100 10 10 -Duplex Full Full Half Full Half -Value (in base 16) 0x20 0x08 0x04 0x02 0x01 +Speed and Duplex Configuration +============================== -Example: insmod e1000 AutoNeg=0x03, loads e1000 and specifies (10 full duplex, -10 half duplex) for negotiation with the peer. +Three keywords are used to control the speed and duplex configuration. +These keywords are Speed, Duplex, and AutoNeg. -Note that setting AutoNeg does not guarantee that the board will link at the -highest specified speed or duplex mode, but the board will link at the -highest possible speed/duplex of the link partner IF the link partner is also -set to auto-negotiate. If the link partner is forced speed/duplex, the -adapter MUST be forced to the same speed/duplex. +If the board uses a fiber interface, these keywords are ignored, and the +fiber interface board only links at 1000 Mbps full-duplex. +For copper-based boards, the keywords interact as follows: -Additional Configurations -========================= + The default operation is auto-negotiate. The board advertises all + supported speed and duplex combinations, and it links at the highest + common speed and duplex mode IF the link partner is set to auto-negotiate. + + If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps + is advertised (The 1000BaseT spec requires auto-negotiation.) + + If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- + negotiation is disabled, and the AutoNeg parameter is ignored. Partner + SHOULD also be forced. - Configuring the Driver on Different Distributions - ------------------------------------------------- +The AutoNeg parameter is used when more control is required over the +auto-negotiation process. It should be used when you wish to control which +speed and duplex combinations are advertised during the auto-negotiation +process. - Configuring a network driver to load properly when the system is started is - distribution dependent. Typically, the configuration process involves adding - an alias line to /etc/modules.conf as well as editing other system startup - scripts and/or configuration files. Many popular Linux distributions ship - with tools to make these changes for you. To learn the proper way to - configure a network device for your system, refer to your distribution - documentation. If during this process you are asked for the driver or module - name, the name for the Linux Base Driver for the Intel PRO/1000 Family of - Adapters is e1000. +The parameter may be specified as either a decimal or hexadecimal value as +determined by the bitmap below. - As an example, if you install the e1000 driver for two PRO/1000 adapters - (eth0 and eth1) and set the speed and duplex to 10full and 100half, add the - following to modules.conf: +Bit position 7 6 5 4 3 2 1 0 +Decimal Value 128 64 32 16 8 4 2 1 +Hex value 80 40 20 10 8 4 2 1 +Speed (Mbps) N/A N/A 1000 N/A 100 100 10 10 +Duplex Full Full Half Full Half - alias eth0 e1000 - alias eth1 e1000 - options e1000 Speed=10,100 Duplex=2,1 +Some examples of using AutoNeg: - Viewing Link Messages - --------------------- + modprobe e1000 AutoNeg=0x01 (Restricts autonegotiation to 10 Half) + modprobe e1000 AutoNeg=1 (Same as above) + modprobe e1000 AutoNeg=0x02 (Restricts autonegotiation to 10 Full) + modprobe e1000 AutoNeg=0x03 (Restricts autonegotiation to 10 Half or 10 Full) + modprobe e1000 AutoNeg=0x04 (Restricts autonegotiation to 100 Half) + modprobe e1000 AutoNeg=0x05 (Restricts autonegotiation to 10 Half or 100 + Half) + modprobe e1000 AutoNeg=0x020 (Restricts autonegotiation to 1000 Full) + modprobe e1000 AutoNeg=32 (Same as above) - Link messages will not be displayed to the console if the distribution is - restricting system messages. In order to see network driver link messages on - your console, set dmesg to eight by entering the following: +Note that when this parameter is used, Speed and Duplex must not be specified. - dmesg -n 8 +If the link partner is forced to a specific speed and duplex, then this +parameter should not be used. Instead, use the Speed and Duplex parameters +previously mentioned to force the adapter to the same speed and duplex. - NOTE: This setting is not saved across reboots. +Additional Configurations +========================= Jumbo Frames ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example: - The driver supports Jumbo Frames for all adapters except 82542-based - adapters. Jumbo Frames support is enabled by changing the MTU to a value - larger than the default of 1500. Use the ifconfig command to increase the - MTU size. For example: + ifconfig eth<x> mtu 9000 up - ifconfig ethx mtu 9000 up + This setting is not saved across reboots. It can be made permanent if + you add: - The maximum MTU setting for Jumbo Frames is 16110. This value coincides - with the maximum Jumbo Frames size of 16128. + MTU=9000 - NOTE: Jumbo Frames are supported at 1000 Mbps only. Using Jumbo Frames at - 10 or 100 Mbps may result in poor performance or loss of link. + to the file /etc/sysconfig/network-scripts/ifcfg-eth<x>. This example + applies to the Red Hat distributions; other distributions may store this + setting in a different location. + Notes: + Degradation in throughput performance may be observed in some Jumbo frames + environments. If this is observed, increasing the application's socket buffer + size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. + See the specific application manual and /usr/src/linux*/Documentation/ + networking/ip-sysctl.txt for more details. - NOTE: MTU designates the frame size. To enable Jumbo Frames, increase the - MTU size on the interface beyond 1500. + - The maximum MTU setting for Jumbo Frames is 16110. This value coincides + with the maximum Jumbo Frames size of 16128. - Ethtool - ------- + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + - Adapters based on the Intel(R) 82542 and 82573V/E controller do not + support Jumbo Frames. These correspond to the following product names: + Intel(R) PRO/1000 Gigabit Server Adapter + Intel(R) PRO/1000 PM Network Connection + ethtool + ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool + diagnostics, as well as displaying statistical information. The ethtool version 1.6 or later is required for this functionality. The latest release of ethtool can be found from - http://sf.net/projects/gkernel. - - NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support - for a more complete ethtool feature set can be enabled by upgrading - ethtool to ethtool-1.8.1. + http://ftp.kernel.org/pub/software/network/ethtool/ Enabling Wake on LAN* (WoL) --------------------------- + WoL is configured through the ethtool* utility. - WoL is configured through the Ethtool* utility. Ethtool is included with - all versions of Red Hat after Red Hat 7.2. For other Linux distributions, - download and install Ethtool from the following website: - http://sourceforge.net/projects/gkernel. - - For instructions on enabling WoL with Ethtool, refer to the website listed - above. - - WoL will be enabled on the system during the next shut down or reboot. - For this driver version, in order to enable WoL, the e1000 driver must be + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the e1000 driver must be loaded when shutting down or rebooting the system. - NAPI - ---- - - NAPI (Rx polling mode) is supported in the e1000 driver. NAPI is enabled - or disabled based on the configuration of the kernel. - - See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. - - -Known Issues -============ - - Jumbo Frames System Requirement - ------------------------------- - - Memory allocation failures have been observed on Linux systems with 64 MB - of RAM or less that are running Jumbo Frames. If you are using Jumbo Frames, - your system may require more than the advertised minimum requirement of 64 MB - of system memory. - - Support ======= @@ -382,20 +452,10 @@ For general information, go to the Intel support website at: http://support.intel.com -If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related to -the issue to linux.nics@intel.com. - - -License -======= +or the Intel Wired Networking project hosted by Sourceforge at: -This software program is released under the terms of a license agreement -between you ('Licensee') and Intel. Do not use or load this software or any -associated materials (collectively, the 'Software') until you have carefully -read the full terms and conditions of the LICENSE located in this software -package. By loading or using the Software, you agree to the terms of this -Agreement. If you do not agree with the terms of this Agreement, do not -install or use the Software. + http://sourceforge.net/projects/e1000 -* Other names and brands may be claimed as the property of others. +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/e1000e.txt b/Documentation/networking/e1000e.txt new file mode 100644 index 00000000000..ad2d9f38ce1 --- /dev/null +++ b/Documentation/networking/e1000e.txt @@ -0,0 +1,312 @@ +Linux* Driver for Intel(R) Ethernet Network Connection +====================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Command Line Parameters +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +The e1000e driver supports all PCI Express Intel(R) Gigabit Network +Connections, except those that are 82575, 82576 and 82580-based*. + +* NOTE: The Intel(R) PRO/1000 P Dual Port Server Adapter is supported by + the e1000 driver, not the e1000e driver due to the 82546 part being used + behind a PCI Express bridge. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://support.intel.com/support/go/network/adapter/home.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +NOTES: For more information about the InterruptThrottleRate, + RxIntDelay, TxIntDelay, RxAbsIntDelay, and TxAbsIntDelay + parameters, see the application note at: + http://www.intel.com/design/network/applnots/ap450.htm + +InterruptThrottleRate +--------------------- +Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, + 4=simplified balancing) +Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types, but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +The driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the +interrupt rate will drop as low as 2000 interrupts per second. If the +traffic is mostly transmit or mostly receive, the interrupt rate could +be as high as 8000. + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. + +NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and + RxAbsIntDelay parameters. In other words, minimizing the receive + and/or transmit absolute delays does not force the controller to + generate more interrupts than what the Interrupt Throttle Rate + allows. + +NOTE: When e1000e is loaded with default settings and multiple adapters + are in use simultaneously, the CPU utilization may increase non- + linearly. In order to limit the CPU utilization without impacting + the overall throughput, we recommend that you load the driver as + follows: + + modprobe e1000e InterruptThrottleRate=3000,3000,3000 + + This sets the InterruptThrottleRate to 3000 interrupts/sec for + the first, second, and third instances of the driver. The range + of 2000 to 3000 interrupts per second works on a majority of + systems and is a good starting point, but the optimal value will + be platform-specific. If CPU utilization is not a concern, use + RX_POLLING (NAPI) and default driver settings. + +RxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 0 + +This value delays the generation of receive interrupts in units of 1.024 +microseconds. Receive interrupt reduction can improve CPU efficiency if +properly tuned for specific network traffic. Increasing this value adds +extra latency to frame reception and can end up decreasing the throughput +of TCP traffic. If the system is reporting dropped receives, this value +may be set too high, causing the driver to run out of available receive +descriptors. + +CAUTION: When setting RxIntDelay to a value other than 0, adapters may + hang (stop transmitting) under certain network conditions. If + this occurs a NETDEV WATCHDOG message is logged in the system + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential + for the hang ensure that RxIntDelay is set to 0. + +RxAbsIntDelay +------------- +Valid Range: 0-65535 (0=off) +Default Value: 8 + +This value, in units of 1.024 microseconds, limits the delay in which a +receive interrupt is generated. Useful only if RxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is received within the set amount of time. Proper tuning, +along with RxIntDelay, may improve traffic throughput in specific network +conditions. + +TxIntDelay +---------- +Valid Range: 0-65535 (0=off) +Default Value: 8 + +This value delays the generation of transmit interrupts in units of +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the +system is reporting dropped transmits, this value may be set too high +causing the driver to run out of available transmit descriptors. + +TxAbsIntDelay +------------- +Valid Range: 0-65535 (0=off) +Default Value: 32 + +This value, in units of 1.024 microseconds, limits the delay in which a +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +this value ensures that an interrupt is generated after the initial +packet is sent on the wire within the set amount of time. Proper tuning, +along with TxIntDelay, may improve traffic throughput in specific +network conditions. + +Copybreak +--------- +Valid Range: 0-xxxxxxx (0=off) +Default Value: 256 + +Driver copies all packets below or equaling this size to a fresh RX +buffer before handing it up the stack. + +This parameter is different than other parameters, in that it is a +single (not 1,1,1 etc.) parameter applied to all driver instances and +it is also available during runtime at +/sys/module/e1000e/parameters/copybreak + +SmartPowerDownEnable +-------------------- +Valid Range: 0-1 +Default Value: 0 (disabled) + +Allows PHY to turn off in lower power states. The user can set this parameter +in supported chipsets. + +KumeranLockLoss +--------------- +Valid Range: 0-1 +Default Value: 1 (enabled) + +This workaround skips resetting the PHY at shutdown for the initial +silicon releases of ICH8 systems. + +IntMode +------- +Valid Range: 0-2 (0=legacy, 1=MSI, 2=MSI-X) +Default Value: 2 + +Allows changing the interrupt mode at module load time, without requiring a +recompile. If the driver load fails to enable a specific interrupt mode, the +driver will try other interrupt modes, from least to most compatible. The +interrupt order is MSI-X, MSI, Legacy. If specifying MSI (IntMode=1) +interrupts, only MSI and Legacy will be attempted. + +CrcStripping +------------ +Valid Range: 0-1 +Default Value: 1 (enabled) + +Strip the CRC from received packets before sending up the network stack. If +you have a machine with a BMC enabled but cannot receive IPMI traffic after +loading or enabling the driver, try disabling this feature. + +WriteProtectNVM +--------------- +Valid Range: 0,1 +Default Value: 1 + +If set to 1, configure the hardware to ignore all write/erase cycles to the +GbE region in the ICHx NVM (in order to prevent accidental corruption of the +NVM). This feature can be disabled by setting the parameter to 0 during initial +driver load. +NOTE: The machine must be power cycled (full off/on) when enabling NVM writes +via setting the parameter to zero. Once the NVM has been locked (via the +parameter at 1 when the driver loads) it cannot be unlocked except via power +cycle. + +Additional Configurations +========================= + + Jumbo Frames + ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example: + + ifconfig eth<x> mtu 9000 up + + This setting is not saved across reboots. + + Notes: + + - The maximum MTU setting for Jumbo Frames is 9216. This value coincides + with the maximum Jumbo Frames size of 9234 bytes. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + - Some adapters limit Jumbo Frames sized packets to a maximum of + 4096 bytes and some adapters do not support Jumbo Frames. + + - Jumbo Frames cannot be configured on an 82579-based Network device, if + MACSec is enabled on the system. + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. We + strongly recommend downloading the latest version of ethtool at: + + http://ftp.kernel.org/pub/software/network/ethtool/ + + NOTE: When validating enable/disable tests on some parts (82578, for example) + you need to add a few seconds between tests when working with ethtool. + + Speed and Duplex + ---------------- + Speed and Duplex are configured through the ethtool* utility. For + instructions, refer to the ethtool man page. + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is configured through the ethtool* utility. For instructions on + enabling WoL with ethtool, refer to the ethtool man page. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the e1000e driver must be + loaded when shutting down or rebooting the system. + + In most cases Wake On LAN is only supported on port A for multiple port + adapters. To verify if a port supports Wake on Lan run ethtool eth<X>. + +Support +======= + +For general information, go to the Intel support website at: + + www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ewrk3.txt b/Documentation/networking/ewrk3.txt deleted file mode 100644 index 90e9e5f16e6..00000000000 --- a/Documentation/networking/ewrk3.txt +++ /dev/null @@ -1,46 +0,0 @@ -The EtherWORKS 3 driver in this distribution is designed to work with all -kernels > 1.1.33 (approx) and includes tools in the 'ewrk3tools' -subdirectory to allow set up of the card, similar to the MSDOS -'NICSETUP.EXE' tools provided on the DOS drivers disk (type 'make' in that -subdirectory to make the tools). - -The supported cards are DE203, DE204 and DE205. All other cards are NOT -supported - refer to 'depca.c' for running the LANCE based network cards and -'de4x5.c' for the DIGITAL Semiconductor PCI chip based adapters from -Digital. - -The ability to load this driver as a loadable module has been included and -used extensively during the driver development (to save those long reboot -sequences). To utilise this ability, you have to do 8 things: - - 0) have a copy of the loadable modules code installed on your system. - 1) copy ewrk3.c from the /linux/drivers/net directory to your favourite - temporary directory. - 2) edit the source code near line 1898 to reflect the I/O address and - IRQ you're using. - 3) compile ewrk3.c, but include -DMODULE in the command line to ensure - that the correct bits are compiled (see end of source code). - 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a - kernel with the ewrk3 configuration turned off and reboot. - 5) insmod ewrk3.o - [Alan Cox: Changed this so you can insmod ewrk3.o irq=x io=y] - [Adam Kropelin: Multiple cards now supported by irq=x1,x2 io=y1,y2] - 6) run the net startup bits for your new eth?? interface manually - (usually /etc/rc.inet[12] at boot time). - 7) enjoy! - - Note that autoprobing is not allowed in loadable modules - the system is - already up and running and you're messing with interrupts. - - To unload a module, turn off the associated interface - 'ifconfig eth?? down' then 'rmmod ewrk3'. - -The performance we've achieved so far has been measured through the 'ttcp' -tool at 975kB/s. This measures the total TCP stack performance which -includes the card, so don't expect to get much nearer the 1.25MB/s -theoretical Ethernet rate. - - -Enjoy! - -Dave diff --git a/Documentation/networking/fib_trie.txt b/Documentation/networking/fib_trie.txt index f50d0c673c5..0723db7f849 100644 --- a/Documentation/networking/fib_trie.txt +++ b/Documentation/networking/fib_trie.txt @@ -79,7 +79,7 @@ trie_rebalance() resize() Analyzes a tnode and optimizes the child array size by either inflating - or shrinking it repeatedly until it fullfills the criteria for optimal + or shrinking it repeatedly until it fulfills the criteria for optimal level compression. This part follows the original paper pretty closely and there may be some room for experimentation here. diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index bbf2005270b..ee78eba78a9 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -1,42 +1,1027 @@ -filter.txt: Linux Socket Filtering -Written by: Jay Schulist <jschlst@samba.org> +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= Introduction -============ - - Linux Socket Filtering is derived from the Berkeley -Packet Filter. There are some distinct differences between -the BSD and Linux Kernel Filtering. - -Linux Socket Filtering (LSF) allows a user-space program to -attach a filter onto any socket and allow or disallow certain -types of data to come through the socket. LSF follows exactly -the same filter code structure as the BSD Berkeley Packet Filter -(BPF), so referring to the BSD bpf.4 manpage is very helpful in -creating filters. - -LSF is much simpler than BPF. One does not have to worry about -devices or anything like that. You simply create your filter -code, send it to the kernel via the SO_ATTACH_FILTER ioctl and -if your filter code passes the kernel check on it, you then -immediately begin filtering data on that socket. - -You can also detach filters from your socket via the -SO_DETACH_FILTER ioctl. This will probably not be used much -since when you close a socket that has a filter on it the -filter is automagically removed. The other less common case -may be adding a different filter on the same socket where you had another -filter that is still running: the kernel takes care of removing -the old one and placing your new one in its place, assuming your -filter has passed the checks, otherwise if it fails the old filter -will remain on that socket. - -Examples -======== - -Ioctls- -setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &Filter, sizeof(Filter)); -setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &value, sizeof(value)); - -See the BSD bpf.4 manpage and the BSD Packet Filter paper written by -Steven McCanne and Van Jacobson of Lawrence Berkeley Laboratory. +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + + [1] Documentation/prctl/seccomp_filter.txt + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include <linux/filter.h> which contains the +following relevant structures: + +struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ +}; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code. + +struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; +}; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +#include <sys/socket.h> +#include <sys/types.h> +#include <arpa/inet.h> +#include <linux/if_ether.h> +/* ... */ + +/* From the example above: tcpdump -i em1 port 22 -dd */ +struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, +}; + +struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, +}; + +sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); +if (sock < 0) + /* ... bail out ... */ + +ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); +if (ret < 0) + /* ... bail out ... */ + +/* ... */ +close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/net/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + Element Description + + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned): + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + Instruction Addressing mode Description + + ld 1, 2, 3, 4, 10 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 10 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8 Jump on k == A + jneq 8 Jump on k != A + jne 8 Jump on k != A + jlt 8 Jump on k < A + jle 8 Jump on k <= A + jgt 7, 8 Jump on k > A + jge 7, 8 Jump on k >= A + jset 7, 8 Jump on k & A + + add 0, 4 A + <x> + sub 0, 4 A - <x> + mul 0, 4 A * <x> + div 0, 4 A / <x> + mod 0, 4 A % <x> + neg 0, 4 !A + and 0, 4 A & <x> + or 0, 4 A | <x> + xor 0, 4 A ^ <x> + lsh 0, 4 A << <x> + rsh 0, 4 A >> <x> + + tax Copy A into X + txa Copy X into A + + ret 4, 9 Return + +The next table shows addressing formats from the 2nd column: + + Addressing mode Syntax Description + + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 #k,Lt Jump to Lt if predicate is true + 9 a/%a Accumulator A + 10 extension BPF extension + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + Extension Description + + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci vlan_tx_tag_get(skb) + vlan_pr vlan_tx_tag_present(skb) + rand prandom_u32() + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +** ARP packets: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +** IPv4 TCP packets: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +** (Accelerated) VLAN w/ id 10: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +** icmp random packet sampling, 1 in 4 + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +** SECCOMP filter example: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code: + +$ ./bpf_asm foo +4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output: + +$ ./bpf_asm -c foo +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 1, 0x00000806 }, +{ 0x06, 0, 0, 0xffffffff }, +{ 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/net/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing: + +# ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +> load pcap foo.pcap + Loads standard tcpdump pcap file. + +> run [<n>] +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +> disassemble +l0: ldh [12] +l1: jeq #0x800, l2, l5 +l2: ldb [23] +l3: jeq #0x1, l4, l5 +l4: ret #0xffff +l5: ret #0 + Prints out BPF code disassembly. + +> dump +/* { op, jt, jf, k }, */ +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 3, 0x00000800 }, +{ 0x30, 0, 0, 0x00000017 }, +{ 0x15, 0, 1, 0x00000001 }, +{ 0x06, 0, 0, 0x0000ffff }, +{ 0x06, 0, 0, 0000000000 }, + Prints out C-style BPF code dump. + +> breakpoint 0 +breakpoint at: l0: ldh [12] +> breakpoint 1 +breakpoint at: l1: jeq #0x800, l2, l5 + ... + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + > run + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + +> breakpoint +breakpoints: 0 1 + Prints currently set breakpoints. + +> step [-<n>, +<n>] + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +> select <n> + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +> quit +# + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, +ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is +transparently invoked for each attached filter from user space or for internal +kernel users if it has been previously enabled by root: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg: + +[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f +[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 +[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 +[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 +[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 +[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +In the kernel source tree under tools/net/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump: + +# ./bpf_jit_disasm +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + +Issuing option `-o` will "annotate" opcodes to resulting assembler +instructions, which can be very useful for JIT developers: + +# ./bpf_jit_disasm -o +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using sk_unattached_filter_create() for setting up the filter, resp. +sk_unattached_filter_destroy() for destroying it. The macro +SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct sk_filter that we +got from sk_unattached_filter_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from sk_chk_filter() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most of the +architectures. Only x86-64 performs JIT compilation from eBPF instruction set, +however, future work will migrate other JIT compilers as well, so that they +will profit from the very same benefits. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as "if (cond) jump_true; + else jump_false;", they are being replaced into alternative constructs like + "if (cond) jump_true; /* else fall-through */". + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __sk_run_filter() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into '%rax' which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + In the future an eBPF verifier can be used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using 'exclusive add' instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can used as generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + Classic BPF classes: eBPF classes: + + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 [ class 6 unused, for future if needed ] + BPF_MISC 0x07 BPF_ALU64 0x07 + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: + + BPF_JA 0x00 + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF only: function call */ + BPF_EXIT 0x90 /* eBPF only: function return */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single 'ret' +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently +unused and reserved for future use. + +For load and store instructions the 8-bit 'code' field is divided as: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of: + + BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and +(BPF_IND | <size> | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to 'struct sk_buff' and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations: + +BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg +BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 +BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) +BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg +BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +Jay Schulist <jschlst@samba.org> +Daniel Borkmann <dborkman@redhat.com> +Alexei Starovoitov <ast@plumgrid.com> diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt index b1f337f0f4c..d52af53efdc 100644 --- a/Documentation/networking/fore200e.txt +++ b/Documentation/networking/fore200e.txt @@ -11,12 +11,10 @@ i386, alpha (untested), powerpc, sparc and sparc64 archs. The intent is to enable the use of different models of FORE adapters at the same time, by hosts that have several bus interfaces (such as PCI+SBUS, -PCI+MCA or PCI+EISA). +or PCI+EISA). Only PCI and SBUS devices are currently supported by the driver, but support -for other bus interfaces such as EISA should not be too hard to add (this may -be more tricky for the MCA bus, though, as FORE made some MCA-specific -modifications to the adapter's AALI interface). +for other bus interfaces such as EISA should not be too hard to add. Firmware Copyright Notice @@ -39,12 +37,12 @@ version. Alternative binary firmware images can be found somewhere on the ForeThought CD-ROM supplied with your adapter by FORE Systems. You can also get the latest firmware images from FORE Systems at -http://www.fore.com. Register TACTics Online and go to +http://en.wikipedia.org/wiki/FORE_Systems. Register TACTics Online and go to the 'software updates' pages. The firmware binaries are part of the various ForeThought software distributions. Notice that different versions of the PCA-200E firmware exist, depending -on the endianess of the host architecture. The driver is shipped with +on the endianness of the host architecture. The driver is shipped with both little and big endian PCA firmware images. Name and location of the new firmware images can be set at kernel diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt index c3297f79c13..70e6275b757 100644 --- a/Documentation/networking/gen_stats.txt +++ b/Documentation/networking/gen_stats.txt @@ -79,8 +79,8 @@ Rate Estimator: 0) Prepare an estimator attribute. Most likely this would be in user space. The value of this TLV should contain a tc_estimator structure. - As usual, such a TLV nees to be 32 bit aligned and therefore the - length needs to be appropriately set etc. The estimator interval + As usual, such a TLV needs to be 32 bit aligned and therefore the + length needs to be appropriately set, etc. The estimator interval and ewma log need to be converted to the appropriate values. tc_estimator.c::tc_setup_estimator() is advisable to be used as the conversion routine. It does a few clever things. It takes a time @@ -103,8 +103,8 @@ In the kernel when setting up: else failed -From now on, everytime you dump my_rate_est_stats it will contain -uptodate info. +From now on, every time you dump my_rate_est_stats it will contain +up-to-date info. Once you are done, call gen_kill_estimator(my_basicstats, my_rate_est_stats) Make sure that my_basicstats and my_rate_est_stats diff --git a/Documentation/networking/generic-hdlc.txt b/Documentation/networking/generic-hdlc.txt index 31bc8b759b7..4eb3cc40b70 100644 --- a/Documentation/networking/generic-hdlc.txt +++ b/Documentation/networking/generic-hdlc.txt @@ -3,15 +3,15 @@ Krzysztof Halasa <khc@pm.waw.pl> Generic HDLC layer currently supports: -1. Frame Relay (ANSI, CCITT, Cisco and no LMI). +1. Frame Relay (ANSI, CCITT, Cisco and no LMI) - Normal (routed) and Ethernet-bridged (Ethernet device emulation) interfaces can share a single PVC. - ARP support (no InARP support in the kernel - there is an experimental InARP user-space daemon available on: http://www.kernel.org/pub/linux/utils/net/hdlc/). -2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation. -3. Cisco HDLC. -4. PPP (uses syncppp.c). +2. raw HDLC - either IP (IPv4) interface or Ethernet device emulation +3. Cisco HDLC +4. PPP 5. X.25 (uses X.25 routines). Generic HDLC is a protocol driver only - it needs a low-level driver diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt new file mode 100644 index 00000000000..3e071115ca9 --- /dev/null +++ b/Documentation/networking/generic_netlink.txt @@ -0,0 +1,3 @@ +A wiki document on how to use Generic Netlink can be found here: + + * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt new file mode 100644 index 00000000000..ba1daea7f2e --- /dev/null +++ b/Documentation/networking/gianfar.txt @@ -0,0 +1,42 @@ +The Gianfar Ethernet Driver + +Author: Andy Fleming <afleming@freescale.com> +Updated: 2005-07-28 + + +CHECKSUM OFFLOADING + +The eTSEC controller (first included in parts from late 2005 like +the 8548) has the ability to perform TCP, UDP, and IP checksums +in hardware. The Linux kernel only offloads the TCP and UDP +checksums (and always performs the pseudo header checksums), so +the driver only supports checksumming for TCP/IP and UDP/IP +packets. Use ethtool to enable or disable this feature for RX +and TX. + +VLAN + +In order to use VLAN, please consult Linux documentation on +configuring VLANs. The gianfar driver supports hardware insertion and +extraction of VLAN headers, but not filtering. Filtering will be +done by the kernel. + +MULTICASTING + +The gianfar driver supports using the group hash table on the +TSEC (and the extended hash table on the eTSEC) for multicast +filtering. On the eTSEC, the exact-match MAC registers are used +before the hash tables. See Linux documentation on how to join +multicast groups. + +PADDING + +The gianfar driver supports padding received frames with 2 bytes +to align the IP header to a 16-byte boundary, when supported by +hardware. + +ETHTOOL + +The gianfar driver supports the use of ethtool for many +configuration options. You must run ethtool only on currently +open interfaces. See ethtool documentation for details. diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt new file mode 100644 index 00000000000..f737273c6dc --- /dev/null +++ b/Documentation/networking/i40e.txt @@ -0,0 +1,115 @@ +Linux Base Driver for the Intel(R) Ethernet Controller XL710 Family +=================================================================== + +Intel i40e Linux driver. +Copyright(c) 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Performance Tuning +- Known Issues +- Support + + +Identifying Your Adapter +======================== + +The driver in this release is compatible with the Intel Ethernet +Controller XL710 Family. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + + +Enabling the driver +=================== + +The driver is enabled via the standard kernel configuration system, +using the make command: + + Make oldconfig/silentoldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Intel devices + -> Intel(R) Ethernet Controller XL710 Family + +Additional Configurations +========================= + + Generic Receive Offload (GRO) + ----------------------------- + The driver supports the in-kernel software implementation of GRO. GRO has + shown that by coalescing Rx traffic into larger chunks of data, CPU + utilization can be significantly reduced when under large Rx load. GRO is + an evolution of the previously-used LRO interface. GRO is able to coalesce + other protocols besides TCP. It's also safe to use with configurations that + are problematic for LRO, namely bridging and iSCSI. + + Ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool + + Data Center Bridging (DCB) + -------------------------- + DCB configuration is not currently supported. + + FCoE + ---- + Fiber Channel over Ethernet (FCoE) hardware offload is not currently + supported. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://e1000.sourceforge.net + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sourceforge.net and copy +netdev@vger.kernel.org. diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt new file mode 100644 index 00000000000..21e41271af7 --- /dev/null +++ b/Documentation/networking/i40evf.txt @@ -0,0 +1,47 @@ +Linux* Base Driver for Intel(R) Network Connection +================================================== + +Intel XL710 X710 Virtual Function Linux driver. +Copyright(c) 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Known Issues/Troubleshooting +- Support + +This file describes the i40evf Linux* Base Driver for the Intel(R) XL710 +X710 Virtual Function. + +The i40evf driver supports XL710 and X710 virtual function devices that +can only be activated on kernels with CONFIG_PCI_IOV enabled. + +The guest OS loading the i40evf driver must support MSI-X interrupts. + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Known Issues/Troubleshooting +============================ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt new file mode 100644 index 00000000000..22bbc7225f8 --- /dev/null +++ b/Documentation/networking/ieee802154.txt @@ -0,0 +1,151 @@ + + Linux IEEE 802.15.4 implementation + + +Introduction +============ +The IEEE 802.15.4 working group focuses on standardization of bottom +two layers: Medium Access Control (MAC) and Physical (PHY). And there +are mainly two options available for upper layers: + - ZigBee - proprietary protocol from ZigBee Alliance + - 6LowPAN - IPv6 networking over low rate personal area networks + +The Linux-ZigBee project goal is to provide complete implementation +of IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack +of protocols for organizing Low-Rate Wireless Personal Area Networks. + +The stack is composed of three main parts: + - IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API, + the generic Linux networking stack to transfer IEEE 802.15.4 messages + and a special protocol over genetlink for configuration/management + - MAC - provides access to shared channel and reliable data delivery + - PHY - represents device drivers + + +Socket API +========== + +int sd = socket(PF_IEEE802154, SOCK_DGRAM, 0); +..... + +The address family, socket addresses etc. are defined in the +include/net/af_ieee802154.h header or in the special header +in our userspace package (see either linux-zigbee sourceforge download page +or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee). + +One can use SOCK_RAW for passing raw data towards device xmit function. YMMV. + + +Kernel side +============= + +Like with WiFi, there are several types of devices implementing IEEE 802.15.4. +1) 'HardMAC'. The MAC layer is implemented in the device itself, the device + exports MLME and data API. +2) 'SoftMAC' or just radio. These types of devices are just radio transceivers + possibly with some kinds of acceleration like automatic CRC computation and + comparation, automagic ACK handling, address matching, etc. + +Those types of devices require different approach to be hooked into Linux kernel. + + +MLME - MAC Level Management +============================ + +Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. +See the include/net/nl802154.h header. Our userspace tools package +(see above) provides CLI configuration utility for radio interfaces and simple +coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. + + +HardMAC +======= + +See the header include/net/ieee802154_netdev.h. You have to implement Linux +net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family +code via plain sk_buffs. On skb reception skb->cb must contain additional +info as described in the struct ieee802154_mac_cb. During packet transmission +the skb->cb is used to provide additional data to device's header_ops->create +function. Be aware that this data can be overridden later (when socket code +submits skb to qdisc), so if you need something from that cb later, you should +store info in the skb->data on your own. + +To hook the MLME interface you have to populate the ml_priv field of your +net_device with a pointer to struct ieee802154_mlme_ops instance. The fields +assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional. +All other fields are required. + +We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c + + +SoftMAC +======= + +The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it +provides interface for drivers registration and management of slave interfaces. + +NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4 +stack interface for network sniffers (e.g. WireShark). + +This layer is going to be extended soon. + +See header include/net/mac802154.h and several drivers in drivers/ieee802154/. + + +Device drivers API +================== + +The include/net/mac802154.h defines following functions: + - struct ieee802154_dev *ieee802154_alloc_device + (size_t priv_size, struct ieee802154_ops *ops): + allocation of IEEE 802.15.4 compatible device + + - void ieee802154_free_device(struct ieee802154_dev *dev): + freeing allocated device + + - int ieee802154_register_device(struct ieee802154_dev *dev): + register PHY in the system + + - void ieee802154_unregister_device(struct ieee802154_dev *dev): + freeing registered PHY + +Moreover IEEE 802.15.4 device operations structure should be filled. + +Fake drivers +============ + +In addition there are two drivers available which simulate real devices with +HardMAC (fakehard) and SoftMAC (fakelb - IEEE 802.15.4 loopback driver) +interfaces. This option provides possibility to test and debug stack without +usage of real hardware. + +See sources in drivers/ieee802154 folder for more details. + + +6LoWPAN Linux implementation +============================ + +The IEEE 802.15.4 standard specifies an MTU of 128 bytes, yielding about 80 +octets of actual MAC payload once security is turned on, on a wireless link +with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format +[RFC4944] was specified to carry IPv6 datagrams over such constrained links, +taking into account limited bandwidth, memory, or energy resources that are +expected in applications such as wireless Sensor Networks. [RFC4944] defines +a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header +to support the IPv6 minimum MTU requirement [RFC2460], and stateless header +compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the +relatively large IPv6 and UDP headers down to (in the best case) several bytes. + +In Semptember 2011 the standard update was published - [RFC6282]. +It deprecates HC1 and HC2 compression and defines IPHC encoding format which is +used in this Linux implementation. + +All the code related to 6lowpan you may find in files: net/ieee802154/6lowpan.* + +To setup 6lowpan interface you need (busybox release > 1.17.0): +1. Add IEEE802.15.4 interface and initialize PANid; +2. Add 6lowpan interface by command like: + # ip link add link wpan0 name lowpan0 type lowpan +3. Set MAC (if needs): + # ip link set lowpan0 address de:ad:be:ef:ca:fe:ba:be +4. Bring up 'lowpan0' interface diff --git a/Documentation/networking/ifenslave.c b/Documentation/networking/ifenslave.c deleted file mode 100644 index f315d20d386..00000000000 --- a/Documentation/networking/ifenslave.c +++ /dev/null @@ -1,1110 +0,0 @@ -/* Mode: C; - * ifenslave.c: Configure network interfaces for parallel routing. - * - * This program controls the Linux implementation of running multiple - * network interfaces in parallel. - * - * Author: Donald Becker <becker@cesdis.gsfc.nasa.gov> - * Copyright 1994-1996 Donald Becker - * - * This program is free software; you can redistribute it - * and/or modify it under the terms of the GNU General Public - * License as published by the Free Software Foundation. - * - * The author may be reached as becker@CESDIS.gsfc.nasa.gov, or C/O - * Center of Excellence in Space Data and Information Sciences - * Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771 - * - * Changes : - * - 2000/10/02 Willy Tarreau <willy at meta-x.org> : - * - few fixes. Master's MAC address is now correctly taken from - * the first device when not previously set ; - * - detach support : call BOND_RELEASE to detach an enslaved interface. - * - give a mini-howto from command-line help : # ifenslave -h - * - * - 2001/02/16 Chad N. Tindel <ctindel at ieee dot org> : - * - Master is now brought down before setting the MAC address. In - * the 2.4 kernel you can't change the MAC address while the device is - * up because you get EBUSY. - * - * - 2001/09/13 Takao Indoh <indou dot takao at jp dot fujitsu dot com> - * - Added the ability to change the active interface on a mode 1 bond - * at runtime. - * - * - 2001/10/23 Chad N. Tindel <ctindel at ieee dot org> : - * - No longer set the MAC address of the master. The bond device will - * take care of this itself - * - Try the SIOC*** versions of the bonding ioctls before using the - * old versions - * - 2002/02/18 Erik Habbinga <erik_habbinga @ hp dot com> : - * - ifr2.ifr_flags was not initialized in the hwaddr_notset case, - * SIOCGIFFLAGS now called before hwaddr_notset test - * - * - 2002/10/31 Tony Cureington <tony.cureington * hp_com> : - * - If the master does not have a hardware address when the first slave - * is enslaved, the master is assigned the hardware address of that - * slave - there is a comment in bonding.c stating "ifenslave takes - * care of this now." This corrects the problem of slaves having - * different hardware addresses in active-backup mode when - * multiple interfaces are specified on a single ifenslave command - * (ifenslave bond0 eth0 eth1). - * - * - 2003/03/18 - Tsippy Mendelson <tsippy.mendelson at intel dot com> and - * Shmulik Hen <shmulik.hen at intel dot com> - * - Moved setting the slave's mac address and openning it, from - * the application to the driver. This enables support of modes - * that need to use the unique mac address of each slave. - * The driver also takes care of closing the slave and restoring its - * original mac address upon release. - * In addition, block possibility of enslaving before the master is up. - * This prevents putting the system in an undefined state. - * - * - 2003/05/01 - Amir Noam <amir.noam at intel dot com> - * - Added ABI version control to restore compatibility between - * new/old ifenslave and new/old bonding. - * - Prevent adding an adapter that is already a slave. - * Fixes the problem of stalling the transmission and leaving - * the slave in a down state. - * - * - 2003/05/01 - Shmulik Hen <shmulik.hen at intel dot com> - * - Prevent enslaving if the bond device is down. - * Fixes the problem of leaving the system in unstable state and - * halting when trying to remove the module. - * - Close socket on all abnormal exists. - * - Add versioning scheme that follows that of the bonding driver. - * current version is 1.0.0 as a base line. - * - * - 2003/05/22 - Jay Vosburgh <fubar at us dot ibm dot com> - * - ifenslave -c was broken; it's now fixed - * - Fixed problem with routes vanishing from master during enslave - * processing. - * - * - 2003/05/27 - Amir Noam <amir.noam at intel dot com> - * - Fix backward compatibility issues: - * For drivers not using ABI versions, slave was set down while - * it should be left up before enslaving. - * Also, master was not set down and the default set_mac_address() - * would fail and generate an error message in the system log. - * - For opt_c: slave should not be set to the master's setting - * while it is running. It was already set during enslave. To - * simplify things, it is now handeled separately. - * - * - 2003/12/01 - Shmulik Hen <shmulik.hen at intel dot com> - * - Code cleanup and style changes - * set version to 1.1.0 - */ - -#define APP_VERSION "1.1.0" -#define APP_RELDATE "December 1, 2003" -#define APP_NAME "ifenslave" - -static char *version = -APP_NAME ".c:v" APP_VERSION " (" APP_RELDATE ")\n" -"o Donald Becker (becker@cesdis.gsfc.nasa.gov).\n" -"o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).\n" -"o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel\n" -" (ctindel at ieee dot org).\n"; - -static const char *usage_msg = -"Usage: ifenslave [-f] <master-if> <slave-if> [<slave-if>...]\n" -" ifenslave -d <master-if> <slave-if> [<slave-if>...]\n" -" ifenslave -c <master-if> <slave-if>\n" -" ifenslave --help\n"; - -static const char *help_msg = -"\n" -" To create a bond device, simply follow these three steps :\n" -" - ensure that the required drivers are properly loaded :\n" -" # modprobe bonding ; modprobe <3c59x|eepro100|pcnet32|tulip|...>\n" -" - assign an IP address to the bond device :\n" -" # ifconfig bond0 <addr> netmask <mask> broadcast <bcast>\n" -" - attach all the interfaces you need to the bond device :\n" -" # ifenslave [{-f|--force}] bond0 eth0 [eth1 [eth2]...]\n" -" If bond0 didn't have a MAC address, it will take eth0's. Then, all\n" -" interfaces attached AFTER this assignment will get the same MAC addr.\n" -" (except for ALB/TLB modes)\n" -"\n" -" To set the bond device down and automatically release all the slaves :\n" -" # ifconfig bond0 down\n" -"\n" -" To detach a dead interface without setting the bond device down :\n" -" # ifenslave {-d|--detach} bond0 eth0 [eth1 [eth2]...]\n" -"\n" -" To change active slave :\n" -" # ifenslave {-c|--change-active} bond0 eth0\n" -"\n" -" To show master interface info\n" -" # ifenslave bond0\n" -"\n" -" To show all interfaces info\n" -" # ifenslave {-a|--all-interfaces}\n" -"\n" -" To be more verbose\n" -" # ifenslave {-v|--verbose} ...\n" -"\n" -" # ifenslave {-u|--usage} Show usage\n" -" # ifenslave {-V|--version} Show version\n" -" # ifenslave {-h|--help} This message\n" -"\n"; - -#include <unistd.h> -#include <stdlib.h> -#include <stdio.h> -#include <ctype.h> -#include <string.h> -#include <errno.h> -#include <fcntl.h> -#include <getopt.h> -#include <sys/types.h> -#include <sys/socket.h> -#include <sys/ioctl.h> -#include <linux/if.h> -#include <net/if_arp.h> -#include <linux/if_ether.h> -#include <linux/if_bonding.h> -#include <linux/sockios.h> - -typedef unsigned long long u64; /* hack, so we may include kernel's ethtool.h */ -typedef __uint32_t u32; /* ditto */ -typedef __uint16_t u16; /* ditto */ -typedef __uint8_t u8; /* ditto */ -#include <linux/ethtool.h> - -struct option longopts[] = { - /* { name has_arg *flag val } */ - {"all-interfaces", 0, 0, 'a'}, /* Show all interfaces. */ - {"change-active", 0, 0, 'c'}, /* Change the active slave. */ - {"detach", 0, 0, 'd'}, /* Detach a slave interface. */ - {"force", 0, 0, 'f'}, /* Force the operation. */ - {"help", 0, 0, 'h'}, /* Give help */ - {"usage", 0, 0, 'u'}, /* Give usage */ - {"verbose", 0, 0, 'v'}, /* Report each action taken. */ - {"version", 0, 0, 'V'}, /* Emit version information. */ - { 0, 0, 0, 0} -}; - -/* Command-line flags. */ -unsigned int -opt_a = 0, /* Show-all-interfaces flag. */ -opt_c = 0, /* Change-active-slave flag. */ -opt_d = 0, /* Detach a slave interface. */ -opt_f = 0, /* Force the operation. */ -opt_h = 0, /* Help */ -opt_u = 0, /* Usage */ -opt_v = 0, /* Verbose flag. */ -opt_V = 0; /* Version */ - -int skfd = -1; /* AF_INET socket for ioctl() calls.*/ -int abi_ver = 0; /* userland - kernel ABI version */ -int hwaddr_set = 0; /* Master's hwaddr is set */ -int saved_errno; - -struct ifreq master_mtu, master_flags, master_hwaddr; -struct ifreq slave_mtu, slave_flags, slave_hwaddr; - -struct dev_ifr { - struct ifreq *req_ifr; - char *req_name; - int req_type; -}; - -struct dev_ifr master_ifra[] = { - {&master_mtu, "SIOCGIFMTU", SIOCGIFMTU}, - {&master_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS}, - {&master_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR}, - {NULL, "", 0} -}; - -struct dev_ifr slave_ifra[] = { - {&slave_mtu, "SIOCGIFMTU", SIOCGIFMTU}, - {&slave_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS}, - {&slave_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR}, - {NULL, "", 0} -}; - -static void if_print(char *ifname); -static int get_drv_info(char *master_ifname); -static int get_if_settings(char *ifname, struct dev_ifr ifra[]); -static int get_slave_flags(char *slave_ifname); -static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr); -static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr); -static int set_slave_mtu(char *slave_ifname, int mtu); -static int set_if_flags(char *ifname, short flags); -static int set_if_up(char *ifname, short flags); -static int set_if_down(char *ifname, short flags); -static int clear_if_addr(char *ifname); -static int set_if_addr(char *master_ifname, char *slave_ifname); -static int change_active(char *master_ifname, char *slave_ifname); -static int enslave(char *master_ifname, char *slave_ifname); -static int release(char *master_ifname, char *slave_ifname); -#define v_print(fmt, args...) \ - if (opt_v) \ - fprintf(stderr, fmt, ## args ) - -int main(int argc, char *argv[]) -{ - char **spp, *master_ifname, *slave_ifname; - int c, i, rv; - int res = 0; - int exclusive = 0; - - while ((c = getopt_long(argc, argv, "acdfhuvV", longopts, 0)) != EOF) { - switch (c) { - case 'a': opt_a++; exclusive++; break; - case 'c': opt_c++; exclusive++; break; - case 'd': opt_d++; exclusive++; break; - case 'f': opt_f++; exclusive++; break; - case 'h': opt_h++; exclusive++; break; - case 'u': opt_u++; exclusive++; break; - case 'v': opt_v++; break; - case 'V': opt_V++; exclusive++; break; - - case '?': - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - } - - /* options check */ - if (exclusive > 1) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - if (opt_v || opt_V) { - printf(version); - if (opt_V) { - res = 0; - goto out; - } - } - - if (opt_u) { - printf(usage_msg); - res = 0; - goto out; - } - - if (opt_h) { - printf(usage_msg); - printf(help_msg); - res = 0; - goto out; - } - - /* Open a basic socket */ - if ((skfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) { - perror("socket"); - res = 1; - goto out; - } - - if (opt_a) { - if (optind == argc) { - /* No remaining args */ - /* show all interfaces */ - if_print((char *)NULL); - goto out; - } else { - /* Just show usage */ - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - } - - /* Copy the interface name */ - spp = argv + optind; - master_ifname = *spp++; - - if (master_ifname == NULL) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - /* exchange abi version with bonding module */ - res = get_drv_info(master_ifname); - if (res) { - fprintf(stderr, - "Master '%s': Error: handshake with driver failed. " - "Aborting\n", - master_ifname); - goto out; - } - - slave_ifname = *spp++; - - if (slave_ifname == NULL) { - if (opt_d || opt_c) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - /* A single arg means show the - * configuration for this interface - */ - if_print(master_ifname); - goto out; - } - - res = get_if_settings(master_ifname, master_ifra); - if (res) { - /* Probably a good reason not to go on */ - fprintf(stderr, - "Master '%s': Error: get settings failed: %s. " - "Aborting\n", - master_ifname, strerror(res)); - goto out; - } - - /* check if master is indeed a master; - * if not then fail any operation - */ - if (!(master_flags.ifr_flags & IFF_MASTER)) { - fprintf(stderr, - "Illegal operation; the specified interface '%s' " - "is not a master. Aborting\n", - master_ifname); - res = 1; - goto out; - } - - /* check if master is up; if not then fail any operation */ - if (!(master_flags.ifr_flags & IFF_UP)) { - fprintf(stderr, - "Illegal operation; the specified master interface " - "'%s' is not up.\n", - master_ifname); - res = 1; - goto out; - } - - /* Only for enslaving */ - if (!opt_c && !opt_d) { - sa_family_t master_family = master_hwaddr.ifr_hwaddr.sa_family; - unsigned char *hwaddr = - (unsigned char *)master_hwaddr.ifr_hwaddr.sa_data; - - /* The family '1' is ARPHRD_ETHER for ethernet. */ - if (master_family != 1 && !opt_f) { - fprintf(stderr, - "Illegal operation: The specified master " - "interface '%s' is not ethernet-like.\n " - "This program is designed to work with " - "ethernet-like network interfaces.\n " - "Use the '-f' option to force the " - "operation.\n", - master_ifname); - res = 1; - goto out; - } - - /* Check master's hw addr */ - for (i = 0; i < 6; i++) { - if (hwaddr[i] != 0) { - hwaddr_set = 1; - break; - } - } - - if (hwaddr_set) { - v_print("current hardware address of master '%s' " - "is %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, " - "type %d\n", - master_ifname, - hwaddr[0], hwaddr[1], - hwaddr[2], hwaddr[3], - hwaddr[4], hwaddr[5], - master_family); - } - } - - /* Accepts only one slave */ - if (opt_c) { - /* change active slave */ - res = get_slave_flags(slave_ifname); - if (res) { - fprintf(stderr, - "Slave '%s': Error: get flags failed. " - "Aborting\n", - slave_ifname); - goto out; - } - res = change_active(master_ifname, slave_ifname); - if (res) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Change active failed\n", - master_ifname, slave_ifname); - } - } else { - /* Accept multiple slaves */ - do { - if (opt_d) { - /* detach a slave interface from the master */ - rv = get_slave_flags(slave_ifname); - if (rv) { - /* Can't work with this slave. */ - /* remember the error and skip it*/ - fprintf(stderr, - "Slave '%s': Error: get flags " - "failed. Skipping\n", - slave_ifname); - res = rv; - continue; - } - rv = release(master_ifname, slave_ifname); - if (rv) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Release failed\n", - master_ifname, slave_ifname); - res = rv; - } - } else { - /* attach a slave interface to the master */ - rv = get_if_settings(slave_ifname, slave_ifra); - if (rv) { - /* Can't work with this slave. */ - /* remember the error and skip it*/ - fprintf(stderr, - "Slave '%s': Error: get " - "settings failed: %s. " - "Skipping\n", - slave_ifname, strerror(rv)); - res = rv; - continue; - } - rv = enslave(master_ifname, slave_ifname); - if (rv) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Enslave failed\n", - master_ifname, slave_ifname); - res = rv; - } - } - } while ((slave_ifname = *spp++) != NULL); - } - -out: - if (skfd >= 0) { - close(skfd); - } - - return res; -} - -static short mif_flags; - -/* Get the inteface configuration from the kernel. */ -static int if_getconfig(char *ifname) -{ - struct ifreq ifr; - int metric, mtu; /* Parameters of the master interface. */ - struct sockaddr dstaddr, broadaddr, netmask; - unsigned char *hwaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFFLAGS, &ifr) < 0) - return -1; - mif_flags = ifr.ifr_flags; - printf("The result of SIOCGIFFLAGS on %s is %x.\n", - ifname, ifr.ifr_flags); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFADDR, &ifr) < 0) - return -1; - printf("The result of SIOCGIFADDR is %2.2x.%2.2x.%2.2x.%2.2x.\n", - ifr.ifr_addr.sa_data[0], ifr.ifr_addr.sa_data[1], - ifr.ifr_addr.sa_data[2], ifr.ifr_addr.sa_data[3]); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFHWADDR, &ifr) < 0) - return -1; - - /* Gotta convert from 'char' to unsigned for printf(). */ - hwaddr = (unsigned char *)ifr.ifr_hwaddr.sa_data; - printf("The result of SIOCGIFHWADDR is type %d " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - ifr.ifr_hwaddr.sa_family, hwaddr[0], hwaddr[1], - hwaddr[2], hwaddr[3], hwaddr[4], hwaddr[5]); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFMETRIC, &ifr) < 0) { - metric = 0; - } else - metric = ifr.ifr_metric; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFMTU, &ifr) < 0) - mtu = 0; - else - mtu = ifr.ifr_mtu; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFDSTADDR, &ifr) < 0) { - memset(&dstaddr, 0, sizeof(struct sockaddr)); - } else - dstaddr = ifr.ifr_dstaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFBRDADDR, &ifr) < 0) { - memset(&broadaddr, 0, sizeof(struct sockaddr)); - } else - broadaddr = ifr.ifr_broadaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFNETMASK, &ifr) < 0) { - memset(&netmask, 0, sizeof(struct sockaddr)); - } else - netmask = ifr.ifr_netmask; - - return 0; -} - -static void if_print(char *ifname) -{ - char buff[1024]; - struct ifconf ifc; - struct ifreq *ifr; - int i; - - if (ifname == (char *)NULL) { - ifc.ifc_len = sizeof(buff); - ifc.ifc_buf = buff; - if (ioctl(skfd, SIOCGIFCONF, &ifc) < 0) { - perror("SIOCGIFCONF failed"); - return; - } - - ifr = ifc.ifc_req; - for (i = ifc.ifc_len / sizeof(struct ifreq); --i >= 0; ifr++) { - if (if_getconfig(ifr->ifr_name) < 0) { - fprintf(stderr, - "%s: unknown interface.\n", - ifr->ifr_name); - continue; - } - - if (((mif_flags & IFF_UP) == 0) && !opt_a) continue; - /*ife_print(&ife);*/ - } - } else { - if (if_getconfig(ifname) < 0) { - fprintf(stderr, - "%s: unknown interface.\n", ifname); - } - } -} - -static int get_drv_info(char *master_ifname) -{ - struct ifreq ifr; - struct ethtool_drvinfo info; - char *endptr; - - memset(&ifr, 0, sizeof(ifr)); - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - ifr.ifr_data = (caddr_t)&info; - - info.cmd = ETHTOOL_GDRVINFO; - strncpy(info.driver, "ifenslave", 32); - snprintf(info.fw_version, 32, "%d", BOND_ABI_VERSION); - - if (ioctl(skfd, SIOCETHTOOL, &ifr) < 0) { - if (errno == EOPNOTSUPP) { - goto out; - } - - saved_errno = errno; - v_print("Master '%s': Error: get bonding info failed %s\n", - master_ifname, strerror(saved_errno)); - return 1; - } - - abi_ver = strtoul(info.fw_version, &endptr, 0); - if (*endptr) { - v_print("Master '%s': Error: got invalid string as an ABI " - "version from the bonding module\n", - master_ifname); - return 1; - } - -out: - v_print("ABI ver is %d\n", abi_ver); - - return 0; -} - -static int change_active(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (!(slave_flags.ifr_flags & IFF_SLAVE)) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is not a slave\n", - slave_ifname); - return 1; - } - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDCHANGEACTIVE, &ifr) < 0) && - (ioctl(skfd, BOND_CHANGE_ACTIVE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDCHANGEACTIVE failed: " - "%s\n", - master_ifname, strerror(saved_errno)); - res = 1; - } - - return res; -} - -static int enslave(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (slave_flags.ifr_flags & IFF_SLAVE) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is already a slave\n", - slave_ifname); - return 1; - } - - res = set_if_down(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface down failed\n", - slave_ifname); - return res; - } - - if (abi_ver < 2) { - /* Older bonding versions would panic if the slave has no IP - * address, so get the IP setting from the master. - */ - res = set_if_addr(master_ifname, slave_ifname); - if (res) { - fprintf(stderr, - "Slave '%s': Error: set address failed\n", - slave_ifname); - return res; - } - } else { - res = clear_if_addr(slave_ifname); - if (res) { - fprintf(stderr, - "Slave '%s': Error: clear address failed\n", - slave_ifname); - return res; - } - } - - if (master_mtu.ifr_mtu != slave_mtu.ifr_mtu) { - res = set_slave_mtu(slave_ifname, master_mtu.ifr_mtu); - if (res) { - fprintf(stderr, - "Slave '%s': Error: set MTU failed\n", - slave_ifname); - return res; - } - } - - if (hwaddr_set) { - /* Master already has an hwaddr - * so set it's hwaddr to the slave - */ - if (abi_ver < 1) { - /* The driver is using an old ABI, so - * the application sets the slave's - * hwaddr - */ - res = set_slave_hwaddr(slave_ifname, - &(master_hwaddr.ifr_hwaddr)); - if (res) { - fprintf(stderr, - "Slave '%s': Error: set hw address " - "failed\n", - slave_ifname); - goto undo_mtu; - } - - /* For old ABI the application needs to bring the - * slave back up - */ - res = set_if_up(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface " - "down failed\n", - slave_ifname); - goto undo_slave_mac; - } - } - /* The driver is using a new ABI, - * so the driver takes care of setting - * the slave's hwaddr and bringing - * it up again - */ - } else { - /* No hwaddr for master yet, so - * set the slave's hwaddr to it - */ - if (abi_ver < 1) { - /* For old ABI, the master needs to be - * down before setting it's hwaddr - */ - res = set_if_down(master_ifname, master_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Master '%s': Error: bring interface " - "down failed\n", - master_ifname); - goto undo_mtu; - } - } - - res = set_master_hwaddr(master_ifname, - &(slave_hwaddr.ifr_hwaddr)); - if (res) { - fprintf(stderr, - "Master '%s': Error: set hw address " - "failed\n", - master_ifname); - goto undo_mtu; - } - - if (abi_ver < 1) { - /* For old ABI, bring the master - * back up - */ - res = set_if_up(master_ifname, master_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Master '%s': Error: bring interface " - "up failed\n", - master_ifname); - goto undo_master_mac; - } - } - - hwaddr_set = 1; - } - - /* Do the real thing */ - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDENSLAVE, &ifr) < 0) && - (ioctl(skfd, BOND_ENSLAVE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDENSLAVE failed: %s\n", - master_ifname, strerror(saved_errno)); - res = 1; - } - - if (res) { - goto undo_master_mac; - } - - return 0; - -/* rollback (best effort) */ -undo_master_mac: - set_master_hwaddr(master_ifname, &(master_hwaddr.ifr_hwaddr)); - hwaddr_set = 0; - goto undo_mtu; -undo_slave_mac: - set_slave_hwaddr(slave_ifname, &(slave_hwaddr.ifr_hwaddr)); -undo_mtu: - set_slave_mtu(slave_ifname, slave_mtu.ifr_mtu); - return res; -} - -static int release(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (!(slave_flags.ifr_flags & IFF_SLAVE)) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is not a slave\n", - slave_ifname); - return 1; - } - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDRELEASE, &ifr) < 0) && - (ioctl(skfd, BOND_RELEASE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDRELEASE failed: %s\n", - master_ifname, strerror(saved_errno)); - return 1; - } else if (abi_ver < 1) { - /* The driver is using an old ABI, so we'll set the interface - * down to avoid any conflicts due to same MAC/IP - */ - res = set_if_down(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface " - "down failed\n", - slave_ifname); - } - } - - /* set to default mtu */ - set_slave_mtu(slave_ifname, 1500); - - return res; -} - -static int get_if_settings(char *ifname, struct dev_ifr ifra[]) -{ - int i; - int res = 0; - - for (i = 0; ifra[i].req_ifr; i++) { - strncpy(ifra[i].req_ifr->ifr_name, ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].req_type, ifra[i].req_ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: %s failed: %s\n", - ifname, ifra[i].req_name, - strerror(saved_errno)); - - return saved_errno; - } - } - - return 0; -} - -static int get_slave_flags(char *slave_ifname) -{ - int res = 0; - - strncpy(slave_flags.ifr_name, slave_ifname, IFNAMSIZ); - res = ioctl(skfd, SIOCGIFFLAGS, &slave_flags); - if (res < 0) { - saved_errno = errno; - v_print("Slave '%s': Error: SIOCGIFFLAGS failed: %s\n", - slave_ifname, strerror(saved_errno)); - } else { - v_print("Slave %s: flags %04X.\n", - slave_ifname, slave_flags.ifr_flags); - } - - return res; -} - -static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr) -{ - unsigned char *addr = (unsigned char *)hwaddr->sa_data; - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr)); - res = ioctl(skfd, SIOCSIFHWADDR, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCSIFHWADDR failed: %s\n", - master_ifname, strerror(saved_errno)); - return res; - } else { - v_print("Master '%s': hardware address set to " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - master_ifname, addr[0], addr[1], addr[2], - addr[3], addr[4], addr[5]); - } - - return res; -} - -static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr) -{ - unsigned char *addr = (unsigned char *)hwaddr->sa_data; - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr)); - res = ioctl(skfd, SIOCSIFHWADDR, &ifr); - if (res < 0) { - saved_errno = errno; - - v_print("Slave '%s': Error: SIOCSIFHWADDR failed: %s\n", - slave_ifname, strerror(saved_errno)); - - if (saved_errno == EBUSY) { - v_print(" The device is busy: it must be idle " - "before running this command.\n"); - } else if (saved_errno == EOPNOTSUPP) { - v_print(" The device does not support setting " - "the MAC address.\n" - " Your kernel likely does not support slave " - "devices.\n"); - } else if (saved_errno == EINVAL) { - v_print(" The device's address type does not match " - "the master's address type.\n"); - } - return res; - } else { - v_print("Slave '%s': hardware address set to " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - slave_ifname, addr[0], addr[1], addr[2], - addr[3], addr[4], addr[5]); - } - - return res; -} - -static int set_slave_mtu(char *slave_ifname, int mtu) -{ - struct ifreq ifr; - int res = 0; - - ifr.ifr_mtu = mtu; - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - - res = ioctl(skfd, SIOCSIFMTU, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Slave '%s': Error: SIOCSIFMTU failed: %s\n", - slave_ifname, strerror(saved_errno)); - } else { - v_print("Slave '%s': MTU set to %d.\n", slave_ifname, mtu); - } - - return res; -} - -static int set_if_flags(char *ifname, short flags) -{ - struct ifreq ifr; - int res = 0; - - ifr.ifr_flags = flags; - strncpy(ifr.ifr_name, ifname, IFNAMSIZ); - - res = ioctl(skfd, SIOCSIFFLAGS, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: SIOCSIFFLAGS failed: %s\n", - ifname, strerror(saved_errno)); - } else { - v_print("Interface '%s': flags set to %04X.\n", ifname, flags); - } - - return res; -} - -static int set_if_up(char *ifname, short flags) -{ - return set_if_flags(ifname, flags | IFF_UP); -} - -static int set_if_down(char *ifname, short flags) -{ - return set_if_flags(ifname, flags & ~IFF_UP); -} - -static int clear_if_addr(char *ifname) -{ - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, ifname, IFNAMSIZ); - ifr.ifr_addr.sa_family = AF_INET; - memset(ifr.ifr_addr.sa_data, 0, sizeof(ifr.ifr_addr.sa_data)); - - res = ioctl(skfd, SIOCSIFADDR, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: SIOCSIFADDR failed: %s\n", - ifname, strerror(saved_errno)); - } else { - v_print("Interface '%s': address cleared\n", ifname); - } - - return res; -} - -static int set_if_addr(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res; - unsigned char *ipaddr; - int i; - struct { - char *req_name; - char *desc; - int g_ioctl; - int s_ioctl; - } ifra[] = { - {"IFADDR", "addr", SIOCGIFADDR, SIOCSIFADDR}, - {"DSTADDR", "destination addr", SIOCGIFDSTADDR, SIOCSIFDSTADDR}, - {"BRDADDR", "broadcast addr", SIOCGIFBRDADDR, SIOCSIFBRDADDR}, - {"NETMASK", "netmask", SIOCGIFNETMASK, SIOCSIFNETMASK}, - {NULL, NULL, 0, 0}, - }; - - for (i = 0; ifra[i].req_name; i++) { - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].g_ioctl, &ifr); - if (res < 0) { - int saved_errno = errno; - - v_print("Interface '%s': Error: SIOCG%s failed: %s\n", - master_ifname, ifra[i].req_name, - strerror(saved_errno)); - - ifr.ifr_addr.sa_family = AF_INET; - memset(ifr.ifr_addr.sa_data, 0, - sizeof(ifr.ifr_addr.sa_data)); - } - - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].s_ioctl, &ifr); - if (res < 0) { - int saved_errno = errno; - - v_print("Interface '%s': Error: SIOCS%s failed: %s\n", - slave_ifname, ifra[i].req_name, - strerror(saved_errno)); - - return res; - } - - ipaddr = ifr.ifr_addr.sa_data; - v_print("Interface '%s': set IP %s to %d.%d.%d.%d\n", - slave_ifname, ifra[i].desc, - ipaddr[0], ipaddr[1], ipaddr[2], ipaddr[3]); - } - - return 0; -} - -/* - * Local variables: - * version-control: t - * kept-new-versions: 5 - * c-indent-level: 4 - * c-basic-offset: 4 - * tab-width: 4 - * compile-command: "gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave" - * End: - */ - diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt new file mode 100644 index 00000000000..43d3549366a --- /dev/null +++ b/Documentation/networking/igb.txt @@ -0,0 +1,129 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Support + +Identifying Your Adapter +======================== + +This driver supports all 82575, 82576 and 82580-based Intel (R) gigabit network +connections. + +For specific information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Command Line Parameters +======================= + +The default value for each parameter is generally the recommended setting, +unless otherwise noted. + +max_vfs +------- +Valid Range: 0-7 +Default Value: 0 + +This parameter adds support for SR-IOV. It causes the driver to spawn up to +max_vfs worth of virtual function. + +Additional Configurations +========================= + + Jumbo Frames + ------------ + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example: + + ifconfig eth<x> mtu 9000 up + + This setting is not saved across reboots. + + Notes: + + - The maximum MTU setting for Jumbo Frames is 9216. This value coincides + with the maximum Jumbo Frames size of 9234 bytes. + + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + version of ethtool can be found at: + + http://ftp.kernel.org/pub/software/network/ethtool/ + + Enabling Wake on LAN* (WoL) + --------------------------- + WoL is configured through the ethtool* utility. + + For instructions on enabling WoL with ethtool, refer to the ethtool man page. + + WoL will be enabled on the system during the next shut down or reboot. + For this driver version, in order to enable WoL, the igb driver must be + loaded when shutting down or rebooting the system. + + Wake On LAN is only supported on port A of multi-port adapters. + + Wake On LAN is not supported for the Intel(R) Gigabit VT Quad Port Server + Adapter. + + Multiqueue + ---------- + In this mode, a separate MSI-X vector is allocated for each queue and one + for "other" interrupts such as link status change and errors. All + interrupts are throttled via interrupt moderation. Interrupt moderation + must be used to avoid interrupt storms while the driver is processing one + interrupt. The moderation value should be at least as large as the expected + time for the driver to process an interrupt. Multiqueue is off by default. + + REQUIREMENTS: MSI-X support is required for Multiqueue. If MSI-X is not + found, the system will fallback to MSI or to Legacy interrupts. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF(n) + + Where n=the VF that attempted to do the spoofing. + + Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool + ------------------------------------------------------------ + You can set a MAC address of a Virtual Function (VF), a default VLAN and the + rate limit using the IProute2 tool. Download the latest version of the + iproute2 tool from Sourceforge if your version does not have all the + features you require. + + +Support +======= + +For general information, go to the Intel support website at: + + www.intel.com/support/ + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/igbvf.txt b/Documentation/networking/igbvf.txt new file mode 100644 index 00000000000..40db17a6665 --- /dev/null +++ b/Documentation/networking/igbvf.txt @@ -0,0 +1,80 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Support + +This file describes the igbvf Linux* Base Driver for Intel Network Connection. + +The igbvf driver supports 82576-based virtual function devices that can only +be activated on kernels that support SR-IOV. SR-IOV requires the correct +platform and OS support. + +The igbvf driver requires the igb driver, version 2.0 or later. The igbvf +driver supports virtual functions generated by the igb driver with a max_vfs +value of 1 or greater. For more information on the max_vfs parameter refer +to the README included with the igb driver. + +The guest OS loading the igbvf driver must support MSI-X interrupts. + +This driver is only supported as a loadable module at this time. Intel is +not supplying patches against the kernel source to allow for static linking +of the driver. For questions related to hardware requirements, refer to the +documentation supplied with your Intel Gigabit adapter. All hardware +requirements listed apply to use with Linux. + +Instructions on updating ethtool can be found in the section "Additional +Configurations" later in this document. + +VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs. + +Identifying Your Adapter +======================== + +The igbvf driver supports 82576-based virtual function devices that can only +be activated on kernels that support SR-IOV. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +For the latest Intel network drivers for Linux, refer to the following +website. In the search field, enter your adapter name or type, or use the +networking link on the left to search for your adapter: + + http://downloadcenter.intel.com/scripts-df-external/Support_Intel.aspx + +Additional Configurations +========================= + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 3.0 or later is required for this functionality, although we + strongly recommend downloading the latest version at: + + http://ftp.kernel.org/pub/software/network/ethtool/ + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 65895bb5141..ab42c95f998 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -2,7 +2,7 @@ ip_forward - BOOLEAN 0 - disabled (default) - not 0 - enabled + not 0 - enabled Forward Packets between interfaces. @@ -11,14 +11,82 @@ ip_forward - BOOLEAN for routers) ip_default_ttl - INTEGER - default 64 - -ip_no_pmtu_disc - BOOLEAN - Disable Path MTU Discovery. - default FALSE + Default value of TTL field (Time To Live) for outgoing (but not + forwarded) IP packets. Should be between 1 and 255 inclusive. + Default: 64 (as recommended by RFC1700) + +ip_no_pmtu_disc - INTEGER + Disable Path MTU Discovery. If enabled in mode 1 and a + fragmentation-required ICMP is received, the PMTU to this + destination will be set to min_pmtu (see below). You will need + to raise min_pmtu to the smallest interface MTU on your system + manually if you want to avoid locally generated fragments. + + In mode 2 incoming Path MTU Discovery messages will be + discarded. Outgoing frames are handled the same as in mode 1, + implicitly setting IP_PMTUDISC_DONT on every created socket. + + Mode 3 is a hardend pmtu discover mode. The kernel will only + accept fragmentation-needed errors if the underlying protocol + can verify them besides a plain socket lookup. Current + protocols for which pmtu events will be honored are TCP, SCTP + and DCCP as they verify e.g. the sequence number or the + association. This mode should not be enabled globally but is + only intended to secure e.g. name servers in namespaces where + TCP path mtu must still work but path MTU information of other + protocols should be discarded. If enabled globally this mode + could break other protocols. + + Possible values: 0-3 + Default: FALSE min_pmtu - INTEGER - default 562 - minimum discovered Path MTU + default 552 - minimum discovered Path MTU + +ip_forward_use_pmtu - BOOLEAN + By default we don't trust protocol path MTUs while forwarding + because they could be easily forged and can lead to unwanted + fragmentation by the router. + You only need to enable this if you have user-space software + which tries to discover path mtus by itself and depends on the + kernel honoring this information. This is normally not the + case. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled + +route/max_size - INTEGER + Maximum number of routes allowed in the kernel. Increase + this when using large numbers of interfaces and/or routes. + +neigh/default/gc_thresh1 - INTEGER + Minimum number of entries to keep. Garbage collector will not + purge entries if there are fewer than this number. + Default: 128 + +neigh/default/gc_thresh3 - INTEGER + Maximum number of neighbor entries allowed. Increase this + when using large numbers of interfaces and when communicating + with large numbers of directly-connected peers. + Default: 1024 + +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + Setting negative value is meaningless and will return error. + Default: 65536 Bytes(64KB) + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + Prior to linux 3.3, the default value is 3 which may cause + unexpected packet loss. The current default value is calculated + according to default value of unres_qlen_bytes and true size of + packet. + Default: 31 mtu_expires - INTEGER Time, in seconds, that cached PMTU information is kept. @@ -30,26 +98,49 @@ min_adv_mss - INTEGER IP Fragmentation: ipfrag_high_thresh - INTEGER - Maximum memory used to reassemble IP fragments. When + Maximum memory used to reassemble IP fragments. When ipfrag_high_thresh bytes of memory is allocated for this purpose, the fragment handler will toss packets until ipfrag_low_thresh is reached. - + ipfrag_low_thresh - INTEGER - See ipfrag_high_thresh + See ipfrag_high_thresh ipfrag_time - INTEGER - Time in seconds to keep an IP fragment in memory. + Time in seconds to keep an IP fragment in memory. ipfrag_secret_interval - INTEGER - Regeneration interval (in seconds) of the hash secret (or lifetime + Regeneration interval (in seconds) of the hash secret (or lifetime for the hash secret) for IP fragments. Default: 600 +ipfrag_max_dist - INTEGER + ipfrag_max_dist is a non-negative integer value which defines the + maximum "disorder" which is allowed among fragments which share a + common IP source address. Note that reordering of packets is + not unusual, but if a large number of fragments arrive from a source + IP address while a particular fragment queue remains incomplete, it + probably indicates that one or more fragments belonging to that queue + have been lost. When ipfrag_max_dist is positive, an additional check + is done on fragments before they are added to a reassembly queue - if + ipfrag_max_dist (or more) fragments have arrived from a particular IP + address between additions to any IP fragment queue using that source + address, it's presumed that one or more fragments in the queue are + lost. The existing fragment queue will be dropped, and a new one + started. An ipfrag_max_dist value of zero disables this check. + + Using a very small value, e.g. 1 or 2, for ipfrag_max_dist can + result in unnecessarily dropping fragment queues when normal + reordering of packets occurs, which could lead to poor application + performance. Using a very large value, e.g. 50000, increases the + likelihood of incorrectly reassembling IP fragments that originate + from different IP datagrams, which could result in data corruption. + Default: 64 + INET peer storage: inet_peer_threshold - INTEGER - The approximate size of the storage. Starting from this threshold + The approximate size of the storage. Starting from this threshold entries will be thrown aggressively. This threshold also determines entries' time-to-live and time intervals between garbage collection passes. More entries, less time-to-live, less GC interval. @@ -58,35 +149,133 @@ inet_peer_minttl - INTEGER Minimum time-to-live of entries. Should be enough to cover fragment time-to-live on the reassembling side. This minimum time-to-live is guaranteed if the pool size is less than inet_peer_threshold. - Measured in jiffies(1). + Measured in seconds. inet_peer_maxttl - INTEGER Maximum time-to-live of entries. Unused entries will expire after this period of time if there is no memory pressure on the pool (i.e. when the number of entries in the pool is very small). - Measured in jiffies(1). + Measured in seconds. -inet_peer_gc_mintime - INTEGER - Minimum interval between garbage collection passes. This interval is - in effect under high memory pressure on the pool. - Measured in jiffies(1). +TCP variables: -inet_peer_gc_maxtime - INTEGER - Minimum interval between garbage collection passes. This interval is - in effect under low (or absent) memory pressure on the pool. - Measured in jiffies(1). +somaxconn - INTEGER + Limit of socket listen() backlog, known in userspace as SOMAXCONN. + Defaults to 128. See also tcp_max_syn_backlog for additional tuning + for TCP sockets. -TCP variables: +tcp_abort_on_overflow - BOOLEAN + If listening service is too slow to accept new connections, + reset them. Default state is FALSE. It means that if overflow + occurred due to a burst, connection will recover. Enable this + option _only_ if you are really sure that listening daemon + cannot be tuned to accept connections faster. Enabling this + option can harm clients of your server. -tcp_syn_retries - INTEGER - Number of times initial SYNs for an active TCP connection attempt - will be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. +tcp_adv_win_scale - INTEGER + Count buffering overhead as bytes/2^tcp_adv_win_scale + (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), + if it is <= 0. + Possible values are [-31, 31], inclusive. + Default: 1 -tcp_synack_retries - INTEGER - Number of times SYNACKs for a passive TCP connection attempt will - be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. +tcp_allowed_congestion_control - STRING + Show/set the congestion control choices available to non-privileged + processes. The list is a subset of those listed in + tcp_available_congestion_control. + Default is "reno" and the default setting (tcp_congestion_control). + +tcp_app_win - INTEGER + Reserve max(window/2^tcp_app_win, mss) of window for application + buffer. Value 0 is special, it means that nothing is reserved. + Default: 31 + +tcp_autocorking - BOOLEAN + Enable TCP auto corking : + When applications do consecutive small write()/sendmsg() system calls, + we try to coalesce these small writes as much as possible, to lower + total amount of sent packets. This is done if at least one prior + packet for the flow is waiting in Qdisc queues or device transmit + queue. Applications can still use TCP_CORK for optimal behavior + when they know how/when to uncork their sockets. + Default : 1 + +tcp_available_congestion_control - STRING + Shows the available congestion control choices that are registered. + More congestion control algorithms may be available as modules, + but not loaded. + +tcp_base_mss - INTEGER + The initial value of search_low to be used by the packetization layer + Path MTU discovery (MTU probing). If MTU probing is enabled, + this is the initial MSS used by the connection. + +tcp_congestion_control - STRING + Set the congestion control algorithm to be used for new + connections. The algorithm "reno" is always available, but + additional choices may be available based on kernel configuration. + Default is set as part of kernel configuration. + For passive connections, the listener congestion control choice + is inherited. + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] + +tcp_dsack - BOOLEAN + Allows TCP to send "duplicate" SACKs. + +tcp_early_retrans - INTEGER + Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold + for triggering fast retransmit when the amount of outstanding data is + small and when no previously unsent data can be transmitted (such + that limited transmit could be used). Also controls the use of + Tail loss probe (TLP) that converts RTOs occurring due to tail + losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01). + Possible values: + 0 disables ER + 1 enables ER + 2 enables ER but delays fast recovery and fast retransmit + by a fourth of RTT. This mitigates connection falsely + recovers when network has a small degree of reordering + (less than 3 packets). + 3 enables delayed ER and TLP. + 4 enables TLP only. + Default: 3 + +tcp_ecn - INTEGER + Control use of Explicit Congestion Notification (ECN) by TCP. + ECN is used only when both ends of the TCP connection indicate + support for it. This feature is useful in avoiding losses due + to congestion by allowing supporting routers to signal + congestion before having to drop packets. + Possible values are: + 0 Disable ECN. Neither initiate nor accept ECN. + 1 Enable ECN when requested by incoming connections and + also request ECN on outgoing connection attempts. + 2 Enable ECN when requested by incoming connections + but do not request ECN on outgoing connections. + Default: 2 + +tcp_fack - BOOLEAN + Enable FACK congestion avoidance and fast retransmission. + The value is not used, if tcp_sack is not enabled. + +tcp_fin_timeout - INTEGER + The length of time an orphaned (no longer referenced by any + application) connection will remain in the FIN_WAIT_2 state + before it is aborted at the local end. While a perfectly + valid "receive only" state for an un-orphaned connection, an + orphaned connection in FIN_WAIT_2 state could otherwise wait + forever for the remote to close its end of the connection. + Cf. tcp_max_orphans + Default: 60 seconds + +tcp_frto - INTEGER + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. + F-RTO is an enhanced recovery algorithm for TCP retransmission + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. @@ -102,54 +291,13 @@ tcp_keepalive_intvl - INTEGER after probes started. Default value: 75sec i.e. connection will be aborted after ~11 minutes of retries. -tcp_retries1 - INTEGER - How many times to retry before deciding that something is wrong - and it is necessary to report this suspicion to network layer. - Minimal RFC value is 3, it is default, which corresponds - to ~3sec-8min depending on RTO. - -tcp_retries2 - INTEGER - How may times to retry before killing alive TCP connection. - RFC1122 says that the limit should be longer than 100 sec. - It is too small number. Default value 15 corresponds to ~13-30min - depending on RTO. - -tcp_orphan_retries - INTEGER - How may times to retry before killing TCP connection, closed - by our side. Default value 7 corresponds to ~50sec-16min - depending on RTO. If you machine is loaded WEB server, - you should think about lowering this value, such sockets - may consume significant resources. Cf. tcp_max_orphans. - -tcp_fin_timeout - INTEGER - Time to hold socket in state FIN-WAIT-2, if it was closed - by our side. Peer can be broken and never close its side, - or even died unexpectedly. Default value is 60sec. - Usual value used in 2.2 was 180 seconds, you may restore - it, but remember that if your machine is even underloaded WEB server, - you risk to overflow memory with kilotons of dead sockets, - FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1, - because they eat maximum 1.5K of memory, but they tend - to live longer. Cf. tcp_max_orphans. - -tcp_max_tw_buckets - INTEGER - Maximal number of timewait sockets held by system simultaneously. - If this number is exceeded time-wait socket is immediately destroyed - and warning is printed. This limit exists only to prevent - simple DoS attacks, you _must_ not lower the limit artificially, - but rather increase it (probably, after increasing installed memory), - if network conditions require more than default value. - -tcp_tw_recycle - BOOLEAN - Enable fast recycling TIME-WAIT sockets. Default value is 0. - It should not be changed without advice/request of technical - experts. - -tcp_tw_reuse - BOOLEAN - Allow to reuse TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. Default value is 0. - It should not be changed without advice/request of technical - experts. +tcp_low_latency - BOOLEAN + If set, the TCP stack makes decisions that prefer lower + latency as opposed to higher throughput. By default, this + option is not set meaning that higher throughput is preferred. + An example of an application where this default should be + changed would be a Beowulf compute cluster. + Default: 0 tcp_max_orphans - INTEGER Maximal number of TCP sockets not attached to any user file handle, @@ -163,98 +311,112 @@ tcp_max_orphans - INTEGER more aggressively. Let me to remind again: each orphan eats up to ~64K of unswappable memory. -tcp_abort_on_overflow - BOOLEAN - If listening service is too slow to accept new connections, - reset them. Default state is FALSE. It means that if overflow - occurred due to a burst, connection will recover. Enable this - option _only_ if you are really sure that listening daemon - cannot be tuned to accept connections faster. Enabling this - option can harm clients of your server. - -tcp_syncookies - BOOLEAN - Only valid when the kernel was compiled with CONFIG_SYNCOOKIES - Send out syncookies when the syn backlog queue of a socket - overflows. This is to prevent against the common 'syn flood attack' - Default: FALSE - - Note, that syncookies is fallback facility. - It MUST NOT be used to help highly loaded servers to stand - against legal connection rate. If you see synflood warnings - in your logs, but investigation shows that they occur - because of overload with legal connections, you should tune - another parameters until this warning disappear. - See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. +tcp_max_syn_backlog - INTEGER + Maximal number of remembered connection requests, which have not + received an acknowledgment from connecting client. + The minimal value is 128 for low memory machines, and it will + increase in proportion to the memory of machine. + If server suffers from overload, try increasing this number. - syncookies seriously violate TCP protocol, do not allow - to use TCP extensions, can result in serious degradation - of some services (f.e. SMTP relaying), visible not by you, - but your clients and relays, contacting you. While you see - synflood warnings in logs not being really flooded, your server - is seriously misconfigured. +tcp_max_tw_buckets - INTEGER + Maximal number of timewait sockets held by system simultaneously. + If this number is exceeded time-wait socket is immediately destroyed + and warning is printed. This limit exists only to prevent + simple DoS attacks, you _must_ not lower the limit artificially, + but rather increase it (probably, after increasing installed memory), + if network conditions require more than default value. -tcp_stdurg - BOOLEAN - Use the Host requirements interpretation of the TCP urg pointer field. - Most hosts use the older BSD interpretation, so if you turn this on - Linux might not communicate correctly with them. - Default: FALSE - -tcp_max_syn_backlog - INTEGER - Maximal number of remembered connection requests, which are - still did not receive an acknowledgment from connecting client. - Default value is 1024 for systems with more than 128Mb of memory, - and 128 for low memory machines. If server suffers of overload, - try to increase this number. +tcp_mem - vector of 3 INTEGERs: min, pressure, max + min: below this number of pages TCP is not bothered about its + memory appetite. -tcp_window_scaling - BOOLEAN - Enable window scaling as defined in RFC1323. + pressure: when amount of memory allocated by TCP exceeds this number + of pages, TCP moderates its memory consumption and enters memory + pressure mode, which is exited when memory consumption falls + under "min". -tcp_timestamps - BOOLEAN - Enable timestamps as defined in RFC1323. + max: number of pages allowed for queueing by all TCP sockets. -tcp_sack - BOOLEAN - Enable select acknowledgments (SACKS). + Defaults are calculated at boot time from amount of available + memory. -tcp_fack - BOOLEAN - Enable FACK congestion avoidance and fast retransmission. - The value is not used, if tcp_sack is not enabled. +tcp_moderate_rcvbuf - BOOLEAN + If set, TCP performs receive buffer auto-tuning, attempting to + automatically size the buffer (no greater than tcp_rmem[2]) to + match the size required by the path for full throughput. Enabled by + default. + +tcp_mtu_probing - INTEGER + Controls TCP Packetization-Layer Path MTU Discovery. Takes three + values: + 0 - Disabled + 1 - Disabled by default, enabled when an ICMP black hole detected + 2 - Always enabled, use initial MSS of tcp_base_mss. + +tcp_no_metrics_save - BOOLEAN + By default, TCP saves various connection metrics in the route cache + when the connection closes, so that connections established in the + near future can use these to set initial conditions. Usually, this + increases overall performance, but may sometimes cause performance + degradation. If set, TCP will not cache metrics on closing + connections. -tcp_dsack - BOOLEAN - Allows TCP to send "duplicate" SACKs. +tcp_orphan_retries - INTEGER + This value influences the timeout of a locally closed TCP connection, + when RTO retransmissions remain unacknowledged. + See tcp_retries2 for more details. -tcp_ecn - BOOLEAN - Enable Explicit Congestion Notification in TCP. + The default value is 8. + If your machine is a loaded WEB server, + you should think about lowering this value, such sockets + may consume significant resources. Cf. tcp_max_orphans. tcp_reordering - INTEGER Maximal reordering of packets in a TCP stream. - Default: 3 + Default: 3 tcp_retrans_collapse - BOOLEAN Bug-to-bug compatibility with some broken printers. On retransmit try to send bigger packets to work around bugs in certain TCP stacks. -tcp_wmem - vector of 3 INTEGERs: min, default, max - min: Amount of memory reserved for send buffers for TCP socket. - Each TCP socket has rights to use it due to fact of its birth. - Default: 4K +tcp_retries1 - INTEGER + This value influences the time, after which TCP decides, that + something is wrong due to unacknowledged RTO retransmissions, + and reports this suspicion to the network layer. + See tcp_retries2 for more details. - default: Amount of memory allowed for send buffers for TCP socket - by default. This value overrides net.core.wmem_default used - by other protocols, it is usually lower than net.core.wmem_default. - Default: 16K + RFC 1122 recommends at least 3 retransmissions, which is the + default. + +tcp_retries2 - INTEGER + This value influences the timeout of an alive TCP connection, + when RTO retransmissions remain unacknowledged. + Given a value of N, a hypothetical TCP connection following + exponential backoff with an initial RTO of TCP_RTO_MIN would + retransmit N times before killing the connection at the (N+1)th RTO. + + The default value of 15 yields a hypothetical timeout of 924.6 + seconds and is a lower bound for the effective timeout. + TCP will effectively time out at the first RTO which exceeds the + hypothetical timeout. - max: Maximal amount of memory allowed for automatically selected - send buffers for TCP socket. This value does not override - net.core.wmem_max, "static" selection via SO_SNDBUF does not use this. - Default: 128K + RFC 1122 recommends at least 100 seconds for the timeout, + which corresponds to a value of at least 8. + +tcp_rfc1337 - BOOLEAN + If set, the TCP stack behaves conforming to RFC1337. If unset, + we are not conforming to RFC, but prevent TCP TIME_WAIT + assassination. + Default: 0 tcp_rmem - vector of 3 INTEGERs: min, default, max min: Minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. - Default: 8K + Default: 1 page - default: default size of receive buffer used by TCP sockets. + default: initial size of receive buffer used by TCP sockets. This value overrides net.core.rmem_default used by other protocols. Default: 87380 bytes. This value results in window of 65535 with default setting of tcp_adv_win_scale and tcp_app_win:0 and a bit @@ -262,85 +424,310 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max max: maximal size of receive buffer allowed for automatically selected receiver buffers for TCP socket. This value does not override - net.core.rmem_max, "static" selection via SO_RCVBUF does not use this. - Default: 87380*2 bytes. + net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables + automatic tuning of that socket's receive buffer size, in which + case this value is ignored. + Default: between 87380B and 6MB, depending on RAM size. -tcp_mem - vector of 3 INTEGERs: min, pressure, max - low: below this number of pages TCP is not bothered about its - memory appetite. +tcp_sack - BOOLEAN + Enable select acknowledgments (SACKS). - pressure: when amount of memory allocated by TCP exceeds this number - of pages, TCP moderates its memory consumption and enters memory - pressure mode, which is exited when memory consumption falls - under "low". +tcp_slow_start_after_idle - BOOLEAN + If set, provide RFC2861 behavior and time out the congestion + window after an idle period. An idle period is defined at + the current RTO. If unset, the congestion window will not + be timed out after an idle period. + Default: 1 - high: number of pages allowed for queueing by all TCP sockets. +tcp_stdurg - BOOLEAN + Use the Host requirements interpretation of the TCP urgent pointer field. + Most hosts use the older BSD interpretation, so if you turn this on + Linux might not communicate correctly with them. + Default: FALSE - Defaults are calculated at boot time from amount of available - memory. +tcp_synack_retries - INTEGER + Number of times SYNACKs for a passive TCP connection attempt will + be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to 31seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for a passive TCP connection will happen after 63seconds. -tcp_app_win - INTEGER - Reserve max(window/2^tcp_app_win, mss) of window for application - buffer. Value 0 is special, it means that nothing is reserved. - Default: 31 +tcp_syncookies - BOOLEAN + Only valid when the kernel was compiled with CONFIG_SYN_COOKIES + Send out syncookies when the syn backlog queue of a socket + overflows. This is to prevent against the common 'SYN flood attack' + Default: 1 -tcp_adv_win_scale - INTEGER - Count buffering overhead as bytes/2^tcp_adv_win_scale - (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), - if it is <= 0. + Note, that syncookies is fallback facility. + It MUST NOT be used to help highly loaded servers to stand + against legal connection rate. If you see SYN flood warnings + in your logs, but investigation shows that they occur + because of overload with legal connections, you should tune + another parameters until this warning disappear. + See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. + + syncookies seriously violate TCP protocol, do not allow + to use TCP extensions, can result in serious degradation + of some services (f.e. SMTP relaying), visible not by you, + but your clients and relays, contacting you. While you see + SYN flood warnings in logs not being really flooded, your server + is seriously misconfigured. + + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + +tcp_fastopen - INTEGER + Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data + in the opening SYN packet. To use this feature, the client application + must use sendmsg() or sendto() with MSG_FASTOPEN flag rather than + connect() to perform a TCP handshake automatically. + + The values (bitmap) are + 1: Enables sending data in the opening SYN on the client w/ MSG_FASTOPEN. + 2: Enables TCP Fast Open on the server side, i.e., allowing data in + a SYN packet to be accepted and passed to the application before + 3-way hand shake finishes. + 4: Send data in the opening SYN regardless of cookie availability and + without a cookie option. + 0x100: Accept SYN data w/o validating the cookie. + 0x200: Accept data-in-SYN w/o any cookie option present. + 0x400/0x800: Enable Fast Open on all listeners regardless of the + TCP_FASTOPEN socket option. The two different flags designate two + different ways of setting max_qlen without the TCP_FASTOPEN socket + option. + + Default: 1 + + Note that the client & server side Fast Open flags (1 and 2 + respectively) must be also enabled before the rest of flags can take + effect. + + See include/net/tcp.h and the code for more details. + +tcp_syn_retries - INTEGER + Number of times initial SYNs for an active TCP connection attempt + will be retransmitted. Should not be higher than 255. Default value + is 6, which corresponds to 63seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for an active TCP connection attempt will happen after 127seconds. + +tcp_timestamps - BOOLEAN + Enable timestamps as defined in RFC1323. + +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. Default: 2 -tcp_rfc1337 - BOOLEAN - If set, the TCP stack behaves conforming to RFC1337. If unset, - we are not conforming to RFC, but prevent TCP TIME_WAIT - assassination. +tcp_tso_win_divisor - INTEGER + This allows control over what percentage of the congestion window + can be consumed by a single TSO frame. + The setting of this parameter is a choice between burstiness and + building larger TSO frames. + Default: 3 + +tcp_tw_recycle - BOOLEAN + Enable fast recycling TIME-WAIT sockets. Default value is 0. + It should not be changed without advice/request of technical + experts. + +tcp_tw_reuse - BOOLEAN + Allow to reuse TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. Default value is 0. + It should not be changed without advice/request of technical + experts. + +tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + +tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP sockets. + Each TCP socket has rights to use it due to fact of its birth. + Default: 1 page + + default: initial size of send buffer used by TCP sockets. This + value overrides net.core.wmem_default used by other protocols. + It is usually lower than net.core.wmem_default. + Default: 16K + + max: Maximal amount of memory allowed for automatically tuned + send buffers for TCP sockets. This value does not override + net.core.wmem_max. Calling setsockopt() with SO_SNDBUF disables + automatic tuning of that socket's send buffer size, in which case + this value is ignored. + Default: between 64K and 4MB, depending on RAM size. + +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + +tcp_workaround_signed_windows - BOOLEAN + If set, assume no receipt of a window scaling option means the + remote TCP is broken and treats the window as a signed quantity. + If unset, assume the remote TCP is not broken even if we do + not receive a window scaling option from them. Default: 0 -tcp_low_latency - BOOLEAN - If set, the TCP stack makes decisions that prefer lower - latency as opposed to higher throughput. By default, this - option is not set meaning that higher throughput is preferred. - An example of an application where this default should be - changed would be a Beowulf compute cluster. +tcp_dma_copybreak - INTEGER + Lower limit, in bytes, of the size of socket reads that will be + offloaded to a DMA copy engine, if one is present in the system + and CONFIG_NET_DMA is enabled. + Default: 4096 + +tcp_thin_linear_timeouts - BOOLEAN + Enable dynamic triggering of linear timeouts for thin streams. + If set, a check is performed upon retransmission by timeout to + determine if the stream is thin (less than 4 packets in flight). + As long as the stream is found to be thin, up to 6 linear + timeouts may be performed before exponential backoff mode is + initiated. This improves retransmission latency for + non-aggressive thin streams, often found to be time-dependent. + For more information on thin streams, see + Documentation/networking/tcp-thin.txt Default: 0 -tcp_tso_win_divisor - INTEGER - This allows control over what percentage of the congestion window - can be consumed by a single TSO frame. - The setting of this parameter is a choice between burstiness and - building larger TSO frames. - Default: 3 - -tcp_frto - BOOLEAN - Enables F-RTO, an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in wireless environments - where packet loss is typically due to random radio interference - rather than intermediate router congestion. +tcp_thin_dupack - BOOLEAN + Enable dynamic triggering of retransmissions after one dupACK + for thin streams. If set, a check is performed upon reception + of a dupACK to determine if the stream is thin (less than 4 + packets in flight). As long as the stream is found to be thin, + data is retransmitted on the first received dupACK. This + improves retransmission latency for non-aggressive thin + streams, often found to be time-dependent. + For more information on thin streams, see + Documentation/networking/tcp-thin.txt + Default: 0 -tcp_congestion_control - STRING - Set the congestion control algorithm to be used for new - connections. The algorithm "reno" is always available, but - additional choices may be available based on kernel configuration. +tcp_limit_output_bytes - INTEGER + Controls TCP Small Queue limit per tcp socket. + TCP bulk sender tends to increase packets in flight until it + gets losses notifications. With SNDBUF autotuning, this can + result in a large amount of packets queued in qdisc/device + on the local machine, hurting latency of other flows, for + typical pfifo_fast qdiscs. + tcp_limit_output_bytes limits the number of bytes on qdisc + or device to reduce artificial RTT/cwnd and reduce bufferbloat. + Default: 131072 + +tcp_challenge_ack_limit - INTEGER + Limits number of Challenge ACK sent per second, as recommended + in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) + Default: 100 -somaxconn - INTEGER - Limit of socket listen() backlog, known in userspace as SOMAXCONN. - Defaults to 128. See also tcp_max_syn_backlog for additional tuning - for TCP sockets. +UDP variables: + +udp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all UDP sockets. + + min: Below this number of pages UDP is not bothered about its + memory appetite. When amount of memory allocated by UDP exceeds + this number, UDP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all UDP sockets. + + Default is calculated at boot time from amount of available memory. + +udp_rmem_min - INTEGER + Minimal size of receive buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for receiving data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 1 page + +udp_wmem_min - INTEGER + Minimal size of send buffer used by UDP sockets in moderation. + Each UDP socket is able to use the size for sending data, even if + total pages of UDP sockets exceed udp_mem pressure. The unit is byte. + Default: 1 page + +CIPSOv4 Variables: + +cipso_cache_enable - BOOLEAN + If set, enable additions to and lookups from the CIPSO label mapping + cache. If unset, additions are ignored and lookups always result in a + miss. However, regardless of the setting the cache is still + invalidated when required when means you can safely toggle this on and + off and the cache will always be "safe". + Default: 1 + +cipso_cache_bucket_size - INTEGER + The CIPSO label cache consists of a fixed size hash table with each + hash bucket containing a number of cache entries. This variable limits + the number of entries in each hash bucket; the larger the value the + more CIPSO label mappings that can be cached. When the number of + entries in a given hash bucket reaches this limit adding new entries + causes the oldest entry in the bucket to be removed to make room. + Default: 10 + +cipso_rbm_optfmt - BOOLEAN + Enable the "Optimized Tag 1 Format" as defined in section 3.4.2.6 of + the CIPSO draft specification (see Documentation/netlabel for details). + This means that when set the CIPSO tag will be padded with empty + categories in order to make the packet data 32-bit aligned. + Default: 0 + +cipso_rbm_structvalid - BOOLEAN + If set, do a very strict check of the CIPSO option when + ip_options_compile() is called. If unset, relax the checks done during + ip_options_compile(). Either way is "safe" as errors are caught else + where in the CIPSO processing code but setting this to 0 (False) should + result in less work (i.e. it should be faster) but could cause problems + with other implementations that require strict checking. + Default: 0 IP Variables: ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to - choose the local port. The first number is the first, the - second the last local port number. Default value depends on - amount of memory available on the system: - > 128Mb 32768-61000 - < 128Mb 1024-4999 or even less. - This number defines number of active connections, which this - system can issue simultaneously to systems not supporting - TCP extensions (timestamps). With tcp_tw_recycle enabled - (i.e. by default) range 1024-4999 is enough to issue up to - 2000 connections per second to systems supporting timestamps. + choose the local port. The first number is the first, the + second the last local port number. The default values are + 32768 and 61000 respectively. + +ip_local_reserved_ports - list of comma separated ranges + Specify the ports which are reserved for known third-party + applications. These ports will not be used by automatic port + assignments (e.g. when calling connect() or bind() with port + number 0). Explicit port allocation behavior is unchanged. + + The format used for both input and output is a comma separated + list of ranges (e.g. "1,2-4,10-10" for ports 1, 2, 3, 4 and + 10). Writing to the file will clear all previously reserved + ports and update the current list with the one given in the + input. + + Note that ip_local_port_range and ip_local_reserved_ports + settings are independent and both are considered by the kernel + when determining which ports are available for automatic port + assignments. + + You can reserve ports which are not in the current + ip_local_port_range, e.g.: + + $ cat /proc/sys/net/ipv4/ip_local_port_range + 32000 61000 + $ cat /proc/sys/net/ipv4/ip_local_reserved_ports + 8080,9148 + + although this is redundant. However such a setting is useful + if later the port range is changed to a value that will + include the reserved ports. + + Default: Empty ip_nonlocal_bind - BOOLEAN If set, allows processes to bind() to non-local IP addresses, @@ -354,6 +741,15 @@ ip_dynaddr - BOOLEAN occurs. Default: 0 +ip_early_demux - BOOLEAN + Optimize input packet processing down to one demux for + certain kinds of local sockets. Currently we only do this + for established TCP sockets. + + It may add an additional cost for pure routing workloads that + reduces overall throughput, in such case you should disable it. + Default: 1 + icmp_echo_ignore_all - BOOLEAN If set non-zero, then the kernel will ignore all ICMP ECHO requests sent to it. @@ -367,8 +763,9 @@ icmp_echo_ignore_broadcasts - BOOLEAN icmp_ratelimit - INTEGER Limit the maximal rates for sending ICMP packets whose type matches icmp_ratemask (see below) to specific targets. - 0 to disable any limiting, otherwise the maximal rate in jiffies(1) - Default: 100 + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Default: 1000 icmp_ratemask - INTEGER Mask made of ICMP types for which rates are being limited. @@ -397,16 +794,51 @@ icmp_ignore_bogus_error_responses - BOOLEAN frames. Such violations are normally logged via a kernel warning. If this is set to TRUE, the kernel will not give such warnings, which will avoid log file clutter. - Default: FALSE + Default: 1 + +icmp_errors_use_inbound_ifaddr - BOOLEAN + + If zero, icmp error messages are sent with the primary address of + the exiting interface. + + If non-zero, the message will be sent with the primary address of + the interface that received the packet that caused the icmp error. + This is the behaviour network many administrators will expect from + a router. And it can make debugging complicated network layouts + much easier. + + Note that if no primary address exists for the interface selected, + then the primary address of the first non-loopback interface that + has one will be used regardless of this setting. + + Default: 0 igmp_max_memberships - INTEGER Change the maximum number of multicast groups we can subscribe to. Default: 20 -conf/interface/* changes special settings per interface (where "interface" is - the name of your network interface) -conf/all/* is special, changes the settings for all interfaces + Theoretical maximum value is bounded by having to send a membership + report in a single datagram (i.e. the report can't span multiple + datagrams, or risk confusing the switch and leaving groups you don't + intend to). + + The number of supported groups 'M' is bounded by the number of group + report entries you can fit into a single datagram of 65535 bytes. + + M = 65536-sizeof (ip header)/(sizeof(Group record)) + + Group records are variable length, with a minimum of 12 bytes. + So net.ipv4.igmp_max_memberships should not be set higher than: + + (65536-24) / 12 = 5459 + The value 5459 assumes no IP header options, so in practice + this number may be lower. + + conf/interface/* changes special settings per interface (where + "interface" is the name of your network interface) + + conf/all/* is special, changes the settings for all interfaces log_martians - BOOLEAN Log packets with impossible addresses to kernel log. @@ -417,11 +849,11 @@ log_martians - BOOLEAN accept_redirects - BOOLEAN Accept ICMP redirect messages. accept_redirects for the interface will be enabled if: - - both conf/{all,interface}/accept_redirects are TRUE in the case forwarding - for the interface is enabled + - both conf/{all,interface}/accept_redirects are TRUE in the case + forwarding for the interface is enabled or - - at least one of conf/{all,interface}/accept_redirects is TRUE in the case - forwarding for the interface is disabled + - at least one of conf/{all,interface}/accept_redirects is TRUE in the + case forwarding for the interface is disabled accept_redirects for the interface will be disabled otherwise default TRUE (host) FALSE (router) @@ -432,8 +864,8 @@ forwarding - BOOLEAN mc_forwarding - BOOLEAN Do multicast routing. The kernel needs to be compiled with CONFIG_MROUTE and a multicast routing daemon is required. - conf/all/mc_forwarding must also be set to TRUE to enable multicast routing - for the interface + conf/all/mc_forwarding must also be set to TRUE to enable multicast + routing for the interface medium_id - INTEGER Integer value used to differentiate the devices by the medium they @@ -441,7 +873,7 @@ medium_id - INTEGER the broadcast packets are received only on one of them. The default value 0 means that the device is the only interface to its medium, value of -1 means that medium is not known. - + Currently, it is used to change the proxy_arp behavior: the proxy_arp feature is enabled for packets forwarded between two devices attached to different media. @@ -452,6 +884,25 @@ proxy_arp - BOOLEAN conf/{all,interface}/proxy_arp is set to TRUE, it will be disabled otherwise +proxy_arp_pvlan - BOOLEAN + Private VLAN proxy arp. + Basically allow proxy arp replies back to the same interface + (from which the ARP request/solicitation was received). + + This is done to support (ethernet) switch features, like RFC + 3069, where the individual ports are NOT allowed to + communicate with each other, but they are allowed to talk to + the upstream router. As described in RFC 3069, it is possible + to allow these hosts to communicate through the upstream + router by proxy_arp'ing. Don't need to be used together with + proxy_arp. + + This technology is known by different names: + In RFC 3069 it is called VLAN Aggregation. + Cisco and Allied Telesyn call it Private VLAN. + Hewlett-Packard call it Source-Port filtering or port-isolation. + Ericsson call it MAC-Forced Forwarding (RFC Draft). + shared_media - BOOLEAN Send(router) or accept(host) RFC1620 shared media redirects. Overrides ip_secure_redirects. @@ -491,17 +942,39 @@ accept_source_route - BOOLEAN default TRUE (router) FALSE (host) -rp_filter - BOOLEAN - 1 - do source validation by reversed path, as specified in RFC1812 - Recommended option for single homed hosts and stub network - routers. Could cause troubles for complicated (not loop free) - networks running a slow unreliable protocol (sort of RIP), - or using static routes. +accept_local - BOOLEAN + Accept packets with local source addresses. In combination + with suitable routing, this can be used to direct packets + between two local interfaces over the wire and have them + accepted properly. - 0 - No source validation. + rp_filter must be set to a non-zero value in order for + accept_local to have an effect. - conf/all/rp_filter must also be set to TRUE to do source validation - on the interface + default FALSE + +route_localnet - BOOLEAN + Do not consider loopback addresses as martian source or destination + while routing. This enables the use of 127/8 for local routing purposes. + default FALSE + +rp_filter - INTEGER + 0 - No source validation. + 1 - Strict mode as defined in RFC3704 Strict Reverse Path + Each incoming packet is tested against the FIB and if the interface + is not the best reverse path the packet check will fail. + By default failed packets are discarded. + 2 - Loose mode as defined in RFC3704 Loose Reverse Path + Each incoming packet's source address is also tested against the FIB + and if the source address is not reachable via any interface + the packet check will fail. + + Current recommended practice in RFC3704 is to enable strict mode + to prevent IP spoofing from DDos attacks. If using asymmetric routing + or other complicated routing, then loose mode is recommended. + + The max value from conf/{all,interface}/rp_filter is used + when doing source validation on the {interface}. Default value is 0. Note that some distributions enable it in startup scripts. @@ -574,6 +1047,26 @@ arp_ignore - INTEGER The max value from conf/{all,interface}/arp_ignore is used when ARP request is received on the {interface} +arp_notify - BOOLEAN + Define mode for notification of address and device changes. + 0 - (default): do nothing + 1 - Generate gratuitous arp requests when device is brought up + or hardware address changes. + +arp_accept - BOOLEAN + Define behavior for gratuitous ARP frames who's IP is not + already present in the ARP table: + 0 - don't create new entries in the ARP table + 1 - create new entries in the ARP table + + Both replies and requests type gratuitous arp will trigger the + ARP table to be updated, if this setting is on. + + If the ARP table already contains the IP address of the + gratuitous arp frame, the arp table will be updated regardless + if this setting is on or off. + + app_solicit - INTEGER The maximum number of probes to send to the user space ARP daemon via netlink before dropping back to multicast probes (see @@ -585,16 +1078,26 @@ disable_policy - BOOLEAN disable_xfrm - BOOLEAN Disable IPSEC encryption on this interface, whatever the policy +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + Default: 10000 (10 seconds) + +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + Default: 1000 (1 seconds) + +promote_secondaries - BOOLEAN + When a primary IP address is removed from this interface + promote a corresponding secondary IP address instead of + removing all the corresponding secondary IP addresses. tag - INTEGER Allows you to write a number, which can be used as required. Default value is 0. -(1) Jiffie: internal timeunit for the kernel. On the i386 1/100s, on the -Alpha 1/1024s. See the HZ define in /usr/include/asm/param.h for the exact -value on your system. - Alexey Kuznetsov. kuznet@ms2.inr.ac.ru @@ -614,29 +1117,44 @@ apply to IPv6 [XXX?]. bindv6only - BOOLEAN Default value for IPV6_V6ONLY socket option, - which restricts use of the IPv6 socket to IPv6 communication + which restricts use of the IPv6 socket to IPv6 communication only. TRUE: disable IPv4-mapped address feature FALSE: enable IPv4-mapped address feature - Default: FALSE (as specified in RFC2553bis) + Default: FALSE (as specified in RFC3493) + +flowlabel_consistency - BOOLEAN + Protect the consistency (and unicity) of flow label. + You have to disable it to use IPV6_FL_F_REFLECT flag on the + flow label manager. + TRUE: enabled + FALSE: disabled + Default: TRUE + +anycast_src_echo_reply - BOOLEAN + Controls the use of anycast addresses as source addresses for ICMPv6 + echo reply + TRUE: enabled + FALSE: disabled + Default: FALSE IPv6 Fragmentation: ip6frag_high_thresh - INTEGER - Maximum memory used to reassemble IPv6 fragments. When + Maximum memory used to reassemble IPv6 fragments. When ip6frag_high_thresh bytes of memory is allocated for this purpose, the fragment handler will toss packets until ip6frag_low_thresh is reached. - + ip6frag_low_thresh - INTEGER - See ip6frag_high_thresh + See ip6frag_high_thresh ip6frag_time - INTEGER Time in seconds to keep an IPv6 fragment in memory. ip6frag_secret_interval - INTEGER - Regeneration interval (in seconds) of the hash secret (or lifetime + Regeneration interval (in seconds) of the hash secret (or lifetime for the hash secret) for IPv6 fragments. Default: 600 @@ -645,78 +1163,132 @@ conf/default/*: conf/all/*: - Change all the interface-specific settings. + Change all the interface-specific settings. [XXX: Other special features than forwarding?] conf/all/forwarding - BOOLEAN - Enable global IPv6 forwarding between all interfaces. + Enable global IPv6 forwarding between all interfaces. - IPv4 and IPv6 work differently here; e.g. netfilter must be used + IPv4 and IPv6 work differently here; e.g. netfilter must be used to control which interfaces may forward packets and which not. - This also sets all interfaces' Host/Router setting + This also sets all interfaces' Host/Router setting 'forwarding' to the specified value. See below for details. This referred to as global forwarding. +proxy_ndp - BOOLEAN + Do proxy ndp. + conf/interface/*: Change special settings per interface. - The functional behaviour for certain settings is different + The functional behaviour for certain settings is different depending on whether local forwarding is enabled or not. -accept_ra - BOOLEAN +accept_ra - INTEGER Accept Router Advertisements; autoconfigure using them. - + + It also determines whether or not to transmit Router + Solicitations. If and only if the functional setting is to + accept Router Advertisements, Router Solicitations will be + transmitted. + + Possible values are: + 0 Do not accept Router Advertisements. + 1 Accept Router Advertisements if forwarding is disabled. + 2 Overrule forwarding behaviour. Accept Router Advertisements + even if forwarding is enabled. + Functional default: enabled if local forwarding is disabled. disabled if local forwarding is enabled. +accept_ra_defrtr - BOOLEAN + Learn default router in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_pinfo - BOOLEAN + Learn Prefix Information in Router Advertisement. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + +accept_ra_rt_info_max_plen - INTEGER + Maximum prefix length of Route Information in RA. + + Route Information w/ prefix larger than or equal to this + variable shall be ignored. + + Functional default: 0 if accept_ra_rtr_pref is enabled. + -1 if accept_ra_rtr_pref is disabled. + +accept_ra_rtr_pref - BOOLEAN + Accept Router Preference in RA. + + Functional default: enabled if accept_ra is enabled. + disabled if accept_ra is disabled. + accept_redirects - BOOLEAN Accept Redirects. Functional default: enabled if local forwarding is disabled. disabled if local forwarding is enabled. +accept_source_route - INTEGER + Accept source routing (routing extension header). + + >= 0: Accept only routing header type 2. + < 0: Do not accept routing header. + + Default: 0 + autoconf - BOOLEAN - Autoconfigure addresses using Prefix Information in Router + Autoconfigure addresses using Prefix Information in Router Advertisements. - Functional default: enabled if accept_ra is enabled. - disabled if accept_ra is disabled. + Functional default: enabled if accept_ra_pinfo is enabled. + disabled if accept_ra_pinfo is disabled. dad_transmits - INTEGER The amount of Duplicate Address Detection probes to send. Default: 1 - -forwarding - BOOLEAN - Configure interface-specific Host/Router behaviour. - Note: It is recommended to have the same setting on all +forwarding - INTEGER + Configure interface-specific Host/Router behaviour. + + Note: It is recommended to have the same setting on all interfaces; mixed router/host scenarios are rather uncommon. - FALSE: + Possible values are: + 0 Forwarding disabled + 1 Forwarding enabled + + FALSE (0): By default, Host behaviour is assumed. This means: 1. IsRouter flag is not set in Neighbour Advertisements. - 2. Router Solicitations are being sent when necessary. - 3. If accept_ra is TRUE (default), accept Router + 2. If accept_ra is TRUE (default), transmit Router + Solicitations. + 3. If accept_ra is TRUE (default), accept Router Advertisements (and do autoconfiguration). 4. If accept_redirects is TRUE (default), accept Redirects. - TRUE: + TRUE (1): - If local forwarding is enabled, Router behaviour is assumed. + If local forwarding is enabled, Router behaviour is assumed. This means exactly the reverse from the above: 1. IsRouter flag is set in Neighbour Advertisements. - 2. Router Solicitations are not sent. - 3. Router Advertisements are ignored. + 2. Router Solicitations are not sent unless accept_ra is 2. + 3. Router Advertisements are ignored unless accept_ra is 2. 4. Redirects are ignored. - Default: FALSE if global forwarding is disabled (default), - otherwise TRUE. + Default: 0 (disabled) if global forwarding is disabled (default), + otherwise 1 (enabled). hop_limit - INTEGER Default Hop Limit to set. @@ -726,6 +1298,12 @@ mtu - INTEGER Default Maximum Transfer Unit Default: 1280 (IPv6 required minimum) +router_probe_interval - INTEGER + Minimum interval (in seconds) between Router Probing described + in RFC4191. + + Default: 60 + router_solicitation_delay - INTEGER Number of seconds to wait after interface is brought up before sending Router Solicitations. @@ -736,7 +1314,7 @@ router_solicitation_interval - INTEGER Default: 4 router_solicitations - INTEGER - Number of Router Solicitations to send until assuming no + Number of Router Solicitations to send until assuming no routers are present. Default: 3 @@ -760,28 +1338,94 @@ temp_prefered_lft - INTEGER max_desync_factor - INTEGER Maximum value for DESYNC_FACTOR, which is a random value - that ensures that clients don't synchronize with each + that ensures that clients don't synchronize with each other and generate new addresses at exactly the same time. value is in seconds. Default: 600 - + regen_max_retry - INTEGER Number of attempts before give up attempting to generate valid temporary addresses. Default: 5 max_addresses - INTEGER - Number of maximum addresses per interface. 0 disables limitation. - It is recommended not set too large value (or 0) because it would - be too easy way to crash kernel to allow to create too much of - autoconfigured addresses. + Maximum number of autoconfigured addresses per interface. Setting + to zero disables the limitation. It is not recommended to set this + value too large (or to zero) because it would be an easy way to + crash the kernel by allowing too many addresses to be created. Default: 16 +disable_ipv6 - BOOLEAN + Disable IPv6 operation. If accept_dad is set to 2, this value + will be dynamically set to TRUE if DAD fails for the link-local + address. + Default: FALSE (enable IPv6 operation) + + When this value is changed from 1 to 0 (IPv6 is being enabled), + it will dynamically create a link-local address on the given + interface and start Duplicate Address Detection, if necessary. + + When this value is changed from 0 to 1 (IPv6 is being disabled), + it will dynamically delete all address on the given interface. + +accept_dad - INTEGER + Whether to accept DAD (Duplicate Address Detection). + 0: Disable DAD + 1: Enable DAD (default) + 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate + link-local address has been found. + +force_tllao - BOOLEAN + Enable sending the target link-layer address option even when + responding to a unicast neighbor solicitation. + Default: FALSE + + Quoting from RFC 2461, section 4.4, Target link-layer address: + + "The option MUST be included for multicast solicitations in order to + avoid infinite Neighbor Solicitation "recursion" when the peer node + does not have a cache entry to return a Neighbor Advertisements + message. When responding to unicast solicitations, the option can be + omitted since the sender of the solicitation has the correct link- + layer address; otherwise it would not have be able to send the unicast + solicitation in the first place. However, including the link-layer + address in this case adds little overhead and eliminates a potential + race condition where the sender deletes the cached link-layer address + prior to receiving a response to a previous solicitation." + +ndisc_notify - BOOLEAN + Define mode for notification of address and device changes. + 0 - (default): do nothing + 1 - Generate unsolicited neighbour advertisements when device is brought + up or hardware address changes. + +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + Default: 1000 (1 second) + +force_mld_version - INTEGER + 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + 1 - Enforce to use MLD version 1 + 2 - Enforce to use MLD version 2 + +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + 1 - (default) discard fragmented neighbor discovery packets + 0 - allow fragmented neighbor discovery packets + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. - 0 to disable any limiting, otherwise the maximal rate in jiffies(1) - Default: 100 + 0 to disable any limiting, + otherwise the minimal space between responses in milliseconds. + Default: 1000 IPv6 Update by: @@ -807,30 +1451,263 @@ bridge-nf-call-ip6tables - BOOLEAN Default: 1 bridge-nf-filter-vlan-tagged - BOOLEAN - 1 : pass bridged vlan-tagged ARP/IP traffic to arptables/iptables. + 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. 0 : disable this. + Default: 0 + +bridge-nf-filter-pppoe-tagged - BOOLEAN + 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. + 0 : disable this. + Default: 0 + +bridge-nf-pass-vlan-input-dev - BOOLEAN + 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan + interface on the bridge and set the netfilter input device to the vlan. + This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT + target work with vlan-on-top-of-bridge interfaces. When no matching + vlan interface is found, or this switch is off, the input device is + set to the bridge interface. + 0: disable bridge netfilter vlan interface lookup. + Default: 0 + +proc/sys/net/sctp/* Variables: + +addip_enable - BOOLEAN + Enable or disable extension of Dynamic Address Reconfiguration + (ADD-IP) functionality specified in RFC5061. This extension provides + the ability to dynamically add and remove new addresses for the SCTP + associations. + + 1: Enable extension. + + 0: Disable extension. + + Default: 0 + +addip_noauth_enable - BOOLEAN + Dynamic Address Reconfiguration (ADD-IP) requires the use of + authentication to protect the operations of adding or removing new + addresses. This requirement is mandated so that unauthorized hosts + would not be able to hijack associations. However, older + implementations may not have implemented this requirement while + allowing the ADD-IP extension. For reasons of interoperability, + we provide this variable to control the enforcement of the + authentication requirement. + + 1: Allow ADD-IP extension to be used without authentication. This + should only be set in a closed environment for interoperability + with older implementations. + + 0: Enforce the authentication requirement + + Default: 0 + +auth_enable - BOOLEAN + Enable or disable Authenticated Chunks extension. This extension + provides the ability to send and receive authenticated chunks and is + required for secure operation of Dynamic Address Reconfiguration + (ADD-IP) extension. + + 1: Enable this extension. + 0: Disable this extension. + + Default: 0 + +prsctp_enable - BOOLEAN + Enable or disable the Partial Reliability extension (RFC3758) which + is used to notify peers that a given DATA should no longer be expected. + + 1: Enable extension + 0: Disable + Default: 1 +max_burst - INTEGER + The limit of the number of new packets that can be initially sent. It + controls how bursty the generated traffic can be. + + Default: 4 + +association_max_retrans - INTEGER + Set the maximum number for retransmissions that an association can + attempt deciding that the remote end is unreachable. If this value + is exceeded, the association is terminated. + + Default: 10 + +max_init_retransmits - INTEGER + The maximum number of retransmissions of INIT and COOKIE-ECHO chunks + that an association will attempt before declaring the destination + unreachable and terminating. + + Default: 8 + +path_max_retrans - INTEGER + The maximum number of retransmissions that will be attempted on a given + path. Once this threshold is exceeded, the path is considered + unreachable, and new traffic will use a different path when the + association is multihomed. + + Default: 5 + +pf_retrans - INTEGER + The number of retransmissions that will be attempted on a given path + before traffic is redirected to an alternate transport (should one + exist). Note this is distinct from path_max_retrans, as a path that + passes the pf_retrans threshold can still be used. Its only + deprioritized when a transmission path is selected by the stack. This + setting is primarily used to enable fast failover mechanisms without + having to reduce path_max_retrans to a very low value. See: + http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt + for details. Note also that a value of pf_retrans > path_max_retrans + disables this feature + + Default: 0 + +rto_initial - INTEGER + The initial round trip timeout value in milliseconds that will be used + in calculating round trip times. This is the initial time interval + for retransmissions. + + Default: 3000 + +rto_max - INTEGER + The maximum value (in milliseconds) of the round trip timeout. This + is the largest time interval that can elapse between retransmissions. + + Default: 60000 + +rto_min - INTEGER + The minimum value (in milliseconds) of the round trip timeout. This + is the smallest time interval the can elapse between retransmissions. + + Default: 1000 + +hb_interval - INTEGER + The interval (in milliseconds) between HEARTBEAT chunks. These chunks + are sent at the specified interval on idle paths to probe the state of + a given path between 2 associations. + + Default: 30000 + +sack_timeout - INTEGER + The amount of time (in milliseconds) that the implementation will wait + to send a SACK. + + Default: 200 + +valid_cookie_life - INTEGER + The default lifetime of the SCTP cookie (in milliseconds). The cookie + is used during association establishment. + + Default: 60000 + +cookie_preserve_enable - BOOLEAN + Enable or disable the ability to extend the lifetime of the SCTP cookie + that is used during the establishment phase of SCTP association + + 1: Enable cookie lifetime extension. + 0: Disable + + Default: 1 + +cookie_hmac_alg - STRING + Select the hmac algorithm used when generating the cookie value sent by + a listening sctp socket to a connecting client in the INIT-ACK chunk. + Valid values are: + * md5 + * sha1 + * none + Ability to assign md5 or sha1 as the selected alg is predicated on the + configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and + CONFIG_CRYPTO_SHA1). + + Default: Dependent on configuration. MD5 if available, else SHA1 if + available, else none. + +rcvbuf_policy - INTEGER + Determines if the receive buffer is attributed to the socket or to + association. SCTP supports the capability to create multiple + associations on a single socket. When using this capability, it is + possible that a single stalled association that's buffering a lot + of data may block other associations from delivering their data by + consuming all of the receive buffer space. To work around this, + the rcvbuf_policy could be set to attribute the receiver buffer space + to each association instead of the socket. This prevents the described + blocking. + + 1: rcvbuf space is per association + 0: rcvbuf space is per socket + + Default: 0 + +sndbuf_policy - INTEGER + Similar to rcvbuf_policy above, this applies to send buffer space. + + 1: Send buffer is tracked per association + 0: Send buffer is tracked per socket. + + Default: 0 + +sctp_mem - vector of 3 INTEGERs: min, pressure, max + Number of pages allowed for queueing by all SCTP sockets. + + min: Below this number of pages SCTP is not bothered about its + memory appetite. When amount of memory allocated by SCTP exceeds + this number, SCTP starts to moderate memory usage. + + pressure: This value was introduced to follow format of tcp_mem. + + max: Number of pages allowed for queueing by all SCTP sockets. + + Default is calculated at boot time from amount of available memory. + +sctp_rmem - vector of 3 INTEGERs: min, default, max + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 1 page + +sctp_wmem - vector of 3 INTEGERs: min, default, max + Currently this tunable has no effect. + +addr_scope_policy - INTEGER + Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 + + 0 - Disable IPv4 address scoping + 1 - Enable IPv4 address scoping + 2 - Follow draft but allow IPv4 private addresses + 3 - Follow draft but allow IPv4 link local addresses + + Default: 1 + + +/proc/sys/net/core/* + Please see: Documentation/sysctl/net.txt for descriptions of these entries. + + +/proc/sys/net/unix/* +max_dgram_qlen - INTEGER + The maximum length of dgram socket receive queue + + Default: 10 + UNDOCUMENTED: -dev_weight FIXME -discovery_slots FIXME -discovery_timeout FIXME -fast_poll_increase FIXME -ip6_queue_maxlen FIXME -lap_keepalive_time FIXME -lo_cong FIXME -max_baud_rate FIXME -max_dgram_qlen FIXME -max_noreply_time FIXME -max_tx_data_size FIXME -max_tx_window FIXME -min_tx_turn_time FIXME -mod_cong FIXME -no_cong FIXME -no_cong_thresh FIXME -slot_timeout FIXME -warn_noreply_time FIXME - -$Id: ip-sysctl.txt,v 1.20 2001/12/13 09:00:18 davem Exp $ +/proc/sys/net/irda/* + fast_poll_increase FIXME + warn_noreply_time FIXME + discovery_slots FIXME + slot_timeout FIXME + max_baud_rate FIXME + discovery_timeout FIXME + lap_keepalive_time FIXME + max_noreply_time FIXME + max_tx_data_size FIXME + max_tx_window FIXME + min_tx_turn_time FIXME diff --git a/Documentation/networking/ipddp.txt b/Documentation/networking/ipddp.txt index 661a5558dd8..ba5c217fffe 100644 --- a/Documentation/networking/ipddp.txt +++ b/Documentation/networking/ipddp.txt @@ -36,11 +36,6 @@ AppleTalk-IP to IP decapsulation. Basic instructions for user space tools ======================================= -To enable AppleTalk-IP decapsulation/encapsulation you will need the -proper tools. You can get the tools for decapsulation from -http://spacs1.spacs.k12.wi.us/~jschlst/index.html and for encapsulation -from http://www.maths.unm.edu/~bradford/ltpc.html - I will briefly describe the operation of the tools, but you will need to consult the supporting documentation for each set of tools. diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt index 39ccb8595bf..670b72f1658 100644 --- a/Documentation/networking/iphase.txt +++ b/Documentation/networking/iphase.txt @@ -22,7 +22,7 @@ The features and limitations of this driver are as follows: - All variants of Interphase ATM PCI (i)Chip adapter cards are supported, including x575 (OC3, control memory 128K , 512K and packet memory 128K, 512K and 1M), x525 (UTP25) and x531 (DS3 and E3). See - http://www.iphase.com/products/ClassSheet.cfm?ClassID=ATM + http://www.iphase.com/ for details. - Only x86 platforms are supported. - SMP is supported. @@ -81,7 +81,7 @@ Installation 1M. The RAM size decides the number of buffers and buffer size. The default size and number of buffers are set as following: - Totol Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf + Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf RAM size size size size size cnt cnt -------- ------ ------ ------ ------ ------ ------ 128K 64K 64K 10K 10K 6 6 diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt new file mode 100644 index 00000000000..8dbc08b7e43 --- /dev/null +++ b/Documentation/networking/ipsec.txt @@ -0,0 +1,38 @@ + +Here documents known IPsec corner cases which need to be keep in mind when +deploy various IPsec configuration in real world production environment. + +1. IPcomp: Small IP packet won't get compressed at sender, and failed on + policy check on receiver. + +Quote from RFC3173: +2.2. Non-Expansion Policy + + If the total size of a compressed payload and the IPComp header, as + defined in section 3, is not smaller than the size of the original + payload, the IP datagram MUST be sent in the original non-compressed + form. To clarify: If an IP datagram is sent non-compressed, no + + IPComp header is added to the datagram. This policy ensures saving + the decompression processing cycles and avoiding incurring IP + datagram fragmentation when the expanded datagram is larger than the + MTU. + + Small IP datagrams are likely to expand as a result of compression. + Therefore, a numeric threshold should be applied before compression, + where IP datagrams of size smaller than the threshold are sent in the + original form without attempting compression. The numeric threshold + is implementation dependent. + +Current IPComp implementation is indeed by the book, while as in practice +when sending non-compressed packet to the peer(whether or not packet len +is smaller than the threshold or the compressed len is large than original +packet len), the packet is dropped when checking the policy as this packet +matches the selector but not coming from any XFRM layer, i.e., with no +security path. Such naked packet will not eventually make it to upper layer. +The result is much more wired to the user when ping peer with different +payload length. + +One workaround is try to set "level use" for each policy if user observed +above scenario. The consequence of doing so is small packet(uncompressed) +will skip policy checking on receiver side. diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt new file mode 100644 index 00000000000..6cd74fa5535 --- /dev/null +++ b/Documentation/networking/ipv6.txt @@ -0,0 +1,72 @@ + +Options for the ipv6 module are supplied as parameters at load time. + +Module options may be given as command line arguments to the insmod +or modprobe command, but are usually specified in either +/etc/modules.d/*.conf configuration files, or in a distro-specific +configuration file. + +The available ipv6 module parameters are listed below. If a parameter +is not specified the default value is used. + +The parameters are as follows: + +disable + + Specifies whether to load the IPv6 module, but disable all + its functionality. This might be used when another module + has a dependency on the IPv6 module being loaded, but no + IPv6 addresses or operations are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled. + + This is the default value. + + 1 + IPv6 is disabled. + + No IPv6 addresses will be added to interfaces, and + it will not be possible to open an IPv6 socket. + + A reboot is required to enable IPv6. + +autoconf + + Specifies whether to enable IPv6 address autoconfiguration + on all interfaces. This might be used when one does not wish + for addresses to be automatically generated from prefixes + received in Router Advertisements. + + The possible values and their effects are: + + 0 + IPv6 address autoconfiguration is disabled on all interfaces. + + Only the IPv6 loopback address (::1) and link-local addresses + will be added to interfaces. + + 1 + IPv6 address autoconfiguration is enabled on all interfaces. + + This is the default value. + +disable_ipv6 + + Specifies whether to disable IPv6 on all interfaces. + This might be used when no IPv6 addresses are desired. + + The possible values and their effects are: + + 0 + IPv6 is enabled on all interfaces. + + This is the default value. + + 1 + IPv6 is disabled on all interfaces. + + No IPv6 addresses will be added to interfaces. + diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt new file mode 100644 index 00000000000..7a3c0472959 --- /dev/null +++ b/Documentation/networking/ipvs-sysctl.txt @@ -0,0 +1,211 @@ +/proc/sys/net/ipv4/vs/* Variables: + +am_droprate - INTEGER + default 10 + + It sets the always mode drop rate, which is used in the mode 3 + of the drop_rate defense. + +amemthresh - INTEGER + default 1024 + + It sets the available memory threshold (in pages), which is + used in the automatic modes of defense. When there is no + enough available memory, the respective strategy will be + enabled and the variable is automatically set to 2, otherwise + the strategy is disabled and the variable is set to 1. + +backup_only - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conntrack - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + +cache_bypass - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If it is enabled, forward packets to the original destination + directly when no cache server is available and destination + address is not local (iph->daddr is RTN_UNICAST). It is mostly + used in transparent web cache cluster. + +debug_level - INTEGER + 0 - transmission error messages (default) + 1 - non-fatal error messages + 2 - configuration + 3 - destination trash + 4 - drop entry + 5 - service lookup + 6 - scheduling + 7 - connection new/expire, lookup and synchronization + 8 - state transition + 9 - binding destination, template checks and applications + 10 - IPVS packet transmission + 11 - IPVS packet handling (ip_vs_in/ip_vs_out) + 12 or more - packet traversal + + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. + + Higher debugging levels include the messages for lower debugging + levels, so setting debug level 2, includes level 0, 1 and 2 + messages. Thus, logging becomes more and more verbose the higher + the level. + +drop_entry - INTEGER + 0 - disabled (default) + + The drop_entry defense is to randomly drop entries in the + connection hash table, just in order to collect back some + memory for new connections. In the current code, the + drop_entry procedure can be activated every second, then it + randomly scans 1/32 of the whole and drops entries that are in + the SYN-RECV/SYNACK state, which should be effective against + syn-flooding attack. + + The valid values of drop_entry are from 0 to 3, where 0 means + that this strategy is always disabled, 1 and 2 mean automatic + modes (when there is no enough available memory, the strategy + is enabled and the variable is automatically set to 2, + otherwise the strategy is disabled and the variable is set to + 1), and 3 means that that the strategy is always enabled. + +drop_packet - INTEGER + 0 - disabled (default) + + The drop_packet defense is designed to drop 1/rate packets + before forwarding them to real servers. If the rate is 1, then + drop all the incoming packets. + + The value definition is the same as that of the drop_entry. In + the automatic mode, the rate is determined by the follow + formula: rate = amemthresh / (amemthresh - available_memory) + when available memory is less than the available memory + threshold. When the mode 3 is set, the always mode drop rate + is controlled by the /proc/sys/net/ipv4/vs/am_droprate. + +expire_nodest_conn - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + The default value is 0, the load balancer will silently drop + packets when its destination server is not available. It may + be useful, when user-space monitoring program deletes the + destination server (because of server overload or wrong + detection) and add back the server later, and the connections + to the server can continue. + + If this feature is enabled, the load balancer will expire the + connection immediately when a packet arrives and its + destination server is not available, then the client program + will be notified that the connection is closed. This is + equivalent to the feature some people requires to flush + connections when its destination is not available. + +expire_quiescent_template - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + When set to a non-zero value, the load balancer will expire + persistent templates when the destination server is quiescent. + This may be useful, when a user makes a destination server + quiescent by setting its weight to 0 and it is desired that + subsequent otherwise persistent connections are sent to a + different destination server. By default new persistent + connections are allowed to quiescent destination servers. + + If this feature is enabled, the load balancer will expire the + persistence template if it is to be used to schedule a new + connection and the destination server is quiescent. + +nat_icmp_send - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + It controls sending icmp error messages (ICMP_DEST_UNREACH) + for VS/NAT when the load balancer receives packets from real + servers but the connection entries don't exist. + +secure_tcp - INTEGER + 0 - disabled (default) + + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. + + The value definition is the same as that of drop_entry and + drop_packet. + +sync_threshold - INTEGER + default 3 + + It sets synchronization threshold, which is the minimum number + of incoming packets that a connection needs to receive before + the connection will be synchronized. A connection will be + synchronized, every time the number of its incoming packets + modulus 50 equals the threshold. The range of the threshold is + from 0 to 49. + +snat_reroute - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/irda.txt b/Documentation/networking/irda.txt index 9e5b8e66d6a..bff26c138be 100644 --- a/Documentation/networking/irda.txt +++ b/Documentation/networking/irda.txt @@ -3,12 +3,8 @@ of the IrDA Utilities. More detailed information about these and associated programs can be found on http://irda.sourceforge.net/ For more information about how to use the IrDA protocol stack, see the -Linux Infared HOWTO (http://www.tuxmobil.org/Infrared-HOWTO/Infrared-HOWTO.html) -by Werner Heuser <wehe@tuxmobil.org> +Linux Infrared HOWTO by Werner Heuser <wehe@tuxmobil.org>: +<http://www.tuxmobil.org/Infrared-HOWTO/Infrared-HOWTO.html> There is an active mailing list for discussing Linux-IrDA matters called irda-users@lists.sourceforge.net - - - - diff --git a/Documentation/networking/ixgb.txt b/Documentation/networking/ixgb.txt index 7c98277777e..1e0c045e89f 100644 --- a/Documentation/networking/ixgb.txt +++ b/Documentation/networking/ixgb.txt @@ -1,7 +1,7 @@ -Linux* Base Driver for the Intel(R) PRO/10GbE Family of Adapters -================================================================ +Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection +===================================================================== -November 17, 2004 +March 14, 2011 Contents @@ -9,94 +9,151 @@ Contents - In This Release - Identifying Your Adapter +- Building and Installation - Command Line Parameters - Improving Performance +- Additional Configurations +- Known Issues/Troubleshooting - Support + In This Release =============== -This file describes the Linux* Base Driver for the Intel(R) PRO/10GbE Family -of Adapters, version 1.0.x. +This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R) +Network Connection. This driver includes support for Itanium(R)2-based +systems. + +For questions related to hardware requirements, refer to the documentation +supplied with your 10 Gigabit adapter. All hardware requirements listed apply +to use with Linux. + +The following features are available in this kernel: + - Native VLANs + - Channel Bonding (teaming) + - SNMP + +Channel Bonding documentation can be found in the Linux kernel source: +/Documentation/networking/bonding.txt + +The driver information previously displayed in the /proc filesystem is not +supported in this release. Alternatively, you can use ethtool (version 1.6 +or later), lspci, and ifconfig to obtain the same information. + +Instructions on updating ethtool can be found in the section "Additional +Configurations" later in this document. -For questions related to hardware requirements, refer to the documentation -supplied with your Intel PRO/10GbE adapter. All hardware requirements listed -apply to use with Linux. Identifying Your Adapter ======================== -To verify your Intel adapter is supported, find the board ID number on the -adapter. Look for a label that has a barcode and a number in the format -A12345-001. +The following Intel network adapters are compatible with the drivers in this +release: + +Controller Adapter Name Physical Layer +---------- ------------ -------------- +82597EX Intel(R) PRO/10GbE LR/SR/CX4 10G Base-LR (1310 nm optical fiber) + Server Adapters 10G Base-SR (850 nm optical fiber) + 10G Base-CX4(twin-axial copper cabling) + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + + +Building and Installation +========================= + +select m for "Intel(R) PRO/10GbE support" located at: + Location: + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet (10000 Mbit) (NETDEV_10000 [=y]) +1. make modules && make modules_install + +2. Load the module: + +    modprobe ixgb <parameter>=<value> + + The insmod command can be used if the full + path to the driver module is specified. For example: + + insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgb/ixgb.ko + + With 2.6 based kernels also make sure that older ixgb drivers are + removed from the kernel, before loading the new module: -Use the above information and the Adapter & Driver ID Guide at: + rmmod ixgb; modprobe ixgb - http://support.intel.com/support/network/adapter/pro100/21397.htm +3. Assign an IP address to the interface by entering the following, where + x is the interface number: -For the latest Intel network drivers for Linux, go to: + ifconfig ethx <IP_address> + +4. Verify that the interface works. Enter the following, where <IP_address> + is the IP address for another machine on the same subnet as the interface + that is being tested: + + ping <IP_address> - http://downloadfinder.intel.com/scripts-df/support_intel.asp Command Line Parameters ======================= -If the driver is built as a module, the following optional parameters are -used by entering them on the command line with the modprobe or insmod command -using this syntax: +If the driver is built as a module, the following optional parameters are +used by entering them on the command line with the modprobe command using +this syntax: modprobe ixgb [<option>=<VAL1>,<VAL2>,...] - insmod ixgb [<option>=<VAL1>,<VAL2>,...] +For example, with two 10GbE PCI adapters, entering: -For example, with two PRO/10GbE PCI adapters, entering: + modprobe ixgb TxDescriptors=80,128 - insmod ixgb TxDescriptors=80,128 - -loads the ixgb driver with 80 TX resources for the first adapter and 128 TX +loads the ixgb driver with 80 TX resources for the first adapter and 128 TX resources for the second adapter. The default value for each parameter is generally the recommended setting, -unless otherwise noted. Also, if the driver is statically built into the -kernel, the driver is loaded with the default values for all the parameters. -Ethtool can be used to change some of the parameters at runtime. +unless otherwise noted. FlowControl Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) Default: Read from the EEPROM - If EEPROM is not detected, default is 3 - This parameter controls the automatic generation(Tx) and response(Rx) to - Ethernet PAUSE frames. + If EEPROM is not detected, default is 1 + This parameter controls the automatic generation(Tx) and response(Rx) to + Ethernet PAUSE frames. There are hardware bugs associated with enabling + Tx flow control so beware. RxDescriptors Valid Range: 64-512 Default Value: 512 - This value is the number of receive descriptors allocated by the driver. - Increasing this value allows the driver to buffer more incoming packets. - Each descriptor is 16 bytes. A receive buffer is also allocated for - each descriptor and can be either 2048, 4056, 8192, or 16384 bytes, - depending on the MTU setting. When the MTU size is 1500 or less, the + This value is the number of receive descriptors allocated by the driver. + Increasing this value allows the driver to buffer more incoming packets. + Each descriptor is 16 bytes. A receive buffer is also allocated for + each descriptor and can be either 2048, 4056, 8192, or 16384 bytes, + depending on the MTU setting. When the MTU size is 1500 or less, the receive buffer size is 2048 bytes. When the MTU is greater than 1500 the - receive buffer size will be either 4056, 8192, or 16384 bytes. The + receive buffer size will be either 4056, 8192, or 16384 bytes. The maximum MTU size is 16114. RxIntDelay Valid Range: 0-65535 (0=off) -Default Value: 6 - This value delays the generation of receive interrupts in units of - 0.8192 microseconds. Receive interrupt reduction can improve CPU - efficiency if properly tuned for specific network traffic. Increasing - this value adds extra latency to frame reception and can end up - decreasing the throughput of TCP traffic. If the system is reporting - dropped receives, this value may be set too high, causing the driver to +Default Value: 72 + This value delays the generation of receive interrupts in units of + 0.8192 microseconds. Receive interrupt reduction can improve CPU + efficiency if properly tuned for specific network traffic. Increasing + this value adds extra latency to frame reception and can end up + decreasing the throughput of TCP traffic. If the system is reporting + dropped receives, this value may be set too high, causing the driver to run out of available receive descriptors. TxDescriptors Valid Range: 64-4096 Default Value: 256 This value is the number of transmit descriptors allocated by the driver. - Increasing this value allows the driver to queue more transmits. Each + Increasing this value allows the driver to queue more transmits. Each descriptor is 16 bytes. XsumRX @@ -105,51 +162,49 @@ Default Value: 1 A value of '1' indicates that the driver should enable IP checksum offload for received packets (both UDP and TCP) to the adapter hardware. -XsumTX -Valid Range: 0-1 -Default Value: 1 - A value of '1' indicates that the driver should enable IP checksum - offload for transmitted packets (both UDP and TCP) to the adapter - hardware. Improving Performance ===================== -With the Intel PRO/10 GbE adapter, the default Linux configuration will very -likely limit the total available throughput artificially. There is a set of -things that when applied together increase the ability of Linux to transmit -and receive data. The following enhancements were originally acquired from -settings published at http://www.spec.org/web99 for various submitted results -using Linux. +With the 10 Gigabit server adapters, the default Linux configuration will +very likely limit the total available throughput artificially. There is a set +of configuration changes that, when applied together, will increase the ability +of Linux to transmit and receive data. The following enhancements were +originally acquired from settings published at http://www.spec.org/web99/ for +various submitted results using Linux. -NOTE: These changes are only suggestions, and serve as a starting point for -tuning your network performance. +NOTE: These changes are only suggestions, and serve as a starting point for + tuning your network performance. The changes are made in three major ways, listed in order of greatest effect: -- Use ifconfig to modify the mtu (maximum transmission unit) and the txqueuelen +- Use ifconfig to modify the mtu (maximum transmission unit) and the txqueuelen parameter. - Use sysctl to modify /proc parameters (essentially kernel tuning) -- Use setpci to modify the MMRBC field in PCI-X configuration space to increase +- Use setpci to modify the MMRBC field in PCI-X configuration space to increase transmit burst lengths on the bus. -NOTE: setpci modifies the adapter's configuration registers to allow it to read -up to 4k bytes at a time (for transmits). However, for some systems the -behavior after modifying this register may be undefined (possibly errors of some -kind). A power-cycle, hard reset or explicitly setting the e6 register back to -22 (setpci -d 8086:1048 e6.b=22) may be required to get back to a stable -configuration. +NOTE: setpci modifies the adapter's configuration registers to allow it to read +up to 4k bytes at a time (for transmits). However, for some systems the +behavior after modifying this register may be undefined (possibly errors of +some kind). A power-cycle, hard reset or explicitly setting the e6 register +back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a +stable configuration. - COPY these lines and paste them into ixgb_perf.sh: #!/bin/bash -echo "configuring network performance , edit this file to change the interface" +echo "configuring network performance , edit this file to change the interface +or device ID of 10GbE card" # set mmrbc to 4k reads, modify only Intel 10GbE device IDs -setpci -d 8086:1048 e6.b=2e -# set the MTU (max transmission unit) - it requires your switch and clients to change too! +# replace 1a48 with appropriate 10GbE device's ID installed on the system, +# if needed. +setpci -d 8086:1a48 e6.b=2e +# set the MTU (max transmission unit) - it requires your switch and clients +# to change as well. # set the txqueuelen # your ixgb adapter should be loaded as eth1 for this to work, change if needed ifconfig eth1 mtu 9000 txqueuelen 1000 up -# call the sysctl utility to modify /proc/sys entries -sysctl -p ./sysctl_ixgb.conf +# call the sysctl utility to modify /proc/sys entries +sysctl -p ./sysctl_ixgb.conf - END ixgb_perf.sh - COPY these lines and paste them into sysctl_ixgb.conf: @@ -159,54 +214,220 @@ sysctl -p ./sysctl_ixgb.conf # several network benchmark tests, your mileage may vary ### IPV4 specific settings -net.ipv4.tcp_timestamps = 0 # turns TCP timestamp support off, default 1, reduces CPU use -net.ipv4.tcp_sack = 0 # turn SACK support off, default on -# on systems with a VERY fast bus -> memory interface this is the big gainer -net.ipv4.tcp_rmem = 10000000 10000000 10000000 # sets min/default/max TCP read buffer, default 4096 87380 174760 -net.ipv4.tcp_wmem = 10000000 10000000 10000000 # sets min/pressure/max TCP write buffer, default 4096 16384 131072 -net.ipv4.tcp_mem = 10000000 10000000 10000000 # sets min/pressure/max TCP buffer space, default 31744 32256 32768 +# turn TCP timestamp support off, default 1, reduces CPU use +net.ipv4.tcp_timestamps = 0 +# turn SACK support off, default on +# on systems with a VERY fast bus -> memory interface this is the big gainer +net.ipv4.tcp_sack = 0 +# set min/default/max TCP read buffer, default 4096 87380 174760 +net.ipv4.tcp_rmem = 10000000 10000000 10000000 +# set min/pressure/max TCP write buffer, default 4096 16384 131072 +net.ipv4.tcp_wmem = 10000000 10000000 10000000 +# set min/pressure/max TCP buffer space, default 31744 32256 32768 +net.ipv4.tcp_mem = 10000000 10000000 10000000 ### CORE settings (mostly for socket and UDP effect) -net.core.rmem_max = 524287 # maximum receive socket buffer size, default 131071 -net.core.wmem_max = 524287 # maximum send socket buffer size, default 131071 -net.core.rmem_default = 524287 # default receive socket buffer size, default 65535 -net.core.wmem_default = 524287 # default send socket buffer size, default 65535 -net.core.optmem_max = 524287 # maximum amount of option memory buffers, default 10240 -net.core.netdev_max_backlog = 300000 # number of unprocessed input packets before kernel starts dropping them, default 300 +# set maximum receive socket buffer size, default 131071 +net.core.rmem_max = 524287 +# set maximum send socket buffer size, default 131071 +net.core.wmem_max = 524287 +# set default receive socket buffer size, default 65535 +net.core.rmem_default = 524287 +# set default send socket buffer size, default 65535 +net.core.wmem_default = 524287 +# set maximum amount of option memory buffers, default 10240 +net.core.optmem_max = 524287 +# set number of unprocessed input packets before kernel starts dropping them; default 300 +net.core.netdev_max_backlog = 300000 - END sysctl_ixgb.conf -Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface -your ixgb driver is using. +Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface +your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's +ID installed on the system. -NOTE: Unless these scripts are added to the boot process, these changes will -only last only until the next system reboot. +NOTE: Unless these scripts are added to the boot process, these changes will + only last only until the next system reboot. Resolving Slow UDP Traffic -------------------------- +If your server does not seem to be able to receive UDP traffic as fast as it +can receive TCP traffic, it could be because Linux, by default, does not set +the network stack buffers as large as they need to be to support high UDP +transfer rates. One way to alleviate this problem is to allow more memory to +be used by the IP stack to store incoming data. -If your server does not seem to be able to receive UDP traffic as fast as it -can receive TCP traffic, it could be because Linux, by default, does not set -the network stack buffers as large as they need to be to support high UDP -transfer rates. One way to alleviate this problem is to allow more memory to -be used by the IP stack to store incoming data. - -For instance, use the commands: +For instance, use the commands: sysctl -w net.core.rmem_max=262143 and sysctl -w net.core.rmem_default=262143 -to increase the read buffer memory max and default to 262143 (256k - 1) from -defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables -will increase the amount of memory used by the network stack for receives, and +to increase the read buffer memory max and default to 262143 (256k - 1) from +defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables +will increase the amount of memory used by the network stack for receives, and can be increased significantly more if necessary for your application. + +Additional Configurations +========================= + + Configuring the Driver on Different Distributions + ------------------------------------------------- + Configuring a network driver to load properly when the system is started is + distribution dependent. Typically, the configuration process involves adding + an alias line to /etc/modprobe.conf as well as editing other system startup + scripts and/or configuration files. Many popular Linux distributions ship + with tools to make these changes for you. To learn the proper way to + configure a network device for your system, refer to your distribution + documentation. If during this process you are asked for the driver or module + name, the name for the Linux Base Driver for the Intel 10GbE Family of + Adapters is ixgb. + + Viewing Link Messages + --------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following: + + dmesg -n 8 + + NOTE: This setting is not saved across reboots. + + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16114. Use the ifconfig command to + increase the MTU size. For example: + + ifconfig ethx mtu 9000 up + + The maximum MTU setting for Jumbo Frames is 16114. This value coincides + with the maximum Jumbo Frames size of 16128. + + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The ethtool + version 1.6 or later is required for this functionality. + + The latest release of ethtool can be found from + http://ftp.kernel.org/pub/software/network/ethtool/ + + NOTE: The ethtool version 1.6 only supports a limited set of ethtool options. + Support for a more complete ethtool feature set can be enabled by + upgrading to the latest version. + + + NAPI + ---- + + NAPI (Rx polling mode) is supported in the ixgb driver. NAPI is enabled + or disabled based on the configuration of the kernel. see CONFIG_IXGB_NAPI + + See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. + + +Known Issues/Troubleshooting +============================ + + NOTE: After installing the driver, if your Intel Network Connection is not + working, verify in the "In This Release" section of the readme that you have + installed the correct driver. + + Intel(R) PRO/10GbE CX4 Server Adapter Cable Interoperability Issue with + Fujitsu XENPAK Module in SmartBits Chassis + --------------------------------------------------------------------- + Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 + Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits + chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni. + The CRC errors may be received either by the Intel(R) PRO/10GbE CX4 + Server adapter or the SmartBits. If this situation occurs using a different + cable assembly may resolve the issue. + + CX4 Server Adapter Cable Interoperability Issues with HP Procurve 3400cl + Switch Port + ------------------------------------------------------------------------ + Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server + adapter is connected to an HP Procurve 3400cl switch port using short cables + (1 m or shorter). If this situation occurs, using a longer cable may resolve + the issue. + + Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that + Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC + errors may be received either by the CX4 Server adapter or at the switch. If + this situation occurs, using a different cable assembly may resolve the issue. + + + Jumbo Frames System Requirement + ------------------------------- + Memory allocation failures have been observed on Linux systems with 64 MB + of RAM or less that are running Jumbo Frames. If you are using Jumbo + Frames, your system may require more than the advertised minimum + requirement of 64 MB of system memory. + + + Performance Degradation with Jumbo Frames + ----------------------------------------- + Degradation in throughput performance may be observed in some Jumbo frames + environments. If this is observed, increasing the application's socket buffer + size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. + See the specific application manual and /usr/src/linux*/Documentation/ + networking/ip-sysctl.txt for more details. + + + Allocating Rx Buffers when Using Jumbo Frames + --------------------------------------------- + Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if + the available memory is heavily fragmented. This issue may be seen with PCI-X + adapters or with packet split disabled. This can be reduced or eliminated + by changing the amount of available memory for receive buffer allocation, by + increasing /proc/sys/vm/min_free_kbytes. + + + Multiple Interfaces on Same Ethernet Broadcast Network + ------------------------------------------------------ + Due to the default ARP behavior on Linux, it is not possible to have + one system on two IP networks in the same Ethernet broadcast domain + (non-partitioned switch) behave as expected. All Ethernet interfaces + will respond to IP traffic for any IP address assigned to the system. + This results in unbalanced receive traffic. + + If you have multiple interfaces in a server, do either of the following: + + - Turn on ARP filtering by entering: + echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter + + - Install the interfaces in separate broadcast domains - either in + different switches or in a switch partitioned to VLANs. + + + UDP Stress Test Dropped Packet Issue + -------------------------------------- + Under small packets UDP stress test with 10GbE driver, the Linux system + may drop UDP packets due to the fullness of socket buffers. You may want + to change the driver's Flow Control variables to the minimum value for + controlling packet reception. + + + Tx Hangs Possible Under Stress + ------------------------------ + Under stress conditions, if TX hangs occur, turning off TSO + "ethtool -K eth0 tso off" may resolve the problem. + + Support ======= -For general information and support, go to the Intel support website at: +For general information, go to the Intel support website at: http://support.intel.com +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + If an issue is identified with the released source code on the supported -kernel with a supported adapter, email the specific information related to -the issue to linux.nics@intel.com. +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt new file mode 100644 index 00000000000..96cccebb839 --- /dev/null +++ b/Documentation/networking/ixgbe.txt @@ -0,0 +1,349 @@ +Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of +Adapters +============================================================================= + +Intel 10 Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Performance Tuning +- Known Issues +- Support + +Identifying Your Adapter +======================== + +The driver in this release is compatible with 82598, 82599 and X540-based +Intel Network Connections. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + +SFP+ Devices with Pluggable Optics +---------------------------------- + +82599-BASED ADAPTERS + +NOTES: If your 82599-based Intel(R) Network Adapter came with Intel optics, or +is an Intel(R) Ethernet Server Adapter X520-2, then it only supports Intel +optics and/or the direct attach cables listed below. + +When 82599-based SFP+ devices are connected back to back, they should be set to +the same Speed setting via ethtool. Results may vary if you mix speed settings. +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. + +Supplier Type Part Numbers + +SR Modules +Intel DUAL RATE 1G/10G SFP+ SR (bailed) FTLX8571D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDZ-IN2 +LR Modules +Intel DUAL RATE 1G/10G SFP+ LR (bailed) FTLX1471D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDZ-IN2 + +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. + +Supplier Type Part Numbers + +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL + +Finisar DUAL RATE 1G/10G SFP+ SR (No Bail) FTLX8571D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ SR (No Bail) AFBR-703SDZ-IN1 +Finisar DUAL RATE 1G/10G SFP+ LR (No Bail) FTLX1471D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ LR (No Bail) AFCT-701SDZ-IN1 +Finistar 1000BASE-T SFP FCLF8522P2BTL +Avago 1000BASE-T SFP ABCU-5710RZ + +82599-based adapters support all passive and active limiting direct attach +cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. + +Laser turns off for SFP+ when ifconfig down +------------------------------------------- +"ifconfig down" turns off the laser for 82599-based SFP+ fiber adapters. +"ifconfig up" turns on the laser. + + +82598-BASED ADAPTERS + +NOTES for 82598-Based Adapters: +- Intel(R) Network Adapters that support removable optical modules only support + their original module type (i.e., the Intel(R) 10 Gigabit SR Dual Port + Express Module only supports SR optical modules). If you plug in a different + type of module, the driver will not load. +- Hot Swapping/hot plugging optical modules is not supported. +- Only single speed, 10 gigabit modules are supported. +- LAN on Motherboard (LOMs) may support DA, SR, or LR modules. Other module + types are not supported. Please see your system documentation for details. + +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. + +Supplier Type Part Numbers + +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL + +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. + + +Flow Control +------------ +Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable +receiving and transmitting pause frames for ixgbe. When TX is enabled, PAUSE +frames are generated when the receive packet buffer crosses a predefined +threshold. When rx is enabled, the transmit unit will halt for the time delay +specified when a PAUSE frame is received. + +Flow Control is enabled by default. If you want to disable a flow control +capable link partner, use ethtool: + + ethtool -A eth? autoneg off RX off TX off + +NOTE: For 82598 backplane cards entering 1 gig mode, flow control default +behavior is changed to off. Flow control in 1 gig mode on these devices can +lead to Tx hangs. + +Intel(R) Ethernet Flow Director +------------------------------- +Supports advanced filters that direct receive packets by their flows to +different queues. Enables tight control on routing a flow in the platform. +Matches flows and CPU cores for flow affinity. Supports multiple parameters +for flexible flow classification and load balancing. + +Flow director is enabled only if the kernel is multiple TX queue capable. + +An included script (set_irq_affinity.sh) automates setting the IRQ to CPU +affinity. + +You can verify that the driver is using Flow Director by looking at the counter +in ethtool: fdir_miss and fdir_match. + +Other ethtool Commands: +To enable Flow Director + ethtool -K ethX ntuple on +To add a filter + Use -U switch. e.g., ethtool -U ethX flow-type tcp4 src-ip 0x178000a + action 1 +To see the list of filters currently present: + ethtool -u ethX + +Perfect Filter: Perfect filter is an interface to load the filter table that +funnels all flow into queue_0 unless an alternative queue is specified using +"action". In that case, any flow that matches the filter criteria will be +directed to the appropriate queue. + +If the queue is defined as -1, filter will drop matching packets. + +To account for filter matches and misses, there are two stats in ethtool: +fdir_match and fdir_miss. In addition, rx_queue_N_packets shows the number of +packets processed by the Nth queue. + +NOTE: Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are not +compatible with Flow Director. IF Flow Director is enabled, these will be +disabled. + +The following three parameters impact Flow Director. + +FdirMode +-------- +Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode) +Default Value: 1 + + Flow Director filtering modes. + +FdirPballoc +----------- +Valid Range: 0-2 (0=64k, 1=128k, 2=256k) +Default Value: 0 + + Flow Director allocated packet buffer size. + +AtrSampleRate +-------------- +Valid Range: 1-100 +Default Value: 20 + + Software ATR Tx packet sample rate. For example, when set to 20, every 20th + packet, looks to see if the packet will create a new flow. + +Node +---- +Valid Range: 0-n +Default Value: 1 (off) + + 0 - n: where n is the number of NUMA nodes (i.e. 0 - 3) currently online in + your system + 1: turns this option off + + The Node parameter will allow you to pick which NUMA node you want to have + the adapter allocate memory on. + +max_vfs +------- +Valid Range: 1-63 +Default Value: 0 + + If the value is greater than 0 it will also force the VMDq parameter to be 1 + or more. + + This parameter adds support for SR-IOV. It causes the driver to spawn up to + max_vfs worth of virtual function. + + +Additional Configurations +========================= + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16110. Use the ifconfig command to + increase the MTU size. For example: + + ifconfig ethx mtu 9000 up + + The maximum MTU setting for Jumbo Frames is 16110. This value coincides + with the maximum Jumbo Frames size of 16128. + + Generic Receive Offload, aka GRO + -------------------------------- + The driver supports the in-kernel software implementation of GRO. GRO has + shown that by coalescing Rx traffic into larger chunks of data, CPU + utilization can be significantly reduced when under large Rx load. GRO is an + evolution of the previously-used LRO interface. GRO is able to coalesce + other protocols besides TCP. It's also safe to use with configurations that + are problematic for LRO, namely bridging and iSCSI. + + Data Center Bridging, aka DCB + ----------------------------- + DCB is a configuration Quality of Service implementation in hardware. + It uses the VLAN priority tag (802.1p) to filter traffic. That means + that there are 8 different priorities that traffic can be filtered into. + It also enables priority flow control which can limit or eliminate the + number of dropped packets during network stress. Bandwidth can be + allocated to each of these priorities, which is enforced at the hardware + level. + + To enable DCB support in ixgbe, you must enable the DCB netlink layer to + allow the userspace tools (see below) to communicate with the driver. + This can be found in the kernel configuration here: + + -> Networking support + -> Networking options + -> Data Center Bridging support + + Once this is selected, DCB support must be selected for ixgbe. This can + be found here: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet (10000 Mbit) (NETDEV_10000 [=y]) + -> Intel(R) 10GbE PCI Express adapters support + -> Data Center Bridging (DCB) Support + + After these options are selected, you must rebuild your kernel and your + modules. + + In order to use DCB, userspace tools must be downloaded and installed. + The dcbd tools can be found at: + + http://e1000.sf.net + + Ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + The latest release of ethtool can be found from + http://ftp.kernel.org/pub/software/network/ethtool/ + + FCoE + ---- + This release of the ixgbe driver contains new code to enable users to use + Fiber Channel over Ethernet (FCoE) and Data Center Bridging (DCB) + functionality that is supported by the 82598-based hardware. This code has + no default effect on the regular driver operation, and configuring DCB and + FCoE is outside the scope of this driver README. Refer to + http://www.open-fcoe.org/ for FCoE project information and contact + e1000-eedc@lists.sourceforge.net for DCB information. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + Enabling SR-IOV in a 32-bit or 64-bit Microsoft* Windows* Server 2008/R2 + Guest OS using Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE + controller under KVM + ------------------------------------------------------------------------ + KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This + includes traditional PCIe devices, as well as SR-IOV-capable devices using + Intel 82576-based and 82599-based controllers. + + While direct assignment of a PCIe device or an SR-IOV Virtual Function (VF) + to a Linux-based VM running 2.6.32 or later kernel works fine, there is a + known issue with Microsoft Windows Server 2008 VM that results in a "yellow + bang" error. This problem is within the KVM VMM itself, not the Intel driver, + or the SR-IOV logic of the VMM, but rather that KVM emulates an older CPU + model for the guests, and this older CPU model does not support MSI-X + interrupts, which is a requirement for Intel SR-IOV. + + If you wish to use the Intel 82576 or 82599-based controllers in SR-IOV mode + with KVM and a Microsoft Windows Server 2008 guest try the following + workaround. The workaround is to tell KVM to emulate a different model of CPU + when using qemu to create the KVM guest: + + "-cpu qemu64,model=13" + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://e1000.sourceforge.net + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ixgbevf.txt b/Documentation/networking/ixgbevf.txt new file mode 100644 index 00000000000..53d8d2a5a6a --- /dev/null +++ b/Documentation/networking/ixgbevf.txt @@ -0,0 +1,52 @@ +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== + +Intel Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Known Issues/Troubleshooting +- Support + +This file describes the ixgbevf Linux* Base Driver for Intel Network +Connection. + +The ixgbevf driver supports 82599-based virtual function devices that can only +be activated on kernels with CONFIG_PCI_IOV enabled. + +The ixgbevf driver supports virtual functions generated by the ixgbe driver +with a max_vfs value of 1 or greater. + +The guest OS loading the ixgbevf driver must support MSI-X interrupts. + +VLANs: There is a limit of a total of 32 shared VLANs to 1 or more VFs. + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Known Issues/Troubleshooting +============================ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt new file mode 100644 index 00000000000..c74434de2fa --- /dev/null +++ b/Documentation/networking/l2tp.txt @@ -0,0 +1,348 @@ +This document describes how to use the kernel's L2TP drivers to +provide L2TP functionality. L2TP is a protocol that tunnels one or +more sessions over an IP tunnel. It is commonly used for VPNs +(L2TP/IPSec) and by ISPs to tunnel subscriber PPP sessions over an IP +network infrastructure. With L2TPv3, it is also useful as a Layer-2 +tunneling infrastructure. + +Features +======== + +L2TPv2 (PPP over L2TP (UDP tunnels)). +L2TPv3 ethernet pseudowires. +L2TPv3 PPP pseudowires. +L2TPv3 IP encapsulation. +Netlink sockets for L2TPv3 configuration management. + +History +======= + +The original pppol2tp driver was introduced in 2.6.23 and provided +L2TPv2 functionality (rfc2661). L2TPv2 is used to tunnel one or more PPP +sessions over a UDP tunnel. + +L2TPv3 (rfc3931) changes the protocol to allow different frame types +to be passed over an L2TP tunnel by moving the PPP-specific parts of +the protocol out of the core L2TP packet headers. Each frame type is +known as a pseudowire type. Ethernet, PPP, HDLC, Frame Relay and ATM +pseudowires for L2TP are defined in separate RFC standards. Another +change for L2TPv3 is that it can be carried directly over IP with no +UDP header (UDP is optional). It is also possible to create static +unmanaged L2TPv3 tunnels manually without a control protocol +(userspace daemon) to manage them. + +To support L2TPv3, the original pppol2tp driver was split up to +separate the L2TP and PPP functionality. Existing L2TPv2 userspace +apps should be unaffected as the original pppol2tp sockets API is +retained. L2TPv3, however, uses netlink to manage L2TPv3 tunnels and +sessions. + +Design +====== + +The L2TP protocol separates control and data frames. The L2TP kernel +drivers handle only L2TP data frames; control frames are always +handled by userspace. L2TP control frames carry messages between L2TP +clients/servers and are used to setup / teardown tunnels and +sessions. An L2TP client or server is implemented in userspace. + +Each L2TP tunnel is implemented using a UDP or L2TPIP socket; L2TPIP +provides L2TPv3 IP encapsulation (no UDP) and is implemented using a +new l2tpip socket family. The tunnel socket is typically created by +userspace, though for unmanaged L2TPv3 tunnels, the socket can also be +created by the kernel. Each L2TP session (pseudowire) gets a network +interface instance. In the case of PPP, these interfaces are created +indirectly by pppd using a pppol2tp socket. In the case of ethernet, +the netdevice is created upon a netlink request to create an L2TPv3 +ethernet pseudowire. + +For PPP, the PPPoL2TP driver, net/l2tp/l2tp_ppp.c, provides a +mechanism by which PPP frames carried through an L2TP session are +passed through the kernel's PPP subsystem. The standard PPP daemon, +pppd, handles all PPP interaction with the peer. PPP network +interfaces are created for each local PPP endpoint. The kernel's PPP +subsystem arranges for PPP control frames to be delivered to pppd, +while data frames are forwarded as usual. + +For ethernet, the L2TPETH driver, net/l2tp/l2tp_eth.c, implements a +netdevice driver, managing virtual ethernet devices, one per +pseudowire. These interfaces can be managed using standard Linux tools +such as "ip" and "ifconfig". If only IP frames are passed over the +tunnel, the interface can be given an IP addresses of itself and its +peer. If non-IP frames are to be passed over the tunnel, the interface +can be added to a bridge using brctl. All L2TP datapath protocol +functions are handled by the L2TP core driver. + +Each tunnel and session within a tunnel is assigned a unique tunnel_id +and session_id. These ids are carried in the L2TP header of every +control and data packet. (Actually, in L2TPv3, the tunnel_id isn't +present in data frames - it is inferred from the IP connection on +which the packet was received.) The L2TP driver uses the ids to lookup +internal tunnel and/or session contexts to determine how to handle the +packet. Zero tunnel / session ids are treated specially - zero ids are +never assigned to tunnels or sessions in the network. In the driver, +the tunnel context keeps a reference to the tunnel UDP or L2TPIP +socket. The session context holds data that lets the driver interface +to the kernel's network frame type subsystems, i.e. PPP, ethernet. + +Userspace Programming +===================== + +For L2TPv2, there are a number of requirements on the userspace L2TP +daemon in order to use the pppol2tp driver. + +1. Use a UDP socket per tunnel. + +2. Create a single PPPoL2TP socket per tunnel bound to a special null + session id. This is used only for communicating with the driver but + must remain open while the tunnel is active. Opening this tunnel + management socket causes the driver to mark the tunnel socket as an + L2TP UDP encapsulation socket and flags it for use by the + referenced tunnel id. This hooks up the UDP receive path via + udp_encap_rcv() in net/ipv4/udp.c. PPP data frames are never passed + in this special PPPoX socket. + +3. Create a PPPoL2TP socket per L2TP session. This is typically done + by starting pppd with the pppol2tp plugin and appropriate + arguments. A PPPoL2TP tunnel management socket (Step 2) must be + created before the first PPPoL2TP session socket is created. + +When creating PPPoL2TP sockets, the application provides information +to the driver about the socket in a socket connect() call. Source and +destination tunnel and session ids are provided, as well as the file +descriptor of a UDP socket. See struct pppol2tp_addr in +include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are +treated specially. When creating the per-tunnel PPPoL2TP management +socket in Step 2 above, zero source and destination session ids are +specified, which tells the driver to prepare the supplied UDP file +descriptor for use as an L2TP tunnel socket. + +Userspace may control behavior of the tunnel or session using +setsockopt and ioctl on the PPPoX socket. The following socket +options are supported:- + +DEBUG - bitmask of debug message categories. See below. +SENDSEQ - 0 => don't send packets with sequence numbers + 1 => send packets with sequence numbers +RECVSEQ - 0 => receive packet sequence numbers are optional + 1 => drop receive packets without sequence numbers +LNSMODE - 0 => act as LAC. + 1 => act as LNS. +REORDERTO - reorder timeout (in millisecs). If 0, don't try to reorder. + +Only the DEBUG option is supported by the special tunnel management +PPPoX socket. + +In addition to the standard PPP ioctls, a PPPIOCGL2TPSTATS is provided +to retrieve tunnel and session statistics from the kernel using the +PPPoX socket of the appropriate tunnel or session. + +For L2TPv3, userspace must use the netlink API defined in +include/linux/l2tp.h to manage tunnel and session contexts. The +general procedure to create a new L2TP tunnel with one session is:- + +1. Open a GENL socket using L2TP_GENL_NAME for configuring the kernel + using netlink. + +2. Create a UDP or L2TPIP socket for the tunnel. + +3. Create a new L2TP tunnel using a L2TP_CMD_TUNNEL_CREATE + request. Set attributes according to desired tunnel parameters, + referencing the UDP or L2TPIP socket created in the previous step. + +4. Create a new L2TP session in the tunnel using a + L2TP_CMD_SESSION_CREATE request. + +The tunnel and all of its sessions are closed when the tunnel socket +is closed. The netlink API may also be used to delete sessions and +tunnels. Configuration and status info may be set or read using netlink. + +The L2TP driver also supports static (unmanaged) L2TPv3 tunnels. These +are where there is no L2TP control message exchange with the peer to +setup the tunnel; the tunnel is configured manually at each end of the +tunnel. There is no need for an L2TP userspace application in this +case -- the tunnel socket is created by the kernel and configured +using parameters sent in the L2TP_CMD_TUNNEL_CREATE netlink +request. The "ip" utility of iproute2 has commands for managing static +L2TPv3 tunnels; do "ip l2tp help" for more information. + +Debugging +========= + +The driver supports a flexible debug scheme where kernel trace +messages may be optionally enabled per tunnel and per session. Care is +needed when debugging a live system since the messages are not +rate-limited and a busy system could be swamped. Userspace uses +setsockopt on the PPPoX socket to set a debug mask. + +The following debug mask bits are available: + +PPPOL2TP_MSG_DEBUG verbose debug (if compiled in) +PPPOL2TP_MSG_CONTROL userspace - kernel interface +PPPOL2TP_MSG_SEQ sequence numbers handling +PPPOL2TP_MSG_DATA data packets + +If enabled, files under a l2tp debugfs directory can be used to dump +kernel state about L2TP tunnels and sessions. To access it, the +debugfs filesystem must first be mounted. + +# mount -t debugfs debugfs /debug + +Files under the l2tp directory can then be accessed. + +# cat /debug/l2tp/tunnels + +The debugfs files should not be used by applications to obtain L2TP +state information because the file format is subject to change. It is +implemented to provide extra debug information to help diagnose +problems.) Users should use the netlink API. + +/proc/net/pppol2tp is also provided for backwards compatibility with +the original pppol2tp driver. It lists information about L2TPv2 +tunnels and sessions only. Its use is discouraged. + +Unmanaged L2TPv3 Tunnels +======================== + +Some commercial L2TP products support unmanaged L2TPv3 ethernet +tunnels, where there is no L2TP control protocol; tunnels are +configured at each side manually. New commands are available in +iproute2's ip utility to support this. + +To create an L2TPv3 ethernet pseudowire between local host 192.168.1.1 +and peer 192.168.1.2, using IP addresses 10.5.1.1 and 10.5.1.2 for the +tunnel endpoints:- + +# modprobe l2tp_eth +# modprobe l2tp_netlink + +# ip l2tp add tunnel tunnel_id 1 peer_tunnel_id 1 udp_sport 5000 \ + udp_dport 5000 encap udp local 192.168.1.1 remote 192.168.1.2 +# ip l2tp add session tunnel_id 1 session_id 1 peer_session_id 1 +# ifconfig -a +# ip addr add 10.5.1.2/32 peer 10.5.1.1/32 dev l2tpeth0 +# ifconfig l2tpeth0 up + +Choose IP addresses to be the address of a local IP interface and that +of the remote system. The IP addresses of the l2tpeth0 interface can be +anything suitable. + +Repeat the above at the peer, with ports, tunnel/session ids and IP +addresses reversed. The tunnel and session IDs can be any non-zero +32-bit number, but the values must be reversed at the peer. + +Host 1 Host2 +udp_sport=5000 udp_sport=5001 +udp_dport=5001 udp_dport=5000 +tunnel_id=42 tunnel_id=45 +peer_tunnel_id=45 peer_tunnel_id=42 +session_id=128 session_id=5196755 +peer_session_id=5196755 peer_session_id=128 + +When done at both ends of the tunnel, it should be possible to send +data over the network. e.g. + +# ping 10.5.1.1 + + +Sample Userspace Code +===================== + +1. Create tunnel management PPPoX socket + + kernel_fd = socket(AF_PPPOX, SOCK_DGRAM, PX_PROTO_OL2TP); + if (kernel_fd >= 0) { + struct sockaddr_pppol2tp sax; + struct sockaddr_in const *peer_addr; + + peer_addr = l2tp_tunnel_get_peer_addr(tunnel); + memset(&sax, 0, sizeof(sax)); + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = udp_fd; /* fd of tunnel UDP socket */ + sax.pppol2tp.addr.sin_addr.s_addr = peer_addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = peer_addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = 0; /* special case: mgmt socket */ + sax.pppol2tp.d_tunnel = 0; + sax.pppol2tp.d_session = 0; /* special case: mgmt socket */ + + if(connect(kernel_fd, (struct sockaddr *)&sax, sizeof(sax) ) < 0 ) { + perror("connect failed"); + result = -errno; + goto err; + } + } + +2. Create session PPPoX data socket + + struct sockaddr_pppol2tp sax; + int fd; + + /* Note, the target socket must be bound already, else it will not be ready */ + sax.sa_family = AF_PPPOX; + sax.sa_protocol = PX_PROTO_OL2TP; + sax.pppol2tp.fd = tunnel_fd; + sax.pppol2tp.addr.sin_addr.s_addr = addr->sin_addr.s_addr; + sax.pppol2tp.addr.sin_port = addr->sin_port; + sax.pppol2tp.addr.sin_family = AF_INET; + sax.pppol2tp.s_tunnel = tunnel_id; + sax.pppol2tp.s_session = session_id; + sax.pppol2tp.d_tunnel = peer_tunnel_id; + sax.pppol2tp.d_session = peer_session_id; + + /* session_fd is the fd of the session's PPPoL2TP socket. + * tunnel_fd is the fd of the tunnel UDP socket. + */ + fd = connect(session_fd, (struct sockaddr *)&sax, sizeof(sax)); + if (fd < 0 ) { + return -errno; + } + return 0; + +Internal Implementation +======================= + +The driver keeps a struct l2tp_tunnel context per L2TP tunnel and a +struct l2tp_session context for each session. The l2tp_tunnel is +always associated with a UDP or L2TP/IP socket and keeps a list of +sessions in the tunnel. The l2tp_session context keeps kernel state +about the session. It has private data which is used for data specific +to the session type. With L2TPv2, the session always carried PPP +traffic. With L2TPv3, the session can also carry ethernet frames +(ethernet pseudowire) or other data types such as ATM, HDLC or Frame +Relay. + +When a tunnel is first opened, the reference count on the socket is +increased using sock_hold(). This ensures that the kernel socket +cannot be removed while L2TP's data structures reference it. + +Some L2TP sessions also have a socket (PPP pseudowires) while others +do not (ethernet pseudowires). We can't use the socket reference count +as the reference count for session contexts. The L2TP implementation +therefore has its own internal reference counts on the session +contexts. + +To Do +===== + +Add L2TP tunnel switching support. This would route tunneled traffic +from one L2TP tunnel into another. Specified in +http://tools.ietf.org/html/draft-ietf-l2tpext-tunnel-switching-08 + +Add L2TPv3 VLAN pseudowire support. + +Add L2TPv3 IP pseudowire support. + +Add L2TPv3 ATM pseudowire support. + +Miscellaneous +============= + +The L2TP drivers were developed as part of the OpenL2TP project by +Katalix Systems Ltd. OpenL2TP is a full-featured L2TP client / server, +designed from the ground up to have the L2TP datapath in the +kernel. The project also implemented the pppol2tp plugin for pppd +which allows pppd to use the kernel driver. Details can be found at +http://www.openl2tp.org. diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt index fe2a9129d95..0bf3220c715 100644 --- a/Documentation/networking/ltpc.txt +++ b/Documentation/networking/ltpc.txt @@ -25,7 +25,7 @@ the driver will try to determine them itself. If you load the driver as a module, you can pass the parameters "io=", "irq=", and "dma=" on the command line with insmod or modprobe, or add -them as options in /etc/modprobe.conf: +them as options in a configuration file in /etc/modprobe.d/ directory: alias lt0 ltpc # autoload the module when the interface is configured options ltpc io=0x240 irq=9 dma=1 diff --git a/Documentation/networking/mac80211-auth-assoc-deauth.txt b/Documentation/networking/mac80211-auth-assoc-deauth.txt new file mode 100644 index 00000000000..d7a15fe91bf --- /dev/null +++ b/Documentation/networking/mac80211-auth-assoc-deauth.txt @@ -0,0 +1,95 @@ +# +# This outlines the Linux authentication/association and +# deauthentication/disassociation flows. +# +# This can be converted into a diagram using the service +# at http://www.websequencediagrams.com/ +# + +participant userspace +participant mac80211 +participant driver + +alt authentication needed (not FT) +userspace->mac80211: authenticate + +alt authenticated/authenticating already +mac80211->driver: sta_state(AP, not-exists) +mac80211->driver: bss_info_changed(clear BSSID) +else associated +note over mac80211,driver +like deauth/disassoc, without sending the +BA session stop & deauth/disassoc frames +end note +end + +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) + +alt no probe request data known +mac80211->driver: TX directed probe request +driver->mac80211: RX probe response +end + +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame + +alt WEP shared key auth +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame +end + +mac80211->driver: sta_state(AP, authenticated) +mac80211->userspace: RX auth frame + +end + +userspace->mac80211: associate +alt authenticated or associated +note over mac80211,driver: cleanup like for authenticate +end + +alt not previously authenticated (FT) +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) +mac80211->driver: sta_state(AP, authenticated) +end +mac80211->driver: TX assoc +driver->mac80211: RX assoc response +note over mac80211: init rate control +mac80211->driver: sta_state(AP, associated) + +alt not using WPA +mac80211->driver: sta_state(AP, authorized) +end + +mac80211->driver: set up QoS parameters + +mac80211->driver: bss_info_changed(QoS, HT, associated with AID) +mac80211->userspace: associated + +note left of userspace: associated now + +alt using WPA +note over userspace +do 4-way-handshake +(data frames) +end note +userspace->mac80211: authorized +mac80211->driver: sta_state(AP, authorized) +end + +userspace->mac80211: deauthenticate/disassociate +mac80211->driver: stop BA sessions +mac80211->driver: TX deauth/disassoc +mac80211->driver: flush frames +mac80211->driver: sta_state(AP,associated) +mac80211->driver: sta_state(AP,authenticated) +mac80211->driver: sta_state(AP,exists) +mac80211->driver: sta_state(AP,not-exists) +mac80211->driver: turn off powersave +mac80211->driver: bss_info_changed(clear BSSID, not associated, no QoS, ...) +mac80211->driver: config(channel type to non-HT) +mac80211->userspace: disconnected diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt new file mode 100644 index 00000000000..3a930072b16 --- /dev/null +++ b/Documentation/networking/mac80211-injection.txt @@ -0,0 +1,67 @@ +How to use packet injection with mac80211 +========================================= + +mac80211 now allows arbitrary packets to be injected down any Monitor Mode +interface from userland. The packet you inject needs to be composed in the +following format: + + [ radiotap header ] + [ ieee80211 header ] + [ payload ] + +The radiotap format is discussed in +./Documentation/networking/radiotap-headers.txt. + +Despite many radiotap parameters being currently defined, most only make sense +to appear on received packets. The following information is parsed from the +radiotap headers and used to control injection: + + * IEEE80211_RADIOTAP_FLAGS + + IEEE80211_RADIOTAP_F_FCS: FCS will be removed and recalculated + IEEE80211_RADIOTAP_F_WEP: frame will be encrypted if key available + IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the + current fragmentation threshold. + + * IEEE80211_RADIOTAP_TX_FLAGS + + IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for + an ACK even if it is a unicast frame + +The injection code can also skip all other currently defined radiotap fields +facilitating replay of captured radiotap headers directly. + +Here is an example valid radiotap header defining some parameters + + 0x00, 0x00, // <-- radiotap version + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate + 0x0c, //<-- tx power + 0x01 //<-- antenna + +The ieee80211 header follows immediately afterwards, looking for example like +this: + + 0x08, 0x01, 0x00, 0x00, + 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x13, 0x22, 0x33, 0x44, 0x55, 0x66, + 0x10, 0x86 + +Then lastly there is the payload. + +After composing the packet contents, it is sent by send()-ing it to a logical +mac80211 interface that is in Monitor mode. Libpcap can also be used, +(which is easier than doing the work to bind the socket to the right +interface), along the following lines: + + ppcap = pcap_open_live(szInterfaceName, 800, 1, 20, szErrbuf); +... + r = pcap_inject(ppcap, u8aSendBuffer, nLength); + +You can also find a link to a complete inject application here: + +http://wireless.kernel.org/en/users/Documentation/packetspammer + +Andy Green <andy@warmcat.com> diff --git a/Documentation/networking/mac80211_hwsim/README b/Documentation/networking/mac80211_hwsim/README new file mode 100644 index 00000000000..24ac91d5669 --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/README @@ -0,0 +1,68 @@ +mac80211_hwsim - software simulator of 802.11 radio(s) for mac80211 +Copyright (c) 2008, Jouni Malinen <j@w1.fi> + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License version 2 as +published by the Free Software Foundation. + + +Introduction + +mac80211_hwsim is a Linux kernel module that can be used to simulate +arbitrary number of IEEE 802.11 radios for mac80211. It can be used to +test most of the mac80211 functionality and user space tools (e.g., +hostapd and wpa_supplicant) in a way that matches very closely with +the normal case of using real WLAN hardware. From the mac80211 view +point, mac80211_hwsim is yet another hardware driver, i.e., no changes +to mac80211 are needed to use this testing tool. + +The main goal for mac80211_hwsim is to make it easier for developers +to test their code and work with new features to mac80211, hostapd, +and wpa_supplicant. The simulated radios do not have the limitations +of real hardware, so it is easy to generate an arbitrary test setup +and always reproduce the same setup for future tests. In addition, +since all radio operation is simulated, any channel can be used in +tests regardless of regulatory rules. + +mac80211_hwsim kernel module has a parameter 'radios' that can be used +to select how many radios are simulated (default 2). This allows +configuration of both very simply setups (e.g., just a single access +point and a station) or large scale tests (multiple access points with +hundreds of stations). + +mac80211_hwsim works by tracking the current channel of each virtual +radio and copying all transmitted frames to all other radios that are +currently enabled and on the same channel as the transmitting +radio. Software encryption in mac80211 is used so that the frames are +actually encrypted over the virtual air interface to allow more +complete testing of encryption. + +A global monitoring netdev, hwsim#, is created independent of +mac80211. This interface can be used to monitor all transmitted frames +regardless of channel. + + +Simple example + +This example shows how to use mac80211_hwsim to simulate two radios: +one to act as an access point and the other as a station that +associates with the AP. hostapd and wpa_supplicant are used to take +care of WPA2-PSK authentication. In addition, hostapd is also +processing access point side of association. + + +# Build mac80211_hwsim as part of kernel configuration + +# Load the module +modprobe mac80211_hwsim + +# Run hostapd (AP) for wlan0 +hostapd hostapd.conf + +# Run wpa_supplicant (station) for wlan1 +wpa_supplicant -Dwext -iwlan1 -c wpa_supplicant.conf + + +More test cases are available in hostap.git: +git://w1.fi/srv/git/hostap.git and mac80211_hwsim/tests subdirectory +(http://w1.fi/gitweb/gitweb.cgi?p=hostap.git;a=tree;f=mac80211_hwsim/tests) diff --git a/Documentation/networking/mac80211_hwsim/hostapd.conf b/Documentation/networking/mac80211_hwsim/hostapd.conf new file mode 100644 index 00000000000..08cde7e35f2 --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/hostapd.conf @@ -0,0 +1,11 @@ +interface=wlan0 +driver=nl80211 + +hw_mode=g +channel=1 +ssid=mac80211 test + +wpa=2 +wpa_key_mgmt=WPA-PSK +wpa_pairwise=CCMP +wpa_passphrase=12345678 diff --git a/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf new file mode 100644 index 00000000000..299128cff03 --- /dev/null +++ b/Documentation/networking/mac80211_hwsim/wpa_supplicant.conf @@ -0,0 +1,10 @@ +ctrl_interface=/var/run/wpa_supplicant + +network={ + ssid="mac80211 test" + psk="12345678" + key_mgmt=WPA-PSK + proto=WPA2 + pairwise=CCMP + group=CCMP +} diff --git a/Documentation/networking/multicast.txt b/Documentation/networking/multicast.txt deleted file mode 100644 index b06c8c69266..00000000000 --- a/Documentation/networking/multicast.txt +++ /dev/null @@ -1,63 +0,0 @@ -Behaviour of Cards Under Multicast -================================== - -This is how they currently behave, not what the hardware can do--for example, -the Lance driver doesn't use its filter, even though the code for loading -it is in the DEC Lance-based driver. - -The following are requirements for multicasting ------------------------------------------------ -AppleTalk Multicast hardware filtering not important but - avoid cards only doing promisc -IP-Multicast Multicast hardware filters really help -IP-MRoute AllMulti hardware filters are of no help - - -Board Multicast AllMulti Promisc Filter ------------------------------------------------------------------------- -3c501 YES YES YES Software -3c503 YES YES YES Hardware -3c505 YES NO YES Hardware -3c507 NO NO NO N/A -3c509 YES YES YES Software -3c59x YES YES YES Software -ac3200 YES YES YES Hardware -apricot YES PROMISC YES Hardware -arcnet NO NO NO N/A -at1700 PROMISC PROMISC YES Software -atp PROMISC PROMISC YES Software -cs89x0 YES YES YES Software -de4x5 YES YES YES Hardware -de600 NO NO NO N/A -de620 PROMISC PROMISC YES Software -depca YES PROMISC YES Hardware -dmfe YES YES YES Software(*) -e2100 YES YES YES Hardware -eepro YES PROMISC YES Hardware -eexpress NO NO NO N/A -ewrk3 YES PROMISC YES Hardware -hp-plus YES YES YES Hardware -hp YES YES YES Hardware -hp100 YES YES YES Hardware -ibmtr NO NO NO N/A -ioc3-eth YES YES YES Hardware -lance YES YES YES Software(#) -ne YES YES YES Hardware -ni52 <------------------ Buggy ------------------> -ni65 YES YES YES Software(#) -seeq NO NO NO N/A -sgiseek <------------------ Buggy ------------------> -smc-ultra YES YES YES Hardware -sunlance YES YES YES Hardware -tulip YES YES YES Hardware -wavelan YES PROMISC YES Hardware -wd YES YES YES Hardware -xirc2ps_cs YES YES YES Hardware -znet YES YES YES Software - - -PROMISC = This multicast mode is in fact promiscuous mode. Avoid using -cards who go PROMISC on any multicast in a multicast kernel. - -(#) = Hardware multicast support is not used yet. -(*) = Hardware support for Davicom 9132 chipset only. diff --git a/Documentation/networking/multiqueue.txt b/Documentation/networking/multiqueue.txt new file mode 100644 index 00000000000..4caa0e314cc --- /dev/null +++ b/Documentation/networking/multiqueue.txt @@ -0,0 +1,79 @@ + + HOWTO for multiqueue network device support + =========================================== + +Section 1: Base driver requirements for implementing multiqueue support + +Intro: Kernel support for multiqueue devices +--------------------------------------------------------- + +Kernel support for multiqueue devices is always present. + +Section 1: Base driver requirements for implementing multiqueue support +----------------------------------------------------------------------- + +Base drivers are required to use the new alloc_etherdev_mq() or +alloc_netdev_mq() functions to allocate the subqueues for the device. The +underlying kernel API will take care of the allocation and deallocation of +the subqueue memory, as well as netdev configuration of where the queues +exist in memory. + +The base driver will also need to manage the queues as it does the global +netdev->queue_lock today. Therefore base drivers should use the +netif_{start|stop|wake}_subqueue() functions to manage each queue while the +device is still operational. netdev->queue_lock is still used when the device +comes online or when it's completely shut down (unregister_netdev(), etc.). + + +Section 2: Qdisc support for multiqueue devices + +----------------------------------------------- + +Currently two qdiscs are optimized for multiqueue devices. The first is the +default pfifo_fast qdisc. This qdisc supports one qdisc per hardware queue. +A new round-robin qdisc, sch_multiq also supports multiple hardware queues. The +qdisc is responsible for classifying the skb's and then directing the skb's to +bands and queues based on the value in skb->queue_mapping. Use this field in +the base driver to determine which queue to send the skb to. + +sch_multiq has been added for hardware that wishes to avoid head-of-line +blocking. It will cycle though the bands and verify that the hardware queue +associated with the band is not stopped prior to dequeuing a packet. + +On qdisc load, the number of bands is based on the number of queues on the +hardware. Once the association is made, any skb with skb->queue_mapping set, +will be queued to the band associated with the hardware queue. + + +Section 3: Brief howto using MULTIQ for multiqueue devices +--------------------------------------------------------------- + +The userspace command 'tc,' part of the iproute2 package, is used to configure +qdiscs. To add the MULTIQ qdisc to your network device, assuming the device +is called eth0, run the following command: + +# tc qdisc add dev eth0 root handle 1: multiq + +The qdisc will allocate the number of bands to equal the number of queues that +the device reports, and bring the qdisc online. Assuming eth0 has 4 Tx +queues, the band mapping would look like: + +band 0 => queue 0 +band 1 => queue 1 +band 2 => queue 2 +band 3 => queue 3 + +Traffic will begin flowing through each queue based on either the simple_tx_hash +function or based on netdev->select_queue() if you have it defined. + +The behavior of tc filters remains the same. However a new tc action, +skbedit, has been added. Assuming you wanted to route all traffic to a +specific host, for example 192.168.0.3, through a specific queue you could use +this action and establish a filter such as: + +tc filter add dev eth0 parent 1: protocol ip prio 1 u32 \ + match ip dst 192.168.0.3 \ + action skbedit queue_mapping 3 + +Author: Alexander Duyck <alexander.h.duyck@intel.com> +Original Author: Peter P. Waskiewicz Jr. <peter.p.waskiewicz.jr@intel.com> diff --git a/Documentation/networking/ncsa-telnet b/Documentation/networking/ncsa-telnet deleted file mode 100644 index d77d28b0909..00000000000 --- a/Documentation/networking/ncsa-telnet +++ /dev/null @@ -1,16 +0,0 @@ -NCSA telnet doesn't work with path MTU discovery enabled. This is due to a -bug in NCSA that also stops it working with other modern networking code -such as Solaris. - -The following information is courtesy of -Marek <marekm@i17linuxb.ists.pwr.wroc.pl> - -There is a fixed version somewhere on ftp.upe.ac.za (sorry, I don't -remember the exact pathname, and this site is very slow from here). -It may or may not be faster for you to get it from -ftp://ftp.ists.pwr.wroc.pl/pub/msdos/telnet/ncsa_upe/tel23074.zip -(source is in v230704s.zip). I have tested it with 1.3.79 (with -path mtu discovery enabled - ncsa 2.3.08 didn't work) and it seems -to work. I don't know if anyone is working on this code - this -version is over a year old. Too bad - it's faster and often more -stable than these windoze telnets, and runs on almost anything... diff --git a/Documentation/networking/net-modules.txt b/Documentation/networking/net-modules.txt deleted file mode 100644 index 0b27863f155..00000000000 --- a/Documentation/networking/net-modules.txt +++ /dev/null @@ -1,321 +0,0 @@ -Wed 2-Aug-95 <matti.aarnio@utu.fi> - - Linux network driver modules - - Do not mistake this for "README.modules" at the top-level - directory! That document tells about modules in general, while - this one tells only about network device driver modules. - - This is a potpourri of INSMOD-time(*) configuration options - (if such exists) and their default values of various modules - in the Linux network drivers collection. - - Some modules have also hidden (= non-documented) tunable values. - The choice of not documenting them is based on general belief, that - the less the user needs to know, the better. (There are things that - driver developers can use, others should not confuse themselves.) - - In many cases it is highly preferred that insmod:ing is done - ONLY with defining an explicit address for the card, AND BY - NOT USING AUTO-PROBING! - - Now most cards have some explicitly defined base address that they - are compiled with (to avoid auto-probing, among other things). - If that compiled value does not match your actual configuration, - do use the "io=0xXXX" -parameter for the insmod, and give there - a value matching your environment. - - If you are adventurous, you can ask the driver to autoprobe - by using the "io=0" parameter, however it is a potentially dangerous - thing to do in a live system. (If you don't know where the - card is located, you can try autoprobing, and after possible - crash recovery, insmod with proper IO-address..) - - -------------------------- - (*) "INSMOD-time" means when you load module with - /sbin/insmod you can feed it optional parameters. - See "man insmod". - -------------------------- - - - 8390 based Network Modules (Paul Gortmaker, Nov 12, 1995) - -------------------------- - -(Includes: smc-ultra, ne, wd, 3c503, hp, hp-plus, e2100 and ac3200) - -The 8390 series of network drivers now support multiple card systems without -reloading the same module multiple times (memory efficient!) This is done by -specifying multiple comma separated values, such as: - - insmod 3c503.o io=0x280,0x300,0x330,0x350 xcvr=0,1,0,1 - -The above would have the one module controlling four 3c503 cards, with card 2 -and 4 using external transceivers. The "insmod" manual describes the usage -of comma separated value lists. - -It is *STRONGLY RECOMMENDED* that you supply "io=" instead of autoprobing. -If an "io=" argument is not supplied, then the ISA drivers will complain -about autoprobing being not recommended, and begrudgingly autoprobe for -a *SINGLE CARD ONLY* -- if you want to use multiple cards you *have* to -supply an "io=0xNNN,0xQQQ,..." argument. - -The ne module is an exception to the above. A NE2000 is essentially an -8390 chip, some bus glue and some RAM. Because of this, the ne probe is -more invasive than the rest, and so at boot we make sure the ne probe is -done last of all the 8390 cards (so that it won't trip over other 8390 based -cards) With modules we can't ensure that all other non-ne 8390 cards have -already been found. Because of this, the ne module REQUIRES an "io=0xNNN" -argument passed in via insmod. It will refuse to autoprobe. - -It is also worth noting that auto-IRQ probably isn't as reliable during -the flurry of interrupt activity on a running machine. Cards such as the -ne2000 that can't get the IRQ setting from an EEPROM or configuration -register are probably best supplied with an "irq=M" argument as well. - - ----------------------------------------------------------------------- -Card/Module List - Configurable Parameters and Default Values ----------------------------------------------------------------------- - -3c501.c: - io = 0x280 IO base address - irq = 5 IRQ - (Probes ports: 0x280, 0x300) - -3c503.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver using autoIRQ) - xcvr = 0 (Use xcvr=1 to select external transceiver.) - (Probes ports: 0x300, 0x310, 0x330, 0x350, 0x250, 0x280, 0x2A0, 0x2E0) - -3c505.c: - io = 0 - irq = 0 - dma = 6 (not autoprobed) - (Probes ports: 0x300, 0x280, 0x310) - -3c507.c: - io = 0x300 - irq = 0 - (Probes ports: 0x300, 0x320, 0x340, 0x280) - -3c509.c: - io = 0 - irq = 0 - ( Module load-time probing Works reliably only on EISA, ISA ID-PROBE - IS NOT RELIABLE! Compile this driver statically into kernel for - now, if you need it auto-probing on an ISA-bus machine. ) - -8390.c: - (No public options, several other modules need this one) - -a2065.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -ac3200.c: - io = 0 (Checks 0x1000 to 0x8fff in 0x1000 intervals) - irq = 0 (Read from config register) - (EISA probing..) - -apricot.c: - io = 0x300 (Can't be altered!) - irq = 10 - -arcnet.c: - io = 0 - irqnum = 0 - shmem = 0 - num = 0 - DO SET THESE MANUALLY AT INSMOD! - (When probing, looks at the following possible addresses: - Suggested ones: - 0x300, 0x2E0, 0x2F0, 0x2D0 - Other ones: - 0x200, 0x210, 0x220, 0x230, 0x240, 0x250, 0x260, 0x270, - 0x280, 0x290, 0x2A0, 0x2B0, 0x2C0, - 0x310, 0x320, 0x330, 0x340, 0x350, 0x360, 0x370, - 0x380, 0x390, 0x3A0, 0x3E0, 0x3F0 ) - -ariadne.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -at1700.c: - io = 0x260 - irq = 0 - (Probes ports: 0x260, 0x280, 0x2A0, 0x240, 0x340, 0x320, 0x380, 0x300) - -atari_bionet.c: - Supports full autoprobing. (m68k/Atari) - -atari_pamsnet.c: - Supports full autoprobing. (m68k/Atari) - -atarilance.c: - Supports full autoprobing. (m68k/Atari) - -atp.c: *Not modularized* - (Probes ports: 0x378, 0x278, 0x3BC; - fixed IRQs: 5 and 7 ) - -cops.c: - io = 0x240 - irq = 5 - nodeid = 0 (AutoSelect = 0, NodeID 1-254 is hand selected.) - (Probes ports: 0x240, 0x340, 0x200, 0x210, 0x220, 0x230, 0x260, - 0x2A0, 0x300, 0x310, 0x320, 0x330, 0x350, 0x360) - -de4x5.c: - io = 0x000b - irq = 10 - is_not_dec = 0 -- For non-DEC card using DEC 21040/21041/21140 chip, set this to 1 - (EISA, and PCI probing) - -de600.c: - de600_debug = 0 - (On port 0x378, irq 7 -- lpt1; compile time configurable) - -de620.c: - bnc = 0, utp = 0 <-- Force media by setting either. - io = 0x378 (also compile-time configurable) - irq = 7 - -depca.c: - io = 0x200 - irq = 7 - (Probes ports: ISA: 0x300, 0x200; - EISA: 0x0c00 ) - -dummy.c: - No options - -e2100.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver) - mem = 0 (Override default shared memory start of 0xd0000) - xcvr = 0 (Use xcvr=1 to select external transceiver.) - (Probes ports: 0x300, 0x280, 0x380, 0x220) - -eepro.c: - io = 0x200 - irq = 0 - (Probes ports: 0x200, 0x240, 0x280, 0x2C0, 0x300, 0x320, 0x340, 0x360) - -eexpress.c: - io = 0x300 - irq = 0 (IRQ value read from EEPROM) - (Probes ports: 0x300, 0x270, 0x320, 0x340) - -eql.c: - (No parameters) - -ewrk3.c: - io = 0x300 - irq = 5 - (With module no autoprobing! - On EISA-bus does EISA probing. - Static linkage probes ports on ISA bus: - 0x100, 0x120, 0x140, 0x160, 0x180, 0x1A0, 0x1C0, - 0x200, 0x220, 0x240, 0x260, 0x280, 0x2A0, 0x2C0, 0x2E0, - 0x300, 0x340, 0x360, 0x380, 0x3A0, 0x3C0) - -hp-plus.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ read from configuration register) - (Probes ports: 0x200, 0x240, 0x280, 0x2C0, 0x300, 0x320, 0x340) - -hp.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ software selected by driver using autoIRQ) - (Probes ports: 0x300, 0x320, 0x340, 0x280, 0x2C0, 0x200, 0x240) - -hp100.c: - hp100_port = 0 (IO-base address) - (Does EISA-probing, if on EISA-slot; - On ISA-bus probes all ports from 0x100 thru to 0x3E0 - in increments of 0x020) - -hydra.c: - Since this is a Zorro board, it supports full autoprobing, even for - multiple boards. (m68k/Amiga) - -ibmtr.c: - io = 0xa20, 0xa24 (autoprobed by default) - irq = 0 (driver cannot select irq - read from hardware) - mem = 0 (shared memory base set at 0xd0000 and not yet - able to override thru mem= parameter.) - -lance.c: *Not modularized* - (PCI, and ISA probing; "CONFIG_PCI" needed for PCI support) - (Probes ISA ports: 0x300, 0x320, 0x340, 0x360) - -loopback.c: *Static kernel component* - -ne.c: - io = 0 (Explicitly *requires* an "io=0xNNN" value) - irq = 0 (Tries to determine configured IRQ via autoIRQ) - (Probes ports: 0x300, 0x280, 0x320, 0x340, 0x360) - -net_init.c: *Static kernel component* - -ni52.c: *Not modularized* - (Probes ports: 0x300, 0x280, 0x360, 0x320, 0x340 - mems: 0xD0000, 0xD2000, 0xC8000, 0xCA000, - 0xD4000, 0xD6000, 0xD8000 ) - -ni65.c: *Not modularized* **16MB MEMORY BARRIER BUG** - (Probes ports: 0x300, 0x320, 0x340, 0x360) - -pi2.c: *Not modularized* (well, NON-STANDARD modularization!) - Only one card supported at this time. - (Probes ports: 0x380, 0x300, 0x320, 0x340, 0x360, 0x3A0) - -plip.c: - io = 0 - irq = 0 (by default, uses IRQ 5 for port at 0x3bc, IRQ 7 - for port at 0x378, and IRQ 2 for port at 0x278) - (Probes ports: 0x278, 0x378, 0x3bc) - -ppp.c: - No options (ppp-2.2+ has some, this is based on non-dynamic - version from ppp-2.1.2d) - -seeq8005.c: *Not modularized* - (Probes ports: 0x300, 0x320, 0x340, 0x360) - -skeleton.c: *Skeleton* - -slhc.c: - No configuration parameters - -slip.c: - slip_maxdev = 256 (default value from SL_NRUNIT on slip.h) - - -smc-ultra.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ val. read from EEPROM) - (Probes ports: 0x200, 0x220, 0x240, 0x280, 0x300, 0x340, 0x380) - -tulip.c: *Partial modularization* - (init-time memory allocation makes problems..) - -tunnel.c: - No insmod parameters - -wavelan.c: - io = 0x390 (Settable, but change not recommended) - irq = 0 (Not honoured, if changed..) - -wd.c: - io = 0 (It will complain if you don't supply an "io=0xNNN") - irq = 0 (IRQ val. read from EEPROM, ancient cards use autoIRQ) - mem = 0 (Force shared-memory on address 0xC8000, or whatever..) - mem_end = 0 (Force non-std. mem. size via supplying mem_end val.) - (eg. for 32k WD8003EBT, use mem=0xd0000 mem_end=0xd8000) - (Probes ports: 0x300, 0x280, 0x380, 0x240) - -znet.c: *Not modularized* - (Only one device on Zenith Z-Note (notebook?) systems, - configuration information from (EE)PROM) diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt index 53618fb1a71..a5d574a9ae0 100644 --- a/Documentation/networking/netconsole.txt +++ b/Documentation/networking/netconsole.txt @@ -1,8 +1,13 @@ started by Ingo Molnar <mingo@redhat.com>, 2001.09.17 2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 +IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013 Please send bug reports to Matt Mackall <mpm@selenic.com> +Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com> + +Introduction: +============= This module logs kernel printk messages over UDP allowing debugging of problem where disk logging fails and serial consoles are impractical. @@ -13,6 +18,9 @@ the specified interface as soon as possible. While this doesn't allow capture of early kernel panics, it does capture most of the boot process. +Sender and receiver configuration: +================================== + It takes a string configuration parameter "netconsole" in the following format: @@ -34,24 +42,136 @@ Examples: insmod netconsole netconsole=@/,@10.0.0.2/ + or using IPv6 + + insmod netconsole netconsole=@/,@fd00:1:2:3::1/ + +It also supports logging to multiple remote agents by specifying +parameters for the multiple agents separated by semicolons and the +complete string enclosed in "quotes", thusly: + + modprobe netconsole netconsole="@/,@10.0.0.2/;@/eth1,6892@10.0.0.3/" + Built-in netconsole starts immediately after the TCP stack is initialized and attempts to bring up the supplied dev at the supplied address. -The remote host can run either 'netcat -u -l -p <port>' or syslogd. +The remote host has several options to receive the kernel messages, +for example: + +1) syslogd + +2) netcat + + On distributions using a BSD-based netcat version (e.g. Fedora, + openSUSE and Ubuntu) the listening port must be specified without + the -p switch: + + 'nc -u -l -p <port>' / 'nc -u -l <port>' or + 'netcat -u -l -p <port>' / 'netcat -u -l <port>' + +3) socat + + 'socat udp-recv:<port> -' + +Dynamic reconfiguration: +======================== + +Dynamic reconfigurability is a useful addition to netconsole that enables +remote logging targets to be dynamically added, removed, or have their +parameters reconfigured at runtime from a configfs-based userspace interface. +[ Note that the parameters of netconsole targets that were specified/created +from the boot/module option are not exposed via this interface, and hence +cannot be modified dynamically. ] + +To include this feature, select CONFIG_NETCONSOLE_DYNAMIC when building the +netconsole module (or kernel, if netconsole is built-in). + +Some examples follow (where configfs is mounted at the /sys/kernel/config +mountpoint). + +To add a remote logging target (target names can be arbitrary): + + cd /sys/kernel/config/netconsole/ + mkdir target1 + +Note that newly created targets have default parameter values (as mentioned +above) and are disabled by default -- they must first be enabled by writing +"1" to the "enabled" attribute (usually after setting parameters accordingly) +as described below. + +To remove a target: + + rmdir /sys/kernel/config/netconsole/othertarget/ + +The interface exposes these parameters of a netconsole target to userspace: + + enabled Is this target currently enabled? (read-write) + dev_name Local network interface name (read-write) + local_port Source UDP port to use (read-write) + remote_port Remote agent's UDP port (read-write) + local_ip Source IP address to use (read-write) + remote_ip Remote agent's IP address (read-write) + local_mac Local interface's MAC address (read-only) + remote_mac Remote agent's MAC address (read-write) + +The "enabled" attribute is also used to control whether the parameters of +a target can be updated or not -- you can modify the parameters of only +disabled targets (i.e. if "enabled" is 0). + +To update a target's parameters: + + cat enabled # check if enabled is 1 + echo 0 > enabled # disable the target (if required) + echo eth2 > dev_name # set local interface + echo 10.0.0.4 > remote_ip # update some parameter + echo cb:a9:87:65:43:21 > remote_mac # update more parameters + echo 1 > enabled # enable target again + +You can also update the local interface dynamically. This is especially +useful if you want to use interfaces that have newly come up (and may not +have existed when netconsole was loaded / initialized). + +Miscellaneous notes: +==================== WARNING: the default target ethernet setting uses the broadcast ethernet address to send packets, which can cause increased load on other systems on the same ethernet segment. +TIP: some LAN switches may be configured to suppress ethernet broadcasts +so it is advised to explicitly specify the remote agents' MAC addresses +from the config parameters passed to netconsole. + +TIP: to find out the MAC address of, say, 10.0.0.2, you may try using: + + ping -c 1 10.0.0.2 ; /sbin/arp -n | grep 10.0.0.2 + +TIP: in case the remote logging agent is on a separate LAN subnet than +the sender, it is suggested to try specifying the MAC address of the +default gateway (you may use /sbin/route -n to find it out) as the +remote MAC address instead. + NOTE: the network device (eth1 in the above case) can run any kind of other network traffic, netconsole is not intrusive. Netconsole might cause slight delays in other traffic if the volume of kernel messages is high, but should have no other impact. +NOTE: if you find that the remote logging agent is not receiving or +printing all messages from the sender, it is likely that you have set +the "console_loglevel" parameter (on the sender) to only send high +priority messages to the console. You can change this at runtime using: + + dmesg -n 8 + +or by specifying "debug" on the kernel command line at boot, to send +all kernel messages to the console. A specific value for this parameter +can also be set using the "loglevel" kernel boot option. See the +dmesg(8) man page and Documentation/kernel-parameters.txt for details. + Netconsole was designed to be as instantaneous as possible, to enable the logging of even the most critical kernel bugs. It works from IRQ contexts as well, and does not enable interrupts while -sending packets. Due to these unique needs, configuration can not +sending packets. Due to these unique needs, configuration cannot be more automatic, and some fundamental limitations will remain: only IP networks, UDP packets and ethernet devices are supported. diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt new file mode 100644 index 00000000000..0fe1c6e0dbc --- /dev/null +++ b/Documentation/networking/netdev-FAQ.txt @@ -0,0 +1,224 @@ + +Information you need to know about netdev +----------------------------------------- + +Q: What is netdev? + +A: It is a mailing list for all network-related Linux stuff. This includes + anything found under net/ (i.e. core code like IPv6) and drivers/net + (i.e. hardware specific drivers) in the Linux source tree. + + Note that some subsystems (e.g. wireless drivers) which have a high volume + of traffic have their own specific mailing lists. + + The netdev list is managed (like many other Linux mailing lists) through + VGER ( http://vger.kernel.org/ ) and archives can be found below: + + http://marc.info/?l=linux-netdev + http://www.spinics.net/lists/netdev/ + + Aside from subsystems like that mentioned above, all network-related Linux + development (i.e. RFC, review, comments, etc.) takes place on netdev. + +Q: How do the changes posted to netdev make their way into Linux? + +A: There are always two trees (git repositories) in play. Both are driven + by David Miller, the main network maintainer. There is the "net" tree, + and the "net-next" tree. As you can probably guess from the names, the + net tree is for fixes to existing code already in the mainline tree from + Linus, and net-next is where the new code goes for the future release. + You can find the trees here: + + http://git.kernel.org/?p=linux/kernel/git/davem/net.git + http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git + +Q: How often do changes from these trees make it to the mainline Linus tree? + +A: To understand this, you need to know a bit of background information + on the cadence of Linux development. Each new release starts off with + a two week "merge window" where the main maintainers feed their new + stuff to Linus for merging into the mainline tree. After the two weeks, + the merge window is closed, and it is called/tagged "-rc1". No new + features get mainlined after this -- only fixes to the rc1 content + are expected. After roughly a week of collecting fixes to the rc1 + content, rc2 is released. This repeats on a roughly weekly basis + until rc7 (typically; sometimes rc6 if things are quiet, or rc8 if + things are in a state of churn), and a week after the last vX.Y-rcN + was done, the official "vX.Y" is released. + + Relating that to netdev: At the beginning of the 2-week merge window, + the net-next tree will be closed - no new changes/features. The + accumulated new content of the past ~10 weeks will be passed onto + mainline/Linus via a pull request for vX.Y -- at the same time, + the "net" tree will start accumulating fixes for this pulled content + relating to vX.Y + + An announcement indicating when net-next has been closed is usually + sent to netdev, but knowing the above, you can predict that in advance. + + IMPORTANT: Do not send new net-next content to netdev during the + period during which net-next tree is closed. + + Shortly after the two weeks have passed (and vX.Y-rc1 is released), the + tree for net-next reopens to collect content for the next (vX.Y+1) release. + + If you aren't subscribed to netdev and/or are simply unsure if net-next + has re-opened yet, simply check the net-next git repository link above for + any new networking-related commits. + + The "net" tree continues to collect fixes for the vX.Y content, and + is fed back to Linus at regular (~weekly) intervals. Meaning that the + focus for "net" is on stabilization and bugfixes. + + Finally, the vX.Y gets released, and the whole cycle starts over. + +Q: So where are we now in this cycle? + +A: Load the mainline (Linus) page here: + + http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git + + and note the top of the "tags" section. If it is rc1, it is early + in the dev cycle. If it was tagged rc7 a week ago, then a release + is probably imminent. + +Q: How do I indicate which tree (net vs. net-next) my patch should be in? + +A: Firstly, think whether you have a bug fix or new "next-like" content. + Then once decided, assuming that you use git, use the prefix flag, i.e. + + git format-patch --subject-prefix='PATCH net-next' start..finish + + Use "net" instead of "net-next" (always lower case) in the above for + bug-fix net content. If you don't use git, then note the only magic in + the above is just the subject text of the outgoing e-mail, and you can + manually change it yourself with whatever MUA you are comfortable with. + +Q: I sent a patch and I'm wondering what happened to it. How can I tell + whether it got merged? + +A: Start by looking at the main patchworks queue for netdev: + + http://patchwork.ozlabs.org/project/netdev/list/ + + The "State" field will tell you exactly where things are at with + your patch. + +Q: The above only says "Under Review". How can I find out more? + +A: Generally speaking, the patches get triaged quickly (in less than 48h). + So be patient. Asking the maintainer for status updates on your + patch is a good way to ensure your patch is ignored or pushed to + the bottom of the priority list. + +Q: How can I tell what patches are queued up for backporting to the + various stable releases? + +A: Normally Greg Kroah-Hartman collects stable commits himself, but + for networking, Dave collects up patches he deems critical for the + networking subsystem, and then hands them off to Greg. + + There is a patchworks queue that you can see here: + http://patchwork.ozlabs.org/bundle/davem/stable/?state=* + + It contains the patches which Dave has selected, but not yet handed + off to Greg. If Greg already has the patch, then it will be here: + http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git + + A quick way to find whether the patch is in this stable-queue is + to simply clone the repo, and then git grep the mainline commit ID, e.g. + + stable-queue$ git grep -l 284041ef21fdf2e + releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + stable/stable-queue$ + +Q: I see a network patch and I think it should be backported to stable. + Should I request it via "stable@vger.kernel.org" like the references in + the kernel's Documentation/stable_kernel_rules.txt file say? + +A: No, not for networking. Check the stable queues as per above 1st to see + if it is already queued. If not, then send a mail to netdev, listing + the upstream commit ID and why you think it should be a stable candidate. + + Before you jump to go do the above, do note that the normal stable rules + in Documentation/stable_kernel_rules.txt still apply. So you need to + explicitly indicate why it is a critical fix and exactly what users are + impacted. In addition, you need to convince yourself that you _really_ + think it has been overlooked, vs. having been considered and rejected. + + Generally speaking, the longer it has had a chance to "soak" in mainline, + the better the odds that it is an OK candidate for stable. So scrambling + to request a commit be added the day after it appears should be avoided. + +Q: I have created a network patch and I think it should be backported to + stable. Should I add a "Cc: stable@vger.kernel.org" like the references + in the kernel's Documentation/ directory say? + +A: No. See above answer. In short, if you think it really belongs in + stable, then ensure you write a decent commit log that describes who + gets impacted by the bugfix and how it manifests itself, and when the + bug was introduced. If you do that properly, then the commit will + get handled appropriately and most likely get put in the patchworks + stable queue if it really warrants it. + + If you think there is some valid information relating to it being in + stable that does _not_ belong in the commit log, then use the three + dash marker line as described in Documentation/SubmittingPatches to + temporarily embed that information into the patch that you send. + +Q: Someone said that the comment style and coding convention is different + for the networking content. Is this true? + +A: Yes, in a largely trivial way. Instead of this: + + /* + * foobar blah blah blah + * another line of text + */ + + it is requested that you make it look like this: + + /* foobar blah blah blah + * another line of text + */ + +Q: I am working in existing code that has the former comment style and not the + latter. Should I submit new code in the former style or the latter? + +A: Make it the latter style, so that eventually all code in the domain of + netdev is of this format. + +Q: I found a bug that might have possible security implications or similar. + Should I mail the main netdev maintainer off-list? + +A: No. The current netdev maintainer has consistently requested that people + use the mailing lists and not reach out directly. If you aren't OK with + that, then perhaps consider mailing "security@kernel.org" or reading about + http://oss-security.openwall.org/wiki/mailing-lists/distros + as possible alternative mechanisms. + +Q: What level of testing is expected before I submit my change? + +A: If your changes are against net-next, the expectation is that you + have tested by layering your changes on top of net-next. Ideally you + will have done run-time testing specific to your change, but at a + minimum, your changes should survive an "allyesconfig" and an + "allmodconfig" build without new warnings or failures. + +Q: Any other tips to help ensure my net/net-next patch gets OK'd? + +A: Attention to detail. Re-read your own work as if you were the + reviewer. You can start with using checkpatch.pl, perhaps even + with the "--strict" flag. But do not be mindlessly robotic in + doing so. If your change is a bug-fix, make sure your commit log + indicates the end-user visible symptom, the underlying reason as + to why it happens, and then if necessary, explain why the fix proposed + is the best way to get things done. Don't mangle whitespace, and as + is common, don't mis-indent function arguments that span multiple lines. + If it is your first patch, mail it to yourself so you can test apply + it to an unpatched tree to confirm infrastructure didn't mangle it. + + Finally, go back and read Documentation/SubmittingPatches to be + sure you are not repeating some common mistake documented there. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt new file mode 100644 index 00000000000..f310edec8a7 --- /dev/null +++ b/Documentation/networking/netdev-features.txt @@ -0,0 +1,167 @@ +Netdev features mess and how to get out from it alive +===================================================== + +Author: + MichaÅ‚ MirosÅ‚aw <mirq-linux@rere.qmqm.pl> + + + + Part I: Feature sets +====================== + +Long gone are the days when a network card would just take and give packets +verbatim. Today's devices add multiple features and bugs (read: offloads) +that relieve an OS of various tasks like generating and checking checksums, +splitting packets, classifying them. Those capabilities and their state +are commonly referred to as netdev features in Linux kernel world. + +There are currently three sets of features relevant to the driver, and +one used internally by network core: + + 1. netdev->hw_features set contains features whose state may possibly + be changed (enabled or disabled) for a particular device by user's + request. This set should be initialized in ndo_init callback and not + changed later. + + 2. netdev->features set contains features which are currently enabled + for a device. This should be changed only by network core or in + error paths of ndo_set_features callback. + + 3. netdev->vlan_features set contains features whose state is inherited + by child VLAN devices (limits netdev->features set). This is currently + used for all VLAN devices whether tags are stripped or inserted in + hardware or software. + + 4. netdev->wanted_features set contains feature set requested by user. + This set is filtered by ndo_fix_features callback whenever it or + some device-specific conditions change. This set is internal to + networking core and should not be referenced in drivers. + + + + Part II: Controlling enabled features +======================================= + +When current feature set (netdev->features) is to be changed, new set +is calculated and filtered by calling ndo_fix_features callback +and netdev_fix_features(). If the resulting set differs from current +set, it is passed to ndo_set_features callback and (if the callback +returns success) replaces value stored in netdev->features. +NETDEV_FEAT_CHANGE notification is issued after that whenever current +set might have changed. + +The following events trigger recalculation: + 1. device's registration, after ndo_init returned success + 2. user requested changes in features state + 3. netdev_update_features() is called + +ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks +are treated as always returning success. + +A driver that wants to trigger recalculation must do so by calling +netdev_update_features() while holding rtnl_lock. This should not be done +from ndo_*_features callbacks. netdev->features should not be modified by +driver except by means of ndo_fix_features callback. + + + + Part III: Implementation hints +================================ + + * ndo_fix_features: + +All dependencies between features should be resolved here. The resulting +set can be reduced further by networking core imposed limitations (as coded +in netdev_fix_features()). For this reason it is safer to disable a feature +when its dependencies are not met instead of forcing the dependency on. + +This callback should not modify hardware nor driver state (should be +stateless). It can be called multiple times between successive +ndo_set_features calls. + +Callback must not alter features contained in NETIF_F_SOFT_FEATURES or +NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but +care must be taken as the change won't affect already configured VLANs. + + * ndo_set_features: + +Hardware should be reconfigured to match passed feature set. The set +should not be altered unless some error condition happens that can't +be reliably detected in ndo_fix_features. In this case, the callback +should update netdev->features to match resulting hardware state. +Errors returned are not (and cannot be) propagated anywhere except dmesg. +(Note: successful return is zero, >0 means silent error.) + + + + Part IV: Features +=================== + +For current list of features, see include/linux/netdev_features.h. +This section describes semantics of some of them. + + * Transmit checksumming + +For complete description, see comments near the top of include/linux/skbuff.h. + +Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. +It means that device can fill TCP/UDP-like checksum anywhere in the packets +whatever headers there might be. + + * Transmit TCP segmentation offload + +NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit +set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + + * Transmit DMA from high memory + +On platforms where this is relevant, NETIF_F_HIGHDMA signals that +ndo_start_xmit can handle skbs with frags in high memory. + + * Transmit scatter-gather + +Those features say that ndo_start_xmit can handle fragmented skbs: +NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- +chained skbs (skb->next/prev list). + + * Software features + +Features contained in NETIF_F_SOFT_FEATURES are features of networking +stack. Driver should not change behaviour based on them. + + * LLTX driver (deprecated for hardware drivers) + +NETIF_F_LLTX should be set in drivers that implement their own locking in +transmit path or don't need locking at all (e.g. software tunnels). +In ndo_start_xmit, it is recommended to use a try_lock and return +NETDEV_TX_LOCKED when the spin lock fails. The locking should also properly +protect against other callbacks (the rules you need to find out). + +Don't use it for new drivers. + + * netns-local device + +NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between +network namespaces (e.g. loopback). + +Don't use it in drivers. + + * VLAN challenged + +NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN +headers. Some drivers set this because the cards can't handle the bigger MTU. +[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU +VLANs. This may be not useful, though.] + +* rx-fcs + +This requests that the NIC append the Ethernet Frame Checksum (FCS) +to the end of the skb data. This allows sniffers and other tools to +read the CRC recorded by the NIC on receipt of the packet. + +* rx-all + +This requests that the NIC receive all possible frames, including errored +frames (such as bad FCS, etc). This can be helpful when sniffing a link with +bad packets on it. Some NICs may receive more packets if also put into normal +PROMISC mode. diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index 3c0a5ba614d..0b1cf6b2a59 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt @@ -10,49 +10,74 @@ network devices. struct net_device allocation rules ================================== Network device structures need to persist even after module is unloaded and -must be allocated with kmalloc. If device has registered successfully, -it will be freed on last use by free_netdev. This is required to handle the -pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu ) +must be allocated with alloc_netdev_mqs() and friends. +If device has registered successfully, it will be freed on last use +by free_netdev(). This is required to handle the pathologic case cleanly +(example: rmmod mydriver </sys/class/net/myeth/mtu ) -There are routines in net_init.c to handle the common cases of -alloc_etherdev, alloc_netdev. These reserve extra space for driver +alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver private data which gets freed when the network device is freed. If separately allocated data is attached to the network device -(dev->priv) then it is up to the module exit handler to free that. +(netdev_priv(dev)) then it is up to the module exit handler to free that. + +MTU +=== +Each network device has a Maximum Transfer Unit. The MTU does not +include any link layer protocol overhead. Upper layer protocols must +not pass a socket buffer (skb) to a device to transmit with more data +than the mtu. The MTU does not include link layer header overhead, so +for example on Ethernet if the standard MTU is 1500 bytes used, the +actual skb will contain up to 1514 bytes because of the Ethernet +header. Devices should allow for the 4 byte VLAN header as well. + +Segmentation Offload (GSO, TSO) is an exception to this rule. The +upper layer protocol may pass a large socket buffer to the device +transmit routine, and the device will break that up into separate +packets based on the current MTU. + +MTU is symmetrical and applies both to receive and transmit. A device +must be able to receive at least the maximum size packet allowed by +the MTU. A network device may use the MTU as mechanism to size receive +buffers, but the device should allow packets with VLAN header. With +standard Ethernet mtu of 1500 bytes, the device should allow up to +1518 byte packets (1500 + 14 header + 4 tag). The device may either: +drop, truncate, or pass up oversize packets, but dropping oversize +packets is preferred. struct net_device synchronization rules ======================================= -dev->open: +ndo_open: Synchronization: rtnl_lock() semaphore. Context: process -dev->stop: +ndo_stop: Synchronization: rtnl_lock() semaphore. Context: process - Note1: netif_running() is guaranteed false - Note2: dev->poll() is guaranteed to be stopped + Note: netif_running() is guaranteed false -dev->do_ioctl: +ndo_do_ioctl: Synchronization: rtnl_lock() semaphore. Context: process -dev->get_stats: +ndo_get_stats: Synchronization: dev_base_lock rwlock. Context: nominally process, but don't sleep inside an rwlock -dev->hard_start_xmit: - Synchronization: dev->xmit_lock spinlock. +ndo_start_xmit: + Synchronization: __netif_tx_lock spinlock. + When the driver sets NETIF_F_LLTX in dev->features this will be - called without holding xmit_lock. In this case the driver + called without holding netif_tx_lock. In this case the driver has to lock by itself when needed. It is recommended to use a try lock - for this and return -1 when the spin lock fails. + for this and return NETDEV_TX_LOCKED when the spin lock fails. The locking there should also properly protect against - set_multicast_list - Context: BHs disabled - Notes: netif_queue_stopped() is guaranteed false - Interrupts must be enabled when calling hard_start_xmit. - (Interrupts must also be enabled when enabling the BH handler.) + set_rx_mode. Note that the use of NETIF_F_LLTX is deprecated. + Don't use it for new drivers. + + Context: Process with BHs disabled or BH (timer), + will be called with interrupts disabled by netconsole. + Return codes: o NETDEV_TX_OK everything ok. o NETDEV_TX_BUSY Cannot transmit packet, try later @@ -61,17 +86,22 @@ dev->hard_start_xmit: o NETDEV_TX_LOCKED Locking failed, please retry quickly. Only valid when NETIF_F_LLTX is set. -dev->tx_timeout: - Synchronization: dev->xmit_lock spinlock. +ndo_tx_timeout: + Synchronization: netif_tx_lock spinlock; all TX queues frozen. Context: BHs disabled Notes: netif_queue_stopped() is guaranteed true -dev->set_multicast_list: - Synchronization: dev->xmit_lock spinlock. +ndo_set_rx_mode: + Synchronization: netif_addr_lock spinlock. Context: BHs disabled -dev->poll: - Synchronization: __LINK_STATE_RX_SCHED bit in dev->state. See - dev_close code and comments in net/core/dev.c for more info. +struct napi_struct synchronization rules +======================================== +napi->poll: + Synchronization: NAPI_STATE_SCHED bit in napi->state. Device + driver's ndo_stop method will invoke napi_disable() on + all NAPI instances which will do a sleeping poll on the + NAPI_STATE_SCHED napi->state bit, waiting for all pending + NAPI activity to cease. Context: softirq - + will be called with interrupts disabled by netconsole. diff --git a/Documentation/networking/netif-msg.txt b/Documentation/networking/netif-msg.txt index 18ad4cea625..c967ddb90d0 100644 --- a/Documentation/networking/netif-msg.txt +++ b/Documentation/networking/netif-msg.txt @@ -40,7 +40,7 @@ History Per-interface rather than per-driver message level setting. More selective control over the type of messages emitted. - The netif_msg recommandation adds these features with only a minor + The netif_msg recommendation adds these features with only a minor complexity and code size increase. The recommendation is the following points diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt new file mode 100644 index 00000000000..c6af4bac5aa --- /dev/null +++ b/Documentation/networking/netlink_mmap.txt @@ -0,0 +1,339 @@ +This file documents how to use memory mapped I/O with netlink. + +Author: Patrick McHardy <kaber@trash.net> + +Overview +-------- + +Memory mapped netlink I/O can be used to increase throughput and decrease +overhead of unicast receive and transmit operations. Some netlink subsystems +require high throughput, these are mainly the netfilter subsystems +nfnetlink_queue and nfnetlink_log, but it can also help speed up large +dump operations of f.i. the routing database. + +Memory mapped netlink I/O used two circular ring buffers for RX and TX which +are mapped into the processes address space. + +The RX ring is used by the kernel to directly construct netlink messages into +user-space memory without copying them as done with regular socket I/O, +additionally as long as the ring contains messages no recvmsg() or poll() +syscalls have to be issued by user-space to get more message. + +The TX ring is used to process messages directly from user-space memory, the +kernel processes all messages contained in the ring using a single sendmsg() +call. + +Usage overview +-------------- + +In order to use memory mapped netlink I/O, user-space needs three main changes: + +- ring setup +- conversion of the RX path to get messages from the ring instead of recvmsg() +- conversion of the TX path to construct messages into the ring + +Ring setup is done using setsockopt() to provide the ring parameters to the +kernel, then a call to mmap() to map the ring into the processes address space: + +- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, ¶ms, sizeof(params)); +- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, ¶ms, sizeof(params)); +- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) + +Usage of either ring is optional, but even if only the RX ring is used the +mapping still needs to be writable in order to update the frame status after +processing. + +Conversion of the reception path involves calling poll() on the file +descriptor, once the socket is readable the frames from the ring are +processed in order until no more messages are available, as indicated by +a status word in the frame header. + +On kernel side, in order to make use of memory mapped I/O on receive, the +originating netlink subsystem needs to support memory mapped I/O, otherwise +it will use an allocated socket buffer as usual and the contents will be + copied to the ring on transmission, nullifying most of the performance gains. +Dumps of kernel databases automatically support memory mapped I/O. + +Conversion of the transmit path involves changing message construction to +use memory from the TX ring instead of (usually) a buffer declared on the +stack and setting up the frame header appropriately. Optionally poll() can +be used to wait for free frames in the TX ring. + +Structured and definitions for using memory mapped I/O are contained in +<linux/netlink.h>. + +RX and TX rings +---------------- + +Each ring contains a number of continuous memory blocks, containing frames of +fixed size dependent on the parameters used for ring setup. + +Ring: [ block 0 ] + [ frame 0 ] + [ frame 1 ] + [ block 1 ] + [ frame 2 ] + [ frame 3 ] + ... + [ block n ] + [ frame 2 * n ] + [ frame 2 * n + 1 ] + +The blocks are only visible to the kernel, from the point of view of user-space +the ring just contains the frames in a continuous memory zone. + +The ring parameters used for setting up the ring are defined as follows: + +struct nl_mmap_req { + unsigned int nm_block_size; + unsigned int nm_block_nr; + unsigned int nm_frame_size; + unsigned int nm_frame_nr; +}; + +Frames are grouped into blocks, where each block is a continuous region of memory +and holds nm_block_size / nm_frame_size frames. The total number of frames in +the ring is nm_frame_nr. The following invariants hold: + +- frames_per_block = nm_block_size / nm_frame_size + +- nm_frame_nr = frames_per_block * nm_block_nr + +Some parameters are constrained, specifically: + +- nm_block_size must be a multiple of the architectures memory page size. + The getpagesize() function can be used to get the page size. + +- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be + able to hold at least the frame header + +- nm_frame_size must be smaller or equal to nm_block_size + +- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT + +- nm_frame_nr must equal the actual number of frames as specified above. + +When the kernel can't allocate physically continuous memory for a ring block, +it will fall back to use physically discontinuous memory. This might affect +performance negatively, in order to avoid this the nm_frame_size parameter +should be chosen to be as small as possible for the required frame size and +the number of blocks should be increased instead. + +Ring frames +------------ + +Each frames contain a frame header, consisting of a synchronization word and some +meta-data, and the message itself. + +Frame: [ header message ] + +The frame header is defined as follows: + +struct nl_mmap_hdr { + unsigned int nm_status; + unsigned int nm_len; + __u32 nm_group; + /* credentials */ + __u32 nm_pid; + __u32 nm_uid; + __u32 nm_gid; +}; + +- nm_status is used for synchronizing processing between the kernel and user- + space and specifies ownership of the frame as well as the operation to perform + +- nm_len contains the length of the message contained in the data area + +- nm_group specified the destination multicast group of message + +- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending + process. These values correspond to the data available using SOCK_PASSCRED in + the SCM_CREDENTIALS cmsg. + +The possible values in the status word are: + +- NL_MMAP_STATUS_UNUSED: + RX ring: frame belongs to the kernel and contains no message + for user-space. Approriate action is to invoke poll() + to wait for new messages. + + TX ring: frame belongs to user-space and can be used for + message construction. + +- NL_MMAP_STATUS_RESERVED: + RX ring only: frame is currently used by the kernel for message + construction and contains no valid message yet. + Appropriate action is to invoke poll() to wait for + new messages. + +- NL_MMAP_STATUS_VALID: + RX ring: frame contains a valid message. Approriate action is + to process the message and release the frame back to + the kernel by setting the status to + NL_MMAP_STATUS_UNUSED or queue the frame by setting the + status to NL_MMAP_STATUS_SKIP. + + TX ring: the frame contains a valid message from user-space to + be processed by the kernel. After completing processing + the kernel will release the frame back to user-space by + setting the status to NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_COPY: + RX ring only: a message is ready to be processed but could not be + stored in the ring, either because it exceeded the + frame size or because the originating subsystem does + not support memory mapped I/O. Appropriate action is + to invoke recvmsg() to receive the message and release + the frame back to the kernel by setting the status to + NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_SKIP: + RX ring only: user-space queued the message for later processing, but + processed some messages following it in the ring. The + kernel should skip this frame when looking for unused + frames. + +The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the +frame header. + +TX limitations +-------------- + +Kernel processing usually involves validation of the message received by +user-space, then processing its contents. The kernel must assure that +userspace is not able to modify the message contents after they have been +validated. In order to do so, the message is copied from the ring frame +to an allocated buffer if either of these conditions is false: + +- only a single mapping of the ring exists +- the file descriptor is not shared between processes + +This means that for threaded programs, the kernel will fall back to copying. + +Example +------- + +Ring setup: + + unsigned int block_size = 16 * getpagesize(); + struct nl_mmap_req req = { + .nm_block_size = block_size, + .nm_block_nr = 64, + .nm_frame_size = 16384, + .nm_frame_nr = 64 * block_size / 16384, + }; + unsigned int ring_size; + void *rx_ring, *tx_ring; + + /* Configure ring parameters */ + if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) + exit(1); + if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) + exit(1) + + /* Calculate size of each individual ring */ + ring_size = req.nm_block_nr * req.nm_block_size; + + /* Map RX/TX rings. The TX ring is located after the RX ring */ + rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + if ((long)rx_ring == -1L) + exit(1); + tx_ring = rx_ring + ring_size: + +Message reception: + +This example assumes some ring parameters of the ring setup are available. + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + unsigned char buf[16384]; + ssize_t len; + + while (1) { + struct pollfd pfds[1]; + + pfds[0].fd = fd; + pfds[0].events = POLLIN | POLLERR; + pfds[0].revents = 0; + + if (poll(pfds, 1, -1) < 0 && errno != -EINTR) + exit(1); + + /* Check for errors. Error handling omitted */ + if (pfds[0].revents & POLLERR) + <handle error> + + /* If no new messages, poll again */ + if (!(pfds[0].revents & POLLIN)) + continue; + + /* Process all frames */ + while (1) { + /* Get next frame header */ + hdr = rx_ring + frame_offset; + + if (hdr->nm_status == NL_MMAP_STATUS_VALID) { + /* Regular memory mapped frame */ + nlh = (void *)hdr + NL_MMAP_HDRLEN; + len = hdr->nm_len; + + /* Release empty message immediately. May happen + * on error during message construction. + */ + if (len == 0) + goto release; + } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) { + /* Frame queued to socket receive queue */ + len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (len <= 0) + break; + nlh = buf; + } else + /* No more messages to process, continue polling */ + break; + + process_msg(nlh); +release: + /* Release frame back to the kernel */ + hdr->nm_status = NL_MMAP_STATUS_UNUSED; + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; + } + } + +Message transmission: + +This example assumes some ring parameters of the ring setup are available. +A single message is constructed and transmitted, to send multiple messages +at once they would be constructed in consecutive frames before a final call +to sendto(). + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + }; + + hdr = tx_ring + frame_offset; + if (hdr->nm_status != NL_MMAP_STATUS_UNUSED) + /* No frame available. Use poll() to avoid. */ + exit(1); + + nlh = (void *)hdr + NL_MMAP_HDRLEN; + + /* Build message */ + build_message(nlh); + + /* Fill frame header: length and status need to be set */ + hdr->nm_len = nlh->nlmsg_len; + hdr->nm_status = NL_MMAP_STATUS_VALID; + + if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0) + exit(1); + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt new file mode 100644 index 00000000000..70da5086153 --- /dev/null +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -0,0 +1,176 @@ +/proc/sys/net/netfilter/nf_conntrack_* Variables: + +nf_conntrack_acct - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow accounting. 64-bit byte and packet + counters per flow are added. + +nf_conntrack_buckets - INTEGER (read-only) + Size of hash table. If not specified as parameter during module + loading, the default size is calculated by dividing total memory + by 16384 to determine the number of buckets but the hash table will + never have fewer than 32 or more than 16384 buckets. + +nf_conntrack_checksum - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + Verify checksum of incoming packets. Packets with bad checksums are + in INVALID state. If this is enabled, such packets will not be + considered for connection tracking. + +nf_conntrack_count - INTEGER (read-only) + Number of currently allocated flow entries. + +nf_conntrack_events - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If this option is enabled, the connection tracking code will + provide userspace with connection tracking events via ctnetlink. + +nf_conntrack_events_retry_timeout - INTEGER (seconds) + default 15 + + This option is only relevant when "reliable connection tracking + events" are used. Normally, ctnetlink is "lossy", that is, + events are normally dropped when userspace listeners can't keep up. + + Userspace can request "reliable event mode". When this mode is + active, the conntrack will only be destroyed after the event was + delivered. If event delivery fails, the kernel periodically + re-tries to send the event to userspace. + + This is the maximum interval the kernel should use when re-trying + to deliver the destroy event. + + A higher number means there will be fewer delivery retries and it + will take longer for a backlog to be processed. + +nf_conntrack_expect_max - INTEGER + Maximum size of expectation table. Default value is + nf_conntrack_buckets / 256. Minimum is 1. + +nf_conntrack_frag6_high_thresh - INTEGER + default 262144 + + Maximum memory used to reassemble IPv6 fragments. When + nf_conntrack_frag6_high_thresh bytes of memory is allocated for this + purpose, the fragment handler will toss packets until + nf_conntrack_frag6_low_thresh is reached. + +nf_conntrack_frag6_low_thresh - INTEGER + default 196608 + + See nf_conntrack_frag6_low_thresh + +nf_conntrack_frag6_timeout - INTEGER (seconds) + default 60 + + Time to keep an IPv6 fragment in memory. + +nf_conntrack_generic_timeout - INTEGER (seconds) + default 600 + + Default for generic timeout. This refers to layer 4 unknown/unsupported + protocols. + +nf_conntrack_helper - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + Enable automatic conntrack helper assignment. + +nf_conntrack_icmp_timeout - INTEGER (seconds) + default 30 + + Default for ICMP timeout. + +nf_conntrack_icmpv6_timeout - INTEGER (seconds) + default 30 + + Default for ICMP6 timeout. + +nf_conntrack_log_invalid - INTEGER + 0 - disable (default) + 1 - log ICMP packets + 6 - log TCP packets + 17 - log UDP packets + 33 - log DCCP packets + 41 - log ICMPv6 packets + 136 - log UDPLITE packets + 255 - log packets of any protocol + + Log invalid packets of a type specified by value. + +nf_conntrack_max - INTEGER + Size of connection tracking table. Default value is + nf_conntrack_buckets value * 4. + +nf_conntrack_tcp_be_liberal - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Be conservative in what you do, be liberal in what you accept from others. + If it's non-zero, we mark only out of window RST segments as INVALID. + +nf_conntrack_tcp_loose - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If it is set to zero, we disable picking up already established + connections. + +nf_conntrack_tcp_max_retrans - INTEGER + default 3 + + Maximum number of packets that can be retransmitted without + received an (acceptable) ACK from the destination. If this number + is reached, a shorter timer will be started. + +nf_conntrack_tcp_timeout_close - INTEGER (seconds) + default 10 + +nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_established - INTEGER (seconds) + default 432000 (5 days) + +nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds) + default 30 + +nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds) + default 300 + +nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds) + default 300 + +nf_conntrack_timestamp - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow timestamping. + +nf_conntrack_udp_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_udp_timeout_stream2 - INTEGER (seconds) + default 180 + + This extended timeout will be used in case there is an UDP stream + detected. diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.txt new file mode 100644 index 00000000000..b24c29bdae2 --- /dev/null +++ b/Documentation/networking/nfc.txt @@ -0,0 +1,128 @@ +Linux NFC subsystem +=================== + +The Near Field Communication (NFC) subsystem is required to standardize the +NFC device drivers development and to create an unified userspace interface. + +This document covers the architecture overview, the device driver interface +description and the userspace interface description. + +Architecture overview +--------------------- + +The NFC subsystem is responsible for: + - NFC adapters management; + - Polling for targets; + - Low-level data exchange; + +The subsystem is divided in some parts. The 'core' is responsible for +providing the device driver interface. On the other side, it is also +responsible for providing an interface to control operations and low-level +data exchange. + +The control operations are available to userspace via generic netlink. + +The low-level data exchange interface is provided by the new socket family +PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets. + + + +--------------------------------------+ + | USER SPACE | + +--------------------------------------+ + ^ ^ + | low-level | control + | data exchange | operations + | | + | v + | +-----------+ + | AF_NFC | netlink | + | socket +-----------+ + | raw ^ + | | + v v + +---------+ +-----------+ + | rawsock | <--------> | core | + +---------+ +-----------+ + ^ + | + v + +-----------+ + | driver | + +-----------+ + +Device Driver Interface +----------------------- + +When registering on the NFC subsystem, the device driver must inform the core +of the set of supported NFC protocols and the set of ops callbacks. The ops +callbacks that must be implemented are the following: + +* start_poll - setup the device to poll for targets +* stop_poll - stop on progress polling operation +* activate_target - select and initialize one of the targets found +* deactivate_target - deselect and deinitialize the selected target +* data_exchange - send data and receive the response (transceive operation) + +Userspace interface +-------------------- + +The userspace interface is divided in control operations and low-level data +exchange operation. + +CONTROL OPERATIONS: + +Generic netlink is used to implement the interface to the control operations. +The operations are composed by commands and events, all listed below: + +* NFC_CMD_GET_DEVICE - get specific device info or dump the device list +* NFC_CMD_START_POLL - setup a specific device to polling for targets +* NFC_CMD_STOP_POLL - stop the polling operation in a specific device +* NFC_CMD_GET_TARGET - dump the list of targets found by a specific device + +* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition +* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal +* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets +are found + +The user must call START_POLL to poll for NFC targets, passing the desired NFC +protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling +state until it finds any target. However, the user can stop the polling +operation by calling STOP_POLL command. In this case, it will be checked if +the requester of STOP_POLL is the same of START_POLL. + +If the polling operation finds one or more targets, the event TARGETS_FOUND is +sent (including the device id). The user must call GET_TARGET to get the list of +all targets found by such device. Each reply message has target attributes with +relevant information such as the supported NFC protocols. + +All polling operations requested through one netlink socket are stopped when +it's closed. + +LOW-LEVEL DATA EXCHANGE: + +The userspace must use PF_NFC sockets to perform any data communication with +targets. All NFC sockets use AF_NFC: + +struct sockaddr_nfc { + sa_family_t sa_family; + __u32 dev_idx; + __u32 target_idx; + __u32 nfc_protocol; +}; + +To establish a connection with one target, the user must create an +NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc +struct correctly filled. All information comes from NFC_EVENT_TARGETS_FOUND +netlink event. As a target can support more than one NFC protocol, the user +must inform which protocol it wants to use. + +Internally, 'connect' will result in an activate_target call to the driver. +When the socket is closed, the target is deactivated. + +The data format exchanged through the sockets is NFC protocol dependent. For +instance, when communicating with MIFARE tags, the data exchanged are MIFARE +commands and their responses. + +The first received package is the response to the first sent package and so +on. In order to allow valid "empty" responses, every data received has a NULL +header of 1 byte. diff --git a/Documentation/networking/olympic.txt b/Documentation/networking/olympic.txt deleted file mode 100644 index c65a94010ea..00000000000 --- a/Documentation/networking/olympic.txt +++ /dev/null @@ -1,79 +0,0 @@ - -IBM PCI Pit/Pit-Phy/Olympic CHIPSET BASED TOKEN RING CARDS README - -Release 0.2.0 - Release - June 8th 1999 Peter De Schrijver & Mike Phillips -Release 0.9.C - Release - April 18th 2001 Mike Phillips - -Thanks: -Erik De Cock, Adrian Bridgett and Frank Fiene for their -patience and testing. -Donald Champion for the cardbus support -Kyle Lucke for the dma api changes. -Jonathon Bitner for hardware support. -Everybody on linux-tr for their continued support. - -Options: - -The driver accepts four options: ringspeed, pkt_buf_sz, -message_level and network_monitor. - -These options can be specified differently for each card found. - -ringspeed: Has one of three settings 0 (default), 4 or 16. 0 will -make the card autosense the ringspeed and join at the appropriate speed, -this will be the default option for most people. 4 or 16 allow you to -explicitly force the card to operate at a certain speed. The card will fail -if you try to insert it at the wrong speed. (Although some hubs will allow -this so be *very* careful). The main purpose for explicitly setting the ring -speed is for when the card is first on the ring. In autosense mode, if the card -cannot detect any active monitors on the ring it will not open, so you must -re-init the card at the appropriate speed. Unfortunately at present the only -way of doing this is rmmod and insmod which is a bit tough if it is compiled -in the kernel. - -pkt_buf_sz: This is this initial receive buffer allocation size. This will -default to 4096 if no value is entered. You may increase performance of the -driver by setting this to a value larger than the network packet size, although -the driver now re-sizes buffers based on MTU settings as well. - -message_level: Controls level of messages created by the driver. Defaults to 0: -which only displays start-up and critical messages. Presently any non-zero -value will display all soft messages as well. NB This does not turn -debugging messages on, that must be done by modified the source code. - -network_monitor: Any non-zero value will provide a quasi network monitoring -mode. All unexpected MAC frames (beaconing etc.) will be received -by the driver and the source and destination addresses printed. -Also an entry will be added in /proc/net called olympic_tr%d, where tr%d -is the registered device name, i.e tr0, tr1, etc. This displays low -level information about the configuration of the ring and the adapter. -This feature has been designed for network administrators to assist in -the diagnosis of network / ring problems. (This used to OLYMPIC_NETWORK_MONITOR, -but has now changed to allow each adapter to be configured differently and -to alleviate the necessity to re-compile olympic to turn the option on). - -Multi-card: - -The driver will detect multiple cards and will work with shared interrupts, -each card is assigned the next token ring device, i.e. tr0 , tr1, tr2. The -driver should also happily reside in the system with other drivers. It has -been tested with ibmtr.c running, and I personally have had one Olicom PCI -card and two IBM olympic cards (all on the same interrupt), all running -together. - -Variable MTU size: - -The driver can handle a MTU size upto either 4500 or 18000 depending upon -ring speed. The driver also changes the size of the receive buffers as part -of the mtu re-sizing, so if you set mtu = 18000, you will need to be able -to allocate 16 * (sk_buff with 18000 buffer size) call it 18500 bytes per ring -position = 296,000 bytes of memory space, plus of course anything -necessary for the tx sk_buff's. Remember this is per card, so if you are -building routers, gateway's etc, you could start to use a lot of memory -real fast. - - -6/8/99 Peter De Schrijver and Mike Phillips - diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt new file mode 100644 index 00000000000..37c20ee2455 --- /dev/null +++ b/Documentation/networking/openvswitch.txt @@ -0,0 +1,235 @@ +Open vSwitch datapath developer documentation +============================================= + +The Open vSwitch kernel module allows flexible userspace control over +flow-level packet processing on selected network devices. It can be +used to implement a plain Ethernet switch, network device bonding, +VLAN processing, network access control, flow-based network control, +and so on. + +The kernel module implements multiple "datapaths" (analogous to +bridges), each of which can have multiple "vports" (analogous to ports +within a bridge). Each datapath also has associated with it a "flow +table" that userspace populates with "flows" that map from keys based +on packet headers and metadata to sets of actions. The most common +action forwards the packet to another vport; other actions are also +implemented. + +When a packet arrives on a vport, the kernel module processes it by +extracting its flow key and looking it up in the flow table. If there +is a matching flow, it executes the associated actions. If there is +no match, it queues the packet to userspace for processing (as part of +its processing, userspace will likely set up a flow to handle further +packets of the same type entirely in-kernel). + + +Flow key compatibility +---------------------- + +Network protocols evolve over time. New protocols become important +and existing protocols lose their prominence. For the Open vSwitch +kernel module to remain relevant, it must be possible for newer +versions to parse additional protocols as part of the flow key. It +might even be desirable, someday, to drop support for parsing +protocols that have become obsolete. Therefore, the Netlink interface +to Open vSwitch is designed to allow carefully written userspace +applications to work with any version of the flow key, past or future. + +To support this forward and backward compatibility, whenever the +kernel module passes a packet to userspace, it also passes along the +flow key that it parsed from the packet. Userspace then extracts its +own notion of a flow key from the packet and compares it against the +kernel-provided version: + + - If userspace's notion of the flow key for the packet matches the + kernel's, then nothing special is necessary. + + - If the kernel's flow key includes more fields than the userspace + version of the flow key, for example if the kernel decoded IPv6 + headers but userspace stopped at the Ethernet type (because it + does not understand IPv6), then again nothing special is + necessary. Userspace can still set up a flow in the usual way, + as long as it uses the kernel-provided flow key to do it. + + - If the userspace flow key includes more fields than the + kernel's, for example if userspace decoded an IPv6 header but + the kernel stopped at the Ethernet type, then userspace can + forward the packet manually, without setting up a flow in the + kernel. This case is bad for performance because every packet + that the kernel considers part of the flow must go to userspace, + but the forwarding behavior is correct. (If userspace can + determine that the values of the extra fields would not affect + forwarding behavior, then it could set up a flow anyway.) + +How flow keys evolve over time is important to making this work, so +the following sections go into detail. + + +Flow key format +--------------- + +A flow key is passed over a Netlink socket as a sequence of Netlink +attributes. Some attributes represent packet metadata, defined as any +information about a packet that cannot be extracted from the packet +itself, e.g. the vport on which the packet was received. Most +attributes, however, are extracted from headers within the packet, +e.g. source and destination addresses from Ethernet, IP, or TCP +headers. + +The <linux/openvswitch.h> header file defines the exact format of the +flow key attributes. For informal explanatory purposes here, we write +them as comma-separated strings, with parentheses indicating arguments +and nesting. For example, the following could represent a flow key +corresponding to a TCP packet that arrived on vport 1: + + in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), + eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, + frag=no), tcp(src=49163, dst=80) + +Often we ellipsize arguments not important to the discussion, e.g.: + + in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) + + +Wildcarded flow key format +-------------------------- + +A wildcarded flow is described with two sequences of Netlink attributes +passed over the Netlink socket. A flow key, exactly as described above, and an +optional corresponding flow mask. + +A wildcarded flow can represent a group of exact match flows. Each '1' bit +in the mask specifies a exact match with the corresponding bit in the flow key. +A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit +of a incoming packet. Using wildcarded flow can improve the flow set up rate +by reduce the number of new flows need to be processed by the user space program. + +Support for the mask Netlink attribute is optional for both the kernel and user +space program. The kernel can ignore the mask attribute, installing an exact +match flow, or reduce the number of don't care bits in the kernel to less than +what was specified by the user space program. In this case, variations in bits +that the kernel does not implement will simply result in additional flow setups. +The kernel module will also work with user space programs that neither support +nor supply flow mask attributes. + +Since the kernel may ignore or modify wildcard bits, it can be difficult for +the userspace program to know exactly what matches are installed. There are +two possible approaches: reactively install flows as they miss the kernel +flow table (and therefore not attempt to determine wildcard changes at all) +or use the kernel's response messages to determine the installed wildcards. + +When interacting with userspace, the kernel should maintain the match portion +of the key exactly as originally installed. This will provides a handle to +identify the flow for all future operations. However, when reporting the +mask of an installed flow, the mask should include any restrictions imposed +by the kernel. + +The behavior when using overlapping wildcarded flows is undefined. It is the +responsibility of the user space program to ensure that any incoming packet +can match at most one flow, wildcarded or not. The current implementation +performs best-effort detection of overlapping wildcarded flows and may reject +some but not all of them. However, this behavior may change in future versions. + + +Basic rule for evolving flow keys +--------------------------------- + +Some care is needed to really maintain forward and backward +compatibility for applications that follow the rules listed under +"Flow key compatibility" above. + +The basic rule is obvious: + + ------------------------------------------------------------------ + New network protocol support must only supplement existing flow + key attributes. It must not change the meaning of already defined + flow key attributes. + ------------------------------------------------------------------ + +This rule does have less-obvious consequences so it is worth working +through a few examples. Suppose, for example, that the kernel module +did not already implement VLAN parsing. Instead, it just interpreted +the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the +packet. The flow key for any packet with an 802.1Q header would look +essentially like this, ignoring metadata: + + eth(...), eth_type(0x8100) + +Naively, to add VLAN support, it makes sense to add a new "vlan" flow +key attribute to contain the VLAN tag, then continue to decode the +encapsulated headers beyond the VLAN tag using the existing field +definitions. With this change, a TCP packet in VLAN 10 would have a +flow key much like this: + + eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) + +But this change would negatively affect a userspace application that +has not been updated to understand the new "vlan" flow key attribute. +The application could, following the flow compatibility rules above, +ignore the "vlan" attribute that it does not understand and therefore +assume that the flow contained IP packets. This is a bad assumption +(the flow only contains IP packets if one parses and skips over the +802.1Q header) and it could cause the application's behavior to change +across kernel versions even though it follows the compatibility rules. + +The solution is to use a set of nested attributes. This is, for +example, why 802.1Q support uses nested attributes. A TCP packet in +VLAN 10 is actually expressed as: + + eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), + ip(proto=6, ...), tcp(...))) + +Notice how the "eth_type", "ip", and "tcp" flow key attributes are +nested inside the "encap" attribute. Thus, an application that does +not understand the "vlan" key will not see either of those attributes +and therefore will not misinterpret them. (Also, the outer eth_type +is still 0x8100, not changed to 0x0800.) + +Handling malformed packets +-------------------------- + +Don't drop packets in the kernel for malformed protocol headers, bad +checksums, etc. This would prevent userspace from implementing a +simple Ethernet switch that forwards every packet. + +Instead, in such a case, include an attribute with "empty" content. +It doesn't matter if the empty content could be valid protocol values, +as long as those values are rarely seen in practice, because userspace +can always forward all packets with those values to userspace and +handle them individually. + +For example, consider a packet that contains an IP header that +indicates protocol 6 for TCP, but which is truncated just after the IP +header, so that the TCP header is missing. The flow key for this +packet would include a tcp attribute with all-zero src and dst, like +this: + + eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) + +As another example, consider a packet with an Ethernet type of 0x8100, +indicating that a VLAN TCI should follow, but which is truncated just +after the Ethernet type. The flow key for this packet would include +an all-zero-bits vlan and an empty encap attribute, like this: + + eth(...), eth_type(0x8100), vlan(0), encap() + +Unlike a TCP packet with source and destination ports 0, an +all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka +VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan +attribute expressly to allow this situation to be distinguished. +Thus, the flow key in this second example unambiguously indicates a +missing or malformed VLAN TCI. + +Other rules +----------- + +The other rules for flow keys are much less subtle: + + - Duplicate attributes are not allowed at a given nesting level. + + - Ordering of attributes is not significant. + + - When the kernel sends a given flow key to userspace, it always + composes it the same way. This allows userspace to hash and + compare entire flow keys that it may not be able to fully + interpret. diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt new file mode 100644 index 00000000000..355c6d8ef8a --- /dev/null +++ b/Documentation/networking/operstates.txt @@ -0,0 +1,162 @@ + +1. Introduction + +Linux distinguishes between administrative and operational state of an +interface. Administrative state is the result of "ip link set dev +<dev> up or down" and reflects whether the administrator wants to use +the device for traffic. + +However, an interface is not usable just because the admin enabled it +- ethernet requires to be plugged into the switch and, depending on +a site's networking policy and configuration, an 802.1X authentication +to be performed before user data can be transferred. Operational state +shows the ability of an interface to transmit this user data. + +Thanks to 802.1X, userspace must be granted the possibility to +influence operational state. To accommodate this, operational state is +split into two parts: Two flags that can be set by the driver only, and +a RFC2863 compatible state that is derived from these flags, a policy, +and changeable from userspace under certain rules. + + +2. Querying from userspace + +Both admin and operational state can be queried via the netlink +operation RTM_GETLINK. It is also possible to subscribe to RTMGRP_LINK +to be notified of updates. This is important for setting from userspace. + +These values contain interface state: + +ifinfomsg::if_flags & IFF_UP: + Interface is admin up +ifinfomsg::if_flags & IFF_RUNNING: + Interface is in RFC2863 operational state UP or UNKNOWN. This is for + backward compatibility, routing daemons, dhcp clients can use this + flag to determine whether they should use the interface. +ifinfomsg::if_flags & IFF_LOWER_UP: + Driver has signaled netif_carrier_on() +ifinfomsg::if_flags & IFF_DORMANT: + Driver has signaled netif_dormant_on() + +TLV IFLA_OPERSTATE + +contains RFC2863 state of the interface in numeric representation: + +IF_OPER_UNKNOWN (0): + Interface is in unknown state, neither driver nor userspace has set + operational state. Interface must be considered for user data as + setting operational state has not been implemented in every driver. +IF_OPER_NOTPRESENT (1): + Unused in current kernel (notpresent interfaces normally disappear), + just a numerical placeholder. +IF_OPER_DOWN (2): + Interface is unable to transfer data on L1, f.e. ethernet is not + plugged or interface is ADMIN down. +IF_OPER_LOWERLAYERDOWN (3): + Interfaces stacked on an interface that is IF_OPER_DOWN show this + state (f.e. VLAN). +IF_OPER_TESTING (4): + Unused in current kernel. +IF_OPER_DORMANT (5): + Interface is L1 up, but waiting for an external event, f.e. for a + protocol to establish. (802.1X) +IF_OPER_UP (6): + Interface is operational up and can be used. + +This TLV can also be queried via sysfs. + +TLV IFLA_LINKMODE + +contains link policy. This is needed for userspace interaction +described below. + +This TLV can also be queried via sysfs. + + +3. Kernel driver API + +Kernel drivers have access to two flags that map to IFF_LOWER_UP and +IFF_DORMANT. These flags can be set from everywhere, even from +interrupts. It is guaranteed that only the driver has write access, +however, if different layers of the driver manipulate the same flag, +the driver has to provide the synchronisation needed. + +__LINK_STATE_NOCARRIER, maps to !IFF_LOWER_UP: + +The driver uses netif_carrier_on() to clear and netif_carrier_off() to +set this flag. On netif_carrier_off(), the scheduler stops sending +packets. The name 'carrier' and the inversion are historical, think of +it as lower layer. + +Note that for certain kind of soft-devices, which are not managing any +real hardware, it is possible to set this bit from userspace. One +should use TVL IFLA_CARRIER to do so. + +netif_carrier_ok() can be used to query that bit. + +__LINK_STATE_DORMANT, maps to IFF_DORMANT: + +Set by the driver to express that the device cannot yet be used +because some driver controlled protocol establishment has to +complete. Corresponding functions are netif_dormant_on() to set the +flag, netif_dormant_off() to clear it and netif_dormant() to query. + +On device allocation, networking core sets the flags equivalent to +netif_carrier_ok() and !netif_dormant(). + + +Whenever the driver CHANGES one of these flags, a workqueue event is +scheduled to translate the flag combination to IFLA_OPERSTATE as +follows: + +!netif_carrier_ok(): + IF_OPER_LOWERLAYERDOWN if the interface is stacked, IF_OPER_DOWN + otherwise. Kernel can recognise stacked interfaces because their + ifindex != iflink. + +netif_carrier_ok() && netif_dormant(): + IF_OPER_DORMANT + +netif_carrier_ok() && !netif_dormant(): + IF_OPER_UP if userspace interaction is disabled. Otherwise + IF_OPER_DORMANT with the possibility for userspace to initiate the + IF_OPER_UP transition afterwards. + + +4. Setting from userspace + +Applications have to use the netlink interface to influence the +RFC2863 operational state of an interface. Setting IFLA_LINKMODE to 1 +via RTM_SETLINK instructs the kernel that an interface should go to +IF_OPER_DORMANT instead of IF_OPER_UP when the combination +netif_carrier_ok() && !netif_dormant() is set by the +driver. Afterwards, the userspace application can set IFLA_OPERSTATE +to IF_OPER_DORMANT or IF_OPER_UP as long as the driver does not set +netif_carrier_off() or netif_dormant_on(). Changes made by userspace +are multicasted on the netlink group RTMGRP_LINK. + +So basically a 802.1X supplicant interacts with the kernel like this: + +-subscribe to RTMGRP_LINK +-set IFLA_LINKMODE to 1 via RTM_SETLINK +-query RTM_GETLINK once to get initial state +-if initial flags are not (IFF_LOWER_UP && !IFF_DORMANT), wait until + netlink multicast signals this state +-do 802.1X, eventually abort if flags go down again +-send RTM_SETLINK to set operstate to IF_OPER_UP if authentication + succeeds, IF_OPER_DORMANT otherwise +-see how operstate and IFF_RUNNING is echoed via netlink multicast +-set interface back to IF_OPER_DORMANT if 802.1X reauthentication + fails +-restart if kernel changes IFF_LOWER_UP or IFF_DORMANT flag + +if supplicant goes down, bring back IFLA_LINKMODE to 0 and +IFLA_OPERSTATE to a sane value. + +A routing daemon or dhcp client just needs to care for IFF_RUNNING or +waiting for operstate to go IF_OPER_UP/IF_OPER_UNKNOWN before +considering the interface / querying a DHCP address. + + +For technical questions and/or comments please e-mail to Stefan Rompf +(stefan at loplof.de). diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 8d4cf78258e..38112d512f4 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -2,45 +2,52 @@ + ABSTRACT -------------------------------------------------------------------------------- -This file documents the CONFIG_PACKET_MMAP option available with the PACKET -socket interface on 2.4 and 2.6 kernels. This type of sockets is used for -capture network traffic with utilities like tcpdump or any other that uses -the libpcap library. +This file documents the mmap() facility available with the PACKET +socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for +i) capture network traffic with utilities like tcpdump, ii) transmit network +traffic, or any other that needs raw access to network interface. -You can find the latest version of this document at +You can find the latest version of this document at: + http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap - http://pusa.uv.es/~ulisses/packet_mmap/ +Howto can be found at: + http://wiki.gnu-log.net (packet_mmap) -Please send me your comments to - - Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> +Please send your comments to + Ulisses Alonso Camaró <uaca@i.hate.spam.alumni.uv.es> + Johann Baudy <johann.baudy@gnu-log.net> ------------------------------------------------------------------------------- + Why use PACKET_MMAP -------------------------------------------------------------------------------- -In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very -inefficient. It uses very limited buffers and requires one system call -to capture each packet, it requires two if you want to get packet's -timestamp (like libpcap always does). +In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very +inefficient. It uses very limited buffers and requires one system call to +capture each packet, it requires two if you want to get packet's timestamp +(like libpcap always does). In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size -configurable circular buffer mapped in user space. This way reading packets just -needs to wait for them, most of the time there is no need to issue a single -system call. By using a shared buffer between the kernel and the user +configurable circular buffer mapped in user space that can be used to either +send or receive packets. This way reading packets just needs to wait for them, +most of the time there is no need to issue a single system call. Concerning +transmission, multiple packets can be sent through one system call to get the +highest bandwidth. By using a shared buffer between the kernel and the user also has the benefit of minimizing packet copies. -It's fine to use PACKET_MMAP to improve the performance of the capture process, -but it isn't everything. At least, if you are capturing at high speeds (this -is relative to the cpu speed), you should check if the device driver of your -network interface card supports some sort of interrupt load mitigation or -(even better) if it supports NAPI, also make sure it is enabled. +It's fine to use PACKET_MMAP to improve the performance of the capture and +transmission process, but it isn't everything. At least, if you are capturing +at high speeds (this is relative to the cpu speed), you should check if the +device driver of your network interface card supports some sort of interrupt +load mitigation or (even better) if it supports NAPI, also make sure it is +enabled. For transmission, check the MTU (Maximum Transmission Unit) used and +supported by devices of your network. CPU IRQ pinning of your network interface +card can also be an advantage. -------------------------------------------------------------------------------- -+ How to use CONFIG_PACKET_MMAP ++ How to use mmap() to improve capture process -------------------------------------------------------------------------------- -From the user standpoint, you should use the higher level libpcap library, wich +From the user standpoint, you should use the higher level libpcap library, which is a de facto standard, portable across nearly all operating systems including Win32. @@ -49,7 +56,7 @@ support for PACKET_MMAP, and also probably the libpcap included in your distribu I'm aware of two implementations of PACKET_MMAP in libpcap: - http://pusa.uv.es/~ulisses/packet_mmap/ (by Simon Patarin, based on libpcap 0.6.2) + http://wiki.ipxwarzone.com/ (by Simon Patarin, based on libpcap 0.6.2) http://public.lanl.gov/cpw/ (by Phil Wood, based on lastest libpcap) The rest of this document is intended for people who want to understand @@ -57,7 +64,7 @@ the low level details or want to improve libpcap by including PACKET_MMAP support. -------------------------------------------------------------------------------- -+ How to use CONFIG_PACKET_MMAP directly ++ How to use mmap() directly to improve capture process -------------------------------------------------------------------------------- From the system calls stand point, the use of PACKET_MMAP involves @@ -66,7 +73,8 @@ the following process: [setup] socket() -------> creation of the capture socket setsockopt() ---> allocation of the circular buffer (ring) - mmap() ---------> maping of the allocated buffer to the + option: PACKET_RX_RING + mmap() ---------> mapping of the allocated buffer to the user process [capture] poll() ---------> to wait for incoming packets @@ -79,9 +87,7 @@ the following process: socket creation and destruction is straight forward, and is done the same way with or without PACKET_MMAP: -int fd; - -fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked @@ -92,18 +98,107 @@ by the kernel. The destruction of the socket and all associated resources is done by a simple call to close(fd). -Next I will describe PACKET_MMAP settings and it's constraints, -also the maping of the circular buffer in the user process and +Similarly as without PACKET_MMAP, it is possible to use one socket +for capture and transmission. This can be done by mapping the +allocated RX and TX buffer ring with a single mmap() call. +See "Mapping and use of the circular buffer (ring)". + +Next I will describe PACKET_MMAP settings and its constraints, +also the mapping of the circular buffer in the user process and the use of this buffer. -------------------------------------------------------------------------------- -+ PACKET_MMAP settings ++ How to use mmap() directly to improve transmission process -------------------------------------------------------------------------------- +Transmission process is similar to capture as shown below. + +[setup] socket() -------> creation of the transmission socket + setsockopt() ---> allocation of the circular buffer (ring) + option: PACKET_TX_RING + bind() ---------> bind transmission socket with a network interface + mmap() ---------> mapping of the allocated buffer to the + user process + +[transmission] poll() ---------> wait for free packets (optional) + send() ---------> send all packets that are set as ready in + the ring + The flag MSG_DONTWAIT can be used to return + before end of transfer. + +[shutdown] close() --------> destruction of the transmission socket and + deallocation of all associated resources. + +Socket creation and destruction is also straight forward, and is done +the same way as in capturing described in the previous paragraph: + + int fd = socket(PF_PACKET, mode, 0); + +The protocol can optionally be 0 in case we only want to transmit +via this socket, which avoids an expensive call to packet_rcv(). +In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 +set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. + +Binding the socket to your network interface is mandatory (with zero copy) to +know the header size of frames used in the circular buffer. + +As capture, each frame contains two parts: + + -------------------- +| struct tpacket_hdr | Header. It contains the status of +| | of this frame +|--------------------| +| data buffer | +. . Data that will be sent over the network interface. +. . + -------------------- + + bind() associates the socket to your network interface thanks to + sll_ifindex parameter of struct sockaddr_ll. + + Initialization example: + + struct sockaddr_ll my_addr; + struct ifreq s_ifr; + ... + + strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); + + /* get interface index of eth0 */ + ioctl(this->socket, SIOCGIFINDEX, &s_ifr); + + /* fill sockaddr_ll struct to prepare binding */ + my_addr.sll_family = AF_PACKET; + my_addr.sll_protocol = htons(ETH_P_ALL); + my_addr.sll_ifindex = s_ifr.ifr_ifindex; + + /* bind socket to eth0 */ + bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); + A complete tutorial is available at: http://wiki.gnu-log.net/ + +By default, the user should put data at : + frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) + +So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), +the beginning of the user data will be at : + frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +If you wish to put user data at a custom offset from the beginning of +the frame (for payload alignment with SOCK_RAW mode for instance) you +can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order +to make this work it must be enabled previously with setsockopt() +and the PACKET_TX_HAS_OFF option. + +-------------------------------------------------------------------------------- ++ PACKET_MMAP settings +-------------------------------------------------------------------------------- To setup PACKET_MMAP from user level code is done with a call like + - Capture process setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) + - Transmission process + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) The most significant argument in the previous call is the req parameter, this parameter must to have the following structure: @@ -117,11 +212,11 @@ this parameter must to have the following structure: }; This structure is defined in /usr/include/linux/if_packet.h and establishes a -circular buffer (ring) of unswappable memory mapped in the capture process. +circular buffer (ring) of unswappable memory. Being mapped in the capture process allows reading the captured frames and related meta-information like timestamps without requiring a system call. -Captured frames are grouped in blocks. Each block is a physically contiguous +Frames are grouped in blocks. Each block is a physically contiguous region of memory and holds tp_block_size/tp_frame_size frames. The total number of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because @@ -131,7 +226,6 @@ indeed, packet_set_ring checks that the following condition is true frames_per_block * tp_block_nr == tp_frame_nr - Lets see an example, with the following values: tp_block_size= 4096 @@ -153,11 +247,10 @@ we will get the following buffer structure: A frame can be of any size with the only condition it can fit in a block. A block can only hold an integer number of frames, or in other words, a frame cannot -be spawn accross two blocks so there are some datails you have to take into -account when choosing the frame_size. See "Maping and use of the circular +be spawned across two blocks, so there are some details you have to take into +account when choosing the frame_size. See "Mapping and use of the circular buffer (ring)". - -------------------------------------------------------------------------------- + PACKET_MMAP setting constraints -------------------------------------------------------------------------------- @@ -194,7 +287,6 @@ User space programs can include /usr/include/sys/user.h and The pagesize can also be determined dynamically with the getpagesize (2) system call. - Block number limit -------------------- @@ -214,11 +306,10 @@ called pg_vec, its size limits the number of blocks that can be allocated. v block #2 block #1 - -kmalloc allocates any number of bytes of phisically contiguous memory from -a pool of pre-determined sizes. This pool of memory is mantained by the slab -allocator wich is at the end the responsible for doing the allocation and -hence wich imposes the maximum memory that kmalloc can allocate. +kmalloc allocates any number of bytes of physically contiguous memory from +a pool of pre-determined sizes. This pool of memory is maintained by the slab +allocator which is at the end the responsible for doing the allocation and +hence which imposes the maximum memory that kmalloc can allocate. In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The predetermined sizes that kmalloc uses can be checked in the "size-<bytes>" @@ -229,7 +320,6 @@ pointers to blocks is 131072/4 = 32768 blocks - PACKET_MMAP buffer size calculator ------------------------------------ @@ -254,7 +344,7 @@ and, the number of frames be <block number> * <block size> / <frame size> -Suposse the following parameters, wich apply for 2.6 kernel and an +Suppose the following parameters, which apply for 2.6 kernel and an i386 architecture: <size-max> = 131072 bytes @@ -262,7 +352,7 @@ i386 architecture: <pagesize> = 4096 bytes <max-order> = 11 -and a value for <frame size> of 2048 byteas. These parameters will yield +and a value for <frame size> of 2048 bytes. These parameters will yield <block number> = 131072/4 = 32768 blocks <block size> = 4096 << 11 = 8 MiB. @@ -270,7 +360,6 @@ and a value for <frame size> of 2048 byteas. These parameters will yield and hence the buffer will have a 262144 MiB size. So it can hold 262144 MiB / 2048 bytes = 134217728 frames - Actually, this buffer size is not possible with an i386 architecture. Remember that the memory is allocated in kernel space, in the case of an i386 kernel's memory size is limited to 1GiB. @@ -278,13 +367,13 @@ an i386 kernel's memory size is limited to 1GiB. All memory allocations are not freed until the socket is closed. The memory allocations are done with GFP_KERNEL priority, this basically means that the allocation can wait and swap other process' memory in order to allocate -the nececessary memory, so normally limits can be reached. +the necessary memory, so normally limits can be reached. Other constraints ------------------- If you check the source code you will see that what I draw here as a frame -is not only the link level frame. At the begining of each frame there is a +is not only the link level frame. At the beginning of each frame there is a header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame meta information like timestamp. So what we draw here a frame it's really the following (from include/linux/if_packet.h): @@ -296,13 +385,12 @@ the following (from include/linux/if_packet.h): - struct tpacket_hdr - pad to TPACKET_ALIGNMENT=16 - struct sockaddr_ll - - Gap, chosen so that packet data (Start+tp_net) alignes to + - Gap, chosen so that packet data (Start+tp_net) aligns to TPACKET_ALIGNMENT=16 - Start+tp_mac: [ Optional MAC header ] - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 */ - The following are conditions that are checked in packet_set_ring @@ -311,14 +399,14 @@ the following (from include/linux/if_packet.h): tp_frame_size must be a multiple of TPACKET_ALIGNMENT tp_frame_nr must be exactly frames_per_block*tp_block_nr -Note that tp_block_size should be choosed to be a power of two or there will +Note that tp_block_size should be chosen to be a power of two or there will be a waste of memory. -------------------------------------------------------------------------------- -+ Maping and use of the circular buffer (ring) ++ Mapping and use of the circular buffer (ring) -------------------------------------------------------------------------------- -The maping of the buffer in the user process is done with the conventional +The mapping of the buffer in the user process is done with the conventional mmap function. Even the circular buffer is compound of several physically discontiguous blocks of memory, they are contiguous to the user space, hence just one call to mmap is needed: @@ -326,23 +414,36 @@ just one call to mmap is needed: mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); If tp_frame_size is a divisor of tp_block_size frames will be -contiguosly spaced by tp_frame_size bytes. If not, each +contiguously spaced by tp_frame_size bytes. If not, each tp_block_size/tp_frame_size frames there will be a gap between the frames. This is because a frame cannot be spawn across two blocks. +To use one socket for capture and transmission, the mapping of both the +RX and TX buffer ring has to be done with one call to mmap: + + ... + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); + ... + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + tx_ring = rx_ring + size; + +RX must be the first as the kernel maps the TX ring memory right +after the RX one. + At the beginning of each frame there is an status field (see struct tpacket_hdr). If this field is 0 means that the frame is ready to be used for the kernel, If not, there is a frame the user can read and the following flags apply: ++++ Capture process: from include/linux/if_packet.h #define TP_STATUS_COPY 2 #define TP_STATUS_LOSING 4 #define TP_STATUS_CSUMNOTREADY 8 - TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's larger than tp_frame_size. This packet can be @@ -352,7 +453,7 @@ TP_STATUS_COPY : This flag indicates that the frame (and associated enabled previously with setsockopt() and the PACKET_COPY_THRESH option. - The number of frames than can be buffered to + The number of frames that can be buffered to be read with recvfrom is limited like a normal socket. See the SO_RCVBUF option in the socket (7) man page. @@ -360,8 +461,8 @@ TP_STATUS_LOSING : indicates there were packet drops from last time statistics where checked with getsockopt() and the PACKET_STATISTICS option. -TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets wich - it's checksum will be done in hardware. So while +TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which + its checksum will be done in hardware. So while reading the packet we should not try to check the checksum. @@ -391,6 +492,573 @@ packets are in the ring: It doesn't incur in a race condition to first check the status value and then poll for frames. +++ Transmission process +Those defines are also used for transmission: + + #define TP_STATUS_AVAILABLE 0 // Frame is available + #define TP_STATUS_SEND_REQUEST 1 // Frame will be sent on next send() + #define TP_STATUS_SENDING 2 // Frame is currently in transmission + #define TP_STATUS_WRONG_FORMAT 4 // Frame format is not correct + +First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a +packet, the user fills a data buffer of an available frame, sets tp_len to +current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. +This can be done on multiple frames. Once the user is ready to transmit, it +calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are +forwarded to the network device. The kernel updates each status of sent +frames with TP_STATUS_SENDING until the end of transfer. +At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. + + header->tp_len = in_i_size; + header->tp_status = TP_STATUS_SEND_REQUEST; + retval = send(this->socket, NULL, 0, 0); + +The user can also use poll() to check if a buffer is available: +(status == TP_STATUS_SENDING) + + struct pollfd pfd; + pfd.fd = fd; + pfd.revents = 0; + pfd.events = POLLOUT; + retval = poll(&pfd, 1, timeout); + +------------------------------------------------------------------------------- ++ What TPACKET versions are available and when to use them? +------------------------------------------------------------------------------- + + int val = tpacket_version; + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + +where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. + +TPACKET_V1: + - Default if not otherwise specified by setsockopt(2) + - RX_RING, TX_RING available + +TPACKET_V1 --> TPACKET_V2: + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 + structures, thus this also works on 64 bit kernel with 32 bit + userspace and the like + - Timestamp resolution in nanoseconds instead of microseconds + - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), + in the tpacket2_hdr structure: + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates + that the tp_vlan_tci field has valid VLAN TCI value + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field + indicates that the tp_vlan_tpid field has valid VLAN TPID value + - How to switch to TPACKET_V2: + 1. Replace struct tpacket_hdr by struct tpacket2_hdr + 2. Query header len and save + 3. Set protocol version to 2, set up ring as usual + 4. For getting the sockaddr_ll, + use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of + (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +TPACKET_V2 --> TPACKET_V3: + - Flexible buffer implementation: + 1. Blocks can be configured with non-static frame-size + 2. Read/poll is at a block-level (as opposed to packet-level) + 3. Added poll timeout to avoid indefinite user-space wait + on idle links + 4. Added user-configurable knobs: + 4.1 block::timeout + 4.2 tpkt_hdr::sk_rxhash + - RX Hash data available in user space + - Currently only RX_RING available + +------------------------------------------------------------------------------- ++ AF_PACKET fanout mode +------------------------------------------------------------------------------- + +In the AF_PACKET fanout mode, packet reception can be load balanced among +processes. This also works in combination with mmap(2) on packet sockets. + +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping + +Minimal example code by David S. Miller (try things like "./test eth0 hash", +"./test eth0 lb", etc.): + +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +#include <sys/types.h> +#include <sys/wait.h> +#include <sys/socket.h> +#include <sys/ioctl.h> + +#include <unistd.h> + +#include <linux/if_ether.h> +#include <linux/if_packet.h> + +#include <net/if.h> + +static const char *device_name; +static int fanout_type; +static int fanout_id; + +#ifndef PACKET_FANOUT +# define PACKET_FANOUT 18 +# define PACKET_FANOUT_HASH 0 +# define PACKET_FANOUT_LB 1 +#endif + +static int setup_socket(void) +{ + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); + struct sockaddr_ll ll; + struct ifreq ifr; + int fanout_arg; + + if (fd < 0) { + perror("socket"); + return EXIT_FAILURE; + } + + memset(&ifr, 0, sizeof(ifr)); + strcpy(ifr.ifr_name, device_name); + err = ioctl(fd, SIOCGIFINDEX, &ifr); + if (err < 0) { + perror("SIOCGIFINDEX"); + return EXIT_FAILURE; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = AF_PACKET; + ll.sll_ifindex = ifr.ifr_ifindex; + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + return EXIT_FAILURE; + } + + fanout_arg = (fanout_id | (fanout_type << 16)); + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (err) { + perror("setsockopt"); + return EXIT_FAILURE; + } + + return fd; +} + +static void fanout_thread(void) +{ + int fd = setup_socket(); + int limit = 10000; + + if (fd < 0) + exit(fd); + + while (limit-- > 0) { + char buf[1600]; + int err; + + err = read(fd, buf, sizeof(buf)); + if (err < 0) { + perror("read"); + exit(EXIT_FAILURE); + } + if ((limit % 10) == 0) + fprintf(stdout, "(%d) \n", getpid()); + } + + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); + + close(fd); + exit(0); +} + +int main(int argc, char **argp) +{ + int fd, err; + int i; + + if (argc != 3) { + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); + return EXIT_FAILURE; + } + + if (!strcmp(argp[2], "hash")) + fanout_type = PACKET_FANOUT_HASH; + else if (!strcmp(argp[2], "lb")) + fanout_type = PACKET_FANOUT_LB; + else { + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); + exit(EXIT_FAILURE); + } + + device_name = argp[1]; + fanout_id = getpid() & 0xffff; + + for (i = 0; i < 4; i++) { + pid_t pid = fork(); + + switch (pid) { + case 0: + fanout_thread(); + + case -1: + perror("fork"); + exit(EXIT_FAILURE); + } + } + + for (i = 0; i < 4; i++) { + int status; + + wait(&status); + } + + return 0; +} + +------------------------------------------------------------------------------- ++ AF_PACKET TPACKET_V3 example +------------------------------------------------------------------------------- + +AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame +sizes by doing it's own memory management. It is based on blocks where polling +works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. + +It is said that TPACKET_V3 brings the following benefits: + *) ~15 - 20% reduction in CPU-usage + *) ~20% increase in packet capture rate + *) ~2x increase in packet density + *) Port aggregation analysis + *) Non static frame size to capture entire packet payload + +So it seems to be a good candidate to be used with packet fanout. + +Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile +it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): + +/* Written from scratch, but kernel-to-user space API usage + * dissected from lolpcap: + * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> + * License: GPL, version 2.0 + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <string.h> +#include <assert.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <netdb.h> +#include <poll.h> +#include <unistd.h> +#include <signal.h> +#include <inttypes.h> +#include <sys/socket.h> +#include <sys/mman.h> +#include <linux/if_packet.h> +#include <linux/if_ether.h> +#include <linux/ip.h> + +#ifndef likely +# define likely(x) __builtin_expect(!!(x), 1) +#endif +#ifndef unlikely +# define unlikely(x) __builtin_expect(!!(x), 0) +#endif + +struct block_desc { + uint32_t version; + uint32_t offset_to_priv; + struct tpacket_hdr_v1 h1; +}; + +struct ring { + struct iovec *rd; + uint8_t *map; + struct tpacket_req3 req; +}; + +static unsigned long packets_total = 0, bytes_total = 0; +static sig_atomic_t sigint = 0; + +static void sighandler(int num) +{ + sigint = 1; +} + +static int setup_socket(struct ring *ring, char *netdev) +{ + int err, i, fd, v = TPACKET_V3; + struct sockaddr_ll ll; + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; + unsigned int blocknum = 64; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(1); + } + + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + memset(&ring->req, 0, sizeof(ring->req)); + ring->req.tp_block_size = blocksiz; + ring->req.tp_frame_size = framesiz; + ring->req.tp_block_nr = blocknum; + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; + ring->req.tp_retire_blk_tov = 60; + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; + + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, + sizeof(ring->req)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); + if (ring->map == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); + assert(ring->rd); + for (i = 0; i < ring->req.tp_block_nr; ++i) { + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); + ring->rd[i].iov_len = ring->req.tp_block_size; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = PF_PACKET; + ll.sll_protocol = htons(ETH_P_ALL); + ll.sll_ifindex = if_nametoindex(netdev); + ll.sll_hatype = 0; + ll.sll_pkttype = 0; + ll.sll_halen = 0; + + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + exit(1); + } + + return fd; +} + +static void display(struct tpacket3_hdr *ppd) +{ + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); + + if (eth->h_proto == htons(ETH_P_IP)) { + struct sockaddr_in ss, sd; + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; + + memset(&ss, 0, sizeof(ss)); + ss.sin_family = PF_INET; + ss.sin_addr.s_addr = ip->saddr; + getnameinfo((struct sockaddr *) &ss, sizeof(ss), + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); + + memset(&sd, 0, sizeof(sd)); + sd.sin_family = PF_INET; + sd.sin_addr.s_addr = ip->daddr; + getnameinfo((struct sockaddr *) &sd, sizeof(sd), + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); + + printf("%s -> %s, ", sbuff, dbuff); + } + + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); +} + +static void walk_block(struct block_desc *pbd, const int block_num) +{ + int num_pkts = pbd->h1.num_pkts, i; + unsigned long bytes = 0; + struct tpacket3_hdr *ppd; + + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + + pbd->h1.offset_to_first_pkt); + for (i = 0; i < num_pkts; ++i) { + bytes += ppd->tp_snaplen; + display(ppd); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + + ppd->tp_next_offset); + } + + packets_total += num_pkts; + bytes_total += bytes; +} + +static void flush_block(struct block_desc *pbd) +{ + pbd->h1.block_status = TP_STATUS_KERNEL; +} + +static void teardown_socket(struct ring *ring, int fd) +{ + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); + free(ring->rd); + close(fd); +} + +int main(int argc, char **argp) +{ + int fd, err; + socklen_t len; + struct ring ring; + struct pollfd pfd; + unsigned int block_num = 0, blocks = 64; + struct block_desc *pbd; + struct tpacket_stats_v3 stats; + + if (argc != 2) { + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); + return EXIT_FAILURE; + } + + signal(SIGINT, sighandler); + + memset(&ring, 0, sizeof(ring)); + fd = setup_socket(&ring, argp[argc - 1]); + assert(fd > 0); + + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = fd; + pfd.events = POLLIN | POLLERR; + pfd.revents = 0; + + while (likely(!sigint)) { + pbd = (struct block_desc *) ring.rd[block_num].iov_base; + + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { + poll(&pfd, 1, -1); + continue; + } + + walk_block(pbd, block_num); + flush_block(pbd); + block_num = (block_num + 1) % blocks; + } + + len = sizeof(stats); + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); + if (err < 0) { + perror("getsockopt"); + exit(1); + } + + fflush(stdout); + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", + stats.tp_packets, bytes_total, stats.tp_drops, + stats.tp_freeze_q_cnt); + + teardown_socket(&ring, fd); + return 0; +} + +------------------------------------------------------------------------------- ++ PACKET_QDISC_BYPASS +------------------------------------------------------------------------------- + +If there is a requirement to load the network with many packets in a similar +fashion as pktgen does, you might set the following option after socket +creation: + + int one = 1; + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); + +This has the side-effect, that packets sent through PF_PACKET will bypass the +kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, +packet are not buffered, tc disciplines are ignored, increased loss can occur +and such packets are also not visible to other PF_PACKET sockets anymore. So, +you have been warned; generally, this can be useful for stress testing various +components of a system. + +On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled +on PF_PACKET sockets. + +------------------------------------------------------------------------------- ++ PACKET_TIMESTAMP +------------------------------------------------------------------------------- + +The PACKET_TIMESTAMP setting determines the source of the timestamp in +the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your +NIC is capable of timestamping packets in hardware, you can request those +hardware timestamps to be used. Note: you may need to enable the generation +of hardware timestamps with SIOCSHWTSTAMP (see related information from +Documentation/networking/timestamping.txt). + +PACKET_TIMESTAMP accepts the same integer bit field as +SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE +and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by +PACKET_TIMESTAMP. SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over +SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. + + int req = 0; + req |= SOF_TIMESTAMPING_SYS_HARDWARE; + setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) + +For the mmap(2)ed ring buffers, such timestamps are stored in the +tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine +what kind of timestamp has been reported, the tp_status field is binary |'ed +with the following possible bits ... + + TP_STATUS_TS_SYS_HARDWARE + TP_STATUS_TS_RAW_HARDWARE + TP_STATUS_TS_SOFTWARE + +... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the +RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set), +then this means that a software fallback was invoked *within* PF_PACKET's +processing code (less precise). + +Getting timestamps for the TX_RING works as follows: i) fill the ring frames, +ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant +frames to be updated resp. the frame handed over to the application, iv) walk +through the frames to pick up the individual hw/sw timestamps. + +Only (!) if transmit timestamping is enabled, then these bits are combined +with binary | with TP_STATUS_AVAILABLE, so you must check for that in your +application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) +in a first step to see if the frame belongs to the application, and then +one can extract the type of timestamp in a second step from tp_status)! + +If you don't care about them, thus having it disabled, checking for +TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the +TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec +members do not contain a valid value. For TX_RINGs, by default no timestamp +is generated! + +See include/linux/net_tstamp.h and Documentation/networking/timestamping +for more information on hardware timestamps. + +------------------------------------------------------------------------------- ++ Miscellaneous bits +------------------------------------------------------------------------------- + +- Packet sockets work well together with Linux socket filters, thus you also + might want to have a look at Documentation/networking/filter.txt + -------------------------------------------------------------------------------- + THANKS -------------------------------------------------------------------------------- diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt new file mode 100644 index 00000000000..81003581f47 --- /dev/null +++ b/Documentation/networking/phonet.txt @@ -0,0 +1,214 @@ +Linux Phonet protocol family +============================ + +Introduction +------------ + +Phonet is a packet protocol used by Nokia cellular modems for both IPC +and RPC. With the Linux Phonet socket family, Linux host processes can +receive and send messages from/to the modem, or any other external +device attached to the modem. The modem takes care of routing. + +Phonet packets can be exchanged through various hardware connections +depending on the device, such as: + - USB with the CDC Phonet interface, + - infrared, + - Bluetooth, + - an RS232 serial port (with a dedicated "FBUS" line discipline), + - the SSI bus with some TI OMAP processors. + + +Packets format +-------------- + +Phonet packets have a common header as follows: + + struct phonethdr { + uint8_t pn_media; /* Media type (link-layer identifier) */ + uint8_t pn_rdev; /* Receiver device ID */ + uint8_t pn_sdev; /* Sender device ID */ + uint8_t pn_res; /* Resource ID or function */ + uint16_t pn_length; /* Big-endian message byte length (minus 6) */ + uint8_t pn_robj; /* Receiver object ID */ + uint8_t pn_sobj; /* Sender object ID */ + }; + +On Linux, the link-layer header includes the pn_media byte (see below). +The next 7 bytes are part of the network-layer header. + +The device ID is split: the 6 higher-order bits constitute the device +address, while the 2 lower-order bits are used for multiplexing, as are +the 8-bit object identifiers. As such, Phonet can be considered as a +network layer with 6 bits of address space and 10 bits for transport +protocol (much like port numbers in IP world). + +The modem always has address number zero. All other device have a their +own 6-bit address. + + +Link layer +---------- + +Phonet links are always point-to-point links. The link layer header +consists of a single Phonet media type byte. It uniquely identifies the +link through which the packet is transmitted, from the modem's +perspective. Each Phonet network device shall prepend and set the media +type byte as appropriate. For convenience, a common phonet_header_ops +link-layer header operations structure is provided. It sets the +media type according to the network device hardware address. + +Linux Phonet network interfaces support a dedicated link layer packets +type (ETH_P_PHONET) which is out of the Ethernet type range. They can +only send and receive Phonet packets. + +The virtual TUN tunnel device driver can also be used for Phonet. This +requires IFF_TUN mode, _without_ the IFF_NO_PI flag. In this case, +there is no link-layer header, so there is no Phonet media type byte. + +Note that Phonet interfaces are not allowed to re-order packets, so +only the (default) Linux FIFO qdisc should be used with them. + + +Network layer +------------- + +The Phonet socket address family maps the Phonet packet header: + + struct sockaddr_pn { + sa_family_t spn_family; /* AF_PHONET */ + uint8_t spn_obj; /* Object ID */ + uint8_t spn_dev; /* Device ID */ + uint8_t spn_resource; /* Resource or function */ + uint8_t spn_zero[...]; /* Padding */ + }; + +The resource field is only used when sending and receiving; +It is ignored by bind() and getsockname(). + + +Low-level datagram protocol +--------------------------- + +Applications can send Phonet messages using the Phonet datagram socket +protocol from the PF_PHONET family. Each socket is bound to one of the +2^10 object IDs available, and can send and receive packets with any +other peer. + + struct sockaddr_pn addr = { .spn_family = AF_PHONET, }; + ssize_t len; + socklen_t addrlen = sizeof(addr); + int fd; + + fd = socket(PF_PHONET, SOCK_DGRAM, 0); + bind(fd, (struct sockaddr *)&addr, sizeof(addr)); + /* ... */ + + sendto(fd, msg, msglen, 0, (struct sockaddr *)&addr, sizeof(addr)); + len = recvfrom(fd, buf, sizeof(buf), 0, + (struct sockaddr *)&addr, &addrlen); + +This protocol follows the SOCK_DGRAM connection-less semantics. +However, connect() and getpeername() are not supported, as they did +not seem useful with Phonet usages (could be added easily). + + +Resource subscription +--------------------- + +A Phonet datagram socket can be subscribed to any number of 8-bits +Phonet resources, as follow: + + uint32_t res = 0xXX; + ioctl(fd, SIOCPNADDRESOURCE, &res); + +Subscription is similarly cancelled using the SIOCPNDELRESOURCE I/O +control request, or when the socket is closed. + +Note that no more than one socket can be subcribed to any given +resource at a time. If not, ioctl() will return EBUSY. + + +Phonet Pipe protocol +-------------------- + +The Phonet Pipe protocol is a simple sequenced packets protocol +with end-to-end congestion control. It uses the passive listening +socket paradigm. The listening socket is bound to an unique free object +ID. Each listening socket can handle up to 255 simultaneous +connections, one per accept()'d socket. + + int lfd, cfd; + + lfd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + listen (lfd, INT_MAX); + + /* ... */ + cfd = accept(lfd, NULL, NULL); + for (;;) + { + char buf[...]; + ssize_t len = read(cfd, buf, sizeof(buf)); + + /* ... */ + + write(cfd, msg, msglen); + } + +Connections are traditionally established between two endpoints by a +"third party" application. This means that both endpoints are passive. + + +As of Linux kernel version 2.6.39, it is also possible to connect +two endpoints directly, using connect() on the active side. This is +intended to support the newer Nokia Wireless Modem API, as found in +e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform: + + struct sockaddr_spn spn; + int fd; + + fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + memset(&spn, 0, sizeof(spn)); + spn.spn_family = AF_PHONET; + spn.spn_obj = ...; + spn.spn_dev = ...; + spn.spn_resource = 0xD9; + connect(fd, (struct sockaddr *)&spn, sizeof(spn)); + /* normal I/O here ... */ + close(fd); + + +WARNING: +When polling a connected pipe socket for writability, there is an +intrinsic race condition whereby writability might be lost between the +polling and the writing system calls. In this case, the socket will +block until write becomes possible again, unless non-blocking mode +is enabled. + + +The pipe protocol provides two socket options at the SOL_PNPIPE level: + + PNPIPE_ENCAP accepts one integer value (int) of: + + PNPIPE_ENCAP_NONE: The socket operates normally (default). + + PNPIPE_ENCAP_IP: The socket is used as a backend for a virtual IP + interface. This requires CAP_NET_ADMIN capability. GPRS data + support on Nokia modems can use this. Note that the socket cannot + be reliably poll()'d or read() from while in this mode. + + PNPIPE_IFINDEX is a read-only integer value. It contains the + interface index of the network interface created by PNPIPE_ENCAP, + or zero if encapsulation is off. + + PNPIPE_HANDLE is a read-only integer value. It contains the underlying + identifier ("pipe handle") of the pipe. This is only defined for + socket descriptors that are already connected or being connected. + + +Authors +------- + +Linux Phonet was initially written by Sakari Ailus. +Other contributors include Mikä Liljeberg, Andras Domokos, +Carlos Chinea and Rémi Denis-Courmont. +Copyright (C) 2008 Nokia Corporation. diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index 29ccae40903..3544c98401f 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -1,7 +1,7 @@ ------- PHY Abstraction Layer -(Updated 2005-07-21) +(Updated 2008-04-08) Purpose @@ -48,7 +48,7 @@ The MDIO bus time, so it is safe for them to block, waiting for an interrupt to signal the operation is complete - 2) A reset function is necessary. This is used to return the bus to an + 2) A reset function is optional. This is used to return the bus to an initialized state. 3) A probe function is needed. This function should set up anything the bus @@ -62,7 +62,8 @@ The MDIO bus 5) The bus must also be declared somewhere as a device, and registered. As an example for how one driver implemented an mdio bus driver, see - drivers/net/gianfar_mii.c and arch/ppc/syslib/mpc85xx_devices.c + drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file + for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/") Connecting to a PHY @@ -96,12 +97,13 @@ Letting the PHY Abstraction Layer do Everything static void adjust_link(struct net_device *dev); Next, you need to know the device name of the PHY connected to this device. - The name will look something like, "phy0:0", where the first number is the - bus id, and the second is the PHY's address on that bus. + The name will look something like, "0:00", where the first number is the + bus id, and the second is the PHY's address on that bus. Typically, + the bus is responsible for making its ID unique. Now, to connect, just call this function: - phydev = phy_connect(dev, phy_name, &adjust_link, flags); + phydev = phy_connect(dev, phy_name, &adjust_link, interface); phydev is a pointer to the phy_device structure which represents the PHY. If phy_connect is successful, it will return the pointer. dev, here, is the @@ -111,10 +113,16 @@ Letting the PHY Abstraction Layer do Everything current state, though the PHY will not yet be truly operational at this point. - flags is a u32 which can optionally contain phy-specific flags. + PHY-specific flags should be set in phydev->dev_flags prior to the call + to phy_connect() such that the underlying PHY driver can check for flags + and perform specific operations based on them. This is useful if the system has put hardware restrictions on the PHY/controller, of which the PHY needs to be aware. + interface is a u32 which specifies the connection type used + between the controller and the PHY. Examples are GMII, MII, + RGMII, and SGMII. For a full list, see include/linux/phy.h + Now just make sure that phydev->supported and phydev->advertising have any values pruned from them which don't make sense for your controller (a 10/100 controller may be connected to a gigabit capable PHY, so you would need to @@ -172,18 +180,6 @@ Doing it all yourself A convenience function to print out the PHY status neatly. - int phy_clear_interrupt(struct phy_device *phydev); - int phy_config_interrupt(struct phy_device *phydev, u32 interrupts); - - Clear the PHY's interrupt, and configure which ones are allowed, - respectively. Currently only supports all on, or all off. - - int phy_enable_interrupts(struct phy_device *phydev); - int phy_disable_interrupts(struct phy_device *phydev); - - Functions which enable/disable PHY interrupts, clearing them - before and after, respectively. - int phy_start_interrupts(struct phy_device *phydev); int phy_stop_interrupts(struct phy_device *phydev); @@ -191,11 +187,10 @@ Doing it all yourself start, or disables then frees them for stop. struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, - u32 flags); + phy_interface_t interface); Attaches a network device to a particular PHY, binding the PHY to a generic - driver if none was found during bus initialization. Passes in - any phy-specific flags as needed. + driver if none was found during bus initialization. int phy_start_aneg(struct phy_device *phydev); @@ -208,12 +203,6 @@ Doing it all yourself Fills the phydev structure with up-to-date information about the current settings in the PHY. - void phy_sanitize_settings(struct phy_device *phydev) - - Resolves differences between currently desired settings, and - supported settings for the given PHY device. Does not make - the changes in the hardware, though. - int phy_ethtool_sset(struct phy_device *phydev, struct ethtool_cmd *cmd); int phy_ethtool_gset(struct phy_device *phydev, struct ethtool_cmd *cmd); @@ -264,15 +253,25 @@ Writing a PHY driver Each driver consists of a number of function pointers: + soft_reset: perform a PHY software reset config_init: configures PHY into a sane state after a reset. For instance, a Davicom PHY requires descrambling disabled. - probe: Does any setup needed by the driver + probe: Allocate phy->priv, optionally refuse to bind. + PHY may not have been reset or had fixups run yet. suspend/resume: power management config_aneg: Changes the speed/duplex/negotiation settings + aneg_done: Determines the auto-negotiation result read_status: Reads the current speed/duplex/negotiation settings ack_interrupt: Clear a pending interrupt + did_interrupt: Checks if the PHY generated an interrupt config_intr: Enable or disable interrupts remove: Does any driver take-down + ts_info: Queries about the HW timestamping status + hwtstamp: Set the PHY HW timestamping configuration + rxtstamp: Requests a receive timestamp at the PHY level for a 'skb' + txtsamp: Requests a transmit timestamp at the PHY level for a 'skb' + set_wol: Enable Wake-on-LAN at the PHY level + get_wol: Get the Wake-on-LAN status at the PHY level Of these, only config_aneg and read_status are required to be assigned by the driver code. The rest are optional. Also, it is @@ -286,3 +285,39 @@ Writing a PHY driver Feel free to look at the Marvell, Cicada, and Davicom drivers in drivers/net/phy/ for examples (the lxt and qsemi drivers have not been tested as of this writing) + +Board Fixups + + Sometimes the specific interaction between the platform and the PHY requires + special handling. For instance, to change where the PHY's clock input is, + or to add a delay to account for latency issues in the data path. In order + to support such contingencies, the PHY Layer allows platform code to register + fixups to be run when the PHY is brought up (or subsequently reset). + + When the PHY Layer brings up a PHY it checks to see if there are any fixups + registered for it, matching based on UID (contained in the PHY device's phy_id + field) and the bus identifier (contained in phydev->dev.bus_id). Both must + match, however two constants, PHY_ANY_ID and PHY_ANY_UID, are provided as + wildcards for the bus ID and UID, respectively. + + When a match is found, the PHY layer will invoke the run function associated + with the fixup. This function is passed a pointer to the phy_device of + interest. It should therefore only operate on that PHY. + + The platform code can either register the fixup using phy_register_fixup(): + + int phy_register_fixup(const char *phy_id, + u32 phy_uid, u32 phy_uid_mask, + int (*run)(struct phy_device *)); + + Or using one of the two stubs, phy_register_fixup_for_uid() and + phy_register_fixup_for_id(): + + int phy_register_fixup_for_uid(u32 phy_uid, u32 phy_uid_mask, + int (*run)(struct phy_device *)); + int phy_register_fixup_for_id(const char *phy_id, + int (*run)(struct phy_device *)); + + The stubs set one of the two matching criteria, and set the other one to + match anything. + diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index cc4b4d04129..0e30c7845b2 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -7,7 +7,7 @@ Date: 041221 Enable CONFIG_NET_PKTGEN to compile and build pktgen.o either in kernel or as module. Module is preferred. insmod pktgen if needed. Once running -pktgen creates a thread on each CPU where each thread has affinty it's CPU. +pktgen creates a thread on each CPU where each thread has affinity to its CPU. Monitoring and controlling is done via /proc. Easiest to select a suitable a sample script and configure. @@ -18,7 +18,7 @@ root 129 0.3 0.0 0 0 ? SW 2003 523:20 [pktgen/0] root 130 0.3 0.0 0 0 ? SW 2003 509:50 [pktgen/1] -For montoring and control pktgen creates: +For monitoring and control pktgen creates: /proc/net/pktgen/pgctrl /proc/net/pktgen/kpktgend_X /proc/net/pktgen/ethX @@ -32,7 +32,7 @@ Running: Stopped: eth1 Result: OK: max_before_softirq=10000 -Most important the devices assigend to thread. Note! A device can only belong +Most important the devices assigned to thread. Note! A device can only belong to one thread. @@ -63,8 +63,8 @@ Current: Result: OK: 13101142(c12220741+d880401) usec, 10000000 (60byte,0frags) 763292pps 390Mb/sec (390805504bps) errors: 39664 -Confguring threads and devices -============================== +Configuring threads and devices +================================ This is done via the /proc interface easiest done via pgset in the scripts Examples: @@ -74,7 +74,7 @@ Examples: pgset "pkt_size 9014" sets packet size to 9014 pgset "frags 5" packet will consist of 5 fragments pgset "count 200000" sets number of packets to send, set to zero - for continious sends untill explicitl stopped. + for continuous sends until explicitly stopped. pgset "delay 5000" adds delay to hard_start_xmit(). nanoseconds @@ -90,6 +90,11 @@ Examples: pgset "dstmac 00:00:00:00:00:00" sets MAC destination address pgset "srcmac 00:00:00:00:00:00" sets MAC source address + pgset "queue_map_min 0" Sets the min value of tx queue interval + pgset "queue_map_max 7" Sets the max value of tx queue interval, for multiqueue devices + To select queue 1 of a given device, + use queue_map_min=1 and queue_map_max=1 + pgset "src_mac_count 1" Sets the number of MACs we'll range through. The 'minimum' MAC is what you set with srcmac. @@ -97,9 +102,20 @@ Examples: The 'minimum' MAC is what you set with dstmac. pgset "flag [name]" Set a flag to determine behaviour. Current flags - are: IPSRC_RND #IP Source is random (between min/max), - IPDST_RND, UDPSRC_RND, - UDPDST_RND, MACSRC_RND, MACDST_RND + are: IPSRC_RND # IP source is random (between min/max) + IPDST_RND # IP destination is random + UDPSRC_RND, UDPDST_RND, + MACSRC_RND, MACDST_RND + TXSIZE_RND, IPV6, + MPLS_RND, VID_RND, SVID_RND + FLOW_SEQ, + QUEUE_MAP_RND # queue map random + QUEUE_MAP_CPU # queue map mirrors smp_processor_id() + UDPCSUM, + IPSEC # IPsec encapsulation (needs CONFIG_XFRM) + NODE_ALLOC # node specific memory allocation + + pgset spi SPI_VALUE Set specific SA used to transform packet. pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. @@ -109,13 +125,46 @@ Examples: cycle through the port range. pgset "udp_dst_max 9" set UDP destination port max. + pgset "mpls 0001000a,0002000a,0000000a" set MPLS labels (in this example + outer label=16,middle label=32, + inner label=0 (IPv4 NULL)) Note that + there must be no spaces between the + arguments. Leading zeros are required. + Do not set the bottom of stack bit, + that's done automatically. If you do + set the bottom of stack bit, that + indicates that you want to randomly + generate that address and the flag + MPLS_RND will be turned on. You + can have any mix of random and fixed + labels in the label stack. + + pgset "mpls 0" turn off mpls (or any invalid argument works too!) + + pgset "vlan_id 77" set VLAN ID 0-4095 + pgset "vlan_p 3" set priority bit 0-7 (default 0) + pgset "vlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "svlan_id 22" set SVLAN ID 0-4095 + pgset "svlan_p 3" set priority bit 0-7 (default 0) + pgset "svlan_cfi 0" set canonical format identifier 0-1 (default 0) + + pgset "vlan_id 9999" > 4095 remove vlan and svlan tags + pgset "svlan 9999" > 4095 remove svlan tag + + + pgset "tos XX" set former IPv4 TOS field (e.g. "tos 28" for AF11 no ECN, default 00) + pgset "traffic_class XX" set former IPv6 TRAFFIC CLASS (e.g. "traffic_class B8" for EF no ECN, default 00) + pgset stop aborts injection. Also, ^C aborts generator. + pgset "rate 300M" set rate to 300 Mb/s + pgset "ratep 1000000" set rate to 1Mpps Example scripts =============== -A collection of small tutorial scripts for pktgen is in expamples dir. +A collection of small tutorial scripts for pktgen is in examples dir. pktgen.conf-1-1 # 1 CPU 1 dev pktgen.conf-1-2 # 1 CPU 2 dev @@ -135,6 +184,18 @@ Note when adding devices to a specific CPU there good idea to also assign /proc/irq/XX/smp_affinity so the TX-interrupts gets bound to the same CPU. as this reduces cache bouncing when freeing skb's. +Enable IPsec +============ +Default IPsec transformation with ESP encapsulation plus Transport mode +could be enabled by simply setting: + +pgset "flag IPSEC" +pgset "flows 1" + +To avoid breaking existing testbed scripts for using AH type and tunnel mode, +user could use "pgset spi SPI_VALUE" to specify which formal of transformation +to employ. + Current commands and configuration options ========================================== @@ -167,6 +228,8 @@ pkt_size min_pkt_size max_pkt_size +mpls + udp_src_min udp_src_max @@ -175,12 +238,22 @@ udp_dst_max flag IPSRC_RND - TXSIZE_RND IPDST_RND UDPSRC_RND UDPDST_RND MACSRC_RND MACDST_RND + TXSIZE_RND + IPV6 + MPLS_RND + VID_RND + SVID_RND + FLOW_SEQ + QUEUE_MAP_RND + QUEUE_MAP_CPU + UDPCSUM + IPSEC + NODE_ALLOC dst_min dst_max @@ -199,6 +272,9 @@ src6 flows flowlen +rate +ratep + References: ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/ ftp://robur.slu.se/pub/Linux/net-development/pktgen-testing/examples/ @@ -211,4 +287,4 @@ Grant Grundler for testing on IA-64 and parisc, Harald Welte, Lennert Buytenhek Stephen Hemminger, Andi Kleen, Dave Miller and many others. -Good luck with the linux net-development.
\ No newline at end of file +Good luck with the linux net-development. diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt index 15b5172fbb9..091d20273dc 100644 --- a/Documentation/networking/ppp_generic.txt +++ b/Documentation/networking/ppp_generic.txt @@ -342,7 +342,7 @@ an interface unit are: numbers on received multilink fragments SC_MP_XSHORTSEQ transmit short multilink sequence nos. - The values of these flags are defined in <linux/if_ppp.h>. Note + The values of these flags are defined in <linux/ppp-ioctl.h>. Note that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option is not selected. @@ -358,7 +358,7 @@ an interface unit are: * PPPIOCSCOMPRESS sets the parameters for packet compression or decompression. The argument should point to a ppp_option_data - structure (defined in <linux/if_ppp.h>), which contains a + structure (defined in <linux/ppp-ioctl.h>), which contains a pointer/length pair which should describe a block of memory containing a CCP option specifying a compression method and its parameters. The ppp_option_data struct also contains a `transmit' @@ -395,7 +395,7 @@ an interface unit are: * PPPIOCSNPMODE sets the network-protocol mode for a given network protocol. The argument should point to an npioctl struct (defined - in <linux/if_ppp.h>). The `protocol' field gives the PPP protocol + in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol number for the protocol to be affected, and the `mode' field specifies what to do with packets for that protocol: diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.txt index 59cb915c371..4a79209e77a 100644 --- a/Documentation/networking/proc_net_tcp.txt +++ b/Documentation/networking/proc_net_tcp.txt @@ -1,8 +1,9 @@ This document describes the interfaces /proc/net/tcp and /proc/net/tcp6. +Note that these interfaces are deprecated in favor of tcp_diag. These /proc interfaces provide information about currently active TCP -connections, and are implemented by tcp_get_info() in net/ipv4/tcp_ipv4.c and -tcp6_get_info() in net/ipv6/tcp_ipv6.c, respectively. +connections, and are implemented by tcp4_seq_show() in net/ipv4/tcp_ipv4.c +and tcp6_seq_show() in net/ipv6/tcp_ipv6.c, respectively. It will first list all listening TCP sockets, and next list all established TCP connections. A typical entry of /proc/net/tcp would look like this (split @@ -25,7 +26,7 @@ up into 3 parts because of the length of the line): 1000 0 54165785 4 cd1e6040 25 4 27 3 -1 | | | | | | | | | |--> slow start size threshold, - | | | | | | | | | or -1 if the treshold + | | | | | | | | | or -1 if the threshold | | | | | | | | | is >= 0xFFFF | | | | | | | | |----> sending congestion window | | | | | | | |-------> (ack.quick<<1)|ack.pingpong diff --git a/Documentation/networking/pt.txt b/Documentation/networking/pt.txt deleted file mode 100644 index 72e888c1d98..00000000000 --- a/Documentation/networking/pt.txt +++ /dev/null @@ -1,58 +0,0 @@ -This is the README for the Gracilis Packetwin device driver, version 0.5 -ALPHA for Linux 1.3.43. - -These files will allow you to talk to the PackeTwin (now know as PT) and -connect through it just like a pair of TNCs. To do this you will also -require the AX.25 code in the kernel enabled. - -There are four files in this archive; this readme, a patch file, a .c file -and finally a .h file. The two program files need to be put into the -drivers/net directory in the Linux source tree, for me this is the -directory /usr/src/linux/drivers/net. The patch file needs to be patched in -at the top of the Linux source tree (/usr/src/linux in my case). - -You will most probably have to edit the pt.c file to suit your own setup, -this should just involve changing some of the defines at the top of the file. -Please note that if you run an external modem you must specify a speed of 0. - -The program is currently setup to run a 4800 baud external modem on port A -and a Kantronics DE-9600 daughter board on port B so if you have this (or -something similar) then you're right. - -To compile in the driver, put the files in the correct place and patch in -the diff. You will have to re-configure the kernel again before you -recompile it. - -The driver is not real good at the moment for finding the card. You can -'help' it by changing the order of the potential addresses in the structure -found in the pt_init() function so the address of where the card is is put -first. - -After compiling, you have to get them going, they are pretty well like any -other net device and just need ifconfig to get them going. -As an example, here is my /etc/rc.net --------------------------- - -# -# Configure the PackeTwin, port A. -/sbin/ifconfig pt0a 44.136.8.87 hw ax25 vk2xlz mtu 512 -/sbin/ifconfig pt0a 44.136.8.87 broadcast 44.136.8.255 netmask 255.255.255.0 -/sbin/route add -net 44.136.8.0 netmask 255.255.255.0 dev pt0a -/sbin/route add -net 44.0.0.0 netmask 255.0.0.0 gw 44.136.8.68 dev pt0a -/sbin/route add -net 138.25.16.0 netmask 255.255.240.0 dev pt0a -/sbin/route add -host 44.136.8.255 dev pt0a -# -# Configure the PackeTwin, port B. -/sbin/ifconfig pt0b 44.136.8.87 hw ax25 vk2xlz-1 mtu 512 -/sbin/ifconfig pt0b 44.136.8.87 broadcast 44.255.255.255 netmask 255.0.0.0 -/sbin/route add -host 44.136.8.216 dev pt0b -/sbin/route add -host 44.136.8.95 dev pt0b -/sbin/route add -host 44.255.255.255 dev pt0b - -This version of the driver comes under the GNU GPL. If you have one of my -previous (non-GPL) versions of the driver, please update to this one. - -I hope that this all works well for you. I would be pleased to hear how -many people use the driver and if it does its job. - - - Craig vk2xlz <csmall@small.dropbear.id.au> diff --git a/Documentation/networking/radiotap-headers.txt b/Documentation/networking/radiotap-headers.txt new file mode 100644 index 00000000000..953331c7984 --- /dev/null +++ b/Documentation/networking/radiotap-headers.txt @@ -0,0 +1,152 @@ +How to use radiotap headers +=========================== + +Pointer to the radiotap include file +------------------------------------ + +Radiotap headers are variable-length and extensible, you can get most of the +information you need to know on them from: + +./include/net/ieee80211_radiotap.h + +This document gives an overview and warns on some corner cases. + + +Structure of the header +----------------------- + +There is a fixed portion at the start which contains a u32 bitmap that defines +if the possible argument associated with that bit is present or not. So if b0 +of the it_present member of ieee80211_radiotap_header is set, it means that +the header for argument index 0 (IEEE80211_RADIOTAP_TSFT) is present in the +argument area. + + < 8-byte ieee80211_radiotap_header > + [ <possible argument bitmap extensions ... > ] + [ <argument> ... ] + +At the moment there are only 13 possible argument indexes defined, but in case +we run out of space in the u32 it_present member, it is defined that b31 set +indicates that there is another u32 bitmap following (shown as "possible +argument bitmap extensions..." above), and the start of the arguments is moved +forward 4 bytes each time. + +Note also that the it_len member __le16 is set to the total number of bytes +covered by the ieee80211_radiotap_header and any arguments following. + + +Requirements for arguments +-------------------------- + +After the fixed part of the header, the arguments follow for each argument +index whose matching bit is set in the it_present member of +ieee80211_radiotap_header. + + - the arguments are all stored little-endian! + + - the argument payload for a given argument index has a fixed size. So + IEEE80211_RADIOTAP_TSFT being present always indicates an 8-byte argument is + present. See the comments in ./include/net/ieee80211_radiotap.h for a nice + breakdown of all the argument sizes + + - the arguments must be aligned to a boundary of the argument size using + padding. So a u16 argument must start on the next u16 boundary if it isn't + already on one, a u32 must start on the next u32 boundary and so on. + + - "alignment" is relative to the start of the ieee80211_radiotap_header, ie, + the first byte of the radiotap header. The absolute alignment of that first + byte isn't defined. So even if the whole radiotap header is starting at, eg, + address 0x00000003, still the first byte of the radiotap header is treated as + 0 for alignment purposes. + + - the above point that there may be no absolute alignment for multibyte + entities in the fixed radiotap header or the argument region means that you + have to take special evasive action when trying to access these multibyte + entities. Some arches like Blackfin cannot deal with an attempt to + dereference, eg, a u16 pointer that is pointing to an odd address. Instead + you have to use a kernel API get_unaligned() to dereference the pointer, + which will do it bytewise on the arches that require that. + + - The arguments for a given argument index can be a compound of multiple types + together. For example IEEE80211_RADIOTAP_CHANNEL has an argument payload + consisting of two u16s of total length 4. When this happens, the padding + rule is applied dealing with a u16, NOT dealing with a 4-byte single entity. + + +Example valid radiotap header +----------------------------- + + 0x00, 0x00, // <-- radiotap version + pad byte + 0x0b, 0x00, // <- radiotap header length + 0x04, 0x0c, 0x00, 0x00, // <-- bitmap + 0x6c, // <-- rate (in 500kHz units) + 0x0c, //<-- tx power + 0x01 //<-- antenna + + +Using the Radiotap Parser +------------------------- + +If you are having to parse a radiotap struct, you can radically simplify the +job by using the radiotap parser that lives in net/wireless/radiotap.c and has +its prototypes available in include/net/cfg80211.h. You use it like this: + +#include <net/cfg80211.h> + +/* buf points to the start of the radiotap header part */ + +int MyFunction(u8 * buf, int buflen) +{ + int pkt_rate_100kHz = 0, antenna = 0, pwr = 0; + struct ieee80211_radiotap_iterator iterator; + int ret = ieee80211_radiotap_iterator_init(&iterator, buf, buflen); + + while (!ret) { + + ret = ieee80211_radiotap_iterator_next(&iterator); + + if (ret) + continue; + + /* see if this argument is something we can use */ + + switch (iterator.this_arg_index) { + /* + * You must take care when dereferencing iterator.this_arg + * for multibyte types... the pointer is not aligned. Use + * get_unaligned((type *)iterator.this_arg) to dereference + * iterator.this_arg for type "type" safely on all arches. + */ + case IEEE80211_RADIOTAP_RATE: + /* radiotap "rate" u8 is in + * 500kbps units, eg, 0x02=1Mbps + */ + pkt_rate_100kHz = (*iterator.this_arg) * 5; + break; + + case IEEE80211_RADIOTAP_ANTENNA: + /* radiotap uses 0 for 1st ant */ + antenna = *iterator.this_arg); + break; + + case IEEE80211_RADIOTAP_DBM_TX_POWER: + pwr = *iterator.this_arg; + break; + + default: + break; + } + } /* while more rt headers */ + + if (ret != -ENOENT) + return TXRX_DROP; + + /* discard the radiotap header part */ + buf += iterator.max_length; + buflen -= iterator.max_length; + + ... + +} + +Andy Green <andy@warmcat.com> diff --git a/Documentation/networking/ray_cs.txt b/Documentation/networking/ray_cs.txt index b1def00bc4a..c0c12307ed9 100644 --- a/Documentation/networking/ray_cs.txt +++ b/Documentation/networking/ray_cs.txt @@ -13,8 +13,8 @@ wireless LAN cards. As of kernel 2.3.18, the ray_cs driver is part of the Linux kernel source. My web page for the development of ray_cs is at -http://world.std.com/~corey/raylink.html and I can be emailed at -corey@world.std.com +http://web.ralinktech.com/ralink/Home/Support/Linux.html +and I can be emailed at corey@world.std.com The kernel driver is based on ray_cs-1.62.tgz @@ -25,12 +25,11 @@ the essid= string parameter is available via the kernel command line. This will change after the method of sorting out parameters for all the PCMCIA drivers is agreed upon. If you must have a built in driver with nondefault parameters, they can be edited in -/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for MODULE_PARM +/usr/src/linux/drivers/net/pcmcia/ray_cs.c. Searching for module_param will find them all. Information on card services is available at: - ftp://hyper.stanford.edu/pub/pcmcia/doc - http://hyper.stanford.edu/HyperNews/get/pcmcia/home.html + http://pcmcia-cs.sourceforge.net/ Card services user programs are still required for PCMCIA devices. diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt new file mode 100644 index 00000000000..c67077cbeb8 --- /dev/null +++ b/Documentation/networking/rds.txt @@ -0,0 +1,356 @@ + +Overview +======== + +This readme tries to provide some background on the hows and whys of RDS, +and will hopefully help you find your way around the code. + +In addition, please see this email about RDS origins: +http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html + +RDS Architecture +================ + +RDS provides reliable, ordered datagram delivery by using a single +reliable connection between any two nodes in the cluster. This allows +applications to use a single socket to talk to any other process in the +cluster - so in a cluster with N processes you need N sockets, in contrast +to N*N if you use a connection-oriented socket transport like TCP. + +RDS is not Infiniband-specific; it was designed to support different +transports. The current implementation used to support RDS over TCP as well +as IB. Work is in progress to support RDS over iWARP, and using DCE to +guarantee no dropped packets on Ethernet, it may be possible to use RDS over +UDP in the future. + +The high-level semantics of RDS from the application's point of view are + + * Addressing + RDS uses IPv4 addresses and 16bit port numbers to identify + the end point of a connection. All socket operations that involve + passing addresses between kernel and user space generally + use a struct sockaddr_in. + + The fact that IPv4 addresses are used does not mean the underlying + transport has to be IP-based. In fact, RDS over IB uses a + reliable IB connection; the IP address is used exclusively to + locate the remote node's GID (by ARPing for the given IP). + + The port space is entirely independent of UDP, TCP or any other + protocol. + + * Socket interface + RDS sockets work *mostly* as you would expect from a BSD + socket. The next section will cover the details. At any rate, + all I/O is performed through the standard BSD socket API. + Some additions like zerocopy support are implemented through + control messages, while other extensions use the getsockopt/ + setsockopt calls. + + Sockets must be bound before you can send or receive data. + This is needed because binding also selects a transport and + attaches it to the socket. Once bound, the transport assignment + does not change. RDS will tolerate IPs moving around (eg in + a active-active HA scenario), but only as long as the address + doesn't move to a different transport. + + * sysctls + RDS supports a number of sysctls in /proc/sys/net/rds + + +Socket Interface +================ + + AF_RDS, PF_RDS, SOL_RDS + These constants haven't been assigned yet, because RDS isn't in + mainline yet. Currently, the kernel module assigns some constant + and publishes it to user space through two sysctl files + /proc/sys/net/rds/pf_rds + /proc/sys/net/rds/sol_rds + + fd = socket(PF_RDS, SOCK_SEQPACKET, 0); + This creates a new, unbound RDS socket. + + setsockopt(SOL_SOCKET): send and receive buffer size + RDS honors the send and receive buffer size socket options. + You are not allowed to queue more than SO_SNDSIZE bytes to + a socket. A message is queued when sendmsg is called, and + it leaves the queue when the remote system acknowledges + its arrival. + + The SO_RCVSIZE option controls the maximum receive queue length. + This is a soft limit rather than a hard limit - RDS will + continue to accept and queue incoming messages, even if that + takes the queue length over the limit. However, it will also + mark the port as "congested" and send a congestion update to + the source node. The source node is supposed to throttle any + processes sending to this congested port. + + bind(fd, &sockaddr_in, ...) + This binds the socket to a local IP address and port, and a + transport. + + sendmsg(fd, ...) + Sends a message to the indicated recipient. The kernel will + transparently establish the underlying reliable connection + if it isn't up yet. + + An attempt to send a message that exceeds SO_SNDSIZE will + return with -EMSGSIZE + + An attempt to send a message that would take the total number + of queued bytes over the SO_SNDSIZE threshold will return + EAGAIN. + + An attempt to send a message to a destination that is marked + as "congested" will return ENOBUFS. + + recvmsg(fd, ...) + Receives a message that was queued to this socket. The sockets + recv queue accounting is adjusted, and if the queue length + drops below SO_SNDSIZE, the port is marked uncongested, and + a congestion update is sent to all peers. + + Applications can ask the RDS kernel module to receive + notifications via control messages (for instance, there is a + notification when a congestion update arrived, or when a RDMA + operation completes). These notifications are received through + the msg.msg_control buffer of struct msghdr. The format of the + messages is described in manpages. + + poll(fd) + RDS supports the poll interface to allow the application + to implement async I/O. + + POLLIN handling is pretty straightforward. When there's an + incoming message queued to the socket, or a pending notification, + we signal POLLIN. + + POLLOUT is a little harder. Since you can essentially send + to any destination, RDS will always signal POLLOUT as long as + there's room on the send queue (ie the number of bytes queued + is less than the sendbuf size). + + However, the kernel will refuse to accept messages to + a destination marked congested - in this case you will loop + forever if you rely on poll to tell you what to do. + This isn't a trivial problem, but applications can deal with + this - by using congestion notifications, and by checking for + ENOBUFS errors returned by sendmsg. + + setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) + This allows the application to discard all messages queued to a + specific destination on this particular socket. + + This allows the application to cancel outstanding messages if + it detects a timeout. For instance, if it tried to send a message, + and the remote host is unreachable, RDS will keep trying forever. + The application may decide it's not worth it, and cancel the + operation. In this case, it would use RDS_CANCEL_SENT_TO to + nuke any pending messages. + + +RDMA for RDS +============ + + see rds-rdma(7) manpage (available in rds-tools) + + +Congestion Notifications +======================== + + see rds(7) manpage + + +RDS Protocol +============ + + Message header + + The message header is a 'struct rds_header' (see rds.h): + Fields: + h_sequence: + per-packet sequence number + h_ack: + piggybacked acknowledgment of last packet received + h_len: + length of data, not including header + h_sport: + source port + h_dport: + destination port + h_flags: + CONG_BITMAP - this is a congestion update bitmap + ACK_REQUIRED - receiver must ack this packet + RETRANSMITTED - packet has previously been sent + h_credit: + indicate to other end of connection that + it has more credits available (i.e. there is + more send room) + h_padding[4]: + unused, for future use + h_csum: + header checksum + h_exthdr: + optional data can be passed here. This is currently used for + passing RDMA-related information. + + ACK and retransmit handling + + One might think that with reliable IB connections you wouldn't need + to ack messages that have been received. The problem is that IB + hardware generates an ack message before it has DMAed the message + into memory. This creates a potential message loss if the HCA is + disabled for any reason between when it sends the ack and before + the message is DMAed and processed. This is only a potential issue + if another HCA is available for fail-over. + + Sending an ack immediately would allow the sender to free the sent + message from their send queue quickly, but could cause excessive + traffic to be used for acks. RDS piggybacks acks on sent data + packets. Ack-only packets are reduced by only allowing one to be + in flight at a time, and by the sender only asking for acks when + its send buffers start to fill up. All retransmissions are also + acked. + + Flow Control + + RDS's IB transport uses a credit-based mechanism to verify that + there is space in the peer's receive buffers for more data. This + eliminates the need for hardware retries on the connection. + + Congestion + + Messages waiting in the receive queue on the receiving socket + are accounted against the sockets SO_RCVBUF option value. Only + the payload bytes in the message are accounted for. If the + number of bytes queued equals or exceeds rcvbuf then the socket + is congested. All sends attempted to this socket's address + should return block or return -EWOULDBLOCK. + + Applications are expected to be reasonably tuned such that this + situation very rarely occurs. An application encountering this + "back-pressure" is considered a bug. + + This is implemented by having each node maintain bitmaps which + indicate which ports on bound addresses are congested. As the + bitmap changes it is sent through all the connections which + terminate in the local address of the bitmap which changed. + + The bitmaps are allocated as connections are brought up. This + avoids allocation in the interrupt handling path which queues + sages on sockets. The dense bitmaps let transports send the + entire bitmap on any bitmap change reasonably efficiently. This + is much easier to implement than some finer-grained + communication of per-port congestion. The sender does a very + inexpensive bit test to test if the port it's about to send to + is congested or not. + + +RDS Transport Layer +================== + + As mentioned above, RDS is not IB-specific. Its code is divided + into a general RDS layer and a transport layer. + + The general layer handles the socket API, congestion handling, + loopback, stats, usermem pinning, and the connection state machine. + + The transport layer handles the details of the transport. The IB + transport, for example, handles all the queue pairs, work requests, + CM event handlers, and other Infiniband details. + + +RDS Kernel Structures +===================== + + struct rds_message + aka possibly "rds_outgoing", the generic RDS layer copies data to + be sent and sets header fields as needed, based on the socket API. + This is then queued for the individual connection and sent by the + connection's transport. + struct rds_incoming + a generic struct referring to incoming data that can be handed from + the transport to the general code and queued by the general code + while the socket is awoken. It is then passed back to the transport + code to handle the actual copy-to-user. + struct rds_socket + per-socket information + struct rds_connection + per-connection information + struct rds_transport + pointers to transport-specific functions + struct rds_statistics + non-transport-specific statistics + struct rds_cong_map + wraps the raw congestion bitmap, contains rbnode, waitq, etc. + +Connection management +===================== + + Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and + ERROR states. + + The first time an attempt is made by an RDS socket to send data to + a node, a connection is allocated and connected. That connection is + then maintained forever -- if there are transport errors, the + connection will be dropped and re-established. + + Dropping a connection while packets are queued will cause queued or + partially-sent datagrams to be retransmitted when the connection is + re-established. + + +The send path +============= + + rds_sendmsg() + struct rds_message built from incoming data + CMSGs parsed (e.g. RDMA ops) + transport connection alloced and connected if not already + rds_message placed on send queue + send worker awoken + rds_send_worker() + calls rds_send_xmit() until queue is empty + rds_send_xmit() + transmits congestion map if one is pending + may set ACK_REQUIRED + calls transport to send either non-RDMA or RDMA message + (RDMA ops never retransmitted) + rds_ib_xmit() + allocs work requests from send ring + adds any new send credits available to peer (h_credits) + maps the rds_message's sg list + piggybacks ack + populates work requests + post send to connection's queue pair + +The recv path +============= + + rds_ib_recv_cq_comp_handler() + looks at write completions + unmaps recv buffer from device + no errors, call rds_ib_process_recv() + refill recv ring + rds_ib_process_recv() + validate header checksum + copy header to rds_ib_incoming struct if start of a new datagram + add to ibinc's fraglist + if competed datagram: + update cong map if datagram was cong update + call rds_recv_incoming() otherwise + note if ack is required + rds_recv_incoming() + drop duplicate packets + respond to pings + find the sock associated with this datagram + add to sock queue + wake up sock + do some congestion calculations + rds_recvmsg + copy data into user iovec + handle CMSGs + return to application + + diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt new file mode 100644 index 00000000000..356f791af57 --- /dev/null +++ b/Documentation/networking/regulatory.txt @@ -0,0 +1,214 @@ +Linux wireless regulatory documentation +--------------------------------------- + +This document gives a brief review over how the Linux wireless +regulatory infrastructure works. + +More up to date information can be obtained at the project's web page: + +http://wireless.kernel.org/en/developers/Regulatory + +Keeping regulatory domains in userspace +--------------------------------------- + +Due to the dynamic nature of regulatory domains we keep them +in userspace and provide a framework for userspace to upload +to the kernel one regulatory domain to be used as the central +core regulatory domain all wireless devices should adhere to. + +How to get regulatory domains to the kernel +------------------------------------------- + +Userspace gets a regulatory domain in the kernel by having +a userspace agent build it and send it via nl80211. Only +expected regulatory domains will be respected by the kernel. + +A currently available userspace agent which can accomplish this +is CRDA - central regulatory domain agent. Its documented here: + +http://wireless.kernel.org/en/developers/Regulatory/CRDA + +Essentially the kernel will send a udev event when it knows +it needs a new regulatory domain. A udev rule can be put in place +to trigger crda to send the respective regulatory domain for a +specific ISO/IEC 3166 alpha2. + +Below is an example udev rule which can be used: + +# Example file, should be put in /etc/udev/rules.d/regulatory.rules +KERNEL=="regulatory*", ACTION=="change", SUBSYSTEM=="platform", RUN+="/sbin/crda" + +The alpha2 is passed as an environment variable under the variable COUNTRY. + +Who asks for regulatory domains? +-------------------------------- + +* Users + +Users can use iw: + +http://wireless.kernel.org/en/users/Documentation/iw + +An example: + + # set regulatory domain to "Costa Rica" + iw reg set CR + +This will request the kernel to set the regulatory domain to +the specificied alpha2. The kernel in turn will then ask userspace +to provide a regulatory domain for the alpha2 specified by the user +by sending a uevent. + +* Wireless subsystems for Country Information elements + +The kernel will send a uevent to inform userspace a new +regulatory domain is required. More on this to be added +as its integration is added. + +* Drivers + +If drivers determine they need a specific regulatory domain +set they can inform the wireless core using regulatory_hint(). +They have two options -- they either provide an alpha2 so that +crda can provide back a regulatory domain for that country or +they can build their own regulatory domain based on internal +custom knowledge so the wireless core can respect it. + +*Most* drivers will rely on the first mechanism of providing a +regulatory hint with an alpha2. For these drivers there is an additional +check that can be used to ensure compliance based on custom EEPROM +regulatory data. This additional check can be used by drivers by +registering on its struct wiphy a reg_notifier() callback. This notifier +is called when the core's regulatory domain has been changed. The driver +can use this to review the changes made and also review who made them +(driver, user, country IE) and determine what to allow based on its +internal EEPROM data. Devices drivers wishing to be capable of world +roaming should use this callback. More on world roaming will be +added to this document when its support is enabled. + +Device drivers who provide their own built regulatory domain +do not need a callback as the channels registered by them are +the only ones that will be allowed and therefore *additional* +channels cannot be enabled. + +Example code - drivers hinting an alpha2: +------------------------------------------ + +This example comes from the zd1211rw device driver. You can start +by having a mapping of your device's EEPROM country/regulatory +domain value to a specific alpha2 as follows: + +static struct zd_reg_alpha2_map reg_alpha2_map[] = { + { ZD_REGDOMAIN_FCC, "US" }, + { ZD_REGDOMAIN_IC, "CA" }, + { ZD_REGDOMAIN_ETSI, "DE" }, /* Generic ETSI, use most restrictive */ + { ZD_REGDOMAIN_JAPAN, "JP" }, + { ZD_REGDOMAIN_JAPAN_ADD, "JP" }, + { ZD_REGDOMAIN_SPAIN, "ES" }, + { ZD_REGDOMAIN_FRANCE, "FR" }, + +Then you can define a routine to map your read EEPROM value to an alpha2, +as follows: + +static int zd_reg2alpha2(u8 regdomain, char *alpha2) +{ + unsigned int i; + struct zd_reg_alpha2_map *reg_map; + for (i = 0; i < ARRAY_SIZE(reg_alpha2_map); i++) { + reg_map = ®_alpha2_map[i]; + if (regdomain == reg_map->reg) { + alpha2[0] = reg_map->alpha2[0]; + alpha2[1] = reg_map->alpha2[1]; + return 0; + } + } + return 1; +} + +Lastly, you can then hint to the core of your discovered alpha2, if a match +was found. You need to do this after you have registered your wiphy. You +are expected to do this during initialization. + + r = zd_reg2alpha2(mac->regdomain, alpha2); + if (!r) + regulatory_hint(hw->wiphy, alpha2); + +Example code - drivers providing a built in regulatory domain: +-------------------------------------------------------------- + +[NOTE: This API is not currently available, it can be added when required] + +If you have regulatory information you can obtain from your +driver and you *need* to use this we let you build a regulatory domain +structure and pass it to the wireless core. To do this you should +kmalloc() a structure big enough to hold your regulatory domain +structure and you should then fill it with your data. Finally you simply +call regulatory_hint() with the regulatory domain structure in it. + +Bellow is a simple example, with a regulatory domain cached using the stack. +Your implementation may vary (read EEPROM cache instead, for example). + +Example cache of some regulatory domain + +struct ieee80211_regdomain mydriver_jp_regdom = { + .n_reg_rules = 3, + .alpha2 = "JP", + //.alpha2 = "99", /* If I have no alpha2 to map it to */ + .reg_rules = { + /* IEEE 802.11b/g, channels 1..14 */ + REG_RULE(2412-20, 2484+20, 40, 6, 20, 0), + /* IEEE 802.11a, channels 34..48 */ + REG_RULE(5170-20, 5240+20, 40, 6, 20, + NL80211_RRF_NO_IR), + /* IEEE 802.11a, channels 52..64 */ + REG_RULE(5260-20, 5320+20, 40, 6, 20, + NL80211_RRF_NO_IR| + NL80211_RRF_DFS), + } +}; + +Then in some part of your code after your wiphy has been registered: + + struct ieee80211_regdomain *rd; + int size_of_regd; + int num_rules = mydriver_jp_regdom.n_reg_rules; + unsigned int i; + + size_of_regd = sizeof(struct ieee80211_regdomain) + + (num_rules * sizeof(struct ieee80211_reg_rule)); + + rd = kzalloc(size_of_regd, GFP_KERNEL); + if (!rd) + return -ENOMEM; + + memcpy(rd, &mydriver_jp_regdom, sizeof(struct ieee80211_regdomain)); + + for (i=0; i < num_rules; i++) + memcpy(&rd->reg_rules[i], + &mydriver_jp_regdom.reg_rules[i], + sizeof(struct ieee80211_reg_rule)); + regulatory_struct_hint(rd); + +Statically compiled regulatory database +--------------------------------------- + +In most situations the userland solution using CRDA as described +above is the preferred solution. However in some cases a set of +rules built into the kernel itself may be desirable. To account +for this situation, a configuration option has been provided +(i.e. CONFIG_CFG80211_INTERNAL_REGDB). With this option enabled, +the wireless database information contained in net/wireless/db.txt is +used to generate a data structure encoded in net/wireless/regdb.c. +That option also enables code in net/wireless/reg.c which queries +the data in regdb.c as an alternative to using CRDA. + +The file net/wireless/db.txt should be kept up-to-date with the db.txt +file available in the git repository here: + + git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-regdb.git + +Again, most users in most situations should be using the CRDA package +provided with their distribution, and in most other situations users +should be building and using CRDA on their own rather than using +this option. If you are not absolutely sure that you should be using +CONFIG_CFG80211_INTERNAL_REGDB then _DO_NOT_USE_IT_. diff --git a/Documentation/networking/routing.txt b/Documentation/networking/routing.txt deleted file mode 100644 index a26838b930f..00000000000 --- a/Documentation/networking/routing.txt +++ /dev/null @@ -1,46 +0,0 @@ -The directory ftp.inr.ac.ru:/ip-routing contains: - -- iproute.c - "professional" routing table maintenance utility. - -- rdisc.tar.gz - rdisc daemon, ported from Sun. - STRONGLY RECOMMENDED FOR ALL HOSTS. - -- routing.tgz - original Mike McLagan's route by source patch. - Currently it is obsolete. - -- gated.dif-ss<NEWEST>.gz - gated-R3_6Alpha_2 fixes. - Look at README.gated - -- mrouted-3.8.dif.gz - mrouted-3.8 fixes. - -- rtmon.c - trivial debugging utility: reads and stores netlink. - - -NEWS for user. - -- Policy based routing. Routing decisions are made on the basis - not only of destination address, but also source address, - TOS and incoming interface. -- Complete set of IP level control messages. - Now Linux is the only OS in the world complying to RFC requirements. - Great win 8) -- New interface addressing paradigm. - Assignment of address ranges to interface, - multiple prefixes etc. etc. - Do not bother, it is compatible with the old one. Moreover: -- You don't need to do "route add aaa.bbb.ccc... eth0" anymore, - it is done automatically. -- "Abstract" UNIX sockets and security enhancements. - This is necessary to use TIRPC and TLI emulation library. - -NEWS for hacker. - -- New destination cache. Flexible, robust and just beautiful. -- Network stack is reordered, simplified, optimized, a lot of bugs fixed. - (well, and new bugs were introduced, but I haven't seen them yet 8)) - It is difficult to describe all the changes, look into source. - -If you see this file, then this patch works 8) - -Alexey Kuznetsov. -kuznet@ms2.inr.ac.ru diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt new file mode 100644 index 00000000000..16a924c486b --- /dev/null +++ b/Documentation/networking/rxrpc.txt @@ -0,0 +1,947 @@ + ====================== + RxRPC NETWORK PROTOCOL + ====================== + +The RxRPC protocol driver provides a reliable two-phase transport on top of UDP +that can be used to perform RxRPC remote operations. This is done over sockets +of AF_RXRPC family, using sendmsg() and recvmsg() with control data to send and +receive data, aborts and errors. + +Contents of this document: + + (*) Overview. + + (*) RxRPC protocol summary. + + (*) AF_RXRPC driver model. + + (*) Control messages. + + (*) Socket options. + + (*) Security. + + (*) Example client usage. + + (*) Example server usage. + + (*) AF_RXRPC kernel interface. + + (*) Configurable parameters. + + +======== +OVERVIEW +======== + +RxRPC is a two-layer protocol. There is a session layer which provides +reliable virtual connections using UDP over IPv4 (or IPv6) as the transport +layer, but implements a real network protocol; and there's the presentation +layer which renders structured data to binary blobs and back again using XDR +(as does SunRPC): + + +-------------+ + | Application | + +-------------+ + | XDR | Presentation + +-------------+ + | RxRPC | Session + +-------------+ + | UDP | Transport + +-------------+ + + +AF_RXRPC provides: + + (1) Part of an RxRPC facility for both kernel and userspace applications by + making the session part of it a Linux network protocol (AF_RXRPC). + + (2) A two-phase protocol. The client transmits a blob (the request) and then + receives a blob (the reply), and the server receives the request and then + transmits the reply. + + (3) Retention of the reusable bits of the transport system set up for one call + to speed up subsequent calls. + + (4) A secure protocol, using the Linux kernel's key retention facility to + manage security on the client end. The server end must of necessity be + more active in security negotiations. + +AF_RXRPC does not provide XDR marshalling/presentation facilities. That is +left to the application. AF_RXRPC only deals in blobs. Even the operation ID +is just the first four bytes of the request blob, and as such is beyond the +kernel's interest. + + +Sockets of AF_RXRPC family are: + + (1) created as type SOCK_DGRAM; + + (2) provided with a protocol of the type of underlying transport they're going + to use - currently only PF_INET is supported. + + +The Andrew File System (AFS) is an example of an application that uses this and +that has both kernel (filesystem) and userspace (utility) components. + + +====================== +RXRPC PROTOCOL SUMMARY +====================== + +An overview of the RxRPC protocol: + + (*) RxRPC sits on top of another networking protocol (UDP is the only option + currently), and uses this to provide network transport. UDP ports, for + example, provide transport endpoints. + + (*) RxRPC supports multiple virtual "connections" from any given transport + endpoint, thus allowing the endpoints to be shared, even to the same + remote endpoint. + + (*) Each connection goes to a particular "service". A connection may not go + to multiple services. A service may be considered the RxRPC equivalent of + a port number. AF_RXRPC permits multiple services to share an endpoint. + + (*) Client-originating packets are marked, thus a transport endpoint can be + shared between client and server connections (connections have a + direction). + + (*) Up to a billion connections may be supported concurrently between one + local transport endpoint and one service on one remote endpoint. An RxRPC + connection is described by seven numbers: + + Local address } + Local port } Transport (UDP) address + Remote address } + Remote port } + Direction + Connection ID + Service ID + + (*) Each RxRPC operation is a "call". A connection may make up to four + billion calls, but only up to four calls may be in progress on a + connection at any one time. + + (*) Calls are two-phase and asymmetric: the client sends its request data, + which the service receives; then the service transmits the reply data + which the client receives. + + (*) The data blobs are of indefinite size, the end of a phase is marked with a + flag in the packet. The number of packets of data making up one blob may + not exceed 4 billion, however, as this would cause the sequence number to + wrap. + + (*) The first four bytes of the request data are the service operation ID. + + (*) Security is negotiated on a per-connection basis. The connection is + initiated by the first data packet on it arriving. If security is + requested, the server then issues a "challenge" and then the client + replies with a "response". If the response is successful, the security is + set for the lifetime of that connection, and all subsequent calls made + upon it use that same security. In the event that the server lets a + connection lapse before the client, the security will be renegotiated if + the client uses the connection again. + + (*) Calls use ACK packets to handle reliability. Data packets are also + explicitly sequenced per call. + + (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. + A hard-ACK indicates to the far side that all the data received to a point + has been received and processed; a soft-ACK indicates that the data has + been received but may yet be discarded and re-requested. The sender may + not discard any transmittable packets until they've been hard-ACK'd. + + (*) Reception of a reply data packet implicitly hard-ACK's all the data + packets that make up the request. + + (*) An call is complete when the request has been sent, the reply has been + received and the final hard-ACK on the last packet of the reply has + reached the server. + + (*) An call may be aborted by either end at any time up to its completion. + + +===================== +AF_RXRPC DRIVER MODEL +===================== + +About the AF_RXRPC driver: + + (*) The AF_RXRPC protocol transparently uses internal sockets of the transport + protocol to represent transport endpoints. + + (*) AF_RXRPC sockets map onto RxRPC connection bundles. Actual RxRPC + connections are handled transparently. One client socket may be used to + make multiple simultaneous calls to the same service. One server socket + may handle calls from many clients. + + (*) Additional parallel client connections will be initiated to support extra + concurrent calls, up to a tunable limit. + + (*) Each connection is retained for a certain amount of time [tunable] after + the last call currently using it has completed in case a new call is made + that could reuse it. + + (*) Each internal UDP socket is retained [tunable] for a certain amount of + time [tunable] after the last connection using it discarded, in case a new + connection is made that could use it. + + (*) A client-side connection is only shared between calls if they have have + the same key struct describing their security (and assuming the calls + would otherwise share the connection). Non-secured calls would also be + able to share connections with each other. + + (*) A server-side connection is shared if the client says it is. + + (*) ACK'ing is handled by the protocol driver automatically, including ping + replying. + + (*) SO_KEEPALIVE automatically pings the other side to keep the connection + alive [TODO]. + + (*) If an ICMP error is received, all calls affected by that error will be + aborted with an appropriate network error passed through recvmsg(). + + +Interaction with the user of the RxRPC socket: + + (*) A socket is made into a server socket by binding an address with a + non-zero service ID. + + (*) In the client, sending a request is achieved with one or more sendmsgs, + followed by the reply being received with one or more recvmsgs. + + (*) The first sendmsg for a request to be sent from a client contains a tag to + be used in all other sendmsgs or recvmsgs associated with that call. The + tag is carried in the control data. + + (*) connect() is used to supply a default destination address for a client + socket. This may be overridden by supplying an alternate address to the + first sendmsg() of a call (struct msghdr::msg_name). + + (*) If connect() is called on an unbound client, a random local port will + bound before the operation takes place. + + (*) A server socket may also be used to make client calls. To do this, the + first sendmsg() of the call must specify the target address. The server's + transport endpoint is used to send the packets. + + (*) Once the application has received the last message associated with a call, + the tag is guaranteed not to be seen again, and so it can be used to pin + client resources. A new call can then be initiated with the same tag + without fear of interference. + + (*) In the server, a request is received with one or more recvmsgs, then the + the reply is transmitted with one or more sendmsgs, and then the final ACK + is received with a last recvmsg. + + (*) When sending data for a call, sendmsg is given MSG_MORE if there's more + data to come on that call. + + (*) When receiving data for a call, recvmsg flags MSG_MORE if there's more + data to come for that call. + + (*) When receiving data or messages for a call, MSG_EOR is flagged by recvmsg + to indicate the terminal message for that call. + + (*) A call may be aborted by adding an abort control message to the control + data. Issuing an abort terminates the kernel's use of that call's tag. + Any messages waiting in the receive queue for that call will be discarded. + + (*) Aborts, busy notifications and challenge packets are delivered by recvmsg, + and control data messages will be set to indicate the context. Receiving + an abort or a busy message terminates the kernel's use of that call's tag. + + (*) The control data part of the msghdr struct is used for a number of things: + + (*) The tag of the intended or affected call. + + (*) Sending or receiving errors, aborts and busy notifications. + + (*) Notifications of incoming calls. + + (*) Sending debug requests and receiving debug replies [TODO]. + + (*) When the kernel has received and set up an incoming call, it sends a + message to server application to let it know there's a new call awaiting + its acceptance [recvmsg reports a special control message]. The server + application then uses sendmsg to assign a tag to the new call. Once that + is done, the first part of the request data will be delivered by recvmsg. + + (*) The server application has to provide the server socket with a keyring of + secret keys corresponding to the security types it permits. When a secure + connection is being set up, the kernel looks up the appropriate secret key + in the keyring and then sends a challenge packet to the client and + receives a response packet. The kernel then checks the authorisation of + the packet and either aborts the connection or sets up the security. + + (*) The name of the key a client will use to secure its communications is + nominated by a socket option. + + +Notes on recvmsg: + + (*) If there's a sequence of data messages belonging to a particular call on + the receive queue, then recvmsg will keep working through them until: + + (a) it meets the end of that call's received data, + + (b) it meets a non-data message, + + (c) it meets a message belonging to a different call, or + + (d) it fills the user buffer. + + If recvmsg is called in blocking mode, it will keep sleeping, awaiting the + reception of further data, until one of the above four conditions is met. + + (2) MSG_PEEK operates similarly, but will return immediately if it has put any + data in the buffer rather than sleeping until it can fill the buffer. + + (3) If a data message is only partially consumed in filling a user buffer, + then the remainder of that message will be left on the front of the queue + for the next taker. MSG_TRUNC will never be flagged. + + (4) If there is more data to be had on a call (it hasn't copied the last byte + of the last data message in that phase yet), then MSG_MORE will be + flagged. + + +================ +CONTROL MESSAGES +================ + +AF_RXRPC makes use of control messages in sendmsg() and recvmsg() to multiplex +calls, to invoke certain actions and to report certain conditions. These are: + + MESSAGE ID SRT DATA MEANING + ======================= === =========== =============================== + RXRPC_USER_CALL_ID sr- User ID App's call specifier + RXRPC_ABORT srt Abort code Abort code to issue/received + RXRPC_ACK -rt n/a Final ACK received + RXRPC_NET_ERROR -rt error num Network error on call + RXRPC_BUSY -rt n/a Call rejected (server busy) + RXRPC_LOCAL_ERROR -rt error num Local error encountered + RXRPC_NEW_CALL -r- n/a New call received + RXRPC_ACCEPT s-- n/a Accept new call + + (SRT = usable in Sendmsg / delivered by Recvmsg / Terminal message) + + (*) RXRPC_USER_CALL_ID + + This is used to indicate the application's call ID. It's an unsigned long + that the app specifies in the client by attaching it to the first data + message or in the server by passing it in association with an RXRPC_ACCEPT + message. recvmsg() passes it in conjunction with all messages except + those of the RXRPC_NEW_CALL message. + + (*) RXRPC_ABORT + + This is can be used by an application to abort a call by passing it to + sendmsg, or it can be delivered by recvmsg to indicate a remote abort was + received. Either way, it must be associated with an RXRPC_USER_CALL_ID to + specify the call affected. If an abort is being sent, then error EBADSLT + will be returned if there is no call with that user ID. + + (*) RXRPC_ACK + + This is delivered to a server application to indicate that the final ACK + of a call was received from the client. It will be associated with an + RXRPC_USER_CALL_ID to indicate the call that's now complete. + + (*) RXRPC_NET_ERROR + + This is delivered to an application to indicate that an ICMP error message + was encountered in the process of trying to talk to the peer. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (*) RXRPC_BUSY + + This is delivered to a client application to indicate that a call was + rejected by the server due to the server being busy. It will be + associated with an RXRPC_USER_CALL_ID to indicate the rejected call. + + (*) RXRPC_LOCAL_ERROR + + This is delivered to an application to indicate that a local error was + encountered and that a call has been aborted because of it. An + errno-class integer value will be included in the control message data + indicating the problem, and an RXRPC_USER_CALL_ID will indicate the call + affected. + + (*) RXRPC_NEW_CALL + + This is delivered to indicate to a server application that a new call has + arrived and is awaiting acceptance. No user ID is associated with this, + as a user ID must subsequently be assigned by doing an RXRPC_ACCEPT. + + (*) RXRPC_ACCEPT + + This is used by a server application to attempt to accept a call and + assign it a user ID. It should be associated with an RXRPC_USER_CALL_ID + to indicate the user ID to be assigned. If there is no call to be + accepted (it may have timed out, been aborted, etc.), then sendmsg will + return error ENODATA. If the user ID is already in use by another call, + then error EBADSLT will be returned. + + +============== +SOCKET OPTIONS +============== + +AF_RXRPC sockets support a few socket options at the SOL_RXRPC level: + + (*) RXRPC_SECURITY_KEY + + This is used to specify the description of the key to be used. The key is + extracted from the calling process's keyrings with request_key() and + should be of "rxrpc" type. + + The optval pointer points to the description string, and optlen indicates + how long the string is, without the NUL terminator. + + (*) RXRPC_SECURITY_KEYRING + + Similar to above but specifies a keyring of server secret keys to use (key + type "keyring"). See the "Security" section. + + (*) RXRPC_EXCLUSIVE_CONNECTION + + This is used to request that new connections should be used for each call + made subsequently on this socket. optval should be NULL and optlen 0. + + (*) RXRPC_MIN_SECURITY_LEVEL + + This is used to specify the minimum security level required for calls on + this socket. optval must point to an int containing one of the following + values: + + (a) RXRPC_SECURITY_PLAIN + + Encrypted checksum only. + + (b) RXRPC_SECURITY_AUTH + + Encrypted checksum plus packet padded and first eight bytes of packet + encrypted - which includes the actual packet length. + + (c) RXRPC_SECURITY_ENCRYPTED + + Encrypted checksum plus entire packet padded and encrypted, including + actual packet length. + + +======== +SECURITY +======== + +Currently, only the kerberos 4 equivalent protocol has been implemented +(security index 2 - rxkad). This requires the rxkad module to be loaded and, +on the client, tickets of the appropriate type to be obtained from the AFS +kaserver or the kerberos server and installed as "rxrpc" type keys. This is +normally done using the klog program. An example simple klog program can be +found at: + + http://people.redhat.com/~dhowells/rxrpc/klog.c + +The payload provided to add_key() on the client should be of the following +form: + + struct rxrpc_key_sec2_v1 { + uint16_t security_index; /* 2 */ + uint16_t ticket_length; /* length of ticket[] */ + uint32_t expiry; /* time at which expires */ + uint8_t kvno; /* key version number */ + uint8_t __pad[3]; + uint8_t session_key[8]; /* DES session key */ + uint8_t ticket[0]; /* the encrypted ticket */ + }; + +Where the ticket blob is just appended to the above structure. + + +For the server, keys of type "rxrpc_s" must be made available to the server. +They have a description of "<serviceID>:<securityIndex>" (eg: "52:2" for an +rxkad key for the AFS VL service). When such a key is created, it should be +given the server's secret key as the instantiation data (see the example +below). + + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + +A keyring is passed to the server socket by naming it in a sockopt. The server +socket then looks the server secret keys up in this keyring when secure +incoming connections are made. This can be seen in an example program that can +be found at: + + http://people.redhat.com/~dhowells/rxrpc/listen.c + + +==================== +EXAMPLE CLIENT USAGE +==================== + +A client would issue an operation by: + + (1) An RxRPC socket is set up by: + + client = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the protocol family of the transport + socket used - usually IPv4 but it can also be IPv6 [TODO]. + + (2) A local address can optionally be bound: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = 0, /* we're a client */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(client, &srx, sizeof(srx)); + + This specifies the local UDP port to be used. If not given, a random + non-privileged port will be used. A UDP port may be shared between + several unrelated RxRPC sockets. Security is handled on a basis of + per-RxRPC virtual connection. + + (3) The security is set: + + const char *key = "AFS:cambridge.redhat.com"; + setsockopt(client, SOL_RXRPC, RXRPC_SECURITY_KEY, key, strlen(key)); + + This issues a request_key() to get the key representing the security + context. The minimum security level can be set: + + unsigned int sec = RXRPC_SECURITY_ENCRYPTED; + setsockopt(client, SOL_RXRPC, RXRPC_MIN_SECURITY_LEVEL, + &sec, sizeof(sec)); + + (4) The server to be contacted can then be specified (alternatively this can + be done through sendmsg): + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7005), /* AFS volume manager */ + .transport.sin_address = ..., + }; + connect(client, &srx, sizeof(srx)); + + (5) The request data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + MSG_MORE should be set in msghdr::msg_flags on all but the last part of + the request. Multiple requests may be made simultaneously. + + If a call is intended to go to a destination other than the default + specified through connect(), then msghdr::msg_name should be set on the + first request message of that call. + + (6) The reply data will then be posted to the server socket for recvmsg() to + pick up. MSG_MORE will be flagged by recvmsg() if there's more reply data + for a particular call to be read. MSG_EOR will be set on the terminal + read for a call. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + If an abort or error occurred, this will be returned in the control data + buffer instead, and MSG_EOR will be flagged to indicate the end of that + call. + + +==================== +EXAMPLE SERVER USAGE +==================== + +A server would be set up to accept operations in the following manner: + + (1) An RxRPC socket is created by: + + server = socket(AF_RXRPC, SOCK_DGRAM, PF_INET); + + Where the third parameter indicates the address type of the transport + socket used - usually IPv4. + + (2) Security is set up if desired by giving the socket a keyring with server + secret keys in it: + + keyring = add_key("keyring", "AFSkeys", NULL, 0, + KEY_SPEC_PROCESS_KEYRING); + + const char secret_key[8] = { + 0xa7, 0x83, 0x8a, 0xcb, 0xc7, 0x83, 0xec, 0x94 }; + add_key("rxrpc_s", "52:2", secret_key, 8, keyring); + + setsockopt(server, SOL_RXRPC, RXRPC_SECURITY_KEYRING, "AFSkeys", 7); + + The keyring can be manipulated after it has been given to the socket. This + permits the server to add more keys, replace keys, etc. whilst it is live. + + (2) A local address must then be bound: + + struct sockaddr_rxrpc srx = { + .srx_family = AF_RXRPC, + .srx_service = VL_SERVICE_ID, /* RxRPC service ID */ + .transport_type = SOCK_DGRAM, /* type of transport socket */ + .transport.sin_family = AF_INET, + .transport.sin_port = htons(7000), /* AFS callback */ + .transport.sin_address = 0, /* all local interfaces */ + }; + bind(server, &srx, sizeof(srx)); + + (3) The server is then set to listen out for incoming calls: + + listen(server, 100); + + (4) The kernel notifies the server of pending incoming connections by sending + it a message for each. This is received with recvmsg() on the server + socket. It has no data, and has a single dataless control message + attached: + + RXRPC_NEW_CALL + + The address that can be passed back by recvmsg() at this point should be + ignored since the call for which the message was posted may have gone by + the time it is accepted - in which case the first call still on the queue + will be accepted. + + (5) The server then accepts the new call by issuing a sendmsg() with two + pieces of control data and no actual data: + + RXRPC_ACCEPT - indicate connection acceptance + RXRPC_USER_CALL_ID - specify user ID for this call + + (6) The first request data packet will then be posted to the server socket for + recvmsg() to pick up. At that point, the RxRPC address for the call can + be read from the address fields in the msghdr struct. + + Subsequent request data will be posted to the server socket for recvmsg() + to collect as it arrives. All but the last piece of the request data will + be delivered with MSG_MORE flagged. + + All data will be delivered with the following control message attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + (8) The reply data should then be posted to the server socket using a series + of sendmsg() calls, each with the following control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + + MSG_MORE should be set in msghdr::msg_flags on all but the last message + for a particular call. + + (9) The final ACK from the client will be posted for retrieval by recvmsg() + when it is received. It will take the form of a dataless message with two + control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + RXRPC_ACK - indicates final ACK (no data) + + MSG_EOR will be flagged to indicate that this is the final message for + this call. + +(10) Up to the point the final packet of reply data is sent, the call can be + aborted by calling sendmsg() with a dataless message with the following + control messages attached: + + RXRPC_USER_CALL_ID - specifies the user ID for this call + RXRPC_ABORT - indicates abort code (4 byte data) + + Any packets waiting in the socket's receive queue will be discarded if + this is issued. + +Note that all the communications for a particular service take place through +the one server socket, using control messages on sendmsg() and recvmsg() to +determine the call affected. + + +========================= +AF_RXRPC KERNEL INTERFACE +========================= + +The AF_RXRPC module also provides an interface for use by in-kernel utilities +such as the AFS filesystem. This permits such a utility to: + + (1) Use different keys directly on individual client calls on one socket + rather than having to open a whole slew of sockets, one for each key it + might want to use. + + (2) Avoid having RxRPC call request_key() at the point of issue of a call or + opening of a socket. Instead the utility is responsible for requesting a + key at the appropriate point. AFS, for instance, would do this during VFS + operations such as open() or unlink(). The key is then handed through + when the call is initiated. + + (3) Request the use of something other than GFP_KERNEL to allocate memory. + + (4) Avoid the overhead of using the recvmsg() call. RxRPC messages can be + intercepted before they get put into the socket Rx queue and the socket + buffers manipulated directly. + +To use the RxRPC facility, a kernel utility must still open an AF_RXRPC socket, +bind an address as appropriate and listen if it's to be a server socket, but +then it passes this to the kernel interface functions. + +The kernel interface functions are as follows: + + (*) Begin a new client call. + + struct rxrpc_call * + rxrpc_kernel_begin_call(struct socket *sock, + struct sockaddr_rxrpc *srx, + struct key *key, + unsigned long user_call_ID, + gfp_t gfp); + + This allocates the infrastructure to make a new RxRPC call and assigns + call and connection numbers. The call will be made on the UDP port that + the socket is bound to. The call will go to the destination address of a + connected client socket unless an alternative is supplied (srx is + non-NULL). + + If a key is supplied then this will be used to secure the call instead of + the key bound to the socket with the RXRPC_SECURITY_KEY sockopt. Calls + secured in this way will still share connections if at all possible. + + The user_call_ID is equivalent to that supplied to sendmsg() in the + control data buffer. It is entirely feasible to use this to point to a + kernel data structure. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (*) End a client call. + + void rxrpc_kernel_end_call(struct rxrpc_call *call); + + This is used to end a previously begun call. The user_call_ID is expunged + from AF_RXRPC's knowledge and will not be seen again in association with + the specified call. + + (*) Send data through a call. + + int rxrpc_kernel_send_data(struct rxrpc_call *call, struct msghdr *msg, + size_t len); + + This is used to supply either the request part of a client call or the + reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the + data buffers to be used. msg_iov may not be NULL and must point + exclusively to in-kernel virtual addresses. msg.msg_flags may be given + MSG_MORE if there will be subsequent data sends for this call. + + The msg must not specify a destination address, control data or any flags + other than MSG_MORE. len is the total amount of data to transmit. + + (*) Abort a call. + + void rxrpc_kernel_abort_call(struct rxrpc_call *call, u32 abort_code); + + This is used to abort a call if it's still in an abortable state. The + abort code specified will be placed in the ABORT message sent. + + (*) Intercept received RxRPC messages. + + typedef void (*rxrpc_interceptor_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + + void + rxrpc_kernel_intercept_rx_messages(struct socket *sock, + rxrpc_interceptor_t interceptor); + + This installs an interceptor function on the specified AF_RXRPC socket. + All messages that would otherwise wind up in the socket's Rx queue are + then diverted to this function. Note that care must be taken to process + the messages in the right order to maintain DATA message sequentiality. + + The interceptor function itself is provided with the address of the socket + and handling the incoming message, the ID assigned by the kernel utility + to the call and the socket buffer containing the message. + + The skb->mark field indicates the type of message: + + MARK MEANING + =============================== ======================================= + RXRPC_SKB_MARK_DATA Data message + RXRPC_SKB_MARK_FINAL_ACK Final ACK received for an incoming call + RXRPC_SKB_MARK_BUSY Client call rejected as server busy + RXRPC_SKB_MARK_REMOTE_ABORT Call aborted by peer + RXRPC_SKB_MARK_NET_ERROR Network error detected + RXRPC_SKB_MARK_LOCAL_ERROR Local error encountered + RXRPC_SKB_MARK_NEW_CALL New incoming call awaiting acceptance + + The remote abort message can be probed with rxrpc_kernel_get_abort_code(). + The two error messages can be probed with rxrpc_kernel_get_error_number(). + A new call can be accepted with rxrpc_kernel_accept_call(). + + Data messages can have their contents extracted with the usual bunch of + socket buffer manipulation functions. A data message can be determined to + be the last one in a sequence with rxrpc_kernel_is_data_last(). When a + data message has been used up, rxrpc_kernel_data_delivered() should be + called on it.. + + Non-data messages should be handled to rxrpc_kernel_free_skb() to dispose + of. It is possible to get extra refs on all types of message for later + freeing, but this may pin the state of a call until the message is finally + freed. + + (*) Accept an incoming call. + + struct rxrpc_call * + rxrpc_kernel_accept_call(struct socket *sock, + unsigned long user_call_ID); + + This is used to accept an incoming call and to assign it a call ID. This + function is similar to rxrpc_kernel_begin_call() and calls accepted must + be ended in the same way. + + If this function is successful, an opaque reference to the RxRPC call is + returned. The caller now holds a reference on this and it must be + properly ended. + + (*) Reject an incoming call. + + int rxrpc_kernel_reject_call(struct socket *sock); + + This is used to reject the first incoming call on the socket's queue with + a BUSY message. -ENODATA is returned if there were no incoming calls. + Other errors may be returned if the call had been aborted (-ECONNABORTED) + or had timed out (-ETIME). + + (*) Record the delivery of a data message and free it. + + void rxrpc_kernel_data_delivered(struct sk_buff *skb); + + This is used to record a data message as having been delivered and to + update the ACK state for the call. The socket buffer will be freed. + + (*) Free a message. + + void rxrpc_kernel_free_skb(struct sk_buff *skb); + + This is used to free a non-DATA socket buffer intercepted from an AF_RXRPC + socket. + + (*) Determine if a data message is the last one on a call. + + bool rxrpc_kernel_is_data_last(struct sk_buff *skb); + + This is used to determine if a socket buffer holds the last data message + to be received for a call (true will be returned if it does, false + if not). + + The data message will be part of the reply on a client call and the + request on an incoming call. In the latter case there will be more + messages, but in the former case there will not. + + (*) Get the abort code from an abort message. + + u32 rxrpc_kernel_get_abort_code(struct sk_buff *skb); + + This is used to extract the abort code from a remote abort message. + + (*) Get the error number from a local or network error message. + + int rxrpc_kernel_get_error_number(struct sk_buff *skb); + + This is used to extract the error number from a message indicating either + a local error occurred or a network error occurred. + + (*) Allocate a null key for doing anonymous security. + + struct key *rxrpc_get_null_key(const char *keyname); + + This is used to allocate a null RxRPC key that can be used to indicate + anonymous security for a particular domain. + + +======================= +CONFIGURABLE PARAMETERS +======================= + +The RxRPC protocol driver has a number of configurable parameters that can be +adjusted through sysctls in /proc/net/rxrpc/: + + (*) req_ack_delay + + The amount of time in milliseconds after receiving a packet with the + request-ack flag set before we honour the flag and actually send the + requested ack. + + Usually the other side won't stop sending packets until the advertised + reception window is full (to a maximum of 255 packets), so delaying the + ACK permits several packets to be ACK'd in one go. + + (*) soft_ack_delay + + The amount of time in milliseconds after receiving a new packet before we + generate a soft-ACK to tell the sender that it doesn't need to resend. + + (*) idle_ack_delay + + The amount of time in milliseconds after all the packets currently in the + received queue have been consumed before we generate a hard-ACK to tell + the sender it can free its buffers, assuming no other reason occurs that + we would send an ACK. + + (*) resend_timeout + + The amount of time in milliseconds after transmitting a packet before we + transmit it again, assuming no ACK is received from the receiver telling + us they got it. + + (*) max_call_lifetime + + The maximum amount of time in seconds that a call may be in progress + before we preemptively kill it. + + (*) dead_call_expiry + + The amount of time in seconds before we remove a dead call from the call + list. Dead calls are kept around for a little while for the purpose of + repeating ACK and ABORT packets. + + (*) connection_expiry + + The amount of time in seconds after a connection was last used before we + remove it from the connection list. Whilst a connection is in existence, + it serves as a placeholder for negotiated security; when it is deleted, + the security must be renegotiated. + + (*) transport_expiry + + The amount of time in seconds after a transport was last used before we + remove it from the transport list. Whilst a transport is in existence, it + serves to anchor the peer data and keeps the connection ID counter. + + (*) rxrpc_rx_window_size + + The size of the receive window in packets. This is the maximum number of + unconsumed received packets we're willing to hold in memory for any + particular call. + + (*) rxrpc_rx_mtu + + The maximum packet MTU size that we're willing to receive in bytes. This + indicates to the peer whether we're willing to accept jumbo packets. + + (*) rxrpc_rx_jumbo_max + + The maximum number of packets that we're willing to accept in a jumbo + packet. Non-terminal packets in a jumbo packet must contain a four byte + header plus exactly 1412 bytes of data. The terminal packet must contain + a four byte header plus any amount of data. In any event, a jumbo packet + may not exceed rxrpc_rx_mtu in size. diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/s2io.txt index bd528ffbeb4..d2a9f43b554 100644 --- a/Documentation/networking/s2io.txt +++ b/Documentation/networking/s2io.txt @@ -37,7 +37,7 @@ To associate an interface with a physical adapter use "ethtool -p <ethX>". The corresponding adapter's LED will blink multiple times. 3. Features supported: -a. Jumbo frames. Xframe I/II supports MTU upto 9600 bytes, +a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes, modifiable using ifconfig command. b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit @@ -49,16 +49,13 @@ significant performance improvement on certain platforms(SGI Altix, IBM xSeries). d. MSI/MSI-X. Can be enabled on platforms which support this feature -(IA64, Xeon) resulting in noticeable performance improvement(upto 7% +(IA64, Xeon) resulting in noticeable performance improvement(up to 7% on certain platforms). -e. NAPI. Compile-time option(CONFIG_S2IO_NAPI) for better Rx interrupt -moderation. - -f. Statistics. Comprehensive MAC-level and software statistics displayed +e. Statistics. Comprehensive MAC-level and software statistics displayed using "ethtool -S" option. -g. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings, +f. Multi-FIFO/Ring. Supports up to 8 transmit queues and receive rings, with multiple steering options. 4. Command line parameters @@ -83,9 +80,9 @@ Valid range: Limited by memory on system Default: 30 e. intr_type -Specifies interrupt type. Possible values 1(INTA), 2(MSI), 3(MSI-X) -Valid range: 1-3 -Default: 1 +Specifies interrupt type. Possible values 0(INTA), 2(MSI-X) +Valid values: 0, 2 +Default: 2 5. Performance suggestions General: @@ -126,7 +123,7 @@ However, you may want to set PCI latency timer to 248. #setpci -d 17d5:* LATENCY_TIMER=f8 For detailed description of the PCI registers, please see Xframe User Guide. b. Use 2-buffer mode. This results in large performance boost on -on certain platforms(eg. SGI Altix, IBM xSeries). +certain platforms(eg. SGI Altix, IBM xSeries). c. Ensure Receive Checksum offload is enabled. Use "ethtool -K ethX" command to set/verify this option. d. Enable NAPI feature(in kernel configuration Device Drivers ---> Network @@ -136,18 +133,9 @@ bring down CPU utilization. ** For AMD opteron platforms with 8131 chipset, MMRBC=1 and MOST=1 are recommended as safe parameters. For more information, please review the AMD8131 errata at -http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26310.pdf - -6. Available Downloads -Neterion "s2io" driver in Red Hat and Suse 2.6-based distributions is kept up -to date, also the latest "s2io" code (including support for 2.4 kernels) is -available via "Support" link on the Neterion site: http://www.neterion.com. +http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ +26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf -For Xframe User Guide (Programming manual), visit ftp site ns1.s2io.com, -user: linuxdocs password: HALdocs - -7. Support +6. Support For further support please contact either your 10GbE Xframe NIC vendor (IBM, -HP, SGI etc.) or click on the "Support" link on the Neterion site: -http://www.neterion.com. - +HP, SGI etc.) diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt new file mode 100644 index 00000000000..99ca40e8e81 --- /dev/null +++ b/Documentation/networking/scaling.txt @@ -0,0 +1,436 @@ +Scaling in the Linux Networking Stack + + +Introduction +============ + +This document describes a set of complementary techniques in the Linux +networking stack to increase parallelism and improve performance for +multi-processor systems. + +The following technologies are described: + + RSS: Receive Side Scaling + RPS: Receive Packet Steering + RFS: Receive Flow Steering + Accelerated Receive Flow Steering + XPS: Transmit Packet Steering + + +RSS: Receive Side Scaling +========================= + +Contemporary NICs support multiple receive and transmit descriptor queues +(multi-queue). On reception, a NIC can send different packets to different +queues to distribute processing among CPUs. The NIC distributes packets by +applying a filter to each packet that assigns it to one of a small number +of logical flows. Packets for each flow are steered to a separate receive +queue, which in turn can be processed by separate CPUs. This mechanism is +generally known as “Receive-side Scaling†(RSS). The goal of RSS and +the other scaling techniques is to increase performance uniformly. +Multi-queue distribution can also be used for traffic prioritization, but +that is not the focus of these techniques. + +The filter used in RSS is typically a hash function over the network +and/or transport layer headers-- for example, a 4-tuple hash over +IP addresses and TCP ports of a packet. The most common hardware +implementation of RSS uses a 128-entry indirection table where each entry +stores a queue number. The receive queue for a packet is determined +by masking out the low order seven bits of the computed hash for the +packet (usually a Toeplitz hash), taking this number as a key into the +indirection table and reading the corresponding value. + +Some advanced NICs allow steering packets to queues based on +programmable filters. For example, webserver bound TCP port 80 packets +can be directed to their own receive queue. Such “n-tuple†filters can +be configured from ethtool (--config-ntuple). + +==== RSS Configuration + +The driver for a multi-queue capable NIC typically provides a kernel +module parameter for specifying the number of hardware queues to +configure. In the bnx2x driver, for instance, this parameter is called +num_queues. A typical RSS configuration would be to have one receive queue +for each CPU if the device supports enough queues, or otherwise at least +one for each memory domain, where a memory domain is a set of CPUs that +share a particular memory level (L1, L2, NUMA node, etc.). + +The indirection table of an RSS device, which resolves a queue by masked +hash, is usually programmed by the driver at initialization. The +default mapping is to distribute the queues evenly in the table, but the +indirection table can be retrieved and modified at runtime using ethtool +commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the +indirection table could be done to give different queues different +relative weights. + +== RSS IRQ Configuration + +Each receive queue has a separate IRQ associated with it. The NIC triggers +this to notify a CPU when new packets arrive on the given queue. The +signaling path for PCIe devices uses message signaled interrupts (MSI-X), +that can route each interrupt to a particular CPU. The active mapping +of queues to IRQs can be determined from /proc/interrupts. By default, +an IRQ may be handled on any CPU. Because a non-negligible part of packet +processing takes place in receive interrupt handling, it is advantageous +to spread receive interrupts between CPUs. To manually adjust the IRQ +affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems +will be running irqbalance, a daemon that dynamically optimizes IRQ +assignments and as a result may override any manual settings. + +== Suggested Configuration + +RSS should be enabled when latency is a concern or whenever receive +interrupt processing forms a bottleneck. Spreading load between CPUs +decreases queue length. For low latency networking, the optimal setting +is to allocate as many queues as there are CPUs in the system (or the +NIC maximum, if lower). The most efficient high-rate configuration +is likely the one with the smallest number of receive queues where no +receive queue overflows due to a saturated CPU, because in default +mode with interrupt coalescing enabled, the aggregate number of +interrupts (and thus work) grows with each additional queue. + +Per-cpu load can be observed using the mpstat utility, but note that on +processors with hyperthreading (HT), each hyperthread is represented as +a separate CPU. For interrupt handling, HT has shown no benefit in +initial tests, so limit the number of queues to the number of CPU cores +in the system. + + +RPS: Receive Packet Steering +============================ + +Receive Packet Steering (RPS) is logically a software implementation of +RSS. Being in software, it is necessarily called later in the datapath. +Whereas RSS selects the queue and hence CPU that will run the hardware +interrupt handler, RPS selects the CPU to perform protocol processing +above the interrupt handler. This is accomplished by placing the packet +on the desired CPU’s backlog queue and waking up the CPU for processing. +RPS has some advantages over RSS: 1) it can be used with any NIC, +2) software filters can easily be added to hash over new protocols, +3) it does not increase hardware device interrupt rate (although it does +introduce inter-processor interrupts (IPIs)). + +RPS is called during bottom half of the receive interrupt handler, when +a driver sends a packet up the network stack with netif_rx() or +netif_receive_skb(). These call the get_rps_cpu() function, which +selects the queue that should process a packet. + +The first step in determining the target CPU for RPS is to calculate a +flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash +depending on the protocol). This serves as a consistent hash of the +associated flow of the packet. The hash is either provided by hardware +or will be computed in the stack. Capable hardware can pass the hash in +the receive descriptor for the packet; this would usually be the same +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in +skb->rx_hash and can be used elsewhere in the stack as a hash of the +packet’s flow. + +Each receive hardware queue has an associated list of CPUs to which +RPS may enqueue packets for processing. For each received packet, +an index into the list is computed from the flow hash modulo the size +of the list. The indexed CPU is the target for processing the packet, +and the packet is queued to the tail of that CPU’s backlog queue. At +the end of the bottom half routine, IPIs are sent to any CPUs for which +packets have been queued to their backlog queue. The IPI wakes backlog +processing on the remote CPU, and any queued packets are then processed +up the networking stack. + +==== RPS Configuration + +RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on +by default for SMP). Even when compiled in, RPS remains disabled until +explicitly configured. The list of CPUs to which RPS may forward traffic +can be configured for each receive queue using a sysfs file entry: + + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus + +This file implements a bitmap of CPUs. RPS is disabled when it is zero +(the default), in which case packets are processed on the interrupting +CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to +the bitmap. + +== Suggested Configuration + +For a single queue device, a typical RPS configuration would be to set +the rps_cpus to the CPUs in the same memory domain of the interrupting +CPU. If NUMA locality is not an issue, this could also be all CPUs in +the system. At high interrupt rate, it might be wise to exclude the +interrupting CPU from the map since that already performs much work. + +For a multi-queue system, if RSS is configured so that a hardware +receive queue is mapped to each CPU, then RPS is probably redundant +and unnecessary. If there are fewer hardware queues than CPUs, then +RPS might be beneficial if the rps_cpus for each queue are the ones that +share the same memory domain as the interrupting CPU for that queue. + +==== RPS Flow Limit + +RPS scales kernel receive processing across CPUs without introducing +reordering. The trade-off to sending all packets from the same flow +to the same CPU is CPU load imbalance if flows vary in packet rate. +In the extreme case a single flow dominates traffic. Especially on +common server workloads with many concurrent connections, such +behavior indicates a problem such as a misconfiguration or spoofed +source Denial of Service attack. + +Flow Limit is an optional RPS feature that prioritizes small flows +during CPU contention by dropping packets from large flows slightly +ahead of those from small flows. It is active only when an RPS or RFS +destination CPU approaches saturation. Once a CPU's input packet +queue exceeds half the maximum queue length (as set by sysctl +net.core.netdev_max_backlog), the kernel starts a per-flow packet +count over the last 256 packets. If a flow exceeds a set ratio (by +default, half) of these packets when a new packet arrives, then the +new packet is dropped. Packets from other flows are still only +dropped once the input packet queue reaches netdev_max_backlog. +No packets are dropped when the input packet queue length is below +the threshold, so flow limit does not sever connections outright: +even large flows maintain connectivity. + +== Interface + +Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not +turned on. It is implemented for each CPU independently (to avoid lock +and cache contention) and toggled per CPU by setting the relevant bit +in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU +bitmap interface as rps_cpus (see above) when called from procfs: + + /proc/sys/net/core/flow_limit_cpu_bitmap + +Per-flow rate is calculated by hashing each packet into a hashtable +bucket and incrementing a per-bucket counter. The hash function is +the same that selects a CPU in RPS, but as the number of buckets can +be much larger than the number of CPUs, flow limit has finer-grained +identification of large flows and fewer false positives. The default +table has 4096 buckets. This value can be modified through sysctl + + net.core.flow_limit_table_len + +The value is only consulted when a new table is allocated. Modifying +it does not update active tables. + +== Suggested Configuration + +Flow limit is useful on systems with many concurrent connections, +where a single connection taking up 50% of a CPU indicates a problem. +In such environments, enable the feature on all CPUs that handle +network rx interrupts (as set in /proc/irq/N/smp_affinity). + +The feature depends on the input packet queue length to exceed +the flow limit threshold (50%) + the flow history length (256). +Setting net.core.netdev_max_backlog to either 1000 or 10000 +performed well in experiments. + + +RFS: Receive Flow Steering +========================== + +While RPS steers packets solely based on hash, and thus generally +provides good load distribution, it does not take into account +application locality. This is accomplished by Receive Flow Steering +(RFS). The goal of RFS is to increase datacache hitrate by steering +kernel processing of packets to the CPU where the application thread +consuming the packet is running. RFS relies on the same RPS mechanisms +to enqueue packets onto the backlog of another CPU and to wake up that +CPU. + +In RFS, packets are not forwarded directly by the value of their hash, +but the hash is used as index into a flow lookup table. This table maps +flows to the CPUs where those flows are being processed. The flow hash +(see RPS section above) is used to calculate the index into this table. +The CPU recorded in each entry is the one which last processed the flow. +If an entry does not hold a valid CPU, then packets mapped to that entry +are steered using plain RPS. Multiple table entries may point to the +same CPU. Indeed, with many flows and few CPUs, it is very likely that +a single application thread handles flows with many different flow hashes. + +rps_sock_flow_table is a global flow table that contains the *desired* CPU +for flows: the CPU that is currently processing the flow in userspace. +Each table value is a CPU index that is updated during calls to recvmsg +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() +and tcp_splice_read()). + +When the scheduler moves a thread to a new CPU while it has outstanding +receive packets on the old CPU, packets may arrive out of order. To +avoid this, RFS uses a second flow table to track outstanding packets +for each flow: rps_dev_flow_table is a table specific to each hardware +receive queue of each device. Each table value stores a CPU index and a +counter. The CPU index represents the *current* CPU onto which packets +for this flow are enqueued for further kernel processing. Ideally, kernel +and userspace processing occur on the same CPU, and hence the CPU index +in both tables is identical. This is likely false if the scheduler has +recently migrated a userspace thread while the kernel still has packets +enqueued for kernel processing on the old CPU. + +The counter in rps_dev_flow_table values records the length of the current +CPU's backlog when a packet in this flow was last enqueued. Each backlog +queue has a head counter that is incremented on dequeue. A tail counter +is computed as head counter + queue length. In other words, the counter +in rps_dev_flow[i] records the last element in flow i that has +been enqueued onto the currently designated CPU for flow i (of course, +entry i is actually selected by hash and multiple flows may hash to the +same entry i). + +And now the trick for avoiding out of order packets: when selecting the +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table +and the rps_dev_flow table of the queue that the packet was received on +are compared. If the desired CPU for the flow (found in the +rps_sock_flow table) matches the current CPU (found in the rps_dev_flow +table), the packet is enqueued onto that CPU’s backlog. If they differ, +the current CPU is updated to match the desired CPU if one of the +following is true: + +- The current CPU's queue head counter >= the recorded tail counter + value in rps_dev_flow[i] +- The current CPU is unset (equal to RPS_NO_CPU) +- The current CPU is offline + +After this check, the packet is sent to the (possibly updated) current +CPU. These rules aim to ensure that a flow only moves to a new CPU when +there are no packets outstanding on the old CPU, as the outstanding +packets could arrive later than those about to be processed on the new +CPU. + +==== RFS Configuration + +RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on +by default for SMP). The functionality remains disabled until explicitly +configured. The number of entries in the global flow table is set through: + + /proc/sys/net/core/rps_sock_flow_entries + +The number of entries in the per-queue flow table are set through: + + /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt + +== Suggested Configuration + +Both of these need to be set before RFS is enabled for a receive queue. +Values for both are rounded up to the nearest power of two. The +suggested flow count depends on the expected number of active connections +at any given time, which may be significantly less than the number of open +connections. We have found that a value of 32768 for rps_sock_flow_entries +works fairly well on a moderately loaded server. + +For a single queue device, the rps_flow_cnt value for the single queue +would normally be configured to the same value as rps_sock_flow_entries. +For a multi-queue device, the rps_flow_cnt for each queue might be +configured as rps_sock_flow_entries / N, where N is the number of +queues. So for instance, if rps_sock_flow_entries is set to 32768 and there +are 16 configured receive queues, rps_flow_cnt for each queue might be +configured as 2048. + + +Accelerated RFS +=============== + +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load +balancing mechanism that uses soft state to steer flows based on where +the application thread consuming the packets of each flow is running. +Accelerated RFS should perform better than RFS since packets are sent +directly to a CPU local to the thread consuming the data. The target CPU +will either be the same CPU where the application runs, or at least a CPU +which is local to the application thread’s CPU in the cache hierarchy. + +To enable accelerated RFS, the networking stack calls the +ndo_rx_flow_steer driver function to communicate the desired hardware +queue for packets matching a particular flow. The network stack +automatically calls this function every time a flow entry in +rps_dev_flow_table is updated. The driver in turn uses a device specific +method to program the NIC to steer the packets. + +The hardware queue for a flow is derived from the CPU recorded in +rps_dev_flow_table. The stack consults a CPU to hardware queue map which +is maintained by the NIC driver. This is an auto-generated reverse map of +the IRQ affinity table shown by /proc/interrupts. Drivers can use +functions in the cpu_rmap (“CPU affinity reverse mapâ€) kernel library +to populate the map. For each CPU, the corresponding queue in the map is +set to be one whose processing CPU is closest in cache locality. + +==== Accelerated RFS Configuration + +Accelerated RFS is only available if the kernel is compiled with +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. +It also requires that ntuple filtering is enabled via ethtool. The map +of CPU to queues is automatically deduced from the IRQ affinities +configured for each receive queue by the driver, so no additional +configuration should be necessary. + +== Suggested Configuration + +This technique should be enabled whenever one wants to use RFS and the +NIC supports hardware acceleration. + +XPS: Transmit Packet Steering +============================= + +Transmit Packet Steering is a mechanism for intelligently selecting +which transmit queue to use when transmitting a packet on a multi-queue +device. To accomplish this, a mapping from CPU to hardware queue(s) is +recorded. The goal of this mapping is usually to assign queues +exclusively to a subset of CPUs, where the transmit completions for +these queues are processed on a CPU within this set. This choice +provides two benefits. First, contention on the device queue lock is +significantly reduced since fewer CPUs contend for the same queue +(contention can be eliminated completely if each CPU has its own +transmit queue). Secondly, cache miss rate on transmit completion is +reduced, in particular for data cache lines that hold the sk_buff +structures. + +XPS is configured per transmit queue by setting a bitmap of CPUs that +may use that queue to transmit. The reverse mapping, from CPUs to +transmit queues, is computed and maintained for each network device. +When transmitting the first packet in a flow, the function +get_xps_queue() is called to select a queue. This function uses the ID +of the running CPU as a key into the CPU-to-queue lookup table. If the +ID matches a single queue, that is used for transmission. If multiple +queues match, one is selected by using the flow hash to compute an index +into the set. + +The queue chosen for transmitting a particular flow is saved in the +corresponding socket structure for the flow (e.g. a TCP connection). +This transmit queue is used for subsequent packets sent on the flow to +prevent out of order (ooo) packets. The choice also amortizes the cost +of calling get_xps_queues() over all packets in the flow. To avoid +ooo packets, the queue for a flow can subsequently only be changed if +skb->ooo_okay is set for a packet in the flow. This flag indicates that +there are no outstanding packets in the flow, so the transmit queue can +change without the risk of generating out of order packets. The +transport layer is responsible for setting ooo_okay appropriately. TCP, +for instance, sets the flag when all data for a connection has been +acknowledged. + +==== XPS Configuration + +XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by +default for SMP). The functionality remains disabled until explicitly +configured. To enable XPS, the bitmap of CPUs that may use a transmit +queue is configured using the sysfs file entry: + +/sys/class/net/<dev>/queues/tx-<n>/xps_cpus + +== Suggested Configuration + +For a network device with a single transmission queue, XPS configuration +has no effect, since there is no choice in this case. In a multi-queue +system, XPS is preferably configured so that each CPU maps onto one queue. +If there are as many queues as there are CPUs in the system, then each +queue can also map onto one CPU, resulting in exclusive pairings that +experience no contention. If there are fewer queues than CPUs, then the +best CPUs to share a given queue are probably those that share the cache +with the CPU that processes transmit completions for that queue +(transmit interrupts). + + +Further Information +=================== +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into +2.6.38. Original patches were submitted by Tom Herbert +(therbert@google.com) + +Accelerated RFS was introduced in 2.6.35. Original patches were +submitted by Ben Hutchings (bwh@kernel.org) + +Authors: +Tom Herbert (therbert@google.com) +Willem de Bruijn (willemb@google.com) diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt index 0c790a76910..97b810ca908 100644 --- a/Documentation/networking/sctp.txt +++ b/Documentation/networking/sctp.txt @@ -19,7 +19,6 @@ of SCTP that is RFC 2960 compliant and provides an programming interface referred to as the UDP-style API of the Sockets Extensions for SCTP, as proposed in IETF Internet-Drafts. - Caveats: -lksctp can be built as statically or as a module. However, be aware that @@ -33,6 +32,4 @@ For more information, please visit the lksctp project website: http://www.sf.net/projects/lksctp Or contact the lksctp developers through the mailing list: - <lksctp-developers@lists.sourceforge.net> - - + <linux-sctp@vger.kernel.org> diff --git a/Documentation/networking/secid.txt b/Documentation/networking/secid.txt new file mode 100644 index 00000000000..95ea0678433 --- /dev/null +++ b/Documentation/networking/secid.txt @@ -0,0 +1,14 @@ +flowi structure: + +The secid member in the flow structure is used in LSMs (e.g. SELinux) to indicate +the label of the flow. This label of the flow is currently used in selecting +matching labeled xfrm(s). + +If this is an outbound flow, the label is derived from the socket, if any, or +the incoming packet this flow is being generated as a response to (e.g. tcp +resets, timewait ack, etc.). It is also conceivable that the label could be +derived from other sources such as process context, device, etc., in special +cases, as may be appropriate. + +If this is an inbound flow, the label is derived from the IPSec security +associations, if any, used by the packet. diff --git a/Documentation/networking/shaper.txt b/Documentation/networking/shaper.txt deleted file mode 100644 index 6c4ebb66a90..00000000000 --- a/Documentation/networking/shaper.txt +++ /dev/null @@ -1,48 +0,0 @@ -Traffic Shaper For Linux - -This is the current BETA release of the traffic shaper for Linux. It works -within the following limits: - -o Minimum shaping speed is currently about 9600 baud (it can only -shape down to 1 byte per clock tick) - -o Maximum is about 256K, it will go above this but get a bit blocky. - -o If you ifconfig the master device that a shaper is attached to down -then your machine will follow. - -o The shaper must be a module. - - -Setup: - - A shaper device is configured using the shapeconfig program. -Typically you will do something like this - -shapecfg attach shaper0 eth1 -shapecfg speed shaper0 64000 -ifconfig shaper0 myhost netmask 255.255.255.240 broadcast 1.2.3.4.255 up -route add -net some.network netmask a.b.c.d dev shaper0 - -The shaper should have the same IP address as the device it is attached to -for normal use. - -Gotchas: - - The shaper shapes transmitted traffic. It's rather impossible to -shape received traffic except at the end (or a router) transmitting it. - - Gated/routed/rwhod/mrouted all see the shaper as an additional device -and will treat it as such unless patched. Note that for mrouted you can run -mrouted tunnels via a traffic shaper to control bandwidth usage. - - The shaper is device/route based. This makes it very easy to use -with any setup BUT less flexible. You may need to use iproute2 to set up -multiple route tables to get the flexibility. - - There is no "borrowing" or "sharing" scheme. This is a simple -traffic limiter. We implement Van Jacobson and Sally Floyd's CBQ -architecture into Linux 2.2. This is the preferred solution. Shaper is -for simple or back compatible setups. - -Alan diff --git a/Documentation/networking/sis900.txt b/Documentation/networking/sis900.txt deleted file mode 100644 index bddffd7385a..00000000000 --- a/Documentation/networking/sis900.txt +++ /dev/null @@ -1,257 +0,0 @@ - -SiS 900/7016 Fast Ethernet Device Driver - -Ollie Lho - -Lei Chun Chang - - Copyright © 1999 by Silicon Integrated System Corp. - - This document gives some information on installation and usage of SiS - 900/7016 device driver under Linux. - - This program is free software; you can redistribute it and/or modify - it under the terms of the GNU General Public License as published by - the Free Software Foundation; either version 2 of the License, or (at - your option) any later version. - - This program is distributed in the hope that it will be useful, but - WITHOUT ANY WARRANTY; without even the implied warranty of - MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - General Public License for more details. - - You should have received a copy of the GNU General Public License - along with this program; if not, write to the Free Software - Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 - USA - _________________________________________________________________ - - Table of Contents - 1. Introduction - 2. Changes - 3. Tested Environment - 4. Files in This Package - 5. Installation - - Building the driver as loadable module - Building the driver into kernel - - 6. Known Problems and Bugs - 7. Revision History - 8. Acknowledgements - _________________________________________________________________ - -Chapter 1. Introduction - - This document describes the revision 1.06 and 1.07 of SiS 900/7016 - Fast Ethernet device driver under Linux. The driver is developed by - Silicon Integrated System Corp. and distributed freely under the GNU - General Public License (GPL). The driver can be compiled as a loadable - module and used under Linux kernel version 2.2.x. (rev. 1.06) With - minimal changes, the driver can also be used under 2.3.x and 2.4.x - kernel (rev. 1.07), please see Chapter 5. If you are intended to use - the driver for earlier kernels, you are on your own. - - The driver is tested with usual TCP/IP applications including FTP, - Telnet, Netscape etc. and is used constantly by the developers. - - Please send all comments/fixes/questions to Lei-Chun Chang. - _________________________________________________________________ - -Chapter 2. Changes - - Changes made in Revision 1.07 - - 1. Separation of sis900.c and sis900.h in order to move most constant - definition to sis900.h (many of those constants were corrected) - 2. Clean up PCI detection, the pci-scan from Donald Becker were not - used, just simple pci_find_*. - 3. MII detection is modified to support multiple mii transceiver. - 4. Bugs in read_eeprom, mdio_* were removed. - 5. Lot of sis900 irrelevant comments were removed/changed and more - comments were added to reflect the real situation. - 6. Clean up of physical/virtual address space mess in buffer - descriptors. - 7. Better transmit/receive error handling. - 8. The driver now uses zero-copy single buffer management scheme to - improve performance. - 9. Names of variables were changed to be more consistent. - 10. Clean up of auo-negotiation and timer code. - 11. Automatic detection and change of PHY on the fly. - 12. Bug in mac probing fixed. - 13. Fix 630E equalier problem by modifying the equalizer workaround - rule. - 14. Support for ICS1893 10/100 Interated PHYceiver. - 15. Support for media select by ifconfig. - 16. Added kernel-doc extratable documentation. - _________________________________________________________________ - -Chapter 3. Tested Environment - - This driver is developed on the following hardware - - * Intel Celeron 500 with SiS 630 (rev 02) chipset - * SiS 900 (rev 01) and SiS 7016/7014 Fast Ethernet Card - - and tested with these software environments - - * Red Hat Linux version 6.2 - * Linux kernel version 2.4.0 - * Netscape version 4.6 - * NcFTP 3.0.0 beta 18 - * Samba version 2.0.3 - _________________________________________________________________ - -Chapter 4. Files in This Package - - In the package you can find these files: - - sis900.c - Driver source file in C - - sis900.h - Header file for sis900.c - - sis900.sgml - DocBook SGML source of the document - - sis900.txt - Driver document in plain text - _________________________________________________________________ - -Chapter 5. Installation - - Silicon Integrated System Corp. is cooperating closely with core Linux - Kernel developers. The revisions of SiS 900 driver are distributed by - the usuall channels for kernel tar files and patches. Those kernel tar - files for official kernel and patches for kernel pre-release can be - download at official kernel ftp site and its mirrors. The 1.06 - revision can be found in kernel version later than 2.3.15 and - pre-2.2.14, and 1.07 revision can be found in kernel version 2.4.0. If - you have no prior experience in networking under Linux, please read - Ethernet HOWTO and Networking HOWTO available from Linux Documentation - Project (LDP). - - The driver is bundled in release later than 2.2.11 and 2.3.15 so this - is the most easy case. Be sure you have the appropriate packages for - compiling kernel source. Those packages are listed in Document/Changes - in kernel source distribution. If you have to install the driver other - than those bundled in kernel release, you should have your driver file - sis900.c and sis900.h copied into /usr/src/linux/drivers/net/ first. - There are two alternative ways to install the driver - _________________________________________________________________ - -Building the driver as loadable module - - To build the driver as a loadable kernel module you have to - reconfigure the kernel to activate network support by - -make menuconfig - - Choose "Loadable module support --->", then select "Enable loadable - module support". - - Choose "Network Device Support --->", select "Ethernet (10 or - 100Mbit)". Then select "EISA, VLB, PCI and on board controllers", and - choose "SiS 900/7016 PCI Fast Ethernet Adapter support" to "M". - - After reconfiguring the kernel, you can make the driver module by - -make modules - - The driver should be compiled with no errors. After compiling the - driver, the driver can be installed to proper place by - -make modules_install - - Load the driver into kernel by - -insmod sis900 - - When loading the driver into memory, some information message can be - view by - -dmesg - - or -cat /var/log/message - - If the driver is loaded properly you will have messages similar to - this: - -sis900.c: v1.07.06 11/07/2000 -eth0: SiS 900 PCI Fast Ethernet at 0xd000, IRQ 10, 00:00:e8:83:7f:a4. -eth0: SiS 900 Internal MII PHY transceiver found at address 1. -eth0: Using SiS 900 Internal MII PHY as default - - showing the version of the driver and the results of probing routine. - - Once the driver is loaded, network can be brought up by - -/sbin/ifconfig eth0 IPADDR broadcast BROADCAST netmask NETMASK media TYPE - - where IPADDR, BROADCAST, NETMASK are your IP address, broadcast - address and netmask respectively. TYPE is used to set medium type used - by the device. Typical values are "10baseT"(twisted-pair 10Mbps - Ethernet) or "100baseT" (twisted-pair 100Mbps Ethernet). For more - information on how to configure network interface, please refer to - Networking HOWTO. - - The link status is also shown by kernel messages. For example, after - the network interface is activated, you may have the message: - -eth0: Media Link On 100mbps full-duplex - - If you try to unplug the twist pair (TP) cable you will get - -eth0: Media Link Off - - indicating that the link is failed. - _________________________________________________________________ - -Building the driver into kernel - - If you want to make the driver into kernel, choose "Y" rather than "M" - on "SiS 900/7016 PCI Fast Ethernet Adapter support" when configuring - the kernel. Build the kernel image in the usual way - -make clean - -make bzlilo - - Next time the system reboot, you have the driver in memory. - _________________________________________________________________ - -Chapter 6. Known Problems and Bugs - - There are some known problems and bugs. If you find any other bugs - please mail to lcchang@sis.com.tw - - 1. AM79C901 HomePNA PHY is not thoroughly tested, there may be some - bugs in the "on the fly" change of transceiver. - 2. A bug is hidden somewhere in the receive buffer management code, - the bug causes NULL pointer reference in the kernel. This fault is - caught before bad things happen and reported with the message: - eth0: NULL pointer encountered in Rx ring, skipping which can be - viewed with dmesg or cat /var/log/message. - 3. The media type change from 10Mbps to 100Mbps twisted-pair ethernet - by ifconfig causes the media link down. - _________________________________________________________________ - -Chapter 7. Revision History - - * November 13, 2000, Revision 1.07, seventh release, 630E problem - fixed and further clean up. - * November 4, 1999, Revision 1.06, Second release, lots of clean up - and optimization. - * August 8, 1999, Revision 1.05, Initial Public Release - _________________________________________________________________ - -Chapter 8. Acknowledgements - - This driver was originally derived form Donald Becker's pci-skeleton - and rtl8139 drivers. Donald also provided various suggestion regarded - with improvements made in revision 1.06. - - The 1.05 revision was created by Jim Huang, AMD 79c901 support was - added by Chin-Shan Li. diff --git a/Documentation/networking/sk98lin.txt b/Documentation/networking/sk98lin.txt deleted file mode 100644 index 851fc97bb22..00000000000 --- a/Documentation/networking/sk98lin.txt +++ /dev/null @@ -1,568 +0,0 @@ -(C)Copyright 1999-2004 Marvell(R). -All rights reserved -=========================================================================== - -sk98lin.txt created 13-Feb-2004 - -Readme File for sk98lin v6.23 -Marvell Yukon/SysKonnect SK-98xx Gigabit Ethernet Adapter family driver for LINUX - -This file contains - 1 Overview - 2 Required Files - 3 Installation - 3.1 Driver Installation - 3.2 Inclusion of adapter at system start - 4 Driver Parameters - 4.1 Per-Port Parameters - 4.2 Adapter Parameters - 5 Large Frame Support - 6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad) - 7 Troubleshooting - -=========================================================================== - - -1 Overview -=========== - -The sk98lin driver supports the Marvell Yukon and SysKonnect -SK-98xx/SK-95xx compliant Gigabit Ethernet Adapter on Linux. It has -been tested with Linux on Intel/x86 machines. -*** - - -2 Required Files -================= - -The linux kernel source. -No additional files required. -*** - - -3 Installation -=============== - -It is recommended to download the latest version of the driver from the -SysKonnect web site www.syskonnect.com. If you have downloaded the latest -driver, the Linux kernel has to be patched before the driver can be -installed. For details on how to patch a Linux kernel, refer to the -patch.txt file. - -3.1 Driver Installation ------------------------- - -The following steps describe the actions that are required to install -the driver and to start it manually. These steps should be carried -out for the initial driver setup. Once confirmed to be ok, they can -be included in the system start. - -NOTE 1: To perform the following tasks you need 'root' access. - -NOTE 2: In case of problems, please read the section "Troubleshooting" - below. - -The driver can either be integrated into the kernel or it can be compiled -as a module. Select the appropriate option during the kernel -configuration. - -Compile/use the driver as a module ----------------------------------- -To compile the driver, go to the directory /usr/src/linux and -execute the command "make menuconfig" or "make xconfig" and proceed as -follows: - -To integrate the driver permanently into the kernel, proceed as follows: - -1. Select the menu "Network device support" and then "Ethernet(1000Mbit)" -2. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support" - with (*) -3. Build a new kernel when the configuration of the above options is - finished. -4. Install the new kernel. -5. Reboot your system. - -To use the driver as a module, proceed as follows: - -1. Enable 'loadable module support' in the kernel. -2. For automatic driver start, enable the 'Kernel module loader'. -3. Select the menu "Network device support" and then "Ethernet(1000Mbit)" -4. Mark "Marvell Yukon Chipset / SysKonnect SK-98xx family support" - with (M) -5. Execute the command "make modules". -6. Execute the command "make modules_install". - The appropiate modules will be installed. -7. Reboot your system. - - -Load the module manually ------------------------- -To load the module manually, proceed as follows: - -1. Enter "modprobe sk98lin". -2. If a Marvell Yukon or SysKonnect SK-98xx adapter is installed in - your computer and you have a /proc file system, execute the command: - "ls /proc/net/sk98lin/" - This should produce an output containing a line with the following - format: - eth0 eth1 ... - which indicates that your adapter has been found and initialized. - - NOTE 1: If you have more than one Marvell Yukon or SysKonnect SK-98xx - adapter installed, the adapters will be listed as 'eth0', - 'eth1', 'eth2', etc. - For each adapter, repeat steps 3 and 4 below. - - NOTE 2: If you have other Ethernet adapters installed, your Marvell - Yukon or SysKonnect SK-98xx adapter will be mapped to the - next available number, e.g. 'eth1'. The mapping is executed - automatically. - The module installation message (displayed either in a system - log file or on the console) prints a line for each adapter - found containing the corresponding 'ethX'. - -3. Select an IP address and assign it to the respective adapter by - entering: - ifconfig eth0 <ip-address> - With this command, the adapter is connected to the Ethernet. - - SK-98xx Gigabit Ethernet Server Adapters: The yellow LED on the adapter - is now active, the link status LED of the primary port is active and - the link status LED of the secondary port (on dual port adapters) is - blinking (if the ports are connected to a switch or hub). - SK-98xx V2.0 Gigabit Ethernet Adapters: The link status LED is active. - In addition, you will receive a status message on the console stating - "ethX: network connection up using port Y" and showing the selected - connection parameters (x stands for the ethernet device number - (0,1,2, etc), y stands for the port name (A or B)). - - NOTE: If you are in doubt about IP addresses, ask your network - administrator for assistance. - -4. Your adapter should now be fully operational. - Use 'ping <otherstation>' to verify the connection to other computers - on your network. -5. To check the adapter configuration view /proc/net/sk98lin/[devicename]. - For example by executing: - "cat /proc/net/sk98lin/eth0" - -Unload the module ------------------ -To stop and unload the driver modules, proceed as follows: - -1. Execute the command "ifconfig eth0 down". -2. Execute the command "rmmod sk98lin". - -3.2 Inclusion of adapter at system start ------------------------------------------ - -Since a large number of different Linux distributions are -available, we are unable to describe a general installation procedure -for the driver module. -Because the driver is now integrated in the kernel, installation should -be easy, using the standard mechanism of your distribution. -Refer to the distribution's manual for installation of ethernet adapters. - -*** - -4 Driver Parameters -==================== - -Parameters can be set at the command line after the module has been -loaded with the command 'modprobe'. -In some distributions, the configuration tools are able to pass parameters -to the driver module. - -If you use the kernel module loader, you can set driver parameters -in the file /etc/modprobe.conf (or /etc/modules.conf in 2.4 or earlier). -To set the driver parameters in this file, proceed as follows: - -1. Insert a line of the form : - options sk98lin ... - For "...", the same syntax is required as described for the command - line paramaters of modprobe below. -2. To activate the new parameters, either reboot your computer - or - unload and reload the driver. - The syntax of the driver parameters is: - - modprobe sk98lin parameter=value1[,value2[,value3...]] - - where value1 refers to the first adapter, value2 to the second etc. - -NOTE: All parameters are case sensitive. Write them exactly as shown - below. - -Example: -Suppose you have two adapters. You want to set auto-negotiation -on the first adapter to ON and on the second adapter to OFF. -You also want to set DuplexCapabilities on the first adapter -to FULL, and on the second adapter to HALF. -Then, you must enter: - - modprobe sk98lin AutoNeg_A=On,Off DupCap_A=Full,Half - -NOTE: The number of adapters that can be configured this way is - limited in the driver (file skge.c, constant SK_MAX_CARD_PARAM). - The current limit is 16. If you happen to install - more adapters, adjust this and recompile. - - -4.1 Per-Port Parameters ------------------------- - -These settings are available for each port on the adapter. -In the following description, '?' stands for the port for -which you set the parameter (A or B). - -Speed ------ -Parameter: Speed_? -Values: 10, 100, 1000, Auto -Default: Auto - -This parameter is used to set the speed capabilities. It is only valid -for the SK-98xx V2.0 copper adapters. -Usually, the speed is negotiated between the two ports during link -establishment. If this fails, a port can be forced to a specific setting -with this parameter. - -Auto-Negotiation ----------------- -Parameter: AutoNeg_? -Values: On, Off, Sense -Default: On - -The "Sense"-mode automatically detects whether the link partner supports -auto-negotiation or not. - -Duplex Capabilities -------------------- -Parameter: DupCap_? -Values: Half, Full, Both -Default: Both - -This parameters is only relevant if auto-negotiation for this port is -not set to "Sense". If auto-negotiation is set to "On", all three values -are possible. If it is set to "Off", only "Full" and "Half" are allowed. -This parameter is usefull if your link partner does not support all -possible combinations. - -Flow Control ------------- -Parameter: FlowCtrl_? -Values: Sym, SymOrRem, LocSend, None -Default: SymOrRem - -This parameter can be used to set the flow control capabilities the -port reports during auto-negotiation. It can be set for each port -individually. -Possible modes: - -- Sym = Symmetric: both link partners are allowed to send - PAUSE frames - -- SymOrRem = SymmetricOrRemote: both or only remote partner - are allowed to send PAUSE frames - -- LocSend = LocalSend: only local link partner is allowed - to send PAUSE frames - -- None = no link partner is allowed to send PAUSE frames - -NOTE: This parameter is ignored if auto-negotiation is set to "Off". - -Role in Master-Slave-Negotiation (1000Base-T only) --------------------------------------------------- -Parameter: Role_? -Values: Auto, Master, Slave -Default: Auto - -This parameter is only valid for the SK-9821 and SK-9822 adapters. -For two 1000Base-T ports to communicate, one must take the role of the -master (providing timing information), while the other must be the -slave. Usually, this is negotiated between the two ports during link -establishment. If this fails, a port can be forced to a specific setting -with this parameter. - - -4.2 Adapter Parameters ------------------------ - -Connection Type (SK-98xx V2.0 copper adapters only) ---------------- -Parameter: ConType -Values: Auto, 100FD, 100HD, 10FD, 10HD -Default: Auto - -The parameter 'ConType' is a combination of all five per-port parameters -within one single parameter. This simplifies the configuration of both ports -of an adapter card! The different values of this variable reflect the most -meaningful combinations of port parameters. - -The following table shows the values of 'ConType' and the corresponding -combinations of the per-port parameters: - - ConType | DupCap AutoNeg FlowCtrl Role Speed - ----------+------------------------------------------------------ - Auto | Both On SymOrRem Auto Auto - 100FD | Full Off None Auto (ignored) 100 - 100HD | Half Off None Auto (ignored) 100 - 10FD | Full Off None Auto (ignored) 10 - 10HD | Half Off None Auto (ignored) 10 - -Stating any other port parameter together with this 'ConType' variable -will result in a merged configuration of those settings. This due to -the fact, that the per-port parameters (e.g. Speed_? ) have a higher -priority than the combined variable 'ConType'. - -NOTE: This parameter is always used on both ports of the adapter card. - -Interrupt Moderation --------------------- -Parameter: Moderation -Values: None, Static, Dynamic -Default: None - -Interrupt moderation is employed to limit the maxmimum number of interrupts -the driver has to serve. That is, one or more interrupts (which indicate any -transmit or receive packet to be processed) are queued until the driver -processes them. When queued interrupts are to be served, is determined by the -'IntsPerSec' parameter, which is explained later below. - -Possible modes: - - -- None - No interrupt moderation is applied on the adapter card. - Therefore, each transmit or receive interrupt is served immediately - as soon as it appears on the interrupt line of the adapter card. - - -- Static - Interrupt moderation is applied on the adapter card. - All transmit and receive interrupts are queued until a complete - moderation interval ends. If such a moderation interval ends, all - queued interrupts are processed in one big bunch without any delay. - The term 'static' reflects the fact, that interrupt moderation is - always enabled, regardless how much network load is currently - passing via a particular interface. In addition, the duration of - the moderation interval has a fixed length that never changes while - the driver is operational. - - -- Dynamic - Interrupt moderation might be applied on the adapter card, - depending on the load of the system. If the driver detects that the - system load is too high, the driver tries to shield the system against - too much network load by enabling interrupt moderation. If - at a later - time - the CPU utilizaton decreases again (or if the network load is - negligible) the interrupt moderation will automatically be disabled. - -Interrupt moderation should be used when the driver has to handle one or more -interfaces with a high network load, which - as a consequence - leads also to a -high CPU utilization. When moderation is applied in such high network load -situations, CPU load might be reduced by 20-30%. - -NOTE: The drawback of using interrupt moderation is an increase of the round- -trip-time (RTT), due to the queueing and serving of interrupts at dedicated -moderation times. - -Interrupts per second ---------------------- -Parameter: IntsPerSec -Values: 30...40000 (interrupts per second) -Default: 2000 - -This parameter is only used, if either static or dynamic interrupt moderation -is used on a network adapter card. Using this paramter if no moderation is -applied, will lead to no action performed. - -This parameter determines the length of any interrupt moderation interval. -Assuming that static interrupt moderation is to be used, an 'IntsPerSec' -parameter value of 2000 will lead to an interrupt moderation interval of -500 microseconds. - -NOTE: The duration of the moderation interval is to be chosen with care. -At first glance, selecting a very long duration (e.g. only 100 interrupts per -second) seems to be meaningful, but the increase of packet-processing delay -is tremendous. On the other hand, selecting a very short moderation time might -compensate the use of any moderation being applied. - - -Preferred Port --------------- -Parameter: PrefPort -Values: A, B -Default: A - -This is used to force the preferred port to A or B (on dual-port network -adapters). The preferred port is the one that is used if both are detected -as fully functional. - -RLMT Mode (Redundant Link Management Technology) ------------------------------------------------- -Parameter: RlmtMode -Values: CheckLinkState,CheckLocalPort, CheckSeg, DualNet -Default: CheckLinkState - -RLMT monitors the status of the port. If the link of the active port -fails, RLMT switches immediately to the standby link. The virtual link is -maintained as long as at least one 'physical' link is up. - -Possible modes: - - -- CheckLinkState - Check link state only: RLMT uses the link state - reported by the adapter hardware for each individual port to - determine whether a port can be used for all network traffic or - not. - - -- CheckLocalPort - In this mode, RLMT monitors the network path - between the two ports of an adapter by regularly exchanging packets - between them. This mode requires a network configuration in which - the two ports are able to "see" each other (i.e. there must not be - any router between the ports). - - -- CheckSeg - Check local port and segmentation: This mode supports the - same functions as the CheckLocalPort mode and additionally checks - network segmentation between the ports. Therefore, this mode is only - to be used if Gigabit Ethernet switches are installed on the network - that have been configured to use the Spanning Tree protocol. - - -- DualNet - In this mode, ports A and B are used as separate devices. - If you have a dual port adapter, port A will be configured as eth0 - and port B as eth1. Both ports can be used independently with - distinct IP addresses. The preferred port setting is not used. - RLMT is turned off. - -NOTE: RLMT modes CLP and CLPSS are designed to operate in configurations - where a network path between the ports on one adapter exists. - Moreover, they are not designed to work where adapters are connected - back-to-back. -*** - - -5 Large Frame Support -====================== - -The driver supports large frames (also called jumbo frames). Using large -frames can result in an improved throughput if transferring large amounts -of data. -To enable large frames, set the MTU (maximum transfer unit) of the -interface to the desired value (up to 9000), execute the following -command: - ifconfig eth0 mtu 9000 -This will only work if you have two adapters connected back-to-back -or if you use a switch that supports large frames. When using a switch, -it should be configured to allow large frames and auto-negotiation should -be set to OFF. The setting must be configured on all adapters that can be -reached by the large frames. If one adapter is not set to receive large -frames, it will simply drop them. - -You can switch back to the standard ethernet frame size by executing the -following command: - ifconfig eth0 mtu 1500 - -To permanently configure this setting, add a script with the 'ifconfig' -line to the system startup sequence (named something like "S99sk98lin" -in /etc/rc.d/rc2.d). -*** - - -6 VLAN and Link Aggregation Support (IEEE 802.1, 802.1q, 802.3ad) -================================================================== - -The Marvell Yukon/SysKonnect Linux drivers are able to support VLAN and -Link Aggregation according to IEEE standards 802.1, 802.1q, and 802.3ad. -These features are only available after installation of open source -modules available on the Internet: -For VLAN go to: http://www.candelatech.com/~greear/vlan.html -For Link Aggregation go to: http://www.st.rim.or.jp/~yumo - -NOTE: SysKonnect GmbH does not offer any support for these open source - modules and does not take the responsibility for any kind of - failures or problems arising in connection with these modules. - -NOTE: Configuring Link Aggregation on a SysKonnect dual link adapter may - cause problems when unloading the driver. - - -7 Troubleshooting -================== - -If any problems occur during the installation process, check the -following list: - - -Problem: The SK-98xx adapter can not be found by the driver. -Solution: In /proc/pci search for the following entry: - 'Ethernet controller: SysKonnect SK-98xx ...' - If this entry exists, the SK-98xx or SK-98xx V2.0 adapter has - been found by the system and should be operational. - If this entry does not exist or if the file '/proc/pci' is not - found, there may be a hardware problem or the PCI support may - not be enabled in your kernel. - The adapter can be checked using the diagnostics program which - is available on the SysKonnect web site: - www.syskonnect.com - - Some COMPAQ machines have problems dealing with PCI under Linux. - Linux. This problem is described in the 'PCI howto' document - (included in some distributions or available from the - web, e.g. at 'www.linux.org'). - - -Problem: Programs such as 'ifconfig' or 'route' can not be found or the - error message 'Operation not permitted' is displayed. -Reason: You are not logged in as user 'root'. -Solution: Logout and login as 'root' or change to 'root' via 'su'. - - -Problem: Upon use of the command 'ping <address>' the message - "ping: sendto: Network is unreachable" is displayed. -Reason: Your route is not set correctly. -Solution: If you are using RedHat, you probably forgot to set up the - route in the 'network configuration'. - Check the existing routes with the 'route' command and check - if an entry for 'eth0' exists, and if so, if it is set correctly. - - -Problem: The driver can be started, the adapter is connected to the - network, but you cannot receive or transmit any packets; - e.g. 'ping' does not work. -Reason: There is an incorrect route in your routing table. -Solution: Check the routing table with the command 'route' and read the - manual help pages dealing with routes (enter 'man route'). - -NOTE: Although the 2.2.x kernel versions generate the routing entry - automatically, problems of this kind may occur here as well. We've - come across a situation in which the driver started correctly at - system start, but after the driver has been removed and reloaded, - the route of the adapter's network pointed to the 'dummy0'device - and had to be corrected manually. - - -Problem: Your computer should act as a router between multiple - IP subnetworks (using multiple adapters), but computers in - other subnetworks cannot be reached. -Reason: Either the router's kernel is not configured for IP forwarding - or the routing table and gateway configuration of at least one - computer is not working. - -Problem: Upon driver start, the following error message is displayed: - "eth0: -- ERROR -- - Class: internal Software error - Nr: 0xcc - Msg: SkGeInitPort() cannot init running ports" -Reason: You are using a driver compiled for single processor machines - on a multiprocessor machine with SMP (Symmetric MultiProcessor) - kernel. -Solution: Configure your kernel appropriately and recompile the kernel or - the modules. - - - -If your problem is not listed here, please contact SysKonnect's technical -support for help (linux@syskonnect.de). -When contacting our technical support, please ensure that the following -information is available: -- System Manufacturer and HW Informations (CPU, Memory... ) -- PCI-Boards in your system -- Distribution -- Kernel version -- Driver version -*** - - - -***End of Readme File*** diff --git a/Documentation/networking/skfp.txt b/Documentation/networking/skfp.txt index 3a419ed42f8..203ec66c9fb 100644 --- a/Documentation/networking/skfp.txt +++ b/Documentation/networking/skfp.txt @@ -68,7 +68,7 @@ Compaq adapters (not tested): ======================= From v2.01 on, the driver is integrated in the linux kernel sources. -Therefor, the installation is the same as for any other adapter +Therefore, the installation is the same as for any other adapter supported by the kernel. Refer to the manual of your distribution about the installation of network adapters. @@ -81,7 +81,7 @@ Makes my life much easier :-) If you run into problems during installation, check those items: -Problem: The FDDI adapter can not be found by the driver. +Problem: The FDDI adapter cannot be found by the driver. Reason: Look in /proc/pci for the following entry: 'FDDI network controller: SysKonnect SK-FDDI-PCI ...' If this entry exists, then the FDDI adapter has been @@ -99,7 +99,7 @@ Reason: Look in /proc/pci for the following entry: Problem: You want to use your computer as a router between multiple IP subnetworks (using multiple adapters), but - you can not reach computers in other subnetworks. + you cannot reach computers in other subnetworks. Reason: Either the router's kernel is not configured for IP forwarding or there is a problem with the routing table and gateway configuration in at least one of the diff --git a/Documentation/networking/slicecom.hun b/Documentation/networking/slicecom.hun deleted file mode 100644 index 5acf1918694..00000000000 --- a/Documentation/networking/slicecom.hun +++ /dev/null @@ -1,371 +0,0 @@ - -SliceCOM adapter felhasznaloi dokumentacioja - 0.51 verziohoz - -Bartók István <bartoki@itc.hu> -Utolso modositas: Wed Aug 29 17:26:58 CEST 2001 - ------------------------------------------------------------------ - -Hasznalata: - -Forditas: - -Code maturity level options - [*] Prompt for development and/or incomplete code/drivers - -Network device support - Wan interfaces - <M> MultiGate (COMX) synchronous - <M> Support for MUNICH based boards: SliceCOM, PCICOM (NEW) - <M> Support for HDLC and syncPPP... - - -A modulok betoltese: - -modprobe comx - -modprobe comx-proto-ppp # a Cisco-HDLC es a SyncPPP protokollt is - # ez a modul adja - -modprobe comx-hw-munich # a modul betoltodeskor azonnal jelent a - # syslogba a detektalt kartyakrol - - -Konfiguralas: - -# Ezen az interfeszen Cisco-HDLC vonali protokoll fog futni -# Az interfeszhez rendelt idoszeletek: 1,2 (128 kbit/sec-es vonal) -# (a G.703 keretben az elso adatot vivo idoszelet az 1-es) -# -mkdir /proc/comx/comx0.1/ -echo slicecom >/proc/comx/comx0.1/boardtype -echo hdlc >/proc/comx/comx0.1/protocol -echo 1 2 >/proc/comx/comx0.1/timeslots - - -# Ezen az interfeszen SyncPPP vonali protokoll fog futni -# Az interfeszhez rendelt idoszelet: 3 (64 kbit/sec-es vonal) -# -mkdir /proc/comx/comx0.2/ -echo slicecom >/proc/comx/comx0.2/boardtype -echo ppp >/proc/comx/comx0.2/protocol -echo 3 >/proc/comx/comx0.2/timeslots - -... - -ifconfig comx0.1 up -ifconfig comx0.2 up - ------------------------------------------------------------------ - -A COMX driverek default 20 csomagnyi transmit queue-t rendelnek a halozati -interfeszekhez. WAN halozatokban ennel hosszabbat is szokas hasznalni -(20 es 100 kozott), hogy a vonal kihasznaltsaga nagy terheles eseten jobb -legyen (bar ezzel megno a varhato kesleltetes a csomagok sorban allasa miatt): - -# ifconfig comx0 txqueuelen 50 - -Ezt a beallitasi lehetoseget csak az ujabb disztribuciok ifconfig parancsa -tamogatja (amik mar a 2.2 kernelekhez keszultek, mint a RedHat 6.1 vagy a -Debian 2.2). - -A 2.1-es Debian disztribuciohoz a http://www.debian.org/~rcw/2.2/netbase/ -cimrol toltheto le ujabb netbase csomag, ami mar ilyet tamogato ifconfig -parancsot tartalmaz. Bovebben a 2.2 kernel hasznalatarol Debian 2.1 alatt: -http://www.debian.org/releases/stable/running-kernel-2.2 - ------------------------------------------------------------------ - -A kartya LED-jeinek jelentese: - -piros - eg, ha Remote Alarm-ot kuld a tuloldal -zold - eg, ha a vett jelben megtalalja a keretszinkront - -Reszletesebben: - -piros: zold: jelentes: - -- - nincs keretszinkron (nincs jel, vagy rossz a jel) -- eg "minden rendben" -eg eg a vetel OK, de a tuloldal Remote Alarm-ot kuld -eg - ez nincs ertelmezve, egyelore funkcio nelkul - ------------------------------------------------------------------ - -Reszletesebb leiras a hardver beallitasi lehetosegeirol: - -Az altalanos,- es a protokoll-retegek beallitasi lehetosegeirol a 'comx.txt' -fajlban leirtak SliceCOM kartyanal is ervenyesek, itt csak a hardver-specifikus -beallitasi lehetosegek vannak osszefoglalva: - -Konfiguralasi interfesz a /proc/comx/ alatt: - -Minden timeslot-csoportnak kulon comx* interfeszt kell letrehozni mkdir-rel: -comx0, comx1, .. stb. Itt beallithato, hogy az adott interfesz hanyadik kartya -melyik timeslotja(i)bol alljon ossze. A Cisco-fele serial3:1 elnevezesek -(serial3:1 = a 3. kartyaban az 1-es idoszelet-csoport) Linuxon aliasing-ot -jelentenenek, ezert mi nem tudunk ilyen elnevezest hasznalni. - -Tobb kartya eseten a comx0.1, comx0.2, ... vagy slice0.1, slice0.2 nevek -hasznalhatoak. - -Tobb SliceCOM kartya is lehet egy gepben, de sajat interrupt kell mindegyiknek, -nem tud meg megosztott interruptot kezelni. - -Az egesz kartyat erinto beallitasok: - -Az ioport es irq beallitas nincs: amit a PCI BIOS kioszt a rendszernek, -azt hasznalja a driver. - - -comx0/boardnum - hanyadik SliceCOM kartya a gepben (a 'termeszetes' PCI - sorrendben ertve: ahogyan a /proc/pci-ban vagy az 'lspci' - kimeneteben megjelenik, altalaban az alaplapi PCI meghajto - aramkorokhoz kozelebb eso kartyak a kisebb sorszamuak) - - Default: 0 (0-tol kezdodik a szamolas) - - -Bar a kovetkezoket csak egy-egy interfeszen allitjuk at, megis az egesz kartya -mukodeset egyszerre allitjak. A megkotes hogy csak UP-ban levo interfeszen -hasznalhatoak, azert van, mert kulonben nem vart eredmenyekre vezetne egy ilyen -paranccsorozat: - - echo 0 >boardnum - echo internal >clock_source - echo 1 >boardnum - -- Ez a 0-s board clock_source-at allitana at. - -Ezek a beallitasok megmaradnak az osszes interfesz torlesekor, de torlodnek -a driver modul ki/betoltesekor. - - -comx0/clock_source - A Tx orajelforrasa, a Cisco-val hasonlatosra keszult. - Hasznalata: - - papaya:# echo line >/proc/comx/comx0/clock_source - papaya:# echo internal >/proc/comx/comx0/clock_source - - line - A Tx orajelet a vett adatfolyambol dekodolja, igyekszik - igazodni hozza. Ha nem lat orajelet az inputon, akkor - atall a sajat orajelgeneratorara. - internal - A Tx orajelet a sajat orajelgeneratora szolgaltatja. - - Default: line - - Normal osszeallitas eseten a tavkozlesi szolgaltato eszkoze - (pl. HDSL modem) adja az orajelet, ezert ez a default. - - -comx0/framing - A CRC4 ki/be kapcsolasa - - A CRC4: 16 PCM keretet (A PCM keret az, amibe a 32 darab 64 - kilobites csatorna van bemultiplexalva. Nem osszetevesztendo a HDLC - kerettel.) 2x8 -as csoportokra osztanak, es azokhoz 4-4 bites CRC-t - szamolnak. Elsosorban a vonal minosegenek a monitorozasara szolgal. - - papaya:~# echo crc4 >/proc/comx/comx0/framing - papaya:~# echo no-crc4 >/proc/comx/comx0/framing - - Default a 'crc4', a MATAV vonalak altalaban igy futnak. De ha nem - egyforma is a beallitas a vonal ket vegen, attol a forgalom altalaban - at tud menni. - - -comx0/linecode - A vonali kodolas beallitasa - - papaya:~# echo hdb3 >/proc/comx/comx0/linecode - papaya:~# echo ami >/proc/comx/comx0/linecode - - Default a 'hdb3', a MATAV vonalak igy futnak. - - (az AMI kodolas igen ritka E1-es vonalaknal). Ha ez a beallitas nem - egyezik a vonal ket vegen, akkor elofordulhat hogy a keretszinkron - osszejon, de CRC4-hibak es a vonalakon atvitt adatokban is hibak - keletkeznek (amit a HDLC/SyncPPP szinten CRC-hibaval jelez) - - -comx0/reg - a kartya aramkoreinek, a MUNICH (reg) es a FALC (lbireg) -comx0/lbireg regisztereinek kozvetlen elerese. Hasznalata: - - echo >reg 0x04 0x0 - a 4-es regiszterbe 0-t ir - echo >reg 0x104 - printk()-val kiirja a 4-es regiszter - tartalmat a syslogba. - - WARNING: ezek csak a fejleszteshez keszultek, sok galibat - lehet veluk okozni! - - -comx0/loopback - A kartya G.703 jelenek a visszahurkolasara is van lehetoseg: - - papaya:# echo none >/proc/comx/comx0/loopback - papaya:# echo local >/proc/comx/comx0/loopback - papaya:# echo remote >/proc/comx/comx0/loopback - - none - nincs visszahurkolas, normal mukodes - local - a kartya a sajat maga altal adott jelet kapja vissza - remote - a kartya a kivulrol vett jelet adja kifele - - Default: none - ------------------------------------------------------------------ - -Az interfeszhez (Cisco terminologiaban 'channel-group') kapcsolodo beallitasok: - -comx0/timeslots - mely timeslotok (idoszeletek) tartoznak az adott interfeszhez. - - papaya:~# cat /proc/comx/comx0/timeslots - 1 3 4 5 6 - papaya:~# - - Egy timeslot megkeresese (hanyas interfeszbe tartozik nalunk): - - papaya:~# grep ' 4' /proc/comx/comx*/timeslots - /proc/comx/comx0/timeslots:1 3 4 5 6 - papaya:~# - - Beallitasa: - papaya:~# echo '1 5 2 6 7 8' >/proc/comx/comx0/timeslots - - A timeslotok sorrendje nem szamit, '1 3 2' ugyanaz mint az '1 2 3'. - - Beallitashoz az adott interfesznek DOWN-ban kell lennie - (ifconfig comx0 down), de ugyanannak a kartyanak a tobbi interfesze - uzemelhet kozben. - - Beallitaskor leellenorzi, hogy az uj timeslotok nem utkoznek-e egy - masik interfesz timeslotjaival. Ha utkoznek, akkor nem allitja at. - - Mindig 10-es szamrendszerben tortenik a timeslotok ertelmezese, nehogy - a 08, 09 alaku felirast rosszul ertelmezze. - ------------------------------------------------------------------ - -Az interfeszek es a kartya allapotanak lekerdezese: - -- A ' '-szel kezdodo sorok az eredeti kimenetet, a //-rel kezdodo sorok a -magyarazatot jelzik. - - papaya:~$ cat /proc/comx/comx1/status - Interface administrative status is UP, modem status is UP, protocol is UP - Modem status changes: 0, Transmitter status is IDLE, tbusy: 0 - Interface load (input): 978376 / 947808 / 951024 bits/s (5s/5m/15m) - (output): 978376 / 947848 / 951024 bits/s (5s/5m/15m) - Debug flags: none - RX errors: len: 22, overrun: 1, crc: 0, aborts: 0 - buffer overrun: 0, pbuffer overrun: 0 - TX errors: underrun: 0 - Line keepalive (value: 10) status UP [0] - -// Itt kezdodik a hardver-specifikus resz: - Controller status: - No alarms - -// Alarm: hibajelzes: -// -// No alarms - minden rendben -// -// LOS - Loss Of Signal - nem erzekel jelet a bemeneten. -// AIS - Alarm Indication Signal - csak egymas utani 1-esek jonnek -// a bemeneten, a tuloldal igy is jelezheti hogy meghibasodott vagy -// nincs inicializalva. -// AUXP - Auxiliary Pattern Indication - 01010101.. sorozat jon a bemeneten. -// LFA - Loss of Frame Alignment - nincs keretszinkron -// RRA - Receive Remote Alarm - a tuloldal el, de hibat jelez. -// LMFA - Loss of CRC4 Multiframe Alignment - nincs CRC4-multikeret-szinkron -// NMF - No Multiframe alignment Found after 400 msec - ilyen alarm a no-crc4 -// es crc4 keretezesek eseten nincs, lasd lentebb -// -// Egyeb lehetseges hibajelzesek: -// -// Transmit Line Short - a kartya ugy erzi hogy az adasi kimenete rovidre -// van zarva, ezert kikapcsolta az adast. (nem feltetlenul veszi eszre -// a kulso rovidzarat) - -// A veteli oldal csomagjainak lancolt listai, debug celokra: - - Rx ring: - rafutott: 0 - lastcheck: 50845731, jiffies: 51314281 - base: 017b1858 - rx_desc_ptr: 0 - rx_desc_ptr: 017b1858 - hw_curr_ptr: 017b1858 - 06040000 017b1868 017b1898 c016ff00 - 06040000 017b1878 017b1e9c c016ff00 - 46040000 017b1888 017b24a0 c016ff00 - 06040000 017b1858 017b2aa4 c016ff00 - -// A kartyat hasznalo tobbi interfesz: a 0-s channel-group a comx1 interfesz, -// es az 1,2,...,16 timeslotok tartoznak hozza: - - Interfaces using this board: (channel-group, interface, timeslots) - 0 comx1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - 1 comx2: 17 - 2 comx3: 18 - 3 comx4: 19 - 4 comx5: 20 - 5 comx6: 21 - 6 comx7: 22 - 7 comx8: 23 - 8 comx9: 24 - 9 comx10: 25 - 10 comx11: 26 - 11 comx12: 27 - 12 comx13: 28 - 13 comx14: 29 - 14 comx15: 30 - 15 comx16: 31 - -// Hany esemenyt kezelt le a driver egy-egy hardver-interrupt kiszolgalasanal: - - Interrupt work histogram: - hist[ 0]: 0 hist[ 1]: 2 hist[ 2]: 18574 hist[ 3]: 79 - hist[ 4]: 14 hist[ 5]: 1 hist[ 6]: 0 hist[ 7]: 1 - hist[ 8]: 0 hist[ 9]: 7 - -// Hany kikuldendo csomag volt mar a Tx-ringben amikor ujabb lett irva bele: - - Tx ring histogram: - hist[ 0]: 2329 hist[ 1]: 0 hist[ 2]: 0 hist[ 3]: 0 - -// Az E1-interfesz hiba-szamlaloi, az rfc2495-nek megfeleloen: -// (kb. a Cisco routerek "show controllers e1" formatumaban: http://www.cisco.com/univercd/cc/td/doc/product/software/ios11/rbook/rinterfc.htm#xtocid25669126) - -Data in current interval (91 seconds elapsed): - 9516 Line Code Violations, 65 Path Code Violations, 2 E-Bit Errors - 0 Slip Secs, 2 Fr Loss Secs, 2 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 11 Unavail Secs -Data in Interval 1 (15 minutes): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 4 intervals (1 hour): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 96 intervals (24 hours): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs - ------------------------------------------------------------------ - -Nehany kulonlegesebb beallitasi lehetoseg (idovel beepulhetnek majd a driverbe): -Ezekkel sok galibat lehet okozni, nagyon ovatosan kell oket hasznalni! - - modified CRC-4, for improved interworking of CRC-4 and non-CRC-4 - devices: (lasd page 107 es g706 Annex B) - lbireg[ 0x1b ] |= 0x08 - lbireg[ 0x1c ] |= 0xc0 - - ilyenkor ertelmezett az NMF - 'No Multiframe alignment Found after - 400 msec' alarm. - - FALC - a vonali meghajto IC - local loop - a sajat adasomat halljam vissza - remote loop - a kivulrol jovo adast adom vissza - - Egy hibakeresesre hasznalhato dolog: - - 1-es timeslot local loop a FALC-ban: echo >lbireg 0x1d 0x21 - - local loop kikapcsolasa: echo >lbireg 0x1d 0x00 diff --git a/Documentation/networking/slicecom.txt b/Documentation/networking/slicecom.txt deleted file mode 100644 index 59cfd95121f..00000000000 --- a/Documentation/networking/slicecom.txt +++ /dev/null @@ -1,369 +0,0 @@ - -SliceCOM adapter user's documentation - for the 0.51 driver version - -Written by Bartók István <bartoki@itc.hu> - -English translation: Lakatos György <gyuri@itc.hu> -Mon Dec 11 15:28:42 CET 2000 - -Last modified: Wed Aug 29 17:25:37 CEST 2001 - ------------------------------------------------------------------ - -Usage: - -Compiling the kernel: - -Code maturity level options - [*] Prompt for development and/or incomplete code/drivers - -Network device support - Wan interfaces - <M> MultiGate (COMX) synchronous - <M> Support for MUNICH based boards: SliceCOM, PCICOM (NEW) - <M> Support for HDLC and syncPPP... - - -Loading the modules: - -modprobe comx - -modprobe comx-proto-ppp # module for Cisco-HDLC and SyncPPP protocols - -modprobe comx-hw-munich # the module logs information by the kernel - # about the detected boards - - -Configuring the board: - -# This interface will use the Cisco-HDLC line protocol, -# the timeslices assigned are 1,2 (128 KiBit line speed) -# (the first data timeslice in the G.703 frame is no. 1) -# -mkdir /proc/comx/comx0.1/ -echo slicecom >/proc/comx/comx0.1/boardtype -echo hdlc >/proc/comx/comx0.1/protocol -echo 1 2 >/proc/comx/comx0.1/timeslots - - -# This interface uses SyncPPP line protocol, the assigned -# is no. 3 (64 KiBit line speed) -# -mkdir /proc/comx/comx0.2/ -echo slicecom >/proc/comx/comx0.2/boardtype -echo ppp >/proc/comx/comx0.2/protocol -echo 3 >/proc/comx/comx0.2/timeslots - -... - -ifconfig comx0.1 up -ifconfig comx0.2 up - ------------------------------------------------------------------ - -The COMX interfaces use a 10 packet transmit queue by default, however WAN -networks sometimes use bigger values (20 to 100), to utilize the line better -by large traffic (though the line delay increases because of more packets -join the queue). - -# ifconfig comx0 txqueuelen 50 - -This option is only supported by the ifconfig command of the later -distributions, which came with 2.2 kernels, such as RedHat 6.1 or Debian 2.2. - -You can download a newer netbase packet from -http://www.debian.org/~rcw/2.2/netbase/ for Debian 2.1, which has a new -ifconfig. You can get further information about using 2.2 kernel with -Debian 2.1 from http://www.debian.org/releases/stable/running-kernel-2.2 - ------------------------------------------------------------------ - -The SliceCom LEDs: - -red - on, if the interface is unconfigured, or it gets Remote Alarm-s -green - on, if the board finds frame-sync in the received signal - -A bit more detailed: - -red: green: meaning: - -- - no frame-sync, no signal received, or signal SNAFU. -- on "Everything is OK" -on on Recepion is ok, but the remote end sends Remote Alarm -on - The interface is unconfigured - ------------------------------------------------------------------ - -A more detailed description of the hardware setting options: - -The general and the protocol layer options described in the 'comx.txt' file -apply to the SliceCom as well, I only summarize the SliceCom hardware specific -settings below. - -The '/proc/comx' configuring interface: - -An interface directory should be created for every timeslot group with -'mkdir', e,g: 'comx0', 'comx1' etc. The timeslots can be assigned here to the -specific interface. The Cisco-like naming convention (serial3:1 - first -timeslot group of the 3rd. board) can't be used here, because these mean IP -aliasing in Linux. - -You can give any meaningful name to keep the configuration clear; -e.g: 'comx0.1', 'comx0.2', 'comx1.1', comx1.2', if you have two boards -with two interfaces each. - -Settings, which apply to the board: - -Neither 'io' nor 'irq' settings required, the driver uses the resources -given by the PCI BIOS. - -comx0/boardnum - board number of the SliceCom in the PC (using the 'natural' - PCI order) as listed in '/proc/pci' or the output of the - 'lspci' command, generally the slots nearer to the motherboard - PCI driver chips have the lower numbers. - - Default: 0 (the counting starts with 0) - -Though the options below are to be set on a single interface, they apply to the -whole board. The restriction, to use them on 'UP' interfaces, is because the -command sequence below could lead to unpredicable results. - - # echo 0 >boardnum - # echo internal >clock_source - # echo 1 >boardnum - -The sequence would set the clock source of board 0. - -These settings will persist after all the interfaces are cleared, but are -cleared when the driver module is unloaded and loaded again. - -comx0/clock_source - source of the transmit clock - Usage: - - # echo line >/proc/comx/comx0/clock_source - # echo internal >/proc/comx/comx0/clock_source - - line - The Tx clock is being decoded if the input data stream, - if no clock seen on the input, then the board will use it's - own clock generator. - - internal - The Tx clock is supplied by the builtin clock generator. - - Default: line - - Normally, the telecommunication company's end device (the HDSL - modem) provides the Tx clock, that's why 'line' is the default. - -comx0/framing - Switching CRC4 off/on - - CRC4: 16 PCM frames (The 32 64Kibit channels are multiplexed into a - PCM frame, nothing to do with HDLC frames) are divided into 2x8 - groups, each group has a 4 bit CRC. - - # echo crc4 >/proc/comx/comx0/framing - # echo no-crc4 >/proc/comx/comx0/framing - - Default is 'crc4', the Hungarian MATAV lines behave like this. - The traffic generally passes if this setting on both ends don't match. - -comx0/linecode - Setting the line coding - - # echo hdb3 >/proc/comx/comx0/linecode - # echo ami >/proc/comx/comx0/linecode - - Default a 'hdb3', MATAV lines use this. - - (AMI coding is rarely used with E1 lines). Frame sync may occur, if - this setting doesn't match the other end's, but CRC4 and data errors - will come, which will result in CRC errors on HDLC/SyncPPP level. - -comx0/reg - direct access to the board's MUNICH (reg) and FALC (lbireg) -comx0/lbireg circuit's registers - - # echo >reg 0x04 0x0 - write 0 to register 4 - # echo >reg 0x104 - write the contents of register 4 with - printk() to syslog - -WARNING! These are only for development purposes, messing with this will - result much trouble! - -comx0/loopback - Places a loop to the board's G.703 signals - - # echo none >/proc/comx/comx0/loopback - # echo local >/proc/comx/comx0/loopback - # echo remote >/proc/comx/comx0/loopback - - none - normal operation, no loop - local - the board receives it's own output - remote - the board sends the received data to the remote side - - Default: none - ------------------------------------------------------------------ - -Interface (channel group in Cisco terms) settings: - -comx0/timeslots - which timeslots belong to the given interface - - Setting: - - # echo '1 5 2 6 7 8' >/proc/comx/comx0/timeslots - - # cat /proc/comx/comx0/timeslots - 1 2 5 6 7 8 - # - - Finding a timeslot: - - # grep ' 4' /proc/comx/comx*/timeslots - /proc/comx/comx0/timeslots:1 3 4 5 6 - # - - The timeslots can be in any order, '1 2 3' is the same as '1 3 2'. - - The interface has to be DOWN during the setting ('ifconfig comx0 - down'), but the other interfaces could operate normally. - - The driver checks if the assigned timeslots are vacant, if not, then - the setting won't be applied. - - The timeslot values are treated as decimal numbers, not to misunderstand - values of 08, 09 form. - ------------------------------------------------------------------ - -Checking the interface and board status: - -- Lines beginning with ' ' (space) belong to the original output, the lines -which begin with '//' are the comments. - - papaya:~$ cat /proc/comx/comx1/status - Interface administrative status is UP, modem status is UP, protocol is UP - Modem status changes: 0, Transmitter status is IDLE, tbusy: 0 - Interface load (input): 978376 / 947808 / 951024 bits/s (5s/5m/15m) - (output): 978376 / 947848 / 951024 bits/s (5s/5m/15m) - Debug flags: none - RX errors: len: 22, overrun: 1, crc: 0, aborts: 0 - buffer overrun: 0, pbuffer overrun: 0 - TX errors: underrun: 0 - Line keepalive (value: 10) status UP [0] - -// The hardware specific part starts here: - Controller status: - No alarms - -// Alarm: -// -// No alarms - Everything OK -// -// LOS - Loss Of Signal - No signal sensed on the input -// AIS - Alarm Indication Signal - The remot side sends '11111111'-s, -// it tells, that there's an error condition, or it's not -// initialised. -// AUXP - Auxiliary Pattern Indication - 01010101.. received. -// LFA - Loss of Frame Alignment - no frame sync received. -// RRA - Receive Remote Alarm - the remote end's OK, but singnals error cond. -// LMFA - Loss of CRC4 Multiframe Alignment - no CRC4 multiframe sync. -// NMF - No Multiframe alignment Found after 400 msec - no such alarm using -// no-crc4 or crc4 framing, see below. -// -// Other possible error messages: -// -// Transmit Line Short - the board felt, that it's output is short-circuited, -// so it switched the transmission off. (The board can't definitely tell, -// that it's output is short-circuited.) - -// Chained list of the received packets, for debug purposes: - - Rx ring: - rafutott: 0 - lastcheck: 50845731, jiffies: 51314281 - base: 017b1858 - rx_desc_ptr: 0 - rx_desc_ptr: 017b1858 - hw_curr_ptr: 017b1858 - 06040000 017b1868 017b1898 c016ff00 - 06040000 017b1878 017b1e9c c016ff00 - 46040000 017b1888 017b24a0 c016ff00 - 06040000 017b1858 017b2aa4 c016ff00 - -// All the interfaces using the board: comx1, using the 1,2,...16 timeslots, -// comx2, using timeslot 17, etc. - - Interfaces using this board: (channel-group, interface, timeslots) - 0 comx1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 - 1 comx2: 17 - 2 comx3: 18 - 3 comx4: 19 - 4 comx5: 20 - 5 comx6: 21 - 6 comx7: 22 - 7 comx8: 23 - 8 comx9: 24 - 9 comx10: 25 - 10 comx11: 26 - 11 comx12: 27 - 12 comx13: 28 - 13 comx14: 29 - 14 comx15: 30 - 15 comx16: 31 - -// The number of events handled by the driver during an interrupt cycle: - - Interrupt work histogram: - hist[ 0]: 0 hist[ 1]: 2 hist[ 2]: 18574 hist[ 3]: 79 - hist[ 4]: 14 hist[ 5]: 1 hist[ 6]: 0 hist[ 7]: 1 - hist[ 8]: 0 hist[ 9]: 7 - -// The number of packets to send in the Tx ring, when a new one arrived: - - Tx ring histogram: - hist[ 0]: 2329 hist[ 1]: 0 hist[ 2]: 0 hist[ 3]: 0 - -// The error counters of the E1 interface, according to the RFC2495, -// (similar to the Cisco "show controllers e1" command's output: -// http://www.cisco.com/univercd/cc/td/doc/product/software/ios11/rbook/rinterfc.htm#xtocid25669126) - -Data in current interval (91 seconds elapsed): - 9516 Line Code Violations, 65 Path Code Violations, 2 E-Bit Errors - 0 Slip Secs, 2 Fr Loss Secs, 2 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 11 Unavail Secs -Data in Interval 1 (15 minutes): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 4 intervals (1 hour): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs -Data in last 96 intervals (24 hours): - 0 Line Code Violations, 0 Path Code Violations, 0 E-Bit Errors - 0 Slip Secs, 0 Fr Loss Secs, 0 Line Err Secs, 0 Degraded Mins - 0 Errored Secs, 0 Bursty Err Secs, 0 Severely Err Secs, 0 Unavail Secs - ------------------------------------------------------------------ - -Some unique options, (may get into the driver later): -Treat them very carefully, these can cause much trouble! - - modified CRC-4, for improved interworking of CRC-4 and non-CRC-4 - devices: (see page 107 and g706 Annex B) - lbireg[ 0x1b ] |= 0x08 - lbireg[ 0x1c ] |= 0xc0 - - - The NMF - 'No Multiframe alignment Found after 400 msec' alarm - comes into account. - - FALC - the line driver chip. - local loop - I hear my transmission back. - remote loop - I echo the remote transmission back. - - Something useful for finding errors: - - - local loop for timeslot 1 in the FALC chip: - - # echo >lbireg 0x1d 0x21 - - - Swithing the loop off: - - # echo >lbireg 0x1d 0x00 diff --git a/Documentation/networking/smctr.txt b/Documentation/networking/smctr.txt deleted file mode 100644 index 4c866f5a0ee..00000000000 --- a/Documentation/networking/smctr.txt +++ /dev/null @@ -1,66 +0,0 @@ -Text File for the SMC TokenCard TokenRing Linux driver (smctr.c). - By Jay Schulist <jschlst@samba.org> - -The Linux SMC Token Ring driver works with the SMC TokenCard Elite (8115T) -ISA and SMC TokenCard Elite/A (8115T/A) MCA adapters. - -Latest information on this driver can be obtained on the Linux-SNA WWW site. -Please point your browser to: http://www.linux-sna.org - -This driver is rather simple to use. Select Y to Token Ring adapter support -in the kernel configuration. A choice for SMC Token Ring adapters will -appear. This drives supports all SMC ISA/MCA adapters. Choose this -option. I personally recommend compiling the driver as a module (M), but if you -you would like to compile it staticly answer Y instead. - -This driver supports multiple adapters without the need to load multiple copies -of the driver. You should be able to load up to 7 adapters without any kernel -modifications, if you are in need of more please contact the maintainer of this -driver. - -Load the driver either by lilo/loadlin or as a module. When a module using the -following command will suffice for most: - -# modprobe smctr -smctr.c: v1.00 12/6/99 by jschlst@samba.org -tr0: SMC TokenCard 8115T at Io 0x300, Irq 10, Rom 0xd8000, Ram 0xcc000. - -Now just setup the device via ifconfig and set and routes you may have. After -this you are ready to start sending some tokens. - -Errata: -1). For anyone wondering where to pick up the SMC adapters please browse - to http://www.smc.com - -2). If you are the first/only Token Ring Client on a Token Ring LAN, please - specify the ringspeed with the ringspeed=[4/16] module option. If no - ringspeed is specified the driver will attempt to autodetect the ring - speed and/or if the adapter is the first/only station on the ring take - the appropriate actions. - - NOTE: Default ring speed is 16MB UTP. - -3). PnP support for this adapter sucks. I recommend hard setting the - IO/MEM/IRQ by the jumpers on the adapter. If this is not possible - load the module with the following io=[ioaddr] mem=[mem_addr] - irq=[irq_num]. - - The following IRQ, IO, and MEM settings are supported. - - IO ports: - 0x200, 0x220, 0x240, 0x260, 0x280, 0x2A0, 0x2C0, 0x2E0, 0x300, - 0x320, 0x340, 0x360, 0x380. - - IRQs: - 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15 - - Memory addresses: - 0xA0000, 0xA4000, 0xA8000, 0xAC000, 0xB0000, 0xB4000, - 0xB8000, 0xBC000, 0xC0000, 0xC4000, 0xC8000, 0xCC000, - 0xD0000, 0xD4000, 0xD8000, 0xDC000, 0xE0000, 0xE4000, - 0xE8000, 0xEC000, 0xF0000, 0xF4000, 0xF8000, 0xFC000 - -This driver is under the GNU General Public License. Its Firmware image is -included as an initialized C-array and is licensed by SMC to the Linux -users of this driver. However no warranty about its fitness is expressed or -implied by SMC. diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt new file mode 100644 index 00000000000..b0b75f8463b --- /dev/null +++ b/Documentation/networking/spider_net.txt @@ -0,0 +1,204 @@ + + The Spidernet Device Driver + =========================== + +Written by Linas Vepstas <linas@austin.ibm.com> + +Version of 7 June 2007 + +Abstract +======== +This document sketches the structure of portions of the spidernet +device driver in the Linux kernel tree. The spidernet is a gigabit +ethernet device built into the Toshiba southbridge commonly used +in the SONY Playstation 3 and the IBM QS20 Cell blade. + +The Structure of the RX Ring. +============================= +The receive (RX) ring is a circular linked list of RX descriptors, +together with three pointers into the ring that are used to manage its +contents. + +The elements of the ring are called "descriptors" or "descrs"; they +describe the received data. This includes a pointer to a buffer +containing the received data, the buffer size, and various status bits. + +There are three primary states that a descriptor can be in: "empty", +"full" and "not-in-use". An "empty" or "ready" descriptor is ready +to receive data from the hardware. A "full" descriptor has data in it, +and is waiting to be emptied and processed by the OS. A "not-in-use" +descriptor is neither empty or full; it is simply not ready. It may +not even have a data buffer in it, or is otherwise unusable. + +During normal operation, on device startup, the OS (specifically, the +spidernet device driver) allocates a set of RX descriptors and RX +buffers. These are all marked "empty", ready to receive data. This +ring is handed off to the hardware, which sequentially fills in the +buffers, and marks them "full". The OS follows up, taking the full +buffers, processing them, and re-marking them empty. + +This filling and emptying is managed by three pointers, the "head" +and "tail" pointers, managed by the OS, and a hardware current +descriptor pointer (GDACTDPA). The GDACTDPA points at the descr +currently being filled. When this descr is filled, the hardware +marks it full, and advances the GDACTDPA by one. Thus, when there is +flowing RX traffic, every descr behind it should be marked "full", +and everything in front of it should be "empty". If the hardware +discovers that the current descr is not empty, it will signal an +interrupt, and halt processing. + +The tail pointer tails or trails the hardware pointer. When the +hardware is ahead, the tail pointer will be pointing at a "full" +descr. The OS will process this descr, and then mark it "not-in-use", +and advance the tail pointer. Thus, when there is flowing RX traffic, +all of the descrs in front of the tail pointer should be "full", and +all of those behind it should be "not-in-use". When RX traffic is not +flowing, then the tail pointer can catch up to the hardware pointer. +The OS will then note that the current tail is "empty", and halt +processing. + +The head pointer (somewhat mis-named) follows after the tail pointer. +When traffic is flowing, then the head pointer will be pointing at +a "not-in-use" descr. The OS will perform various housekeeping duties +on this descr. This includes allocating a new data buffer and +dma-mapping it so as to make it visible to the hardware. The OS will +then mark the descr as "empty", ready to receive data. Thus, when there +is flowing RX traffic, everything in front of the head pointer should +be "not-in-use", and everything behind it should be "empty". If no +RX traffic is flowing, then the head pointer can catch up to the tail +pointer, at which point the OS will notice that the head descr is +"empty", and it will halt processing. + +Thus, in an idle system, the GDACTDPA, tail and head pointers will +all be pointing at the same descr, which should be "empty". All of the +other descrs in the ring should be "empty" as well. + +The show_rx_chain() routine will print out the locations of the +GDACTDPA, tail and head pointers. It will also summarize the contents +of the ring, starting at the tail pointer, and listing the status +of the descrs that follow. + +A typical example of the output, for a nearly idle system, might be + +net eth1: Total number of descrs=256 +net eth1: Chain tail located at descr=20 +net eth1: Chain head is at 20 +net eth1: HW curr desc (GDACTDPA) is at 21 +net eth1: Have 1 descrs with stat=x40800101 +net eth1: HW next desc (GDACNEXTDA) is at 22 +net eth1: Last 255 descrs with stat=xa0800000 + +In the above, the hardware has filled in one descr, number 20. Both +head and tail are pointing at 20, because it has not yet been emptied. +Meanwhile, hw is pointing at 21, which is free. + +The "Have nnn decrs" refers to the descr starting at the tail: in this +case, nnn=1 descr, starting at descr 20. The "Last nnn descrs" refers +to all of the rest of the descrs, from the last status change. The "nnn" +is a count of how many descrs have exactly the same status. + +The status x4... corresponds to "full" and status xa... corresponds +to "empty". The actual value printed is RXCOMST_A. + +In the device driver source code, a different set of names are +used for these same concepts, so that + +"empty" == SPIDER_NET_DESCR_CARDOWNED == 0xa +"full" == SPIDER_NET_DESCR_FRAME_END == 0x4 +"not in use" == SPIDER_NET_DESCR_NOT_IN_USE == 0xf + + +The RX RAM full bug/feature +=========================== + +As long as the OS can empty out the RX buffers at a rate faster than +the hardware can fill them, there is no problem. If, for some reason, +the OS fails to empty the RX ring fast enough, the hardware GDACTDPA +pointer will catch up to the head, notice the not-empty condition, +ad stop. However, RX packets may still continue arriving on the wire. +The spidernet chip can save some limited number of these in local RAM. +When this local ram fills up, the spider chip will issue an interrupt +indicating this (GHIINT0STS will show ERRINT, and the GRMFLLINT bit +will be set in GHIINT1STS). When the RX ram full condition occurs, +a certain bug/feature is triggered that has to be specially handled. +This section describes the special handling for this condition. + +When the OS finally has a chance to run, it will empty out the RX ring. +In particular, it will clear the descriptor on which the hardware had +stopped. However, once the hardware has decided that a certain +descriptor is invalid, it will not restart at that descriptor; instead +it will restart at the next descr. This potentially will lead to a +deadlock condition, as the tail pointer will be pointing at this descr, +which, from the OS point of view, is empty; the OS will be waiting for +this descr to be filled. However, the hardware has skipped this descr, +and is filling the next descrs. Since the OS doesn't see this, there +is a potential deadlock, with the OS waiting for one descr to fill, +while the hardware is waiting for a different set of descrs to become +empty. + +A call to show_rx_chain() at this point indicates the nature of the +problem. A typical print when the network is hung shows the following: + +net eth1: Spider RX RAM full, incoming packets might be discarded! +net eth1: Total number of descrs=256 +net eth1: Chain tail located at descr=255 +net eth1: Chain head is at 255 +net eth1: HW curr desc (GDACTDPA) is at 0 +net eth1: Have 1 descrs with stat=xa0800000 +net eth1: HW next desc (GDACNEXTDA) is at 1 +net eth1: Have 127 descrs with stat=x40800101 +net eth1: Have 1 descrs with stat=x40800001 +net eth1: Have 126 descrs with stat=x40800101 +net eth1: Last 1 descrs with stat=xa0800000 + +Both the tail and head pointers are pointing at descr 255, which is +marked xa... which is "empty". Thus, from the OS point of view, there +is nothing to be done. In particular, there is the implicit assumption +that everything in front of the "empty" descr must surely also be empty, +as explained in the last section. The OS is waiting for descr 255 to +become non-empty, which, in this case, will never happen. + +The HW pointer is at descr 0. This descr is marked 0x4.. or "full". +Since its already full, the hardware can do nothing more, and thus has +halted processing. Notice that descrs 0 through 254 are all marked +"full", while descr 254 and 255 are empty. (The "Last 1 descrs" is +descr 254, since tail was at 255.) Thus, the system is deadlocked, +and there can be no forward progress; the OS thinks there's nothing +to do, and the hardware has nowhere to put incoming data. + +This bug/feature is worked around with the spider_net_resync_head_ptr() +routine. When the driver receives RX interrupts, but an examination +of the RX chain seems to show it is empty, then it is probable that +the hardware has skipped a descr or two (sometimes dozens under heavy +network conditions). The spider_net_resync_head_ptr() subroutine will +search the ring for the next full descr, and the driver will resume +operations there. Since this will leave "holes" in the ring, there +is also a spider_net_resync_tail_ptr() that will skip over such holes. + +As of this writing, the spider_net_resync() strategy seems to work very +well, even under heavy network loads. + + +The TX ring +=========== +The TX ring uses a low-watermark interrupt scheme to make sure that +the TX queue is appropriately serviced for large packet sizes. + +For packet sizes greater than about 1KBytes, the kernel can fill +the TX ring quicker than the device can drain it. Once the ring +is full, the netdev is stopped. When there is room in the ring, +the netdev needs to be reawakened, so that more TX packets are placed +in the ring. The hardware can empty the ring about four times per jiffy, +so its not appropriate to wait for the poll routine to refill, since +the poll routine runs only once per jiffy. The low-watermark mechanism +marks a descr about 1/4th of the way from the bottom of the queue, so +that an interrupt is generated when the descr is processed. This +interrupt wakes up the netdev, which can then refill the queue. +For large packets, this mechanism generates a relatively small number +of interrupts, about 1K/sec. For smaller packets, this will drop to zero +interrupts, as the hardware can empty the queue faster than the kernel +can fill it. + + + ======= END OF DOCUMENT ======== + diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt new file mode 100644 index 00000000000..2090895b08d --- /dev/null +++ b/Documentation/networking/stmmac.txt @@ -0,0 +1,371 @@ + STMicroelectronics 10/100/1000 Synopsys Ethernet driver + +Copyright (C) 2007-2013 STMicroelectronics Ltd +Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> + +This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers +(Synopsys IP blocks). + +Currently this network device driver is for all STM embedded MAC/GMAC +(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 +FF1152AMT0221 D1215994A VIRTEX FPGA board. + +DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether +MAC 10/100 Universal version 4.0 have been used for developing this driver. + +This driver supports both the platform bus and PCI. + +Please, for more information also visit: www.stlinux.com + +1) Kernel Configuration +The kernel configuration option is STMMAC_ETH: + Device Drivers ---> Network device support ---> Ethernet (1000 Mbit) ---> + STMicroelectronics 10/100/1000 Ethernet driver (STMMAC_ETH) + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + phyaddr: to manually provide the physical address to the PHY device; + dma_rxsize: DMA rx ring size; + dma_txsize: DMA tx ring size; + buf_sz: DMA buffer size; + tc: control the HW FIFO threshold; + watchdog: transmit timeout (in milliseconds); + flow_ctrl: Flow control ability [on/off]; + pause: Flow Control Pause Time; + eee_timer: tx EEE timer; + chain_mode: select chain mode instead of ring. + +3) Command line options +Driver parameters can be also passed in command line by using: + stmmaceth=dma_rxsize:128,dma_txsize:512 + +4) Driver information and notes + +4.1) Transmit process +The xmit method is invoked when the kernel needs to transmit a packet; it sets +the descriptors in the ring and informs the DMA engine that there is a packet +ready to be transmitted. +Once the controller has finished transmitting the packet, an interrupt is +triggered; So the driver will be able to release the socket buffers. +By default, the driver sets the NETIF_F_SG bit in the features field of the +net_device structure enabling the scatter/gather feature. + +4.2) Receive process +When one or more packets are received, an interrupt happens. The interrupts +are not queued so the driver has to scan all the descriptors in the ring during +the receive process. +This is based on NAPI so the interrupt handler signals only if there is work +to be done, and it exits. +Then the poll method will be scheduled at some future point. +The incoming packets are stored, by the DMA, in a list of pre-allocated socket +buffers in order to avoid the memcpy (Zero-copy). + +4.3) Interrupt Mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for the reception on chips older than the 3.50. +New chips have an HW RX-Watchdog used for this mitigation. + +On Tx-side, the mitigation schema is based on a SW timer that calls the +tx function (stmmac_tx) to reclaim the resource after transmitting the +frames. +Also there is another parameter (like a threshold) used to program +the descriptors avoiding to set the interrupt on completion bit in +when the frame is sent (xmit). + +Mitigation parameters can be tuned by ethtool. + +4.4) WOL +Wake up on Lan feature through Magic and Unicast frames are supported for the +GMAC core. + +4.5) DMA descriptors +Driver handles both normal and enhanced descriptors. The latter has been only +tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later. + +STMMAC supports DMA descriptor to operate both in dual buffer (RING) +and linked-list(CHAINED) mode. In RING each descriptor points to two +data buffer pointers whereas in CHAINED mode they point to only one data +buffer pointer. RING mode is the default. + +In CHAINED mode each descriptor will have pointer to next descriptor in +the list, hence creating the explicit chaining in the descriptor itself, +whereas such explicit chaining is not possible in RING mode. + +4.6) Ethtool support +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.7) Jumbo and Segmentation Offloading +Jumbo frames are supported and tested for the GMAC. +The GSO has been also added but it's performed in software. +LRO is not supported. + +4.8) Physical +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.9) Platform information +Several driver's information can be passed through the platform +These are included in the include/linux/stmmac.h header file +and detailed below as well: + +struct plat_stmmacenet_data { + char *phy_bus_name; + int bus_id; + int phy_addr; + int interface; + struct stmmac_mdio_bus_data *mdio_bus_data; + struct stmmac_dma_cfg *dma_cfg; + int clk_csr; + int has_gmac; + int enh_desc; + int tx_coe; + int rx_coe; + int bugged_jumbo; + int pmt; + int force_sf_dma_mode; + int force_thresh_dma_mode; + int riwt_off; + void (*fix_mac_speed)(void *priv, unsigned int speed); + void (*bus_setup)(void __iomem *ioaddr); + void *(*setup)(struct platform_device *pdev); + int (*init)(struct platform_device *pdev, void *priv); + void (*exit)(struct platform_device *pdev, void *priv); + void *custom_cfg; + void *custom_data; + void *bsp_priv; + }; + +Where: + o phy_bus_name: phy bus name to attach to the stmmac. + o bus_id: bus identifier. + o phy_addr: the physical address can be passed from the platform. + If it is set to -1 the driver will automatically + detect it at run-time by probing all the 32 addresses. + o interface: PHY device's interface. + o mdio_bus_data: specific platform fields for the MDIO bus. + o dma_cfg: internal DMA parameters + o pbl: the Programmable Burst Length is maximum number of beats to + be transferred in one DMA transaction. + GMAC also enables the 4xPBL by default. + o fixed_burst/mixed_burst/burst_len + o clk_csr: fixed CSR Clock range selection. + o has_gmac: uses the GMAC core. + o enh_desc: if sets the MAC will use the enhanced descriptor structure. + o tx_coe: core is able to perform the tx csum in HW. + o rx_coe: the supports three check sum offloading engine types: + type_1, type_2 (full csum) and no RX coe. + o bugged_jumbo: some HWs are not able to perform the csum in HW for + over-sized frames due to limited buffer sizes. + Setting this flag the csum will be done in SW on + JUMBO frames. + o pmt: core has the embedded power module (optional). + o force_sf_dma_mode: force DMA to use the Store and Forward mode + instead of the Threshold. + o force_thresh_dma_mode: force DMA to use the Threshold mode other than + the Store and Forward mode. + o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode. + o fix_mac_speed: this callback is used for modifying some syscfg registers + (on ST SoCs) according to the link speed negotiated by the + physical layer . + o bus_setup: perform HW setup of the bus. For example, on some ST platforms + this field is used to configure the AMBA bridge to generate more + efficient STBus traffic. + o setup/init/exit: callbacks used for calling a custom initialization; + this is sometime necessary on some platforms (e.g. ST boxes) + where the HW needs to have set some PIO lines or system cfg + registers. setup should return a pointer to private data, + which will be stored in bsp_priv, and then passed to init and + exit callbacks. init/exit callbacks should not use or modify + platform data. + o custom_cfg/custom_data: this is a custom configuration that can be passed + while initializing the resources. + o bsp_priv: another private pointer. + +For MDIO bus The we have: + + struct stmmac_mdio_bus_data { + int (*phy_reset)(void *priv); + unsigned int phy_mask; + int *irqs; + int probed_phy_irq; + }; + +Where: + o phy_reset: hook to reset the phy device attached to the bus. + o phy_mask: phy mask passed when register the MDIO bus within the driver. + o irqs: list of IRQs, one per PHY. + o probed_phy_irq: if irqs is NULL, use this for probed PHY. + +For DMA engine we have the following internal fields that should be +tuned according to the HW capabilities. + +struct stmmac_dma_cfg { + int pbl; + int fixed_burst; + int burst_len_supported; +}; + +Where: + o pbl: Programmable Burst Length + o fixed_burst: program the DMA to use the fixed burst mode + o burst_len: this is the value we put in the register + supported values are provided as macros in + linux/stmmac.h header file. + +--- + +Below an example how the structures above are using on ST platforms. + + static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = { + .has_gmac = 0, + .enh_desc = 0, + .fix_mac_speed = stxYYY_ethernet_fix_mac_speed, + | + |-> to write an internal syscfg + | on this platform when the + | link speed changes from 10 to + | 100 and viceversa + .init = &stmmac_claim_resource, + | + |-> On ST SoC this calls own "PAD" + | manager framework to claim + | all the resources necessary + | (GPIO ...). The .custom_cfg field + | is used to pass a custom config. +}; + +Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact, +there are two MAC cores: one MAC is for MDIO Bus/PHY emulation +with fixed_link support. + +static struct stmmac_mdio_bus_data stmmac1_mdio_bus = { + .phy_reset = phy_reset; + | + |-> function to provide the phy_reset on this board + .phy_mask = 0, +}; + +static struct fixed_phy_status stmmac0_fixed_phy_status = { + .link = 1, + .speed = 100, + .duplex = 1, +}; + +During the board's device_init we can configure the first +MAC for fixed_link by calling: + fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status));) +and the second one, with a real PHY device attached to the bus, +by using the stmmac_mdio_bus_data structure (to provide the id, the +reset procedure etc). + +4.10) List of source files: + o Kconfig + o Makefile + o stmmac_main.c: main network device driver; + o stmmac_mdio.c: mdio functions; + o stmmac_pci: PCI driver; + o stmmac_platform.c: platform driver + o stmmac_ethtool.c: ethtool support; + o stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts + (only tested on ST40 platforms based); + o stmmac.h: private driver structure; + o common.h: common definitions and VFTs; + o descs.h: descriptor structure definitions; + o dwmac1000_core.c: GMAC core functions; + o dwmac1000_dma.c: dma functions for the GMAC chip; + o dwmac1000.h: specific header file for the GMAC; + o dwmac100_core: MAC 100 core and dma code; + o dwmac100_dma.c: dma functions for the MAC chip; + o dwmac1000.h: specific header file for the MAC; + o dwmac_lib.c: generic DMA functions shared among chips; + o enh_desc.c: functions for handling enhanced descriptors; + o norm_desc.c: functions for handling normal descriptors; + o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP + o stmmac_ptp.c: PTP 1588 clock + +5) Debug Information + +The driver exports many information i.e. internal statistics, +debug information, MAC and DMA registers etc. + +These can be read in several ways depending on the +type of the information actually needed. + +For example a user can be use the ethtool support +to get statistics: e.g. using: ethtool -S ethX +(that shows the Management counters (MMC) if supported) +or sees the MAC/DMA registers: e.g. using: ethtool -d ethX + +Compiling the Kernel with CONFIG_DEBUG_FS and enabling the +STMMAC_DEBUG_FS option the driver will export the following +debugfs entries: + +/sys/kernel/debug/stmmaceth/descriptors_status + To show the DMA TX/RX descriptor rings + +Developer can also use the "debug" module parameter to get +further debug information. + +In the end, there are other macros (that cannot be enabled +via menuconfig) to turn-on the RX/TX DMA debugging, +specific MAC core debug printk etc. Others to enable the +debug in the TX and RX processes. +All these are only useful during the developing stage +and should never enabled inside the code for general usage. +In fact, these can generate an huge amount of debug messages. + +6) Energy Efficient Ethernet + +Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along +with a family of Physical layer to operate in the Low power Idle(LPI) +mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps, +1000Mbps & 10Gbps. + +The LPI mode allows power saving by switching off parts of the +communication device functionality when there is no data to be +transmitted & received. The system on both the side of the link can +disable some functionalities & save power during the period of low-link +utilization. The MAC controls whether the system should enter or exit +the LPI mode & communicate this to PHY. + +As soon as the interface is opened, the driver verifies if the EEE can +be supported. This is done by looking at both the DMA HW capability +register and the PHY devices MCD registers. +To enter in Tx LPI mode the driver needs to have a software timer +that enable and disable the LPI mode when there is nothing to be +transmitted. + +7) Extended descriptors +The extended descriptors give us information about the receive Ethernet payload +when it is carrying PTP packets or TCP/UDP/ICMP over IP. +These are not available on GMAC Synopsys chips older than the 3.50. +At probe time the driver will decide if these can be actually used. +This support also is mandatory for PTPv2 because the extra descriptors 6 and 7 +are used for saving the hardware timestamps. + +8) Precision Time Protocol (PTP) +The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), +which enables precise synchronization of clocks in measurement and +control systems implemented with technologies such as network +communication. + +In addition to the basic timestamp features mentioned in IEEE 1588-2002 +Timestamps, new GMAC cores support the advanced timestamp features. +IEEE 1588-2008 that can be enabled when configure the Kernel. + +9) SGMII/RGMII supports +New GMAC devices provide own way to manage RGMII/SGMII. +This information is available at run-time by looking at the +HW capability register. This means that the stmmac can manage +auto-negotiation and link status w/o using the PHYLIB stuff +In fact, the HW provides a subset of extended registers to +restart the ANE, verify Full/Half duplex mode and Speed. +Also thanks to these registers it is possible to look at the +Auto-negotiated Link Parter Ability. + +10) TODO: + o XGMAC is not supported. + o Complete the TBI & RTBI support. + o extend VLAN support for 3.70a SYNP GMAC. diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt new file mode 100644 index 00000000000..70d6cf60825 --- /dev/null +++ b/Documentation/networking/tc-actions-env-rules.txt @@ -0,0 +1,30 @@ + +The "environmental" rules for authors of any new tc actions are: + +1) If you stealeth or borroweth any packet thou shalt be branching +from the righteous path and thou shalt cloneth. + +For example if your action queues a packet to be processed later, +or intentionally branches by redirecting a packet, then you need to +clone the packet. + +There are certain fields in the skb tc_verd that need to be reset so we +avoid loops, etc. A few are generic enough that skb_act_clone() +resets them for you, so invoke skb_act_clone() rather than skb_clone(). + +2) If you munge any packet thou shalt call pskb_expand_head in the case +someone else is referencing the skb. After that you "own" the skb. +You must also tell us if it is ok to munge the packet (TC_OK2MUNGE), +this way any action downstream can stomp on the packet. + +3) Dropping packets you don't own is a no-no. You simply return +TC_ACT_SHOT to the caller and they will drop it. + +The "environmental" rules for callers of actions (qdiscs etc) are: + +*) Thou art responsible for freeing anything returned as being +TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is +returned, then all is great and you don't need to do anything. + +Post on netdev if something is unclear. + diff --git a/Documentation/networking/tcp-thin.txt b/Documentation/networking/tcp-thin.txt new file mode 100644 index 00000000000..151e229980f --- /dev/null +++ b/Documentation/networking/tcp-thin.txt @@ -0,0 +1,47 @@ +Thin-streams and TCP +==================== +A wide range of Internet-based services that use reliable transport +protocols display what we call thin-stream properties. This means +that the application sends data with such a low rate that the +retransmission mechanisms of the transport protocol are not fully +effective. In time-dependent scenarios (like online games, control +systems, stock trading etc.) where the user experience depends +on the data delivery latency, packet loss can be devastating for +the service quality. Extreme latencies are caused by TCP's +dependency on the arrival of new data from the application to trigger +retransmissions effectively through fast retransmit instead of +waiting for long timeouts. + +After analysing a large number of time-dependent interactive +applications, we have seen that they often produce thin streams +and also stay with this traffic pattern throughout its entire +lifespan. The combination of time-dependency and the fact that the +streams provoke high latencies when using TCP is unfortunate. + +In order to reduce application-layer latency when packets are lost, +a set of mechanisms has been made, which address these latency issues +for thin streams. In short, if the kernel detects a thin stream, +the retransmission mechanisms are modified in the following manner: + +1) If the stream is thin, fast retransmit on the first dupACK. +2) If the stream is thin, do not apply exponential backoff. + +These enhancements are applied only if the stream is detected as +thin. This is accomplished by defining a threshold for the number +of packets in flight. If there are less than 4 packets in flight, +fast retransmissions can not be triggered, and the stream is prone +to experience high retransmission latencies. + +Since these mechanisms are targeted at time-dependent applications, +they must be specifically activated by the application using the +TCP_THIN_LINEAR_TIMEOUTS and TCP_THIN_DUPACK IOCTLS or the +tcp_thin_linear_timeouts and tcp_thin_dupack sysctls. Both +modifications are turned off by default. + +References +========== +More information on the modifications, as well as a wide range of +experimental data can be found here: +"Improving latency for interactive, thin-stream applications over +reliable transport" +http://simula.no/research/nd/publications/Simula.nd.477/simula_pdf_file diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt index 0fa30042557..bdc4c0db51e 100644 --- a/Documentation/networking/tcp.txt +++ b/Documentation/networking/tcp.txt @@ -1,7 +1,7 @@ TCP protocol ============ -Last updated: 21 June 2005 +Last updated: 9 February 2008 Contents ======== @@ -30,7 +30,7 @@ A congestion control mechanism can be registered through functions in tcp_cong.c. The functions used by the congestion control mechanism are registered via passing a tcp_congestion_ops struct to tcp_register_congestion_control. As a minimum name, ssthresh, -cong_avoid, min_cwnd must be valid. +cong_avoid must be valid. Private data for a congestion control mechanism is stored in tp->ca_priv. tcp_ca(tp) returns a pointer to this space. This is preallocated space - it @@ -52,9 +52,9 @@ research and RFC's before developing new modules. The method that is used to determine which congestion control mechanism is determined by the setting of the sysctl net.ipv4.tcp_congestion_control. The default congestion control will be the last one registered (LIFO); -so if you built everything as modules. the default will be reno. If you -build with the default's from Kconfig, then BIC will be builtin (not a module) -and it will end up the default. +so if you built everything as modules, the default will be reno. If you +build with the defaults from Kconfig, then CUBIC will be builtin (not a +module) and it will end up the default. If you really want a particular default value then you will need to set it with the sysctl. If you use a sysctl, the module will be autoloaded @@ -62,7 +62,7 @@ if needed and you will get the expected protocol. If you ask for an unknown congestion method, then the sysctl attempt will fail. If you remove a tcp congestion control module, then you will get the next -available one. Since reno can not be built as a module, and can not be +available one. Since reno cannot be built as a module, and cannot be deleted, it will always be available. How the new TCP output machine [nyi] works. diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt new file mode 100644 index 00000000000..5a013686b9e --- /dev/null +++ b/Documentation/networking/team.txt @@ -0,0 +1,2 @@ +Team devices are driven from userspace via libteam library which is here: + https://github.com/jpirko/libteam diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt new file mode 100644 index 00000000000..bc355412490 --- /dev/null +++ b/Documentation/networking/timestamping.txt @@ -0,0 +1,217 @@ +The existing interfaces for getting network packages time stamped are: + +* SO_TIMESTAMP + Generate time stamp for each incoming packet using the (not necessarily + monotonous!) system time. Result is returned via recv_msg() in a + control message as timeval (usec resolution). + +* SO_TIMESTAMPNS + Same time stamping mechanism as SO_TIMESTAMP, but returns result as + timespec (nsec resolution). + +* IP_MULTICAST_LOOP + SO_TIMESTAMP[NS] + Only for multicasts: approximate send time stamp by receiving the looped + packet and using its receive time stamp. + +The following interface complements the existing ones: receive time +stamps can be generated and returned for arbitrary packets and much +closer to the point where the packet is really sent. Time stamps can +be generated in software (as before) or in hardware (if the hardware +has such a feature). + +SO_TIMESTAMPING: + +Instructs the socket layer which kind of information should be collected +and/or reported. The parameter is an integer with some of the following +bits set. Setting other bits is an error and doesn't change the current +state. + +Four of the bits are requests to the stack to try to generate +timestamps. Any combination of them is valid. + +SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware +SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software +SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware +SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software + +The other three bits control which timestamps will be reported in a +generated control message. If none of these bits are set or if none of +the set bits correspond to data that is available, then the control +message will not be generated: + +SOF_TIMESTAMPING_SOFTWARE: report systime if available +SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available +SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available + +It is worth noting that timestamps may be collected for reasons other +than being requested by a particular socket with +SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that +can generate hardware receive timestamps ignore +SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag +in case future drivers pay attention. + +If timestamps are reported, they will appear in a control message with +cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like +this: + +struct scm_timestamping { + struct timespec systime; + struct timespec hwtimetrans; + struct timespec hwtimeraw; +}; + +recvmsg() can be used to get this control message for regular incoming +packets. For send time stamps the outgoing packet is looped back to +the socket's error queue with the send time stamp(s) attached. It can +be received with recvmsg(flags=MSG_ERRQUEUE). The call returns the +original outgoing packet data including all headers preprended down to +and including the link layer, the scm_timestamping control message and +a sock_extended_err control message with ee_errno==ENOMSG and +ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending +bounced packet is ready for reading as far as select() is concerned. +If the outgoing packet has to be fragmented, then only the first +fragment is time stamped and returned to the sending socket. + +All three values correspond to the same event in time, but were +generated in different ways. Each of these values may be empty (= all +zero), in which case no such value was available. If the application +is not interested in some of these values, they can be left blank to +avoid the potential overhead of calculating them. + +systime is the value of the system time at that moment. This +corresponds to the value also returned via SO_TIMESTAMP[NS]. If the +time stamp was generated by hardware, then this field is +empty. Otherwise it is filled in if SOF_TIMESTAMPING_SOFTWARE is +set. + +hwtimeraw is the original hardware time stamp. Filled in if +SOF_TIMESTAMPING_RAW_HARDWARE is set. No assumptions about its +relation to system time should be made. + +hwtimetrans is the hardware time stamp transformed so that it +corresponds as good as possible to system time. This correlation is +not perfect; as a consequence, sorting packets received via different +NICs by their hwtimetrans may differ from the order in which they were +received. hwtimetrans may be non-monotonic even for the same NIC. +Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support +by the network device and will be empty without that support. + + +SIOCSHWTSTAMP, SIOCGHWTSTAMP: + +Hardware time stamping must also be initialized for each device driver +that is expected to do hardware time stamping. The parameter is defined in +/include/linux/net_tstamp.h as: + +struct hwtstamp_config { + int flags; /* no flags defined right now, must be zero */ + int tx_type; /* HWTSTAMP_TX_* */ + int rx_filter; /* HWTSTAMP_FILTER_* */ +}; + +Desired behavior is passed into the kernel and to a specific device by +calling ioctl(SIOCSHWTSTAMP) with a pointer to a struct ifreq whose +ifr_data points to a struct hwtstamp_config. The tx_type and +rx_filter are hints to the driver what it is expected to do. If +the requested fine-grained filtering for incoming packets is not +supported, the driver may time stamp more than just the requested types +of packets. + +A driver which supports hardware time stamping shall update the struct +with the actual, possibly more permissive configuration. If the +requested packets cannot be time stamped, then nothing should be +changed and ERANGE shall be returned (in contrast to EINVAL, which +indicates that SIOCSHWTSTAMP is not supported at all). + +Only a processes with admin rights may change the configuration. User +space is responsible to ensure that multiple processes don't interfere +with each other and that the settings are reset. + +Any process can read the actual configuration by passing this +structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has +not been implemented in all drivers. + +/* possible values for hwtstamp_config->tx_type */ +enum { + /* + * no outgoing packet will need hardware time stamping; + * should a packet arrive which asks for it, no hardware + * time stamping will be done + */ + HWTSTAMP_TX_OFF, + + /* + * enables hardware time stamping for outgoing packets; + * the sender of the packet decides which are to be + * time stamped by setting SOF_TIMESTAMPING_TX_SOFTWARE + * before sending the packet + */ + HWTSTAMP_TX_ON, +}; + +/* possible values for hwtstamp_config->rx_filter */ +enum { + /* time stamp no incoming packet at all */ + HWTSTAMP_FILTER_NONE, + + /* time stamp any incoming packet */ + HWTSTAMP_FILTER_ALL, + + /* return value: time stamp all packets requested plus some others */ + HWTSTAMP_FILTER_SOME, + + /* PTP v1, UDP, any kind of event packet */ + HWTSTAMP_FILTER_PTP_V1_L4_EVENT, + + /* for the complete list of values, please check + * the include file /include/linux/net_tstamp.h + */ +}; + + +DEVICE IMPLEMENTATION + +A driver which supports hardware time stamping must support the +SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with +the actual values as described in the section on SIOCSHWTSTAMP. It +should also support SIOCGHWTSTAMP. + +Time stamps for received packets must be stored in the skb. To get a pointer +to the shared time stamp structure of the skb call skb_hwtstamps(). Then +set the time stamps in the structure: + +struct skb_shared_hwtstamps { + /* hardware time stamp transformed into duration + * since arbitrary point in time + */ + ktime_t hwtstamp; + ktime_t syststamp; /* hwtstamp transformed to system time base */ +}; + +Time stamps for outgoing packets are to be generated as follows: +- In hard_start_xmit(), check if (skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) + is set no-zero. If yes, then the driver is expected to do hardware time + stamping. +- If this is possible for the skb and requested, then declare + that the driver is doing the time stamping by setting the flag + SKBTX_IN_PROGRESS in skb_shinfo(skb)->tx_flags , e.g. with + + skb_shinfo(skb)->tx_flags |= SKBTX_IN_PROGRESS; + + You might want to keep a pointer to the associated skb for the next step + and not free the skb. A driver not supporting hardware time stamping doesn't + do that. A driver must never touch sk_buff::tstamp! It is used to store + software generated time stamps by the network subsystem. +- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware + as possible. skb_tx_timestamp() provides a software time stamp if requested + and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). +- As soon as the driver has sent the packet and/or obtained a + hardware time stamp for it, it passes the time stamp back by + calling skb_hwtstamp_tx() with the original skb, the raw + hardware time stamp. skb_hwtstamp_tx() clones the original skb and + adds the timestamps, therefore the original skb has to be freed now. + If obtaining the hardware time stamp somehow fails, then the driver + should not fall back to software time stamping. The rationale is that + this would occur at a later time in the processing pipeline than other + software time stamping and therefore could lead to unexpected deltas + between time stamps. diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore new file mode 100644 index 00000000000..a380159765c --- /dev/null +++ b/Documentation/networking/timestamping/.gitignore @@ -0,0 +1,2 @@ +timestamping +hwtstamp_config diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile new file mode 100644 index 00000000000..d934afc8306 --- /dev/null +++ b/Documentation/networking/timestamping/Makefile @@ -0,0 +1,14 @@ +# kbuild trick to avoid linker error. Can be omitted if a module is built. +obj- := dummy.o + +# List of programs to build +hostprogs-y := timestamping hwtstamp_config + +# Tell kbuild to always build the programs +always := $(hostprogs-y) + +HOSTCFLAGS_timestamping.o += -I$(objtree)/usr/include +HOSTCFLAGS_hwtstamp_config.o += -I$(objtree)/usr/include + +clean: + rm -f timestamping hwtstamp_config diff --git a/Documentation/networking/timestamping/hwtstamp_config.c b/Documentation/networking/timestamping/hwtstamp_config.c new file mode 100644 index 00000000000..e8b685a7f15 --- /dev/null +++ b/Documentation/networking/timestamping/hwtstamp_config.c @@ -0,0 +1,134 @@ +/* Test program for SIOC{G,S}HWTSTAMP + * Copyright 2013 Solarflare Communications + * Author: Ben Hutchings + */ + +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> + +#include <sys/socket.h> +#include <sys/ioctl.h> + +#include <linux/if.h> +#include <linux/net_tstamp.h> +#include <linux/sockios.h> + +static int +lookup_value(const char **names, int size, const char *name) +{ + int value; + + for (value = 0; value < size; value++) + if (names[value] && strcasecmp(names[value], name) == 0) + return value; + + return -1; +} + +static const char * +lookup_name(const char **names, int size, int value) +{ + return (value >= 0 && value < size) ? names[value] : NULL; +} + +static void list_names(FILE *f, const char **names, int size) +{ + int value; + + for (value = 0; value < size; value++) + if (names[value]) + fprintf(f, " %s\n", names[value]); +} + +static const char *tx_types[] = { +#define TX_TYPE(name) [HWTSTAMP_TX_ ## name] = #name + TX_TYPE(OFF), + TX_TYPE(ON), + TX_TYPE(ONESTEP_SYNC) +#undef TX_TYPE +}; +#define N_TX_TYPES ((int)(sizeof(tx_types) / sizeof(tx_types[0]))) + +static const char *rx_filters[] = { +#define RX_FILTER(name) [HWTSTAMP_FILTER_ ## name] = #name + RX_FILTER(NONE), + RX_FILTER(ALL), + RX_FILTER(SOME), + RX_FILTER(PTP_V1_L4_EVENT), + RX_FILTER(PTP_V1_L4_SYNC), + RX_FILTER(PTP_V1_L4_DELAY_REQ), + RX_FILTER(PTP_V2_L4_EVENT), + RX_FILTER(PTP_V2_L4_SYNC), + RX_FILTER(PTP_V2_L4_DELAY_REQ), + RX_FILTER(PTP_V2_L2_EVENT), + RX_FILTER(PTP_V2_L2_SYNC), + RX_FILTER(PTP_V2_L2_DELAY_REQ), + RX_FILTER(PTP_V2_EVENT), + RX_FILTER(PTP_V2_SYNC), + RX_FILTER(PTP_V2_DELAY_REQ), +#undef RX_FILTER +}; +#define N_RX_FILTERS ((int)(sizeof(rx_filters) / sizeof(rx_filters[0]))) + +static void usage(void) +{ + fputs("Usage: hwtstamp_config if_name [tx_type rx_filter]\n" + "tx_type is any of (case-insensitive):\n", + stderr); + list_names(stderr, tx_types, N_TX_TYPES); + fputs("rx_filter is any of (case-insensitive):\n", stderr); + list_names(stderr, rx_filters, N_RX_FILTERS); +} + +int main(int argc, char **argv) +{ + struct ifreq ifr; + struct hwtstamp_config config; + const char *name; + int sock; + + if ((argc != 2 && argc != 4) || (strlen(argv[1]) >= IFNAMSIZ)) { + usage(); + return 2; + } + + if (argc == 4) { + config.flags = 0; + config.tx_type = lookup_value(tx_types, N_TX_TYPES, argv[2]); + config.rx_filter = lookup_value(rx_filters, N_RX_FILTERS, argv[3]); + if (config.tx_type < 0 || config.rx_filter < 0) { + usage(); + return 2; + } + } + + sock = socket(AF_INET, SOCK_DGRAM, 0); + if (sock < 0) { + perror("socket"); + return 1; + } + + strcpy(ifr.ifr_name, argv[1]); + ifr.ifr_data = (caddr_t)&config; + + if (ioctl(sock, (argc == 2) ? SIOCGHWTSTAMP : SIOCSHWTSTAMP, &ifr)) { + perror("ioctl"); + return 1; + } + + printf("flags = %#x\n", config.flags); + name = lookup_name(tx_types, N_TX_TYPES, config.tx_type); + if (name) + printf("tx_type = %s\n", name); + else + printf("tx_type = %d\n", config.tx_type); + name = lookup_name(rx_filters, N_RX_FILTERS, config.rx_filter); + if (name) + printf("rx_filter = %s\n", name); + else + printf("rx_filter = %d\n", config.rx_filter); + + return 0; +} diff --git a/Documentation/networking/timestamping/timestamping.c b/Documentation/networking/timestamping/timestamping.c new file mode 100644 index 00000000000..8ba82bfe6a3 --- /dev/null +++ b/Documentation/networking/timestamping/timestamping.c @@ -0,0 +1,533 @@ +/* + * This program demonstrates how the various time stamping features in + * the Linux kernel work. It emulates the behavior of a PTP + * implementation in stand-alone master mode by sending PTPv1 Sync + * multicasts once every second. It looks for similar packets, but + * beyond that doesn't actually implement PTP. + * + * Outgoing packets are time stamped with SO_TIMESTAMPING with or + * without hardware support. + * + * Incoming packets are time stamped with SO_TIMESTAMPING with or + * without hardware support, SIOCGSTAMP[NS] (per-socket time stamp) and + * SO_TIMESTAMP[NS]. + * + * Copyright (C) 2009 Intel Corporation. + * Author: Patrick Ohly <patrick.ohly@intel.com> + * + * This program is free software; you can redistribute it and/or modify it + * under the terms and conditions of the GNU General Public License, + * version 2, as published by the Free Software Foundation. + * + * This program is distributed in the hope it will be useful, but WITHOUT + * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or + * FITNESS FOR A PARTICULAR PURPOSE. * See the GNU General Public License for + * more details. + * + * You should have received a copy of the GNU General Public License along with + * this program; if not, write to the Free Software Foundation, Inc., + * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA. + */ + +#include <stdio.h> +#include <stdlib.h> +#include <errno.h> +#include <string.h> + +#include <sys/time.h> +#include <sys/socket.h> +#include <sys/select.h> +#include <sys/ioctl.h> +#include <arpa/inet.h> +#include <net/if.h> + +#include <asm/types.h> +#include <linux/net_tstamp.h> +#include <linux/errqueue.h> + +#ifndef SO_TIMESTAMPING +# define SO_TIMESTAMPING 37 +# define SCM_TIMESTAMPING SO_TIMESTAMPING +#endif + +#ifndef SO_TIMESTAMPNS +# define SO_TIMESTAMPNS 35 +#endif + +#ifndef SIOCGSTAMPNS +# define SIOCGSTAMPNS 0x8907 +#endif + +#ifndef SIOCSHWTSTAMP +# define SIOCSHWTSTAMP 0x89b0 +#endif + +static void usage(const char *error) +{ + if (error) + printf("invalid option: %s\n", error); + printf("timestamping interface option*\n\n" + "Options:\n" + " IP_MULTICAST_LOOP - looping outgoing multicasts\n" + " SO_TIMESTAMP - normal software time stamping, ms resolution\n" + " SO_TIMESTAMPNS - more accurate software time stamping\n" + " SOF_TIMESTAMPING_TX_HARDWARE - hardware time stamping of outgoing packets\n" + " SOF_TIMESTAMPING_TX_SOFTWARE - software fallback for outgoing packets\n" + " SOF_TIMESTAMPING_RX_HARDWARE - hardware time stamping of incoming packets\n" + " SOF_TIMESTAMPING_RX_SOFTWARE - software fallback for incoming packets\n" + " SOF_TIMESTAMPING_SOFTWARE - request reporting of software time stamps\n" + " SOF_TIMESTAMPING_SYS_HARDWARE - request reporting of transformed HW time stamps\n" + " SOF_TIMESTAMPING_RAW_HARDWARE - request reporting of raw HW time stamps\n" + " SIOCGSTAMP - check last socket time stamp\n" + " SIOCGSTAMPNS - more accurate socket time stamp\n"); + exit(1); +} + +static void bail(const char *error) +{ + printf("%s: %s\n", error, strerror(errno)); + exit(1); +} + +static const unsigned char sync[] = { + 0x00, 0x01, 0x00, 0x01, + 0x5f, 0x44, 0x46, 0x4c, + 0x54, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x01, 0x01, + + /* fake uuid */ + 0x00, 0x01, + 0x02, 0x03, 0x04, 0x05, + + 0x00, 0x01, 0x00, 0x37, + 0x00, 0x00, 0x00, 0x08, + 0x00, 0x00, 0x00, 0x00, + 0x49, 0x05, 0xcd, 0x01, + 0x29, 0xb1, 0x8d, 0xb0, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x01, + + /* fake uuid */ + 0x00, 0x01, + 0x02, 0x03, 0x04, 0x05, + + 0x00, 0x00, 0x00, 0x37, + 0x00, 0x00, 0x00, 0x04, + 0x44, 0x46, 0x4c, 0x54, + 0x00, 0x00, 0xf0, 0x60, + 0x00, 0x01, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x01, + 0x00, 0x00, 0xf0, 0x60, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x04, + 0x44, 0x46, 0x4c, 0x54, + 0x00, 0x01, + + /* fake uuid */ + 0x00, 0x01, + 0x02, 0x03, 0x04, 0x05, + + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00, + 0x00, 0x00, 0x00, 0x00 +}; + +static void sendpacket(int sock, struct sockaddr *addr, socklen_t addr_len) +{ + struct timeval now; + int res; + + res = sendto(sock, sync, sizeof(sync), 0, + addr, addr_len); + gettimeofday(&now, 0); + if (res < 0) + printf("%s: %s\n", "send", strerror(errno)); + else + printf("%ld.%06ld: sent %d bytes\n", + (long)now.tv_sec, (long)now.tv_usec, + res); +} + +static void printpacket(struct msghdr *msg, int res, + char *data, + int sock, int recvmsg_flags, + int siocgstamp, int siocgstampns) +{ + struct sockaddr_in *from_addr = (struct sockaddr_in *)msg->msg_name; + struct cmsghdr *cmsg; + struct timeval tv; + struct timespec ts; + struct timeval now; + + gettimeofday(&now, 0); + + printf("%ld.%06ld: received %s data, %d bytes from %s, %zu bytes control messages\n", + (long)now.tv_sec, (long)now.tv_usec, + (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular", + res, + inet_ntoa(from_addr->sin_addr), + msg->msg_controllen); + for (cmsg = CMSG_FIRSTHDR(msg); + cmsg; + cmsg = CMSG_NXTHDR(msg, cmsg)) { + printf(" cmsg len %zu: ", cmsg->cmsg_len); + switch (cmsg->cmsg_level) { + case SOL_SOCKET: + printf("SOL_SOCKET "); + switch (cmsg->cmsg_type) { + case SO_TIMESTAMP: { + struct timeval *stamp = + (struct timeval *)CMSG_DATA(cmsg); + printf("SO_TIMESTAMP %ld.%06ld", + (long)stamp->tv_sec, + (long)stamp->tv_usec); + break; + } + case SO_TIMESTAMPNS: { + struct timespec *stamp = + (struct timespec *)CMSG_DATA(cmsg); + printf("SO_TIMESTAMPNS %ld.%09ld", + (long)stamp->tv_sec, + (long)stamp->tv_nsec); + break; + } + case SO_TIMESTAMPING: { + struct timespec *stamp = + (struct timespec *)CMSG_DATA(cmsg); + printf("SO_TIMESTAMPING "); + printf("SW %ld.%09ld ", + (long)stamp->tv_sec, + (long)stamp->tv_nsec); + stamp++; + printf("HW transformed %ld.%09ld ", + (long)stamp->tv_sec, + (long)stamp->tv_nsec); + stamp++; + printf("HW raw %ld.%09ld", + (long)stamp->tv_sec, + (long)stamp->tv_nsec); + break; + } + default: + printf("type %d", cmsg->cmsg_type); + break; + } + break; + case IPPROTO_IP: + printf("IPPROTO_IP "); + switch (cmsg->cmsg_type) { + case IP_RECVERR: { + struct sock_extended_err *err = + (struct sock_extended_err *)CMSG_DATA(cmsg); + printf("IP_RECVERR ee_errno '%s' ee_origin %d => %s", + strerror(err->ee_errno), + err->ee_origin, +#ifdef SO_EE_ORIGIN_TIMESTAMPING + err->ee_origin == SO_EE_ORIGIN_TIMESTAMPING ? + "bounced packet" : "unexpected origin" +#else + "probably SO_EE_ORIGIN_TIMESTAMPING" +#endif + ); + if (res < sizeof(sync)) + printf(" => truncated data?!"); + else if (!memcmp(sync, data + res - sizeof(sync), + sizeof(sync))) + printf(" => GOT OUR DATA BACK (HURRAY!)"); + break; + } + case IP_PKTINFO: { + struct in_pktinfo *pktinfo = + (struct in_pktinfo *)CMSG_DATA(cmsg); + printf("IP_PKTINFO interface index %u", + pktinfo->ipi_ifindex); + break; + } + default: + printf("type %d", cmsg->cmsg_type); + break; + } + break; + default: + printf("level %d type %d", + cmsg->cmsg_level, + cmsg->cmsg_type); + break; + } + printf("\n"); + } + + if (siocgstamp) { + if (ioctl(sock, SIOCGSTAMP, &tv)) + printf(" %s: %s\n", "SIOCGSTAMP", strerror(errno)); + else + printf("SIOCGSTAMP %ld.%06ld\n", + (long)tv.tv_sec, + (long)tv.tv_usec); + } + if (siocgstampns) { + if (ioctl(sock, SIOCGSTAMPNS, &ts)) + printf(" %s: %s\n", "SIOCGSTAMPNS", strerror(errno)); + else + printf("SIOCGSTAMPNS %ld.%09ld\n", + (long)ts.tv_sec, + (long)ts.tv_nsec); + } +} + +static void recvpacket(int sock, int recvmsg_flags, + int siocgstamp, int siocgstampns) +{ + char data[256]; + struct msghdr msg; + struct iovec entry; + struct sockaddr_in from_addr; + struct { + struct cmsghdr cm; + char control[512]; + } control; + int res; + + memset(&msg, 0, sizeof(msg)); + msg.msg_iov = &entry; + msg.msg_iovlen = 1; + entry.iov_base = data; + entry.iov_len = sizeof(data); + msg.msg_name = (caddr_t)&from_addr; + msg.msg_namelen = sizeof(from_addr); + msg.msg_control = &control; + msg.msg_controllen = sizeof(control); + + res = recvmsg(sock, &msg, recvmsg_flags|MSG_DONTWAIT); + if (res < 0) { + printf("%s %s: %s\n", + "recvmsg", + (recvmsg_flags & MSG_ERRQUEUE) ? "error" : "regular", + strerror(errno)); + } else { + printpacket(&msg, res, data, + sock, recvmsg_flags, + siocgstamp, siocgstampns); + } +} + +int main(int argc, char **argv) +{ + int so_timestamping_flags = 0; + int so_timestamp = 0; + int so_timestampns = 0; + int siocgstamp = 0; + int siocgstampns = 0; + int ip_multicast_loop = 0; + char *interface; + int i; + int enabled = 1; + int sock; + struct ifreq device; + struct ifreq hwtstamp; + struct hwtstamp_config hwconfig, hwconfig_requested; + struct sockaddr_in addr; + struct ip_mreq imr; + struct in_addr iaddr; + int val; + socklen_t len; + struct timeval next; + + if (argc < 2) + usage(0); + interface = argv[1]; + + for (i = 2; i < argc; i++) { + if (!strcasecmp(argv[i], "SO_TIMESTAMP")) + so_timestamp = 1; + else if (!strcasecmp(argv[i], "SO_TIMESTAMPNS")) + so_timestampns = 1; + else if (!strcasecmp(argv[i], "SIOCGSTAMP")) + siocgstamp = 1; + else if (!strcasecmp(argv[i], "SIOCGSTAMPNS")) + siocgstampns = 1; + else if (!strcasecmp(argv[i], "IP_MULTICAST_LOOP")) + ip_multicast_loop = 1; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_HARDWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_TX_HARDWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_TX_SOFTWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_TX_SOFTWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_HARDWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_RX_HARDWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RX_SOFTWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_RX_SOFTWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SOFTWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_SOFTWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_SYS_HARDWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_SYS_HARDWARE; + else if (!strcasecmp(argv[i], "SOF_TIMESTAMPING_RAW_HARDWARE")) + so_timestamping_flags |= SOF_TIMESTAMPING_RAW_HARDWARE; + else + usage(argv[i]); + } + + sock = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP); + if (sock < 0) + bail("socket"); + + memset(&device, 0, sizeof(device)); + strncpy(device.ifr_name, interface, sizeof(device.ifr_name)); + if (ioctl(sock, SIOCGIFADDR, &device) < 0) + bail("getting interface IP address"); + + memset(&hwtstamp, 0, sizeof(hwtstamp)); + strncpy(hwtstamp.ifr_name, interface, sizeof(hwtstamp.ifr_name)); + hwtstamp.ifr_data = (void *)&hwconfig; + memset(&hwconfig, 0, sizeof(hwconfig)); + hwconfig.tx_type = + (so_timestamping_flags & SOF_TIMESTAMPING_TX_HARDWARE) ? + HWTSTAMP_TX_ON : HWTSTAMP_TX_OFF; + hwconfig.rx_filter = + (so_timestamping_flags & SOF_TIMESTAMPING_RX_HARDWARE) ? + HWTSTAMP_FILTER_PTP_V1_L4_SYNC : HWTSTAMP_FILTER_NONE; + hwconfig_requested = hwconfig; + if (ioctl(sock, SIOCSHWTSTAMP, &hwtstamp) < 0) { + if ((errno == EINVAL || errno == ENOTSUP) && + hwconfig_requested.tx_type == HWTSTAMP_TX_OFF && + hwconfig_requested.rx_filter == HWTSTAMP_FILTER_NONE) + printf("SIOCSHWTSTAMP: disabling hardware time stamping not possible\n"); + else + bail("SIOCSHWTSTAMP"); + } + printf("SIOCSHWTSTAMP: tx_type %d requested, got %d; rx_filter %d requested, got %d\n", + hwconfig_requested.tx_type, hwconfig.tx_type, + hwconfig_requested.rx_filter, hwconfig.rx_filter); + + /* bind to PTP port */ + addr.sin_family = AF_INET; + addr.sin_addr.s_addr = htonl(INADDR_ANY); + addr.sin_port = htons(319 /* PTP event port */); + if (bind(sock, + (struct sockaddr *)&addr, + sizeof(struct sockaddr_in)) < 0) + bail("bind"); + + /* set multicast group for outgoing packets */ + inet_aton("224.0.1.130", &iaddr); /* alternate PTP domain 1 */ + addr.sin_addr = iaddr; + imr.imr_multiaddr.s_addr = iaddr.s_addr; + imr.imr_interface.s_addr = + ((struct sockaddr_in *)&device.ifr_addr)->sin_addr.s_addr; + if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_IF, + &imr.imr_interface.s_addr, sizeof(struct in_addr)) < 0) + bail("set multicast"); + + /* join multicast group, loop our own packet */ + if (setsockopt(sock, IPPROTO_IP, IP_ADD_MEMBERSHIP, + &imr, sizeof(struct ip_mreq)) < 0) + bail("join multicast group"); + + if (setsockopt(sock, IPPROTO_IP, IP_MULTICAST_LOOP, + &ip_multicast_loop, sizeof(enabled)) < 0) { + bail("loop multicast"); + } + + /* set socket options for time stamping */ + if (so_timestamp && + setsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, + &enabled, sizeof(enabled)) < 0) + bail("setsockopt SO_TIMESTAMP"); + + if (so_timestampns && + setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, + &enabled, sizeof(enabled)) < 0) + bail("setsockopt SO_TIMESTAMPNS"); + + if (so_timestamping_flags && + setsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, + &so_timestamping_flags, + sizeof(so_timestamping_flags)) < 0) + bail("setsockopt SO_TIMESTAMPING"); + + /* request IP_PKTINFO for debugging purposes */ + if (setsockopt(sock, SOL_IP, IP_PKTINFO, + &enabled, sizeof(enabled)) < 0) + printf("%s: %s\n", "setsockopt IP_PKTINFO", strerror(errno)); + + /* verify socket options */ + len = sizeof(val); + if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMP, &val, &len) < 0) + printf("%s: %s\n", "getsockopt SO_TIMESTAMP", strerror(errno)); + else + printf("SO_TIMESTAMP %d\n", val); + + if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPNS, &val, &len) < 0) + printf("%s: %s\n", "getsockopt SO_TIMESTAMPNS", + strerror(errno)); + else + printf("SO_TIMESTAMPNS %d\n", val); + + if (getsockopt(sock, SOL_SOCKET, SO_TIMESTAMPING, &val, &len) < 0) { + printf("%s: %s\n", "getsockopt SO_TIMESTAMPING", + strerror(errno)); + } else { + printf("SO_TIMESTAMPING %d\n", val); + if (val != so_timestamping_flags) + printf(" not the expected value %d\n", + so_timestamping_flags); + } + + /* send packets forever every five seconds */ + gettimeofday(&next, 0); + next.tv_sec = (next.tv_sec + 1) / 5 * 5; + next.tv_usec = 0; + while (1) { + struct timeval now; + struct timeval delta; + long delta_us; + int res; + fd_set readfs, errorfs; + + gettimeofday(&now, 0); + delta_us = (long)(next.tv_sec - now.tv_sec) * 1000000 + + (long)(next.tv_usec - now.tv_usec); + if (delta_us > 0) { + /* continue waiting for timeout or data */ + delta.tv_sec = delta_us / 1000000; + delta.tv_usec = delta_us % 1000000; + + FD_ZERO(&readfs); + FD_ZERO(&errorfs); + FD_SET(sock, &readfs); + FD_SET(sock, &errorfs); + printf("%ld.%06ld: select %ldus\n", + (long)now.tv_sec, (long)now.tv_usec, + delta_us); + res = select(sock + 1, &readfs, 0, &errorfs, &delta); + gettimeofday(&now, 0); + printf("%ld.%06ld: select returned: %d, %s\n", + (long)now.tv_sec, (long)now.tv_usec, + res, + res < 0 ? strerror(errno) : "success"); + if (res > 0) { + if (FD_ISSET(sock, &readfs)) + printf("ready for reading\n"); + if (FD_ISSET(sock, &errorfs)) + printf("has error\n"); + recvpacket(sock, 0, + siocgstamp, + siocgstampns); + recvpacket(sock, MSG_ERRQUEUE, + siocgstamp, + siocgstampns); + } + } else { + /* write one packet */ + sendpacket(sock, + (struct sockaddr *)&addr, + sizeof(addr)); + next.tv_sec += 5; + continue; + } + } + + return 0; +} diff --git a/Documentation/networking/tlan.txt b/Documentation/networking/tlan.txt index 7e6aa5b20c3..34550dfcef7 100644 --- a/Documentation/networking/tlan.txt +++ b/Documentation/networking/tlan.txt @@ -2,7 +2,7 @@ (C) 1998 James Banks (C) 1999-2001 Torben Mathiasen <tmm@image.dk, torben.mathiasen@compaq.com> -For driver information/updates visit http://opensource.compaq.com +For driver information/updates visit http://www.compaq.com TLAN driver for Linux, version 1.14a @@ -113,5 +113,5 @@ III. Things to try if you have problems. There is also a tlan mailing list which you can join by sending "subscribe tlan" in the body of an email to majordomo@vuser.vu.union.edu. -There is also a tlan website at http://opensource.compaq.com +There is also a tlan website at http://www.compaq.com diff --git a/Documentation/networking/tms380tr.txt b/Documentation/networking/tms380tr.txt deleted file mode 100644 index 179e527b9da..00000000000 --- a/Documentation/networking/tms380tr.txt +++ /dev/null @@ -1,147 +0,0 @@ -Text file for the Linux SysKonnect Token Ring ISA/PCI Adapter Driver. - Text file by: Jay Schulist <jschlst@samba.org> - -The Linux SysKonnect Token Ring driver works with the SysKonnect TR4/16(+) ISA, -SysKonnect TR4/16(+) PCI, SysKonnect TR4/16 PCI, and older revisions of the -SK NET TR4/16 ISA card. - -Latest information on this driver can be obtained on the Linux-SNA WWW site. -Please point your browser to: -http://www.linux-sna.org - -Many thanks to Christoph Goos for his excellent work on this driver and -SysKonnect for donating the adapters to Linux-SNA for the testing and -maintenance of this device driver. - -Important information to be noted: -1. Adapters can be slow to open (~20 secs) and close (~5 secs), please be - patient. -2. This driver works very well when autoprobing for adapters. Why even - think about those nasty io/int/dma settings of modprobe when the driver - will do it all for you! - -This driver is rather simple to use. Select Y to Token Ring adapter support -in the kernel configuration. A choice for SysKonnect Token Ring adapters will -appear. This drives supports all SysKonnect ISA and PCI adapters. Choose this -option. I personally recommend compiling the driver as a module (M), but if you -you would like to compile it staticly answer Y instead. - -This driver supports multiple adapters without the need to load multiple copies -of the driver. You should be able to load up to 7 adapters without any kernel -modifications, if you are in need of more please contact the maintainer of this -driver. - -Load the driver either by lilo/loadlin or as a module. When a module using the -following command will suffice for most: - -# modprobe sktr - -This will produce output similar to the following: (Output is user specific) - -sktr.c: v1.01 08/29/97 by Christoph Goos -tr0: SK NET TR 4/16 PCI found at 0x6100, using IRQ 17. -tr1: SK NET TR 4/16 PCI found at 0x6200, using IRQ 16. -tr2: SK NET TR 4/16 ISA found at 0xa20, using IRQ 10 and DMA 5. - -Now just setup the device via ifconfig and set and routes you may have. After -this you are ready to start sending some tokens. - -Errata: -For anyone wondering where to pick up the SysKonnect adapters please browse -to http://www.syskonnect.com - -This driver is under the GNU General Public License. Its Firmware image is -included as an initialized C-array and is licensed by SysKonnect to the Linux -users of this driver. However no warranty about its fitness is expressed or -implied by SysKonnect. - -Below find attached the setting for the SK NET TR 4/16 ISA adapters -------------------------------------------------------------------- - - *************************** - *** C O N T E N T S *** - *************************** - - 1) Location of DIP-Switch W1 - 2) Default settings - 3) DIP-Switch W1 description - - - ============================================================== - CHAPTER 1 LOCATION OF DIP-SWITCH - ============================================================== - -UÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ -þUÄÄÄÄÄÄ¿ UÄÄÄÄÄ¿ UÄÄÄ¿ þ -þAÄÄÄÄÄÄU W1 AÄÄÄÄÄU UÄÄÄÄ¿ þ þ þ -þUÄÄÄÄÄÄ¿ þ þ þ þ UÄÄÅ¿ -þAÄÄÄÄÄÄU UÄÄÄÄÄÄÄÄÄÄÄ¿ AÄÄÄÄU þ þ þ þþ -þUÄÄÄÄÄÄ¿ þ þ UÄÄÄ¿ AÄÄÄU AÄÄÅU -þAÄÄÄÄÄÄU þ TMS380C26 þ þ þ þ -þUÄÄÄÄÄÄ¿ þ þ AÄÄÄU AÄ¿ -þAÄÄÄÄÄÄU þ þ þ þ -þ AÄÄÄÄÄÄÄÄÄÄÄU þ þ -þ þ þ -þ AÄU -þ þ -þ þ -þ þ -þ þ -AÄÄÄÄÄÄÄÄÄÄÄÄAÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄAÄÄAÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄAÄÄÄÄÄÄÄÄÄU - AÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄU AÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄU - - ============================================================== - CHAPTER 2 DEFAULT SETTINGS - ============================================================== - - W1 1 2 3 4 5 6 7 8 - +------------------------------+ - | ON X | - | OFF X X X X X X X | - +------------------------------+ - - W1.1 = ON Adapter drives address lines SA17..19 - W1.2 - 1.5 = OFF BootROM disabled - W1.6 - 1.8 = OFF I/O address 0A20h - - ============================================================== - CHAPTER 3 DIP SWITCH W1 DESCRIPTION - ============================================================== - - UÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄ¿ ON - þ 1 þ 2 þ 3 þ 4 þ 5 þ 6 þ 7 þ 8 þ - AÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄU OFF - |AD | BootROM Addr. | I/O | - +-+-+-------+-------+-----+-----+ - | | | - | | +------ 6 7 8 - | | ON ON ON 1900h - | | ON ON OFF 0900h - | | ON OFF ON 1980h - | | ON OFF OFF 0980h - | | OFF ON ON 1b20h - | | OFF ON OFF 0b20h - | | OFF OFF ON 1a20h - | | OFF OFF OFF 0a20h (+) - | | - | | - | +-------- 2 3 4 5 - | OFF x x x disabled (+) - | ON ON ON ON C0000 - | ON ON ON OFF C4000 - | ON ON OFF ON C8000 - | ON ON OFF OFF CC000 - | ON OFF ON ON D0000 - | ON OFF ON OFF D4000 - | ON OFF OFF ON D8000 - | ON OFF OFF OFF DC000 - | - | - +----- 1 - OFF adapter does NOT drive SA<17..19> - ON adapter drives SA<17..19> (+) - - - (+) means default setting - - ******************************** diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt new file mode 100644 index 00000000000..ec11429e1d4 --- /dev/null +++ b/Documentation/networking/tproxy.txt @@ -0,0 +1,84 @@ +Transparent proxy support +========================= + +This feature adds Linux 2.2-like transparent proxy support to current kernels. +To use it, enable the socket match and the TPROXY target in your kernel config. +You will need policy routing too, so be sure to enable that as well. + + +1. Making non-local sockets work +================================ + +The idea is that you identify packets with destination address matching a local +socket on your box, set the packet mark to a certain value, and then match on that +value using policy routing to have those packets delivered locally: + +# iptables -t mangle -N DIVERT +# iptables -t mangle -A PREROUTING -p tcp -m socket -j DIVERT +# iptables -t mangle -A DIVERT -j MARK --set-mark 1 +# iptables -t mangle -A DIVERT -j ACCEPT + +# ip rule add fwmark 1 lookup 100 +# ip route add local 0.0.0.0/0 dev lo table 100 + +Because of certain restrictions in the IPv4 routing output code you'll have to +modify your application to allow it to send datagrams _from_ non-local IP +addresses. All you have to do is enable the (SOL_IP, IP_TRANSPARENT) socket +option before calling bind: + +fd = socket(AF_INET, SOCK_STREAM, 0); +/* - 8< -*/ +int value = 1; +setsockopt(fd, SOL_IP, IP_TRANSPARENT, &value, sizeof(value)); +/* - 8< -*/ +name.sin_family = AF_INET; +name.sin_port = htons(0xCAFE); +name.sin_addr.s_addr = htonl(0xDEADBEEF); +bind(fd, &name, sizeof(name)); + +A trivial patch for netcat is available here: +http://people.netfilter.org/hidden/tproxy/netcat-ip_transparent-support.patch + + +2. Redirecting traffic +====================== + +Transparent proxying often involves "intercepting" traffic on a router. This is +usually done with the iptables REDIRECT target; however, there are serious +limitations of that method. One of the major issues is that it actually +modifies the packets to change the destination address -- which might not be +acceptable in certain situations. (Think of proxying UDP for example: you won't +be able to find out the original destination address. Even in case of TCP +getting the original destination address is racy.) + +The 'TPROXY' target provides similar functionality without relying on NAT. Simply +add rules like this to the iptables ruleset above: + +# iptables -t mangle -A PREROUTING -p tcp --dport 80 -j TPROXY \ + --tproxy-mark 0x1/0x1 --on-port 50080 + +Note that for this to work you'll have to modify the proxy to enable (SOL_IP, +IP_TRANSPARENT) for the listening socket. + + +3. Iptables extensions +====================== + +To use tproxy you'll need to have the 'socket' and 'TPROXY' modules +compiled for iptables. A patched version of iptables is available +here: http://git.balabit.hu/?p=bazsi/iptables-tproxy.git + + +4. Application support +====================== + +4.1. Squid +---------- + +Squid 3.HEAD has support built-in. To use it, pass +'--enable-linux-netfilter' to configure and set the 'tproxy' option on +the HTTP listener you redirect traffic to with the TPROXY iptables +target. + +For more information please consult the following page on the Squid +wiki: http://wiki.squid-cache.org/Features/Tproxy4 diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt index ec3d109d787..949d5dcdd9a 100644 --- a/Documentation/networking/tuntap.txt +++ b/Documentation/networking/tuntap.txt @@ -39,10 +39,13 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> mknod /dev/net/tun c 10 200 Set permissions: - e.g. chmod 0700 /dev/net/tun - if you want the device only accessible by root. Giving regular users the - right to assign network devices is NOT a good idea. Users could assign - bogus network interfaces to trick firewalls or administrators. + e.g. chmod 0666 /dev/net/tun + There's no harm in allowing the device to be accessible by non-root users, + since CAP_NET_ADMIN is required for creating network devices or for + connecting to network devices which aren't owned by the user in question. + If you want to create persistent devices and give ownership of them to + unprivileged users, then you need the /dev/net/tun device to be usable by + those users. Driver module autoloading @@ -102,6 +105,83 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> Proto [2 bytes] Raw protocol(IP, IPv6, etc) frame. + 3.3 Multiqueue tuntap interface: + + From version 3.8, Linux supports multiqueue tuntap which can uses multiple + file descriptors (queues) to parallelize packets sending or receiving. The + device allocation is the same as before, and if user wants to create multiple + queues, TUNSETIFF with the same device name must be called many times with + IFF_MULTI_QUEUE flag. + + char *dev should be the name of the device, queues is the number of queues to + be created, fds is used to store and return the file descriptors (queues) + created to the caller. Each file descriptor were served as the interface of a + queue which could be accessed by userspace. + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + + A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When + calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when + calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were + enabled by default after it was created through TUNSETIFF. + + fd is the file descriptor (queue) that we want to enable or disable, when + enable is true we enable it, otherwise we disable it + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + Universal TUN/TAP device driver Frequently Asked Question. 1. What platforms are supported by TUN/TAP driver ? @@ -115,7 +195,7 @@ As mentioned above, main purpose of TUN/TAP driver is tunneling. It is used by VTun (http://vtun.sourceforge.net). Another interesting application using TUN/TAP is pipsecd -(http://perso.enst.fr/~beyssac/pipsec/), an userspace IPSec +(http://perso.enst.fr/~beyssac/pipsec/), a userspace IPSec implementation that can use complete kernel routing (unlike FreeS/WAN). 3. How does Virtual network device actually work ? @@ -138,7 +218,7 @@ This means that you have to read/write IP packets when you are using tun and ethernet frames when using tap. 5. What is the difference between BPF and TUN/TAP driver? -BFP is an advanced packet filter. It can be attached to existing +BPF is an advanced packet filter. It can be attached to existing network interface. It does not provide a virtual network interface. A TUN/TAP driver does provide a virtual network interface and it is possible to attach BPF to this interface. diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt new file mode 100644 index 00000000000..d727a382910 --- /dev/null +++ b/Documentation/networking/udplite.txt @@ -0,0 +1,278 @@ + =========================================================================== + The UDP-Lite protocol (RFC 3828) + =========================================================================== + + + UDP-Lite is a Standards-Track IETF transport protocol whose characteristic + is a variable-length checksum. This has advantages for transport of multimedia + (video, VoIP) over wireless networks, as partly damaged packets can still be + fed into the codec instead of being discarded due to a failed checksum test. + + This file briefly describes the existing kernel support and the socket API. + For in-depth information, you can consult: + + o The UDP-Lite Homepage: + http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + From here you can also download some example application source code. + + o The UDP-Lite HOWTO on + http://web.archive.org/web/*/http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + files/UDP-Lite-HOWTO.txt + + o The Wireshark UDP-Lite WiKi (with capture files): + http://wiki.wireshark.org/Lightweight_User_Datagram_Protocol + + o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt + + + I) APPLICATIONS + + Several applications have been ported successfully to UDP-Lite. Ethereal + (now called wireshark) has UDP-Litev4/v6 support by default. + Porting applications to UDP-Lite is straightforward: only socket level and + IPPROTO need to be changed; senders additionally set the checksum coverage + length (default = header length = 8). Details are in the next section. + + + II) PROGRAMMING API + + UDP-Lite provides a connectionless, unreliable datagram service and hence + uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is + very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2) + call so that the statement looks like: + + s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE); + + or, respectively, + + s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE); + + With just the above change you are able to run UDP-Lite services or connect + to UDP-Lite servers. The kernel will assume that you are not interested in + using partial checksum coverage and so emulate UDP mode (full coverage). + + To make use of the partial checksum coverage facilities requires setting a + single socket option, which takes an integer specifying the coverage length: + + * Sender checksum coverage: UDPLITE_SEND_CSCOV + + For example, + + int val = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int)); + + sets the checksum coverage length to 20 bytes (12b data + 8b header). + Of each packet only the first 20 bytes (plus the pseudo-header) will be + checksummed. This is useful for RTP applications which have a 12-byte + base header. + + + * Receiver checksum coverage: UDPLITE_RECV_CSCOV + + This option is the receiver-side analogue. It is truly optional, i.e. not + required to enable traffic with partial checksum coverage. Its function is + that of a traffic filter: when enabled, it instructs the kernel to drop + all packets which have a coverage _less_ than this value. For example, if + RTP and UDP headers are to be protected, a receiver can enforce that only + packets with a minimum coverage of 20 are admitted: + + int min = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int)); + + The calls to getsockopt(2) are analogous. Being an extension and not a stand- + alone protocol, all socket options known from UDP can be used in exactly the + same manner as before, e.g. UDP_CORK or UDP_ENCAP. + + A detailed discussion of UDP-Lite checksum coverage options is in section IV. + + + III) HEADER FILES + + The socket API requires support through header files in /usr/include: + + * /usr/include/netinet/in.h + to define IPPROTO_UDPLITE + + * /usr/include/netinet/udplite.h + for UDP-Lite header fields and protocol constants + + For testing purposes, the following can serve as a `mini' header file: + + #define IPPROTO_UDPLITE 136 + #define SOL_UDPLITE 136 + #define UDPLITE_SEND_CSCOV 10 + #define UDPLITE_RECV_CSCOV 11 + + Ready-made header files for various distros are in the UDP-Lite tarball. + + + IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS + + To enable debugging messages, the log level need to be set to 8, as most + messages use the KERN_DEBUG level (7). + + 1) Sender Socket Options + + If the sender specifies a value of 0 as coverage length, the module + assumes full coverage, transmits a packet with coverage length of 0 + and according checksum. If the sender specifies a coverage < 8 and + different from 0, the kernel assumes 8 as default value. Finally, + if the specified coverage length exceeds the packet length, the packet + length is used instead as coverage length. + + 2) Receiver Socket Options + + The receiver specifies the minimum value of the coverage length it + is willing to accept. A value of 0 here indicates that the receiver + always wants the whole of the packet covered. In this case, all + partially covered packets are dropped and an error is logged. + + It is not possible to specify illegal values (<0 and <8); in these + cases the default of 8 is assumed. + + All packets arriving with a coverage value less than the specified + threshold are discarded, these events are also logged. + + 3) Disabling the Checksum Computation + + On both sender and receiver, checksumming will always be performed + and cannot be disabled using SO_NO_CHECK. Thus + + setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... ); + + will always will be ignored, while the value of + + getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...); + + is meaningless (as in TCP). Packets with a zero checksum field are + illegal (cf. RFC 3828, sec. 3.1) and will be silently discarded. + + 4) Fragmentation + + The checksum computation respects both buffersize and MTU. The size + of UDP-Lite packets is determined by the size of the send buffer. The + minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF + in include/net/sock.h), the default value is configurable as + net.core.wmem_default or via setting the SO_SNDBUF socket(7) + option. The maximum upper bound for the send buffer is determined + by net.core.wmem_max. + + Given a payload size larger than the send buffer size, UDP-Lite will + split the payload into several individual packets, filling up the + send buffer size in each case. + + The precise value also depends on the interface MTU. The interface MTU, + in turn, may trigger IP fragmentation. In this case, the generated + UDP-Lite packet is split into several IP packets, of which only the + first one contains the L4 header. + + The send buffer size has implications on the checksum coverage length. + Consider the following example: + + Payload: 1536 bytes Send Buffer: 1024 bytes + MTU: 1500 bytes Coverage Length: 856 bytes + + UDP-Lite will ship the 1536 bytes in two separate packets: + + Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes + Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes + + The coverage packet covers the UDP-Lite header and 848 bytes of the + payload in the first packet, the second packet is fully covered. Note + that for the second packet, the coverage length exceeds the packet + length. The kernel always re-adjusts the coverage length to the packet + length in such cases. + + As an example of what happens when one UDP-Lite packet is split into + several tiny fragments, consider the following example. + + Payload: 1024 bytes Send buffer size: 1024 bytes + MTU: 300 bytes Coverage length: 575 bytes + + +-+-----------+--------------+--------------+--------------+ + |8| 272 | 280 | 280 | 280 | + +-+-----------+--------------+--------------+--------------+ + 280 560 840 1032 + ^ + *****checksum coverage************* + + The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte + header). According to the interface MTU, these are split into 4 IP + packets (280 byte IP payload + 20 byte IP header). The kernel module + sums the contents of the entire first two packets, plus 15 bytes of + the last packet before releasing the fragments to the IP module. + + To see the analogous case for IPv6 fragmentation, consider a link + MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum + coverage is less than 1232 bytes (MTU minus IPv6/fragment header + lengths), only the first fragment needs to be considered. When using + larger checksum coverage lengths, each eligible fragment needs to be + checksummed. Suppose we have a checksum coverage of 3062. The buffer + of 3356 bytes will be split into the following fragments: + + Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data + + The first two fragments have to be checksummed in full, of the last + fragment only 598 (= 3062 - 2*1232) bytes are checksummed. + + While it is important that such cases are dealt with correctly, they + are (annoyingly) rare: UDP-Lite is designed for optimising multimedia + performance over wireless (or generally noisy) links and thus smaller + coverage lengths are likely to be expected. + + + V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING + + Exceptional and error conditions are logged to syslog at the KERN_DEBUG + level. Live statistics about UDP-Lite are available in /proc/net/snmp + and can (with newer versions of netstat) be viewed using + + netstat -svu + + This displays UDP-Lite statistics variables, whose meaning is as follows. + + InDatagrams: The total number of datagrams delivered to users. + + NoPorts: Number of packets received to an unknown port. + These cases are counted separately (not as InErrors). + + InErrors: Number of erroneous UDP-Lite packets. Errors include: + * internal socket queue receive errors + * packet too short (less than 8 bytes or stated + coverage length exceeds received length) + * xfrm4_policy_check() returned with error + * application has specified larger min. coverage + length than that of incoming packet + * checksum coverage violated + * bad checksum + + OutDatagrams: Total number of sent datagrams. + + These statistics derive from the UDP MIB (RFC 2013). + + + VI) IPTABLES + + There is packet match support for UDP-Lite as well as support for the LOG target. + If you copy and paste the following line into /etc/protocols, + + udplite 136 UDP-Lite # UDP-Lite [RFC 3828] + + then + iptables -A INPUT -p udplite -j LOG + + will produce logging output to syslog. Dropping and rejecting packets also works. + + + VII) MAINTAINER ADDRESS + + The UDP-Lite patch was developed at + University of Aberdeen + Electronics Research Group + Department of Engineering + Fraser Noble Building + Aberdeen AB24 3UE; UK + The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial + code was developed by William Stanislaus, <william@erg.abdn.ac.uk>. diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt index 80e1cb19609..97282da82b7 100644 --- a/Documentation/networking/vortex.txt +++ b/Documentation/networking/vortex.txt @@ -1,5 +1,5 @@ Documentation/networking/vortex.txt -Andrew Morton <andrewm@uow.edu.au> +Andrew Morton 30 April 2000 @@ -11,7 +11,7 @@ The driver was written by Donald Becker <becker@scyld.com> Don is no longer the prime maintainer of this version of the driver. Please report problems to one or more of: - Andrew Morton <andrewm@uow.edu.au> + Andrew Morton Netdev mailing list <netdev@vger.kernel.org> Linux kernel mailing list <linux-kernel@vger.kernel.org> @@ -24,43 +24,51 @@ Since kernel 2.3.99-pre6, this driver incorporates the support for the This driver supports the following hardware: - 3c590 Vortex 10Mbps - 3c592 EISA 10mbps Demon/Vortex - 3c597 EISA Fast Demon/Vortex - 3c595 Vortex 100baseTx - 3c595 Vortex 100baseT4 - 3c595 Vortex 100base-MII - 3Com Vortex - 3c900 Boomerang 10baseT - 3c900 Boomerang 10Mbps Combo - 3c900 Cyclone 10Mbps TPO - 3c900B Cyclone 10Mbps T - 3c900 Cyclone 10Mbps Combo - 3c900 Cyclone 10Mbps TPC - 3c900B-FL Cyclone 10base-FL - 3c905 Boomerang 100baseTx - 3c905 Boomerang 100baseT4 - 3c905B Cyclone 100baseTx - 3c905B Cyclone 10/100/BNC - 3c905B-FX Cyclone 100baseFx - 3c905C Tornado - 3c980 Cyclone - 3cSOHO100-TX Hurricane - 3c555 Laptop Hurricane - 3c575 Boomerang CardBus - 3CCFE575 Cyclone CardBus - 3CCFE575CT Cyclone CardBus - 3CCFE656 Cyclone CardBus - 3CCFEM656 Cyclone CardBus - 3c450 Cyclone/unknown - + 3c590 Vortex 10Mbps + 3c592 EISA 10Mbps Demon/Vortex + 3c597 EISA Fast Demon/Vortex + 3c595 Vortex 100baseTx + 3c595 Vortex 100baseT4 + 3c595 Vortex 100base-MII + 3c900 Boomerang 10baseT + 3c900 Boomerang 10Mbps Combo + 3c900 Cyclone 10Mbps TPO + 3c900 Cyclone 10Mbps Combo + 3c900 Cyclone 10Mbps TPC + 3c900B-FL Cyclone 10base-FL + 3c905 Boomerang 100baseTx + 3c905 Boomerang 100baseT4 + 3c905B Cyclone 100baseTx + 3c905B Cyclone 10/100/BNC + 3c905B-FX Cyclone 100baseFx + 3c905C Tornado + 3c920B-EMB-WNM (ATI Radeon 9100 IGP) + 3c980 Cyclone + 3c980C Python-T + 3cSOHO100-TX Hurricane + 3c555 Laptop Hurricane + 3c556 Laptop Tornado + 3c556B Laptop Hurricane + 3c575 [Megahertz] 10/100 LAN CardBus + 3c575 Boomerang CardBus + 3CCFE575BT Cyclone CardBus + 3CCFE575CT Tornado CardBus + 3CCFE656 Cyclone CardBus + 3CCFEM656B Cyclone+Winmodem CardBus + 3CXFEM656C Tornado+Winmodem CardBus + 3c450 HomePNA Tornado + 3c920 Tornado + 3c982 Hydra Dual Port A + 3c982 Hydra Dual Port B + 3c905B-T4 + 3c920B-EMB-WNM Tornado Module parameters ================= There are several parameters which may be provided to the driver when -its module is loaded. These are usually placed in /etc/modprobe.conf -(/etc/modules.conf in 2.4). Example: +its module is loaded. These are usually placed in /etc/modprobe.d/*.conf +configuration files. Example: options 3c59x debug=3 rx_copybreak=300 @@ -170,7 +178,7 @@ max_interrupt_work=N The driver's interrupt service routine can handle many receive and transmit packets in a single invocation. It does this in a loop. - The value of max_interrupt_work governs how mnay times the interrupt + The value of max_interrupt_work governs how many times the interrupt service routine will loop. The default value is 32 loops. If this is exceeded the interrupt service routine gives up and generates a warning message "eth0: Too much work in interrupt". @@ -274,48 +282,38 @@ Details of the device driver implementation are at the top of the source file. Additional documentation is available at Don Becker's Linux Drivers site: - http://www.scyld.com/network/vortex.html + http://www.scyld.com/vortex.html Donald Becker's driver development site: - http://www.scyld.com/network + http://www.scyld.com/network.html Donald's vortex-diag program is useful for inspecting the NIC's state: - http://www.scyld.com/diag/#pci-diags + http://www.scyld.com/ethercard_diag.html Donald's mii-diag program may be used for inspecting and manipulating the NIC's Media Independent Interface subsystem: - http://www.scyld.com/diag/#mii-diag + http://www.scyld.com/ethercard_diag.html#mii-diag Donald's wake-on-LAN page: - http://www.scyld.com/expert/wake-on-lan.html - -3Com's documentation for many NICs, including the ones supported by -this driver is available at - - http://support.3com.com/partners/developer/developer_form.html + http://www.scyld.com/wakeonlan.html 3Com's DOS-based application for setting up the NICs EEPROMs: ftp://ftp.3com.com/pub/nic/3c90x/3c90xx2.exe -Driver updates and a detailed changelog for the modifications which -were made for the 2.3/2,4 series kernel is available at - - http://www.uow.edu.au/~andrewm/linux/#3c59x-2.3 - Autonegotiation notes --------------------- The driver uses a one-minute heartbeat for adapting to changes in - the external LAN environment. This means that when, for example, a - machine is unplugged from a hubbed 10baseT LAN plugged into a - switched 100baseT LAN, the throughput will be quite dreadful for up - to sixty seconds. Be patient. + the external LAN environment if link is up and 5 seconds if link is down. + This means that when, for example, a machine is unplugged from a hubbed + 10baseT LAN plugged into a switched 100baseT LAN, the throughput + will be quite dreadful for up to sixty seconds. Be patient. Cisco interoperability note from Walter Wong <wcw+@CMU.EDU>: @@ -356,13 +354,13 @@ steps you should take: Eliminate some variables: try different cards, different computers, different cables, different ports on the switch/hub, - different versions of the kernel or ofthe driver, etc. + different versions of the kernel or of the driver, etc. - OK, it's a driver problem. You need to generate a report. Typically this is an email to the - maintainer and/or linux-net@vger.kernel.org. The maintainer's - email address will be inthe driver source or in the MAINTAINERS file. + maintainer and/or netdev@vger.kernel.org. The maintainer's + email address will be in the driver source or in the MAINTAINERS file. - The contents of your report will vary a lot depending upon the problem. If it's a kernel crash then you should refer to the @@ -427,15 +425,15 @@ steps you should take: 1) Increase the debug level. Usually this is done via: a) modprobe driver debug=7 - b) In /etc/modprobe.conf (or /etc/modules.conf for 2.4): + b) In /etc/modprobe.d/driver.conf: options driver debug=7 2) Recreate the problem with the higher debug level, send all logs to the maintainer. 3) Download you card's diagnostic tool from Donald - Backer's website http://www.scyld.com/diag. Download - mii-diag.c as well. Build these. + Becker's website <http://www.scyld.com/ethercard_diag.html>. + Download mii-diag.c as well. Build these. a) Run 'vortex-diag -aaee' and 'mii-diag -v' when the card is working correctly. Save the output. @@ -443,8 +441,8 @@ steps you should take: b) Run the above commands when the card is malfunctioning. Send both sets of output. -Finally, please be patient and be prepared to do some work. You may end up working on -this problem for a week or more as the maintainer asks more questions, asks for more -tests, asks for patches to be applied, etc. At the end of it all, the problem may even -remain unresolved. - +Finally, please be patient and be prepared to do some work. You may +end up working on this problem for a week or more as the maintainer +asks more questions, asks for more tests, asks for patches to be +applied, etc. At the end of it all, the problem may even remain +unresolved. diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt new file mode 100644 index 00000000000..bb76c667a47 --- /dev/null +++ b/Documentation/networking/vxge.txt @@ -0,0 +1,93 @@ +Neterion's (Formerly S2io) X3100 Series 10GbE PCIe Server Adapter Linux driver +============================================================================== + +Contents +-------- + +1) Introduction +2) Features supported +3) Configurable driver parameters +4) Troubleshooting + +1) Introduction: +---------------- +This Linux driver supports all Neterion's X3100 series 10 GbE PCIe I/O +Virtualized Server adapters. +The X3100 series supports four modes of operation, configurable via +firmware - + Single function mode + Multi function mode + SRIOV mode + MRIOV mode +The functions share a 10GbE link and the pci-e bus, but hardly anything else +inside the ASIC. Features like independent hw reset, statistics, bandwidth/ +priority allocation and guarantees, GRO, TSO, interrupt moderation etc are +supported independently on each function. + +(See below for a complete list of features supported for both IPv4 and IPv6) + +2) Features supported: +---------------------- + +i) Single function mode (up to 17 queues) + +ii) Multi function mode (up to 17 functions) + +iii) PCI-SIG's I/O Virtualization + - Single Root mode: v1.0 (up to 17 functions) + - Multi-Root mode: v1.0 (up to 17 functions) + +iv) Jumbo frames + X3100 Series supports MTU up to 9600 bytes, modifiable using + ifconfig command. + +v) Offloads supported: (Enabled by default) + Checksum offload (TCP/UDP/IP) on transmit and receive paths + TCP Segmentation Offload (TSO) on transmit path + Generic Receive Offload (GRO) on receive path + +vi) MSI-X: (Enabled by default) + Resulting in noticeable performance improvement (up to 7% on certain + platforms). + +vii) NAPI: (Enabled by default) + For better Rx interrupt moderation. + +viii)RTH (Receive Traffic Hash): (Enabled by default) + Receive side steering for better scaling. + +ix) Statistics + Comprehensive MAC-level and software statistics displayed using + "ethtool -S" option. + +x) Multiple hardware queues: (Enabled by default) + Up to 17 hardware based transmit and receive data channels, with + multiple steering options (transmit multiqueue enabled by default). + +3) Configurable driver parameters: +---------------------------------- + +i) max_config_dev + Specifies maximum device functions to be enabled. + Valid range: 1-8 + +ii) max_config_port + Specifies number of ports to be enabled. + Valid range: 1,2 + Default: 1 + +iii)max_config_vpath + Specifies maximum VPATH(s) configured for each device function. + Valid range: 1-17 + +iv) vlan_tag_strip + Enables/disables vlan tag stripping from all received tagged frames that + are not replicated at the internal L2 switch. + Valid range: 0,1 (disabled, enabled respectively) + Default: 1 + +v) addr_learn_en + Enable learning the mac address of the guest OS interface in + virtualization environment. + Valid range: 0,1 (disabled, enabled respectively) + Default: 0 diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt new file mode 100644 index 00000000000..6d993510f09 --- /dev/null +++ b/Documentation/networking/vxlan.txt @@ -0,0 +1,47 @@ +Virtual eXtensible Local Area Networking documentation +====================================================== + +The VXLAN protocol is a tunnelling protocol that is designed to +solve the problem of limited number of available VLAN's (4096). +With VXLAN identifier is expanded to 24 bits. + +It is a draft RFC standard, that is implemented by Cisco Nexus, +Vmware and Brocade. The protocol runs over UDP using a single +destination port (still not standardized by IANA). +This document describes the Linux kernel tunnel device, +there is also an implantation of VXLAN for Openvswitch. + +Unlike most tunnels, a VXLAN is a 1 to N network, not just point +to point. A VXLAN device can either dynamically learn the IP address +of the other end, in a manner similar to a learning bridge, or the +forwarding entries can be configured statically. + +The management of vxlan is done in a similar fashion to it's +too closest neighbors GRE and VLAN. Configuring VXLAN requires +the version of iproute2 that matches the kernel release +where VXLAN was first merged upstream. + +1. Create vxlan device + # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 + +This creates a new device (vxlan0). The device uses the +the multicast group 239.1.1.1 over eth1 to handle packets where +no entry is in the forwarding table. + +2. Delete vxlan device + # ip link delete vxlan0 + +3. Show vxlan info + # ip -d link show vxlan0 + +It is possible to create, destroy and display the vxlan +forwarding table using the new bridge command. + +1. Create forwarding table entry + # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0 + +2. Delete forwarding table entry + # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0 + +3. Show forwarding table + # bridge fdb show dev vxlan0 diff --git a/Documentation/networking/wan-router.txt b/Documentation/networking/wan-router.txt deleted file mode 100644 index c96897aa08b..00000000000 --- a/Documentation/networking/wan-router.txt +++ /dev/null @@ -1,622 +0,0 @@ ------------------------------------------------------------------------------- -Linux WAN Router Utilities Package ------------------------------------------------------------------------------- -Version 2.2.1 -Mar 28, 2001 -Author: Nenad Corbic <ncorbic@sangoma.com> -Copyright (c) 1995-2001 Sangoma Technologies Inc. ------------------------------------------------------------------------------- - -INTRODUCTION - -Wide Area Networks (WANs) are used to interconnect Local Area Networks (LANs) -and/or stand-alone hosts over vast distances with data transfer rates -significantly higher than those achievable with commonly used dial-up -connections. - -Usually an external device called `WAN router' sitting on your local network -or connected to your machine's serial port provides physical connection to -WAN. Although router's job may be as simple as taking your local network -traffic, converting it to WAN format and piping it through the WAN link, these -devices are notoriously expensive, with prices as much as 2 - 5 times higher -then the price of a typical PC box. - -Alternatively, considering robustness and multitasking capabilities of Linux, -an internal router can be built (most routers use some sort of stripped down -Unix-like operating system anyway). With a number of relatively inexpensive WAN -interface cards available on the market, a perfectly usable router can be -built for less than half a price of an external router. Yet a Linux box -acting as a router can still be used for other purposes, such as fire-walling, -running FTP, WWW or DNS server, etc. - -This kernel module introduces the notion of a WAN Link Driver (WLD) to Linux -operating system and provides generic hardware-independent services for such -drivers. Why can existing Linux network device interface not be used for -this purpose? Well, it can. However, there are a few key differences between -a typical network interface (e.g. Ethernet) and a WAN link. - -Many WAN protocols, such as X.25 and frame relay, allow for multiple logical -connections (known as `virtual circuits' in X.25 terminology) over a single -physical link. Each such virtual circuit may (and almost always does) lead -to a different geographical location and, therefore, different network. As a -result, it is the virtual circuit, not the physical link, that represents a -route and, therefore, a network interface in Linux terms. - -To further complicate things, virtual circuits are usually volatile in nature -(excluding so called `permanent' virtual circuits or PVCs). With almost no -time required to set up and tear down a virtual circuit, it is highly desirable -to implement on-demand connections in order to minimize network charges. So -unlike a typical network driver, the WAN driver must be able to handle multiple -network interfaces and cope as multiple virtual circuits come into existence -and go away dynamically. - -Last, but not least, WAN configuration is much more complex than that of say -Ethernet and may well amount to several dozens of parameters. Some of them -are "link-wide" while others are virtual circuit-specific. The same holds -true for WAN statistics which is by far more extensive and extremely useful -when troubleshooting WAN connections. Extending the ifconfig utility to suit -these needs may be possible, but does not seem quite reasonable. Therefore, a -WAN configuration utility and corresponding application programmer's interface -is needed for this purpose. - -Most of these problems are taken care of by this module. Its goal is to -provide a user with more-or-less standard look and feel for all WAN devices and -assist a WAN device driver writer by providing common services, such as: - - o User-level interface via /proc file system - o Centralized configuration - o Device management (setup, shutdown, etc.) - o Network interface management (dynamic creation/destruction) - o Protocol encapsulation/decapsulation - -To ba able to use the Linux WAN Router you will also need a WAN Tools package -available from - - ftp.sangoma.com/pub/linux/current_wanpipe/wanpipe-X.Y.Z.tgz - -where vX.Y.Z represent the wanpipe version number. - -For technical questions and/or comments please e-mail to ncorbic@sangoma.com. -For general inquiries please contact Sangoma Technologies Inc. by - - Hotline: 1-800-388-2475 (USA and Canada, toll free) - Phone: (905) 474-1990 ext: 106 - Fax: (905) 474-9223 - E-mail: dm@sangoma.com (David Mandelstam) - WWW: http://www.sangoma.com - - -INSTALLATION - -Please read the WanpipeForLinux.pdf manual on how to -install the WANPIPE tools and drivers properly. - - -After installing wanpipe package: /usr/local/wanrouter/doc. -On the ftp.sangoma.com : /linux/current_wanpipe/doc - - -COPYRIGHT AND LICENSING INFORMATION - -This program is free software; you can redistribute it and/or modify it under -the terms of the GNU General Public License as published by the Free Software -Foundation; either version 2, or (at your option) any later version. - -This program is distributed in the hope that it will be useful, but WITHOUT -ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS -FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. - -You should have received a copy of the GNU General Public License along with -this program; if not, write to the Free Software Foundation, Inc., 675 Mass -Ave, Cambridge, MA 02139, USA. - - - -ACKNOWLEDGEMENTS - -This product is based on the WANPIPE(tm) Multiprotocol WAN Router developed -by Sangoma Technologies Inc. for Linux 2.0.x and 2.2.x. Success of the WANPIPE -together with the next major release of Linux kernel in summer 1996 commanded -adequate changes to the WANPIPE code to take full advantage of new Linux -features. - -Instead of continuing developing proprietary interface tied to Sangoma WAN -cards, we decided to separate all hardware-independent code into a separate -module and defined two levels of interfaces - one for user-level applications -and another for kernel-level WAN drivers. WANPIPE is now implemented as a -WAN driver compliant with the WAN Link Driver interface. Also a general -purpose WAN configuration utility and a set of shell scripts was developed to -support WAN router at the user level. - -Many useful ideas concerning hardware-independent interface implementation -were given by Mike McLagan <mike.mclagan@linux.org> and his implementation -of the Frame Relay router and drivers for Sangoma cards (dlci/sdla). - -With the new implementation of the APIs being incorporated into the WANPIPE, -a special thank goes to Alan Cox in providing insight into BSD sockets. - -Special thanks to all the WANPIPE users who performed field-testing, reported -bugs and made valuable comments and suggestions that help us to improve this -product. - - - -NEW IN THIS RELEASE - - o Updated the WANCFG utility - Calls the pppconfig to configure the PPPD - for async connections. - - o Added the PPPCONFIG utility - Used to configure the PPPD dameon for the - WANPIPE Async PPP and standard serial port. - The wancfg calls the pppconfig to configure - the pppd. - - o Fixed the PCI autodetect feature. - The SLOT 0 was used as an autodetect option - however, some high end PC's slot numbers start - from 0. - - o This release has been tested with the new backupd - daemon release. - - -PRODUCT COMPONENTS AND RELATED FILES - -/etc: (or user defined) - wanpipe1.conf default router configuration file - -/lib/modules/X.Y.Z/misc: - wanrouter.o router kernel loadable module - af_wanpipe.o wanpipe api socket module - -/lib/modules/X.Y.Z/net: - sdladrv.o Sangoma SDLA support module - wanpipe.o Sangoma WANPIPE(tm) driver module - -/proc/net/wanrouter - Config reads current router configuration - Status reads current router status - {name} reads WAN driver statistics - -/usr/sbin: - wanrouter wanrouter start-up script - wanconfig wanrouter configuration utility - sdladump WANPIPE adapter memory dump utility - fpipemon Monitor for Frame Relay - cpipemon Monitor for Cisco HDLC - ppipemon Monitor for PPP - xpipemon Monitor for X25 - wpkbdmon WANPIPE keyboard led monitor/debugger - -/usr/local/wanrouter: - README this file - COPYING GNU General Public License - Setup installation script - Filelist distribution definition file - wanrouter.rc meta-configuration file - (used by the Setup and wanrouter script) - -/usr/local/wanrouter/doc: - wanpipeForLinux.pdf WAN Router User's Manual - -/usr/local/wanrouter/patches: - wanrouter-v2213.gz patch for Linux kernels 2.2.11 up to 2.2.13. - wanrouter-v2214.gz patch for Linux kernel 2.2.14. - wanrouter-v2215.gz patch for Linux kernels 2.2.15 to 2.2.17. - wanrouter-v2218.gz patch for Linux kernels 2.2.18 and up. - wanrouter-v240.gz patch for Linux kernel 2.4.0. - wanrouter-v242.gz patch for Linux kernel 2.4.2 and up. - wanrouter-v2034.gz patch for Linux kernel 2.0.34 - wanrouter-v2036.gz patch for Linux kernel 2.0.36 and up. - -/usr/local/wanrouter/patches/kdrivers: - Sources of the latest WANPIPE device drivers. - These are used to UPGRADE the linux kernel to the newest - version if the kernel source has already been pathced with - WANPIPE drivers. - -/usr/local/wanrouter/samples: - interface sample interface configuration file - wanpipe1.cpri CHDLC primary port - wanpipe2.csec CHDLC secondary port - wanpipe1.fr Frame Relay protocol - wanpipe1.ppp PPP protocol ) - wanpipe1.asy CHDLC ASYNC protocol - wanpipe1.x25 X25 protocol - wanpipe1.stty Sync TTY driver (Used by Kernel PPPD daemon) - wanpipe1.atty Async TTY driver (Used by Kernel PPPD daemon) - wanrouter.rc sample meta-configuration file - -/usr/local/wanrouter/util: - * wan-tools utilities source code - -/usr/local/wanrouter/api/x25: - * x25 api sample programs. -/usr/local/wanrouter/api/chdlc: - * chdlc api sample programs. -/usr/local/wanrouter/api/fr: - * fr api sample programs. -/usr/local/wanrouter/config/wancfg: - wancfg WANPIPE GUI configuration program. - Creates wanpipe#.conf files. -/usr/local/wanrouter/config/cfgft1: - cfgft1 GUI CSU/DSU configuration program. - -/usr/include/linux: - wanrouter.h router API definitions - wanpipe.h WANPIPE API definitions - sdladrv.h SDLA support module API definitions - sdlasfm.h SDLA firmware module definitions - if_wanpipe.h WANPIPE Socket definitions - if_wanpipe_common.h WANPIPE Socket/Driver common definitions. - sdlapci.h WANPIPE PCI definitions - - -/usr/src/linux/net/wanrouter: - * wanrouter source code - -/var/log: - wanrouter wanrouter start-up log (created by the Setup script) - -/var/lock: (or /var/lock/subsys for RedHat) - wanrouter wanrouter lock file (created by the Setup script) - -/usr/local/wanrouter/firmware: - fr514.sfm Frame relay firmware for Sangoma S508/S514 card - cdual514.sfm Dual Port Cisco HDLC firmware for Sangoma S508/S514 card - ppp514.sfm PPP Firmware for Sangoma S508 and S514 cards - x25_508.sfm X25 Firmware for Sangoma S508 card. - - -REVISION HISTORY - -1.0.0 December 31, 1996 Initial version - -1.0.1 January 30, 1997 Status and statistics can be read via /proc - filesystem entries. - -1.0.2 April 30, 1997 Added UDP management via monitors. - -1.0.3 June 3, 1997 UDP management for multiple boards using Frame - Relay and PPP - Enabled continuous transmission of Configure - Request Packet for PPP (for 508 only) - Connection Timeout for PPP changed from 900 to 0 - Flow Control Problem fixed for Frame Relay - -1.0.4 July 10, 1997 S508/FT1 monitoring capability in fpipemon and - ppipemon utilities. - Configurable TTL for UDP packets. - Multicast and Broadcast IP source addresses are - silently discarded. - -1.0.5 July 28, 1997 Configurable T391,T392,N391,N392,N393 for Frame - Relay in router.conf. - Configurable Memory Address through router.conf - for Frame Relay, PPP and X.25. (commenting this - out enables auto-detection). - Fixed freeing up received buffers using kfree() - for Frame Relay and X.25. - Protect sdla_peek() by calling save_flags(), - cli() and restore_flags(). - Changed number of Trace elements from 32 to 20 - Added DLCI specific data monitoring in FPIPEMON. -2.0.0 Nov 07, 1997 Implemented protection of RACE conditions by - critical flags for FRAME RELAY and PPP. - DLCI List interrupt mode implemented. - IPX support in FRAME RELAY and PPP. - IPX Server Support (MARS) - More driver specific stats included in FPIPEMON - and PIPEMON. - -2.0.1 Nov 28, 1997 Bug Fixes for version 2.0.0. - Protection of "enable_irq()" while - "disable_irq()" has been enabled from any other - routine (for Frame Relay, PPP and X25). - Added additional Stats for Fpipemon and Ppipemon - Improved Load Sharing for multiple boards - -2.0.2 Dec 09, 1997 Support for PAP and CHAP for ppp has been - implemented. - -2.0.3 Aug 15, 1998 New release supporting Cisco HDLC, CIR for Frame - relay, Dynamic IP assignment for PPP and Inverse - Arp support for Frame-relay. Man Pages are - included for better support and a new utility - for configuring FT1 cards. - -2.0.4 Dec 09, 1998 Dual Port support for Cisco HDLC. - Support for HDLC (LAPB) API. - Supports BiSync Streaming code for S502E - and S503 cards. - Support for Streaming HDLC API. - Provides a BSD socket interface for - creating applications using BiSync - streaming. - -2.0.5 Aug 04, 1999 CHDLC initializatin bug fix. - PPP interrupt driven driver: - Fix to the PPP line hangup problem. - New PPP firmware - Added comments to the startup SYSTEM ERROR messages - Xpipemon debugging application for the X25 protocol - New USER_MANUAL.txt - Fixed the odd boundary 4byte writes to the board. - BiSync Streaming code has been taken out. - Available as a patch. - Streaming HDLC API has been taken out. - Available as a patch. - -2.0.6 Aug 17, 1999 Increased debugging in statup scripts - Fixed insallation bugs from 2.0.5 - Kernel patch works for both 2.2.10 and 2.2.11 kernels. - There is no functional difference between the two packages - -2.0.7 Aug 26, 1999 o Merged X25API code into WANPIPE. - o Fixed a memory leak for X25API - o Updated the X25API code for 2.2.X kernels. - o Improved NEM handling. - -2.1.0 Oct 25, 1999 o New code for S514 PCI Card - o New CHDLC and Frame Relay drivers - o PPP and X25 are not supported in this release - -2.1.1 Nov 30, 1999 o PPP support for S514 PCI Cards - -2.1.3 Apr 06, 2000 o Socket based x25api - o Socket based chdlc api - o Socket based fr api - o Dual Port Receive only CHDLC support. - o Asynchronous CHDLC support (Secondary Port) - o cfgft1 GUI csu/dsu configurator - o wancfg GUI configuration file - configurator. - o Architectual directory changes. - -beta-2.1.4 Jul 2000 o Dynamic interface configuration: - Network interfaces reflect the state - of protocol layer. If the protocol becomes - disconnected, driver will bring down - the interface. Once the protocol reconnects - the interface will be brought up. - - Note: This option is turned off by default. - - o Dynamic wanrouter setup using 'wanconfig': - wanconfig utility can be used to - shutdown,restart,start or reconfigure - a virtual circuit dynamically. - - Frame Relay: Each DLCI can be: - created,stopped,restarted and reconfigured - dynamically using wanconfig. - - ex: wanconfig card wanpipe1 dev wp1_fr16 up - - o Wanrouter startup via command line arguments: - wanconfig also supports wanrouter startup via command line - arguments. Thus, there is no need to create a wanpipe#.conf - configuration file. - - o Socket based x25api update/bug fixes. - Added support for LCN numbers greater than 255. - Option to pass up modem messages. - Provided a PCI IRQ check, so a single S514 - card is guaranteed to have a non-sharing interrupt. - - o Fixes to the wancfg utility. - o New FT1 debugging support via *pipemon utilities. - o Frame Relay ARP support Enabled. - -beta3-2.1.4 Jul 2000 o X25 M_BIT Problem fix. - o Added the Multi-Port PPP - Updated utilites for the Multi-Port PPP. - -2.1.4 Aut 2000 - o In X25API: - Maximum packet an application can send - to the driver has been extended to 4096 bytes. - - Fixed the x25 startup bug. Enable - communications only after all interfaces - come up. HIGH SVC/PVC is used to calculate - the number of channels. - Enable protocol only after all interfaces - are enabled. - - o Added an extra state to the FT1 config, kernel module. - o Updated the pipemon debuggers. - - o Blocked the Multi-Port PPP from running on kernels - 2.2.16 or greater, due to syncppp kernel module - change. - -beta1-2.1.5 Nov 15 2000 - o Fixed the MulitPort PPP Support for kernels 2.2.16 and above. - 2.2.X kernels only - - o Secured the driver UDP debugging calls - - All illegal netowrk debugging calls are reported to - the log. - - Defined a set of allowed commands, all other denied. - - o Cpipemon - - Added set FT1 commands to the cpipemon. Thus CSU/DSU - configuraiton can be performed using cpipemon. - All systems that cannot run cfgft1 GUI utility should - use cpipemon to configure the on board CSU/DSU. - - - o Keyboard Led Monitor/Debugger - - A new utilty /usr/sbin/wpkbdmon uses keyboard leds - to convey operatinal statistic information of the - Sangoma WANPIPE cards. - NUM_LOCK = Line State (On=connected, Off=disconnected) - CAPS_LOCK = Tx data (On=transmitting, Off=no tx data) - SCROLL_LOCK = Rx data (On=receiving, Off=no rx data - - o Hardware probe on module load and dynamic device allocation - - During WANPIPE module load, all Sangoma cards are probed - and found information is printed in the /var/log/messages. - - If no cards are found, the module load fails. - - Appropriate number of devices are dynamically loaded - based on the number of Sangoma cards found. - - Note: The kernel configuraiton option - CONFIG_WANPIPE_CARDS has been taken out. - - o Fixed the Frame Relay and Chdlc network interfaces so they are - compatible with libpcap libraries. Meaning, tcpdump, snort, - ethereal, and all other packet sniffers and debuggers work on - all WANPIPE netowrk interfaces. - - Set the network interface encoding type to ARPHRD_PPP. - This tell the sniffers that data obtained from the - network interface is in pure IP format. - Fix for 2.2.X kernels only. - - o True interface encoding option for Frame Relay and CHDLC - - The above fix sets the network interface encoding - type to ARPHRD_PPP, however some customers use - the encoding interface type to determine the - protocol running. Therefore, the TURE ENCODING - option will set the interface type back to the - original value. - - NOTE: If this option is used with Frame Relay and CHDLC - libpcap library support will be broken. - i.e. tcpdump will not work. - Fix for 2.2.x Kernels only. - - o Ethernet Bridgind over Frame Relay - - The Frame Relay bridging has been developed by - Kristian Hoffmann and Mark Wells. - - The Linux kernel bridge is used to send ethernet - data over the frame relay links. - For 2.2.X Kernels only. - - o Added extensive 2.0.X support. Most new features of - 2.1.5 for protocols Frame Relay, PPP and CHDLC are - supported under 2.0.X kernels. - -beta1-2.2.0 Dec 30 2000 - o Updated drivers for 2.4.X kernels. - o Updated drivers for SMP support. - o X25API is now able to share PCI interrupts. - o Took out a general polling routine that was used - only by X25API. - o Added appropriate locks to the dynamic reconfiguration - code. - o Fixed a bug in the keyboard debug monitor. - -beta2-2.2.0 Jan 8 2001 - o Patches for 2.4.0 kernel - o Patches for 2.2.18 kernel - o Minor updates to PPP and CHLDC drivers. - Note: No functional difference. - -beta3-2.2.9 Jan 10 2001 - o I missed the 2.2.18 kernel patches in beta2-2.2.0 - release. They are included in this release. - -Stable Release -2.2.0 Feb 01 2001 - o Bug fix in wancfg GUI configurator. - The edit function didn't work properly. - - -bata1-2.2.1 Feb 09 2001 - o WANPIPE TTY Driver emulation. - Two modes of operation Sync and Async. - Sync: Using the PPPD daemon, kernel SyncPPP layer - and the Wanpipe sync TTY driver: a PPP protocol - connection can be established via Sangoma adapter, over - a T1 leased line. - - The 2.4.0 kernel PPP layer supports MULTILINK - protocol, that can be used to bundle any number of Sangoma - adapters (T1 lines) into one, under a single IP address. - Thus, efficiently obtaining multiple T1 throughput. - - NOTE: The remote side must also implement MULTILINK PPP - protocol. - - Async:Using the PPPD daemon, kernel AsyncPPP layer - and the WANPIPE async TTY driver: a PPP protocol - connection can be established via Sangoma adapter and - a modem, over a telephone line. - - Thus, the WANPIPE async TTY driver simulates a serial - TTY driver that would normally be used to interface the - MODEM to the linux kernel. - - o WANPIPE PPP Backup Utility - This utility will monitor the state of the PPP T1 line. - In case of failure, a dial up connection will be established - via pppd daemon, ether via a serial tty driver (serial port), - or a WANPIPE async TTY driver (in case serial port is unavailable). - - Furthermore, while in dial up mode, the primary PPP T1 link - will be monitored for signs of life. - - If the PPP T1 link comes back to life, the dial up connection - will be shutdown and T1 line re-established. - - - o New Setup installation script. - Option to UPGRADE device drivers if the kernel source has - already been patched with WANPIPE. - - Option to COMPILE WANPIPE modules against the currently - running kernel, thus no need for manual kernel and module - re-compilatin. - - o Updates and Bug Fixes to wancfg utility. - -bata2-2.2.1 Feb 20 2001 - - o Bug fixes to the CHDLC device drivers. - The driver had compilation problems under kernels - 2.2.14 or lower. - - o Bug fixes to the Setup installation script. - The device drivers compilation options didn't work - properly. - - o Update to the wpbackupd daemon. - Optimized the cross-over times, between the primary - link and the backup dialup. - -beta3-2.2.1 Mar 02 2001 - o Patches for 2.4.2 kernel. - - o Bug fixes to util/ make files. - o Bug fixes to the Setup installation script. - - o Took out the backupd support and made it into - as separate package. - -beta4-2.2.1 Mar 12 2001 - - o Fix to the Frame Relay Device driver. - IPSAC sends a packet of zero length - header to the frame relay driver. The - driver tries to push its own 2 byte header - into the packet, which causes the driver to - crash. - - o Fix the WANPIPE re-configuration code. - Bug was found by trying to run the cfgft1 while the - interface was already running. - - o Updates to cfgft1. - Writes a wanpipe#.cfgft1 configuration file - once the CSU/DSU is configured. This file can - holds the current CSU/DSU configuration. - - - ->>>>>> END OF README <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< - - diff --git a/Documentation/networking/wavelan.txt b/Documentation/networking/wavelan.txt deleted file mode 100644 index c1acf5eb371..00000000000 --- a/Documentation/networking/wavelan.txt +++ /dev/null @@ -1,73 +0,0 @@ - The Wavelan drivers saga - ------------------------ - - By Jean Tourrilhes <jt@hpl.hp.com> - - The Wavelan is a Radio network adapter designed by -Lucent. Under this generic name is hidden quite a variety of hardware, -and many Linux driver to support it. - The get the full story on Wireless LANs, please consult : - http://www.hpl.hp.com/personal/Jean_Tourrilhes/Linux/ - -"wavelan" driver (old ISA Wavelan) ----------------- - o Config : Network device -> Wireless LAN -> AT&T WaveLAN - o Location : .../drivers/net/wavelan* - o in-line doc : .../drivers/net/wavelan.p.h - o on-line doc : - http://www.hpl.hp.com/personal/Jean_Tourrilhes/Linux/Wavelan.html - - This is the driver for the ISA version of the first generation -of the Wavelan, now discontinued. The device is 2 Mb/s, composed of a -Intel 82586 controller and a Lucent Modem, and is NOT 802.11 compliant. - The driver has been tested with the following hardware : - o Wavelan ISA 915 MHz (full length ISA card) - o Wavelan ISA 915 MHz 2.0 (half length ISA card) - o Wavelan ISA 2.4 GHz (full length ISA card, fixed frequency) - o Wavelan ISA 2.4 GHz 2.0 (half length ISA card, frequency selectable) - o Above cards with the optional DES encryption feature - -"wavelan_cs" driver (old Pcmcia Wavelan) -------------------- - o Config : Network device -> PCMCIA network -> - Pcmcia Wireless LAN -> AT&T/Lucent WaveLAN - o Location : .../drivers/net/pcmcia/wavelan* - o in-line doc : .../drivers/net/pcmcia/wavelan_cs.h - o on-line doc : - http://www.hpl.hp.com/personal/Jean_Tourrilhes/Linux/Wavelan.html - - This is the driver for the PCMCIA version of the first -generation of the Wavelan, now discontinued. The device is 2 Mb/s, -composed of a Intel 82593 controller (totally different from the 82586) -and a Lucent Modem, and NOT 802.11 compatible. - The driver has been tested with the following hardware : - o Wavelan Pcmcia 915 MHz 2.0 (Pcmcia card + separate - modem/antenna block) - o Wavelan Pcmcia 2.4 GHz 2.0 (Pcmcia card + separate - modem/antenna block) - -"wvlan_cs" driver (Wavelan IEEE, GPL) ------------------ - o Config : Not yet in kernel - o Location : Pcmcia package 3.1.10+ - o on-line doc : http://www.fasta.fh-dortmund.de/users/andy/wvlan/ - - This is the driver for the current generation of Wavelan IEEE, -which is 802.11 compatible. Depending on version, it is 2 Mb/s or 11 -Mb/s, with or without encryption, all implemented in Lucent specific -DSP (the Hermes). - This is a GPL full source PCMCIA driver (ISA is just a Pcmcia -card with ISA-Pcmcia bridge). - -"wavelan2_cs" driver (Wavelan IEEE, binary) --------------------- - o Config : Not yet in kernel - o Location : ftp://sourceforge.org/pcmcia/contrib/ - - This driver support exactly the same hardware as the previous -driver, the main difference is that it is based on a binary library -and supported by Lucent. - - I hope it clears the confusion ;-) - - Jean diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt index 975cc87ebdd..7f213b556e8 100644 --- a/Documentation/networking/x25-iface.txt +++ b/Documentation/networking/x25-iface.txt @@ -20,23 +20,23 @@ the rest of the skbuff, if any more information does exist. Packet Layer to Device Driver ----------------------------- -First Byte = 0x00 +First Byte = 0x00 (X25_IFACE_DATA) This indicates that the rest of the skbuff contains data to be transmitted over the LAPB link. The LAPB link should already exist before any data is passed down. -First Byte = 0x01 +First Byte = 0x01 (X25_IFACE_CONNECT) Establish the LAPB link. If the link is already established then the connect confirmation message should be returned as soon as possible. -First Byte = 0x02 +First Byte = 0x02 (X25_IFACE_DISCONNECT) Terminate the LAPB link. If it is already disconnected then the disconnect confirmation message should be returned as soon as possible. -First Byte = 0x03 +First Byte = 0x03 (X25_IFACE_PARAMS) LAPB parameters. To be defined. @@ -44,22 +44,22 @@ LAPB parameters. To be defined. Device Driver to Packet Layer ----------------------------- -First Byte = 0x00 +First Byte = 0x00 (X25_IFACE_DATA) This indicates that the rest of the skbuff contains data that has been received over the LAPB link. -First Byte = 0x01 +First Byte = 0x01 (X25_IFACE_CONNECT) LAPB link has been established. The same message is used for both a LAPB link connect_confirmation and a connect_indication. -First Byte = 0x02 +First Byte = 0x02 (X25_IFACE_DISCONNECT) LAPB link has been terminated. This same message is used for both a LAPB link disconnect_confirmation and a disconnect_indication. -First Byte = 0x03 +First Byte = 0x03 (X25_IFACE_PARAMS) LAPB parameters. To be defined. @@ -105,7 +105,7 @@ reduced by the following measures or a combination thereof: later. The lapb module interface was modified to support this. Its data_indication() method should now transparently pass the - netif_rx() return value to the (lapb mopdule) caller. + netif_rx() return value to the (lapb module) caller. (2) Drivers for kernel versions 2.2.x should always check the global variable netdev_dropping when a new frame is received. The driver should only call netif_rx() if netdev_dropping is zero. Otherwise diff --git a/Documentation/networking/xfrm_proc.txt b/Documentation/networking/xfrm_proc.txt new file mode 100644 index 00000000000..d0d8bafa901 --- /dev/null +++ b/Documentation/networking/xfrm_proc.txt @@ -0,0 +1,74 @@ +XFRM proc - /proc/net/xfrm_* files +================================== +Masahide NAKAMURA <nakam@linux-ipv6.org> + + +Transformation Statistics +------------------------- +xfrm_proc is a statistics shown factor dropped by transformation +for developer. +It is a counter designed from current transformation source code +and defined like linux private MIB. + +Inbound statistics +~~~~~~~~~~~~~~~~~~ +XfrmInError: + All errors which is not matched others +XfrmInBufferError: + No buffer is left +XfrmInHdrError: + Header error +XfrmInNoStates: + No state is found + i.e. Either inbound SPI, address, or IPsec protocol at SA is wrong +XfrmInStateProtoError: + Transformation protocol specific error + e.g. SA key is wrong +XfrmInStateModeError: + Transformation mode specific error +XfrmInStateSeqError: + Sequence error + i.e. Sequence number is out of window +XfrmInStateExpired: + State is expired +XfrmInStateMismatch: + State has mismatch option + e.g. UDP encapsulation type is mismatch +XfrmInStateInvalid: + State is invalid +XfrmInTmplMismatch: + No matching template for states + e.g. Inbound SAs are correct but SP rule is wrong +XfrmInNoPols: + No policy is found for states + e.g. Inbound SAs are correct but no SP is found +XfrmInPolBlock: + Policy discards +XfrmInPolError: + Policy error + +Outbound errors +~~~~~~~~~~~~~~~ +XfrmOutError: + All errors which is not matched others +XfrmOutBundleGenError: + Bundle generation error +XfrmOutBundleCheckError: + Bundle check error +XfrmOutNoStates: + No state is found +XfrmOutStateProtoError: + Transformation protocol specific error +XfrmOutStateModeError: + Transformation mode specific error +XfrmOutStateSeqError: + Sequence error + i.e. Sequence number overflow +XfrmOutStateExpired: + State is expired +XfrmOutPolBlock: + Policy discards +XfrmOutPolDead: + Policy is dead +XfrmOutPolError: + Policy error diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt new file mode 100644 index 00000000000..d7aac9dedeb --- /dev/null +++ b/Documentation/networking/xfrm_sync.txt @@ -0,0 +1,169 @@ + +The sync patches work is based on initial patches from +Krisztian <hidden@balabit.hu> and others and additional patches +from Jamal <hadi@cyberus.ca>. + +The end goal for syncing is to be able to insert attributes + generate +events so that the an SA can be safely moved from one machine to another +for HA purposes. +The idea is to synchronize the SA so that the takeover machine can do +the processing of the SA as accurate as possible if it has access to it. + +We already have the ability to generate SA add/del/upd events. +These patches add ability to sync and have accurate lifetime byte (to +ensure proper decay of SAs) and replay counters to avoid replay attacks +with as minimal loss at failover time. +This way a backup stays as closely uptodate as an active member. + +Because the above items change for every packet the SA receives, +it is possible for a lot of the events to be generated. +For this reason, we also add a nagle-like algorithm to restrict +the events. i.e we are going to set thresholds to say "let me +know if the replay sequence threshold is reached or 10 secs have passed" +These thresholds are set system-wide via sysctls or can be updated +per SA. + +The identified items that need to be synchronized are: +- the lifetime byte counter +note that: lifetime time limit is not important if you assume the failover +machine is known ahead of time since the decay of the time countdown +is not driven by packet arrival. +- the replay sequence for both inbound and outbound + +1) Message Structure +---------------------- + +nlmsghdr:aevent_id:optional-TLVs. + +The netlink message types are: + +XFRM_MSG_NEWAE and XFRM_MSG_GETAE. + +A XFRM_MSG_GETAE does not have TLVs. +A XFRM_MSG_NEWAE will have at least two TLVs (as is +discussed further below). + +aevent_id structure looks like: + + struct xfrm_aevent_id { + struct xfrm_usersa_id sa_id; + xfrm_address_t saddr; + __u32 flags; + __u32 reqid; + }; + +The unique SA is identified by the combination of xfrm_usersa_id, +reqid and saddr. + +flags are used to indicate different things. The possible +flags are: + XFRM_AE_RTHR=1, /* replay threshold*/ + XFRM_AE_RVAL=2, /* replay value */ + XFRM_AE_LVAL=4, /* lifetime value */ + XFRM_AE_ETHR=8, /* expiry timer threshold */ + XFRM_AE_CR=16, /* Event cause is replay update */ + XFRM_AE_CE=32, /* Event cause is timer expiry */ + XFRM_AE_CU=64, /* Event cause is policy update */ + +How these flags are used is dependent on the direction of the +message (kernel<->user) as well the cause (config, query or event). +This is described below in the different messages. + +The pid will be set appropriately in netlink to recognize direction +(0 to the kernel and pid = processid that created the event +when going from kernel to user space) + +A program needs to subscribe to multicast group XFRMNLGRP_AEVENTS +to get notified of these events. + +2) TLVS reflect the different parameters: +----------------------------------------- + +a) byte value (XFRMA_LTIME_VAL) +This TLV carries the running/current counter for byte lifetime since +last event. + +b)replay value (XFRMA_REPLAY_VAL) +This TLV carries the running/current counter for replay sequence since +last event. + +c)replay threshold (XFRMA_REPLAY_THRESH) +This TLV carries the threshold being used by the kernel to trigger events +when the replay sequence is exceeded. + +d) expiry timer (XFRMA_ETIMER_THRESH) +This is a timer value in milliseconds which is used as the nagle +value to rate limit the events. + +3) Default configurations for the parameters: +---------------------------------------------- + +By default these events should be turned off unless there is +at least one listener registered to listen to the multicast +group XFRMNLGRP_AEVENTS. + +Programs installing SAs will need to specify the two thresholds, however, +in order to not change existing applications such as racoon +we also provide default threshold values for these different parameters +in case they are not specified. + +the two sysctls/proc entries are: +a) /proc/sys/net/core/sysctl_xfrm_aevent_etime +used to provide default values for the XFRMA_ETIMER_THRESH in incremental +units of time of 100ms. The default is 10 (1 second) + +b) /proc/sys/net/core/sysctl_xfrm_aevent_rseqth +used to provide default values for XFRMA_REPLAY_THRESH parameter +in incremental packet count. The default is two packets. + +4) Message types +---------------- + +a) XFRM_MSG_GETAE issued by user-->kernel. +XFRM_MSG_GETAE does not carry any TLVs. +The response is a XFRM_MSG_NEWAE which is formatted based on what +XFRM_MSG_GETAE queried for. +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +*if XFRM_AE_RTHR flag is set, then XFRMA_REPLAY_THRESH is also retrieved +*if XFRM_AE_ETHR flag is set, then XFRMA_ETIMER_THRESH is also retrieved + +b) XFRM_MSG_NEWAE is issued by either user space to configure +or kernel to announce events or respond to a XFRM_MSG_GETAE. + +i) user --> kernel to configure a specific SA. +any of the values or threshold parameters can be updated by passing the +appropriate TLV. +A response is issued back to the sender in user space to indicate success +or failure. +In the case of success, additionally an event with +XFRM_MSG_NEWAE is also issued to any listeners as described in iii). + +ii) kernel->user direction as a response to XFRM_MSG_GETAE +The response will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. +The threshold TLVs will be included if explicitly requested in +the XFRM_MSG_GETAE message. + +iii) kernel->user to report as event if someone sets any values or +thresholds for an SA using XFRM_MSG_NEWAE (as described in #i above). +In such a case XFRM_AE_CU flag is set to inform the user that +the change happened as a result of an update. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +iv) kernel->user to report event when replay threshold or a timeout +is exceeded. +In such a case either XFRM_AE_CR (replay exceeded) or XFRM_AE_CE (timeout +happened) is set to inform the user what happened. +Note the two flags are mutually exclusive. +The message will always have XFRMA_LTIME_VAL and XFRMA_REPLAY_VAL TLVs. + +Exceptions to threshold settings +-------------------------------- + +If you have an SA that is getting hit by traffic in bursts such that +there is a period where the timer threshold expires with no packets +seen, then an odd behavior is seen as follows: +The first packet arrival after a timer expiry will trigger a timeout +aevent; i.e we dont wait for a timeout period or a packet threshold +to be reached. This is done for simplicity and efficiency reasons. + +-JHS diff --git a/Documentation/networking/xfrm_sysctl.txt b/Documentation/networking/xfrm_sysctl.txt new file mode 100644 index 00000000000..5bbd16792fe --- /dev/null +++ b/Documentation/networking/xfrm_sysctl.txt @@ -0,0 +1,4 @@ +/proc/sys/net/core/xfrm_* Variables: + +xfrm_acq_expires - INTEGER + default 30 - hard timeout in seconds for acquire requests |
