diff options
Diffstat (limited to 'Documentation/networking')
90 files changed, 7435 insertions, 3143 deletions
diff --git a/Documentation/networking/.gitignore b/Documentation/networking/.gitignore index 286a5680f49..e69de29bb2d 100644 --- a/Documentation/networking/.gitignore +++ b/Documentation/networking/.gitignore @@ -1 +0,0 @@ -ifenslave diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index fe5c099b8fc..557b6ef70c2 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -2,16 +2,28 @@ - this file 3c505.txt - information on the 3Com EtherLink Plus (3c505) driver. +3c509.txt + - information on the 3Com Etherlink III Series Ethernet cards. 6pack.txt - info on the 6pack protocol, an alternative to KISS for AX.25 -DLINK.txt - - info on the D-Link DE-600/DE-620 parallel port pocket adapters +LICENSE.qla3xxx + - GPLv2 for QLogic Linux Networking HBA Driver +LICENSE.qlge + - GPLv2 for QLogic Linux qlge NIC Driver +LICENSE.qlcnic + - GPLv2 for QLogic Linux qlcnic NIC Driver +Makefile + - Makefile for docsrc. PLIP.txt - PLIP: The Parallel Line Internet Protocol device driver +README.ipw2100 + - README for the Intel PRO/Wireless 2100 driver. +README.ipw2200 + - README for the Intel PRO/Wireless 2915ABG and 2200BG driver. README.sb1000 - info on General Instrument/NextLevel SURFboard1000 cable modem. alias.txt - - info on using alias network devices + - info on using alias network devices. arcnet-hardware.txt - tons of info on ARCnet, hubs, jumper settings for ARCnet cards, etc. arcnet.txt @@ -20,8 +32,12 @@ atm.txt - info on where to get ATM programs and support for Linux. ax25.txt - info on using AX.25 and NET/ROM code for Linux +batman-adv.txt + - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames. baycom.txt - info on the driver for Baycom style amateur radio modems +bonding.txt + - Linux Ethernet Bonding Driver HOWTO: link aggregation in Linux. bridge.txt - where to get user space programs for ethernet bridging with Linux. can.txt @@ -34,36 +50,58 @@ cxacru.txt - Conexant AccessRunner USB ADSL Modem cxacru-cf.py - Conexant AccessRunner USB ADSL Modem configuration file parser +cxgb.txt + - Release Notes for the Chelsio N210 Linux device driver. +dccp.txt + - the Datagram Congestion Control Protocol (DCCP) (RFC 4340..42). de4x5.txt - the Digital EtherWORKS DE4?? and DE5?? PCI Ethernet driver decnet.txt - info on using the DECnet networking layer in Linux. -depca.txt - - the Digital DEPCA/EtherWORKS DE1?? and DE2?? LANCE Ethernet driver -dgrs.txt - - the Digi International RightSwitch SE-X Ethernet driver +dl2k.txt + - README for D-Link DL2000-based Gigabit Ethernet Adapters (dl2k.ko). +dm9000.txt + - README for the Simtec DM9000 Network driver. dmfe.txt - info on the Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver. +dns_resolver.txt + - The DNS resolver module allows kernel servies to make DNS queries. +driver.txt + - Softnet driver issues. e100.txt - info on Intel's EtherExpress PRO/100 line of 10/100 boards e1000.txt - info on Intel's E1000 line of gigabit ethernet boards +e1000e.txt + - README for the Intel Gigabit Ethernet Driver (e1000e). eql.txt - serial IP load balancing -ethertap.txt - - the Ethertap user space packet reception and transmission driver -ewrk3.txt - - the Digital EtherWORKS 3 DE203/4/5 Ethernet driver +fib_trie.txt + - Level Compressed Trie (LC-trie) notes: a structure for routing. filter.txt - Linux Socket Filtering fore200e.txt - FORE Systems PCA-200E/SBA-200E ATM NIC driver info. framerelay.txt - info on using Frame Relay/Data Link Connection Identifier (DLCI). +gen_stats.txt + - Generic networking statistics for netlink users. +generic-hdlc.txt + - The generic High Level Data Link Control (HDLC) layer. generic_netlink.txt - info on Generic Netlink +gianfar.txt + - Gianfar Ethernet Driver. +i40e.txt + - README for the Intel Ethernet Controller XL710 Driver (i40e). +i40evf.txt + - Short note on the Driver for the Intel(R) XL710 X710 Virtual Function ieee802154.txt - Linux IEEE 802.15.4 implementation, API and drivers +igb.txt + - README for the Intel Gigabit Ethernet Driver (igb). +igbvf.txt + - README for the Intel Gigabit Ethernet Driver (igbvf). ip-sysctl.txt - /proc/sys/net/ipv4/* variables ip_dynaddr.txt @@ -72,43 +110,125 @@ ipddp.txt - AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation iphase.txt - Interphase PCI ATM (i)Chip IA Linux driver info. +ipsec.txt + - Note on not compressing IPSec payload and resulting failed policy check. +ipv6.txt + - Options to the ipv6 kernel module. +ipvs-sysctl.txt + - Per-inode explanation of the /proc/sys/net/ipv4/vs interface. irda.txt - where to get IrDA (infrared) utilities and info for Linux. +ixgb.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgb). +ixgbe.txt + - README for the Intel 10 Gigabit Ethernet Driver (ixgbe). +ixgbevf.txt + - README for the Intel Virtual Function (VF) Driver (ixgbevf). +l2tp.txt + - User guide to the L2TP tunnel protocol. lapb-module.txt - programming information of the LAPB module. ltpc.txt - the Apple or Farallon LocalTalk PC card driver -multicast.txt - - Behaviour of cards under Multicast +mac80211-auth-assoc-deauth.txt + - authentication and association / deauth-disassoc with max80211 +mac80211-injection.txt + - HOWTO use packet injection with mac80211 +multiqueue.txt + - HOWTO for multiqueue network device support. +netconsole.txt + - The network console module netconsole.ko: configuration and notes. +netdev-FAQ.txt + - FAQ describing how to submit net changes to netdev mailing list. +netdev-features.txt + - Network interface features API description. netdevices.txt - info on network device driver functions exported to the kernel. -olympic.txt - - IBM PCI Pit/Pit-Phy/Olympic Token Ring driver info. +netif-msg.txt + - Design of the network interface message level setting (NETIF_MSG_*). +netlink_mmap.txt + - memory mapped I/O with netlink +nf_conntrack-sysctl.txt + - list of netfilter-sysctl knobs. +nfc.txt + - The Linux Near Field Communication (NFS) subsystem. +openvswitch.txt + - Open vSwitch developer documentation. +operstates.txt + - Overview of network interface operational states. +packet_mmap.txt + - User guide to memory mapped packet socket rings (PACKET_[RT]X_RING). +phonet.txt + - The Phonet packet protocol used in Nokia cellular modems. +phy.txt + - The PHY abstraction layer. +pktgen.txt + - User guide to the kernel packet generator (pktgen.ko). policy-routing.txt - IP policy-based routing +ppp_generic.txt + - Information about the generic PPP driver. +proc_net_tcp.txt + - Per inode overview of the /proc/net/tcp and /proc/net/tcp6 interfaces. +radiotap-headers.txt + - Background on radiotap headers. ray_cs.txt - Raylink Wireless LAN card driver info. +rds.txt + - Background on the reliable, ordered datagram delivery method RDS. +regulatory.txt + - Overview of the Linux wireless regulatory infrastructure. +rxrpc.txt + - Guide to the RxRPC protocol. +s2io.txt + - Release notes for Neterion Xframe I/II 10GbE driver. +scaling.txt + - Explanation of network scaling techniques: RSS, RPS, RFS, aRFS, XPS. +sctp.txt + - Notes on the Linux kernel implementation of the SCTP protocol. +secid.txt + - Explanation of the secid member in flow structures. skfp.txt - SysKonnect FDDI (SK-5xxx, Compaq Netelligent) driver info. smc9.txt - the driver for SMC's 9000 series of Ethernet cards -smctr.txt - - SMC TokenCard TokenRing Linux driver info. +spider_net.txt + - README for the Spidernet Driver (as found in PS3 / Cell BE). +stmmac.txt + - README for the STMicro Synopsys Ethernet driver. +tc-actions-env-rules.txt + - rules for traffic control (tc) actions. +timestamping.txt + - overview of network packet timestamping variants. tcp.txt - short blurb on how TCP output takes place. +tcp-thin.txt + - kernel tuning options for low rate 'thin' TCP streams. +team.txt + - pointer to information for ethernet teaming devices. tlan.txt - ThunderLAN (Compaq Netelligent 10/100, Olicom OC-2xxx) driver info. -tms380tr.txt - - SysKonnect Token Ring ISA/PCI adapter driver info. +tproxy.txt + - Transparent proxy support user guide. tuntap.txt - TUN/TAP device driver, allowing user space Rx/Tx of packets. +udplite.txt + - UDP-Lite protocol (RFC 3828) introduction. vortex.txt - info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards. -wavelan.txt - - AT&T GIS (nee NCR) WaveLAN card: An Ethernet-like radio transceiver +vxge.txt + - README for the Neterion X3100 PCIe Server Adapter. +vxlan.txt + - Virtual extensible LAN overview x25.txt - general info on X.25 development. x25-iface.txt - description of the X.25 Packet Layer to LAPB device interface. +xfrm_proc.txt + - description of the statistics package for XFRM. +xfrm_sync.txt + - sync patches for XFRM enable migration of an SA between hosts. +xfrm_sysctl.txt + - description of the XFRM configuration options. z8530drv.txt - info about Linux driver for Z8530 based HDLC cards for AX.25 diff --git a/Documentation/networking/3c359.txt b/Documentation/networking/3c359.txt deleted file mode 100644 index 4af8071a6d1..00000000000 --- a/Documentation/networking/3c359.txt +++ /dev/null @@ -1,58 +0,0 @@ - -3COM PCI TOKEN LINK VELOCITY XL TOKEN RING CARDS README - -Release 0.9.0 - Release - Jul 17th 2000 Mike Phillips - - 1.2.0 - Final - Feb 17th 2002 Mike Phillips - Updated for submission to the 2.4.x kernel. - -Thanks: - Terry Murphy from 3Com for tech docs and support, - Adam D. Ligas for testing the driver. - -Note: - This driver will NOT work with the 3C339 Token Ring cards, you need -to use the tms380 driver instead. - -Options: - -The driver accepts three options: ringspeed, pkt_buf_sz and message_level. - -These options can be specified differently for each card found. - -ringspeed: Has one of three settings 0 (default), 4 or 16. 0 will -make the card autosense the ringspeed and join at the appropriate speed, -this will be the default option for most people. 4 or 16 allow you to -explicitly force the card to operate at a certain speed. The card will fail -if you try to insert it at the wrong speed. (Although some hubs will allow -this so be *very* careful). The main purpose for explicitly setting the ring -speed is for when the card is first on the ring. In autosense mode, if the card -cannot detect any active monitors on the ring it will open at the same speed as -its last opening. This can be hazardous if this speed does not match the speed -you want the ring to operate at. - -pkt_buf_sz: This is this initial receive buffer allocation size. This will -default to 4096 if no value is entered. You may increase performance of the -driver by setting this to a value larger than the network packet size, although -the driver now re-sizes buffers based on MTU settings as well. - -message_level: Controls level of messages created by the driver. Defaults to 0: -which only displays start-up and critical messages. Presently any non-zero -value will display all soft messages as well. NB This does not turn -debugging messages on, that must be done by modified the source code. - -Variable MTU size: - -The driver can handle a MTU size upto either 4500 or 18000 depending upon -ring speed. The driver also changes the size of the receive buffers as part -of the mtu re-sizing, so if you set mtu = 18000, you will need to be able -to allocate 16 * (sk_buff with 18000 buffer size) call it 18500 bytes per ring -position = 296,000 bytes of memory space, plus of course anything -necessary for the tx sk_buff's. Remember this is per card, so if you are -building routers, gateway's etc, you could start to use a lot of memory -real fast. - -2/17/02 Mike Phillips - diff --git a/Documentation/networking/3c505.txt b/Documentation/networking/3c505.txt deleted file mode 100644 index 72f38b13101..00000000000 --- a/Documentation/networking/3c505.txt +++ /dev/null @@ -1,45 +0,0 @@ -The 3Com Etherlink Plus (3c505) driver. - -This driver now uses DMA. There is currently no support for PIO operation. -The default DMA channel is 6; this is _not_ autoprobed, so you must -make sure you configure it correctly. If loading the driver as a -module, you can do this with "modprobe 3c505 dma=n". If the driver is -linked statically into the kernel, you must either use an "ether=" -statement on the command line, or change the definition of ELP_DMA in 3c505.h. - -The driver will warn you if it has to fall back on the compiled in -default DMA channel. - -If no base address is given at boot time, the driver will autoprobe -ports 0x300, 0x280 and 0x310 (in that order). If no IRQ is given, the driver -will try to probe for it. - -The driver can be used as a loadable module. - -Theoretically, one instance of the driver can now run multiple cards, -in the standard way (when loading a module, say "modprobe 3c505 -io=0x300,0x340 irq=10,11 dma=6,7" or whatever). I have not tested -this, though. - -The driver may now support revision 2 hardware; the dependency on -being able to read the host control register has been removed. This -is also untested, since I don't have a suitable card. - -Known problems: - I still see "DMA upload timed out" messages from time to time. These -seem to be fairly non-fatal though. - The card is old and slow. - -To do: - Improve probe/setup code - Test multicast and promiscuous operation - -Authors: - The driver is mainly written by Craig Southeren, email - <craigs@ineluki.apana.org.au>. - Parts of the driver (adapting the driver to 1.1.4+ kernels, - IRQ/address detection, some changes) and this README by - Juha Laiho <jlaiho@ichaos.nullnet.fi>. - DMA mode, more fixes, etc, by Philip Blundell <pjb27@cam.ac.uk> - Multicard support, Software configurable DMA, etc., by - Christopher Collins <ccollins@pcug.org.au> diff --git a/Documentation/networking/3c509.txt b/Documentation/networking/3c509.txt index dcc9eaf5939..fbf722e15ac 100644 --- a/Documentation/networking/3c509.txt +++ b/Documentation/networking/3c509.txt @@ -25,7 +25,6 @@ models: 3c509B (later revision of the ISA card; supports full-duplex) 3c589 (PCMCIA) 3c589B (later revision of the 3c589; supports full-duplex) - 3c529 (MCA) 3c579 (EISA) Large portions of this documentation were heavily borrowed from the guide diff --git a/Documentation/networking/DLINK.txt b/Documentation/networking/DLINK.txt deleted file mode 100644 index 55d24433d15..00000000000 --- a/Documentation/networking/DLINK.txt +++ /dev/null @@ -1,203 +0,0 @@ -Released 1994-06-13 - - - CONTENTS: - - 1. Introduction. - 2. License. - 3. Files in this release. - 4. Installation. - 5. Problems and tuning. - 6. Using the drivers with earlier releases. - 7. Acknowledgments. - - - 1. INTRODUCTION. - - This is a set of Ethernet drivers for the D-Link DE-600/DE-620 - pocket adapters, for the parallel port on a Linux based machine. - Some adapter "clones" will also work. Xircom is _not_ a clone... - These drivers _can_ be used as loadable modules, - and were developed for use on Linux 1.1.13 and above. - For use on Linux 1.0.X, or earlier releases, see below. - - I have used these drivers for NFS, ftp, telnet and X-clients on - remote machines. Transmissions with ftp seems to work as - good as can be expected (i.e. > 80k bytes/sec) from a - parallel port...:-) Receive speeds will be about 60-80% of this. - Depending on your machine, somewhat higher speeds can be achieved. - - All comments/fixes to Bjorn Ekwall (bj0rn@blox.se). - - - 2. LICENSE. - - This program is free software; you can redistribute it - and/or modify it under the terms of the GNU General Public - License as published by the Free Software Foundation; either - version 2, or (at your option) any later version. - - This program is distributed in the hope that it will be - useful, but WITHOUT ANY WARRANTY; without even the implied - warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR - PURPOSE. See the GNU General Public License for more - details. - - You should have received a copy of the GNU General Public - License along with this program; if not, write to the Free - Software Foundation, Inc., 675 Mass Ave, Cambridge, MA - 02139, USA. - - - 3. FILES IN THIS RELEASE. - - README.DLINK This file. - de600.c The Source (may it be with You :-) for the DE-600 - de620.c ditto for the DE-620 - de620.h Macros for de620.c - - If you are upgrading from the d-link tar release, there will - also be a "dlink-patches" file that will patch Linux 1.1.18: - linux/drivers/net/Makefile - linux/drivers/net/CONFIG - linux/drivers/net/MODULES - linux/drivers/net/Space.c - linux/config.in - Apply the patch by: - "cd /usr/src; patch -p0 < linux/drivers/net/dlink-patches" - The old source, "linux/drivers/net/d_link.c", can be removed. - - - 4. INSTALLATION. - - o Get the latest net binaries, according to current net.wisdom. - - o Read the NET-2 and Ethernet HOWTOs and modify your setup. - - o If your parallel port has a strange address or irq, - modify "linux/drivers/net/CONFIG" accordingly, or adjust - the parameters in the "tuning" section in the sources. - - If you are going to use the drivers as loadable modules, do _not_ - enable them while doing "make config", but instead make sure that - the drivers are included in "linux/drivers/net/MODULES". - - If you are _not_ going to use the driver(s) as loadable modules, - but instead have them included in the kernel, remember to enable - the drivers while doing "make config". - - o To include networking and DE600/DE620 support in your kernel: - # cd /linux - (as modules:) - # make config (answer yes on CONFIG_NET and CONFIG_INET) - (else included in the kernel:) - # make config (answer yes on CONFIG _NET, _INET and _DE600 or _DE620) - # make clean - # make zImage (or whatever magic you usually do) - - o I use lilo to boot multiple kernels, so that I at least - can have one working kernel :-). If you do too, append - these lines to /etc/lilo/config: - - image = /linux/zImage - label = newlinux - root = /dev/hda2 (or whatever YOU have...) - - # /etc/lilo/install - - o Do "sync" and reboot the new kernel with a D-Link - DE-600/DE-620 pocket adapter connected. - - o The adapter can be configured with ifconfig eth? - where the actual number is decided by the kernel - when the drivers are initialized. - - - 5. "PROBLEMS" AND TUNING, - - o If you see error messages from the driver, and if the traffic - stops on the adapter, try to do "ifconfig" and "route" once - more, just as in "rc.inet1". This should take care of most - problems, including effects from power loss, or adapters that - aren't connected to the printer port in some way or another. - You can somewhat change the behaviour by enabling/disabling - the macro SHUTDOWN_WHEN_LOST in the "tuning" section. - For the DE-600 there is another macro, CHECK_LOST_DE600, - that you might want to read about in the "tuning" section. - - o Some machines have trouble handling the parallel port and - the adapter at high speed. If you experience problems: - - DE-600: - - The adapter is not recognized at boot, i.e. an Ethernet - address of 00:80:c8:... is not shown, try to add another - "; SLOW_DOWN_IO" - at DE600_SLOW_DOWN in the "tuning" section. As a last resort, - uncomment: "#define REALLY_SLOW_IO" (see <asm/io.h> for hints). - - - You experience "timeout" messages: first try to add another - "; SLOW_DOWN_IO" - at DE600_SLOW_DOWN in the "tuning" section, _then_ try to - increase the value (original value: 5) at - "if (tickssofar < 5)" near line 422. - - DE-620: - - Your parallel port might be "sluggish". To cater for - this, there are the macros LOWSPEED and READ_DELAY/WRITE_DELAY - in the "tuning" section. Your first step should be to enable - LOWSPEED, and after that you can "tune" the XXX_DELAY values. - - o If the adapter _is_ recognized at boot but you get messages - about "Network Unreachable", then the problem is probably - _not_ with the driver. Check your net configuration instead - (ifconfig and route) in "rc.inet1". - - o There is some rudimentary support for debugging, look at - the source. Use "-DDE600_DEBUG=3" or "-DDE620_DEBUG=3" - when compiling, or include it in "linux/drivers/net/CONFIG". - IF YOU HAVE PROBLEMS YOU CAN'T SOLVE: PLEASE COMPILE THE DRIVER - WITH DEBUGGING ENABLED, AND SEND ME THE RESULTING OUTPUT! - - - 6. USING THE DRIVERS WITH EARLIER RELEASES. - - The later 1.1.X releases of the Linux kernel include some - changes in the networking layer (a.k.a. NET3). This affects - these drivers in a few places. The hints that follow are - _not_ tested by me, since I don't have the disk space to keep - all releases on-line. - Known needed changes to date: - - release patchfile: some patches will fail, but they should - be easy to apply "by hand", since they are trivial. - (Space.c: d_link_init() is now called de600_probe()) - - de600.c: change "mark_bh(NET_BH)" to "mark_bh(INET_BH)". - - de620.c: (maybe) change the code around "netif_rx(skb);" to be - similar to the code around "dev_rint(...)" in de600.c - - - 7. ACKNOWLEDGMENTS. - - These drivers wouldn't have been done without the base - (and support) from Ross Biro, and D-Link Systems Inc. - The driver relies upon GPL-ed source from D-Link Systems Inc. - and from Russel Nelson at Crynwr Software <nelson@crynwr.com>. - - Additional input also from: - Donald Becker <becker@super.org>, Alan Cox <A.Cox@swansea.ac.uk> - and Fred N. van Kempen <waltje@uWalt.NL.Mugnet.ORG> - - DE-600 alpha release primary victim^H^H^H^H^H^Htester: - - Erik Proper <erikp@cs.kun.nl>. - Good input also from several users, most notably - - Mark Burton <markb@ordern.demon.co.uk>. - - DE-620 alpha release victims^H^H^H^H^H^H^Htesters: - - J. Joshua Kopper <kopper@rtsg.mot.com> - - Olav Kvittem <Olav.Kvittem@uninett.no> - - Germano Caronni <caronni@nessie.cs.id.ethz.ch> - - Jeremy Fitzhardinge <jeremy@suite.sw.oz.au> - - - Happy hacking! - - Bjorn Ekwall == bj0rn@blox.se diff --git a/Documentation/networking/LICENSE.qlcnic b/Documentation/networking/LICENSE.qlcnic index 29ad4b10642..2ae3b64983a 100644 --- a/Documentation/networking/LICENSE.qlcnic +++ b/Documentation/networking/LICENSE.qlcnic @@ -1,61 +1,22 @@ -Copyright (c) 2009-2010 QLogic Corporation +Copyright (c) 2009-2013 QLogic Corporation QLogic Linux qlcnic NIC Driver -This program includes a device driver for Linux 2.6 that may be -distributed with QLogic hardware specific firmware binary file. You may modify and redistribute the device driver code under the GNU General Public License (a copy of which is attached hereto as Exhibit A) published by the Free Software Foundation (version 2). -You may redistribute the hardware specific firmware binary file -under the following terms: - - 1. Redistribution of source code (only if applicable), - must retain the above copyright notice, this list of - conditions and the following disclaimer. - - 2. Redistribution in binary form must reproduce the above - copyright notice, this list of conditions and the - following disclaimer in the documentation and/or other - materials provided with the distribution. - - 3. The name of QLogic Corporation may not be used to - endorse or promote products derived from this software - without specific prior written permission - -REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, -THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY -EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A -PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR -BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED -TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON -ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, -OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY -OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. - -USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT -CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR -OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, -TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN -ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN -COMBINATION WITH THIS PROGRAM. - EXHIBIT A - GNU GENERAL PUBLIC LICENSE - Version 2, June 1991 + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc. 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. - Preamble + Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public @@ -105,7 +66,7 @@ patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. - GNU GENERAL PUBLIC LICENSE + GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains @@ -304,7 +265,7 @@ make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. - NO WARRANTY + NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN diff --git a/Documentation/networking/LICENSE.qlge b/Documentation/networking/LICENSE.qlge index 123b6edd7f1..ce64e4d15b2 100644 --- a/Documentation/networking/LICENSE.qlge +++ b/Documentation/networking/LICENSE.qlge @@ -1,46 +1,288 @@ -Copyright (c) 2003-2008 QLogic Corporation -QLogic Linux Networking HBA Driver +Copyright (c) 2003-2011 QLogic Corporation +QLogic Linux qlge NIC Driver -This program includes a device driver for Linux 2.6 that may be -distributed with QLogic hardware specific firmware binary file. You may modify and redistribute the device driver code under the -GNU General Public License as published by the Free Software -Foundation (version 2 or a later version). - -You may redistribute the hardware specific firmware binary file -under the following terms: - - 1. Redistribution of source code (only if applicable), - must retain the above copyright notice, this list of - conditions and the following disclaimer. - - 2. Redistribution in binary form must reproduce the above - copyright notice, this list of conditions and the - following disclaimer in the documentation and/or other - materials provided with the distribution. - - 3. The name of QLogic Corporation may not be used to - endorse or promote products derived from this software - without specific prior written permission - -REGARDLESS OF WHAT LICENSING MECHANISM IS USED OR APPLICABLE, -THIS PROGRAM IS PROVIDED BY QLOGIC CORPORATION "AS IS'' AND ANY -EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE -IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A -PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR -BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, -EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED -TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, -DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON -ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, -OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY -OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE -POSSIBILITY OF SUCH DAMAGE. - -USER ACKNOWLEDGES AND AGREES THAT USE OF THIS PROGRAM WILL NOT -CREATE OR GIVE GROUNDS FOR A LICENSE BY IMPLICATION, ESTOPPEL, OR -OTHERWISE IN ANY INTELLECTUAL PROPERTY RIGHTS (PATENT, COPYRIGHT, -TRADE SECRET, MASK WORK, OR OTHER PROPRIETARY RIGHT) EMBODIED IN -ANY OTHER QLOGIC HARDWARE OR SOFTWARE EITHER SOLELY OR IN -COMBINATION WITH THIS PROGRAM. +GNU General Public License (a copy of which is attached hereto as +Exhibit A) published by the Free Software Foundation (version 2). + +EXHIBIT A + + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. diff --git a/Documentation/networking/Makefile b/Documentation/networking/Makefile index 5aba7a33aee..0aa1ac98fc2 100644 --- a/Documentation/networking/Makefile +++ b/Documentation/networking/Makefile @@ -1,9 +1,6 @@ # kbuild trick to avoid linker error. Can be omitted if a module is built. obj- := dummy.o -# List of programs to build -hostprogs-y := ifenslave - # Tell kbuild to always build the programs always := $(hostprogs-y) diff --git a/Documentation/networking/README.ipw2200 b/Documentation/networking/README.ipw2200 index 616a8e540b0..b7658bed490 100644 --- a/Documentation/networking/README.ipw2200 +++ b/Documentation/networking/README.ipw2200 @@ -256,7 +256,7 @@ You can set the debug level via: Where $VALUE would be a number in the case of this sysfs entry. The input to sysfs files does not have to be a number. For example, the -firmware loader used by hotplug utilizes sysfs entries for transfering +firmware loader used by hotplug utilizes sysfs entries for transferring the firmware image from user space into the driver. The Intel(R) PRO/Wireless 2915ABG Driver for Linux exposes sysfs entries diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt new file mode 100644 index 00000000000..3f24df8c6e6 --- /dev/null +++ b/Documentation/networking/altera_tse.txt @@ -0,0 +1,263 @@ + Altera Triple-Speed Ethernet MAC driver + +Copyright (C) 2008-2014 Altera Corporation + +This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers +using the SGDMA and MSGDMA soft DMA IP components. The driver uses the +platform bus to obtain component resources. The designs used to test this +driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board, +and tested with ARM and NIOS processor hosts seperately. The anticipated use +cases are simple communications between an embedded system and an external peer +for status and simple configuration of the embedded system. + +For more information visit www.altera.com and www.rocketboards.org. Support +forums for the driver may be found on www.rocketboards.org, and a design used +to test this driver may be found there as well. Support is also available from +the maintainer of this driver, found in MAINTAINERS. + +The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP +components that can be assembled and built into an FPGA using the Altera +Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that +this driver was tested against. The sopc2dts tool is used to create the +device tree for the driver, and may be found at rocketboards.org. + +The driver probe function examines the device tree and determines if the +Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The +probe function then installs the appropriate set of DMA routines to +initialize, setup transmits, receives, and interrupt handling primitives for +the respective configurations. + +The SGDMA component is to be deprecated in the near future (over the next 1-2 +years as of this writing in early 2014) in favor of the MSGDMA component. +SGDMA support is included for existing designs and reference in case a +developer wishes to support their own soft DMA logic and driver support. Any +new designs should not use the SGDMA. + +The SGDMA supports only a single transmit or receive operation at a time, and +therefore will not perform as well compared to the MSGDMA soft IP. Please +visit www.altera.com for known, documented SGDMA errata. + +Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time. +Scatter-gather DMA will be added to a future maintenance update to this +driver. + +Jumbo frames are not supported at this time. + +The driver limits PHY operations to 10/100Mbps, and has not yet been fully +tested for 1Gbps. This support will be added in a future maintenance update. + +1) Kernel Configuration +The kernel configuration option is ALTERA_TSE: + Device Drivers ---> Network device support ---> Ethernet driver support ---> + Altera Triple-Speed Ethernet MAC support (ALTERA_TSE) + +2) Driver parameters list: + debug: message level (0: no output, 16: all); + dma_rx_num: Number of descriptors in the RX list (default is 64); + dma_tx_num: Number of descriptors in the TX list (default is 64). + +3) Command line options +Driver parameters can be also passed in command line by using: + altera_tse=dma_rx_num:128,dma_tx_num:512 + +4) Driver information and notes + +4.1) Transmit process +When the driver's transmit routine is called by the kernel, it sets up a +transmit descriptor by calling the underlying DMA transmit routine (SGDMA or +MSGDMA), and initites a transmit operation. Once the transmit is complete, an +interrupt is driven by the transmit DMA logic. The driver handles the transmit +completion in the context of the interrupt handling chain by recycling +resource required to send and track the requested transmit operation. + +4.2) Receive process +The driver will post receive buffers to the receive DMA logic during driver +intialization. Receive buffers may or may not be queued depending upon the +underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able +to queue receive buffers to the SGDMA receive logic). When a packet is +received, the DMA logic generates an interrupt. The driver handles a receive +interrupt by obtaining the DMA receive logic status, reaping receive +completions until no more receive completions are available. + +4.3) Interrupt Mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for receive operations. Interrupt mitigation is not yet supported +for transmit operations, but will be added in a future maintenance release. + +4.4) Ethtool support +Ethtool is supported. Driver statistics and internal errors can be taken using: +ethtool -S ethX command. It is possible to dump registers etc. + +4.5) PHY Support +The driver is compatible with PAL to work with PHY and GPHY devices. + +4.7) List of source files: + o Kconfig + o Makefile + o altera_tse_main.c: main network device driver + o altera_tse_ethtool.c: ethtool support + o altera_tse.h: private driver structure and common definitions + o altera_msgdma.h: MSGDMA implementation function definitions + o altera_sgdma.h: SGDMA implementation function definitions + o altera_msgdma.c: MSGDMA implementation + o altera_sgdma.c: SGDMA implementation + o altera_sgdmahw.h: SGDMA register and descriptor definitions + o altera_msgdmahw.h: MSGDMA register and descriptor definitions + o altera_utils.c: Driver utility functions + o altera_utils.h: Driver utility function definitions + +5) Debug Information + +The driver exports debug information such as internal statistics, +debug information, MAC and DMA registers etc. + +A user may use the ethtool support to get statistics: +e.g. using: ethtool -S ethX (that shows the statistics counters) +or sees the MAC registers: e.g. using: ethtool -d ethX + +The developer can also use the "debug" module parameter to get +further debug information. + +6) Statistics Support + +The controller and driver support a mix of IEEE standard defined statistics, +RFC defined statistics, and driver or Altera defined statistics. The four +specifications containing the standard definitions for these statistics are +as follows: + + o IEEE 802.3-2012 - IEEE Standard for Ethernet. + o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt. + o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt. + o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com + +The statistics supported by the TSE and the device driver are as follows: + +"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.2. This statistics is the count of frames that are successfully +transmitted. + +"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.5. This statistic is the count of frames that are successfully +received. This count does not include any error packets such as CRC errors, +length errors, or alignment errors. + +"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE +802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are +an integral number of bytes in length and do not pass the CRC test as the frame +is received. + +"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012, +Section 5.2.2.1.7. This statistic is the count of frames that are not an +integral number of bytes in length and do not pass the CRC test as the frame is +received. + +"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.8. This statistic is the count of data and pad bytes +successfully transmitted from the interface. + +"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012, +Section 5.2.2.1.14. This statistic is the count of data and pad bytes +successfully received by the controller. + +"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE +802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames +transmitted from the network controller. + +"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE +802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames +received by the network controller. + +"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is +a count of the number of packets received containing errors that prevented the +packet from being delivered to a higher level protocol. + +"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic +is a count of the number of packets that could not be transmitted due to errors. + +"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were not addressed +to the broadcast address or a multicast group. + +"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +a multicast address group. + +"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This +statistic is a count of the number of packets received that were addressed to +the broadcast address. + +"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This +statistic is the number of outbound packets not transmitted even though an +error was not detected. An example of a reason this might occur is to free up +internal buffer space. + +"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were not addressed to +a multicast group or broadcast address. + +"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +multicast group. + +"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This +statistic counts the number of packets transmitted that were addressed to a +broadcast address. + +"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819. +This statistic counts the number of packets dropped due to lack of internal +controller resources. + +"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819. +This statistic counts the total number of bytes received by the controller, +including error and discarded packets. + +"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819. +This statistic counts the total number of packets received by the controller, +including error, discarded, unicast, multicast, and broadcast packets. + +"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets received less +than 64 bytes long. + +"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819. +This statistic counts the number of correctly formed packets greater than 1518 +bytes long. + +"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819. +This statistic counts the total number of packets received that were 64 octets +in length. + +"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC +2819. This statistic counts the total number of packets received that were +between 65 and 127 octets in length inclusive. + +"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 128 and 255 octets in length inclusive. + +"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 256 and 511 octets in length inclusive. + +"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in +RFC 2819. This statistic is the total number of packets received that were +between 512 and 1023 octets in length inclusive. + +"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define +in RFC 2819. This statistic is the total number of packets received that were +between 1024 and 1518 octets in length inclusive. + +"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the +Altera TSE. This statistics counts the number of received good and errored +frames between the length of 1519 and the maximum frame length configured +in the frm_length register. See the Altera TSE User Guide for More details. + +"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This +statistic is the total number of packets received that were longer than 1518 +octets, and had either a bad CRC with an integral number of octets (CRC Error) +or a bad CRC with a non-integral number of octets (Alignment Error). + +"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This +statistic is the total number of packets received that were less than 64 octets +in length and had either a bad CRC with an integral number of octets (CRC +error) or a bad CRC with a non-integral number of octets (Alignment Error). diff --git a/Documentation/networking/arcnet.txt b/Documentation/networking/arcnet.txt index 9ff57950215..aff97f47c05 100644 --- a/Documentation/networking/arcnet.txt +++ b/Documentation/networking/arcnet.txt @@ -70,9 +70,10 @@ list, mail to linux-arcnet@tichy.ch.uj.edu.pl. There are archives of the mailing list at: http://epistolary.org/mailman/listinfo.cgi/arcnet -The people on linux-net@vger.kernel.org have also been known to be very -helpful, especially when we're talking about ALPHA Linux kernels that may or -may not work right in the first place. +The people on linux-net@vger.kernel.org (now defunct, replaced by +netdev@vger.kernel.org) have also been known to be very helpful, especially +when we're talking about ALPHA Linux kernels that may or may not work right +in the first place. Other Drivers and Info diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt new file mode 100644 index 00000000000..58e49042fc2 --- /dev/null +++ b/Documentation/networking/batman-adv.txt @@ -0,0 +1,202 @@ +BATMAN-ADV +---------- + +Batman advanced is a new approach to wireless networking which +does no longer operate on the IP basis. Unlike the batman daemon, +which exchanges information using UDP packets and sets routing +tables, batman-advanced operates on ISO/OSI Layer 2 only and uses +and routes (or better: bridges) Ethernet Frames. It emulates a +virtual network switch of all nodes participating. Therefore all +nodes appear to be link local, thus all higher operating proto- +cols won't be affected by any changes within the network. You can +run almost any protocol above batman advanced, prominent examples +are: IPv4, IPv6, DHCP, IPX. + +Batman advanced was implemented as a Linux kernel driver to re- +duce the overhead to a minimum. It does not depend on any (other) +network driver, and can be used on wifi as well as ethernet lan, +vpn, etc ... (anything with ethernet-style layer 2). + + +CONFIGURATION +------------- + +Load the batman-adv module into your kernel: + +# insmod batman-adv.ko + +The module is now waiting for activation. You must add some in- +terfaces on which batman can operate. After loading the module +batman advanced will scan your systems interfaces to search for +compatible interfaces. Once found, it will create subfolders in +the /sys directories of each supported interface, e.g. + +# ls /sys/class/net/eth0/batman_adv/ +# iface_status mesh_iface + +If an interface does not have the "batman_adv" subfolder it prob- +ably is not supported. Not supported interfaces are: loopback, +non-ethernet and batman's own interfaces. + +Note: After the module was loaded it will continuously watch for +new interfaces to verify the compatibility. There is no need to +reload the module if you plug your USB wifi adapter into your ma- +chine after batman advanced was initially loaded. + +To activate a given interface simply write "bat0" into its +"mesh_iface" file inside the batman_adv subfolder: + +# echo bat0 > /sys/class/net/eth0/batman_adv/mesh_iface + +Repeat this step for all interfaces you wish to add. Now batman +starts using/broadcasting on this/these interface(s). + +By reading the "iface_status" file you can check its status: + +# cat /sys/class/net/eth0/batman_adv/iface_status +# active + +To deactivate an interface you have to write "none" into its +"mesh_iface" file: + +# echo none > /sys/class/net/eth0/batman_adv/mesh_iface + + +All mesh wide settings can be found in batman's own interface +folder: + +# ls /sys/class/net/bat0/mesh/ +#aggregated_ogms distributed_arp_table gw_sel_class orig_interval +#ap_isolation fragmentation hop_penalty routing_algo +#bonding gw_bandwidth isolation_mark vlan0 +#bridge_loop_avoidance gw_mode log_level + +There is a special folder for debugging information: + +# ls /sys/kernel/debug/batman_adv/bat0/ +# bla_backbone_table log transtable_global +# bla_claim_table originators transtable_local +# gateways socket + +Some of the files contain all sort of status information regard- +ing the mesh network. For example, you can view the table of +originators (mesh participants) with: + +# cat /sys/kernel/debug/batman_adv/bat0/originators + +Other files allow to change batman's behaviour to better fit your +requirements. For instance, you can check the current originator +interval (value in milliseconds which determines how often batman +sends its broadcast packets): + +# cat /sys/class/net/bat0/mesh/orig_interval +# 1000 + +and also change its value: + +# echo 3000 > /sys/class/net/bat0/mesh/orig_interval + +In very mobile scenarios, you might want to adjust the originator +interval to a lower value. This will make the mesh more respon- +sive to topology changes, but will also increase the overhead. + + +USAGE +----- + +To make use of your newly created mesh, batman advanced provides +a new interface "bat0" which you should use from this point on. +All interfaces added to batman advanced are not relevant any +longer because batman handles them for you. Basically, one "hands +over" the data by using the batman interface and batman will make +sure it reaches its destination. + +The "bat0" interface can be used like any other regular inter- +face. It needs an IP address which can be either statically con- +figured or dynamically (by using DHCP or similar services): + +# NodeA: ifconfig bat0 192.168.0.1 +# NodeB: ifconfig bat0 192.168.0.2 +# NodeB: ping 192.168.0.1 + +Note: In order to avoid problems remove all IP addresses previ- +ously assigned to interfaces now used by batman advanced, e.g. + +# ifconfig eth0 0.0.0.0 + + +LOGGING/DEBUGGING +----------------- + +All error messages, warnings and information messages are sent to +the kernel log. Depending on your operating system distribution +this can be read in one of a number of ways. Try using the com- +mands: dmesg, logread, or looking in the files /var/log/kern.log +or /var/log/syslog. All batman-adv messages are prefixed with +"batman-adv:" So to see just these messages try + +# dmesg | grep batman-adv + +When investigating problems with your mesh network it is some- +times necessary to see more detail debug messages. This must be +enabled when compiling the batman-adv module. When building bat- +man-adv as part of kernel, use "make menuconfig" and enable the +option "B.A.T.M.A.N. debugging". + +Those additional debug messages can be accessed using a special +file in debugfs + +# cat /sys/kernel/debug/batman_adv/bat0/log + +The additional debug output is by default disabled. It can be en- +abled during run time. Following log_levels are defined: + +0 - All debug output disabled +1 - Enable messages related to routing / flooding / broadcasting +2 - Enable messages related to route added / changed / deleted +4 - Enable messages related to translation table operations +8 - Enable messages related to bridge loop avoidance +16 - Enable messaged related to DAT, ARP snooping and parsing +31 - Enable all messages + +The debug output can be changed at runtime using the file +/sys/class/net/bat0/mesh/log_level. e.g. + +# echo 6 > /sys/class/net/bat0/mesh/log_level + +will enable debug messages for when routes change. + +Counters for different types of packets entering and leaving the +batman-adv module are available through ethtool: + +# ethtool --statistics bat0 + + +BATCTL +------ + +As batman advanced operates on layer 2 all hosts participating in +the virtual switch are completely transparent for all protocols +above layer 2. Therefore the common diagnosis tools do not work +as expected. To overcome these problems batctl was created. At +the moment the batctl contains ping, traceroute, tcpdump and +interfaces to the kernel module settings. + +For more information, please see the manpage (man batctl). + +batctl is available on http://www.open-mesh.org/ + + +CONTACT +------- + +Please send us comments, experiences, questions, anything :) + +IRC: #batman on irc.freenode.org +Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription + at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n) + +You can also contact the Authors: + +Marek Lindner <mareklindner@neomailbox.ch> +Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/baycom.txt b/Documentation/networking/baycom.txt index 4e68849d563..688f18fd446 100644 --- a/Documentation/networking/baycom.txt +++ b/Documentation/networking/baycom.txt @@ -93,7 +93,7 @@ Every time a driver is inserted into the kernel, it has to know which modems it should access at which ports. This can be done with the setbaycom utility. If you are only using one modem, you can also configure the driver from the insmod command line (or by means of an option line in -/etc/modprobe.conf). +/etc/modprobe.d/*.conf). Examples: modprobe baycom_ser_fdx mode="ser12*" iobase=0x3f8 irq=4 diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 5dc638791d9..9c723ecd002 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1,7 +1,7 @@ Linux Ethernet Bonding Driver HOWTO - Latest update: 23 September 2009 + Latest update: 27 April 2011 Initial release : Thomas Davis <tadavis at lbl.gov> Corrections, HA extensions : 2000/10/03-15 : @@ -49,7 +49,8 @@ Table of Contents 3.3 Configuring Bonding Manually with Ifenslave 3.3.1 Configuring Multiple Bonds Manually 3.4 Configuring Bonding Manually via Sysfs -3.5 Overriding Configuration for Special Cases +3.5 Configuration with Interfaces Support +3.6 Overriding Configuration for Special Cases 4. Querying Bonding Configuration 4.1 Bonding Configuration @@ -103,8 +104,7 @@ Table of Contents ============================== Most popular distro kernels ship with the bonding driver -already available as a module and the ifenslave user level control -program installed and ready for use. If your distro does not, or you +already available as a module. If your distro does not, or you have need to compile bonding from source (e.g., configuring and installing a mainline kernel from kernel.org), you'll need to perform the following steps: @@ -123,46 +123,13 @@ device support" section. It is recommended that you configure the driver as module since it is currently the only way to pass parameters to the driver or configure more than one bonding device. - Build and install the new kernel and modules, then continue -below to install ifenslave. + Build and install the new kernel and modules. -1.2 Install ifenslave Control Utility +1.2 Bonding Control Utility ------------------------------------- - The ifenslave user level control program is included in the -kernel source tree, in the file Documentation/networking/ifenslave.c. -It is generally recommended that you use the ifenslave that -corresponds to the kernel that you are using (either from the same -source tree or supplied with the distro), however, ifenslave -executables from older kernels should function (but features newer -than the ifenslave release are not supported). Running an ifenslave -that is newer than the kernel is not supported, and may or may not -work. - - To install ifenslave, do the following: - -# gcc -Wall -O -I/usr/src/linux/include ifenslave.c -o ifenslave -# cp ifenslave /sbin/ifenslave - - If your kernel source is not in "/usr/src/linux," then replace -"/usr/src/linux/include" in the above with the location of your kernel -source include directory. - - You may wish to back up any existing /sbin/ifenslave, or, for -testing or informal use, tag the ifenslave to the kernel version -(e.g., name the ifenslave executable /sbin/ifenslave-2.6.10). - -IMPORTANT NOTE: - - If you omit the "-I" or specify an incorrect directory, you -may end up with an ifenslave that is incompatible with the kernel -you're trying to build it for. Some distros (e.g., Red Hat from 7.1 -onwards) do not have /usr/include/linux symbolically linked to the -default kernel source include directory. - -SECOND IMPORTANT NOTE: - If you plan to configure bonding using sysfs, you do not need -to use ifenslave. + It is recommended to configure bonding via iproute2 (netlink) +or sysfs, the old ifenslave control utility is obsolete. 2. Bonding Driver Options ========================= @@ -172,9 +139,8 @@ bonding module at load time, or are specified via sysfs. Module options may be given as command line arguments to the insmod or modprobe command, but are usually specified in either the -/etc/modules.conf or /etc/modprobe.conf configuration file, or in a -distro-specific configuration file (some of which are detailed in the next -section). +/etc/modrobe.d/*.conf configuration files, or in a distro-specific +configuration file (some of which are detailed in the next section). Details on bonding support for sysfs is provided in the "Configuring Bonding Manually via Sysfs" section, below. @@ -195,6 +161,23 @@ or, for backwards compatibility, the option value. E.g., The parameters are as follows: +active_slave + + Specifies the new active slave for modes that support it + (active-backup, balance-alb and balance-tlb). Possible values + are the name of any currently enslaved interface, or an empty + string. If a name is given, the slave and its link must be up in order + to be selected as the new active slave. If an empty string is + specified, the current active slave is cleared, and a new active + slave is selected automatically. + + Note that this is only available through the sysfs interface. No module + parameter by this name exists. + + The normal value of this option is the name of the currently + active slave, or the empty string if there is no active slave or + the current mode does not use an active slave. + ad_select Specifies the 802.3ad aggregation selection logic to use. The @@ -237,6 +220,18 @@ ad_select This option was added in bonding version 3.4.0. +all_slaves_active + + Specifies that duplicate frames (received on inactive ports) should be + dropped (0) or delivered (1). + + Normally, bonding will drop duplicate frames (received on inactive + ports), which is desirable for most users. But there are some times + it is nice to allow duplicate frames to be delivered. + + The default value is 0 (drop duplicate frames received on inactive + ports). + arp_interval Specifies the ARP link monitoring frequency in milliseconds. @@ -275,16 +270,15 @@ arp_ip_target arp_validate Specifies whether or not ARP probes and replies should be - validated in the active-backup mode. This causes the ARP - monitor to examine the incoming ARP requests and replies, and - only consider a slave to be up if it is receiving the - appropriate ARP traffic. + validated in any mode that supports arp monitoring, or whether + non-ARP traffic should be filtered (disregarded) for link + monitoring purposes. Possible values are: none or 0 - No validation is performed. This is the default. + No validation or filtering is performed. active or 1 @@ -298,28 +292,90 @@ arp_validate Validation is performed for all slaves. - For the active slave, the validation checks ARP replies to - confirm that they were generated by an arp_ip_target. Since - backup slaves do not typically receive these replies, the - validation performed for backup slaves is on the ARP request - sent out via the active slave. It is possible that some - switch or network configurations may result in situations - wherein the backup slaves do not receive the ARP requests; in - such a situation, validation of backup slaves must be - disabled. - - This option is useful in network configurations in which - multiple bonding hosts are concurrently issuing ARPs to one or - more targets beyond a common switch. Should the link between - the switch and target fail (but not the switch itself), the - probe traffic generated by the multiple bonding instances will - fool the standard ARP monitor into considering the links as - still up. Use of the arp_validate option can resolve this, as - the ARP monitor will only consider ARP requests and replies - associated with its own instance of bonding. + filter or 4 + + Filtering is applied to all slaves. No validation is + performed. + + filter_active or 5 + + Filtering is applied to all slaves, validation is performed + only for the active slave. + + filter_backup or 6 + + Filtering is applied to all slaves, validation is performed + only for backup slaves. + + Validation: + + Enabling validation causes the ARP monitor to examine the incoming + ARP requests and replies, and only consider a slave to be up if it + is receiving the appropriate ARP traffic. + + For an active slave, the validation checks ARP replies to confirm + that they were generated by an arp_ip_target. Since backup slaves + do not typically receive these replies, the validation performed + for backup slaves is on the broadcast ARP request sent out via the + active slave. It is possible that some switch or network + configurations may result in situations wherein the backup slaves + do not receive the ARP requests; in such a situation, validation + of backup slaves must be disabled. + + The validation of ARP requests on backup slaves is mainly helping + bonding to decide which slaves are more likely to work in case of + the active slave failure, it doesn't really guarantee that the + backup slave will work if it's selected as the next active slave. + + Validation is useful in network configurations in which multiple + bonding hosts are concurrently issuing ARPs to one or more targets + beyond a common switch. Should the link between the switch and + target fail (but not the switch itself), the probe traffic + generated by the multiple bonding instances will fool the standard + ARP monitor into considering the links as still up. Use of + validation can resolve this, as the ARP monitor will only consider + ARP requests and replies associated with its own instance of + bonding. + + Filtering: + + Enabling filtering causes the ARP monitor to only use incoming ARP + packets for link availability purposes. Arriving packets that are + not ARPs are delivered normally, but do not count when determining + if a slave is available. + + Filtering operates by only considering the reception of ARP + packets (any ARP packet, regardless of source or destination) when + determining if a slave has received traffic for link availability + purposes. + + Filtering is useful in network configurations in which significant + levels of third party broadcast traffic would fool the standard + ARP monitor into considering the links as still up. Use of + filtering can resolve this, as only ARP traffic is considered for + link availability purposes. This option was added in bonding version 3.1.0. +arp_all_targets + + Specifies the quantity of arp_ip_targets that must be reachable + in order for the ARP monitor to consider a slave as being up. + This option affects only active-backup mode for slaves with + arp_validation enabled. + + Possible values are: + + any or 0 + + consider the slave up only when any of the arp_ip_targets + is reachable + + all or 1 + + consider the slave up only when all of the arp_ip_targets + are reachable + downdelay Specifies the time, in milliseconds, to wait before disabling @@ -367,7 +423,7 @@ fail_over_mac gratuitous ARP is lost, communication may be disrupted. - When this policy is used in conjuction with the mii + When this policy is used in conjunction with the mii monitor, devices which assert link up prior to being able to actually transmit and receive are particularly susceptible to loss of the gratuitous ARP, and an @@ -432,6 +488,23 @@ miimon determined. See the High Availability section for additional information. The default value is 0. +min_links + + Specifies the minimum number of links that must be active before + asserting carrier. It is similar to the Cisco EtherChannel min-links + feature. This allows setting the minimum number of member ports that + must be up (link-up state) before marking the bond device as up + (carrier on). This is useful for situations where higher level services + such as clustering want to ensure a minimum number of low bandwidth + links are active before switchover. This option only affect 802.3ad + mode. + + The default value is 0. This will cause carrier to be asserted (for + 802.3ad mode) whenever there is an active aggregator, regardless of the + number of available links in that aggregator. Note that, because an + aggregator cannot be active without at least one available link, + setting this option to 0 or to 1 has the exact same effect. + mode Specifies one of the bonding policies. The default is @@ -512,13 +585,19 @@ mode balance-tlb or 5 Adaptive transmit load balancing: channel bonding that - does not require any special switch support. The - outgoing traffic is distributed according to the - current load (computed relative to the speed) on each - slave. Incoming traffic is received by the current - slave. If the receiving slave fails, another slave - takes over the MAC address of the failed receiving - slave. + does not require any special switch support. + + In tlb_dynamic_lb=1 mode; the outgoing traffic is + distributed according to the current load (computed + relative to the speed) on each slave. + + In tlb_dynamic_lb=0 mode; the load balancing based on + current load is disabled and the load is distributed + only using the hash distribution. + + Incoming traffic is received by the current slave. + If the receiving slave fails, another slave takes over + the MAC address of the failed receiving slave. Prerequisite: @@ -584,25 +663,32 @@ mode chosen. num_grat_arp +num_unsol_na - Specifies the number of gratuitous ARPs to be issued after a - failover event. One gratuitous ARP is issued immediately after - the failover, subsequent ARPs are sent at a rate of one per link - monitor interval (arp_interval or miimon, whichever is active). + Specify the number of peer notifications (gratuitous ARPs and + unsolicited IPv6 Neighbor Advertisements) to be issued after a + failover event. As soon as the link is up on the new slave + (possibly immediately) a peer notification is sent on the + bonding device and each VLAN sub-device. This is repeated at + each link monitor interval (arp_interval or miimon, whichever + is active) if the number is greater than 1. - The valid range is 0 - 255; the default value is 1. This option - affects only the active-backup mode. This option was added for - bonding version 3.3.0. + The valid range is 0 - 255; the default value is 1. These options + affect only the active-backup mode. These options were added for + bonding versions 3.3.0 and 3.4.0 respectively. -num_unsol_na + From Linux 3.0 and bonding version 3.7.1, these notifications + are generated by the ipv4 and ipv6 code and the numbers of + repetitions cannot be set independently. + +packets_per_slave - Specifies the number of unsolicited IPv6 Neighbor Advertisements - to be issued after a failover event. One unsolicited NA is issued - immediately after the failover. + Specify the number of packets to transmit through a slave before + moving to the next one. When set to 0 then a slave is chosen at + random. - The valid range is 0 - 255; the default value is 1. This option - affects only the active-backup mode. This option was added for - bonding version 3.4.0. + The valid range is 0 - 65535; the default value is 1. This option + has effect only in balance-rr mode. primary @@ -613,7 +699,8 @@ primary one slave is preferred over another, e.g., when one slave has higher throughput than another. - The primary option is only valid for active-backup mode. + The primary option is only valid for active-backup(1), + balance-tlb (5) and balance-alb (6) mode. primary_reselect @@ -655,6 +742,28 @@ primary_reselect This option was added for bonding version 3.6.0. +tlb_dynamic_lb + + Specifies if dynamic shuffling of flows is enabled in tlb + mode. The value has no effect on any other modes. + + The default behavior of tlb mode is to shuffle active flows across + slaves based on the load in that interval. This gives nice lb + characteristics but can cause packet reordering. If re-ordering is + a concern use this variable to disable flow shuffling and rely on + load balancing provided solely by the hash distribution. + xmit-hash-policy can be used to select the appropriate hashing for + the setup. + + The sysfs entry can be used to change the setting per bond device + and the initial value is derived from the module parameter. The + sysfs entry is allowed to be changed only if the bond device is + down. + + The default value is "1" that enables flow shuffling while value "0" + disables it. This option was added in bonding driver 3.7.1 + + updelay Specifies the time, in milliseconds, to wait before enabling a @@ -688,7 +797,7 @@ use_carrier xmit_hash_policy Selects the transmit hash policy to use for slave selection in - balance-xor and 802.3ad modes. Possible values are: + balance-xor, 802.3ad, and tlb modes. Possible values are: layer2 @@ -710,9 +819,14 @@ xmit_hash_policy Uses XOR of hardware MAC addresses and IP addresses to generate the hash. The formula is - (((source IP XOR dest IP) AND 0xffff) XOR - ( source MAC XOR destination MAC )) - modulo slave count + hash = source MAC XOR destination MAC + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. This algorithm will place all traffic to a particular network peer on the same slave. For non-IP traffic, @@ -736,20 +850,21 @@ xmit_hash_policy The formula for unfragmented TCP and UDP packets is - ((source port XOR dest port) XOR - ((source IP XOR dest IP) AND 0xffff) - modulo slave count + hash = source port, destination port (as in the header) + hash = hash XOR source IP XOR destination IP + hash = hash XOR (hash RSHIFT 16) + hash = hash XOR (hash RSHIFT 8) + And then hash is reduced modulo slave count. + + If the protocol is IPv6 then the source and destination + addresses are first hashed using ipv6_addr_hash. - For fragmented TCP or UDP packets and all other IP - protocol traffic, the source and destination port + For fragmented TCP or UDP packets and all other IPv4 and + IPv6 protocol traffic, the source and destination port information is omitted. For non-IP traffic, the formula is the same as for the layer2 transmit hash policy. - This policy is intended to mimic the behavior of - certain switches, notably Cisco switches with PFC2 as - well as some Foundry and IBM products. - This algorithm is not fully 802.3ad compliant. A single TCP or UDP conversation containing both fragmented and unfragmented packets will see packets @@ -760,6 +875,26 @@ xmit_hash_policy conversations. Other implementations of 802.3ad may or may not tolerate this noncompliance. + encap2+3 + + This policy uses the same formula as layer2+3 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + + encap3+4 + + This policy uses the same formula as layer3+4 but it + relies on skb_flow_dissect to obtain the header fields + which might result in the use of inner headers if an + encapsulation protocol is used. For example this will + improve the performance for tunnel users because the + packets will be distributed according to the encapsulated + flows. + The default value is layer2. This option was added in bonding version 2.6.3. In earlier versions of bonding, this parameter does not exist, and the layer2 policy is the only policy. The @@ -771,30 +906,51 @@ resend_igmp a failover event. One membership report is issued immediately after the failover, subsequent packets are sent in each 200ms interval. - The valid range is 0 - 255; the default value is 1. This option - was added for bonding version 3.7.0. + The valid range is 0 - 255; the default value is 1. A value of 0 + prevents the IGMP membership report from being issued in response + to the failover event. + + This option is useful for bonding modes balance-rr (0), active-backup + (1), balance-tlb (5) and balance-alb (6), in which a failover can + switch the IGMP traffic from one slave to another. Therefore a fresh + IGMP report must be issued to cause the switch to forward the incoming + IGMP traffic over the newly selected slave. + + This option was added for bonding version 3.7.0. + +lp_interval + + Specifies the number of seconds between instances where the bonding + driver sends learning packets to each slaves peer switch. + + The valid range is 1 - 0x7fffffff; the default value is 1. This Option + has effect only in balance-tlb and balance-alb modes. 3. Configuring Bonding Devices ============================== You can configure bonding using either your distro's network -initialization scripts, or manually using either ifenslave or the -sysfs interface. Distros generally use one of two packages for the -network initialization scripts: initscripts or sysconfig. Recent -versions of these packages have support for bonding, while older +initialization scripts, or manually using either iproute2 or the +sysfs interface. Distros generally use one of three packages for the +network initialization scripts: initscripts, sysconfig or interfaces. +Recent versions of these packages have support for bonding, while older versions do not. We will first describe the options for configuring bonding for -distros using versions of initscripts and sysconfig with full or -partial support for bonding, then provide information on enabling +distros using versions of initscripts, sysconfig and interfaces with full +or partial support for bonding, then provide information on enabling bonding without support from the network initialization scripts (i.e., older versions of initscripts or sysconfig). - If you're unsure whether your distro uses sysconfig or -initscripts, or don't know if it's new enough, have no fear. + If you're unsure whether your distro uses sysconfig, +initscripts or interfaces, or don't know if it's new enough, have no fear. Determining this is fairly straightforward. - First, issue the command: + First, look for a file called interfaces in /etc/network directory. +If this file is present in your system, then your system use interfaces. See +Configuration with Interfaces Support. + + Else, issue the command: $ rpm -qf /sbin/ifup @@ -963,7 +1119,7 @@ ifcfg-bondX files. Because the sysconfig scripts supply the bonding module options in the ifcfg-bondX file, it is not necessary to add them to -the system /etc/modules.conf or /etc/modprobe.conf configuration file. +the system /etc/modules.d/*.conf configuration files. 3.2 Configuration with Initscripts Support ------------------------------------------ @@ -1040,15 +1196,13 @@ queried targets, e.g., arp_ip_target=+192.168.1.1 arp_ip_target=+192.168.1.2 is the proper syntax to specify multiple targets. When specifying -options via BONDING_OPTS, it is not necessary to edit /etc/modules.conf or -/etc/modprobe.conf. +options via BONDING_OPTS, it is not necessary to edit /etc/modprobe.d/*.conf. For even older versions of initscripts that do not support -BONDING_OPTS, it is necessary to edit /etc/modules.conf (or -/etc/modprobe.conf, depending upon your distro) to load the bonding module -with your desired options when the bond0 interface is brought up. The -following lines in /etc/modules.conf (or modprobe.conf) will load the -bonding module, and select its options: +BONDING_OPTS, it is necessary to edit /etc/modprobe.d/*.conf, depending upon +your distro) to load the bonding module with your desired options when the +bond0 interface is brought up. The following lines in /etc/modprobe.d/*.conf +will load the bonding module, and select its options: alias bond0 bonding options bond0 mode=balance-alb miimon=100 @@ -1085,7 +1239,7 @@ not support this method for specifying multiple bonding interfaces; for those instances, see the "Configuring Multiple Bonds Manually" section, below. -3.3 Configuring Bonding Manually with Ifenslave +3.3 Configuring Bonding Manually with iproute2 ----------------------------------------------- This section applies to distros whose network initialization @@ -1094,9 +1248,9 @@ knowledge of bonding. One such distro is SuSE Linux Enterprise Server version 8. The general method for these systems is to place the bonding -module parameters into /etc/modules.conf or /etc/modprobe.conf (as +module parameters into a config file in /etc/modprobe.d/ (as appropriate for the installed distro), then add modprobe and/or -ifenslave commands to the system's global init script. The name of +`ip link` commands to the system's global init script. The name of the global init script differs; for sysconfig, it is /etc/init.d/boot.local and for initscripts it is /etc/rc.d/rc.local. @@ -1108,8 +1262,8 @@ reboots, edit the appropriate file (/etc/init.d/boot.local or modprobe bonding mode=balance-alb miimon=100 modprobe e100 ifconfig bond0 192.168.1.1 netmask 255.255.255.0 up -ifenslave bond0 eth0 -ifenslave bond0 eth1 +ip link set eth0 master bond0 +ip link set eth1 master bond0 Replace the example bonding module parameters and bond0 network configuration (IP address, netmask, etc) with the appropriate @@ -1155,7 +1309,7 @@ options, you may wish to use the "max_bonds" module parameter, documented above. To create multiple bonding devices with differing options, it is -preferrable to use bonding parameters exported by sysfs, documented in the +preferable to use bonding parameters exported by sysfs, documented in the section below. For versions of bonding without sysfs support, the only means to @@ -1170,7 +1324,7 @@ network initialization scripts. specify a different name for each instance (the module loading system requires that every loaded module, even multiple instances of the same module, have a unique name). This is accomplished by supplying multiple -sets of bonding options in /etc/modprobe.conf, for example: +sets of bonding options in /etc/modprobe.d/*.conf, for example: alias bond0 bonding options bond0 -o bond0 mode=balance-rr miimon=100 @@ -1296,6 +1450,12 @@ To add ARP targets: To remove an ARP target: # echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target +To configure the interval between learning packet transmits: +# echo 12 > /sys/class/net/bond0/bonding/lp_interval + NOTE: the lp_inteval is the number of seconds between instances where +the bonding driver sends learning packets to each slaves peer switch. The +default interval is 1 second. + Example Configuration --------------------- We begin with the same example that is shown in section 3.3, @@ -1327,8 +1487,62 @@ echo 2000 > /sys/class/net/bond1/bonding/arp_interval echo +eth2 > /sys/class/net/bond1/bonding/slaves echo +eth3 > /sys/class/net/bond1/bonding/slaves -3.5 Overriding Configuration for Special Cases +3.5 Configuration with Interfaces Support +----------------------------------------- + + This section applies to distros which use /etc/network/interfaces file +to describe network interface configuration, most notably Debian and it's +derivatives. + + The ifup and ifdown commands on Debian don't support bonding out of +the box. The ifenslave-2.6 package should be installed to provide bonding +support. Once installed, this package will provide bond-* options to be used +into /etc/network/interfaces. + + Note that ifenslave-2.6 package will load the bonding module and use +the ifenslave command when appropriate. + +Example Configurations +---------------------- + +In /etc/network/interfaces, the following stanza will configure bond0, in +active-backup mode, with eth0 and eth1 as slaves. + +auto bond0 +iface bond0 inet dhcp + bond-slaves eth0 eth1 + bond-mode active-backup + bond-miimon 100 + bond-primary eth0 eth1 + +If the above configuration doesn't work, you might have a system using +upstart for system startup. This is most notably true for recent +Ubuntu versions. The following stanza in /etc/network/interfaces will +produce the same result on those systems. + +auto bond0 +iface bond0 inet dhcp + bond-slaves none + bond-mode active-backup + bond-miimon 100 + +auto eth0 +iface eth0 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +auto eth1 +iface eth1 inet manual + bond-master bond0 + bond-primary eth0 eth1 + +For a full list of bond-* supported options in /etc/network/interfaces and some +more advanced examples tailored to you particular distros, see the files in +/usr/share/doc/ifenslave-2.6. + +3.6 Overriding Configuration for Special Cases ---------------------------------------------- + When using the bonding driver, the physical port which transmits a frame is typically selected by the bonding driver, and is not relevant to the user or system administrator. The output port is simply selected using the policies of @@ -1681,8 +1895,8 @@ route additions may cause trouble. On systems with network configuration scripts that do not associate physical devices directly with network interface names (so that the same physical device always has the same "ethX" name), it may -be necessary to add some special logic to either /etc/modules.conf or -/etc/modprobe.conf (depending upon which is installed on the system). +be necessary to add some special logic to config files in +/etc/modprobe.d/. For example, given a modules.conf containing the following: @@ -1709,20 +1923,15 @@ add above bonding e1000 tg3 bonding is loaded. This command is fully documented in the modules.conf manual page. - On systems utilizing modprobe.conf (or modprobe.conf.local), -an equivalent problem can occur. In this case, the following can be -added to modprobe.conf (or modprobe.conf.local, as appropriate), as -follows (all on one line; it has been split here for clarity): + On systems utilizing modprobe an equivalent problem can occur. +In this case, the following can be added to config files in +/etc/modprobe.d/ as: -install bonding /sbin/modprobe tg3; /sbin/modprobe e1000; - /sbin/modprobe --ignore-install bonding +softdep bonding pre: tg3 e1000 - This will, when loading the bonding module, rather than -performing the normal action, instead execute the provided command. -This command loads the device drivers in the order needed, then calls -modprobe with --ignore-install to cause the normal action to then take -place. Full documentation on this can be found in the modprobe.conf -and modprobe manual pages. + This will load tg3 and e1000 modules before loading the bonding one. +Full documentation on this can be found in the modprobe.d and modprobe +manual pages. 8.3. Painfully Slow Or No Failed Link Detection By Miimon --------------------------------------------------------- @@ -1846,7 +2055,7 @@ access to fail over to. Additionally, the bonding load balance modes support link monitoring of their members, so if individual links fail, the load will be rebalanced across the remaining devices. - See Section 13, "Configuring Bonding for Maximum Throughput" + See Section 12, "Configuring Bonding for Maximum Throughput" for information on configuring bonding with one peer device. 11.2 High Availability in a Multiple Switch Topology @@ -2499,18 +2708,15 @@ enslaved. 16. Resources and Links ======================= -The latest version of the bonding driver can be found in the latest + The latest version of the bonding driver can be found in the latest version of the linux kernel, found on http://kernel.org -The latest version of this document can be found in either the latest -kernel source (named Documentation/networking/bonding.txt), or on the -bonding sourceforge site: + The latest version of this document can be found in the latest kernel +source (named Documentation/networking/bonding.txt). -http://www.sourceforge.net/projects/bonding - -Discussions regarding the bonding driver take place primarily on the -bonding-devel mailing list, hosted at sourceforge.net. If you have -questions or problems, post them to the list. The list address is: + Discussions regarding the usage of the bonding driver take place on the +bonding-devel mailing list, hosted at sourceforge.net. If you have questions or +problems, post them to the list. The list address is: bonding-devel@lists.sourceforge.net @@ -2519,6 +2725,17 @@ be found at: https://lists.sourceforge.net/lists/listinfo/bonding-devel + Discussions regarding the development of the bonding driver take place +on the main Linux network mailing list, hosted at vger.kernel.org. The list +address is: + +netdev@vger.kernel.org + + The administrative interface (to subscribe or unsubscribe) can +be found at: + +http://vger.kernel.org/vger-lists.html#netdev + Donald Becker's Ethernet Drivers and diag programs may be found at : - http://web.archive.org/web/*/http://www.scyld.com/network/ diff --git a/Documentation/networking/bridge.txt b/Documentation/networking/bridge.txt index bec69a8a169..a27cb6214ed 100644 --- a/Documentation/networking/bridge.txt +++ b/Documentation/networking/bridge.txt @@ -1,8 +1,15 @@ In order to use the Ethernet bridging functionality, you'll need the -userspace tools. These programs and documentation are available -at http://www.linux-foundation.org/en/Net:Bridge. The download page is -http://prdownloads.sourceforge.net/bridge. +userspace tools. + +Documentation for Linux bridging is on: + http://www.linuxfoundation.org/collaborate/workgroups/networking/bridge + +The bridge-utilities are maintained at: + git://git.kernel.org/pub/scm/linux/kernel/git/shemminger/bridge-utils.git + +Additionally, the iproute2 utilities can be used to configure +bridge devices. If you still have questions, don't hesitate to post to the mailing list -(more info http://lists.osdl.org/mailman/listinfo/bridge). +(more info https://lists.linux-foundation.org/mailman/listinfo/bridge). diff --git a/Documentation/networking/caif/Linux-CAIF.txt b/Documentation/networking/caif/Linux-CAIF.txt index 7fe7a9a33a4..0aa4bd381be 100644 --- a/Documentation/networking/caif/Linux-CAIF.txt +++ b/Documentation/networking/caif/Linux-CAIF.txt @@ -19,60 +19,36 @@ and host. Currently, UART and Loopback are available for Linux. Architecture: ------------ The implementation of CAIF is divided into: -* CAIF Socket Layer, Kernel API, and Net Device. +* CAIF Socket Layer and GPRS IP Interface. * CAIF Core Protocol Implementation * CAIF Link Layer, implemented as NET devices. RTNL ! - ! +------+ +------+ +------+ - ! +------+! +------+! +------+! - ! ! Sock !! !Kernel!! ! Net !! - ! ! API !+ ! API !+ ! Dev !+ <- CAIF Client APIs - ! +------+ +------! +------+ - ! ! ! ! - ! +----------!----------+ - ! +------+ <- CAIF Protocol Implementation - +-------> ! CAIF ! - ! Core ! - +------+ - +--------!--------+ - ! ! - +------+ +-----+ - ! ! ! TTY ! <- Link Layer (Net Devices) - +------+ +-----+ - - -Using the Kernel API ----------------------- -The Kernel API is used for accessing CAIF channels from the -kernel. -The user of the API has to implement two callbacks for receive -and control. -The receive callback gives a CAIF packet as a SKB. The control -callback will -notify of channel initialization complete, and flow-on/flow- -off. - - - struct caif_device caif_dev = { - .caif_config = { - .name = "MYDEV" - .type = CAIF_CHTY_AT - } - .receive_cb = my_receive, - .control_cb = my_control, - }; - caif_add_device(&caif_dev); - caif_transmit(&caif_dev, skb); - -See the caif_kernel.h for details about the CAIF kernel API. + ! +------+ +------+ + ! +------+! +------+! + ! ! IP !! !Socket!! + +-------> !interf!+ ! API !+ <- CAIF Client APIs + ! +------+ +------! + ! ! ! + ! +-----------+ + ! ! + ! +------+ <- CAIF Core Protocol + ! ! CAIF ! + ! ! Core ! + ! +------+ + ! +----------!---------+ + ! ! ! ! + ! +------+ +-----+ +------+ + +--> ! HSI ! ! TTY ! ! USB ! <- Link Layer (Net Devices) + +------+ +-----+ +------+ + I M P L E M E N T A T I O N =========================== -=========================== + CAIF Core Protocol Layer ========================================= @@ -88,17 +64,13 @@ The Core CAIF implementation contains: - Simple implementation of CAIF. - Layered architecture (a la Streams), each layer in the CAIF specification is implemented in a separate c-file. - - Clients must implement PHY layer to access physical HW - with receive and transmit functions. - Clients must call configuration function to add PHY layer. - Clients must implement CAIF layer to consume/produce CAIF payload with receive and transmit functions. - Clients must call configuration function to add and connect the Client layer. - When receiving / transmitting CAIF Packets (cfpkt), ownership is passed - to the called function (except for framing layers' receive functions - or if a transmit function returns an error, in which case the caller - must free the packet). + to the called function (except for framing layers' receive function) Layered Architecture -------------------- @@ -109,11 +81,6 @@ Implementation. The support functions include: CAIF Packet has functions for creating, destroying and adding content and for adding/extracting header and trailers to protocol packets. - - CFLST CAIF list implementation. - - - CFGLUE CAIF Glue. Contains OS Specifics, such as memory - allocation, endianness, etc. - The CAIF Protocol implementation contains: - CFCNFG CAIF Configuration layer. Configures the CAIF Protocol @@ -128,7 +95,7 @@ The CAIF Protocol implementation contains: control and remote shutdown requests. - CFVEI CAIF VEI layer. Handles CAIF AT Channels on VEI (Virtual - External Interface). This layer encodes/decodes VEI frames. + External Interface). This layer encodes/decodes VEI frames. - CFDGML CAIF Datagram layer. Handles CAIF Datagram layer (IP traffic), encodes/decodes Datagram frames. @@ -136,7 +103,7 @@ The CAIF Protocol implementation contains: - CFMUX CAIF Mux layer. Handles multiplexing between multiple physical bearers and multiple channels such as VEI, Datagram, etc. The MUX keeps track of the existing CAIF Channels and - Physical Instances and selects the apropriate instance based + Physical Instances and selects the appropriate instance based on Channel-Id and Physical-ID. - CFFRML CAIF Framing layer. Handles Framing i.e. Frame length @@ -170,7 +137,7 @@ The CAIF Protocol implementation contains: +---------+ +---------+ ! ! +---------+ +---------+ - | | | Serial | + | | | Serial | | | | CFSERL | +---------+ +---------+ @@ -186,24 +153,20 @@ In this layered approach the following "rules" apply. layer->dn->transmit(layer->dn, packet); -Linux Driver Implementation +CAIF Socket and IP interface =========================== -Linux GPRS Net Device and CAIF socket are implemented on top of the -CAIF Core protocol. The Net device and CAIF socket have an instance of +The IP interface and CAIF socket API are implemented on top of the +CAIF Core protocol. The IP Interface and CAIF socket have an instance of 'struct cflayer', just like the CAIF Core protocol stack. Net device and Socket implement the 'receive()' function defined by 'struct cflayer', just like the rest of the CAIF stack. In this way, transmit and receive of packets is handled as by the rest of the layers: the 'dn->transmit()' function is called in order to transmit data. -The layer on top of the CAIF Core implementation is -sometimes referred to as the "Client layer". - - Configuration of Link Layer --------------------------- -The Link Layer is implemented as Linux net devices (struct net_device). +The Link Layer is implemented as Linux network devices (struct net_device). Payload handling and registration is done using standard Linux mechanisms. The CAIF Protocol relies on a loss-less link layer without implementing diff --git a/Documentation/networking/caif/spi_porting.txt b/Documentation/networking/caif/spi_porting.txt index 61d7c924745..9efd0687dc4 100644 --- a/Documentation/networking/caif/spi_porting.txt +++ b/Documentation/networking/caif/spi_porting.txt @@ -32,7 +32,7 @@ the physical hardware, both with regard to SPI and to GPIOs. This function is called by the CAIF SPI interface to give you a chance to set up your hardware to be ready to receive a stream of data from the master. The xfer structure contains - both physical and logical adresses, as well as the total length + both physical and logical addresses, as well as the total length of the transfer in both directions.The dev parameter can be used to map to different CAIF SPI slave devices. @@ -150,7 +150,7 @@ static int sspi_init_xfer(struct cfspi_xfer *xfer, struct cfspi_dev *dev) void sspi_sig_xfer(bool xfer, struct cfspi_dev *dev) { /* If xfer is true then you should assert the SPI_INT to indicate to - * the master that you are ready to recieve the data from the master + * the master that you are ready to receive the data from the master * SPI. If xfer is false then you should de-assert SPI_INT to indicate * that the transfer is done. */ diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt index 5b04b67ddca..2236d6dcb7d 100644 --- a/Documentation/networking/can.txt +++ b/Documentation/networking/can.txt @@ -2,32 +2,38 @@ can.txt -Readme file for the Controller Area Network Protocol Family (aka Socket CAN) +Readme file for the Controller Area Network Protocol Family (aka SocketCAN) This file contains - 1 Overview / What is Socket CAN + 1 Overview / What is SocketCAN 2 Motivation / Why using the socket API - 3 Socket CAN concept + 3 SocketCAN concept 3.1 receive lists 3.2 local loopback of sent frames - 3.3 network security issues (capabilities) - 3.4 network problem notifications + 3.3 network problem notifications - 4 How to use Socket CAN + 4 How to use SocketCAN 4.1 RAW protocol sockets with can_filters (SOCK_RAW) 4.1.1 RAW socket option CAN_RAW_FILTER 4.1.2 RAW socket option CAN_RAW_ERR_FILTER 4.1.3 RAW socket option CAN_RAW_LOOPBACK 4.1.4 RAW socket option CAN_RAW_RECV_OWN_MSGS - 4.1.5 RAW socket returned message flags + 4.1.5 RAW socket option CAN_RAW_FD_FRAMES + 4.1.6 RAW socket returned message flags 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + 4.2.1 Broadcast Manager operations + 4.2.2 Broadcast Manager message flags + 4.2.3 Broadcast Manager transmission timers + 4.2.4 Broadcast Manager message sequence transmission + 4.2.5 Broadcast Manager receive filter timers + 4.2.6 Broadcast Manager multiplex message receive filter 4.3 connected transport protocols (SOCK_SEQPACKET) 4.4 unconnected transport protocols (SOCK_DGRAM) - 5 Socket CAN core module + 5 SocketCAN core module 5.1 can.ko module params 5.2 procfs content 5.3 writing own CAN protocol modules @@ -41,22 +47,23 @@ This file contains 6.5.1 Netlink interface to set/get devices properties 6.5.2 Setting the CAN bit-timing 6.5.3 Starting and stopping the CAN network device - 6.6 supported CAN hardware + 6.6 CAN FD (flexible data rate) driver support + 6.7 supported CAN hardware - 7 Socket CAN resources + 7 SocketCAN resources 8 Credits ============================================================================ -1. Overview / What is Socket CAN +1. Overview / What is SocketCAN -------------------------------- The socketcan package is an implementation of CAN protocols (Controller Area Network) for Linux. CAN is a networking technology which has widespread use in automation, embedded devices, and automotive fields. While there have been other CAN implementations -for Linux based on character devices, Socket CAN uses the Berkeley +for Linux based on character devices, SocketCAN uses the Berkeley socket API, the Linux network stack and implements the CAN device drivers as network interfaces. The CAN socket API has been designed as similar as possible to the TCP/IP protocols to allow programmers, @@ -66,7 +73,7 @@ sockets. 2. Motivation / Why using the socket API ---------------------------------------- -There have been CAN implementations for Linux before Socket CAN so the +There have been CAN implementations for Linux before SocketCAN so the question arises, why we have started another project. Most existing implementations come as a device driver for some CAN hardware, they are based on character devices and provide comparatively little @@ -81,10 +88,10 @@ the CAN controller requires employment of another device driver and often the need for adaption of large parts of the application to the new driver's API. -Socket CAN was designed to overcome all of these limitations. A new +SocketCAN was designed to overcome all of these limitations. A new protocol family has been implemented which provides a socket interface to user space applications and which builds upon the Linux network -layer, so to use all of the provided queueing functionality. A device +layer, enabling use all of the provided queueing functionality. A device driver for CAN controller hardware registers itself with the Linux network layer as a network device, so that CAN frames from the controller can be passed up to the network layer and on to the CAN @@ -138,15 +145,15 @@ solution for a couple of reasons: providing an API for device drivers to register with. However, then it would be no more difficult, or may be even easier, to use the networking framework provided by the Linux kernel, and this is what - Socket CAN does. + SocketCAN does. The use of the networking framework of the Linux kernel is just the natural and most appropriate way to implement CAN for Linux. -3. Socket CAN concept +3. SocketCAN concept --------------------- - As described in chapter 2 it is the main goal of Socket CAN to + As described in chapter 2 it is the main goal of SocketCAN to provide a socket interface to user space applications which builds upon the Linux network layer. In contrast to the commonly known TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!) @@ -160,11 +167,11 @@ solution for a couple of reasons: The network transparent access of multiple applications leads to the problem that different applications may be interested in the same - CAN-IDs from the same CAN network interface. The Socket CAN core + CAN-IDs from the same CAN network interface. The SocketCAN core module - which implements the protocol family CAN - provides several high efficient receive lists for this reason. If e.g. a user space application opens a CAN RAW socket, the raw protocol module itself - requests the (range of) CAN-IDs from the Socket CAN core that are + requests the (range of) CAN-IDs from the SocketCAN core that are requested by the user. The subscription and unsubscription of CAN-IDs can be done for specific CAN interfaces or for all(!) known CAN interfaces with the can_rx_(un)register() functions provided to @@ -209,21 +216,7 @@ solution for a couple of reasons: * = you really like to have this when you're running analyser tools like 'candump' or 'cansniffer' on the (same) node. - 3.3 network security issues (capabilities) - - The Controller Area Network is a local field bus transmitting only - broadcast messages without any routing and security concepts. - In the majority of cases the user application has to deal with - raw CAN frames. Therefore it might be reasonable NOT to restrict - the CAN access only to the user root, as known from other networks. - Since the currently implemented CAN_RAW and CAN_BCM sockets can only - send and receive frames to/from CAN interfaces it does not affect - security of others networks to allow all users to access the CAN. - To enable non-root users to access CAN_RAW and CAN_BCM protocol - sockets the Kconfig options CAN_RAW_USER and/or CAN_BCM_USER may be - selected at kernel compile time. - - 3.4 network problem notifications + 3.3 network problem notifications The use of the CAN bus may lead to several problems on the physical and media access control layer. Detecting and logging of these lower @@ -232,22 +225,22 @@ solution for a couple of reasons: arbitration problems and error frames caused by the different ECUs. The occurrence of detected errors are important for diagnosis and have to be logged together with the exact timestamp. For this - reason the CAN interface driver can generate so called Error Frames - that can optionally be passed to the user application in the same - way as other CAN frames. Whenever an error on the physical layer + reason the CAN interface driver can generate so called Error Message + Frames that can optionally be passed to the user application in the + same way as other CAN frames. Whenever an error on the physical layer or the MAC layer is detected (e.g. by the CAN controller) the driver - creates an appropriate error frame. Error frames can be requested by - the user application using the common CAN filter mechanisms. Inside - this filter definition the (interested) type of errors may be - selected. The reception of error frames is disabled by default. - The format of the CAN error frame is briefly decribed in the Linux - header file "include/linux/can/error.h". - -4. How to use Socket CAN + creates an appropriate error message frame. Error messages frames can + be requested by the user application using the common CAN filter + mechanisms. Inside this filter definition the (interested) type of + errors may be selected. The reception of error messages is disabled + by default. The format of the CAN error message frame is briefly + described in the Linux header file "include/linux/can/error.h". + +4. How to use SocketCAN ------------------------ Like TCP/IP, you first need to open a socket for communicating over a - CAN network. Since Socket CAN implements a new protocol family, you + CAN network. Since SocketCAN implements a new protocol family, you need to pass PF_CAN as the first argument to the socket(2) system call. Currently, there are two CAN protocols to choose from, the raw socket protocol and the broadcast manager (BCM). So to open a socket, @@ -273,13 +266,13 @@ solution for a couple of reasons: struct can_frame { canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ - __u8 can_dlc; /* data length code: 0 .. 8 */ + __u8 can_dlc; /* frame payload length in byte (0 .. 8) */ __u8 data[8] __attribute__((aligned(8))); }; The alignment of the (linear) payload data[] to a 64bit boundary - allows the user to define own structs and unions to easily access the - CAN payload. There is no given byteorder on the CAN bus by + allows the user to define their own structs and unions to easily access + the CAN payload. There is no given byteorder on the CAN bus by default. A read(2) system call on a CAN_RAW socket transfers a struct can_frame to the user space. @@ -375,6 +368,51 @@ solution for a couple of reasons: nbytes = sendto(s, &frame, sizeof(struct can_frame), 0, (struct sockaddr*)&addr, sizeof(addr)); + Remark about CAN FD (flexible data rate) support: + + Generally the handling of CAN FD is very similar to the formerly described + examples. The new CAN FD capable CAN controllers support two different + bitrates for the arbitration phase and the payload phase of the CAN FD frame + and up to 64 bytes of payload. This extended payload length breaks all the + kernel interfaces (ABI) which heavily rely on the CAN frame with fixed eight + bytes of payload (struct can_frame) like the CAN_RAW socket. Therefore e.g. + the CAN_RAW socket supports a new socket option CAN_RAW_FD_FRAMES that + switches the socket into a mode that allows the handling of CAN FD frames + and (legacy) CAN frames simultaneously (see section 4.1.5). + + The struct canfd_frame is defined in include/linux/can.h: + + struct canfd_frame { + canid_t can_id; /* 32 bit CAN_ID + EFF/RTR/ERR flags */ + __u8 len; /* frame payload length in byte (0 .. 64) */ + __u8 flags; /* additional flags for CAN FD */ + __u8 __res0; /* reserved / padding */ + __u8 __res1; /* reserved / padding */ + __u8 data[64] __attribute__((aligned(8))); + }; + + The struct canfd_frame and the existing struct can_frame have the can_id, + the payload length and the payload data at the same offset inside their + structures. This allows to handle the different structures very similar. + When the content of a struct can_frame is copied into a struct canfd_frame + all structure elements can be used as-is - only the data[] becomes extended. + + When introducing the struct canfd_frame it turned out that the data length + code (DLC) of the struct can_frame was used as a length information as the + length and the DLC has a 1:1 mapping in the range of 0 .. 8. To preserve + the easy handling of the length information the canfd_frame.len element + contains a plain length value from 0 .. 64. So both canfd_frame.len and + can_frame.can_dlc are equal and contain a length information and no DLC. + For details about the distinction of CAN and CAN FD capable devices and + the mapping to the bus-relevant data length code (DLC), see chapter 6.6. + + The length of the two CAN(FD) frame structures define the maximum transfer + unit (MTU) of the CAN(FD) network interface and skbuff data length. Two + definitions are specified for CAN specific MTUs in include/linux/can.h : + + #define CAN_MTU (sizeof(struct can_frame)) == 16 => 'legacy' CAN frame + #define CANFD_MTU (sizeof(struct canfd_frame)) == 72 => CAN FD frame + 4.1 RAW protocol sockets with can_filters (SOCK_RAW) Using CAN_RAW sockets is extensively comparable to the commonly @@ -383,7 +421,7 @@ solution for a couple of reasons: defaults are set at RAW socket binding time: - The filters are set to exactly one filter receiving everything - - The socket only receives valid data frames (=> no error frames) + - The socket only receives valid data frames (=> no error message frames) - The loopback of sent CAN frames is enabled (see chapter 3.2) - The socket does not receive its own sent frames (in loopback mode) @@ -426,15 +464,50 @@ solution for a couple of reasons: setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0); - To set the filters to zero filters is quite obsolete as not read + To set the filters to zero filters is quite obsolete as to not read data causes the raw socket to discard the received CAN frames. But having this 'send only' use-case we may remove the receive list in the Kernel to save a little (really a very little!) CPU usage. + 4.1.1.1 CAN filter usage optimisation + + The CAN filters are processed in per-device filter lists at CAN frame + reception time. To reduce the number of checks that need to be performed + while walking through the filter lists the CAN core provides an optimized + filter handling when the filter subscription focusses on a single CAN ID. + + For the possible 2048 SFF CAN identifiers the identifier is used as an index + to access the corresponding subscription list without any further checks. + For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as + hash function to retrieve the EFF table index. + + To benefit from the optimized filters for single CAN identifiers the + CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together + with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the + can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is + subscribed. E.g. in the example from above + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = CAN_SFF_MASK; + + both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass. + + To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the + filter has to be defined in this way to benefit from the optimized filters: + + struct can_filter rfilter[2]; + + rfilter[0].can_id = 0x123; + rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK); + rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG; + rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK); + + setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter)); + 4.1.2 RAW socket option CAN_RAW_ERR_FILTER As described in chapter 3.4 the CAN interface driver can generate so - called Error Frames that can optionally be passed to the user + called Error Message Frames that can optionally be passed to the user application in the same way as other CAN frames. The possible errors are divided into different error classes that may be filtered using the appropriate error mask. To register for every possible @@ -472,7 +545,63 @@ solution for a couple of reasons: setsockopt(s, SOL_CAN_RAW, CAN_RAW_RECV_OWN_MSGS, &recv_own_msgs, sizeof(recv_own_msgs)); - 4.1.5 RAW socket returned message flags + 4.1.5 RAW socket option CAN_RAW_FD_FRAMES + + CAN FD support in CAN_RAW sockets can be enabled with a new socket option + CAN_RAW_FD_FRAMES which is off by default. When the new socket option is + not supported by the CAN_RAW socket (e.g. on older kernels), switching the + CAN_RAW_FD_FRAMES option returns the error -ENOPROTOOPT. + + Once CAN_RAW_FD_FRAMES is enabled the application can send both CAN frames + and CAN FD frames. OTOH the application has to handle CAN and CAN FD frames + when reading from the socket. + + CAN_RAW_FD_FRAMES enabled: CAN_MTU and CANFD_MTU are allowed + CAN_RAW_FD_FRAMES disabled: only CAN_MTU is allowed (default) + + Example: + [ remember: CANFD_MTU == sizeof(struct canfd_frame) ] + + struct canfd_frame cfd; + + nbytes = read(s, &cfd, CANFD_MTU); + + if (nbytes == CANFD_MTU) { + printf("got CAN FD frame with length %d\n", cfd.len); + /* cfd.flags contains valid data */ + } else if (nbytes == CAN_MTU) { + printf("got legacy CAN frame with length %d\n", cfd.len); + /* cfd.flags is undefined */ + } else { + fprintf(stderr, "read: invalid CAN(FD) frame\n"); + return 1; + } + + /* the content can be handled independently from the received MTU size */ + + printf("can_id: %X data length: %d data: ", cfd.can_id, cfd.len); + for (i = 0; i < cfd.len; i++) + printf("%02X ", cfd.data[i]); + + When reading with size CANFD_MTU only returns CAN_MTU bytes that have + been received from the socket a legacy CAN frame has been read into the + provided CAN FD structure. Note that the canfd_frame.flags data field is + not specified in the struct can_frame and therefore it is only valid in + CANFD_MTU sized CAN FD frames. + + Implementation hint for new CAN applications: + + To build a CAN FD aware application use struct canfd_frame as basic CAN + data structure for CAN_RAW based applications. When the application is + executed on an older Linux kernel and switching the CAN_RAW_FD_FRAMES + socket option returns an error: No problem. You'll get legacy CAN frames + or CAN FD frames and can process them the same way. + + When sending to CAN devices make sure that the device is capable to handle + CAN FD frames by checking if the device maximum transfer unit is CANFD_MTU. + The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. + + 4.1.6 RAW socket returned message flags When using recvmsg() call, the msg->msg_flags may contain following flags: @@ -484,21 +613,232 @@ solution for a couple of reasons: In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set. 4.2 Broadcast Manager protocol sockets (SOCK_DGRAM) + + The Broadcast Manager protocol provides a command based configuration + interface to filter and send (e.g. cyclic) CAN messages in kernel space. + + Receive filters can be used to down sample frequent messages; detect events + such as message contents changes, packet length changes, and do time-out + monitoring of received messages. + + Periodic transmission tasks of CAN frames or a sequence of CAN frames can be + created and modified at runtime; both the message content and the two + possible transmit intervals can be altered. + + A BCM socket is not intended for sending individual CAN frames using the + struct can_frame as known from the CAN_RAW socket. Instead a special BCM + configuration message is defined. The basic BCM configuration message used + to communicate with the broadcast manager and the available operations are + defined in the linux/can/bcm.h include. The BCM message consists of a + message header with a command ('opcode') followed by zero or more CAN frames. + The broadcast manager sends responses to user space in the same form: + + struct bcm_msg_head { + __u32 opcode; /* command */ + __u32 flags; /* special flags */ + __u32 count; /* run 'count' times with ival1 */ + struct timeval ival1, ival2; /* count and subsequent interval */ + canid_t can_id; /* unique can_id for task */ + __u32 nframes; /* number of can_frames following */ + struct can_frame frames[0]; + }; + + The aligned payload 'frames' uses the same basic CAN frame structure defined + at the beginning of section 4 and in the include/linux/can.h include. All + messages to the broadcast manager from user space have this structure. + + Note a CAN_BCM socket must be connected instead of bound after socket + creation (example without error checking): + + int s; + struct sockaddr_can addr; + struct ifreq ifr; + + s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM); + + strcpy(ifr.ifr_name, "can0"); + ioctl(s, SIOCGIFINDEX, &ifr); + + addr.can_family = AF_CAN; + addr.can_ifindex = ifr.ifr_ifindex; + + connect(s, (struct sockaddr *)&addr, sizeof(addr)) + + (..) + + The broadcast manager socket is able to handle any number of in flight + transmissions or receive filters concurrently. The different RX/TX jobs are + distinguished by the unique can_id in each BCM message. However additional + CAN_BCM sockets are recommended to communicate on multiple CAN interfaces. + When the broadcast manager socket is bound to 'any' CAN interface (=> the + interface index is set to zero) the configured receive filters apply to any + CAN interface unless the sendto() syscall is used to overrule the 'any' CAN + interface index. When using recvfrom() instead of read() to retrieve BCM + socket messages the originating CAN interface is provided in can_ifindex. + + 4.2.1 Broadcast Manager operations + + The opcode defines the operation for the broadcast manager to carry out, + or details the broadcast managers response to several events, including + user requests. + + Transmit Operations (user space to broadcast manager): + + TX_SETUP: Create (cyclic) transmission task. + + TX_DELETE: Remove (cyclic) transmission task, requires only can_id. + + TX_READ: Read properties of (cyclic) transmission task for can_id. + + TX_SEND: Send one CAN frame. + + Transmit Responses (broadcast manager to user space): + + TX_STATUS: Reply to TX_READ request (transmission task configuration). + + TX_EXPIRED: Notification when counter finishes sending at initial interval + 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP. + + Receive Operations (user space to broadcast manager): + + RX_SETUP: Create RX content filter subscription. + + RX_DELETE: Remove RX content filter subscription, requires only can_id. + + RX_READ: Read properties of RX content filter subscription for can_id. + + Receive Responses (broadcast manager to user space): + + RX_STATUS: Reply to RX_READ request (filter task configuration). + + RX_TIMEOUT: Cyclic message is detected to be absent (timer ival1 expired). + + RX_CHANGED: BCM message with updated CAN frame (detected content change). + Sent on first message received or on receipt of revised CAN messages. + + 4.2.2 Broadcast Manager message flags + + When sending a message to the broadcast manager the 'flags' element may + contain the following flag definitions which influence the behaviour: + + SETTIMER: Set the values of ival1, ival2 and count + + STARTTIMER: Start the timer with the actual values of ival1, ival2 + and count. Starting the timer leads simultaneously to emit a CAN frame. + + TX_COUNTEVT: Create the message TX_EXPIRED when count expires + + TX_ANNOUNCE: A change of data by the process is emitted immediately. + + TX_CP_CAN_ID: Copies the can_id from the message header to each + subsequent frame in frames. This is intended as usage simplification. For + TX tasks the unique can_id from the message header may differ from the + can_id(s) stored for transmission in the subsequent struct can_frame(s). + + RX_FILTER_ID: Filter by can_id alone, no frames required (nframes=0). + + RX_CHECK_DLC: A change of the DLC leads to an RX_CHANGED. + + RX_NO_AUTOTIMER: Prevent automatically starting the timeout monitor. + + RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occurred, a + RX_CHANGED message will be generated when the (cyclic) receive restarts. + + TX_RESET_MULTI_IDX: Reset the index for the multiple frame transmission. + + RX_RTR_FRAME: Send reply for RTR-request (placed in op->frames[0]). + + 4.2.3 Broadcast Manager transmission timers + + Periodic transmission configurations may use up to two interval timers. + In this case the BCM sends a number of messages ('count') at an interval + 'ival1', then continuing to send at another given interval 'ival2'. When + only one timer is needed 'count' is set to zero and only 'ival2' is used. + When SET_TIMER and START_TIMER flag were set the timers are activated. + The timer values can be altered at runtime when only SET_TIMER is set. + + 4.2.4 Broadcast Manager message sequence transmission + + Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic + TX task configuration. The number of CAN frames is provided in the 'nframes' + element of the BCM message head. The defined number of CAN frames are added + as array to the TX_SETUP BCM configuration message. + + /* create a struct to set up a sequence of four CAN frames */ + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[4]; + } mytxmsg; + + (..) + mytxmsg.nframes = 4; + (..) + + write(s, &mytxmsg, sizeof(mytxmsg)); + + With every transmission the index in the array of CAN frames is increased + and set to zero at index overflow. + + 4.2.5 Broadcast Manager receive filter timers + + The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP. + When the SET_TIMER flag is set the timers are enabled: + + ival1: Send RX_TIMEOUT when a received message is not received again within + the given time. When START_TIMER is set at RX_SETUP the timeout detection + is activated directly - even without a former CAN frame reception. + + ival2: Throttle the received message rate down to the value of ival2. This + is useful to reduce messages for the application when the signal inside the + CAN frame is stateless as state changes within the ival2 periode may get + lost. + + 4.2.6 Broadcast Manager multiplex message receive filter + + To filter for content changes in multiplex message sequences an array of more + than one CAN frames can be passed in a RX_SETUP configuration message. The + data bytes of the first CAN frame contain the mask of relevant bits that + have to match in the subsequent CAN frames with the received CAN frame. + If one of the subsequent CAN frames is matching the bits in that frame data + mark the relevant content to be compared with the previous received content. + Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN + filters) can be added as array to the TX_SETUP BCM configuration message. + + /* usually used to clear CAN frame data[] - beware of endian problems! */ + #define U64_DATA(p) (*(unsigned long long*)(p)->data) + + struct { + struct bcm_msg_head msg_head; + struct can_frame frame[5]; + } msg; + + msg.msg_head.opcode = RX_SETUP; + msg.msg_head.can_id = 0x42; + msg.msg_head.flags = 0; + msg.msg_head.nframes = 5; + U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */ + U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */ + U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */ + U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */ + U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */ + + write(s, &msg, sizeof(msg)); + 4.3 connected transport protocols (SOCK_SEQPACKET) 4.4 unconnected transport protocols (SOCK_DGRAM) -5. Socket CAN core module +5. SocketCAN core module ------------------------- - The Socket CAN core module implements the protocol family + The SocketCAN core module implements the protocol family PF_CAN. CAN protocol modules are loaded by the core module at runtime. The core module provides an interface for CAN protocol modules to subscribe needed CAN IDs (see chapter 3.1). 5.1 can.ko module params - - stats_timer: To calculate the Socket CAN core statistics + - stats_timer: To calculate the SocketCAN core statistics (e.g. current/maximum frames per second) this 1 second timer is invoked at can.ko module start time by default. This timer can be disabled by using stattimer=0 on the module commandline. @@ -507,7 +847,7 @@ solution for a couple of reasons: 5.2 procfs content - As described in chapter 3.1 the Socket CAN core uses several filter + As described in chapter 3.1 the SocketCAN core uses several filter lists to deliver received CAN frames to CAN protocol modules. These receive lists, their filters and the count of filter matches can be checked in the appropriate receive list. All entries contain the @@ -527,22 +867,22 @@ solution for a couple of reasons: rcvlist_all - list for unfiltered entries (no filter operations) rcvlist_eff - list for single extended frame (EFF) entries - rcvlist_err - list for error frames masks + rcvlist_err - list for error message frames masks rcvlist_fil - list for mask/value filters rcvlist_inv - list for mask/value filters (inverse semantic) rcvlist_sff - list for single standard frame (SFF) entries Additional procfs files in /proc/net/can - stats - Socket CAN core statistics (rx/tx frames, match ratios, ...) + stats - SocketCAN core statistics (rx/tx frames, match ratios, ...) reset_stats - manual statistic reset - version - prints the Socket CAN core version and the ABI version + version - prints the SocketCAN core version and the ABI version 5.3 writing own CAN protocol modules To implement a new protocol in the protocol family PF_CAN a new protocol has to be defined in include/linux/can.h . - The prototypes and definitions to use the Socket CAN core can be + The prototypes and definitions to use the SocketCAN core can be accessed by including include/linux/can/core.h . In addition to functions that register the CAN protocol and the CAN device notifier chain there are functions to subscribe CAN @@ -573,10 +913,13 @@ solution for a couple of reasons: dev->type = ARPHRD_CAN; /* the netdevice hardware type */ dev->flags = IFF_NOARP; /* CAN has no arp */ - dev->mtu = sizeof(struct can_frame); + dev->mtu = CAN_MTU; /* sizeof(struct can_frame) -> legacy CAN interface */ + + or alternative, when the controller supports CAN with flexible data rate: + dev->mtu = CANFD_MTU; /* sizeof(struct canfd_frame) -> CAN FD interface */ - The struct can_frame is the payload of each socket buffer in the - protocol family PF_CAN. + The struct can_frame or struct canfd_frame is the payload of each socket + buffer (skbuff) in the protocol family PF_CAN. 6.2 local loopback of sent frames @@ -649,7 +992,7 @@ solution for a couple of reasons: The CAN device must be configured via netlink interface. The supported netlink message types are defined and briefly described in "include/linux/can/netlink.h". CAN link support for the program "ip" - of the IPROUTE2 utility suite is avaiable and it can be used as shown + of the IPROUTE2 utility suite is available and it can be used as shown below: - Setting CAN device properties: @@ -709,7 +1052,7 @@ solution for a couple of reasons: in case of a bus-off condition after the specified delay time in milliseconds. By default it's off. - "bitrate 125000 sample_point 0.875" + "bitrate 125000 sample-point 0.875" Shows the real bit-rate in bits/sec and the sample-point in the range 0.000..0.999. If the calculation of bit-timing parameters is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the @@ -776,7 +1119,7 @@ solution for a couple of reasons: $ ip link set canX up type can bitrate 125000 - A device may enter the "bus-off" state if too much errors occurred on + A device may enter the "bus-off" state if too many errors occurred on the CAN bus. Then no more messages are received or sent. An automatic bus-off recovery can be enabled by setting the "restart-ms" to a non-zero value, e.g.: @@ -784,32 +1127,53 @@ solution for a couple of reasons: $ ip link set canX type can restart-ms 100 Alternatively, the application may realize the "bus-off" condition - by monitoring CAN error frames and do a restart when appropriate with - the command: + by monitoring CAN error message frames and do a restart when + appropriate with the command: $ ip link set canX type can restart - Note that a restart will also create a CAN error frame (see also - chapter 3.4). + Note that a restart will also create a CAN error message frame (see + also chapter 3.4). + + 6.6 CAN FD (flexible data rate) driver support + + CAN FD capable CAN controllers support two different bitrates for the + arbitration phase and the payload phase of the CAN FD frame. Therefore a + second bit timing has to be specified in order to enable the CAN FD bitrate. - 6.6 Supported CAN hardware + Additionally CAN FD capable CAN controllers support up to 64 bytes of + payload. The representation of this length in can_frame.can_dlc and + canfd_frame.len for userspace applications and inside the Linux network + layer is a plain value from 0 .. 64 instead of the CAN 'data length code'. + The data length code was a 1:1 mapping to the payload length in the legacy + CAN frames anyway. The payload length to the bus-relevant DLC mapping is + only performed inside the CAN drivers, preferably with the helper + functions can_dlc2len() and can_len2dlc(). + + The CAN netdevice driver capabilities can be distinguished by the network + devices maximum transfer unit (MTU): + + MTU = 16 (CAN_MTU) => sizeof(struct can_frame) => 'legacy' CAN device + MTU = 72 (CANFD_MTU) => sizeof(struct canfd_frame) => CAN FD capable device + + The CAN device MTU can be retrieved e.g. with a SIOCGIFMTU ioctl() syscall. + N.B. CAN FD capable devices can also handle and send legacy CAN frames. + + FIXME: Add details about the CAN FD controller configuration when available. + + 6.7 Supported CAN hardware Please check the "Kconfig" file in "drivers/net/can" to get an actual - list of the support CAN hardware. On the Socket CAN project website + list of the support CAN hardware. On the SocketCAN project website (see chapter 7) there might be further drivers available, also for older kernel versions. -7. Socket CAN resources +7. SocketCAN resources ----------------------- - You can find further resources for Socket CAN like user space tools, - support for old kernel versions, more drivers, mailing lists, etc. - at the BerliOS OSS project website for Socket CAN: - - http://developer.berlios.de/projects/socketcan - - If you have questions, bug fixes, etc., don't hesitate to post them to - the Socketcan-Users mailing list. But please search the archives first. + The Linux CAN / SocketCAN project ressources (project site / mailing list) + are referenced in the MAINTAINERS file in the Linux source tree. + Search for CAN NETWORK [LAYERS|DRIVERS]. 8. Credits ---------- diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt new file mode 100644 index 00000000000..a15ea602aa5 --- /dev/null +++ b/Documentation/networking/cdc_mbim.txt @@ -0,0 +1,339 @@ + cdc_mbim - Driver for CDC MBIM Mobile Broadband modems + ======================================================== + +The cdc_mbim driver supports USB devices conforming to the "Universal +Serial Bus Communications Class Subclass Specification for Mobile +Broadband Interface Model" [1], which is a further development of +"Universal Serial Bus Communications Class Subclass Specifications for +Network Control Model Devices" [2] optimized for Mobile Broadband +devices, aka "3G/LTE modems". + + +Command Line Parameters +======================= + +The cdc_mbim driver has no parameters of its own. But the probing +behaviour for NCM 1.0 backwards compatible MBIM functions (an +"NCM/MBIM function" as defined in section 3.2 of [1]) is affected +by a cdc_ncm driver parameter: + +prefer_mbim +----------- +Type: Boolean +Valid Range: N/Y (0-1) +Default Value: Y (MBIM is preferred) + +This parameter sets the system policy for NCM/MBIM functions. Such +functions will be handled by either the cdc_ncm driver or the cdc_mbim +driver depending on the prefer_mbim setting. Setting prefer_mbim=N +makes the cdc_mbim driver ignore these functions and lets the cdc_ncm +driver handle them instead. + +The parameter is writable, and can be changed at any time. A manual +unbind/bind is required to make the change effective for NCM/MBIM +functions bound to the "wrong" driver + + +Basic usage +=========== + +MBIM functions are inactive when unmanaged. The cdc_mbim driver only +provides an userspace interface to the MBIM control channel, and will +not participate in the management of the function. This implies that a +userspace MBIM management application always is required to enable a +MBIM function. + +Such userspace applications includes, but are not limited to: + - mbimcli (included with the libmbim [3] library), and + - ModemManager [4] + +Establishing a MBIM IP session reequires at least these actions by the +management application: + - open the control channel + - configure network connection settings + - connect to network + - configure IP interface + +Management application development +---------------------------------- +The driver <-> userspace interfaces are described below. The MBIM +control channel protocol is described in [1]. + + +MBIM control channel userspace ABI +================================== + +/dev/cdc-wdmX character device +------------------------------ +The driver creates a two-way pipe to the MBIM function control channel +using the cdc-wdm driver as a subdriver. The userspace end of the +control channel pipe is a /dev/cdc-wdmX character device. + +The cdc_mbim driver does not process or police messages on the control +channel. The channel is fully delegated to the userspace management +application. It is therefore up to this application to ensure that it +complies with all the control channel requirements in [1]. + +The cdc-wdmX device is created as a child of the MBIM control +interface USB device. The character device associated with a specific +MBIM function can be looked up using sysfs. For example: + + bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc + cdc-wdm0 + + bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev + 180:0 + + +USB configuration descriptors +----------------------------- +The wMaxControlMessage field of the CDC MBIM functional descriptor +limits the maximum control message size. The managament application is +responsible for negotiating a control message size complying with the +requirements in section 9.3.1 of [1], taking this descriptor field +into consideration. + +The userspace application can access the CDC MBIM functional +descriptor of a MBIM function using either of the two USB +configuration descriptor kernel interfaces described in [6] or [7]. + +See also the ioctl documentation below. + + +Fragmentation +------------- +The userspace application is responsible for all control message +fragmentation and defragmentaion, as described in section 9.5 of [1]. + + +/dev/cdc-wdmX write() +--------------------- +The MBIM control messages from the management application *must not* +exceed the negotiated control message size. + + +/dev/cdc-wdmX read() +-------------------- +The management application *must* accept control messages of up the +negotiated control message size. + + +/dev/cdc-wdmX ioctl() +-------------------- +IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size +This ioctl returns the wMaxControlMessage field of the CDC MBIM +functional descriptor for MBIM devices. This is intended as a +convenience, eliminating the need to parse the USB descriptors from +userspace. + + #include <stdio.h> + #include <fcntl.h> + #include <sys/ioctl.h> + #include <linux/types.h> + #include <linux/usb/cdc-wdm.h> + int main() + { + __u16 max; + int fd = open("/dev/cdc-wdm0", O_RDWR); + if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max)) + printf("wMaxControlMessage is %d\n", max); + } + + +Custom device services +---------------------- +The MBIM specification allows vendors to freely define additional +services. This is fully supported by the cdc_mbim driver. + +Support for new MBIM services, including vendor specified services, is +implemented entirely in userspace, like the rest of the MBIM control +protocol + +New services should be registered in the MBIM Registry [5]. + + + +MBIM data channel userspace ABI +=============================== + +wwanY network device +-------------------- +The cdc_mbim driver represents the MBIM data channel as a single +network device of the "wwan" type. This network device is initially +mapped to MBIM IP session 0. + + +Multiplexed IP sessions (IPS) +----------------------------- +MBIM allows multiplexing up to 256 IP sessions over a single USB data +channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN +subdevices of the master wwanY device, mapping MBIM IP session Z to +VLAN ID Z for all values of Z greater than 0. + +The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure +described in section 10.5.1 of [1]. + +The userspace management application is responsible for adding new +VLAN links prior to establishing MBIM IP sessions where the SessionId +is greater than 0. These links can be added by using the normal VLAN +kernel interfaces, either ioctl or netlink. + +For example, adding a link for a MBIM IP session with SessionId 3: + + ip link add link wwan0 name wwan0.3 type vlan id 3 + +The driver will automatically map the "wwan0.3" network device to MBIM +IP session 3. + + +Device Service Streams (DSS) +---------------------------- +MBIM also allows up to 256 non-IP data streams to be multiplexed over +the same shared USB data channel. The cdc_mbim driver models these +sessions as another set of 802.1q VLAN subdevices of the master wwanY +device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values +of A. + +The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO +structure described in section 10.5.29 of [1]. + +The DSS VLAN subdevices are used as a practical interface between the +shared MBIM data channel and a MBIM DSS aware userspace application. +It is not intended to be presented as-is to an end user. The +assumption is that an userspace application initiating a DSS session +also takes care of the necessary framing of the DSS data, presenting +the stream to the end user in an appropriate way for the stream type. + +The network device ABI requires a dummy ethernet header for every DSS +data frame being transported. The contents of this header is +arbitrary, with the following exceptions: + - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped + - RX frames will have the protocol field set to ETH_P_802_3 (but will + not be properly formatted 802.3 frames) + - RX frames will have the destination address set to the hardware + address of the master device + +The DSS supporting userspace management application is responsible for +adding the dummy ethernet header on TX and stripping it on RX. + +This is a simple example using tools commonly available, exporting +DssSessionId 5 as a pty character device pointed to by a /dev/nmea +symlink: + + ip link add link wwan0 name wwan0.dss5 type vlan id 261 + ip link set dev wwan0.dss5 up + socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea + +This is only an example, most suitable for testing out a DSS +service. Userspace applications supporting specific MBIM DSS services +are expected to use the tools and programming interfaces required by +that service. + +Note that adding VLAN links for DSS sessions is entirely optional. A +management application may instead choose to bind a packet socket +directly to the master network device, using the received VLAN tags to +map frames to the correct DSS session and adding 18 byte VLAN ethernet +headers with the appropriate tag on TX. In this case using a socket +filter is recommended, matching only the DSS VLAN subset. This avoid +unnecessary copying of unrelated IP session data to userspace. For +example: + + static struct sock_filter dssfilter[] = { + /* use special negative offsets to get VLAN tag */ + BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */ + + /* verify DSS VLAN range */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG), + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */ + BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */ + + /* verify ethertype */ + BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN), + BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1), + + BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */ + BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */ + }; + + + +Tagged IP session 0 VLAN +------------------------ +As described above, MBIM IP session 0 is treated as special by the +driver. It is initially mapped to untagged frames on the wwanY +network device. + +This mapping implies a few restrictions on multiplexed IPS and DSS +sessions, which may not always be practical: + - no IPS or DSS session can use a frame size greater than the MTU on + IP session 0 + - no IPS or DSS session can be in the up state unless the network + device representing IP session 0 also is up + +These problems can be avoided by optionally making the driver map IP +session 0 to a VLAN subdevice, similar to all other IP sessions. This +behaviour is triggered by adding a VLAN link for the magic VLAN ID +4094. The driver will then immediately start mapping MBIM IP session +0 to this VLAN, and will drop untagged frames on the master wwanY +device. + +Tip: It might be less confusing to the end user to name this VLAN +subdevice after the MBIM SessionID instead of the VLAN ID. For +example: + + ip link add link wwan0 name wwan0.0 type vlan id 4094 + + +VLAN mapping +------------ + +Summarizing the cdc_mbim driver mapping described above, we have this +relationship between VLAN tags on the wwanY network device and MBIM +sessions on the shared USB data channel: + + VLAN ID MBIM type MBIM SessionID Notes + --------------------------------------------------------- + untagged IPS 0 a) + 1 - 255 IPS 1 - 255 <VLANID> + 256 - 511 DSS 0 - 255 <VLANID - 256> + 512 - 4093 b) + 4094 IPS 0 c) + + a) if no VLAN ID 4094 link exists, else dropped + b) unsupported VLAN range, unconditionally dropped + c) if a VLAN ID 4094 link exists, else dropped + + + + +References +========== + +[1] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specification for Mobile Broadband + Interface Model", Revision 1.0 (Errata 1), May 1, 2013 + - http://www.usb.org/developers/docs/devclass_docs/ + +[2] USB Implementers Forum, Inc. - "Universal Serial Bus + Communications Class Subclass Specifications for Network Control + Model Devices", Revision 1.0 (Errata 1), November 24, 2010 + - http://www.usb.org/developers/docs/devclass_docs/ + +[3] libmbim - "a glib-based library for talking to WWAN modems and + devices which speak the Mobile Interface Broadband Model (MBIM) + protocol" + - http://www.freedesktop.org/wiki/Software/libmbim/ + +[4] ModemManager - "a DBus-activated daemon which controls mobile + broadband (2G/3G/4G) devices and connections" + - http://www.freedesktop.org/wiki/Software/ModemManager/ + +[5] "MBIM (Mobile Broadband Interface Model) Registry" + - http://compliance.usb.org/mbim/ + +[6] "/proc/bus/usb filesystem output" + - Documentation/usb/proc_usb_info.txt + +[7] "/sys/bus/usb/devices/.../descriptors" + - Documentation/ABI/stable/sysfs-bus-usb diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/cs89x0.txt index c725d33b316..0e190180eec 100644 --- a/Documentation/networking/cs89x0.txt +++ b/Documentation/networking/cs89x0.txt @@ -36,7 +36,6 @@ TABLE OF CONTENTS 4.1 Compiling the Driver as a Loadable Module 4.2 Compiling the driver to support memory mode 4.3 Compiling the driver to support Rx DMA - 4.4 Compiling the Driver into the Kernel 5.0 TESTING AND TROUBLESHOOTING 5.1 Known Defects and Limitations @@ -364,84 +363,6 @@ The compile-time optionality for DMA was removed in the 2.3 kernel series. DMA support is now unconditionally part of the driver. It is enabled by the 'use_dma=1' module option. -4.4 COMPILING THE DRIVER INTO THE KERNEL - -If your Linux distribution already has support for the cs89x0 driver -then simply copy the source file to the /usr/src/linux/drivers/net -directory to replace the original ones and run the make utility to -rebuild the kernel. See Step 3 for rebuilding the kernel. - -If your Linux does not include the cs89x0 driver, you need to edit three -configuration files, copy the source file to the /usr/src/linux/drivers/net -directory, and then run the make utility to rebuild the kernel. - -1. Edit the following configuration files by adding the statements as -indicated. (When possible, try to locate the added text to the section of the -file containing similar statements). - - -a.) In /usr/src/linux/drivers/net/Config.in, add: - -tristate 'CS89x0 support' CONFIG_CS89x0 - -Example: - - if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then - tristate 'ICL EtherTeam 16i/32 support' CONFIG_ETH16I - fi - - tristate 'CS89x0 support' CONFIG_CS89x0 - - tristate 'NE2000/NE1000 support' CONFIG_NE2000 - if [ "$CONFIG_EXPERIMENTAL" = "y" ]; then - tristate 'NI5210 support' CONFIG_NI52 - - -b.) In /usr/src/linux/drivers/net/Makefile, add the following lines: - -ifeq ($(CONFIG_CS89x0),y) -L_OBJS += cs89x0.o -else - ifeq ($(CONFIG_CS89x0),m) - M_OBJS += cs89x0.o - endif -endif - - -c.) In /linux/drivers/net/Space.c file, add the line: - -extern int cs89x0_probe(struct device *dev); - - -Example: - - extern int ultra_probe(struct device *dev); - extern int wd_probe(struct device *dev); - extern int el2_probe(struct device *dev); - - extern int cs89x0_probe(struct device *dev); - - extern int ne_probe(struct device *dev); - extern int hp_probe(struct device *dev); - extern int hp_plus_probe(struct device *dev); - - -Also add: - - #ifdef CONFIG_CS89x0 - { cs89x0_probe,0 }, - #endif - - -2.) Copy the driver source files (cs89x0.c and cs89x0.h) -into the /usr/src/linux/drivers/net directory. - - -3.) Go to /usr/src/linux directory and run 'make config' followed by 'make' -(or make bzImage) to rebuild the kernel. - -4.) Use the DOS 'setup' utility to disable plug and play on the NIC. - 5.0 TESTING AND TROUBLESHOOTING =============================================================================== diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index b395ca6a49f..55c575fcaf1 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -18,8 +18,8 @@ Introduction Datagram Congestion Control Protocol (DCCP) is an unreliable, connection oriented protocol designed to solve issues present in UDP and TCP, particularly for real-time and multimedia (streaming) traffic. -It divides into a base protocol (RFC 4340) and plugable congestion control -modules called CCIDs. Like plugable TCP congestion control, at least one CCID +It divides into a base protocol (RFC 4340) and pluggable congestion control +modules called CCIDs. Like pluggable TCP congestion control, at least one CCID needs to be enabled in order for the protocol to function properly. In the Linux implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as the TCP-friendly CCID3 (RFC 4342), are optional. @@ -38,11 +38,11 @@ The Linux DCCP implementation does not currently support all the features that a specified in RFCs 4340...42. The known bugs are at: - http://linux-net.osdl.org/index.php/TODO#DCCP + http://www.linuxfoundation.org/collaborate/workgroups/networking/todo#DCCP For more up-to-date versions of the DCCP implementation, please consider using the experimental DCCP test tree; instructions for checking this out are on: -http://linux-net.osdl.org/index.php/DCCP_Testing#Experimental_DCCP_source_tree +http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp_testing#Experimental_DCCP_source_tree Socket options @@ -86,7 +86,7 @@ built-in CCIDs. DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same time, combining the operation of the next two socket options. This option is -preferrable over the latter two, since often applications will use the same +preferable over the latter two, since often applications will use the same type of CCID for both directions; and mixed use of CCIDs is not currently well understood. This socket option takes as argument at least one uint8_t value, or an array of uint8_t values, which must match available CCIDS (see above). CCIDs @@ -167,6 +167,7 @@ rx_ccid = 2 seq_window = 100 The initial sequence window (sec. 7.5.2) of the sender. This influences the local ackno validity and the remote seqno validity windows (7.5.1). + Values in the range Wmin = 32 (RFC 4340, 7.5.2) up to 2^32-1 can be set. tx_qlen = 5 The size of the transmit buffer in packets. A value of 0 corresponds diff --git a/Documentation/networking/depca.txt b/Documentation/networking/depca.txt deleted file mode 100644 index 24c6b26e565..00000000000 --- a/Documentation/networking/depca.txt +++ /dev/null @@ -1,92 +0,0 @@ - -DE10x -===== - -Memory Addresses: - - SW1 SW2 SW3 SW4 -64K on on on on d0000 dbfff - off on on on c0000 cbfff - off off on on e0000 ebfff - -32K on on off on d8000 dbfff - off on off on c8000 cbfff - off off off on e8000 ebfff - -DBR ROM on on dc000 dffff - off on cc000 cffff - off off ec000 effff - -Note that the 2K mode is set by SW3/SW4 on/off or off/off. Address -assignment is through the RBSA register. - -I/O Address: - SW5 -0x300 on -0x200 off - -Remote Boot: - SW6 -Disable on -Enable off - -Remote Boot Timeout: - SW7 -2.5min on -30s off - -IRQ: - SW8 SW9 SW10 SW11 SW12 -2 on off off off off -3 off on off off off -4 off off on off off -5 off off off on off -7 off off off off on - -DE20x -===== - -Memory Size: - - SW3 SW4 -64K on on -32K off on -2K on off -2K off off - -Start Addresses: - - SW1 SW2 SW3 SW4 -64K on on on on c0000 cffff - on off on on d0000 dffff - off on on on e0000 effff - -32K on on off off c8000 cffff - on off off off d8000 dffff - off on off off e8000 effff - -Illegal off off - - - - - -I/O Address: - SW5 -0x300 on -0x200 off - -Remote Boot: - SW6 -Disable on -Enable off - -Remote Boot Timeout: - SW7 -2.5min on -30s off - -IRQ: - SW8 SW9 SW10 SW11 SW12 -5 on off off off off -9 off on off off off -10 off off on off off -11 off off off on off -15 off off off off on - diff --git a/Documentation/networking/dl2k.txt b/Documentation/networking/dl2k.txt index 10e8490fa40..cba74f7a3ab 100644 --- a/Documentation/networking/dl2k.txt +++ b/Documentation/networking/dl2k.txt @@ -45,12 +45,13 @@ Now eth0 should active, you can test it by "ping" or get more information by "ifconfig". If tested ok, continue the next step. 4. cp dl2k.ko /lib/modules/`uname -r`/kernel/drivers/net -5. Add the following line to /etc/modprobe.conf: +5. Add the following line to /etc/modprobe.d/dl2k.conf: alias eth0 dl2k -6. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 +6. Run depmod to updated module indexes. +7. Run "netconfig" or "netconf" to create configuration script ifcfg-eth0 located at /etc/sysconfig/network-scripts or create it manually. [see - Configuration Script Sample] -7. Driver will automatically load and configure at next boot time. +8. Driver will automatically load and configure at next boot time. Compiling the Driver ==================== @@ -154,8 +155,8 @@ Installing the Driver ----------------- 1. Copy dl2k.o to the network modules directory, typically /lib/modules/2.x.x-xx/net or /lib/modules/2.x.x/kernel/drivers/net. - 2. Locate the boot module configuration file, most commonly modprobe.conf - or modules.conf (for 2.4) in the /etc directory. Add the following lines: + 2. Locate the boot module configuration file, most commonly in the + /etc/modprobe.d/ directory. Add the following lines: alias ethx dl2k options dl2k <optional parameters> diff --git a/Documentation/networking/dmfe.txt b/Documentation/networking/dmfe.txt index 8006c227fda..25320bf19c8 100644 --- a/Documentation/networking/dmfe.txt +++ b/Documentation/networking/dmfe.txt @@ -1,3 +1,5 @@ +Note: This driver doesn't have a maintainer. + Davicom DM9102(A)/DM9132/DM9801 fast ethernet driver for Linux. This program is free software; you can redistribute it and/or @@ -55,7 +57,6 @@ Test and make sure PCI latency is now correct for all cases. Authors: Sten Wang <sten_wang@davicom.com.tw > : Original Author -Tobias Ringstrom <tori@unhappy.mine.nu> : Current Maintainer Contributors: diff --git a/Documentation/networking/dns_resolver.txt b/Documentation/networking/dns_resolver.txt index aefd1e68180..d86adcdae42 100644 --- a/Documentation/networking/dns_resolver.txt +++ b/Documentation/networking/dns_resolver.txt @@ -61,7 +61,6 @@ before the more general line given above as the first match is the one taken. create dns_resolver foo:* * /usr/sbin/dns.foo %k - ===== USAGE ===== @@ -103,6 +102,18 @@ implemented in the module can be called after doing: If _expiry is non-NULL, the expiry time (TTL) of the result will be returned also. +The kernel maintains an internal keyring in which it caches looked up keys. +This can be cleared by any process that has the CAP_SYS_ADMIN capability by +the use of KEYCTL_KEYRING_CLEAR on the keyring ID. + + +=============================== +READING DNS KEYS FROM USERSPACE +=============================== + +Keys of dns_resolver type can be read from userspace using keyctl_read() or +"keyctl read/print/pipe". + ========= MECHANISM @@ -132,8 +143,8 @@ the key will be discarded and recreated when the data it holds has expired. dns_query() returns a copy of the value attached to the key, or an error if that is indicated instead. -See <file:Documentation/keys-request-key.txt> for further information about -request-key function. +See <file:Documentation/security/keys-request-key.txt> for further +information about request-key function. ========= diff --git a/Documentation/networking/driver.txt b/Documentation/networking/driver.txt index 03283daa64f..da59e288413 100644 --- a/Documentation/networking/driver.txt +++ b/Documentation/networking/driver.txt @@ -2,16 +2,16 @@ Document about softnet driver issues Transmit path guidelines: -1) The hard_start_xmit method must never return '1' under any - normal circumstances. It is considered a hard error unless +1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under + any normal circumstances. It is considered a hard error unless there is no way your device can tell ahead of time when it's transmit function will become busy. Instead it must maintain the queue properly. For example, for a driver implementing scatter-gather this means: - static int drv_hard_start_xmit(struct sk_buff *skb, - struct net_device *dev) + static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, + struct net_device *dev) { struct drv *dp = netdev_priv(dev); @@ -23,7 +23,7 @@ Transmit path guidelines: unlock_tx(dp); printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", dev->name); - return 1; + return NETDEV_TX_BUSY; } ... queue packet to card ... @@ -35,6 +35,7 @@ Transmit path guidelines: ... unlock_tx(dp); ... + return NETDEV_TX_OK; } And then at the end of your TX reclamation event handling: @@ -58,15 +59,12 @@ Transmit path guidelines: TX_BUFFS_AVAIL(dp) > 0) netif_wake_queue(dp->dev); -2) Do not forget to update netdev->trans_start to jiffies after - each new tx packet is given to the hardware. - -3) A hard_start_xmit method must not modify the shared parts of a +2) An ndo_start_xmit method must not modify the shared parts of a cloned SKB. -4) Do not forget that once you return 0 from your hard_start_xmit - method, it is your driver's responsibility to free up the SKB - and in some finite amount of time. +3) Do not forget that once you return NETDEV_TX_OK from your + ndo_start_xmit method, it is your driver's responsibility to free + up the SKB and in some finite amount of time. For example, this means that it is not allowed for your TX mitigation scheme to let TX packets "hang out" in the TX @@ -74,8 +72,9 @@ Transmit path guidelines: This error can deadlock sockets waiting for send buffer room to be freed up. - If you return 1 from the hard_start_xmit method, you must not keep - any reference to that SKB and you must not attempt to free it up. + If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you + must not keep any reference to that SKB and you must not attempt + to free it up. Probing guidelines: @@ -85,10 +84,10 @@ Probing guidelines: Close/stop guidelines: -1) After the dev->stop routine has been called, the hardware must +1) After the ndo_stop routine has been called, the hardware must not receive or transmit any data. All in flight packets must be aborted. If necessary, poll or wait for completion of any reset commands. -2) The dev->stop routine will be called by unregister_netdevice +2) The ndo_stop routine will be called by unregister_netdevice if device is still UP. diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt index 944aa55e79f..f862cf3aff3 100644 --- a/Documentation/networking/e100.txt +++ b/Documentation/networking/e100.txt @@ -1,7 +1,7 @@ Linux* Base Driver for the Intel(R) PRO/100 Family of Adapters ============================================================== -November 15, 2005 +March 15, 2011 Contents ======== @@ -72,7 +72,7 @@ Tx Descriptors: Number of transmit descriptors. A transmit descriptor is a data ethtool -G eth? tx n, where n is the number of desired tx descriptors. Speed/Duplex: The driver auto-negotiates the link speed and duplex settings by - default. Ethtool can be used as follows to force speed/duplex. + default. The ethtool utility can be used as follows to force speed/duplex. ethtool -s eth? autoneg off speed {10|100} duplex {full|half} @@ -94,8 +94,8 @@ Additional Configurations Configuring a network driver to load properly when the system is started is distribution dependent. Typically, the configuration process involves adding - an alias line to /etc/modules.conf or /etc/modprobe.conf as well as editing - other system startup scripts and/or configuration files. Many popular Linux + an alias line to /etc/modprobe.d/*.conf as well as editing other system + startup scripts and/or configuration files. Many popular Linux distributions ship with tools to make these changes for you. To learn the proper way to configure a network device for your system, refer to your distribution documentation. If during this process you are asked for the @@ -103,7 +103,7 @@ Additional Configurations PRO/100 Family of Adapters is e100. As an example, if you install the e100 driver for two PRO/100 adapters - (eth0 and eth1), add the following to modules.conf or modprobe.conf: + (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/ alias eth0 e100 alias eth1 e100 @@ -122,34 +122,25 @@ Additional Configurations NOTE: This setting is not saved across reboots. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool + diagnostics, as well as displaying statistical information. The ethtool version 1.6 or later is required for this functionality. The latest release of ethtool can be found from - http://sourceforge.net/projects/gkernel. - - NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support - for a more complete ethtool feature set can be enabled by upgrading - ethtool to ethtool-1.8.1. - + http://ftp.kernel.org/pub/software/network/ethtool/ Enabling Wake on LAN* (WoL) --------------------------- - WoL is provided through the Ethtool* utility. Ethtool is included with Red - Hat* 8.0. For other Linux distributions, download and install Ethtool from - the following website: http://sourceforge.net/projects/gkernel. - - For instructions on enabling WoL with Ethtool, refer to the Ethtool man page. + WoL is provided through the ethtool* utility. For instructions on enabling + WoL with ethtool, refer to the ethtool man page. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e100 driver must be loaded when shutting down or rebooting the system. - NAPI ---- diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt index d9271e74e48..437b2099cce 100644 --- a/Documentation/networking/e1000.txt +++ b/Documentation/networking/e1000.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters -=============================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -79,7 +79,7 @@ InterruptThrottleRate --------------------- (not supported on Intel(R) 82542, 82543 or 82544-based adapters) Valid Range: 0,1,3,4,100-100000 (0=off, 1=dynamic, 3=dynamic conservative, - 4=simplified balancing) + 4=simplified balancing) Default Value: 3 The driver can limit the amount of interrupts per second that the adapter @@ -124,8 +124,8 @@ InterruptThrottleRate is set to mode 1. In this mode, which operates the same as mode 3, the InterruptThrottleRate will be increased stepwise to 70000 for traffic in class "Lowest latency". -In simplified mode the interrupt rate is based on the ratio of Tx and -Rx traffic. If the bytes per second rate is approximately equal, the +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the interrupt rate will drop as low as 2000 interrupts per second. If the traffic is mostly transmit or mostly receive, the interrupt rate could be as high as 8000. @@ -245,7 +245,7 @@ NOTE: Depending on the available system resources, the request for a TxDescriptorStep ---------------- Valid Range: 1 (use every Tx Descriptor) - 4 (use every 4th Tx Descriptor) + 4 (use every 4th Tx Descriptor) Default Value: 1 (use every Tx Descriptor) @@ -312,7 +312,7 @@ Valid Range: 0-xxxxxxx (0=off) Default Value: 256 Usage: insmod e1000.ko copybreak=128 -Driver copies all packets below or equaling this size to a fresh Rx +Driver copies all packets below or equaling this size to a fresh RX buffer before handing it up the stack. This parameter is different than other parameters, in that it is a @@ -420,26 +420,26 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 16110. This value coincides with the maximum Jumbo Frames size of 16128. - - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or - loss of link. + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. - Adapters based on the Intel(R) 82542 and 82573V/E controller do not support Jumbo Frames. These correspond to the following product names: Intel(R) PRO/1000 Gigabit Server Adapter Intel(R) PRO/1000 PM Network Connection - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool + diagnostics, as well as displaying statistical information. The ethtool version 1.6 or later is required for this functionality. The latest release of ethtool can be found from - http://sourceforge.net/projects/gkernel. + http://ftp.kernel.org/pub/software/network/ethtool/ Enabling Wake on LAN* (WoL) --------------------------- - WoL is configured through the Ethtool* utility. + WoL is configured through the ethtool* utility. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e1000 driver must be diff --git a/Documentation/networking/e1000e.txt b/Documentation/networking/e1000e.txt index 6aa048badf3..ad2d9f38ce1 100644 --- a/Documentation/networking/e1000e.txt +++ b/Documentation/networking/e1000e.txt @@ -1,8 +1,8 @@ -Linux* Driver for Intel(R) Network Connection -=============================================================== +Linux* Driver for Intel(R) Ethernet Network Connection +====================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -61,6 +61,12 @@ per second, even if more packets have come in. This reduces interrupt load on the system and can lower CPU utilization under heavy load, but will increase latency as packets are not processed as quickly. +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types, but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + The driver has two adaptive modes (setting 1 or 3) in which it dynamically adjusts the InterruptThrottleRate value based on the traffic that it receives. After determining the type of incoming traffic in the last @@ -86,8 +92,8 @@ InterruptThrottleRate is set to mode 1. In this mode, which operates the same as mode 3, the InterruptThrottleRate will be increased stepwise to 70000 for traffic in class "Lowest latency". -In simplified mode the interrupt rate is based on the ratio of Tx and -Rx traffic. If the bytes per second rate is approximately equal the +In simplified mode the interrupt rate is based on the ratio of TX and +RX traffic. If the bytes per second rate is approximately equal, the interrupt rate will drop as low as 2000 interrupts per second. If the traffic is mostly transmit or mostly receive, the interrupt rate could be as high as 8000. @@ -177,7 +183,7 @@ Copybreak Valid Range: 0-xxxxxxx (0=off) Default Value: 256 -Driver copies all packets below or equaling this size to a fresh Rx +Driver copies all packets below or equaling this size to a fresh RX buffer before handing it up the stack. This parameter is different than other parameters, in that it is a @@ -223,17 +229,17 @@ loading or enabling the driver, try disabling this feature. WriteProtectNVM --------------- -Valid Range: 0-1 -Default Value: 1 (enabled) - -Set the hardware to ignore all write/erase cycles to the GbE region in the -ICHx NVM (non-volatile memory). This feature can be disabled by the -WriteProtectNVM module parameter (enabled by default) only after a hardware -reset, but the machine must be power cycled before trying to enable writes. - -Note: the kernel boot option iomem=relaxed may need to be set if the kernel -config option CONFIG_STRICT_DEVMEM=y, if the root user wants to write the -NVM from user space via ethtool. +Valid Range: 0,1 +Default Value: 1 + +If set to 1, configure the hardware to ignore all write/erase cycles to the +GbE region in the ICHx NVM (in order to prevent accidental corruption of the +NVM). This feature can be disabled by setting the parameter to 0 during initial +driver load. +NOTE: The machine must be power cycled (full off/on) when enabling NVM writes +via setting the parameter to zero. Once the NVM has been locked (via the +parameter at 1 when the driver loads) it cannot be unlocked except via power +cycle. Additional Configurations ========================= @@ -253,38 +259,42 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 9216. This value coincides with the maximum Jumbo Frames size of 9234 bytes. - - Using Jumbo Frames at 10 or 100 Mbps is not supported and may result in + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in poor performance or loss of link. - Some adapters limit Jumbo Frames sized packets to a maximum of 4096 bytes and some adapters do not support Jumbo Frames. + - Jumbo Frames cannot be configured on an 82579-based Network device, if + MACSec is enabled on the system. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. We - strongly recommend downloading the latest version of Ethtool at: + strongly recommend downloading the latest version of ethtool at: + + http://ftp.kernel.org/pub/software/network/ethtool/ - http://sourceforge.net/projects/gkernel. + NOTE: When validating enable/disable tests on some parts (82578, for example) + you need to add a few seconds between tests when working with ethtool. Speed and Duplex ---------------- - Speed and Duplex are configured through the Ethtool* utility. For - instructions, refer to the Ethtool man page. + Speed and Duplex are configured through the ethtool* utility. For + instructions, refer to the ethtool man page. Enabling Wake on LAN* (WoL) --------------------------- - WoL is configured through the Ethtool* utility. For instructions on - enabling WoL with Ethtool, refer to the Ethtool man page. + WoL is configured through the ethtool* utility. For instructions on + enabling WoL with ethtool, refer to the ethtool man page. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the e1000e driver must be loaded when shutting down or rebooting the system. In most cases Wake On LAN is only supported on port A for multiple port - adapters. To verify if a port supports Wake on LAN run ethtool eth<X>. - + adapters. To verify if a port supports Wake on Lan run ethtool eth<X>. Support ======= diff --git a/Documentation/networking/ewrk3.txt b/Documentation/networking/ewrk3.txt deleted file mode 100644 index 90e9e5f16e6..00000000000 --- a/Documentation/networking/ewrk3.txt +++ /dev/null @@ -1,46 +0,0 @@ -The EtherWORKS 3 driver in this distribution is designed to work with all -kernels > 1.1.33 (approx) and includes tools in the 'ewrk3tools' -subdirectory to allow set up of the card, similar to the MSDOS -'NICSETUP.EXE' tools provided on the DOS drivers disk (type 'make' in that -subdirectory to make the tools). - -The supported cards are DE203, DE204 and DE205. All other cards are NOT -supported - refer to 'depca.c' for running the LANCE based network cards and -'de4x5.c' for the DIGITAL Semiconductor PCI chip based adapters from -Digital. - -The ability to load this driver as a loadable module has been included and -used extensively during the driver development (to save those long reboot -sequences). To utilise this ability, you have to do 8 things: - - 0) have a copy of the loadable modules code installed on your system. - 1) copy ewrk3.c from the /linux/drivers/net directory to your favourite - temporary directory. - 2) edit the source code near line 1898 to reflect the I/O address and - IRQ you're using. - 3) compile ewrk3.c, but include -DMODULE in the command line to ensure - that the correct bits are compiled (see end of source code). - 4) if you are wanting to add a new card, goto 5. Otherwise, recompile a - kernel with the ewrk3 configuration turned off and reboot. - 5) insmod ewrk3.o - [Alan Cox: Changed this so you can insmod ewrk3.o irq=x io=y] - [Adam Kropelin: Multiple cards now supported by irq=x1,x2 io=y1,y2] - 6) run the net startup bits for your new eth?? interface manually - (usually /etc/rc.inet[12] at boot time). - 7) enjoy! - - Note that autoprobing is not allowed in loadable modules - the system is - already up and running and you're messing with interrupts. - - To unload a module, turn off the associated interface - 'ifconfig eth?? down' then 'rmmod ewrk3'. - -The performance we've achieved so far has been measured through the 'ttcp' -tool at 975kB/s. This measures the total TCP stack performance which -includes the card, so don't expect to get much nearer the 1.25MB/s -theoretical Ethernet rate. - - -Enjoy! - -Dave diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index bbf2005270b..ee78eba78a9 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -1,42 +1,1027 @@ -filter.txt: Linux Socket Filtering -Written by: Jay Schulist <jschlst@samba.org> +Linux Socket Filtering aka Berkeley Packet Filter (BPF) +======================================================= Introduction -============ - - Linux Socket Filtering is derived from the Berkeley -Packet Filter. There are some distinct differences between -the BSD and Linux Kernel Filtering. - -Linux Socket Filtering (LSF) allows a user-space program to -attach a filter onto any socket and allow or disallow certain -types of data to come through the socket. LSF follows exactly -the same filter code structure as the BSD Berkeley Packet Filter -(BPF), so referring to the BSD bpf.4 manpage is very helpful in -creating filters. - -LSF is much simpler than BPF. One does not have to worry about -devices or anything like that. You simply create your filter -code, send it to the kernel via the SO_ATTACH_FILTER ioctl and -if your filter code passes the kernel check on it, you then -immediately begin filtering data on that socket. - -You can also detach filters from your socket via the -SO_DETACH_FILTER ioctl. This will probably not be used much -since when you close a socket that has a filter on it the -filter is automagically removed. The other less common case -may be adding a different filter on the same socket where you had another -filter that is still running: the kernel takes care of removing -the old one and placing your new one in its place, assuming your -filter has passed the checks, otherwise if it fails the old filter -will remain on that socket. - -Examples -======== - -Ioctls- -setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &Filter, sizeof(Filter)); -setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &value, sizeof(value)); - -See the BSD bpf.4 manpage and the BSD Packet Filter paper written by -Steven McCanne and Van Jacobson of Lawrence Berkeley Laboratory. +------------ + +Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. +Though there are some distinct differences between the BSD and Linux +Kernel filtering, but when we speak of BPF or LSF in Linux context, we +mean the very same mechanism of filtering in the Linux kernel. + +BPF allows a user-space program to attach a filter onto any socket and +allow or disallow certain types of data to come through the socket. LSF +follows exactly the same filter code structure as BSD's BPF, so referring +to the BSD bpf.4 manpage is very helpful in creating filters. + +On Linux, BPF is much simpler than on BSD. One does not have to worry +about devices or anything like that. You simply create your filter code, +send it to the kernel via the SO_ATTACH_FILTER option and if your filter +code passes the kernel check on it, you then immediately begin filtering +data on that socket. + +You can also detach filters from your socket via the SO_DETACH_FILTER +option. This will probably not be used much since when you close a socket +that has a filter on it the filter is automagically removed. The other +less common case may be adding a different filter on the same socket where +you had another filter that is still running: the kernel takes care of +removing the old one and placing your new one in its place, assuming your +filter has passed the checks, otherwise if it fails the old filter will +remain on that socket. + +SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once +set, a filter cannot be removed or changed. This allows one process to +setup a socket, attach a filter, lock it then drop privileges and be +assured that the filter will be kept until the socket is closed. + +The biggest user of this construct might be libpcap. Issuing a high-level +filter command like `tcpdump -i em1 port 22` passes through the libpcap +internal compiler that generates a structure that can eventually be loaded +via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` +displays what is being placed into this structure. + +Although we were only speaking about sockets here, BPF in Linux is used +in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel +qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places +such as team driver, PTP code, etc where BPF is being used. + + [1] Documentation/prctl/seccomp_filter.txt + +Original BPF paper: + +Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new +architecture for user-level packet capture. In Proceedings of the +USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 +Conference Proceedings (USENIX'93). USENIX Association, Berkeley, +CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] + +Structure +--------- + +User space applications include <linux/filter.h> which contains the +following relevant structures: + +struct sock_filter { /* Filter block */ + __u16 code; /* Actual filter code */ + __u8 jt; /* Jump true */ + __u8 jf; /* Jump false */ + __u32 k; /* Generic multiuse field */ +}; + +Such a structure is assembled as an array of 4-tuples, that contains +a code, jt, jf and k value. jt and jf are jump offsets and k a generic +value to be used for a provided code. + +struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ + unsigned short len; /* Number of filter blocks */ + struct sock_filter __user *filter; +}; + +For socket filtering, a pointer to this structure (as shown in +follow-up example) is being passed to the kernel through setsockopt(2). + +Example +------- + +#include <sys/socket.h> +#include <sys/types.h> +#include <arpa/inet.h> +#include <linux/if_ether.h> +/* ... */ + +/* From the example above: tcpdump -i em1 port 22 -dd */ +struct sock_filter code[] = { + { 0x28, 0, 0, 0x0000000c }, + { 0x15, 0, 8, 0x000086dd }, + { 0x30, 0, 0, 0x00000014 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 17, 0x00000011 }, + { 0x28, 0, 0, 0x00000036 }, + { 0x15, 14, 0, 0x00000016 }, + { 0x28, 0, 0, 0x00000038 }, + { 0x15, 12, 13, 0x00000016 }, + { 0x15, 0, 12, 0x00000800 }, + { 0x30, 0, 0, 0x00000017 }, + { 0x15, 2, 0, 0x00000084 }, + { 0x15, 1, 0, 0x00000006 }, + { 0x15, 0, 8, 0x00000011 }, + { 0x28, 0, 0, 0x00000014 }, + { 0x45, 6, 0, 0x00001fff }, + { 0xb1, 0, 0, 0x0000000e }, + { 0x48, 0, 0, 0x0000000e }, + { 0x15, 2, 0, 0x00000016 }, + { 0x48, 0, 0, 0x00000010 }, + { 0x15, 0, 1, 0x00000016 }, + { 0x06, 0, 0, 0x0000ffff }, + { 0x06, 0, 0, 0x00000000 }, +}; + +struct sock_fprog bpf = { + .len = ARRAY_SIZE(code), + .filter = code, +}; + +sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); +if (sock < 0) + /* ... bail out ... */ + +ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); +if (ret < 0) + /* ... bail out ... */ + +/* ... */ +close(sock); + +The above example code attaches a socket filter for a PF_PACKET socket +in order to let all IPv4/IPv6 packets with port 22 pass. The rest will +be dropped for this socket. + +The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments +and SO_LOCK_FILTER for preventing the filter to be detached, takes an +integer value with 0 or 1. + +Note that socket filters are not restricted to PF_PACKET sockets only, +but can also be used on other socket families. + +Summary of system calls: + + * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); + * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); + +Normally, most use cases for socket filtering on packet sockets will be +covered by libpcap in high-level syntax, so as an application developer +you should stick to that. libpcap wraps its own layer around all that. + +Unless i) using/linking to libpcap is not an option, ii) the required BPF +filters use Linux extensions that are not supported by libpcap's compiler, +iii) a filter might be more complex and not cleanly implementable with +libpcap's compiler, or iv) particular filter codes should be optimized +differently than libpcap's internal compiler does; then in such cases +writing such a filter "by hand" can be of an alternative. For example, +xt_bpf and cls_bpf users might have requirements that could result in +more complex filter code, or one that cannot be expressed with libpcap +(e.g. different return codes for various code paths). Moreover, BPF JIT +implementors may wish to manually write test cases and thus need low-level +access to BPF code as well. + +BPF engine and instruction set +------------------------------ + +Under tools/net/ there's a small helper tool called bpf_asm which can +be used to write low-level filters for example scenarios mentioned in the +previous section. Asm-like syntax mentioned here has been implemented in +bpf_asm and will be used for further explanations (instead of dealing with +less readable opcodes directly, principles are the same). The syntax is +closely modelled after Steven McCanne's and Van Jacobson's BPF paper. + +The BPF architecture consists of the following basic elements: + + Element Description + + A 32 bit wide accumulator + X 32 bit wide X register + M[] 16 x 32 bit wide misc registers aka "scratch memory + store", addressable from 0 to 15 + +A program, that is translated by bpf_asm into "opcodes" is an array that +consists of the following elements (as already mentioned): + + op:16, jt:8, jf:8, k:32 + +The element op is a 16 bit wide opcode that has a particular instruction +encoded. jt and jf are two 8 bit wide jump targets, one for condition +"jump if true", the other one "jump if false". Eventually, element k +contains a miscellaneous argument that can be interpreted in different +ways depending on the given instruction in op. + +The instruction set consists of load, store, branch, alu, miscellaneous +and return instructions that are also represented in bpf_asm syntax. This +table lists all bpf_asm instructions available resp. what their underlying +opcodes as defined in linux/filter.h stand for: + + Instruction Addressing mode Description + + ld 1, 2, 3, 4, 10 Load word into A + ldi 4 Load word into A + ldh 1, 2 Load half-word into A + ldb 1, 2 Load byte into A + ldx 3, 4, 5, 10 Load word into X + ldxi 4 Load word into X + ldxb 5 Load byte into X + + st 3 Store A into M[] + stx 3 Store X into M[] + + jmp 6 Jump to label + ja 6 Jump to label + jeq 7, 8 Jump on k == A + jneq 8 Jump on k != A + jne 8 Jump on k != A + jlt 8 Jump on k < A + jle 8 Jump on k <= A + jgt 7, 8 Jump on k > A + jge 7, 8 Jump on k >= A + jset 7, 8 Jump on k & A + + add 0, 4 A + <x> + sub 0, 4 A - <x> + mul 0, 4 A * <x> + div 0, 4 A / <x> + mod 0, 4 A % <x> + neg 0, 4 !A + and 0, 4 A & <x> + or 0, 4 A | <x> + xor 0, 4 A ^ <x> + lsh 0, 4 A << <x> + rsh 0, 4 A >> <x> + + tax Copy A into X + txa Copy X into A + + ret 4, 9 Return + +The next table shows addressing formats from the 2nd column: + + Addressing mode Syntax Description + + 0 x/%x Register X + 1 [k] BHW at byte offset k in the packet + 2 [x + k] BHW at the offset X + k in the packet + 3 M[k] Word at offset k in M[] + 4 #k Literal value stored in k + 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet + 6 L Jump label L + 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf + 8 #k,Lt Jump to Lt if predicate is true + 9 a/%a Accumulator A + 10 extension BPF extension + +The Linux kernel also has a couple of BPF extensions that are used along +with the class of load instructions by "overloading" the k argument with +a negative offset + a particular extension offset. The result of such BPF +extensions are loaded into A. + +Possible BPF extensions are shown in the following table: + + Extension Description + + len skb->len + proto skb->protocol + type skb->pkt_type + poff Payload start offset + ifidx skb->dev->ifindex + nla Netlink attribute of type X with offset A + nlan Nested Netlink attribute of type X with offset A + mark skb->mark + queue skb->queue_mapping + hatype skb->dev->type + rxhash skb->hash + cpu raw_smp_processor_id() + vlan_tci vlan_tx_tag_get(skb) + vlan_pr vlan_tx_tag_present(skb) + rand prandom_u32() + +These extensions can also be prefixed with '#'. +Examples for low-level BPF: + +** ARP packets: + + ldh [12] + jne #0x806, drop + ret #-1 + drop: ret #0 + +** IPv4 TCP packets: + + ldh [12] + jne #0x800, drop + ldb [23] + jneq #6, drop + ret #-1 + drop: ret #0 + +** (Accelerated) VLAN w/ id 10: + + ld vlan_tci + jneq #10, drop + ret #-1 + drop: ret #0 + +** icmp random packet sampling, 1 in 4 + ldh [12] + jne #0x800, drop + ldb [23] + jneq #1, drop + # get a random uint32 number + ld rand + mod #4 + jneq #1, drop + ret #-1 + drop: ret #0 + +** SECCOMP filter example: + + ld [4] /* offsetof(struct seccomp_data, arch) */ + jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ + ld [0] /* offsetof(struct seccomp_data, nr) */ + jeq #15, good /* __NR_rt_sigreturn */ + jeq #231, good /* __NR_exit_group */ + jeq #60, good /* __NR_exit */ + jeq #0, good /* __NR_read */ + jeq #1, good /* __NR_write */ + jeq #5, good /* __NR_fstat */ + jeq #9, good /* __NR_mmap */ + jeq #14, good /* __NR_rt_sigprocmask */ + jeq #13, good /* __NR_rt_sigaction */ + jeq #35, good /* __NR_nanosleep */ + bad: ret #0 /* SECCOMP_RET_KILL */ + good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ + +The above example code can be placed into a file (here called "foo"), and +then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf +and cls_bpf understands and can directly be loaded with. Example with above +ARP code: + +$ ./bpf_asm foo +4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, + +In copy and paste C-like output: + +$ ./bpf_asm -c foo +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 1, 0x00000806 }, +{ 0x06, 0, 0, 0xffffffff }, +{ 0x06, 0, 0, 0000000000 }, + +In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF +filters that might not be obvious at first, it's good to test filters before +attaching to a live system. For that purpose, there's a small tool called +bpf_dbg under tools/net/ in the kernel source directory. This debugger allows +for testing BPF filters against given pcap files, single stepping through the +BPF code on the pcap's packets and to do BPF machine register dumps. + +Starting bpf_dbg is trivial and just requires issuing: + +# ./bpf_dbg + +In case input and output do not equal stdin/stdout, bpf_dbg takes an +alternative stdin source as a first argument, and an alternative stdout +sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. + +Other than that, a particular libreadline configuration can be set via +file "~/.bpf_dbg_init" and the command history is stored in the file +"~/.bpf_dbg_history". + +Interaction in bpf_dbg happens through a shell that also has auto-completion +support (follow-up example commands starting with '>' denote bpf_dbg shell). +The usual workflow would be to ... + +> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 + Loads a BPF filter from standard output of bpf_asm, or transformed via + e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT + debugging (next section), this command creates a temporary socket and + loads the BPF code into the kernel. Thus, this will also be useful for + JIT developers. + +> load pcap foo.pcap + Loads standard tcpdump pcap file. + +> run [<n>] +bpf passes:1 fails:9 + Runs through all packets from a pcap to account how many passes and fails + the filter will generate. A limit of packets to traverse can be given. + +> disassemble +l0: ldh [12] +l1: jeq #0x800, l2, l5 +l2: ldb [23] +l3: jeq #0x1, l4, l5 +l4: ret #0xffff +l5: ret #0 + Prints out BPF code disassembly. + +> dump +/* { op, jt, jf, k }, */ +{ 0x28, 0, 0, 0x0000000c }, +{ 0x15, 0, 3, 0x00000800 }, +{ 0x30, 0, 0, 0x00000017 }, +{ 0x15, 0, 1, 0x00000001 }, +{ 0x06, 0, 0, 0x0000ffff }, +{ 0x06, 0, 0, 0000000000 }, + Prints out C-style BPF code dump. + +> breakpoint 0 +breakpoint at: l0: ldh [12] +> breakpoint 1 +breakpoint at: l1: jeq #0x800, l2, l5 + ... + Sets breakpoints at particular BPF instructions. Issuing a `run` command + will walk through the pcap file continuing from the current packet and + break when a breakpoint is being hit (another `run` will continue from + the currently active breakpoint executing next instructions): + + > run + -- register dump -- + pc: [0] <-- program counter + code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction + curr: l0: ldh [12] <-- disassembly of current instruction + A: [00000000][0] <-- content of A (hex, decimal) + X: [00000000][0] <-- content of X (hex, decimal) + M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) + -- packet dump -- <-- Current packet from pcap (hex) + len: 42 + 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 + 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 + 32: 00 00 00 00 00 00 0a 3b 01 01 + (breakpoint) + > + +> breakpoint +breakpoints: 0 1 + Prints currently set breakpoints. + +> step [-<n>, +<n>] + Performs single stepping through the BPF program from the current pc + offset. Thus, on each step invocation, above register dump is issued. + This can go forwards and backwards in time, a plain `step` will break + on the next BPF instruction, thus +1. (No `run` needs to be issued here.) + +> select <n> + Selects a given packet from the pcap file to continue from. Thus, on + the next `run` or `step`, the BPF program is being evaluated against + the user pre-selected packet. Numbering starts just as in Wireshark + with index 1. + +> quit +# + Exits bpf_dbg. + +JIT compiler +------------ + +The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, +ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is +transparently invoked for each attached filter from user space or for internal +kernel users if it has been previously enabled by root: + + echo 1 > /proc/sys/net/core/bpf_jit_enable + +For JIT developers, doing audits etc, each compile run can output the generated +opcode image into the kernel log via: + + echo 2 > /proc/sys/net/core/bpf_jit_enable + +Example output from dmesg: + +[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f +[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 +[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 +[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 +[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 +[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 + +In the kernel source tree under tools/net/, there's bpf_jit_disasm for +generating disassembly out of the kernel log's hexdump: + +# ./bpf_jit_disasm +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 1: mov %rsp,%rbp + 4: sub $0x60,%rsp + 8: mov %rbx,-0x8(%rbp) + c: mov 0x68(%rdi),%r9d + 10: sub 0x6c(%rdi),%r9d + 14: mov 0xd8(%rdi),%r8 + 1b: mov $0xc,%esi + 20: callq 0xffffffffe0ff9442 + 25: cmp $0x800,%eax + 2a: jne 0x0000000000000042 + 2c: mov $0x17,%esi + 31: callq 0xffffffffe0ff945e + 36: cmp $0x1,%eax + 39: jne 0x0000000000000042 + 3b: mov $0xffff,%eax + 40: jmp 0x0000000000000044 + 42: xor %eax,%eax + 44: leaveq + 45: retq + +Issuing option `-o` will "annotate" opcodes to resulting assembler +instructions, which can be very useful for JIT developers: + +# ./bpf_jit_disasm -o +70 bytes emitted from JIT compiler (pass:3, flen:6) +ffffffffa0069c8f + <x>: + 0: push %rbp + 55 + 1: mov %rsp,%rbp + 48 89 e5 + 4: sub $0x60,%rsp + 48 83 ec 60 + 8: mov %rbx,-0x8(%rbp) + 48 89 5d f8 + c: mov 0x68(%rdi),%r9d + 44 8b 4f 68 + 10: sub 0x6c(%rdi),%r9d + 44 2b 4f 6c + 14: mov 0xd8(%rdi),%r8 + 4c 8b 87 d8 00 00 00 + 1b: mov $0xc,%esi + be 0c 00 00 00 + 20: callq 0xffffffffe0ff9442 + e8 1d 94 ff e0 + 25: cmp $0x800,%eax + 3d 00 08 00 00 + 2a: jne 0x0000000000000042 + 75 16 + 2c: mov $0x17,%esi + be 17 00 00 00 + 31: callq 0xffffffffe0ff945e + e8 28 94 ff e0 + 36: cmp $0x1,%eax + 83 f8 01 + 39: jne 0x0000000000000042 + 75 07 + 3b: mov $0xffff,%eax + b8 ff ff 00 00 + 40: jmp 0x0000000000000044 + eb 02 + 42: xor %eax,%eax + 31 c0 + 44: leaveq + c9 + 45: retq + c3 + +For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful +toolchain for developing and testing the kernel's JIT compiler. + +BPF kernel internals +-------------------- +Internally, for the kernel interpreter, a different instruction set +format with similar underlying principles from BPF described in previous +paragraphs is being used. However, the instruction set format is modelled +closer to the underlying architecture to mimic native instruction sets, so +that a better performance can be achieved (more details later). This new +ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which +originates from [e]xtended BPF is not the same as BPF extensions! While +eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' +of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) + +It is designed to be JITed with one to one mapping, which can also open up +the possibility for GCC/LLVM compilers to generate optimized eBPF code through +an eBPF backend that performs almost as fast as natively compiled code. + +The new instruction set was originally designed with the possible goal in +mind to write programs in "restricted C" and compile into eBPF with a optional +GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with +minimal performance overhead over two steps, that is, C -> eBPF -> native code. + +Currently, the new format is being used for running user BPF programs, which +includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, +team driver's classifier for its load-balancing mode, netfilter's xt_bpf +extension, PTP dissector/classifier, and much more. They are all internally +converted by the kernel into the new instruction set representation and run +in the eBPF interpreter. For in-kernel handlers, this all works transparently +by using sk_unattached_filter_create() for setting up the filter, resp. +sk_unattached_filter_destroy() for destroying it. The macro +SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed +code to run the filter. 'filter' is a pointer to struct sk_filter that we +got from sk_unattached_filter_create(), and 'ctx' the given context (e.g. +skb pointer). All constraints and restrictions from sk_chk_filter() apply +before a conversion to the new layout is being done behind the scenes! + +Currently, the classic BPF format is being used for JITing on most of the +architectures. Only x86-64 performs JIT compilation from eBPF instruction set, +however, future work will migrate other JIT compilers as well, so that they +will profit from the very same benefits. + +Some core changes of the new internal format: + +- Number of registers increase from 2 to 10: + + The old format had two registers A and X, and a hidden frame pointer. The + new layout extends this to be 10 internal registers and a read-only frame + pointer. Since 64-bit CPUs are passing arguments to functions via registers + the number of args from eBPF program to in-kernel function is restricted + to 5 and one register is used to accept return value from an in-kernel + function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ + sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved + registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. + + Therefore, eBPF calling convention is defined as: + + * R0 - return value from in-kernel function, and exit value for eBPF program + * R1 - R5 - arguments from eBPF program to in-kernel function + * R6 - R9 - callee saved registers that in-kernel function will preserve + * R10 - read-only frame pointer to access stack + + Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, + etc, and eBPF calling convention maps directly to ABIs used by the kernel on + 64-bit architectures. + + On 32-bit architectures JIT may map programs that use only 32-bit arithmetic + and may let more complex programs to be interpreted. + + R0 - R5 are scratch registers and eBPF program needs spill/fill them if + necessary across calls. Note that there is only one eBPF program (== one + eBPF main routine) and it cannot call other eBPF functions, it can only + call predefined in-kernel functions, though. + +- Register width increases from 32-bit to 64-bit: + + Still, the semantics of the original 32-bit ALU operations are preserved + via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower + subregisters that zero-extend into 64-bit if they are being written to. + That behavior maps directly to x86_64 and arm64 subregister definition, but + makes other JITs more difficult. + + 32-bit architectures run 64-bit internal BPF programs via interpreter. + Their JITs may convert BPF programs that only use 32-bit subregisters into + native instruction set and let the rest being interpreted. + + Operation is 64-bit, because on 64-bit architectures, pointers are also + 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, + so 32-bit eBPF registers would otherwise require to define register-pair + ABI, thus, there won't be able to use a direct eBPF register to HW register + mapping and JIT would need to do combine/split/move operations for every + register in and out of the function, which is complex, bug prone and slow. + Another reason is the use of atomic 64-bit counters. + +- Conditional jt/jf targets replaced with jt/fall-through: + + While the original design has constructs such as "if (cond) jump_true; + else jump_false;", they are being replaced into alternative constructs like + "if (cond) jump_true; /* else fall-through */". + +- Introduces bpf_call insn and register passing convention for zero overhead + calls from/to other kernel functions: + + Before an in-kernel function call, the internal BPF program needs to + place function arguments into R1 to R5 registers to satisfy calling + convention, then the interpreter will take them from registers and pass + to in-kernel function. If R1 - R5 registers are mapped to CPU registers + that are used for argument passing on given architecture, the JIT compiler + doesn't need to emit extra moves. Function arguments will be in the correct + registers and BPF_CALL instruction will be JITed as single 'call' HW + instruction. This calling convention was picked to cover common call + situations without performance penalty. + + After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has + a return value of the function. Since R6 - R9 are callee saved, their state + is preserved across the call. + + For example, consider three C functions: + + u64 f1() { return (*_f2)(1); } + u64 f2(u64 a) { return f3(a + 1, a); } + u64 f3(u64 a, u64 b) { return a - b; } + + GCC can compile f1, f3 into x86_64: + + f1: + movl $1, %edi + movq _f2(%rip), %rax + jmp *%rax + f3: + movq %rdi, %rax + subq %rsi, %rax + ret + + Function f2 in eBPF may look like: + + f2: + bpf_mov R2, R1 + bpf_add R1, 1 + bpf_call f3 + bpf_exit + + If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and + returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to + be used to call into f2. + + For practical reasons all eBPF programs have only one argument 'ctx' which is + already placed into R1 (e.g. on __sk_run_filter() startup) and the programs + can call kernel functions with up to 5 arguments. Calls with 6 or more arguments + are currently not supported, but these restrictions can be lifted if necessary + in the future. + + On 64-bit architectures all register map to HW registers one to one. For + example, x86_64 JIT compiler can map them as ... + + R0 - rax + R1 - rdi + R2 - rsi + R3 - rdx + R4 - rcx + R5 - r8 + R6 - rbx + R7 - r13 + R8 - r14 + R9 - r15 + R10 - rbp + + ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing + and rbx, r12 - r15 are callee saved. + + Then the following internal BPF pseudo-program: + + bpf_mov R6, R1 /* save ctx */ + bpf_mov R2, 2 + bpf_mov R3, 3 + bpf_mov R4, 4 + bpf_mov R5, 5 + bpf_call foo + bpf_mov R7, R0 /* save foo() return value */ + bpf_mov R1, R6 /* restore ctx for next call */ + bpf_mov R2, 6 + bpf_mov R3, 7 + bpf_mov R4, 8 + bpf_mov R5, 9 + bpf_call bar + bpf_add R0, R7 + bpf_exit + + After JIT to x86_64 may look like: + + push %rbp + mov %rsp,%rbp + sub $0x228,%rsp + mov %rbx,-0x228(%rbp) + mov %r13,-0x220(%rbp) + mov %rdi,%rbx + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq foo + mov %rax,%r13 + mov %rbx,%rdi + mov $0x2,%esi + mov $0x3,%edx + mov $0x4,%ecx + mov $0x5,%r8d + callq bar + add %r13,%rax + mov -0x228(%rbp),%rbx + mov -0x220(%rbp),%r13 + leaveq + retq + + Which is in this example equivalent in C to: + + u64 bpf_filter(u64 ctx) + { + return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); + } + + In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 + arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper + registers and place their return value into '%rax' which is R0 in eBPF. + Prologue and epilogue are emitted by JIT and are implicit in the + interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve + them across the calls as defined by calling convention. + + For example the following program is invalid: + + bpf_mov R1, 1 + bpf_call foo + bpf_mov R0, R1 + bpf_exit + + After the call the registers R1-R5 contain junk values and cannot be read. + In the future an eBPF verifier can be used to validate internal BPF programs. + +Also in the new design, eBPF is limited to 4096 insns, which means that any +program will terminate quickly and will only call a fixed number of kernel +functions. Original BPF and the new format are two operand instructions, +which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. + +The input context pointer for invoking the interpreter function is generic, +its content is defined by a specific use case. For seccomp register R1 points +to seccomp_data, for converted BPF filters R1 points to a skb. + +A program, that is translated internally consists of the following elements: + + op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 + +So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field +has room for new instructions. Some of them may use 16/24/32 byte encoding. New +instructions must be multiple of 8 bytes to preserve backward compatibility. + +Internal BPF is a general purpose RISC instruction set. Not every register and +every instruction are used during translation from original BPF to new format. +For example, socket filters are not using 'exclusive add' instruction, but +tracing filters may do to maintain counters of events, for example. Register R9 +is not used by socket filters either, but more complex filters may be running +out of registers and would have to resort to spill/fill to stack. + +Internal BPF can used as generic assembler for last step performance +optimizations, socket filters and seccomp are using it as assembler. Tracing +filters may use it as assembler to generate code from kernel. In kernel usage +may not be bounded by security considerations, since generated internal BPF code +may be optimizing internal code path and not being exposed to the user space. +Safety of internal BPF can come from a verifier (TBD). In such use cases as +described, it may be used as safe instruction set. + +Just like the original BPF, the new format runs within a controlled environment, +is deterministic and the kernel can easily prove that. The safety of the program +can be determined in two steps: first step does depth-first-search to disallow +loops and other CFG validation; second step starts from the first insn and +descends all possible paths. It simulates execution of every insn and observes +the state change of registers and stack. + +eBPF opcode encoding +-------------------- + +eBPF is reusing most of the opcode encoding from classic to simplify conversion +of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' +field is divided into three parts: + + +----------------+--------+--------------------+ + | 4 bits | 1 bit | 3 bits | + | operation code | source | instruction class | + +----------------+--------+--------------------+ + (MSB) (LSB) + +Three LSB bits store instruction class which is one of: + + Classic BPF classes: eBPF classes: + + BPF_LD 0x00 BPF_LD 0x00 + BPF_LDX 0x01 BPF_LDX 0x01 + BPF_ST 0x02 BPF_ST 0x02 + BPF_STX 0x03 BPF_STX 0x03 + BPF_ALU 0x04 BPF_ALU 0x04 + BPF_JMP 0x05 BPF_JMP 0x05 + BPF_RET 0x06 [ class 6 unused, for future if needed ] + BPF_MISC 0x07 BPF_ALU64 0x07 + +When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... + + BPF_K 0x00 + BPF_X 0x08 + + * in classic BPF, this means: + + BPF_SRC(code) == BPF_X - use register X as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + + * in eBPF, this means: + + BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand + BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand + +... and four MSB bits store operation code. + +If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of: + + BPF_ADD 0x00 + BPF_SUB 0x10 + BPF_MUL 0x20 + BPF_DIV 0x30 + BPF_OR 0x40 + BPF_AND 0x50 + BPF_LSH 0x60 + BPF_RSH 0x70 + BPF_NEG 0x80 + BPF_MOD 0x90 + BPF_XOR 0xa0 + BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ + BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ + BPF_END 0xd0 /* eBPF only: endianness conversion */ + +If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: + + BPF_JA 0x00 + BPF_JEQ 0x10 + BPF_JGT 0x20 + BPF_JGE 0x30 + BPF_JSET 0x40 + BPF_JNE 0x50 /* eBPF only: jump != */ + BPF_JSGT 0x60 /* eBPF only: signed '>' */ + BPF_JSGE 0x70 /* eBPF only: signed '>=' */ + BPF_CALL 0x80 /* eBPF only: function call */ + BPF_EXIT 0x90 /* eBPF only: function return */ + +So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF +and eBPF. There are only two registers in classic BPF, so it means A += X. +In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, +BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous +src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. + +Classic BPF is using BPF_MISC class to represent A = X and X = A moves. +eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no +BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean +exactly the same operations as BPF_ALU, but with 64-bit wide operands +instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: +dst_reg = dst_reg + src_reg + +Classic BPF wastes the whole BPF_RET class to represent a single 'ret' +operation. Classic BPF_RET | BPF_K means copy imm32 into return register +and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT +in eBPF means function exit only. The eBPF program needs to store return +value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently +unused and reserved for future use. + +For load and store instructions the 8-bit 'code' field is divided as: + + +--------+--------+-------------------+ + | 3 bits | 2 bits | 3 bits | + | mode | size | instruction class | + +--------+--------+-------------------+ + (MSB) (LSB) + +Size modifier is one of ... + + BPF_W 0x00 /* word */ + BPF_H 0x08 /* half word */ + BPF_B 0x10 /* byte */ + BPF_DW 0x18 /* eBPF only, double word */ + +... which encodes size of load/store operation: + + B - 1 byte + H - 2 byte + W - 4 byte + DW - 8 byte (eBPF only) + +Mode modifier is one of: + + BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */ + BPF_ABS 0x20 + BPF_IND 0x40 + BPF_MEM 0x60 + BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ + BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ + BPF_XADD 0xc0 /* eBPF only, exclusive add */ + +eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and +(BPF_IND | <size> | BPF_LD) which are used to access packet data. + +They had to be carried over from classic to have strong performance of +socket filters running in eBPF interpreter. These instructions can only +be used when interpreter context is a pointer to 'struct sk_buff' and +have seven implicit operands. Register R6 is an implicit input that must +contain pointer to sk_buff. Register R0 is an implicit output which contains +the data fetched from the packet. Registers R1-R5 are scratch registers +and must not be used to store the data across BPF_ABS | BPF_LD or +BPF_IND | BPF_LD instructions. + +These instructions have implicit program exit condition as well. When +eBPF program is trying to access the data beyond the packet boundary, +the interpreter will abort the execution of the program. JIT compilers +therefore must preserve this property. src_reg and imm32 fields are +explicit inputs to these instructions. + +For example: + + BPF_IND | BPF_W | BPF_LD means: + + R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) + and R1 - R5 were scratched. + +Unlike classic BPF instruction set, eBPF has generic load/store operations: + +BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg +BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 +BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) +BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg +BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg + +Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and +2 byte atomic increments are not supported. + +Testing +------- + +Next to the BPF toolchain, the kernel also ships a test module that contains +various test cases for classic and internal BPF that can be executed against +the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and +enabled via Kconfig: + + CONFIG_TEST_BPF=m + +After the module has been built and installed, the test suite can be executed +via insmod or modprobe against 'test_bpf' module. Results of the test cases +including timings in nsec can be found in the kernel log (dmesg). + +Misc +---- + +Also trinity, the Linux syscall fuzzer, has built-in support for BPF and +SECCOMP-BPF kernel fuzzing. + +Written by +---------- + +The document was written in the hope that it is found useful and in order +to give potential BPF hackers or security auditors a better overview of +the underlying architecture. + +Jay Schulist <jschlst@samba.org> +Daniel Borkmann <dborkman@redhat.com> +Alexei Starovoitov <ast@plumgrid.com> diff --git a/Documentation/networking/fore200e.txt b/Documentation/networking/fore200e.txt index 6e0d2a9613e..d52af53efdc 100644 --- a/Documentation/networking/fore200e.txt +++ b/Documentation/networking/fore200e.txt @@ -11,12 +11,10 @@ i386, alpha (untested), powerpc, sparc and sparc64 archs. The intent is to enable the use of different models of FORE adapters at the same time, by hosts that have several bus interfaces (such as PCI+SBUS, -PCI+MCA or PCI+EISA). +or PCI+EISA). Only PCI and SBUS devices are currently supported by the driver, but support -for other bus interfaces such as EISA should not be too hard to add (this may -be more tricky for the MCA bus, though, as FORE made some MCA-specific -modifications to the adapter's AALI interface). +for other bus interfaces such as EISA should not be too hard to add. Firmware Copyright Notice @@ -44,7 +42,7 @@ the 'software updates' pages. The firmware binaries are part of the various ForeThought software distributions. Notice that different versions of the PCA-200E firmware exist, depending -on the endianess of the host architecture. The driver is shipped with +on the endianness of the host architecture. The driver is shipped with both little and big endian PCA firmware images. Name and location of the new firmware images can be set at kernel diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt index d4f8b8b9b53..3e071115ca9 100644 --- a/Documentation/networking/generic_netlink.txt +++ b/Documentation/networking/generic_netlink.txt @@ -1,3 +1,3 @@ A wiki document on how to use Generic Netlink can be found here: - * http://linux-net.osdl.org/index.php/Generic_Netlink_HOWTO + * http://www.linuxfoundation.org/collaborate/workgroups/networking/generic_netlink_howto diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt index ad474ea07d0..ba1daea7f2e 100644 --- a/Documentation/networking/gianfar.txt +++ b/Documentation/networking/gianfar.txt @@ -1,38 +1,8 @@ The Gianfar Ethernet Driver -Sysfs File description Author: Andy Fleming <afleming@freescale.com> Updated: 2005-07-28 -SYSFS - -Several of the features of the gianfar driver are controlled -through sysfs files. These are: - -bd_stash: -To stash RX Buffer Descriptors in the L2, echo 'on' or '1' to -bd_stash, echo 'off' or '0' to disable - -rx_stash_len: -To stash the first n bytes of the packet in L2, echo the number -of bytes to buf_stash_len. echo 0 to disable. - -WARNING: You could really screw these up if you set them too low or high! -fifo_threshold: -To change the number of bytes the controller needs in the -fifo before it starts transmission, echo the number of bytes to -fifo_thresh. Range should be 0-511. - -fifo_starve: -When the FIFO has less than this many bytes during a transmit, it -enters starve mode, and increases the priority of TX memory -transactions. To change, echo the number of bytes to -fifo_starve. Range should be 0-511. - -fifo_starve_off: -Once in starve mode, the FIFO remains there until it has this -many bytes. To change, echo the number of bytes to -fifo_starve_off. Range should be 0-511. CHECKSUM OFFLOADING diff --git a/Documentation/networking/i40e.txt b/Documentation/networking/i40e.txt new file mode 100644 index 00000000000..f737273c6dc --- /dev/null +++ b/Documentation/networking/i40e.txt @@ -0,0 +1,115 @@ +Linux Base Driver for the Intel(R) Ethernet Controller XL710 Family +=================================================================== + +Intel i40e Linux driver. +Copyright(c) 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Additional Configurations +- Performance Tuning +- Known Issues +- Support + + +Identifying Your Adapter +======================== + +The driver in this release is compatible with the Intel Ethernet +Controller XL710 Family. + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/network/sb/CS-012904.htm + + +Enabling the driver +=================== + +The driver is enabled via the standard kernel configuration system, +using the make command: + + Make oldconfig/silentoldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Intel devices + -> Intel(R) Ethernet Controller XL710 Family + +Additional Configurations +========================= + + Generic Receive Offload (GRO) + ----------------------------- + The driver supports the in-kernel software implementation of GRO. GRO has + shown that by coalescing Rx traffic into larger chunks of data, CPU + utilization can be significantly reduced when under large Rx load. GRO is + an evolution of the previously-used LRO interface. GRO is able to coalesce + other protocols besides TCP. It's also safe to use with configurations that + are problematic for LRO, namely bridging and iSCSI. + + Ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + The latest release of ethtool can be found from + https://www.kernel.org/pub/software/network/ethtool + + Data Center Bridging (DCB) + -------------------------- + DCB configuration is not currently supported. + + FCoE + ---- + Fiber Channel over Ethernet (FCoE) hardware offload is not currently + supported. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://e1000.sourceforge.net + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sourceforge.net and copy +netdev@vger.kernel.org. diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt new file mode 100644 index 00000000000..21e41271af7 --- /dev/null +++ b/Documentation/networking/i40evf.txt @@ -0,0 +1,47 @@ +Linux* Base Driver for Intel(R) Network Connection +================================================== + +Intel XL710 X710 Virtual Function Linux driver. +Copyright(c) 2013 Intel Corporation. + +Contents +======== + +- Identifying Your Adapter +- Known Issues/Troubleshooting +- Support + +This file describes the i40evf Linux* Base Driver for the Intel(R) XL710 +X710 Virtual Function. + +The i40evf driver supports XL710 and X710 virtual function devices that +can only be activated on kernels with CONFIG_PCI_IOV enabled. + +The guest OS loading the i40evf driver must support MSI-X interrupts. + +Identifying Your Adapter +======================== + +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: + + http://support.intel.com/support/go/network/adapter/idguide.htm + +Known Issues/Troubleshooting +============================ + + +Support +======= + +For general information, go to the Intel support website at: + + http://support.intel.com + +or the Intel Wired Networking project hosted by Sourceforge at: + + http://sourceforge.net/projects/e1000 + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index 23c995e6403..22bbc7225f8 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -4,15 +4,22 @@ Introduction ============ +The IEEE 802.15.4 working group focuses on standardization of bottom +two layers: Medium Access Control (MAC) and Physical (PHY). And there +are mainly two options available for upper layers: + - ZigBee - proprietary protocol from ZigBee Alliance + - 6LowPAN - IPv6 networking over low rate personal area networks The Linux-ZigBee project goal is to provide complete implementation -of IEEE 802.15.4 / ZigBee / 6LoWPAN protocols. IEEE 802.15.4 is a stack +of IEEE 802.15.4 and 6LoWPAN protocols. IEEE 802.15.4 is a stack of protocols for organizing Low-Rate Wireless Personal Area Networks. -Currently only IEEE 802.15.4 layer is implemented. We have choosen -to use plain Berkeley socket API, the generic Linux networking stack -to transfer IEEE 802.15.4 messages and a special protocol over genetlink -for configuration/management +The stack is composed of three main parts: + - IEEE 802.15.4 layer; We have chosen to use plain Berkeley socket API, + the generic Linux networking stack to transfer IEEE 802.15.4 messages + and a special protocol over genetlink for configuration/management + - MAC - provides access to shared channel and reliable data delivery + - PHY - represents device drivers Socket API @@ -29,15 +36,6 @@ or git tree at git://linux-zigbee.git.sourceforge.net/gitroot/linux-zigbee). One can use SOCK_RAW for passing raw data towards device xmit function. YMMV. -MLME - MAC Level Management -============================ - -Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. -See the include/net/nl802154.h header. Our userspace tools package -(see above) provides CLI configuration utility for radio interfaces and simple -coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. - - Kernel side ============= @@ -51,6 +49,15 @@ Like with WiFi, there are several types of devices implementing IEEE 802.15.4. Those types of devices require different approach to be hooked into Linux kernel. +MLME - MAC Level Management +============================ + +Most of IEEE 802.15.4 MLME interfaces are directly mapped on netlink commands. +See the include/net/nl802154.h header. Our userspace tools package +(see above) provides CLI configuration utility for radio interfaces and simple +coordinator for IEEE 802.15.4 networks as an example users of MLME protocol. + + HardMAC ======= @@ -59,13 +66,14 @@ net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family code via plain sk_buffs. On skb reception skb->cb must contain additional info as described in the struct ieee802154_mac_cb. During packet transmission the skb->cb is used to provide additional data to device's header_ops->create -function. Be aware, that this data can be overriden later (when socket code +function. Be aware that this data can be overridden later (when socket code submits skb to qdisc), so if you need something from that cb later, you should store info in the skb->data on your own. To hook the MLME interface you have to populate the ml_priv field of your -net_device with a pointer to struct ieee802154_mlme_ops instance. All fields are -required. +net_device with a pointer to struct ieee802154_mlme_ops instance. The fields +assoc_req, assoc_resp, disassoc_req, start_req, and scan_req are optional. +All other fields are required. We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c @@ -73,8 +81,71 @@ We provide an example of simple HardMAC driver at drivers/ieee802154/fakehard.c SoftMAC ======= -We are going to provide intermediate layer implementing IEEE 802.15.4 MAC -in software. This is currently WIP. +The MAC is the middle layer in the IEEE 802.15.4 Linux stack. This moment it +provides interface for drivers registration and management of slave interfaces. + +NOTE: Currently the only monitor device type is supported - it's IEEE 802.15.4 +stack interface for network sniffers (e.g. WireShark). + +This layer is going to be extended soon. See header include/net/mac802154.h and several drivers in drivers/ieee802154/. + +Device drivers API +================== + +The include/net/mac802154.h defines following functions: + - struct ieee802154_dev *ieee802154_alloc_device + (size_t priv_size, struct ieee802154_ops *ops): + allocation of IEEE 802.15.4 compatible device + + - void ieee802154_free_device(struct ieee802154_dev *dev): + freeing allocated device + + - int ieee802154_register_device(struct ieee802154_dev *dev): + register PHY in the system + + - void ieee802154_unregister_device(struct ieee802154_dev *dev): + freeing registered PHY + +Moreover IEEE 802.15.4 device operations structure should be filled. + +Fake drivers +============ + +In addition there are two drivers available which simulate real devices with +HardMAC (fakehard) and SoftMAC (fakelb - IEEE 802.15.4 loopback driver) +interfaces. This option provides possibility to test and debug stack without +usage of real hardware. + +See sources in drivers/ieee802154 folder for more details. + + +6LoWPAN Linux implementation +============================ + +The IEEE 802.15.4 standard specifies an MTU of 128 bytes, yielding about 80 +octets of actual MAC payload once security is turned on, on a wireless link +with a link throughput of 250 kbps or less. The 6LoWPAN adaptation format +[RFC4944] was specified to carry IPv6 datagrams over such constrained links, +taking into account limited bandwidth, memory, or energy resources that are +expected in applications such as wireless Sensor Networks. [RFC4944] defines +a Mesh Addressing header to support sub-IP forwarding, a Fragmentation header +to support the IPv6 minimum MTU requirement [RFC2460], and stateless header +compression for IPv6 datagrams (LOWPAN_HC1 and LOWPAN_HC2) to reduce the +relatively large IPv6 and UDP headers down to (in the best case) several bytes. + +In Semptember 2011 the standard update was published - [RFC6282]. +It deprecates HC1 and HC2 compression and defines IPHC encoding format which is +used in this Linux implementation. + +All the code related to 6lowpan you may find in files: net/ieee802154/6lowpan.* + +To setup 6lowpan interface you need (busybox release > 1.17.0): +1. Add IEEE802.15.4 interface and initialize PANid; +2. Add 6lowpan interface by command like: + # ip link add link wpan0 name lowpan0 type lowpan +3. Set MAC (if needs): + # ip link set lowpan0 address de:ad:be:ef:ca:fe:ba:be +4. Bring up 'lowpan0' interface diff --git a/Documentation/networking/ifenslave.c b/Documentation/networking/ifenslave.c deleted file mode 100644 index 2bac9618c34..00000000000 --- a/Documentation/networking/ifenslave.c +++ /dev/null @@ -1,1103 +0,0 @@ -/* Mode: C; - * ifenslave.c: Configure network interfaces for parallel routing. - * - * This program controls the Linux implementation of running multiple - * network interfaces in parallel. - * - * Author: Donald Becker <becker@cesdis.gsfc.nasa.gov> - * Copyright 1994-1996 Donald Becker - * - * This program is free software; you can redistribute it - * and/or modify it under the terms of the GNU General Public - * License as published by the Free Software Foundation. - * - * The author may be reached as becker@CESDIS.gsfc.nasa.gov, or C/O - * Center of Excellence in Space Data and Information Sciences - * Code 930.5, Goddard Space Flight Center, Greenbelt MD 20771 - * - * Changes : - * - 2000/10/02 Willy Tarreau <willy at meta-x.org> : - * - few fixes. Master's MAC address is now correctly taken from - * the first device when not previously set ; - * - detach support : call BOND_RELEASE to detach an enslaved interface. - * - give a mini-howto from command-line help : # ifenslave -h - * - * - 2001/02/16 Chad N. Tindel <ctindel at ieee dot org> : - * - Master is now brought down before setting the MAC address. In - * the 2.4 kernel you can't change the MAC address while the device is - * up because you get EBUSY. - * - * - 2001/09/13 Takao Indoh <indou dot takao at jp dot fujitsu dot com> - * - Added the ability to change the active interface on a mode 1 bond - * at runtime. - * - * - 2001/10/23 Chad N. Tindel <ctindel at ieee dot org> : - * - No longer set the MAC address of the master. The bond device will - * take care of this itself - * - Try the SIOC*** versions of the bonding ioctls before using the - * old versions - * - 2002/02/18 Erik Habbinga <erik_habbinga @ hp dot com> : - * - ifr2.ifr_flags was not initialized in the hwaddr_notset case, - * SIOCGIFFLAGS now called before hwaddr_notset test - * - * - 2002/10/31 Tony Cureington <tony.cureington * hp_com> : - * - If the master does not have a hardware address when the first slave - * is enslaved, the master is assigned the hardware address of that - * slave - there is a comment in bonding.c stating "ifenslave takes - * care of this now." This corrects the problem of slaves having - * different hardware addresses in active-backup mode when - * multiple interfaces are specified on a single ifenslave command - * (ifenslave bond0 eth0 eth1). - * - * - 2003/03/18 - Tsippy Mendelson <tsippy.mendelson at intel dot com> and - * Shmulik Hen <shmulik.hen at intel dot com> - * - Moved setting the slave's mac address and openning it, from - * the application to the driver. This enables support of modes - * that need to use the unique mac address of each slave. - * The driver also takes care of closing the slave and restoring its - * original mac address upon release. - * In addition, block possibility of enslaving before the master is up. - * This prevents putting the system in an undefined state. - * - * - 2003/05/01 - Amir Noam <amir.noam at intel dot com> - * - Added ABI version control to restore compatibility between - * new/old ifenslave and new/old bonding. - * - Prevent adding an adapter that is already a slave. - * Fixes the problem of stalling the transmission and leaving - * the slave in a down state. - * - * - 2003/05/01 - Shmulik Hen <shmulik.hen at intel dot com> - * - Prevent enslaving if the bond device is down. - * Fixes the problem of leaving the system in unstable state and - * halting when trying to remove the module. - * - Close socket on all abnormal exists. - * - Add versioning scheme that follows that of the bonding driver. - * current version is 1.0.0 as a base line. - * - * - 2003/05/22 - Jay Vosburgh <fubar at us dot ibm dot com> - * - ifenslave -c was broken; it's now fixed - * - Fixed problem with routes vanishing from master during enslave - * processing. - * - * - 2003/05/27 - Amir Noam <amir.noam at intel dot com> - * - Fix backward compatibility issues: - * For drivers not using ABI versions, slave was set down while - * it should be left up before enslaving. - * Also, master was not set down and the default set_mac_address() - * would fail and generate an error message in the system log. - * - For opt_c: slave should not be set to the master's setting - * while it is running. It was already set during enslave. To - * simplify things, it is now handled separately. - * - * - 2003/12/01 - Shmulik Hen <shmulik.hen at intel dot com> - * - Code cleanup and style changes - * set version to 1.1.0 - */ - -#define APP_VERSION "1.1.0" -#define APP_RELDATE "December 1, 2003" -#define APP_NAME "ifenslave" - -static char *version = -APP_NAME ".c:v" APP_VERSION " (" APP_RELDATE ")\n" -"o Donald Becker (becker@cesdis.gsfc.nasa.gov).\n" -"o Detach support added on 2000/10/02 by Willy Tarreau (willy at meta-x.org).\n" -"o 2.4 kernel support added on 2001/02/16 by Chad N. Tindel\n" -" (ctindel at ieee dot org).\n"; - -static const char *usage_msg = -"Usage: ifenslave [-f] <master-if> <slave-if> [<slave-if>...]\n" -" ifenslave -d <master-if> <slave-if> [<slave-if>...]\n" -" ifenslave -c <master-if> <slave-if>\n" -" ifenslave --help\n"; - -static const char *help_msg = -"\n" -" To create a bond device, simply follow these three steps :\n" -" - ensure that the required drivers are properly loaded :\n" -" # modprobe bonding ; modprobe <3c59x|eepro100|pcnet32|tulip|...>\n" -" - assign an IP address to the bond device :\n" -" # ifconfig bond0 <addr> netmask <mask> broadcast <bcast>\n" -" - attach all the interfaces you need to the bond device :\n" -" # ifenslave [{-f|--force}] bond0 eth0 [eth1 [eth2]...]\n" -" If bond0 didn't have a MAC address, it will take eth0's. Then, all\n" -" interfaces attached AFTER this assignment will get the same MAC addr.\n" -" (except for ALB/TLB modes)\n" -"\n" -" To set the bond device down and automatically release all the slaves :\n" -" # ifconfig bond0 down\n" -"\n" -" To detach a dead interface without setting the bond device down :\n" -" # ifenslave {-d|--detach} bond0 eth0 [eth1 [eth2]...]\n" -"\n" -" To change active slave :\n" -" # ifenslave {-c|--change-active} bond0 eth0\n" -"\n" -" To show master interface info\n" -" # ifenslave bond0\n" -"\n" -" To show all interfaces info\n" -" # ifenslave {-a|--all-interfaces}\n" -"\n" -" To be more verbose\n" -" # ifenslave {-v|--verbose} ...\n" -"\n" -" # ifenslave {-u|--usage} Show usage\n" -" # ifenslave {-V|--version} Show version\n" -" # ifenslave {-h|--help} This message\n" -"\n"; - -#include <unistd.h> -#include <stdlib.h> -#include <stdio.h> -#include <ctype.h> -#include <string.h> -#include <errno.h> -#include <fcntl.h> -#include <getopt.h> -#include <sys/types.h> -#include <sys/socket.h> -#include <sys/ioctl.h> -#include <linux/if.h> -#include <net/if_arp.h> -#include <linux/if_ether.h> -#include <linux/if_bonding.h> -#include <linux/sockios.h> - -typedef unsigned long long u64; /* hack, so we may include kernel's ethtool.h */ -typedef __uint32_t u32; /* ditto */ -typedef __uint16_t u16; /* ditto */ -typedef __uint8_t u8; /* ditto */ -#include <linux/ethtool.h> - -struct option longopts[] = { - /* { name has_arg *flag val } */ - {"all-interfaces", 0, 0, 'a'}, /* Show all interfaces. */ - {"change-active", 0, 0, 'c'}, /* Change the active slave. */ - {"detach", 0, 0, 'd'}, /* Detach a slave interface. */ - {"force", 0, 0, 'f'}, /* Force the operation. */ - {"help", 0, 0, 'h'}, /* Give help */ - {"usage", 0, 0, 'u'}, /* Give usage */ - {"verbose", 0, 0, 'v'}, /* Report each action taken. */ - {"version", 0, 0, 'V'}, /* Emit version information. */ - { 0, 0, 0, 0} -}; - -/* Command-line flags. */ -unsigned int -opt_a = 0, /* Show-all-interfaces flag. */ -opt_c = 0, /* Change-active-slave flag. */ -opt_d = 0, /* Detach a slave interface. */ -opt_f = 0, /* Force the operation. */ -opt_h = 0, /* Help */ -opt_u = 0, /* Usage */ -opt_v = 0, /* Verbose flag. */ -opt_V = 0; /* Version */ - -int skfd = -1; /* AF_INET socket for ioctl() calls.*/ -int abi_ver = 0; /* userland - kernel ABI version */ -int hwaddr_set = 0; /* Master's hwaddr is set */ -int saved_errno; - -struct ifreq master_mtu, master_flags, master_hwaddr; -struct ifreq slave_mtu, slave_flags, slave_hwaddr; - -struct dev_ifr { - struct ifreq *req_ifr; - char *req_name; - int req_type; -}; - -struct dev_ifr master_ifra[] = { - {&master_mtu, "SIOCGIFMTU", SIOCGIFMTU}, - {&master_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS}, - {&master_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR}, - {NULL, "", 0} -}; - -struct dev_ifr slave_ifra[] = { - {&slave_mtu, "SIOCGIFMTU", SIOCGIFMTU}, - {&slave_flags, "SIOCGIFFLAGS", SIOCGIFFLAGS}, - {&slave_hwaddr, "SIOCGIFHWADDR", SIOCGIFHWADDR}, - {NULL, "", 0} -}; - -static void if_print(char *ifname); -static int get_drv_info(char *master_ifname); -static int get_if_settings(char *ifname, struct dev_ifr ifra[]); -static int get_slave_flags(char *slave_ifname); -static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr); -static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr); -static int set_slave_mtu(char *slave_ifname, int mtu); -static int set_if_flags(char *ifname, short flags); -static int set_if_up(char *ifname, short flags); -static int set_if_down(char *ifname, short flags); -static int clear_if_addr(char *ifname); -static int set_if_addr(char *master_ifname, char *slave_ifname); -static int change_active(char *master_ifname, char *slave_ifname); -static int enslave(char *master_ifname, char *slave_ifname); -static int release(char *master_ifname, char *slave_ifname); -#define v_print(fmt, args...) \ - if (opt_v) \ - fprintf(stderr, fmt, ## args ) - -int main(int argc, char *argv[]) -{ - char **spp, *master_ifname, *slave_ifname; - int c, i, rv; - int res = 0; - int exclusive = 0; - - while ((c = getopt_long(argc, argv, "acdfhuvV", longopts, 0)) != EOF) { - switch (c) { - case 'a': opt_a++; exclusive++; break; - case 'c': opt_c++; exclusive++; break; - case 'd': opt_d++; exclusive++; break; - case 'f': opt_f++; exclusive++; break; - case 'h': opt_h++; exclusive++; break; - case 'u': opt_u++; exclusive++; break; - case 'v': opt_v++; break; - case 'V': opt_V++; exclusive++; break; - - case '?': - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - } - - /* options check */ - if (exclusive > 1) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - if (opt_v || opt_V) { - printf(version); - if (opt_V) { - res = 0; - goto out; - } - } - - if (opt_u) { - printf(usage_msg); - res = 0; - goto out; - } - - if (opt_h) { - printf(usage_msg); - printf(help_msg); - res = 0; - goto out; - } - - /* Open a basic socket */ - if ((skfd = socket(AF_INET, SOCK_DGRAM, 0)) < 0) { - perror("socket"); - res = 1; - goto out; - } - - if (opt_a) { - if (optind == argc) { - /* No remaining args */ - /* show all interfaces */ - if_print((char *)NULL); - goto out; - } else { - /* Just show usage */ - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - } - - /* Copy the interface name */ - spp = argv + optind; - master_ifname = *spp++; - - if (master_ifname == NULL) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - /* exchange abi version with bonding module */ - res = get_drv_info(master_ifname); - if (res) { - fprintf(stderr, - "Master '%s': Error: handshake with driver failed. " - "Aborting\n", - master_ifname); - goto out; - } - - slave_ifname = *spp++; - - if (slave_ifname == NULL) { - if (opt_d || opt_c) { - fprintf(stderr, usage_msg); - res = 2; - goto out; - } - - /* A single arg means show the - * configuration for this interface - */ - if_print(master_ifname); - goto out; - } - - res = get_if_settings(master_ifname, master_ifra); - if (res) { - /* Probably a good reason not to go on */ - fprintf(stderr, - "Master '%s': Error: get settings failed: %s. " - "Aborting\n", - master_ifname, strerror(res)); - goto out; - } - - /* check if master is indeed a master; - * if not then fail any operation - */ - if (!(master_flags.ifr_flags & IFF_MASTER)) { - fprintf(stderr, - "Illegal operation; the specified interface '%s' " - "is not a master. Aborting\n", - master_ifname); - res = 1; - goto out; - } - - /* check if master is up; if not then fail any operation */ - if (!(master_flags.ifr_flags & IFF_UP)) { - fprintf(stderr, - "Illegal operation; the specified master interface " - "'%s' is not up.\n", - master_ifname); - res = 1; - goto out; - } - - /* Only for enslaving */ - if (!opt_c && !opt_d) { - sa_family_t master_family = master_hwaddr.ifr_hwaddr.sa_family; - unsigned char *hwaddr = - (unsigned char *)master_hwaddr.ifr_hwaddr.sa_data; - - /* The family '1' is ARPHRD_ETHER for ethernet. */ - if (master_family != 1 && !opt_f) { - fprintf(stderr, - "Illegal operation: The specified master " - "interface '%s' is not ethernet-like.\n " - "This program is designed to work with " - "ethernet-like network interfaces.\n " - "Use the '-f' option to force the " - "operation.\n", - master_ifname); - res = 1; - goto out; - } - - /* Check master's hw addr */ - for (i = 0; i < 6; i++) { - if (hwaddr[i] != 0) { - hwaddr_set = 1; - break; - } - } - - if (hwaddr_set) { - v_print("current hardware address of master '%s' " - "is %2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, " - "type %d\n", - master_ifname, - hwaddr[0], hwaddr[1], - hwaddr[2], hwaddr[3], - hwaddr[4], hwaddr[5], - master_family); - } - } - - /* Accepts only one slave */ - if (opt_c) { - /* change active slave */ - res = get_slave_flags(slave_ifname); - if (res) { - fprintf(stderr, - "Slave '%s': Error: get flags failed. " - "Aborting\n", - slave_ifname); - goto out; - } - res = change_active(master_ifname, slave_ifname); - if (res) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Change active failed\n", - master_ifname, slave_ifname); - } - } else { - /* Accept multiple slaves */ - do { - if (opt_d) { - /* detach a slave interface from the master */ - rv = get_slave_flags(slave_ifname); - if (rv) { - /* Can't work with this slave. */ - /* remember the error and skip it*/ - fprintf(stderr, - "Slave '%s': Error: get flags " - "failed. Skipping\n", - slave_ifname); - res = rv; - continue; - } - rv = release(master_ifname, slave_ifname); - if (rv) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Release failed\n", - master_ifname, slave_ifname); - res = rv; - } - } else { - /* attach a slave interface to the master */ - rv = get_if_settings(slave_ifname, slave_ifra); - if (rv) { - /* Can't work with this slave. */ - /* remember the error and skip it*/ - fprintf(stderr, - "Slave '%s': Error: get " - "settings failed: %s. " - "Skipping\n", - slave_ifname, strerror(rv)); - res = rv; - continue; - } - rv = enslave(master_ifname, slave_ifname); - if (rv) { - fprintf(stderr, - "Master '%s', Slave '%s': Error: " - "Enslave failed\n", - master_ifname, slave_ifname); - res = rv; - } - } - } while ((slave_ifname = *spp++) != NULL); - } - -out: - if (skfd >= 0) { - close(skfd); - } - - return res; -} - -static short mif_flags; - -/* Get the inteface configuration from the kernel. */ -static int if_getconfig(char *ifname) -{ - struct ifreq ifr; - int metric, mtu; /* Parameters of the master interface. */ - struct sockaddr dstaddr, broadaddr, netmask; - unsigned char *hwaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFFLAGS, &ifr) < 0) - return -1; - mif_flags = ifr.ifr_flags; - printf("The result of SIOCGIFFLAGS on %s is %x.\n", - ifname, ifr.ifr_flags); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFADDR, &ifr) < 0) - return -1; - printf("The result of SIOCGIFADDR is %2.2x.%2.2x.%2.2x.%2.2x.\n", - ifr.ifr_addr.sa_data[0], ifr.ifr_addr.sa_data[1], - ifr.ifr_addr.sa_data[2], ifr.ifr_addr.sa_data[3]); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFHWADDR, &ifr) < 0) - return -1; - - /* Gotta convert from 'char' to unsigned for printf(). */ - hwaddr = (unsigned char *)ifr.ifr_hwaddr.sa_data; - printf("The result of SIOCGIFHWADDR is type %d " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - ifr.ifr_hwaddr.sa_family, hwaddr[0], hwaddr[1], - hwaddr[2], hwaddr[3], hwaddr[4], hwaddr[5]); - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFMETRIC, &ifr) < 0) { - metric = 0; - } else - metric = ifr.ifr_metric; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFMTU, &ifr) < 0) - mtu = 0; - else - mtu = ifr.ifr_mtu; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFDSTADDR, &ifr) < 0) { - memset(&dstaddr, 0, sizeof(struct sockaddr)); - } else - dstaddr = ifr.ifr_dstaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFBRDADDR, &ifr) < 0) { - memset(&broadaddr, 0, sizeof(struct sockaddr)); - } else - broadaddr = ifr.ifr_broadaddr; - - strcpy(ifr.ifr_name, ifname); - if (ioctl(skfd, SIOCGIFNETMASK, &ifr) < 0) { - memset(&netmask, 0, sizeof(struct sockaddr)); - } else - netmask = ifr.ifr_netmask; - - return 0; -} - -static void if_print(char *ifname) -{ - char buff[1024]; - struct ifconf ifc; - struct ifreq *ifr; - int i; - - if (ifname == (char *)NULL) { - ifc.ifc_len = sizeof(buff); - ifc.ifc_buf = buff; - if (ioctl(skfd, SIOCGIFCONF, &ifc) < 0) { - perror("SIOCGIFCONF failed"); - return; - } - - ifr = ifc.ifc_req; - for (i = ifc.ifc_len / sizeof(struct ifreq); --i >= 0; ifr++) { - if (if_getconfig(ifr->ifr_name) < 0) { - fprintf(stderr, - "%s: unknown interface.\n", - ifr->ifr_name); - continue; - } - - if (((mif_flags & IFF_UP) == 0) && !opt_a) continue; - /*ife_print(&ife);*/ - } - } else { - if (if_getconfig(ifname) < 0) { - fprintf(stderr, - "%s: unknown interface.\n", ifname); - } - } -} - -static int get_drv_info(char *master_ifname) -{ - struct ifreq ifr; - struct ethtool_drvinfo info; - char *endptr; - - memset(&ifr, 0, sizeof(ifr)); - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - ifr.ifr_data = (caddr_t)&info; - - info.cmd = ETHTOOL_GDRVINFO; - strncpy(info.driver, "ifenslave", 32); - snprintf(info.fw_version, 32, "%d", BOND_ABI_VERSION); - - if (ioctl(skfd, SIOCETHTOOL, &ifr) < 0) { - if (errno == EOPNOTSUPP) { - goto out; - } - - saved_errno = errno; - v_print("Master '%s': Error: get bonding info failed %s\n", - master_ifname, strerror(saved_errno)); - return 1; - } - - abi_ver = strtoul(info.fw_version, &endptr, 0); - if (*endptr) { - v_print("Master '%s': Error: got invalid string as an ABI " - "version from the bonding module\n", - master_ifname); - return 1; - } - -out: - v_print("ABI ver is %d\n", abi_ver); - - return 0; -} - -static int change_active(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (!(slave_flags.ifr_flags & IFF_SLAVE)) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is not a slave\n", - slave_ifname); - return 1; - } - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDCHANGEACTIVE, &ifr) < 0) && - (ioctl(skfd, BOND_CHANGE_ACTIVE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDCHANGEACTIVE failed: " - "%s\n", - master_ifname, strerror(saved_errno)); - res = 1; - } - - return res; -} - -static int enslave(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (slave_flags.ifr_flags & IFF_SLAVE) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is already a slave\n", - slave_ifname); - return 1; - } - - res = set_if_down(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface down failed\n", - slave_ifname); - return res; - } - - if (abi_ver < 2) { - /* Older bonding versions would panic if the slave has no IP - * address, so get the IP setting from the master. - */ - set_if_addr(master_ifname, slave_ifname); - } else { - res = clear_if_addr(slave_ifname); - if (res) { - fprintf(stderr, - "Slave '%s': Error: clear address failed\n", - slave_ifname); - return res; - } - } - - if (master_mtu.ifr_mtu != slave_mtu.ifr_mtu) { - res = set_slave_mtu(slave_ifname, master_mtu.ifr_mtu); - if (res) { - fprintf(stderr, - "Slave '%s': Error: set MTU failed\n", - slave_ifname); - return res; - } - } - - if (hwaddr_set) { - /* Master already has an hwaddr - * so set it's hwaddr to the slave - */ - if (abi_ver < 1) { - /* The driver is using an old ABI, so - * the application sets the slave's - * hwaddr - */ - res = set_slave_hwaddr(slave_ifname, - &(master_hwaddr.ifr_hwaddr)); - if (res) { - fprintf(stderr, - "Slave '%s': Error: set hw address " - "failed\n", - slave_ifname); - goto undo_mtu; - } - - /* For old ABI the application needs to bring the - * slave back up - */ - res = set_if_up(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface " - "down failed\n", - slave_ifname); - goto undo_slave_mac; - } - } - /* The driver is using a new ABI, - * so the driver takes care of setting - * the slave's hwaddr and bringing - * it up again - */ - } else { - /* No hwaddr for master yet, so - * set the slave's hwaddr to it - */ - if (abi_ver < 1) { - /* For old ABI, the master needs to be - * down before setting its hwaddr - */ - res = set_if_down(master_ifname, master_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Master '%s': Error: bring interface " - "down failed\n", - master_ifname); - goto undo_mtu; - } - } - - res = set_master_hwaddr(master_ifname, - &(slave_hwaddr.ifr_hwaddr)); - if (res) { - fprintf(stderr, - "Master '%s': Error: set hw address " - "failed\n", - master_ifname); - goto undo_mtu; - } - - if (abi_ver < 1) { - /* For old ABI, bring the master - * back up - */ - res = set_if_up(master_ifname, master_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Master '%s': Error: bring interface " - "up failed\n", - master_ifname); - goto undo_master_mac; - } - } - - hwaddr_set = 1; - } - - /* Do the real thing */ - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDENSLAVE, &ifr) < 0) && - (ioctl(skfd, BOND_ENSLAVE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDENSLAVE failed: %s\n", - master_ifname, strerror(saved_errno)); - res = 1; - } - - if (res) { - goto undo_master_mac; - } - - return 0; - -/* rollback (best effort) */ -undo_master_mac: - set_master_hwaddr(master_ifname, &(master_hwaddr.ifr_hwaddr)); - hwaddr_set = 0; - goto undo_mtu; -undo_slave_mac: - set_slave_hwaddr(slave_ifname, &(slave_hwaddr.ifr_hwaddr)); -undo_mtu: - set_slave_mtu(slave_ifname, slave_mtu.ifr_mtu); - return res; -} - -static int release(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res = 0; - - if (!(slave_flags.ifr_flags & IFF_SLAVE)) { - fprintf(stderr, - "Illegal operation: The specified slave interface " - "'%s' is not a slave\n", - slave_ifname); - return 1; - } - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - strncpy(ifr.ifr_slave, slave_ifname, IFNAMSIZ); - if ((ioctl(skfd, SIOCBONDRELEASE, &ifr) < 0) && - (ioctl(skfd, BOND_RELEASE_OLD, &ifr) < 0)) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCBONDRELEASE failed: %s\n", - master_ifname, strerror(saved_errno)); - return 1; - } else if (abi_ver < 1) { - /* The driver is using an old ABI, so we'll set the interface - * down to avoid any conflicts due to same MAC/IP - */ - res = set_if_down(slave_ifname, slave_flags.ifr_flags); - if (res) { - fprintf(stderr, - "Slave '%s': Error: bring interface " - "down failed\n", - slave_ifname); - } - } - - /* set to default mtu */ - set_slave_mtu(slave_ifname, 1500); - - return res; -} - -static int get_if_settings(char *ifname, struct dev_ifr ifra[]) -{ - int i; - int res = 0; - - for (i = 0; ifra[i].req_ifr; i++) { - strncpy(ifra[i].req_ifr->ifr_name, ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].req_type, ifra[i].req_ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: %s failed: %s\n", - ifname, ifra[i].req_name, - strerror(saved_errno)); - - return saved_errno; - } - } - - return 0; -} - -static int get_slave_flags(char *slave_ifname) -{ - int res = 0; - - strncpy(slave_flags.ifr_name, slave_ifname, IFNAMSIZ); - res = ioctl(skfd, SIOCGIFFLAGS, &slave_flags); - if (res < 0) { - saved_errno = errno; - v_print("Slave '%s': Error: SIOCGIFFLAGS failed: %s\n", - slave_ifname, strerror(saved_errno)); - } else { - v_print("Slave %s: flags %04X.\n", - slave_ifname, slave_flags.ifr_flags); - } - - return res; -} - -static int set_master_hwaddr(char *master_ifname, struct sockaddr *hwaddr) -{ - unsigned char *addr = (unsigned char *)hwaddr->sa_data; - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr)); - res = ioctl(skfd, SIOCSIFHWADDR, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Master '%s': Error: SIOCSIFHWADDR failed: %s\n", - master_ifname, strerror(saved_errno)); - return res; - } else { - v_print("Master '%s': hardware address set to " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - master_ifname, addr[0], addr[1], addr[2], - addr[3], addr[4], addr[5]); - } - - return res; -} - -static int set_slave_hwaddr(char *slave_ifname, struct sockaddr *hwaddr) -{ - unsigned char *addr = (unsigned char *)hwaddr->sa_data; - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - memcpy(&(ifr.ifr_hwaddr), hwaddr, sizeof(struct sockaddr)); - res = ioctl(skfd, SIOCSIFHWADDR, &ifr); - if (res < 0) { - saved_errno = errno; - - v_print("Slave '%s': Error: SIOCSIFHWADDR failed: %s\n", - slave_ifname, strerror(saved_errno)); - - if (saved_errno == EBUSY) { - v_print(" The device is busy: it must be idle " - "before running this command.\n"); - } else if (saved_errno == EOPNOTSUPP) { - v_print(" The device does not support setting " - "the MAC address.\n" - " Your kernel likely does not support slave " - "devices.\n"); - } else if (saved_errno == EINVAL) { - v_print(" The device's address type does not match " - "the master's address type.\n"); - } - return res; - } else { - v_print("Slave '%s': hardware address set to " - "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x.\n", - slave_ifname, addr[0], addr[1], addr[2], - addr[3], addr[4], addr[5]); - } - - return res; -} - -static int set_slave_mtu(char *slave_ifname, int mtu) -{ - struct ifreq ifr; - int res = 0; - - ifr.ifr_mtu = mtu; - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - - res = ioctl(skfd, SIOCSIFMTU, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Slave '%s': Error: SIOCSIFMTU failed: %s\n", - slave_ifname, strerror(saved_errno)); - } else { - v_print("Slave '%s': MTU set to %d.\n", slave_ifname, mtu); - } - - return res; -} - -static int set_if_flags(char *ifname, short flags) -{ - struct ifreq ifr; - int res = 0; - - ifr.ifr_flags = flags; - strncpy(ifr.ifr_name, ifname, IFNAMSIZ); - - res = ioctl(skfd, SIOCSIFFLAGS, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: SIOCSIFFLAGS failed: %s\n", - ifname, strerror(saved_errno)); - } else { - v_print("Interface '%s': flags set to %04X.\n", ifname, flags); - } - - return res; -} - -static int set_if_up(char *ifname, short flags) -{ - return set_if_flags(ifname, flags | IFF_UP); -} - -static int set_if_down(char *ifname, short flags) -{ - return set_if_flags(ifname, flags & ~IFF_UP); -} - -static int clear_if_addr(char *ifname) -{ - struct ifreq ifr; - int res = 0; - - strncpy(ifr.ifr_name, ifname, IFNAMSIZ); - ifr.ifr_addr.sa_family = AF_INET; - memset(ifr.ifr_addr.sa_data, 0, sizeof(ifr.ifr_addr.sa_data)); - - res = ioctl(skfd, SIOCSIFADDR, &ifr); - if (res < 0) { - saved_errno = errno; - v_print("Interface '%s': Error: SIOCSIFADDR failed: %s\n", - ifname, strerror(saved_errno)); - } else { - v_print("Interface '%s': address cleared\n", ifname); - } - - return res; -} - -static int set_if_addr(char *master_ifname, char *slave_ifname) -{ - struct ifreq ifr; - int res; - unsigned char *ipaddr; - int i; - struct { - char *req_name; - char *desc; - int g_ioctl; - int s_ioctl; - } ifra[] = { - {"IFADDR", "addr", SIOCGIFADDR, SIOCSIFADDR}, - {"DSTADDR", "destination addr", SIOCGIFDSTADDR, SIOCSIFDSTADDR}, - {"BRDADDR", "broadcast addr", SIOCGIFBRDADDR, SIOCSIFBRDADDR}, - {"NETMASK", "netmask", SIOCGIFNETMASK, SIOCSIFNETMASK}, - {NULL, NULL, 0, 0}, - }; - - for (i = 0; ifra[i].req_name; i++) { - strncpy(ifr.ifr_name, master_ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].g_ioctl, &ifr); - if (res < 0) { - int saved_errno = errno; - - v_print("Interface '%s': Error: SIOCG%s failed: %s\n", - master_ifname, ifra[i].req_name, - strerror(saved_errno)); - - ifr.ifr_addr.sa_family = AF_INET; - memset(ifr.ifr_addr.sa_data, 0, - sizeof(ifr.ifr_addr.sa_data)); - } - - strncpy(ifr.ifr_name, slave_ifname, IFNAMSIZ); - res = ioctl(skfd, ifra[i].s_ioctl, &ifr); - if (res < 0) { - int saved_errno = errno; - - v_print("Interface '%s': Error: SIOCS%s failed: %s\n", - slave_ifname, ifra[i].req_name, - strerror(saved_errno)); - - } - - ipaddr = (unsigned char *)ifr.ifr_addr.sa_data; - v_print("Interface '%s': set IP %s to %d.%d.%d.%d\n", - slave_ifname, ifra[i].desc, - ipaddr[0], ipaddr[1], ipaddr[2], ipaddr[3]); - } - - return 0; -} - -/* - * Local variables: - * version-control: t - * kept-new-versions: 5 - * c-indent-level: 4 - * c-basic-offset: 4 - * tab-width: 4 - * compile-command: "gcc -Wall -Wstrict-prototypes -O -I/usr/src/linux/include ifenslave.c -o ifenslave" - * End: - */ - diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt index ab2d7183189..43d3549366a 100644 --- a/Documentation/networking/igb.txt +++ b/Documentation/networking/igb.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -54,21 +54,22 @@ Additional Configurations - The maximum MTU setting for Jumbo Frames is 9216. This value coincides with the maximum Jumbo Frames size of 9234 bytes. - - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or - loss of link. + - Using Jumbo frames at 10 or 100 Mbps is not supported and may result in + poor performance or loss of link. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. + diagnostics, as well as displaying statistical information. The latest + version of ethtool can be found at: - http://sourceforge.net/projects/gkernel. + http://ftp.kernel.org/pub/software/network/ethtool/ Enabling Wake on LAN* (WoL) --------------------------- - WoL is configured through the Ethtool* utility. + WoL is configured through the ethtool* utility. - For instructions on enabling WoL with Ethtool, refer to the Ethtool man page. + For instructions on enabling WoL with ethtool, refer to the ethtool man page. WoL will be enabled on the system during the next shut down or reboot. For this driver version, in order to enable WoL, the igb driver must be @@ -91,30 +92,26 @@ Additional Configurations REQUIREMENTS: MSI-X support is required for Multiqueue. If MSI-X is not found, the system will fallback to MSI or to Legacy interrupts. - LRO - --- - Large Receive Offload (LRO) is a technique for increasing inbound throughput - of high-bandwidth network connections by reducing CPU overhead. It works by - aggregating multiple incoming packets from a single stream into a larger - buffer before they are passed higher up the networking stack, thus reducing - the number of packets that have to be processed. LRO combines multiple - Ethernet frames into a single receive in the stack, thereby potentially - decreasing CPU utilization for receives. - - NOTE: You need to have inet_lro enabled via either the CONFIG_INET_LRO or - CONFIG_INET_LRO_MODULE kernel config option. Additionally, if - CONFIG_INET_LRO_MODULE is used, the inet_lro module needs to be loaded - before the igb driver. - - You can verify that the driver is using LRO by looking at these counters in - Ethtool: - - lro_aggregated - count of total packets that were combined - lro_flushed - counts the number of packets flushed out of LRO - lro_no_desc - counts the number of times an LRO descriptor was not available - for the LRO packet - - NOTE: IPv6 and UDP are not supported by LRO. + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF(n) + + Where n=the VF that attempted to do the spoofing. + + Setting MAC Address, VLAN and Rate Limit Using IProute2 Tool + ------------------------------------------------------------ + You can set a MAC address of a Virtual Function (VF), a default VLAN and the + rate limit using the IProute2 tool. Download the latest version of the + iproute2 tool from Sourceforge if your version does not have all the + features you require. + Support ======= diff --git a/Documentation/networking/igbvf.txt b/Documentation/networking/igbvf.txt index 056028138d9..40db17a6665 100644 --- a/Documentation/networking/igbvf.txt +++ b/Documentation/networking/igbvf.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -55,12 +55,14 @@ networking link on the left to search for your adapter: Additional Configurations ========================= - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. + diagnostics, as well as displaying statistical information. The ethtool + version 3.0 or later is required for this functionality, although we + strongly recommend downloading the latest version at: - http://sourceforge.net/projects/gkernel. + http://ftp.kernel.org/pub/software/network/ethtool/ Support ======= diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index ae5522703d1..ab42c95f998 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -11,23 +11,82 @@ ip_forward - BOOLEAN for routers) ip_default_ttl - INTEGER - default 64 - -ip_no_pmtu_disc - BOOLEAN - Disable Path MTU Discovery. - default FALSE + Default value of TTL field (Time To Live) for outgoing (but not + forwarded) IP packets. Should be between 1 and 255 inclusive. + Default: 64 (as recommended by RFC1700) + +ip_no_pmtu_disc - INTEGER + Disable Path MTU Discovery. If enabled in mode 1 and a + fragmentation-required ICMP is received, the PMTU to this + destination will be set to min_pmtu (see below). You will need + to raise min_pmtu to the smallest interface MTU on your system + manually if you want to avoid locally generated fragments. + + In mode 2 incoming Path MTU Discovery messages will be + discarded. Outgoing frames are handled the same as in mode 1, + implicitly setting IP_PMTUDISC_DONT on every created socket. + + Mode 3 is a hardend pmtu discover mode. The kernel will only + accept fragmentation-needed errors if the underlying protocol + can verify them besides a plain socket lookup. Current + protocols for which pmtu events will be honored are TCP, SCTP + and DCCP as they verify e.g. the sequence number or the + association. This mode should not be enabled globally but is + only intended to secure e.g. name servers in namespaces where + TCP path mtu must still work but path MTU information of other + protocols should be discarded. If enabled globally this mode + could break other protocols. + + Possible values: 0-3 + Default: FALSE min_pmtu - INTEGER - default 562 - minimum discovered Path MTU + default 552 - minimum discovered Path MTU + +ip_forward_use_pmtu - BOOLEAN + By default we don't trust protocol path MTUs while forwarding + because they could be easily forged and can lead to unwanted + fragmentation by the router. + You only need to enable this if you have user-space software + which tries to discover path mtus by itself and depends on the + kernel honoring this information. This is normally not the + case. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled route/max_size - INTEGER Maximum number of routes allowed in the kernel. Increase this when using large numbers of interfaces and/or routes. +neigh/default/gc_thresh1 - INTEGER + Minimum number of entries to keep. Garbage collector will not + purge entries if there are fewer than this number. + Default: 128 + neigh/default/gc_thresh3 - INTEGER Maximum number of neighbor entries allowed. Increase this when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. + Default: 1024 + +neigh/default/unres_qlen_bytes - INTEGER + The maximum number of bytes which may be used by packets + queued for each unresolved address by other network layers. + (added in linux 3.3) + Setting negative value is meaningless and will return error. + Default: 65536 Bytes(64KB) + +neigh/default/unres_qlen - INTEGER + The maximum number of packets which may be queued for each + unresolved address by other network layers. + (deprecated in linux 3.3) : use unres_qlen_bytes instead. + Prior to linux 3.3, the default value is 3 which may cause + unexpected packet loss. The current default value is calculated + according to default value of unres_qlen_bytes and true size of + packet. + Default: 31 mtu_expires - INTEGER Time, in seconds, that cached PMTU information is kept. @@ -36,12 +95,6 @@ min_adv_mss - INTEGER The advertised MSS depends on the first hop route MTU, but will never be lower than this setting. -rt_cache_rebuild_count - INTEGER - The per net-namespace route cache emergency rebuild threshold. - Any net-namespace having its route cache rebuilt due to - a hash bucket chain being too long more than this many times - will have its route caching disabled - IP Fragmentation: ipfrag_high_thresh - INTEGER @@ -104,16 +157,6 @@ inet_peer_maxttl - INTEGER when the number of entries in the pool is very small). Measured in seconds. -inet_peer_gc_mintime - INTEGER - Minimum interval between garbage collection passes. This interval is - in effect under high memory pressure on the pool. - Measured in seconds. - -inet_peer_gc_maxtime - INTEGER - Minimum interval between garbage collection passes. This interval is - in effect under low (or absent) memory pressure on the pool. - Measured in seconds. - TCP variables: somaxconn - INTEGER @@ -121,17 +164,6 @@ somaxconn - INTEGER Defaults to 128. See also tcp_max_syn_backlog for additional tuning for TCP sockets. -tcp_abc - INTEGER - Controls Appropriate Byte Count (ABC) defined in RFC3465. - ABC is a way of increasing congestion window (cwnd) more slowly - in response to partial acknowledgments. - Possible values are: - 0 increase cwnd once per acknowledgment (no ABC) - 1 increase cwnd once per acknowledgment of full sized segment - 2 allow increase cwnd by two if acknowledgment is - of two segments to compensate for delayed acknowledgments. - Default: 0 (off) - tcp_abort_on_overflow - BOOLEAN If listening service is too slow to accept new connections, reset them. Default state is FALSE. It means that if overflow @@ -144,7 +176,8 @@ tcp_adv_win_scale - INTEGER Count buffering overhead as bytes/2^tcp_adv_win_scale (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), if it is <= 0. - Default: 2 + Possible values are [-31, 31], inclusive. + Default: 1 tcp_allowed_congestion_control - STRING Show/set the congestion control choices available to non-privileged @@ -157,6 +190,16 @@ tcp_app_win - INTEGER buffer. Value 0 is special, it means that nothing is reserved. Default: 31 +tcp_autocorking - BOOLEAN + Enable TCP auto corking : + When applications do consecutive small write()/sendmsg() system calls, + we try to coalesce these small writes as much as possible, to lower + total amount of sent packets. This is done if at least one prior + packet for the flow is waiting in Qdisc queues or device transmit + queue. Applications can still use TCP_CORK for optimal behavior + when they know how/when to uncork their sockets. + Default : 1 + tcp_available_congestion_control - STRING Shows the available congestion control choices that are registered. More congestion control algorithms may be available as modules, @@ -172,28 +215,43 @@ tcp_congestion_control - STRING connections. The algorithm "reno" is always available, but additional choices may be available based on kernel configuration. Default is set as part of kernel configuration. - -tcp_cookie_size - INTEGER - Default size of TCP Cookie Transactions (TCPCT) option, that may be - overridden on a per socket basis by the TCPCT socket option. - Values greater than the maximum (16) are interpreted as the maximum. - Values greater than zero and less than the minimum (8) are interpreted - as the minimum. Odd values are interpreted as the next even value. - Default: 0 (off). + For passive connections, the listener congestion control choice + is inherited. + [see setsockopt(listenfd, SOL_TCP, TCP_CONGESTION, "name" ...) ] tcp_dsack - BOOLEAN Allows TCP to send "duplicate" SACKs. -tcp_ecn - BOOLEAN - Enable Explicit Congestion Notification (ECN) in TCP. ECN is only - used when both ends of the TCP flow support it. It is useful to - avoid losses due to congestion (when the bottleneck router supports - ECN). +tcp_early_retrans - INTEGER + Enable Early Retransmit (ER), per RFC 5827. ER lowers the threshold + for triggering fast retransmit when the amount of outstanding data is + small and when no previously unsent data can be transmitted (such + that limited transmit could be used). Also controls the use of + Tail loss probe (TLP) that converts RTOs occurring due to tail + losses into fast recovery (draft-dukkipati-tcpm-tcp-loss-probe-01). + Possible values: + 0 disables ER + 1 enables ER + 2 enables ER but delays fast recovery and fast retransmit + by a fourth of RTT. This mitigates connection falsely + recovers when network has a small degree of reordering + (less than 3 packets). + 3 enables delayed ER and TLP. + 4 enables TLP only. + Default: 3 + +tcp_ecn - INTEGER + Control use of Explicit Congestion Notification (ECN) by TCP. + ECN is used only when both ends of the TCP connection indicate + support for it. This feature is useful in avoiding losses due + to congestion by allowing supporting routers to signal + congestion before having to drop packets. Possible values are: - 0 disable ECN - 1 ECN enabled - 2 Only server-side ECN enabled. If the other end does - not support ECN, behavior is like with ECN disabled. + 0 Disable ECN. Neither initiate nor accept ECN. + 1 Enable ECN when requested by incoming connections and + also request ECN on outgoing connection attempts. + 2 Enable ECN when requested by incoming connections + but do not request ECN on outgoing connections. Default: 2 tcp_fack - BOOLEAN @@ -201,47 +259,23 @@ tcp_fack - BOOLEAN The value is not used, if tcp_sack is not enabled. tcp_fin_timeout - INTEGER - Time to hold socket in state FIN-WAIT-2, if it was closed - by our side. Peer can be broken and never close its side, - or even died unexpectedly. Default value is 60sec. - Usual value used in 2.2 was 180 seconds, you may restore - it, but remember that if your machine is even underloaded WEB server, - you risk to overflow memory with kilotons of dead sockets, - FIN-WAIT-2 sockets are less dangerous than FIN-WAIT-1, - because they eat maximum 1.5K of memory, but they tend - to live longer. Cf. tcp_max_orphans. + The length of time an orphaned (no longer referenced by any + application) connection will remain in the FIN_WAIT_2 state + before it is aborted at the local end. While a perfectly + valid "receive only" state for an un-orphaned connection, an + orphaned connection in FIN_WAIT_2 state could otherwise wait + forever for the remote to close its end of the connection. + Cf. tcp_max_orphans + Default: 60 seconds tcp_frto - INTEGER - Enables Forward RTO-Recovery (F-RTO) defined in RFC4138. + Enables Forward RTO-Recovery (F-RTO) defined in RFC5682. F-RTO is an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in wireless environments - where packet loss is typically due to random radio interference - rather than intermediate router congestion. F-RTO is sender-side - only modification. Therefore it does not require any support from - the peer. - - If set to 1, basic version is enabled. 2 enables SACK enhanced - F-RTO if flow uses SACK. The basic version can be used also when - SACK is in use though scenario(s) with it exists where F-RTO - interacts badly with the packet counting of the SACK enabled TCP - flow. - -tcp_frto_response - INTEGER - When F-RTO has detected that a TCP retransmission timeout was - spurious (i.e, the timeout would have been avoided had TCP set a - longer retransmission timeout), TCP has several options what to do - next. Possible values are: - 0 Rate halving based; a smooth and conservative response, - results in halved cwnd and ssthresh after one RTT - 1 Very conservative response; not recommended because even - though being valid, it interacts poorly with the rest of - Linux TCP, halves cwnd and ssthresh immediately - 2 Aggressive response; undoes congestion control measures - that are now known to be unnecessary (ignoring the - possibility of a lost retransmission that would require - TCP to be more cautious), cwnd and ssthresh are restored - to the values prior timeout - Default: 0 (rate halving based) + timeouts. It is particularly beneficial in networks where the + RTT fluctuates (e.g., wireless). F-RTO is sender-side only + modification. It does not require any support from the peer. + + By default it's enabled with a non-zero value. 0 disables F-RTO. tcp_keepalive_time - INTEGER How often TCP sends out keepalive messages when keepalive is enabled. @@ -278,11 +312,11 @@ tcp_max_orphans - INTEGER up to ~64K of unswappable memory. tcp_max_syn_backlog - INTEGER - Maximal number of remembered connection requests, which are - still did not receive an acknowledgment from connecting client. - Default value is 1024 for systems with more than 128Mb of memory, - and 128 for low memory machines. If server suffers of overload, - try to increase this number. + Maximal number of remembered connection requests, which have not + received an acknowledgment from connecting client. + The minimal value is 128 for low memory machines, and it will + increase in proportion to the memory of machine. + If server suffers from overload, try increasing this number. tcp_max_tw_buckets - INTEGER Maximal number of timewait sockets held by system simultaneously. @@ -332,7 +366,7 @@ tcp_orphan_retries - INTEGER when RTO retransmissions remain unacknowledged. See tcp_retries2 for more details. - The default value is 7. + The default value is 8. If your machine is a loaded WEB server, you should think about lowering this value, such sockets may consume significant resources. Cf. tcp_max_orphans. @@ -380,7 +414,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max min: Minimal size of receive buffer used by TCP sockets. It is guaranteed to each TCP socket, even under moderate memory pressure. - Default: 8K + Default: 1 page default: initial size of receive buffer used by TCP sockets. This value overrides net.core.rmem_default used by other protocols. @@ -393,7 +427,7 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max net.core.rmem_max. Calling setsockopt() with SO_RCVBUF disables automatic tuning of that socket's receive buffer size, in which case this value is ignored. - Default: between 87380B and 4MB, depending on RAM size. + Default: between 87380B and 6MB, depending on RAM size. tcp_sack - BOOLEAN Enable select acknowledgments (SACKS). @@ -414,13 +448,15 @@ tcp_stdurg - BOOLEAN tcp_synack_retries - INTEGER Number of times SYNACKs for a passive TCP connection attempt will be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. + is 5, which corresponds to 31seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for a passive TCP connection will happen after 63seconds. tcp_syncookies - BOOLEAN - Only valid when the kernel was compiled with CONFIG_SYNCOOKIES + Only valid when the kernel was compiled with CONFIG_SYN_COOKIES Send out syncookies when the syn backlog queue of a socket overflows. This is to prevent against the common 'SYN flood attack' - Default: FALSE + Default: 1 Note, that syncookies is fallback facility. It MUST NOT be used to help highly loaded servers to stand @@ -437,14 +473,57 @@ tcp_syncookies - BOOLEAN SYN flood warnings in logs not being really flooded, your server is seriously misconfigured. + If you want to test which effects syncookies have to your + network connections you can set this knob to 2 to enable + unconditionally generation of syncookies. + +tcp_fastopen - INTEGER + Enable TCP Fast Open feature (draft-ietf-tcpm-fastopen) to send data + in the opening SYN packet. To use this feature, the client application + must use sendmsg() or sendto() with MSG_FASTOPEN flag rather than + connect() to perform a TCP handshake automatically. + + The values (bitmap) are + 1: Enables sending data in the opening SYN on the client w/ MSG_FASTOPEN. + 2: Enables TCP Fast Open on the server side, i.e., allowing data in + a SYN packet to be accepted and passed to the application before + 3-way hand shake finishes. + 4: Send data in the opening SYN regardless of cookie availability and + without a cookie option. + 0x100: Accept SYN data w/o validating the cookie. + 0x200: Accept data-in-SYN w/o any cookie option present. + 0x400/0x800: Enable Fast Open on all listeners regardless of the + TCP_FASTOPEN socket option. The two different flags designate two + different ways of setting max_qlen without the TCP_FASTOPEN socket + option. + + Default: 1 + + Note that the client & server side Fast Open flags (1 and 2 + respectively) must be also enabled before the rest of flags can take + effect. + + See include/net/tcp.h and the code for more details. + tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. + is 6, which corresponds to 63seconds till the last retransmission + with the current initial RTO of 1second. With this the final timeout + for an active TCP connection attempt will happen after 127seconds. tcp_timestamps - BOOLEAN Enable timestamps as defined in RFC1323. +tcp_min_tso_segs - INTEGER + Minimal number of segments per TSO frame. + Since linux-3.12, TCP does an automatic sizing of TSO frames, + depending on flow rate, instead of filling 64Kbytes packets. + For specific usages, it's possible to force TCP to build big + TSO frames. Note that TCP stack might split too big TSO packets + if available window is too small. + Default: 2 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. @@ -469,7 +548,7 @@ tcp_window_scaling - BOOLEAN tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. - Default: 4K + Default: 1 page default: initial size of send buffer used by TCP sockets. This value overrides net.core.wmem_default used by other protocols. @@ -483,6 +562,19 @@ tcp_wmem - vector of 3 INTEGERs: min, default, max this value is ignored. Default: between 64K and 4MB, depending on RAM size. +tcp_notsent_lowat - UNSIGNED INTEGER + A TCP socket can control the amount of unsent bytes in its write queue, + thanks to TCP_NOTSENT_LOWAT socket option. poll()/select()/epoll() + reports POLLOUT events if the amount of unsent bytes is below a per + socket value, and if the write queue is not full. sendmsg() will + also not add new buffers if the limit is hit. + + This global variable controls the amount of unsent data for + sockets not using TCP_NOTSENT_LOWAT. For these sockets, a change + to the global variable has immediate effect. + + Default: UINT_MAX (0xFFFFFFFF) + tcp_workaround_signed_windows - BOOLEAN If set, assume no receipt of a window scaling option means the remote TCP is broken and treats the window as a signed quantity. @@ -520,6 +612,22 @@ tcp_thin_dupack - BOOLEAN Documentation/networking/tcp-thin.txt Default: 0 +tcp_limit_output_bytes - INTEGER + Controls TCP Small Queue limit per tcp socket. + TCP bulk sender tends to increase packets in flight until it + gets losses notifications. With SNDBUF autotuning, this can + result in a large amount of packets queued in qdisc/device + on the local machine, hurting latency of other flows, for + typical pfifo_fast qdiscs. + tcp_limit_output_bytes limits the number of bytes on qdisc + or device to reduce artificial RTT/cwnd and reduce bufferbloat. + Default: 131072 + +tcp_challenge_ack_limit - INTEGER + Limits number of Challenge ACK sent per second, as recommended + in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) + Default: 100 + UDP variables: udp_mem - vector of 3 INTEGERs: min, pressure, max @@ -539,13 +647,13 @@ udp_rmem_min - INTEGER Minimal size of receive buffer used by UDP sockets in moderation. Each UDP socket is able to use the size for receiving data, even if total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4096 + Default: 1 page udp_wmem_min - INTEGER Minimal size of send buffer used by UDP sockets in moderation. Each UDP socket is able to use the size for sending data, even if total pages of UDP sockets exceed udp_mem pressure. The unit is byte. - Default: 4096 + Default: 1 page CIPSOv4 Variables: @@ -587,15 +695,8 @@ IP Variables: ip_local_port_range - 2 INTEGERS Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first, the - second the last local port number. Default value depends on - amount of memory available on the system: - > 128Mb 32768-61000 - < 128Mb 1024-4999 or even less. - This number defines number of active connections, which this - system can issue simultaneously to systems not supporting - TCP extensions (timestamps). With tcp_tw_recycle enabled - (i.e. by default) range 1024-4999 is enough to issue up to - 2000 connections per second to systems supporting timestamps. + second the last local port number. The default values are + 32768 and 61000 respectively. ip_local_reserved_ports - list of comma separated ranges Specify the ports which are reserved for known third-party @@ -640,6 +741,15 @@ ip_dynaddr - BOOLEAN occurs. Default: 0 +ip_early_demux - BOOLEAN + Optimize input packet processing down to one demux for + certain kinds of local sockets. Currently we only do this + for established TCP sockets. + + It may add an additional cost for pure routing workloads that + reduces overall throughput, in such case you should disable it. + Default: 1 + icmp_echo_ignore_all - BOOLEAN If set non-zero, then the kernel will ignore all ICMP ECHO requests sent to it. @@ -684,7 +794,7 @@ icmp_ignore_bogus_error_responses - BOOLEAN frames. Such violations are normally logged via a kernel warning. If this is set to TRUE, the kernel will not give such warnings, which will avoid log file clutter. - Default: FALSE + Default: 1 icmp_errors_use_inbound_ifaddr - BOOLEAN @@ -833,9 +943,19 @@ accept_source_route - BOOLEAN FALSE (host) accept_local - BOOLEAN - Accept packets with local source addresses. In combination with - suitable routing, this can be used to direct packets between two - local interfaces over the wire and have them accepted properly. + Accept packets with local source addresses. In combination + with suitable routing, this can be used to direct packets + between two local interfaces over the wire and have them + accepted properly. + + rp_filter must be set to a non-zero value in order for + accept_local to have an effect. + + default FALSE + +route_localnet - BOOLEAN + Do not consider loopback addresses as martian source or destination + while routing. This enables the use of 127/8 for local routing purposes. default FALSE rp_filter - INTEGER @@ -958,6 +1078,20 @@ disable_policy - BOOLEAN disable_xfrm - BOOLEAN Disable IPSEC encryption on this interface, whatever the policy +igmpv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv1 or IGMPv2 report retransmit will take place. + Default: 10000 (10 seconds) + +igmpv3_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + IGMPv3 report retransmit will take place. + Default: 1000 (1 seconds) + +promote_secondaries - BOOLEAN + When a primary IP address is removed from this interface + promote a corresponding secondary IP address instead of + removing all the corresponding secondary IP addresses. tag - INTEGER @@ -988,7 +1122,22 @@ bindv6only - BOOLEAN TRUE: disable IPv4-mapped address feature FALSE: enable IPv4-mapped address feature - Default: FALSE (as specified in RFC2553bis) + Default: FALSE (as specified in RFC3493) + +flowlabel_consistency - BOOLEAN + Protect the consistency (and unicity) of flow label. + You have to disable it to use IPV6_FL_F_REFLECT flag on the + flow label manager. + TRUE: enabled + FALSE: disabled + Default: TRUE + +anycast_src_echo_reply - BOOLEAN + Controls the use of anycast addresses as source addresses for ICMPv6 + echo reply + TRUE: enabled + FALSE: disabled + Default: FALSE IPv6 Fragmentation: @@ -1038,9 +1187,14 @@ conf/interface/*: The functional behaviour for certain settings is different depending on whether local forwarding is enabled or not. -accept_ra - BOOLEAN +accept_ra - INTEGER Accept Router Advertisements; autoconfigure using them. + It also determines whether or not to transmit Router + Solicitations. If and only if the functional setting is to + accept Router Advertisements, Router Solicitations will be + transmitted. + Possible values are: 0 Do not accept Router Advertisements. 1 Accept Router Advertisements if forwarding is disabled. @@ -1102,7 +1256,7 @@ dad_transmits - INTEGER The amount of Duplicate Address Detection probes to send. Default: 1 -forwarding - BOOLEAN +forwarding - INTEGER Configure interface-specific Host/Router behaviour. Note: It is recommended to have the same setting on all @@ -1111,14 +1265,14 @@ forwarding - BOOLEAN Possible values are: 0 Forwarding disabled 1 Forwarding enabled - 2 Forwarding enabled (Hybrid Mode) FALSE (0): By default, Host behaviour is assumed. This means: 1. IsRouter flag is not set in Neighbour Advertisements. - 2. Router Solicitations are being sent when necessary. + 2. If accept_ra is TRUE (default), transmit Router + Solicitations. 3. If accept_ra is TRUE (default), accept Router Advertisements (and do autoconfiguration). 4. If accept_redirects is TRUE (default), accept Redirects. @@ -1129,16 +1283,10 @@ forwarding - BOOLEAN This means exactly the reverse from the above: 1. IsRouter flag is set in Neighbour Advertisements. - 2. Router Solicitations are not sent. + 2. Router Solicitations are not sent unless accept_ra is 2. 3. Router Advertisements are ignored unless accept_ra is 2. 4. Redirects are ignored. - TRUE (2): - - Hybrid mode. Same behaviour as TRUE, except for: - - 2. Router Solicitations are being sent when necessary. - Default: 0 (disabled) if global forwarding is disabled (default), otherwise 1 (enabled). @@ -1245,6 +1393,33 @@ force_tllao - BOOLEAN race condition where the sender deletes the cached link-layer address prior to receiving a response to a previous solicitation." +ndisc_notify - BOOLEAN + Define mode for notification of address and device changes. + 0 - (default): do nothing + 1 - Generate unsolicited neighbour advertisements when device is brought + up or hardware address changes. + +mldv1_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv1 report retransmit will take place. + Default: 10000 (10 seconds) + +mldv2_unsolicited_report_interval - INTEGER + The interval in milliseconds in which the next unsolicited + MLDv2 report retransmit will take place. + Default: 1000 (1 second) + +force_mld_version - INTEGER + 0 - (default) No enforcement of a MLD version, MLDv1 fallback allowed + 1 - Enforce to use MLD version 1 + 2 - Enforce to use MLD version 2 + +suppress_frag_ndisc - INTEGER + Control RFC 6980 (Security Implications of IPv6 Fragmentation + with IPv6 Neighbor Discovery) behavior: + 1 - (default) discard fragmented neighbor discovery packets + 0 - allow fragmented neighbor discovery packets + icmp/*: ratelimit - INTEGER Limit the maximal rates for sending ICMPv6 packets. @@ -1278,13 +1453,22 @@ bridge-nf-call-ip6tables - BOOLEAN bridge-nf-filter-vlan-tagged - BOOLEAN 1 : pass bridged vlan-tagged ARP/IP/IPv6 traffic to {arp,ip,ip6}tables. 0 : disable this. - Default: 1 + Default: 0 bridge-nf-filter-pppoe-tagged - BOOLEAN 1 : pass bridged pppoe-tagged IP/IPv6 traffic to {ip,ip6}tables. 0 : disable this. - Default: 1 + Default: 0 +bridge-nf-pass-vlan-input-dev - BOOLEAN + 1: if bridge-nf-filter-vlan-tagged is enabled, try to find a vlan + interface on the bridge and set the netfilter input device to the vlan. + This allows use of e.g. "iptables -i br0.1" and makes the REDIRECT + target work with vlan-on-top-of-bridge interfaces. When no matching + vlan interface is found, or this switch is off, the input device is + set to the bridge interface. + 0: disable bridge netfilter vlan interface lookup. + Default: 0 proc/sys/net/sctp/* Variables: @@ -1366,6 +1550,20 @@ path_max_retrans - INTEGER Default: 5 +pf_retrans - INTEGER + The number of retransmissions that will be attempted on a given path + before traffic is redirected to an alternate transport (should one + exist). Note this is distinct from path_max_retrans, as a path that + passes the pf_retrans threshold can still be used. Its only + deprioritized when a transmission path is selected by the stack. This + setting is primarily used to enable fast failover mechanisms without + having to reduce path_max_retrans to a very low value. See: + http://www.ietf.org/id/draft-nishida-tsvwg-sctp-failover-05.txt + for details. Note also that a value of pf_retrans > path_max_retrans + disables this feature + + Default: 0 + rto_initial - INTEGER The initial round trip timeout value in milliseconds that will be used in calculating round trip times. This is the initial time interval @@ -1413,6 +1611,20 @@ cookie_preserve_enable - BOOLEAN Default: 1 +cookie_hmac_alg - STRING + Select the hmac algorithm used when generating the cookie value sent by + a listening sctp socket to a connecting client in the INIT-ACK chunk. + Valid values are: + * md5 + * sha1 + * none + Ability to assign md5 or sha1 as the selected alg is predicated on the + configuration of those algorithms at build time (CONFIG_CRYPTO_MD5 and + CONFIG_CRYPTO_SHA1). + + Default: Dependent on configuration. MD5 if available, else SHA1 if + available, else none. + rcvbuf_policy - INTEGER Determines if the receive buffer is attributed to the socket or to association. SCTP supports the capability to create multiple @@ -1425,7 +1637,7 @@ rcvbuf_policy - INTEGER blocking. 1: rcvbuf space is per association - 0: recbuf space is per socket + 0: rcvbuf space is per socket Default: 0 @@ -1451,10 +1663,17 @@ sctp_mem - vector of 3 INTEGERs: min, pressure, max Default is calculated at boot time from amount of available memory. sctp_rmem - vector of 3 INTEGERs: min, default, max - See tcp_rmem for a description. + Only the first value ("min") is used, "default" and "max" are + ignored. + + min: Minimal size of receive buffer used by SCTP socket. + It is guaranteed to each SCTP socket (but not association) even + under moderate memory pressure. + + Default: 1 page sctp_wmem - vector of 3 INTEGERs: min, default, max - See tcp_wmem for a description. + Currently this tunable has no effect. addr_scope_policy - INTEGER Control IPv4 address scoping - draft-stewart-tsvwg-sctp-ipv4-00 @@ -1468,11 +1687,8 @@ addr_scope_policy - INTEGER /proc/sys/net/core/* -dev_weight - INTEGER - The maximum number of packets that kernel can handle on a NAPI - interrupt, it's a Per-CPU variable. + Please see: Documentation/sysctl/net.txt for descriptions of these entries. - Default: 64 /proc/sys/net/unix/* max_dgram_qlen - INTEGER diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt new file mode 100644 index 00000000000..8dbc08b7e43 --- /dev/null +++ b/Documentation/networking/ipsec.txt @@ -0,0 +1,38 @@ + +Here documents known IPsec corner cases which need to be keep in mind when +deploy various IPsec configuration in real world production environment. + +1. IPcomp: Small IP packet won't get compressed at sender, and failed on + policy check on receiver. + +Quote from RFC3173: +2.2. Non-Expansion Policy + + If the total size of a compressed payload and the IPComp header, as + defined in section 3, is not smaller than the size of the original + payload, the IP datagram MUST be sent in the original non-compressed + form. To clarify: If an IP datagram is sent non-compressed, no + + IPComp header is added to the datagram. This policy ensures saving + the decompression processing cycles and avoiding incurring IP + datagram fragmentation when the expanded datagram is larger than the + MTU. + + Small IP datagrams are likely to expand as a result of compression. + Therefore, a numeric threshold should be applied before compression, + where IP datagrams of size smaller than the threshold are sent in the + original form without attempting compression. The numeric threshold + is implementation dependent. + +Current IPComp implementation is indeed by the book, while as in practice +when sending non-compressed packet to the peer(whether or not packet len +is smaller than the threshold or the compressed len is large than original +packet len), the packet is dropped when checking the policy as this packet +matches the selector but not coming from any XFRM layer, i.e., with no +security path. Such naked packet will not eventually make it to upper layer. +The result is much more wired to the user when ping peer with different +payload length. + +One workaround is try to set "level use" for each policy if user observed +above scenario. The consequence of doing so is small packet(uncompressed) +will skip policy checking on receiver side. diff --git a/Documentation/networking/ipv6.txt b/Documentation/networking/ipv6.txt index 9fd7e21296c..6cd74fa5535 100644 --- a/Documentation/networking/ipv6.txt +++ b/Documentation/networking/ipv6.txt @@ -2,9 +2,9 @@ Options for the ipv6 module are supplied as parameters at load time. Module options may be given as command line arguments to the insmod -or modprobe command, but are usually specified in either the -/etc/modules.conf or /etc/modprobe.conf configuration file, or in a -distro-specific configuration file. +or modprobe command, but are usually specified in either +/etc/modules.d/*.conf configuration files, or in a distro-specific +configuration file. The available ipv6 module parameters are listed below. If a parameter is not specified the default value is used. diff --git a/Documentation/networking/ipvs-sysctl.txt b/Documentation/networking/ipvs-sysctl.txt index 4ccdbca0381..7a3c0472959 100644 --- a/Documentation/networking/ipvs-sysctl.txt +++ b/Documentation/networking/ipvs-sysctl.txt @@ -15,6 +15,30 @@ amemthresh - INTEGER enabled and the variable is automatically set to 2, otherwise the strategy is disabled and the variable is set to 1. +backup_only - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, disable the director function while the server is + in backup mode to avoid packet loops for DR/TUN methods. + +conntrack - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + If set, maintain connection tracking entries for + connections handled by IPVS. + + This should be enabled if connections handled by IPVS are to be + also handled by stateful firewall rules. That is, iptables rules + that make use of connection tracking. It is a performance + optimisation to disable this setting otherwise. + + Connections handled by the IPVS FTP application module + will have connection tracking entries regardless of this setting. + + Only available when IPVS is compiled with CONFIG_IP_VS_NFCT enabled. + cache_bypass - BOOLEAN 0 - disabled (default) not 0 - enabled @@ -39,7 +63,7 @@ debug_level - INTEGER 11 - IPVS packet handling (ip_vs_in/ip_vs_out) 12 or more - packet traversal - Only available when IPVS is compiled with the CONFIG_IPVS_DEBUG + Only available when IPVS is compiled with CONFIG_IP_VS_DEBUG enabled. Higher debugging levels include the messages for lower debugging levels, so setting debug level 2, includes level 0, 1 and 2 @@ -123,13 +147,11 @@ nat_icmp_send - BOOLEAN secure_tcp - INTEGER 0 - disabled (default) - The secure_tcp defense is to use a more complicated state - transition table and some possible short timeouts of each - state. In the VS/NAT, it delays the entering the ESTABLISHED - until the real server starts to send data and ACK packet - (after 3-way handshake). + The secure_tcp defense is to use a more complicated TCP state + transition table. For VS/NAT, it also delays entering the + TCP ESTABLISHED state until the three way handshake is completed. - The value definition is the same as that of drop_entry or + The value definition is the same as that of drop_entry and drop_packet. sync_threshold - INTEGER @@ -141,3 +163,49 @@ sync_threshold - INTEGER synchronized, every time the number of its incoming packets modulus 50 equals the threshold. The range of the threshold is from 0 to 49. + +snat_reroute - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If enabled, recalculate the route of SNATed packets from + realservers so that they are routed as if they originate from the + director. Otherwise they are routed as if they are forwarded by the + director. + + If policy routing is in effect then it is possible that the route + of a packet originating from a director is routed differently to a + packet being forwarded by the director. + + If policy routing is not in effect then the recalculated route will + always be the same as the original route so it is an optimisation + to disable snat_reroute and avoid the recalculation. + +sync_persist_mode - INTEGER + default 0 + + Controls the synchronisation of connections when using persistence + + 0: All types of connections are synchronised + 1: Attempt to reduce the synchronisation traffic depending on + the connection type. For persistent services avoid synchronisation + for normal connections, do it only for persistence templates. + In such case, for TCP and SCTP it may need enabling sloppy_tcp and + sloppy_sctp flags on backup servers. For non-persistent services + such optimization is not applied, mode 0 is assumed. + +sync_version - INTEGER + default 1 + + The version of the synchronisation protocol used when sending + synchronisation messages. + + 0 selects the original synchronisation protocol (version 0). This + should be used when sending synchronisation messages to a legacy + system that only understands the original synchronisation protocol. + + 1 selects the current synchronisation protocol (version 1). This + should be used where possible. + + Kernels with this sync_version entry are able to receive messages + of both version 1 and version 2 of the synchronisation protocol. diff --git a/Documentation/networking/ixgb.txt b/Documentation/networking/ixgb.txt index a0d0ffb5e58..1e0c045e89f 100644 --- a/Documentation/networking/ixgb.txt +++ b/Documentation/networking/ixgb.txt @@ -1,7 +1,7 @@ -Linux Base Driver for 10 Gigabit Intel(R) Network Connection -============================================================= +Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection +===================================================================== -October 9, 2007 +March 14, 2011 Contents @@ -306,18 +306,18 @@ Additional Configurations with the maximum Jumbo Frames size of 16128. - Ethtool + ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool + diagnostics, as well as displaying statistical information. The ethtool version 1.6 or later is required for this functionality. The latest release of ethtool can be found from - http://sourceforge.net/projects/gkernel + http://ftp.kernel.org/pub/software/network/ethtool/ - NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support - for a more complete ethtool feature set can be enabled by upgrading - to the latest version. + NOTE: The ethtool version 1.6 only supports a limited set of ethtool options. + Support for a more complete ethtool feature set can be enabled by + upgrading to the latest version. NAPI diff --git a/Documentation/networking/ixgbe.txt b/Documentation/networking/ixgbe.txt index eeb68685c78..96cccebb839 100644 --- a/Documentation/networking/ixgbe.txt +++ b/Documentation/networking/ixgbe.txt @@ -1,106 +1,213 @@ -Linux Base Driver for 10 Gigabit PCI Express Intel(R) Network Connection -======================================================================== - -March 10, 2009 +Linux* Base Driver for the Intel(R) Ethernet 10 Gigabit PCI Express Family of +Adapters +============================================================================= +Intel 10 Gigabit Linux driver. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== -- In This Release - Identifying Your Adapter -- Building and Installation - Additional Configurations +- Performance Tuning +- Known Issues - Support +Identifying Your Adapter +======================== +The driver in this release is compatible with 82598, 82599 and X540-based +Intel Network Connections. -In This Release -=============== +For more information on how to identify your adapter, go to the Adapter & +Driver ID Guide at: -This file describes the ixgbe Linux Base Driver for the 10 Gigabit PCI -Express Intel(R) Network Connection. This driver includes support for -Itanium(R)2-based systems. + http://support.intel.com/support/network/sb/CS-012904.htm -For questions related to hardware requirements, refer to the documentation -supplied with your 10 Gigabit adapter. All hardware requirements listed apply -to use with Linux. +SFP+ Devices with Pluggable Optics +---------------------------------- -The following features are available in this kernel: - - Native VLANs - - Channel Bonding (teaming) - - SNMP - - Generic Receive Offload - - Data Center Bridging +82599-BASED ADAPTERS -Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.txt +NOTES: If your 82599-based Intel(R) Network Adapter came with Intel optics, or +is an Intel(R) Ethernet Server Adapter X520-2, then it only supports Intel +optics and/or the direct attach cables listed below. -Ethtool, lspci, and ifconfig can be used to display device and driver -specific information. +When 82599-based SFP+ devices are connected back to back, they should be set to +the same Speed setting via ethtool. Results may vary if you mix speed settings. +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. +Supplier Type Part Numbers -Identifying Your Adapter -======================== +SR Modules +Intel DUAL RATE 1G/10G SFP+ SR (bailed) FTLX8571D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ SR (bailed) AFBR-703SDZ-IN2 +LR Modules +Intel DUAL RATE 1G/10G SFP+ LR (bailed) FTLX1471D3BCV-IT +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDDZ-IN1 +Intel DUAL RATE 1G/10G SFP+ LR (bailed) AFCT-701SDZ-IN2 -This driver supports devices based on the 82598 controller and the 82599 -controller. +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. -For specific information on identifying which adapter you have, please visit: +Supplier Type Part Numbers - http://support.intel.com/support/network/sb/CS-008441.htm +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL +Finisar DUAL RATE 1G/10G SFP+ SR (No Bail) FTLX8571D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ SR (No Bail) AFBR-703SDZ-IN1 +Finisar DUAL RATE 1G/10G SFP+ LR (No Bail) FTLX1471D3QCV-IT +Avago DUAL RATE 1G/10G SFP+ LR (No Bail) AFCT-701SDZ-IN1 +Finistar 1000BASE-T SFP FCLF8522P2BTL +Avago 1000BASE-T SFP ABCU-5710RZ -Building and Installation -========================= +82599-based adapters support all passive and active limiting direct attach +cables that comply with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. -select m for "Intel(R) 10GbE PCI Express adapters support" located at: - Location: - -> Device Drivers - -> Network device support (NETDEVICES [=y]) - -> Ethernet (10000 Mbit) (NETDEV_10000 [=y]) +Laser turns off for SFP+ when ifconfig down +------------------------------------------- +"ifconfig down" turns off the laser for 82599-based SFP+ fiber adapters. +"ifconfig up" turns on the laser. -1. make modules & make modules_install -2. Load the module: +82598-BASED ADAPTERS -# modprobe ixgbe +NOTES for 82598-Based Adapters: +- Intel(R) Network Adapters that support removable optical modules only support + their original module type (i.e., the Intel(R) 10 Gigabit SR Dual Port + Express Module only supports SR optical modules). If you plug in a different + type of module, the driver will not load. +- Hot Swapping/hot plugging optical modules is not supported. +- Only single speed, 10 gigabit modules are supported. +- LAN on Motherboard (LOMs) may support DA, SR, or LR modules. Other module + types are not supported. Please see your system documentation for details. - The insmod command can be used if the full - path to the driver module is specified. For example: +The following is a list of 3rd party SFP+ modules and direct attach cables that +have received some testing. Not all modules are applicable to all devices. - insmod /lib/modules/<KERNEL VERSION>/kernel/drivers/net/ixgbe/ixgbe.ko +Supplier Type Part Numbers - With 2.6 based kernels also make sure that older ixgbe drivers are - removed from the kernel, before loading the new module: +Finisar SFP+ SR bailed, 10g single rate FTLX8571D3BCL +Avago SFP+ SR bailed, 10g single rate AFBR-700SDZ +Finisar SFP+ LR bailed, 10g single rate FTLX1471D3BCL - rmmod ixgbe; modprobe ixgbe +82598-based adapters support all passive direct attach cables that comply +with SFF-8431 v4.1 and SFF-8472 v10.4 specifications. Active direct attach +cables are not supported. -3. Assign an IP address to the interface by entering the following, where - x is the interface number: - ifconfig ethx <IP_address> +Flow Control +------------ +Ethernet Flow Control (IEEE 802.3x) can be configured with ethtool to enable +receiving and transmitting pause frames for ixgbe. When TX is enabled, PAUSE +frames are generated when the receive packet buffer crosses a predefined +threshold. When rx is enabled, the transmit unit will halt for the time delay +specified when a PAUSE frame is received. -4. Verify that the interface works. Enter the following, where <IP_address> - is the IP address for another machine on the same subnet as the interface - that is being tested: +Flow Control is enabled by default. If you want to disable a flow control +capable link partner, use ethtool: - ping <IP_address> + ethtool -A eth? autoneg off RX off TX off +NOTE: For 82598 backplane cards entering 1 gig mode, flow control default +behavior is changed to off. Flow control in 1 gig mode on these devices can +lead to Tx hangs. -Additional Configurations -========================= +Intel(R) Ethernet Flow Director +------------------------------- +Supports advanced filters that direct receive packets by their flows to +different queues. Enables tight control on routing a flow in the platform. +Matches flows and CPU cores for flow affinity. Supports multiple parameters +for flexible flow classification and load balancing. + +Flow director is enabled only if the kernel is multiple TX queue capable. + +An included script (set_irq_affinity.sh) automates setting the IRQ to CPU +affinity. + +You can verify that the driver is using Flow Director by looking at the counter +in ethtool: fdir_miss and fdir_match. + +Other ethtool Commands: +To enable Flow Director + ethtool -K ethX ntuple on +To add a filter + Use -U switch. e.g., ethtool -U ethX flow-type tcp4 src-ip 0x178000a + action 1 +To see the list of filters currently present: + ethtool -u ethX + +Perfect Filter: Perfect filter is an interface to load the filter table that +funnels all flow into queue_0 unless an alternative queue is specified using +"action". In that case, any flow that matches the filter criteria will be +directed to the appropriate queue. + +If the queue is defined as -1, filter will drop matching packets. + +To account for filter matches and misses, there are two stats in ethtool: +fdir_match and fdir_miss. In addition, rx_queue_N_packets shows the number of +packets processed by the Nth queue. + +NOTE: Receive Packet Steering (RPS) and Receive Flow Steering (RFS) are not +compatible with Flow Director. IF Flow Director is enabled, these will be +disabled. + +The following three parameters impact Flow Director. + +FdirMode +-------- +Valid Range: 0-2 (0=off, 1=ATR, 2=Perfect filter mode) +Default Value: 1 + + Flow Director filtering modes. + +FdirPballoc +----------- +Valid Range: 0-2 (0=64k, 1=128k, 2=256k) +Default Value: 0 + + Flow Director allocated packet buffer size. + +AtrSampleRate +-------------- +Valid Range: 1-100 +Default Value: 20 - Viewing Link Messages - --------------------- - Link messages will not be displayed to the console if the distribution is - restricting system messages. In order to see network driver link messages on - your console, set dmesg to eight by entering the following: + Software ATR Tx packet sample rate. For example, when set to 20, every 20th + packet, looks to see if the packet will create a new flow. - dmesg -n 8 +Node +---- +Valid Range: 0-n +Default Value: 1 (off) - NOTE: This setting is not saved across reboots. + 0 - n: where n is the number of NUMA nodes (i.e. 0 - 3) currently online in + your system + 1: turns this option off + The Node parameter will allow you to pick which NUMA node you want to have + the adapter allocate memory on. + +max_vfs +------- +Valid Range: 1-63 +Default Value: 0 + + If the value is greater than 0 it will also force the VMDq parameter to be 1 + or more. + + This parameter adds support for SR-IOV. It causes the driver to spawn up to + max_vfs worth of virtual function. + + +Additional Configurations +========================= Jumbo Frames ------------ @@ -123,13 +230,8 @@ Additional Configurations other protocols besides TCP. It's also safe to use with configurations that are problematic for LRO, namely bridging and iSCSI. - GRO is enabled by default in the driver. Future versions of ethtool will - support disabling and re-enabling GRO on the fly. - - Data Center Bridging, aka DCB ----------------------------- - DCB is a configuration Quality of Service implementation in hardware. It uses the VLAN priority tag (802.1p) to filter traffic. That means that there are 8 different priorities that traffic can be filtered into. @@ -163,24 +265,72 @@ Additional Configurations http://e1000.sf.net - Ethtool ------- The driver utilizes the ethtool interface for driver configuration and - diagnostics, as well as displaying statistical information. Ethtool - version 3.0 or later is required for this functionality. + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. The latest release of ethtool can be found from - http://sourceforge.net/projects/gkernel. - + http://ftp.kernel.org/pub/software/network/ethtool/ - NAPI + FCoE ---- + This release of the ixgbe driver contains new code to enable users to use + Fiber Channel over Ethernet (FCoE) and Data Center Bridging (DCB) + functionality that is supported by the 82598-based hardware. This code has + no default effect on the regular driver operation, and configuring DCB and + FCoE is outside the scope of this driver README. Refer to + http://www.open-fcoe.org/ for FCoE project information and contact + e1000-eedc@lists.sourceforge.net for DCB information. + + MAC and VLAN anti-spoofing feature + ---------------------------------- + When a malicious driver attempts to send a spoofed packet, it is dropped by + the hardware and not transmitted. An interrupt is sent to the PF driver + notifying it of the spoof attempt. + + When a spoofed packet is detected the PF driver will send the following + message to the system log (displayed by the "dmesg" command): + + Spoof event(s) detected on VF (n) + + Where n=the VF that attempted to do the spoofing. + + +Performance Tuning +================== + +An excellent article on performance tuning can be found at: + +http://www.redhat.com/promo/summit/2008/downloads/pdf/Thursday/Mark_Wagner.pdf + + +Known Issues +============ + + Enabling SR-IOV in a 32-bit or 64-bit Microsoft* Windows* Server 2008/R2 + Guest OS using Intel (R) 82576-based GbE or Intel (R) 82599-based 10GbE + controller under KVM + ------------------------------------------------------------------------ + KVM Hypervisor/VMM supports direct assignment of a PCIe device to a VM. This + includes traditional PCIe devices, as well as SR-IOV-capable devices using + Intel 82576-based and 82599-based controllers. + + While direct assignment of a PCIe device or an SR-IOV Virtual Function (VF) + to a Linux-based VM running 2.6.32 or later kernel works fine, there is a + known issue with Microsoft Windows Server 2008 VM that results in a "yellow + bang" error. This problem is within the KVM VMM itself, not the Intel driver, + or the SR-IOV logic of the VMM, but rather that KVM emulates an older CPU + model for the guests, and this older CPU model does not support MSI-X + interrupts, which is a requirement for Intel SR-IOV. - NAPI (Rx polling mode) is supported in the ixgbe driver. NAPI is enabled - by default in the driver. + If you wish to use the Intel 82576 or 82599-based controllers in SR-IOV mode + with KVM and a Microsoft Windows Server 2008 guest try the following + workaround. The workaround is to tell KVM to emulate a different model of CPU + when using qemu to create the KVM guest: - See www.cyberus.ca/~hadi/usenix-paper.tgz for more information on NAPI. + "-cpu qemu64,model=13" Support diff --git a/Documentation/networking/ixgbevf.txt b/Documentation/networking/ixgbevf.txt index 21dd5d15b6b..53d8d2a5a6a 100644 --- a/Documentation/networking/ixgbevf.txt +++ b/Documentation/networking/ixgbevf.txt @@ -1,8 +1,8 @@ -Linux* Base Driver for Intel(R) Network Connection -================================================== +Linux* Base Driver for Intel(R) Ethernet Network Connection +=========================================================== Intel Gigabit Linux driver. -Copyright(c) 1999 - 2010 Intel Corporation. +Copyright(c) 1999 - 2013 Intel Corporation. Contents ======== @@ -35,10 +35,6 @@ Driver ID Guide at: Known Issues/Troubleshooting ============================ - Unloading Physical Function (PF) Driver Causes System Reboots When VM is - Running and VF is Loaded on the VM - ------------------------------------------------------------------------ - Do not unload the PF driver (ixgbe) while VFs are assigned to guests. Support ======= diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt index e7bf3979fac..c74434de2fa 100644 --- a/Documentation/networking/l2tp.txt +++ b/Documentation/networking/l2tp.txt @@ -111,7 +111,7 @@ When creating PPPoL2TP sockets, the application provides information to the driver about the socket in a socket connect() call. Source and destination tunnel and session ids are provided, as well as the file descriptor of a UDP socket. See struct pppol2tp_addr in -include/linux/if_ppp.h. Note that zero tunnel / session ids are +include/linux/if_pppol2tp.h. Note that zero tunnel / session ids are treated specially. When creating the per-tunnel PPPoL2TP management socket in Step 2 above, zero source and destination session ids are specified, which tells the driver to prepare the supplied UDP file @@ -197,7 +197,7 @@ state information because the file format is subject to change. It is implemented to provide extra debug information to help diagnose problems.) Users should use the netlink API. -/proc/net/pppol2tp is also provided for backwards compaibility with +/proc/net/pppol2tp is also provided for backwards compatibility with the original pppol2tp driver. It lists information about L2TPv2 tunnels and sessions only. Its use is discouraged. diff --git a/Documentation/networking/ltpc.txt b/Documentation/networking/ltpc.txt index fe2a9129d95..0bf3220c715 100644 --- a/Documentation/networking/ltpc.txt +++ b/Documentation/networking/ltpc.txt @@ -25,7 +25,7 @@ the driver will try to determine them itself. If you load the driver as a module, you can pass the parameters "io=", "irq=", and "dma=" on the command line with insmod or modprobe, or add -them as options in /etc/modprobe.conf: +them as options in a configuration file in /etc/modprobe.d/ directory: alias lt0 ltpc # autoload the module when the interface is configured options ltpc io=0x240 irq=9 dma=1 diff --git a/Documentation/networking/mac80211-auth-assoc-deauth.txt b/Documentation/networking/mac80211-auth-assoc-deauth.txt new file mode 100644 index 00000000000..d7a15fe91bf --- /dev/null +++ b/Documentation/networking/mac80211-auth-assoc-deauth.txt @@ -0,0 +1,95 @@ +# +# This outlines the Linux authentication/association and +# deauthentication/disassociation flows. +# +# This can be converted into a diagram using the service +# at http://www.websequencediagrams.com/ +# + +participant userspace +participant mac80211 +participant driver + +alt authentication needed (not FT) +userspace->mac80211: authenticate + +alt authenticated/authenticating already +mac80211->driver: sta_state(AP, not-exists) +mac80211->driver: bss_info_changed(clear BSSID) +else associated +note over mac80211,driver +like deauth/disassoc, without sending the +BA session stop & deauth/disassoc frames +end note +end + +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) + +alt no probe request data known +mac80211->driver: TX directed probe request +driver->mac80211: RX probe response +end + +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame + +alt WEP shared key auth +mac80211->driver: TX auth frame +driver->mac80211: RX auth frame +end + +mac80211->driver: sta_state(AP, authenticated) +mac80211->userspace: RX auth frame + +end + +userspace->mac80211: associate +alt authenticated or associated +note over mac80211,driver: cleanup like for authenticate +end + +alt not previously authenticated (FT) +mac80211->driver: config(channel, channel type) +mac80211->driver: bss_info_changed(set BSSID, basic rate bitmap) +mac80211->driver: sta_state(AP, exists) +mac80211->driver: sta_state(AP, authenticated) +end +mac80211->driver: TX assoc +driver->mac80211: RX assoc response +note over mac80211: init rate control +mac80211->driver: sta_state(AP, associated) + +alt not using WPA +mac80211->driver: sta_state(AP, authorized) +end + +mac80211->driver: set up QoS parameters + +mac80211->driver: bss_info_changed(QoS, HT, associated with AID) +mac80211->userspace: associated + +note left of userspace: associated now + +alt using WPA +note over userspace +do 4-way-handshake +(data frames) +end note +userspace->mac80211: authorized +mac80211->driver: sta_state(AP, authorized) +end + +userspace->mac80211: deauthenticate/disassociate +mac80211->driver: stop BA sessions +mac80211->driver: TX deauth/disassoc +mac80211->driver: flush frames +mac80211->driver: sta_state(AP,associated) +mac80211->driver: sta_state(AP,authenticated) +mac80211->driver: sta_state(AP,exists) +mac80211->driver: sta_state(AP,not-exists) +mac80211->driver: turn off powersave +mac80211->driver: bss_info_changed(clear BSSID, not associated, no QoS, ...) +mac80211->driver: config(channel type to non-HT) +mac80211->userspace: disconnected diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt index b30e81ad530..3a930072b16 100644 --- a/Documentation/networking/mac80211-injection.txt +++ b/Documentation/networking/mac80211-injection.txt @@ -23,6 +23,10 @@ radiotap headers and used to control injection: IEEE80211_RADIOTAP_F_FRAG: frame will be fragmented if longer than the current fragmentation threshold. + * IEEE80211_RADIOTAP_TX_FLAGS + + IEEE80211_RADIOTAP_F_TX_NOACK: frame should be sent without waiting for + an ACK even if it is a unicast frame The injection code can also skip all other currently defined radiotap fields facilitating replay of captured radiotap headers directly. diff --git a/Documentation/networking/multicast.txt b/Documentation/networking/multicast.txt deleted file mode 100644 index b06c8c69266..00000000000 --- a/Documentation/networking/multicast.txt +++ /dev/null @@ -1,63 +0,0 @@ -Behaviour of Cards Under Multicast -================================== - -This is how they currently behave, not what the hardware can do--for example, -the Lance driver doesn't use its filter, even though the code for loading -it is in the DEC Lance-based driver. - -The following are requirements for multicasting ------------------------------------------------ -AppleTalk Multicast hardware filtering not important but - avoid cards only doing promisc -IP-Multicast Multicast hardware filters really help -IP-MRoute AllMulti hardware filters are of no help - - -Board Multicast AllMulti Promisc Filter ------------------------------------------------------------------------- -3c501 YES YES YES Software -3c503 YES YES YES Hardware -3c505 YES NO YES Hardware -3c507 NO NO NO N/A -3c509 YES YES YES Software -3c59x YES YES YES Software -ac3200 YES YES YES Hardware -apricot YES PROMISC YES Hardware -arcnet NO NO NO N/A -at1700 PROMISC PROMISC YES Software -atp PROMISC PROMISC YES Software -cs89x0 YES YES YES Software -de4x5 YES YES YES Hardware -de600 NO NO NO N/A -de620 PROMISC PROMISC YES Software -depca YES PROMISC YES Hardware -dmfe YES YES YES Software(*) -e2100 YES YES YES Hardware -eepro YES PROMISC YES Hardware -eexpress NO NO NO N/A -ewrk3 YES PROMISC YES Hardware -hp-plus YES YES YES Hardware -hp YES YES YES Hardware -hp100 YES YES YES Hardware -ibmtr NO NO NO N/A -ioc3-eth YES YES YES Hardware -lance YES YES YES Software(#) -ne YES YES YES Hardware -ni52 <------------------ Buggy ------------------> -ni65 YES YES YES Software(#) -seeq NO NO NO N/A -sgiseek <------------------ Buggy ------------------> -smc-ultra YES YES YES Hardware -sunlance YES YES YES Hardware -tulip YES YES YES Hardware -wavelan YES PROMISC YES Hardware -wd YES YES YES Hardware -xirc2ps_cs YES YES YES Hardware -znet YES YES YES Software - - -PROMISC = This multicast mode is in fact promiscuous mode. Avoid using -cards who go PROMISC on any multicast in a multicast kernel. - -(#) = Hardware multicast support is not used yet. -(*) = Hardware support for Davicom 9132 chipset only. diff --git a/Documentation/networking/netconsole.txt b/Documentation/networking/netconsole.txt index 8d022073e3e..a5d574a9ae0 100644 --- a/Documentation/networking/netconsole.txt +++ b/Documentation/networking/netconsole.txt @@ -1,9 +1,10 @@ started by Ingo Molnar <mingo@redhat.com>, 2001.09.17 2.6 port and netpoll api by Matt Mackall <mpm@selenic.com>, Sep 9 2003 +IPv6 support by Cong Wang <xiyou.wangcong@gmail.com>, Jan 1 2013 Please send bug reports to Matt Mackall <mpm@selenic.com> -and Satyam Sharma <satyam.sharma@gmail.com> +Satyam Sharma <satyam.sharma@gmail.com>, and Cong Wang <xiyou.wangcong@gmail.com> Introduction: ============= @@ -41,6 +42,10 @@ Examples: insmod netconsole netconsole=@/,@10.0.0.2/ + or using IPv6 + + insmod netconsole netconsole=@/,@fd00:1:2:3::1/ + It also supports logging to multiple remote agents by specifying parameters for the multiple agents separated by semicolons and the complete string enclosed in "quotes", thusly: @@ -51,8 +56,23 @@ Built-in netconsole starts immediately after the TCP stack is initialized and attempts to bring up the supplied dev at the supplied address. -The remote host can run either 'netcat -u -l -p <port>', -'nc -l -u <port>' or syslogd. +The remote host has several options to receive the kernel messages, +for example: + +1) syslogd + +2) netcat + + On distributions using a BSD-based netcat version (e.g. Fedora, + openSUSE and Ubuntu) the listening port must be specified without + the -p switch: + + 'nc -u -l -p <port>' / 'nc -u -l <port>' or + 'netcat -u -l -p <port>' / 'netcat -u -l <port>' + +3) socat + + 'socat udp-recv:<port> -' Dynamic reconfiguration: ======================== diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt new file mode 100644 index 00000000000..0fe1c6e0dbc --- /dev/null +++ b/Documentation/networking/netdev-FAQ.txt @@ -0,0 +1,224 @@ + +Information you need to know about netdev +----------------------------------------- + +Q: What is netdev? + +A: It is a mailing list for all network-related Linux stuff. This includes + anything found under net/ (i.e. core code like IPv6) and drivers/net + (i.e. hardware specific drivers) in the Linux source tree. + + Note that some subsystems (e.g. wireless drivers) which have a high volume + of traffic have their own specific mailing lists. + + The netdev list is managed (like many other Linux mailing lists) through + VGER ( http://vger.kernel.org/ ) and archives can be found below: + + http://marc.info/?l=linux-netdev + http://www.spinics.net/lists/netdev/ + + Aside from subsystems like that mentioned above, all network-related Linux + development (i.e. RFC, review, comments, etc.) takes place on netdev. + +Q: How do the changes posted to netdev make their way into Linux? + +A: There are always two trees (git repositories) in play. Both are driven + by David Miller, the main network maintainer. There is the "net" tree, + and the "net-next" tree. As you can probably guess from the names, the + net tree is for fixes to existing code already in the mainline tree from + Linus, and net-next is where the new code goes for the future release. + You can find the trees here: + + http://git.kernel.org/?p=linux/kernel/git/davem/net.git + http://git.kernel.org/?p=linux/kernel/git/davem/net-next.git + +Q: How often do changes from these trees make it to the mainline Linus tree? + +A: To understand this, you need to know a bit of background information + on the cadence of Linux development. Each new release starts off with + a two week "merge window" where the main maintainers feed their new + stuff to Linus for merging into the mainline tree. After the two weeks, + the merge window is closed, and it is called/tagged "-rc1". No new + features get mainlined after this -- only fixes to the rc1 content + are expected. After roughly a week of collecting fixes to the rc1 + content, rc2 is released. This repeats on a roughly weekly basis + until rc7 (typically; sometimes rc6 if things are quiet, or rc8 if + things are in a state of churn), and a week after the last vX.Y-rcN + was done, the official "vX.Y" is released. + + Relating that to netdev: At the beginning of the 2-week merge window, + the net-next tree will be closed - no new changes/features. The + accumulated new content of the past ~10 weeks will be passed onto + mainline/Linus via a pull request for vX.Y -- at the same time, + the "net" tree will start accumulating fixes for this pulled content + relating to vX.Y + + An announcement indicating when net-next has been closed is usually + sent to netdev, but knowing the above, you can predict that in advance. + + IMPORTANT: Do not send new net-next content to netdev during the + period during which net-next tree is closed. + + Shortly after the two weeks have passed (and vX.Y-rc1 is released), the + tree for net-next reopens to collect content for the next (vX.Y+1) release. + + If you aren't subscribed to netdev and/or are simply unsure if net-next + has re-opened yet, simply check the net-next git repository link above for + any new networking-related commits. + + The "net" tree continues to collect fixes for the vX.Y content, and + is fed back to Linus at regular (~weekly) intervals. Meaning that the + focus for "net" is on stabilization and bugfixes. + + Finally, the vX.Y gets released, and the whole cycle starts over. + +Q: So where are we now in this cycle? + +A: Load the mainline (Linus) page here: + + http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git + + and note the top of the "tags" section. If it is rc1, it is early + in the dev cycle. If it was tagged rc7 a week ago, then a release + is probably imminent. + +Q: How do I indicate which tree (net vs. net-next) my patch should be in? + +A: Firstly, think whether you have a bug fix or new "next-like" content. + Then once decided, assuming that you use git, use the prefix flag, i.e. + + git format-patch --subject-prefix='PATCH net-next' start..finish + + Use "net" instead of "net-next" (always lower case) in the above for + bug-fix net content. If you don't use git, then note the only magic in + the above is just the subject text of the outgoing e-mail, and you can + manually change it yourself with whatever MUA you are comfortable with. + +Q: I sent a patch and I'm wondering what happened to it. How can I tell + whether it got merged? + +A: Start by looking at the main patchworks queue for netdev: + + http://patchwork.ozlabs.org/project/netdev/list/ + + The "State" field will tell you exactly where things are at with + your patch. + +Q: The above only says "Under Review". How can I find out more? + +A: Generally speaking, the patches get triaged quickly (in less than 48h). + So be patient. Asking the maintainer for status updates on your + patch is a good way to ensure your patch is ignored or pushed to + the bottom of the priority list. + +Q: How can I tell what patches are queued up for backporting to the + various stable releases? + +A: Normally Greg Kroah-Hartman collects stable commits himself, but + for networking, Dave collects up patches he deems critical for the + networking subsystem, and then hands them off to Greg. + + There is a patchworks queue that you can see here: + http://patchwork.ozlabs.org/bundle/davem/stable/?state=* + + It contains the patches which Dave has selected, but not yet handed + off to Greg. If Greg already has the patch, then it will be here: + http://git.kernel.org/cgit/linux/kernel/git/stable/stable-queue.git + + A quick way to find whether the patch is in this stable-queue is + to simply clone the repo, and then git grep the mainline commit ID, e.g. + + stable-queue$ git grep -l 284041ef21fdf2e + releases/3.0.84/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.4.51/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + releases/3.9.8/ipv6-fix-possible-crashes-in-ip6_cork_release.patch + stable/stable-queue$ + +Q: I see a network patch and I think it should be backported to stable. + Should I request it via "stable@vger.kernel.org" like the references in + the kernel's Documentation/stable_kernel_rules.txt file say? + +A: No, not for networking. Check the stable queues as per above 1st to see + if it is already queued. If not, then send a mail to netdev, listing + the upstream commit ID and why you think it should be a stable candidate. + + Before you jump to go do the above, do note that the normal stable rules + in Documentation/stable_kernel_rules.txt still apply. So you need to + explicitly indicate why it is a critical fix and exactly what users are + impacted. In addition, you need to convince yourself that you _really_ + think it has been overlooked, vs. having been considered and rejected. + + Generally speaking, the longer it has had a chance to "soak" in mainline, + the better the odds that it is an OK candidate for stable. So scrambling + to request a commit be added the day after it appears should be avoided. + +Q: I have created a network patch and I think it should be backported to + stable. Should I add a "Cc: stable@vger.kernel.org" like the references + in the kernel's Documentation/ directory say? + +A: No. See above answer. In short, if you think it really belongs in + stable, then ensure you write a decent commit log that describes who + gets impacted by the bugfix and how it manifests itself, and when the + bug was introduced. If you do that properly, then the commit will + get handled appropriately and most likely get put in the patchworks + stable queue if it really warrants it. + + If you think there is some valid information relating to it being in + stable that does _not_ belong in the commit log, then use the three + dash marker line as described in Documentation/SubmittingPatches to + temporarily embed that information into the patch that you send. + +Q: Someone said that the comment style and coding convention is different + for the networking content. Is this true? + +A: Yes, in a largely trivial way. Instead of this: + + /* + * foobar blah blah blah + * another line of text + */ + + it is requested that you make it look like this: + + /* foobar blah blah blah + * another line of text + */ + +Q: I am working in existing code that has the former comment style and not the + latter. Should I submit new code in the former style or the latter? + +A: Make it the latter style, so that eventually all code in the domain of + netdev is of this format. + +Q: I found a bug that might have possible security implications or similar. + Should I mail the main netdev maintainer off-list? + +A: No. The current netdev maintainer has consistently requested that people + use the mailing lists and not reach out directly. If you aren't OK with + that, then perhaps consider mailing "security@kernel.org" or reading about + http://oss-security.openwall.org/wiki/mailing-lists/distros + as possible alternative mechanisms. + +Q: What level of testing is expected before I submit my change? + +A: If your changes are against net-next, the expectation is that you + have tested by layering your changes on top of net-next. Ideally you + will have done run-time testing specific to your change, but at a + minimum, your changes should survive an "allyesconfig" and an + "allmodconfig" build without new warnings or failures. + +Q: Any other tips to help ensure my net/net-next patch gets OK'd? + +A: Attention to detail. Re-read your own work as if you were the + reviewer. You can start with using checkpatch.pl, perhaps even + with the "--strict" flag. But do not be mindlessly robotic in + doing so. If your change is a bug-fix, make sure your commit log + indicates the end-user visible symptom, the underlying reason as + to why it happens, and then if necessary, explain why the fix proposed + is the best way to get things done. Don't mangle whitespace, and as + is common, don't mis-indent function arguments that span multiple lines. + If it is your first patch, mail it to yourself so you can test apply + it to an unpatched tree to confirm infrastructure didn't mangle it. + + Finally, go back and read Documentation/SubmittingPatches to be + sure you are not repeating some common mistake documented there. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt new file mode 100644 index 00000000000..f310edec8a7 --- /dev/null +++ b/Documentation/networking/netdev-features.txt @@ -0,0 +1,167 @@ +Netdev features mess and how to get out from it alive +===================================================== + +Author: + Michał Mirosław <mirq-linux@rere.qmqm.pl> + + + + Part I: Feature sets +====================== + +Long gone are the days when a network card would just take and give packets +verbatim. Today's devices add multiple features and bugs (read: offloads) +that relieve an OS of various tasks like generating and checking checksums, +splitting packets, classifying them. Those capabilities and their state +are commonly referred to as netdev features in Linux kernel world. + +There are currently three sets of features relevant to the driver, and +one used internally by network core: + + 1. netdev->hw_features set contains features whose state may possibly + be changed (enabled or disabled) for a particular device by user's + request. This set should be initialized in ndo_init callback and not + changed later. + + 2. netdev->features set contains features which are currently enabled + for a device. This should be changed only by network core or in + error paths of ndo_set_features callback. + + 3. netdev->vlan_features set contains features whose state is inherited + by child VLAN devices (limits netdev->features set). This is currently + used for all VLAN devices whether tags are stripped or inserted in + hardware or software. + + 4. netdev->wanted_features set contains feature set requested by user. + This set is filtered by ndo_fix_features callback whenever it or + some device-specific conditions change. This set is internal to + networking core and should not be referenced in drivers. + + + + Part II: Controlling enabled features +======================================= + +When current feature set (netdev->features) is to be changed, new set +is calculated and filtered by calling ndo_fix_features callback +and netdev_fix_features(). If the resulting set differs from current +set, it is passed to ndo_set_features callback and (if the callback +returns success) replaces value stored in netdev->features. +NETDEV_FEAT_CHANGE notification is issued after that whenever current +set might have changed. + +The following events trigger recalculation: + 1. device's registration, after ndo_init returned success + 2. user requested changes in features state + 3. netdev_update_features() is called + +ndo_*_features callbacks are called with rtnl_lock held. Missing callbacks +are treated as always returning success. + +A driver that wants to trigger recalculation must do so by calling +netdev_update_features() while holding rtnl_lock. This should not be done +from ndo_*_features callbacks. netdev->features should not be modified by +driver except by means of ndo_fix_features callback. + + + + Part III: Implementation hints +================================ + + * ndo_fix_features: + +All dependencies between features should be resolved here. The resulting +set can be reduced further by networking core imposed limitations (as coded +in netdev_fix_features()). For this reason it is safer to disable a feature +when its dependencies are not met instead of forcing the dependency on. + +This callback should not modify hardware nor driver state (should be +stateless). It can be called multiple times between successive +ndo_set_features calls. + +Callback must not alter features contained in NETIF_F_SOFT_FEATURES or +NETIF_F_NEVER_CHANGE sets. The exception is NETIF_F_VLAN_CHALLENGED but +care must be taken as the change won't affect already configured VLANs. + + * ndo_set_features: + +Hardware should be reconfigured to match passed feature set. The set +should not be altered unless some error condition happens that can't +be reliably detected in ndo_fix_features. In this case, the callback +should update netdev->features to match resulting hardware state. +Errors returned are not (and cannot be) propagated anywhere except dmesg. +(Note: successful return is zero, >0 means silent error.) + + + + Part IV: Features +=================== + +For current list of features, see include/linux/netdev_features.h. +This section describes semantics of some of them. + + * Transmit checksumming + +For complete description, see comments near the top of include/linux/skbuff.h. + +Note: NETIF_F_HW_CSUM is a superset of NETIF_F_IP_CSUM + NETIF_F_IPV6_CSUM. +It means that device can fill TCP/UDP-like checksum anywhere in the packets +whatever headers there might be. + + * Transmit TCP segmentation offload + +NETIF_F_TSO_ECN means that hardware can properly split packets with CWR bit +set, be it TCPv4 (when NETIF_F_TSO is enabled) or TCPv6 (NETIF_F_TSO6). + + * Transmit DMA from high memory + +On platforms where this is relevant, NETIF_F_HIGHDMA signals that +ndo_start_xmit can handle skbs with frags in high memory. + + * Transmit scatter-gather + +Those features say that ndo_start_xmit can handle fragmented skbs: +NETIF_F_SG --- paged skbs (skb_shinfo()->frags), NETIF_F_FRAGLIST --- +chained skbs (skb->next/prev list). + + * Software features + +Features contained in NETIF_F_SOFT_FEATURES are features of networking +stack. Driver should not change behaviour based on them. + + * LLTX driver (deprecated for hardware drivers) + +NETIF_F_LLTX should be set in drivers that implement their own locking in +transmit path or don't need locking at all (e.g. software tunnels). +In ndo_start_xmit, it is recommended to use a try_lock and return +NETDEV_TX_LOCKED when the spin lock fails. The locking should also properly +protect against other callbacks (the rules you need to find out). + +Don't use it for new drivers. + + * netns-local device + +NETIF_F_NETNS_LOCAL is set for devices that are not allowed to move between +network namespaces (e.g. loopback). + +Don't use it in drivers. + + * VLAN challenged + +NETIF_F_VLAN_CHALLENGED should be set for devices which can't cope with VLAN +headers. Some drivers set this because the cards can't handle the bigger MTU. +[FIXME: Those cases could be fixed in VLAN code by allowing only reduced-MTU +VLANs. This may be not useful, though.] + +* rx-fcs + +This requests that the NIC append the Ethernet Frame Checksum (FCS) +to the end of the skb data. This allows sniffers and other tools to +read the CRC recorded by the NIC on receipt of the packet. + +* rx-all + +This requests that the NIC receive all possible frames, including errored +frames (such as bad FCS, etc). This can be helpful when sniffing a link with +bad packets on it. Some NICs may receive more packets if also put into normal +PROMISC mode. diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index 87b3d15f523..0b1cf6b2a59 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt @@ -10,12 +10,12 @@ network devices. struct net_device allocation rules ================================== Network device structures need to persist even after module is unloaded and -must be allocated with kmalloc. If device has registered successfully, -it will be freed on last use by free_netdev. This is required to handle the -pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu ) +must be allocated with alloc_netdev_mqs() and friends. +If device has registered successfully, it will be freed on last use +by free_netdev(). This is required to handle the pathologic case cleanly +(example: rmmod mydriver </sys/class/net/myeth/mtu ) -There are routines in net_init.c to handle the common cases of -alloc_etherdev, alloc_netdev. These reserve extra space for driver +alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver private data which gets freed when the network device is freed. If separately allocated data is attached to the network device (netdev_priv(dev)) then it is up to the module exit handler to free that. @@ -47,33 +47,32 @@ packets is preferred. struct net_device synchronization rules ======================================= -dev->open: +ndo_open: Synchronization: rtnl_lock() semaphore. Context: process -dev->stop: +ndo_stop: Synchronization: rtnl_lock() semaphore. Context: process - Note1: netif_running() is guaranteed false - Note2: dev->poll() is guaranteed to be stopped + Note: netif_running() is guaranteed false -dev->do_ioctl: +ndo_do_ioctl: Synchronization: rtnl_lock() semaphore. Context: process -dev->get_stats: +ndo_get_stats: Synchronization: dev_base_lock rwlock. Context: nominally process, but don't sleep inside an rwlock -dev->hard_start_xmit: - Synchronization: netif_tx_lock spinlock. +ndo_start_xmit: + Synchronization: __netif_tx_lock spinlock. When the driver sets NETIF_F_LLTX in dev->features this will be called without holding netif_tx_lock. In this case the driver has to lock by itself when needed. It is recommended to use a try lock for this and return NETDEV_TX_LOCKED when the spin lock fails. The locking there should also properly protect against - set_multicast_list. Note that the use of NETIF_F_LLTX is deprecated. + set_rx_mode. Note that the use of NETIF_F_LLTX is deprecated. Don't use it for new drivers. Context: Process with BHs disabled or BH (timer), @@ -87,20 +86,20 @@ dev->hard_start_xmit: o NETDEV_TX_LOCKED Locking failed, please retry quickly. Only valid when NETIF_F_LLTX is set. -dev->tx_timeout: - Synchronization: netif_tx_lock spinlock. +ndo_tx_timeout: + Synchronization: netif_tx_lock spinlock; all TX queues frozen. Context: BHs disabled Notes: netif_queue_stopped() is guaranteed true -dev->set_multicast_list: - Synchronization: netif_tx_lock spinlock. +ndo_set_rx_mode: + Synchronization: netif_addr_lock spinlock. Context: BHs disabled struct napi_struct synchronization rules ======================================== napi->poll: Synchronization: NAPI_STATE_SCHED bit in napi->state. Device - driver's dev->close method will invoke napi_disable() on + driver's ndo_stop method will invoke napi_disable() on all NAPI instances which will do a sleeping poll on the NAPI_STATE_SCHED napi->state bit, waiting for all pending NAPI activity to cease. diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt new file mode 100644 index 00000000000..c6af4bac5aa --- /dev/null +++ b/Documentation/networking/netlink_mmap.txt @@ -0,0 +1,339 @@ +This file documents how to use memory mapped I/O with netlink. + +Author: Patrick McHardy <kaber@trash.net> + +Overview +-------- + +Memory mapped netlink I/O can be used to increase throughput and decrease +overhead of unicast receive and transmit operations. Some netlink subsystems +require high throughput, these are mainly the netfilter subsystems +nfnetlink_queue and nfnetlink_log, but it can also help speed up large +dump operations of f.i. the routing database. + +Memory mapped netlink I/O used two circular ring buffers for RX and TX which +are mapped into the processes address space. + +The RX ring is used by the kernel to directly construct netlink messages into +user-space memory without copying them as done with regular socket I/O, +additionally as long as the ring contains messages no recvmsg() or poll() +syscalls have to be issued by user-space to get more message. + +The TX ring is used to process messages directly from user-space memory, the +kernel processes all messages contained in the ring using a single sendmsg() +call. + +Usage overview +-------------- + +In order to use memory mapped netlink I/O, user-space needs three main changes: + +- ring setup +- conversion of the RX path to get messages from the ring instead of recvmsg() +- conversion of the TX path to construct messages into the ring + +Ring setup is done using setsockopt() to provide the ring parameters to the +kernel, then a call to mmap() to map the ring into the processes address space: + +- setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, ¶ms, sizeof(params)); +- setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, ¶ms, sizeof(params)); +- ring = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0) + +Usage of either ring is optional, but even if only the RX ring is used the +mapping still needs to be writable in order to update the frame status after +processing. + +Conversion of the reception path involves calling poll() on the file +descriptor, once the socket is readable the frames from the ring are +processed in order until no more messages are available, as indicated by +a status word in the frame header. + +On kernel side, in order to make use of memory mapped I/O on receive, the +originating netlink subsystem needs to support memory mapped I/O, otherwise +it will use an allocated socket buffer as usual and the contents will be + copied to the ring on transmission, nullifying most of the performance gains. +Dumps of kernel databases automatically support memory mapped I/O. + +Conversion of the transmit path involves changing message construction to +use memory from the TX ring instead of (usually) a buffer declared on the +stack and setting up the frame header appropriately. Optionally poll() can +be used to wait for free frames in the TX ring. + +Structured and definitions for using memory mapped I/O are contained in +<linux/netlink.h>. + +RX and TX rings +---------------- + +Each ring contains a number of continuous memory blocks, containing frames of +fixed size dependent on the parameters used for ring setup. + +Ring: [ block 0 ] + [ frame 0 ] + [ frame 1 ] + [ block 1 ] + [ frame 2 ] + [ frame 3 ] + ... + [ block n ] + [ frame 2 * n ] + [ frame 2 * n + 1 ] + +The blocks are only visible to the kernel, from the point of view of user-space +the ring just contains the frames in a continuous memory zone. + +The ring parameters used for setting up the ring are defined as follows: + +struct nl_mmap_req { + unsigned int nm_block_size; + unsigned int nm_block_nr; + unsigned int nm_frame_size; + unsigned int nm_frame_nr; +}; + +Frames are grouped into blocks, where each block is a continuous region of memory +and holds nm_block_size / nm_frame_size frames. The total number of frames in +the ring is nm_frame_nr. The following invariants hold: + +- frames_per_block = nm_block_size / nm_frame_size + +- nm_frame_nr = frames_per_block * nm_block_nr + +Some parameters are constrained, specifically: + +- nm_block_size must be a multiple of the architectures memory page size. + The getpagesize() function can be used to get the page size. + +- nm_frame_size must be equal or larger to NL_MMAP_HDRLEN, IOW a frame must be + able to hold at least the frame header + +- nm_frame_size must be smaller or equal to nm_block_size + +- nm_frame_size must be a multiple of NL_MMAP_MSG_ALIGNMENT + +- nm_frame_nr must equal the actual number of frames as specified above. + +When the kernel can't allocate physically continuous memory for a ring block, +it will fall back to use physically discontinuous memory. This might affect +performance negatively, in order to avoid this the nm_frame_size parameter +should be chosen to be as small as possible for the required frame size and +the number of blocks should be increased instead. + +Ring frames +------------ + +Each frames contain a frame header, consisting of a synchronization word and some +meta-data, and the message itself. + +Frame: [ header message ] + +The frame header is defined as follows: + +struct nl_mmap_hdr { + unsigned int nm_status; + unsigned int nm_len; + __u32 nm_group; + /* credentials */ + __u32 nm_pid; + __u32 nm_uid; + __u32 nm_gid; +}; + +- nm_status is used for synchronizing processing between the kernel and user- + space and specifies ownership of the frame as well as the operation to perform + +- nm_len contains the length of the message contained in the data area + +- nm_group specified the destination multicast group of message + +- nm_pid, nm_uid and nm_gid contain the netlink pid, UID and GID of the sending + process. These values correspond to the data available using SOCK_PASSCRED in + the SCM_CREDENTIALS cmsg. + +The possible values in the status word are: + +- NL_MMAP_STATUS_UNUSED: + RX ring: frame belongs to the kernel and contains no message + for user-space. Approriate action is to invoke poll() + to wait for new messages. + + TX ring: frame belongs to user-space and can be used for + message construction. + +- NL_MMAP_STATUS_RESERVED: + RX ring only: frame is currently used by the kernel for message + construction and contains no valid message yet. + Appropriate action is to invoke poll() to wait for + new messages. + +- NL_MMAP_STATUS_VALID: + RX ring: frame contains a valid message. Approriate action is + to process the message and release the frame back to + the kernel by setting the status to + NL_MMAP_STATUS_UNUSED or queue the frame by setting the + status to NL_MMAP_STATUS_SKIP. + + TX ring: the frame contains a valid message from user-space to + be processed by the kernel. After completing processing + the kernel will release the frame back to user-space by + setting the status to NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_COPY: + RX ring only: a message is ready to be processed but could not be + stored in the ring, either because it exceeded the + frame size or because the originating subsystem does + not support memory mapped I/O. Appropriate action is + to invoke recvmsg() to receive the message and release + the frame back to the kernel by setting the status to + NL_MMAP_STATUS_UNUSED. + +- NL_MMAP_STATUS_SKIP: + RX ring only: user-space queued the message for later processing, but + processed some messages following it in the ring. The + kernel should skip this frame when looking for unused + frames. + +The data area of a frame begins at a offset of NL_MMAP_HDRLEN relative to the +frame header. + +TX limitations +-------------- + +Kernel processing usually involves validation of the message received by +user-space, then processing its contents. The kernel must assure that +userspace is not able to modify the message contents after they have been +validated. In order to do so, the message is copied from the ring frame +to an allocated buffer if either of these conditions is false: + +- only a single mapping of the ring exists +- the file descriptor is not shared between processes + +This means that for threaded programs, the kernel will fall back to copying. + +Example +------- + +Ring setup: + + unsigned int block_size = 16 * getpagesize(); + struct nl_mmap_req req = { + .nm_block_size = block_size, + .nm_block_nr = 64, + .nm_frame_size = 16384, + .nm_frame_nr = 64 * block_size / 16384, + }; + unsigned int ring_size; + void *rx_ring, *tx_ring; + + /* Configure ring parameters */ + if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0) + exit(1); + if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0) + exit(1) + + /* Calculate size of each individual ring */ + ring_size = req.nm_block_nr * req.nm_block_size; + + /* Map RX/TX rings. The TX ring is located after the RX ring */ + rx_ring = mmap(NULL, 2 * ring_size, PROT_READ | PROT_WRITE, + MAP_SHARED, fd, 0); + if ((long)rx_ring == -1L) + exit(1); + tx_ring = rx_ring + ring_size: + +Message reception: + +This example assumes some ring parameters of the ring setup are available. + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + unsigned char buf[16384]; + ssize_t len; + + while (1) { + struct pollfd pfds[1]; + + pfds[0].fd = fd; + pfds[0].events = POLLIN | POLLERR; + pfds[0].revents = 0; + + if (poll(pfds, 1, -1) < 0 && errno != -EINTR) + exit(1); + + /* Check for errors. Error handling omitted */ + if (pfds[0].revents & POLLERR) + <handle error> + + /* If no new messages, poll again */ + if (!(pfds[0].revents & POLLIN)) + continue; + + /* Process all frames */ + while (1) { + /* Get next frame header */ + hdr = rx_ring + frame_offset; + + if (hdr->nm_status == NL_MMAP_STATUS_VALID) { + /* Regular memory mapped frame */ + nlh = (void *)hdr + NL_MMAP_HDRLEN; + len = hdr->nm_len; + + /* Release empty message immediately. May happen + * on error during message construction. + */ + if (len == 0) + goto release; + } else if (hdr->nm_status == NL_MMAP_STATUS_COPY) { + /* Frame queued to socket receive queue */ + len = recv(fd, buf, sizeof(buf), MSG_DONTWAIT); + if (len <= 0) + break; + nlh = buf; + } else + /* No more messages to process, continue polling */ + break; + + process_msg(nlh); +release: + /* Release frame back to the kernel */ + hdr->nm_status = NL_MMAP_STATUS_UNUSED; + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; + } + } + +Message transmission: + +This example assumes some ring parameters of the ring setup are available. +A single message is constructed and transmitted, to send multiple messages +at once they would be constructed in consecutive frames before a final call +to sendto(). + + unsigned int frame_offset = 0; + struct nl_mmap_hdr *hdr; + struct nlmsghdr *nlh; + struct sockaddr_nl addr = { + .nl_family = AF_NETLINK, + }; + + hdr = tx_ring + frame_offset; + if (hdr->nm_status != NL_MMAP_STATUS_UNUSED) + /* No frame available. Use poll() to avoid. */ + exit(1); + + nlh = (void *)hdr + NL_MMAP_HDRLEN; + + /* Build message */ + build_message(nlh); + + /* Fill frame header: length and status need to be set */ + hdr->nm_len = nlh->nlmsg_len; + hdr->nm_status = NL_MMAP_STATUS_VALID; + + if (sendto(fd, NULL, 0, 0, &addr, sizeof(addr)) < 0) + exit(1); + + /* Advance frame offset to next frame */ + frame_offset = (frame_offset + frame_size) % ring_size; diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt new file mode 100644 index 00000000000..70da5086153 --- /dev/null +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -0,0 +1,176 @@ +/proc/sys/net/netfilter/nf_conntrack_* Variables: + +nf_conntrack_acct - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow accounting. 64-bit byte and packet + counters per flow are added. + +nf_conntrack_buckets - INTEGER (read-only) + Size of hash table. If not specified as parameter during module + loading, the default size is calculated by dividing total memory + by 16384 to determine the number of buckets but the hash table will + never have fewer than 32 or more than 16384 buckets. + +nf_conntrack_checksum - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + Verify checksum of incoming packets. Packets with bad checksums are + in INVALID state. If this is enabled, such packets will not be + considered for connection tracking. + +nf_conntrack_count - INTEGER (read-only) + Number of currently allocated flow entries. + +nf_conntrack_events - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If this option is enabled, the connection tracking code will + provide userspace with connection tracking events via ctnetlink. + +nf_conntrack_events_retry_timeout - INTEGER (seconds) + default 15 + + This option is only relevant when "reliable connection tracking + events" are used. Normally, ctnetlink is "lossy", that is, + events are normally dropped when userspace listeners can't keep up. + + Userspace can request "reliable event mode". When this mode is + active, the conntrack will only be destroyed after the event was + delivered. If event delivery fails, the kernel periodically + re-tries to send the event to userspace. + + This is the maximum interval the kernel should use when re-trying + to deliver the destroy event. + + A higher number means there will be fewer delivery retries and it + will take longer for a backlog to be processed. + +nf_conntrack_expect_max - INTEGER + Maximum size of expectation table. Default value is + nf_conntrack_buckets / 256. Minimum is 1. + +nf_conntrack_frag6_high_thresh - INTEGER + default 262144 + + Maximum memory used to reassemble IPv6 fragments. When + nf_conntrack_frag6_high_thresh bytes of memory is allocated for this + purpose, the fragment handler will toss packets until + nf_conntrack_frag6_low_thresh is reached. + +nf_conntrack_frag6_low_thresh - INTEGER + default 196608 + + See nf_conntrack_frag6_low_thresh + +nf_conntrack_frag6_timeout - INTEGER (seconds) + default 60 + + Time to keep an IPv6 fragment in memory. + +nf_conntrack_generic_timeout - INTEGER (seconds) + default 600 + + Default for generic timeout. This refers to layer 4 unknown/unsupported + protocols. + +nf_conntrack_helper - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + Enable automatic conntrack helper assignment. + +nf_conntrack_icmp_timeout - INTEGER (seconds) + default 30 + + Default for ICMP timeout. + +nf_conntrack_icmpv6_timeout - INTEGER (seconds) + default 30 + + Default for ICMP6 timeout. + +nf_conntrack_log_invalid - INTEGER + 0 - disable (default) + 1 - log ICMP packets + 6 - log TCP packets + 17 - log UDP packets + 33 - log DCCP packets + 41 - log ICMPv6 packets + 136 - log UDPLITE packets + 255 - log packets of any protocol + + Log invalid packets of a type specified by value. + +nf_conntrack_max - INTEGER + Size of connection tracking table. Default value is + nf_conntrack_buckets value * 4. + +nf_conntrack_tcp_be_liberal - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Be conservative in what you do, be liberal in what you accept from others. + If it's non-zero, we mark only out of window RST segments as INVALID. + +nf_conntrack_tcp_loose - BOOLEAN + 0 - disabled + not 0 - enabled (default) + + If it is set to zero, we disable picking up already established + connections. + +nf_conntrack_tcp_max_retrans - INTEGER + default 3 + + Maximum number of packets that can be retransmitted without + received an (acceptable) ACK from the destination. If this number + is reached, a shorter timer will be started. + +nf_conntrack_tcp_timeout_close - INTEGER (seconds) + default 10 + +nf_conntrack_tcp_timeout_close_wait - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_established - INTEGER (seconds) + default 432000 (5 days) + +nf_conntrack_tcp_timeout_fin_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_last_ack - INTEGER (seconds) + default 30 + +nf_conntrack_tcp_timeout_max_retrans - INTEGER (seconds) + default 300 + +nf_conntrack_tcp_timeout_syn_recv - INTEGER (seconds) + default 60 + +nf_conntrack_tcp_timeout_syn_sent - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_time_wait - INTEGER (seconds) + default 120 + +nf_conntrack_tcp_timeout_unacknowledged - INTEGER (seconds) + default 300 + +nf_conntrack_timestamp - BOOLEAN + 0 - disabled (default) + not 0 - enabled + + Enable connection tracking flow timestamping. + +nf_conntrack_udp_timeout - INTEGER (seconds) + default 30 + +nf_conntrack_udp_timeout_stream2 - INTEGER (seconds) + default 180 + + This extended timeout will be used in case there is an UDP stream + detected. diff --git a/Documentation/networking/nfc.txt b/Documentation/networking/nfc.txt new file mode 100644 index 00000000000..b24c29bdae2 --- /dev/null +++ b/Documentation/networking/nfc.txt @@ -0,0 +1,128 @@ +Linux NFC subsystem +=================== + +The Near Field Communication (NFC) subsystem is required to standardize the +NFC device drivers development and to create an unified userspace interface. + +This document covers the architecture overview, the device driver interface +description and the userspace interface description. + +Architecture overview +--------------------- + +The NFC subsystem is responsible for: + - NFC adapters management; + - Polling for targets; + - Low-level data exchange; + +The subsystem is divided in some parts. The 'core' is responsible for +providing the device driver interface. On the other side, it is also +responsible for providing an interface to control operations and low-level +data exchange. + +The control operations are available to userspace via generic netlink. + +The low-level data exchange interface is provided by the new socket family +PF_NFC. The NFC_SOCKPROTO_RAW performs raw communication with NFC targets. + + + +--------------------------------------+ + | USER SPACE | + +--------------------------------------+ + ^ ^ + | low-level | control + | data exchange | operations + | | + | v + | +-----------+ + | AF_NFC | netlink | + | socket +-----------+ + | raw ^ + | | + v v + +---------+ +-----------+ + | rawsock | <--------> | core | + +---------+ +-----------+ + ^ + | + v + +-----------+ + | driver | + +-----------+ + +Device Driver Interface +----------------------- + +When registering on the NFC subsystem, the device driver must inform the core +of the set of supported NFC protocols and the set of ops callbacks. The ops +callbacks that must be implemented are the following: + +* start_poll - setup the device to poll for targets +* stop_poll - stop on progress polling operation +* activate_target - select and initialize one of the targets found +* deactivate_target - deselect and deinitialize the selected target +* data_exchange - send data and receive the response (transceive operation) + +Userspace interface +-------------------- + +The userspace interface is divided in control operations and low-level data +exchange operation. + +CONTROL OPERATIONS: + +Generic netlink is used to implement the interface to the control operations. +The operations are composed by commands and events, all listed below: + +* NFC_CMD_GET_DEVICE - get specific device info or dump the device list +* NFC_CMD_START_POLL - setup a specific device to polling for targets +* NFC_CMD_STOP_POLL - stop the polling operation in a specific device +* NFC_CMD_GET_TARGET - dump the list of targets found by a specific device + +* NFC_EVENT_DEVICE_ADDED - reports an NFC device addition +* NFC_EVENT_DEVICE_REMOVED - reports an NFC device removal +* NFC_EVENT_TARGETS_FOUND - reports START_POLL results when 1 or more targets +are found + +The user must call START_POLL to poll for NFC targets, passing the desired NFC +protocols through NFC_ATTR_PROTOCOLS attribute. The device remains in polling +state until it finds any target. However, the user can stop the polling +operation by calling STOP_POLL command. In this case, it will be checked if +the requester of STOP_POLL is the same of START_POLL. + +If the polling operation finds one or more targets, the event TARGETS_FOUND is +sent (including the device id). The user must call GET_TARGET to get the list of +all targets found by such device. Each reply message has target attributes with +relevant information such as the supported NFC protocols. + +All polling operations requested through one netlink socket are stopped when +it's closed. + +LOW-LEVEL DATA EXCHANGE: + +The userspace must use PF_NFC sockets to perform any data communication with +targets. All NFC sockets use AF_NFC: + +struct sockaddr_nfc { + sa_family_t sa_family; + __u32 dev_idx; + __u32 target_idx; + __u32 nfc_protocol; +}; + +To establish a connection with one target, the user must create an +NFC_SOCKPROTO_RAW socket and call the 'connect' syscall with the sockaddr_nfc +struct correctly filled. All information comes from NFC_EVENT_TARGETS_FOUND +netlink event. As a target can support more than one NFC protocol, the user +must inform which protocol it wants to use. + +Internally, 'connect' will result in an activate_target call to the driver. +When the socket is closed, the target is deactivated. + +The data format exchanged through the sockets is NFC protocol dependent. For +instance, when communicating with MIFARE tags, the data exchanged are MIFARE +commands and their responses. + +The first received package is the response to the first sent package and so +on. In order to allow valid "empty" responses, every data received has a NULL +header of 1 byte. diff --git a/Documentation/networking/olympic.txt b/Documentation/networking/olympic.txt deleted file mode 100644 index c65a94010ea..00000000000 --- a/Documentation/networking/olympic.txt +++ /dev/null @@ -1,79 +0,0 @@ - -IBM PCI Pit/Pit-Phy/Olympic CHIPSET BASED TOKEN RING CARDS README - -Release 0.2.0 - Release - June 8th 1999 Peter De Schrijver & Mike Phillips -Release 0.9.C - Release - April 18th 2001 Mike Phillips - -Thanks: -Erik De Cock, Adrian Bridgett and Frank Fiene for their -patience and testing. -Donald Champion for the cardbus support -Kyle Lucke for the dma api changes. -Jonathon Bitner for hardware support. -Everybody on linux-tr for their continued support. - -Options: - -The driver accepts four options: ringspeed, pkt_buf_sz, -message_level and network_monitor. - -These options can be specified differently for each card found. - -ringspeed: Has one of three settings 0 (default), 4 or 16. 0 will -make the card autosense the ringspeed and join at the appropriate speed, -this will be the default option for most people. 4 or 16 allow you to -explicitly force the card to operate at a certain speed. The card will fail -if you try to insert it at the wrong speed. (Although some hubs will allow -this so be *very* careful). The main purpose for explicitly setting the ring -speed is for when the card is first on the ring. In autosense mode, if the card -cannot detect any active monitors on the ring it will not open, so you must -re-init the card at the appropriate speed. Unfortunately at present the only -way of doing this is rmmod and insmod which is a bit tough if it is compiled -in the kernel. - -pkt_buf_sz: This is this initial receive buffer allocation size. This will -default to 4096 if no value is entered. You may increase performance of the -driver by setting this to a value larger than the network packet size, although -the driver now re-sizes buffers based on MTU settings as well. - -message_level: Controls level of messages created by the driver. Defaults to 0: -which only displays start-up and critical messages. Presently any non-zero -value will display all soft messages as well. NB This does not turn -debugging messages on, that must be done by modified the source code. - -network_monitor: Any non-zero value will provide a quasi network monitoring -mode. All unexpected MAC frames (beaconing etc.) will be received -by the driver and the source and destination addresses printed. -Also an entry will be added in /proc/net called olympic_tr%d, where tr%d -is the registered device name, i.e tr0, tr1, etc. This displays low -level information about the configuration of the ring and the adapter. -This feature has been designed for network administrators to assist in -the diagnosis of network / ring problems. (This used to OLYMPIC_NETWORK_MONITOR, -but has now changed to allow each adapter to be configured differently and -to alleviate the necessity to re-compile olympic to turn the option on). - -Multi-card: - -The driver will detect multiple cards and will work with shared interrupts, -each card is assigned the next token ring device, i.e. tr0 , tr1, tr2. The -driver should also happily reside in the system with other drivers. It has -been tested with ibmtr.c running, and I personally have had one Olicom PCI -card and two IBM olympic cards (all on the same interrupt), all running -together. - -Variable MTU size: - -The driver can handle a MTU size upto either 4500 or 18000 depending upon -ring speed. The driver also changes the size of the receive buffers as part -of the mtu re-sizing, so if you set mtu = 18000, you will need to be able -to allocate 16 * (sk_buff with 18000 buffer size) call it 18500 bytes per ring -position = 296,000 bytes of memory space, plus of course anything -necessary for the tx sk_buff's. Remember this is per card, so if you are -building routers, gateway's etc, you could start to use a lot of memory -real fast. - - -6/8/99 Peter De Schrijver and Mike Phillips - diff --git a/Documentation/networking/openvswitch.txt b/Documentation/networking/openvswitch.txt new file mode 100644 index 00000000000..37c20ee2455 --- /dev/null +++ b/Documentation/networking/openvswitch.txt @@ -0,0 +1,235 @@ +Open vSwitch datapath developer documentation +============================================= + +The Open vSwitch kernel module allows flexible userspace control over +flow-level packet processing on selected network devices. It can be +used to implement a plain Ethernet switch, network device bonding, +VLAN processing, network access control, flow-based network control, +and so on. + +The kernel module implements multiple "datapaths" (analogous to +bridges), each of which can have multiple "vports" (analogous to ports +within a bridge). Each datapath also has associated with it a "flow +table" that userspace populates with "flows" that map from keys based +on packet headers and metadata to sets of actions. The most common +action forwards the packet to another vport; other actions are also +implemented. + +When a packet arrives on a vport, the kernel module processes it by +extracting its flow key and looking it up in the flow table. If there +is a matching flow, it executes the associated actions. If there is +no match, it queues the packet to userspace for processing (as part of +its processing, userspace will likely set up a flow to handle further +packets of the same type entirely in-kernel). + + +Flow key compatibility +---------------------- + +Network protocols evolve over time. New protocols become important +and existing protocols lose their prominence. For the Open vSwitch +kernel module to remain relevant, it must be possible for newer +versions to parse additional protocols as part of the flow key. It +might even be desirable, someday, to drop support for parsing +protocols that have become obsolete. Therefore, the Netlink interface +to Open vSwitch is designed to allow carefully written userspace +applications to work with any version of the flow key, past or future. + +To support this forward and backward compatibility, whenever the +kernel module passes a packet to userspace, it also passes along the +flow key that it parsed from the packet. Userspace then extracts its +own notion of a flow key from the packet and compares it against the +kernel-provided version: + + - If userspace's notion of the flow key for the packet matches the + kernel's, then nothing special is necessary. + + - If the kernel's flow key includes more fields than the userspace + version of the flow key, for example if the kernel decoded IPv6 + headers but userspace stopped at the Ethernet type (because it + does not understand IPv6), then again nothing special is + necessary. Userspace can still set up a flow in the usual way, + as long as it uses the kernel-provided flow key to do it. + + - If the userspace flow key includes more fields than the + kernel's, for example if userspace decoded an IPv6 header but + the kernel stopped at the Ethernet type, then userspace can + forward the packet manually, without setting up a flow in the + kernel. This case is bad for performance because every packet + that the kernel considers part of the flow must go to userspace, + but the forwarding behavior is correct. (If userspace can + determine that the values of the extra fields would not affect + forwarding behavior, then it could set up a flow anyway.) + +How flow keys evolve over time is important to making this work, so +the following sections go into detail. + + +Flow key format +--------------- + +A flow key is passed over a Netlink socket as a sequence of Netlink +attributes. Some attributes represent packet metadata, defined as any +information about a packet that cannot be extracted from the packet +itself, e.g. the vport on which the packet was received. Most +attributes, however, are extracted from headers within the packet, +e.g. source and destination addresses from Ethernet, IP, or TCP +headers. + +The <linux/openvswitch.h> header file defines the exact format of the +flow key attributes. For informal explanatory purposes here, we write +them as comma-separated strings, with parentheses indicating arguments +and nesting. For example, the following could represent a flow key +corresponding to a TCP packet that arrived on vport 1: + + in_port(1), eth(src=e0:91:f5:21:d0:b2, dst=00:02:e3:0f:80:a4), + eth_type(0x0800), ipv4(src=172.16.0.20, dst=172.18.0.52, proto=17, tos=0, + frag=no), tcp(src=49163, dst=80) + +Often we ellipsize arguments not important to the discussion, e.g.: + + in_port(1), eth(...), eth_type(0x0800), ipv4(...), tcp(...) + + +Wildcarded flow key format +-------------------------- + +A wildcarded flow is described with two sequences of Netlink attributes +passed over the Netlink socket. A flow key, exactly as described above, and an +optional corresponding flow mask. + +A wildcarded flow can represent a group of exact match flows. Each '1' bit +in the mask specifies a exact match with the corresponding bit in the flow key. +A '0' bit specifies a don't care bit, which will match either a '1' or '0' bit +of a incoming packet. Using wildcarded flow can improve the flow set up rate +by reduce the number of new flows need to be processed by the user space program. + +Support for the mask Netlink attribute is optional for both the kernel and user +space program. The kernel can ignore the mask attribute, installing an exact +match flow, or reduce the number of don't care bits in the kernel to less than +what was specified by the user space program. In this case, variations in bits +that the kernel does not implement will simply result in additional flow setups. +The kernel module will also work with user space programs that neither support +nor supply flow mask attributes. + +Since the kernel may ignore or modify wildcard bits, it can be difficult for +the userspace program to know exactly what matches are installed. There are +two possible approaches: reactively install flows as they miss the kernel +flow table (and therefore not attempt to determine wildcard changes at all) +or use the kernel's response messages to determine the installed wildcards. + +When interacting with userspace, the kernel should maintain the match portion +of the key exactly as originally installed. This will provides a handle to +identify the flow for all future operations. However, when reporting the +mask of an installed flow, the mask should include any restrictions imposed +by the kernel. + +The behavior when using overlapping wildcarded flows is undefined. It is the +responsibility of the user space program to ensure that any incoming packet +can match at most one flow, wildcarded or not. The current implementation +performs best-effort detection of overlapping wildcarded flows and may reject +some but not all of them. However, this behavior may change in future versions. + + +Basic rule for evolving flow keys +--------------------------------- + +Some care is needed to really maintain forward and backward +compatibility for applications that follow the rules listed under +"Flow key compatibility" above. + +The basic rule is obvious: + + ------------------------------------------------------------------ + New network protocol support must only supplement existing flow + key attributes. It must not change the meaning of already defined + flow key attributes. + ------------------------------------------------------------------ + +This rule does have less-obvious consequences so it is worth working +through a few examples. Suppose, for example, that the kernel module +did not already implement VLAN parsing. Instead, it just interpreted +the 802.1Q TPID (0x8100) as the Ethertype then stopped parsing the +packet. The flow key for any packet with an 802.1Q header would look +essentially like this, ignoring metadata: + + eth(...), eth_type(0x8100) + +Naively, to add VLAN support, it makes sense to add a new "vlan" flow +key attribute to contain the VLAN tag, then continue to decode the +encapsulated headers beyond the VLAN tag using the existing field +definitions. With this change, a TCP packet in VLAN 10 would have a +flow key much like this: + + eth(...), vlan(vid=10, pcp=0), eth_type(0x0800), ip(proto=6, ...), tcp(...) + +But this change would negatively affect a userspace application that +has not been updated to understand the new "vlan" flow key attribute. +The application could, following the flow compatibility rules above, +ignore the "vlan" attribute that it does not understand and therefore +assume that the flow contained IP packets. This is a bad assumption +(the flow only contains IP packets if one parses and skips over the +802.1Q header) and it could cause the application's behavior to change +across kernel versions even though it follows the compatibility rules. + +The solution is to use a set of nested attributes. This is, for +example, why 802.1Q support uses nested attributes. A TCP packet in +VLAN 10 is actually expressed as: + + eth(...), eth_type(0x8100), vlan(vid=10, pcp=0), encap(eth_type(0x0800), + ip(proto=6, ...), tcp(...))) + +Notice how the "eth_type", "ip", and "tcp" flow key attributes are +nested inside the "encap" attribute. Thus, an application that does +not understand the "vlan" key will not see either of those attributes +and therefore will not misinterpret them. (Also, the outer eth_type +is still 0x8100, not changed to 0x0800.) + +Handling malformed packets +-------------------------- + +Don't drop packets in the kernel for malformed protocol headers, bad +checksums, etc. This would prevent userspace from implementing a +simple Ethernet switch that forwards every packet. + +Instead, in such a case, include an attribute with "empty" content. +It doesn't matter if the empty content could be valid protocol values, +as long as those values are rarely seen in practice, because userspace +can always forward all packets with those values to userspace and +handle them individually. + +For example, consider a packet that contains an IP header that +indicates protocol 6 for TCP, but which is truncated just after the IP +header, so that the TCP header is missing. The flow key for this +packet would include a tcp attribute with all-zero src and dst, like +this: + + eth(...), eth_type(0x0800), ip(proto=6, ...), tcp(src=0, dst=0) + +As another example, consider a packet with an Ethernet type of 0x8100, +indicating that a VLAN TCI should follow, but which is truncated just +after the Ethernet type. The flow key for this packet would include +an all-zero-bits vlan and an empty encap attribute, like this: + + eth(...), eth_type(0x8100), vlan(0), encap() + +Unlike a TCP packet with source and destination ports 0, an +all-zero-bits VLAN TCI is not that rare, so the CFI bit (aka +VLAN_TAG_PRESENT inside the kernel) is ordinarily set in a vlan +attribute expressly to allow this situation to be distinguished. +Thus, the flow key in this second example unambiguously indicates a +missing or malformed VLAN TCI. + +Other rules +----------- + +The other rules for flow keys are much less subtle: + + - Duplicate attributes are not allowed at a given nesting level. + + - Ordering of attributes is not significant. + + - When the kernel sends a given flow key to userspace, it always + composes it the same way. This allows userspace to hash and + compare entire flow keys that it may not be able to fully + interpret. diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt index 1a77a3cfae5..355c6d8ef8a 100644 --- a/Documentation/networking/operstates.txt +++ b/Documentation/networking/operstates.txt @@ -88,6 +88,10 @@ set this flag. On netif_carrier_off(), the scheduler stops sending packets. The name 'carrier' and the inversion are historical, think of it as lower layer. +Note that for certain kind of soft-devices, which are not managing any +real hardware, it is possible to set this bit from userspace. One +should use TVL IFLA_CARRIER to do so. + netif_carrier_ok() can be used to query that bit. __LINK_STATE_DORMANT, maps to IFF_DORMANT: diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 073894d1c09..38112d512f4 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -3,9 +3,9 @@ -------------------------------------------------------------------------------- This file documents the mmap() facility available with the PACKET -socket interface on 2.4 and 2.6 kernels. This type of sockets is used for -capture network traffic with utilities like tcpdump or any other that needs -raw access to network interface. +socket interface on 2.4/2.6/3.x kernels. This type of sockets is used for +i) capture network traffic with utilities like tcpdump, ii) transmit network +traffic, or any other that needs raw access to network interface. You can find the latest version of this document at: http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap @@ -21,19 +21,18 @@ Please send your comments to + Why use PACKET_MMAP -------------------------------------------------------------------------------- -In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very -inefficient. It uses very limited buffers and requires one system call -to capture each packet, it requires two if you want to get packet's -timestamp (like libpcap always does). +In Linux 2.4/2.6/3.x if PACKET_MMAP is not enabled, the capture process is very +inefficient. It uses very limited buffers and requires one system call to +capture each packet, it requires two if you want to get packet's timestamp +(like libpcap always does). In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size configurable circular buffer mapped in user space that can be used to either send or receive packets. This way reading packets just needs to wait for them, most of the time there is no need to issue a single system call. Concerning transmission, multiple packets can be sent through one system call to get the -highest bandwidth. -By using a shared buffer between the kernel and the user also has the benefit -of minimizing packet copies. +highest bandwidth. By using a shared buffer between the kernel and the user +also has the benefit of minimizing packet copies. It's fine to use PACKET_MMAP to improve the performance of the capture and transmission process, but it isn't everything. At least, if you are capturing @@ -41,7 +40,8 @@ at high speeds (this is relative to the cpu speed), you should check if the device driver of your network interface card supports some sort of interrupt load mitigation or (even better) if it supports NAPI, also make sure it is enabled. For transmission, check the MTU (Maximum Transmission Unit) used and -supported by devices of your network. +supported by devices of your network. CPU IRQ pinning of your network interface +card can also be an advantage. -------------------------------------------------------------------------------- + How to use mmap() to improve capture process @@ -87,9 +87,7 @@ the following process: socket creation and destruction is straight forward, and is done the same way with or without PACKET_MMAP: -int fd; - -fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) + int fd = socket(PF_PACKET, mode, htons(ETH_P_ALL)); where mode is SOCK_RAW for the raw interface were link level information can be captured or SOCK_DGRAM for the cooked @@ -100,6 +98,11 @@ by the kernel. The destruction of the socket and all associated resources is done by a simple call to close(fd). +Similarly as without PACKET_MMAP, it is possible to use one socket +for capture and transmission. This can be done by mapping the +allocated RX and TX buffer ring with a single mmap() call. +See "Mapping and use of the circular buffer (ring)". + Next I will describe PACKET_MMAP settings and its constraints, also the mapping of the circular buffer in the user process and the use of this buffer. @@ -125,6 +128,16 @@ Transmission process is similar to capture as shown below. [shutdown] close() --------> destruction of the transmission socket and deallocation of all associated resources. +Socket creation and destruction is also straight forward, and is done +the same way as in capturing described in the previous paragraph: + + int fd = socket(PF_PACKET, mode, 0); + +The protocol can optionally be 0 in case we only want to transmit +via this socket, which avoids an expensive call to packet_rcv(). +In this case, you also need to bind(2) the TX_RING with sll_protocol = 0 +set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example. + Binding the socket to your network interface is mandatory (with zero copy) to know the header size of frames used in the circular buffer. @@ -155,7 +168,7 @@ As capture, each frame contains two parts: /* fill sockaddr_ll struct to prepare binding */ my_addr.sll_family = AF_PACKET; - my_addr.sll_protocol = ETH_P_ALL; + my_addr.sll_protocol = htons(ETH_P_ALL); my_addr.sll_ifindex = s_ifr.ifr_ifindex; /* bind socket to eth0 */ @@ -163,11 +176,23 @@ As capture, each frame contains two parts: A complete tutorial is available at: http://wiki.gnu-log.net/ +By default, the user should put data at : + frame base + TPACKET_HDRLEN - sizeof(struct sockaddr_ll) + +So, whatever you choose for the socket mode (SOCK_DGRAM or SOCK_RAW), +the beginning of the user data will be at : + frame base + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +If you wish to put user data at a custom offset from the beginning of +the frame (for payload alignment with SOCK_RAW mode for instance) you +can set tp_net (with SOCK_DGRAM) or tp_mac (with SOCK_RAW). In order +to make this work it must be enabled previously with setsockopt() +and the PACKET_TX_HAS_OFF option. + -------------------------------------------------------------------------------- + PACKET_MMAP settings -------------------------------------------------------------------------------- - To setup PACKET_MMAP from user level code is done with a call like - Capture process @@ -201,7 +226,6 @@ indeed, packet_set_ring checks that the following condition is true frames_per_block * tp_block_nr == tp_frame_nr - Lets see an example, with the following values: tp_block_size= 4096 @@ -223,11 +247,10 @@ we will get the following buffer structure: A frame can be of any size with the only condition it can fit in a block. A block can only hold an integer number of frames, or in other words, a frame cannot -be spawned accross two blocks, so there are some details you have to take into +be spawned across two blocks, so there are some details you have to take into account when choosing the frame_size. See "Mapping and use of the circular buffer (ring)". - -------------------------------------------------------------------------------- + PACKET_MMAP setting constraints -------------------------------------------------------------------------------- @@ -264,7 +287,6 @@ User space programs can include /usr/include/sys/user.h and The pagesize can also be determined dynamically with the getpagesize (2) system call. - Block number limit -------------------- @@ -284,7 +306,6 @@ called pg_vec, its size limits the number of blocks that can be allocated. v block #2 block #1 - kmalloc allocates any number of bytes of physically contiguous memory from a pool of pre-determined sizes. This pool of memory is maintained by the slab allocator which is at the end the responsible for doing the allocation and @@ -299,7 +320,6 @@ pointers to blocks is 131072/4 = 32768 blocks - PACKET_MMAP buffer size calculator ------------------------------------ @@ -340,7 +360,6 @@ and a value for <frame size> of 2048 bytes. These parameters will yield and hence the buffer will have a 262144 MiB size. So it can hold 262144 MiB / 2048 bytes = 134217728 frames - Actually, this buffer size is not possible with an i386 architecture. Remember that the memory is allocated in kernel space, in the case of an i386 kernel's memory size is limited to 1GiB. @@ -372,7 +391,6 @@ the following (from include/linux/if_packet.h): - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. - Pad to align to TPACKET_ALIGNMENT=16 */ - The following are conditions that are checked in packet_set_ring @@ -401,6 +419,19 @@ tp_block_size/tp_frame_size frames there will be a gap between the frames. This is because a frame cannot be spawn across two blocks. +To use one socket for capture and transmission, the mapping of both the +RX and TX buffer ring has to be done with one call to mmap: + + ... + setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo)); + setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar)); + ... + rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); + tx_ring = rx_ring + size; + +RX must be the first as the kernel maps the TX ring memory right +after the RX one. + At the beginning of each frame there is an status field (see struct tpacket_hdr). If this field is 0 means that the frame is ready to be used for the kernel, If not, there is a frame the user can read @@ -413,7 +444,6 @@ and the following flags apply: #define TP_STATUS_LOSING 4 #define TP_STATUS_CSUMNOTREADY 8 - TP_STATUS_COPY : This flag indicates that the frame (and associated meta information) has been truncated because it's larger than tp_frame_size. This packet can be @@ -423,7 +453,7 @@ TP_STATUS_COPY : This flag indicates that the frame (and associated enabled previously with setsockopt() and the PACKET_COPY_THRESH option. - The number of frames than can be buffered to + The number of frames that can be buffered to be read with recvfrom is limited like a normal socket. See the SO_RCVBUF option in the socket (7) man page. @@ -462,7 +492,6 @@ packets are in the ring: It doesn't incur in a race condition to first check the status value and then poll for frames. - ++ Transmission process Those defines are also used for transmission: @@ -494,14 +523,490 @@ The user can also use poll() to check if a buffer is available: retval = poll(&pfd, 1, timeout); ------------------------------------------------------------------------------- ++ What TPACKET versions are available and when to use them? +------------------------------------------------------------------------------- + + int val = tpacket_version; + setsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + getsockopt(fd, SOL_PACKET, PACKET_VERSION, &val, sizeof(val)); + +where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3. + +TPACKET_V1: + - Default if not otherwise specified by setsockopt(2) + - RX_RING, TX_RING available + +TPACKET_V1 --> TPACKET_V2: + - Made 64 bit clean due to unsigned long usage in TPACKET_V1 + structures, thus this also works on 64 bit kernel with 32 bit + userspace and the like + - Timestamp resolution in nanoseconds instead of microseconds + - RX_RING, TX_RING available + - VLAN metadata information available for packets + (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID), + in the tpacket2_hdr structure: + - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates + that the tp_vlan_tci field has valid VLAN TCI value + - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field + indicates that the tp_vlan_tpid field has valid VLAN TPID value + - How to switch to TPACKET_V2: + 1. Replace struct tpacket_hdr by struct tpacket2_hdr + 2. Query header len and save + 3. Set protocol version to 2, set up ring as usual + 4. For getting the sockaddr_ll, + use (void *)hdr + TPACKET_ALIGN(hdrlen) instead of + (void *)hdr + TPACKET_ALIGN(sizeof(struct tpacket_hdr)) + +TPACKET_V2 --> TPACKET_V3: + - Flexible buffer implementation: + 1. Blocks can be configured with non-static frame-size + 2. Read/poll is at a block-level (as opposed to packet-level) + 3. Added poll timeout to avoid indefinite user-space wait + on idle links + 4. Added user-configurable knobs: + 4.1 block::timeout + 4.2 tpkt_hdr::sk_rxhash + - RX Hash data available in user space + - Currently only RX_RING available + +------------------------------------------------------------------------------- ++ AF_PACKET fanout mode +------------------------------------------------------------------------------- + +In the AF_PACKET fanout mode, packet reception can be load balanced among +processes. This also works in combination with mmap(2) on packet sockets. + +Currently implemented fanout policies are: + + - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash + - PACKET_FANOUT_LB: schedule to socket by round-robin + - PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on + - PACKET_FANOUT_RND: schedule to socket by random selection + - PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another + - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping + +Minimal example code by David S. Miller (try things like "./test eth0 hash", +"./test eth0 lb", etc.): + +#include <stddef.h> +#include <stdlib.h> +#include <stdio.h> +#include <string.h> + +#include <sys/types.h> +#include <sys/wait.h> +#include <sys/socket.h> +#include <sys/ioctl.h> + +#include <unistd.h> + +#include <linux/if_ether.h> +#include <linux/if_packet.h> + +#include <net/if.h> + +static const char *device_name; +static int fanout_type; +static int fanout_id; + +#ifndef PACKET_FANOUT +# define PACKET_FANOUT 18 +# define PACKET_FANOUT_HASH 0 +# define PACKET_FANOUT_LB 1 +#endif + +static int setup_socket(void) +{ + int err, fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_IP)); + struct sockaddr_ll ll; + struct ifreq ifr; + int fanout_arg; + + if (fd < 0) { + perror("socket"); + return EXIT_FAILURE; + } + + memset(&ifr, 0, sizeof(ifr)); + strcpy(ifr.ifr_name, device_name); + err = ioctl(fd, SIOCGIFINDEX, &ifr); + if (err < 0) { + perror("SIOCGIFINDEX"); + return EXIT_FAILURE; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = AF_PACKET; + ll.sll_ifindex = ifr.ifr_ifindex; + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + return EXIT_FAILURE; + } + + fanout_arg = (fanout_id | (fanout_type << 16)); + err = setsockopt(fd, SOL_PACKET, PACKET_FANOUT, + &fanout_arg, sizeof(fanout_arg)); + if (err) { + perror("setsockopt"); + return EXIT_FAILURE; + } + + return fd; +} + +static void fanout_thread(void) +{ + int fd = setup_socket(); + int limit = 10000; + + if (fd < 0) + exit(fd); + + while (limit-- > 0) { + char buf[1600]; + int err; + + err = read(fd, buf, sizeof(buf)); + if (err < 0) { + perror("read"); + exit(EXIT_FAILURE); + } + if ((limit % 10) == 0) + fprintf(stdout, "(%d) \n", getpid()); + } + + fprintf(stdout, "%d: Received 10000 packets\n", getpid()); + + close(fd); + exit(0); +} + +int main(int argc, char **argp) +{ + int fd, err; + int i; + + if (argc != 3) { + fprintf(stderr, "Usage: %s INTERFACE {hash|lb}\n", argp[0]); + return EXIT_FAILURE; + } + + if (!strcmp(argp[2], "hash")) + fanout_type = PACKET_FANOUT_HASH; + else if (!strcmp(argp[2], "lb")) + fanout_type = PACKET_FANOUT_LB; + else { + fprintf(stderr, "Unknown fanout type [%s]\n", argp[2]); + exit(EXIT_FAILURE); + } + + device_name = argp[1]; + fanout_id = getpid() & 0xffff; + + for (i = 0; i < 4; i++) { + pid_t pid = fork(); + + switch (pid) { + case 0: + fanout_thread(); + + case -1: + perror("fork"); + exit(EXIT_FAILURE); + } + } + + for (i = 0; i < 4; i++) { + int status; + + wait(&status); + } + + return 0; +} + +------------------------------------------------------------------------------- ++ AF_PACKET TPACKET_V3 example +------------------------------------------------------------------------------- + +AF_PACKET's TPACKET_V3 ring buffer can be configured to use non-static frame +sizes by doing it's own memory management. It is based on blocks where polling +works on a per block basis instead of per ring as in TPACKET_V2 and predecessor. + +It is said that TPACKET_V3 brings the following benefits: + *) ~15 - 20% reduction in CPU-usage + *) ~20% increase in packet capture rate + *) ~2x increase in packet density + *) Port aggregation analysis + *) Non static frame size to capture entire packet payload + +So it seems to be a good candidate to be used with packet fanout. + +Minimal example code by Daniel Borkmann based on Chetan Loke's lolpcap (compile +it with gcc -Wall -O2 blob.c, and try things like "./a.out eth0", etc.): + +/* Written from scratch, but kernel-to-user space API usage + * dissected from lolpcap: + * Copyright 2011, Chetan Loke <loke.chetan@gmail.com> + * License: GPL, version 2.0 + */ + +#include <stdio.h> +#include <stdlib.h> +#include <stdint.h> +#include <string.h> +#include <assert.h> +#include <net/if.h> +#include <arpa/inet.h> +#include <netdb.h> +#include <poll.h> +#include <unistd.h> +#include <signal.h> +#include <inttypes.h> +#include <sys/socket.h> +#include <sys/mman.h> +#include <linux/if_packet.h> +#include <linux/if_ether.h> +#include <linux/ip.h> + +#ifndef likely +# define likely(x) __builtin_expect(!!(x), 1) +#endif +#ifndef unlikely +# define unlikely(x) __builtin_expect(!!(x), 0) +#endif + +struct block_desc { + uint32_t version; + uint32_t offset_to_priv; + struct tpacket_hdr_v1 h1; +}; + +struct ring { + struct iovec *rd; + uint8_t *map; + struct tpacket_req3 req; +}; + +static unsigned long packets_total = 0, bytes_total = 0; +static sig_atomic_t sigint = 0; + +static void sighandler(int num) +{ + sigint = 1; +} + +static int setup_socket(struct ring *ring, char *netdev) +{ + int err, i, fd, v = TPACKET_V3; + struct sockaddr_ll ll; + unsigned int blocksiz = 1 << 22, framesiz = 1 << 11; + unsigned int blocknum = 64; + + fd = socket(AF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); + if (fd < 0) { + perror("socket"); + exit(1); + } + + err = setsockopt(fd, SOL_PACKET, PACKET_VERSION, &v, sizeof(v)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + memset(&ring->req, 0, sizeof(ring->req)); + ring->req.tp_block_size = blocksiz; + ring->req.tp_frame_size = framesiz; + ring->req.tp_block_nr = blocknum; + ring->req.tp_frame_nr = (blocksiz * blocknum) / framesiz; + ring->req.tp_retire_blk_tov = 60; + ring->req.tp_feature_req_word = TP_FT_REQ_FILL_RXHASH; + + err = setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &ring->req, + sizeof(ring->req)); + if (err < 0) { + perror("setsockopt"); + exit(1); + } + + ring->map = mmap(NULL, ring->req.tp_block_size * ring->req.tp_block_nr, + PROT_READ | PROT_WRITE, MAP_SHARED | MAP_LOCKED, fd, 0); + if (ring->map == MAP_FAILED) { + perror("mmap"); + exit(1); + } + + ring->rd = malloc(ring->req.tp_block_nr * sizeof(*ring->rd)); + assert(ring->rd); + for (i = 0; i < ring->req.tp_block_nr; ++i) { + ring->rd[i].iov_base = ring->map + (i * ring->req.tp_block_size); + ring->rd[i].iov_len = ring->req.tp_block_size; + } + + memset(&ll, 0, sizeof(ll)); + ll.sll_family = PF_PACKET; + ll.sll_protocol = htons(ETH_P_ALL); + ll.sll_ifindex = if_nametoindex(netdev); + ll.sll_hatype = 0; + ll.sll_pkttype = 0; + ll.sll_halen = 0; + + err = bind(fd, (struct sockaddr *) &ll, sizeof(ll)); + if (err < 0) { + perror("bind"); + exit(1); + } + + return fd; +} + +static void display(struct tpacket3_hdr *ppd) +{ + struct ethhdr *eth = (struct ethhdr *) ((uint8_t *) ppd + ppd->tp_mac); + struct iphdr *ip = (struct iphdr *) ((uint8_t *) eth + ETH_HLEN); + + if (eth->h_proto == htons(ETH_P_IP)) { + struct sockaddr_in ss, sd; + char sbuff[NI_MAXHOST], dbuff[NI_MAXHOST]; + + memset(&ss, 0, sizeof(ss)); + ss.sin_family = PF_INET; + ss.sin_addr.s_addr = ip->saddr; + getnameinfo((struct sockaddr *) &ss, sizeof(ss), + sbuff, sizeof(sbuff), NULL, 0, NI_NUMERICHOST); + + memset(&sd, 0, sizeof(sd)); + sd.sin_family = PF_INET; + sd.sin_addr.s_addr = ip->daddr; + getnameinfo((struct sockaddr *) &sd, sizeof(sd), + dbuff, sizeof(dbuff), NULL, 0, NI_NUMERICHOST); + + printf("%s -> %s, ", sbuff, dbuff); + } + + printf("rxhash: 0x%x\n", ppd->hv1.tp_rxhash); +} + +static void walk_block(struct block_desc *pbd, const int block_num) +{ + int num_pkts = pbd->h1.num_pkts, i; + unsigned long bytes = 0; + struct tpacket3_hdr *ppd; + + ppd = (struct tpacket3_hdr *) ((uint8_t *) pbd + + pbd->h1.offset_to_first_pkt); + for (i = 0; i < num_pkts; ++i) { + bytes += ppd->tp_snaplen; + display(ppd); + + ppd = (struct tpacket3_hdr *) ((uint8_t *) ppd + + ppd->tp_next_offset); + } + + packets_total += num_pkts; + bytes_total += bytes; +} + +static void flush_block(struct block_desc *pbd) +{ + pbd->h1.block_status = TP_STATUS_KERNEL; +} + +static void teardown_socket(struct ring *ring, int fd) +{ + munmap(ring->map, ring->req.tp_block_size * ring->req.tp_block_nr); + free(ring->rd); + close(fd); +} + +int main(int argc, char **argp) +{ + int fd, err; + socklen_t len; + struct ring ring; + struct pollfd pfd; + unsigned int block_num = 0, blocks = 64; + struct block_desc *pbd; + struct tpacket_stats_v3 stats; + + if (argc != 2) { + fprintf(stderr, "Usage: %s INTERFACE\n", argp[0]); + return EXIT_FAILURE; + } + + signal(SIGINT, sighandler); + + memset(&ring, 0, sizeof(ring)); + fd = setup_socket(&ring, argp[argc - 1]); + assert(fd > 0); + + memset(&pfd, 0, sizeof(pfd)); + pfd.fd = fd; + pfd.events = POLLIN | POLLERR; + pfd.revents = 0; + + while (likely(!sigint)) { + pbd = (struct block_desc *) ring.rd[block_num].iov_base; + + if ((pbd->h1.block_status & TP_STATUS_USER) == 0) { + poll(&pfd, 1, -1); + continue; + } + + walk_block(pbd, block_num); + flush_block(pbd); + block_num = (block_num + 1) % blocks; + } + + len = sizeof(stats); + err = getsockopt(fd, SOL_PACKET, PACKET_STATISTICS, &stats, &len); + if (err < 0) { + perror("getsockopt"); + exit(1); + } + + fflush(stdout); + printf("\nReceived %u packets, %lu bytes, %u dropped, freeze_q_cnt: %u\n", + stats.tp_packets, bytes_total, stats.tp_drops, + stats.tp_freeze_q_cnt); + + teardown_socket(&ring, fd); + return 0; +} + +------------------------------------------------------------------------------- ++ PACKET_QDISC_BYPASS +------------------------------------------------------------------------------- + +If there is a requirement to load the network with many packets in a similar +fashion as pktgen does, you might set the following option after socket +creation: + + int one = 1; + setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one)); + +This has the side-effect, that packets sent through PF_PACKET will bypass the +kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning, +packet are not buffered, tc disciplines are ignored, increased loss can occur +and such packets are also not visible to other PF_PACKET sockets anymore. So, +you have been warned; generally, this can be useful for stress testing various +components of a system. + +On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled +on PF_PACKET sockets. + +------------------------------------------------------------------------------- + PACKET_TIMESTAMP ------------------------------------------------------------------------------- The PACKET_TIMESTAMP setting determines the source of the timestamp in -the packet meta information. If your NIC is capable of timestamping -packets in hardware, you can request those hardware timestamps to used. -Note: you may need to enable the generation of hardware timestamps with -SIOCSHWTSTAMP. +the packet meta information for mmap(2)ed RX_RING and TX_RINGs. If your +NIC is capable of timestamping packets in hardware, you can request those +hardware timestamps to be used. Note: you may need to enable the generation +of hardware timestamps with SIOCSHWTSTAMP (see related information from +Documentation/networking/timestamping.txt). PACKET_TIMESTAMP accepts the same integer bit field as SO_TIMESTAMPING. However, only the SOF_TIMESTAMPING_SYS_HARDWARE @@ -513,12 +1018,47 @@ SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. req |= SOF_TIMESTAMPING_SYS_HARDWARE; setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) -If PACKET_TIMESTAMP is not set, a software timestamp generated inside -the networking stack is used (the behavior before this setting was added). +For the mmap(2)ed ring buffers, such timestamps are stored in the +tpacket{,2,3}_hdr structure's tp_sec and tp_{n,u}sec members. To determine +what kind of timestamp has been reported, the tp_status field is binary |'ed +with the following possible bits ... + + TP_STATUS_TS_SYS_HARDWARE + TP_STATUS_TS_RAW_HARDWARE + TP_STATUS_TS_SOFTWARE + +... that are equivalent to its SOF_TIMESTAMPING_* counterparts. For the +RX_RING, if none of those 3 are set (i.e. PACKET_TIMESTAMP is not set), +then this means that a software fallback was invoked *within* PF_PACKET's +processing code (less precise). + +Getting timestamps for the TX_RING works as follows: i) fill the ring frames, +ii) call sendto() e.g. in blocking mode, iii) wait for status of relevant +frames to be updated resp. the frame handed over to the application, iv) walk +through the frames to pick up the individual hw/sw timestamps. + +Only (!) if transmit timestamping is enabled, then these bits are combined +with binary | with TP_STATUS_AVAILABLE, so you must check for that in your +application (e.g. !(tp_status & (TP_STATUS_SEND_REQUEST | TP_STATUS_SENDING)) +in a first step to see if the frame belongs to the application, and then +one can extract the type of timestamp in a second step from tp_status)! + +If you don't care about them, thus having it disabled, checking for +TP_STATUS_AVAILABLE resp. TP_STATUS_WRONG_FORMAT is sufficient. If in the +TX_RING part only TP_STATUS_AVAILABLE is set, then the tp_sec and tp_{n,u}sec +members do not contain a valid value. For TX_RINGs, by default no timestamp +is generated! See include/linux/net_tstamp.h and Documentation/networking/timestamping for more information on hardware timestamps. +------------------------------------------------------------------------------- ++ Miscellaneous bits +------------------------------------------------------------------------------- + +- Packet sockets work well together with Linux socket filters, thus you also + might want to have a look at Documentation/networking/filter.txt + -------------------------------------------------------------------------------- + THANKS -------------------------------------------------------------------------------- diff --git a/Documentation/networking/phonet.txt b/Documentation/networking/phonet.txt index 24ad2adba6e..81003581f47 100644 --- a/Documentation/networking/phonet.txt +++ b/Documentation/networking/phonet.txt @@ -154,9 +154,28 @@ connections, one per accept()'d socket. write(cfd, msg, msglen); } -Connections are established between two endpoints by a "third party" -application. This means that both endpoints are passive; so connect() -is not possible. +Connections are traditionally established between two endpoints by a +"third party" application. This means that both endpoints are passive. + + +As of Linux kernel version 2.6.39, it is also possible to connect +two endpoints directly, using connect() on the active side. This is +intended to support the newer Nokia Wireless Modem API, as found in +e.g. the Nokia Slim Modem in the ST-Ericsson U8500 platform: + + struct sockaddr_spn spn; + int fd; + + fd = socket(PF_PHONET, SOCK_SEQPACKET, PN_PROTO_PIPE); + memset(&spn, 0, sizeof(spn)); + spn.spn_family = AF_PHONET; + spn.spn_obj = ...; + spn.spn_dev = ...; + spn.spn_resource = 0xD9; + connect(fd, (struct sockaddr *)&spn, sizeof(spn)); + /* normal I/O here ... */ + close(fd); + WARNING: When polling a connected pipe socket for writability, there is an @@ -181,45 +200,9 @@ The pipe protocol provides two socket options at the SOL_PNPIPE level: interface index of the network interface created by PNPIPE_ENCAP, or zero if encapsulation is off. - -Phonet Pipe-controller Implementation -------------------------------------- - -Phonet Pipe-controller is enabled by selecting the CONFIG_PHONET_PIPECTRLR Kconfig -option. It is useful when communicating with those Nokia Modems which do not -implement Pipe controller in them e.g. Nokia Slim Modem used in ST-Ericsson -U8500 platform. - -The implementation is based on the Data Connection Establishment Sequence -depicted in 'Nokia Wireless Modem API - Wireless_modem_user_guide.pdf' -document. - -It allows a phonet sequenced socket (host-pep) to initiate a Pipe connection -between itself and a remote pipe-end point (e.g. modem). - -The implementation adds socket options at SOL_PNPIPE level: - - PNPIPE_PIPE_HANDLE - It accepts an integer argument for setting value of pipe handle. - - PNPIPE_ENABLE accepts one integer value (int). If set to zero, the pipe - is disabled. If the value is non-zero, the pipe is enabled. If the pipe - is not (yet) connected, ENOTCONN is error is returned. - -The implementation also adds socket 'connect'. On calling the 'connect', pipe -will be created between the source socket and the destination, and the pipe -state will be set to PIPE_DISABLED. - -After a pipe has been created and enabled successfully, the Pipe data can be -exchanged between the host-pep and remote-pep (modem). - -User-space would typically follow below sequence with Pipe controller:- --socket --bind --setsockopt for PNPIPE_PIPE_HANDLE --connect --setsockopt for PNPIPE_ENCAP_IP --setsockopt for PNPIPE_ENABLE + PNPIPE_HANDLE is a read-only integer value. It contains the underlying + identifier ("pipe handle") of the pipe. This is only defined for + socket descriptors that are already connected or being connected. Authors diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index 9eb1ba52013..3544c98401f 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -48,7 +48,7 @@ The MDIO bus time, so it is safe for them to block, waiting for an interrupt to signal the operation is complete - 2) A reset function is necessary. This is used to return the bus to an + 2) A reset function is optional. This is used to return the bus to an initialized state. 3) A probe function is needed. This function should set up anything the bus @@ -62,7 +62,8 @@ The MDIO bus 5) The bus must also be declared somewhere as a device, and registered. As an example for how one driver implemented an mdio bus driver, see - drivers/net/gianfar_mii.c and arch/ppc/syslib/mpc85xx_devices.c + drivers/net/ethernet/freescale/fsl_pq_mdio.c and an associated DTS file + for one of the users. (e.g. "git grep fsl,.*-mdio arch/powerpc/boot/dts/") Connecting to a PHY @@ -102,7 +103,7 @@ Letting the PHY Abstraction Layer do Everything Now, to connect, just call this function: - phydev = phy_connect(dev, phy_name, &adjust_link, flags, interface); + phydev = phy_connect(dev, phy_name, &adjust_link, interface); phydev is a pointer to the phy_device structure which represents the PHY. If phy_connect is successful, it will return the pointer. dev, here, is the @@ -112,7 +113,9 @@ Letting the PHY Abstraction Layer do Everything current state, though the PHY will not yet be truly operational at this point. - flags is a u32 which can optionally contain phy-specific flags. + PHY-specific flags should be set in phydev->dev_flags prior to the call + to phy_connect() such that the underlying PHY driver can check for flags + and perform specific operations based on them. This is useful if the system has put hardware restrictions on the PHY/controller, of which the PHY needs to be aware. @@ -184,11 +187,10 @@ Doing it all yourself start, or disables then frees them for stop. struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, - u32 flags, phy_interface_t interface); + phy_interface_t interface); Attaches a network device to a particular PHY, binding the PHY to a generic - driver if none was found during bus initialization. Passes in - any phy-specific flags as needed. + driver if none was found during bus initialization. int phy_start_aneg(struct phy_device *phydev); @@ -251,15 +253,25 @@ Writing a PHY driver Each driver consists of a number of function pointers: + soft_reset: perform a PHY software reset config_init: configures PHY into a sane state after a reset. For instance, a Davicom PHY requires descrambling disabled. - probe: Does any setup needed by the driver + probe: Allocate phy->priv, optionally refuse to bind. + PHY may not have been reset or had fixups run yet. suspend/resume: power management config_aneg: Changes the speed/duplex/negotiation settings + aneg_done: Determines the auto-negotiation result read_status: Reads the current speed/duplex/negotiation settings ack_interrupt: Clear a pending interrupt + did_interrupt: Checks if the PHY generated an interrupt config_intr: Enable or disable interrupts remove: Does any driver take-down + ts_info: Queries about the HW timestamping status + hwtstamp: Set the PHY HW timestamping configuration + rxtstamp: Requests a receive timestamp at the PHY level for a 'skb' + txtsamp: Requests a transmit timestamp at the PHY level for a 'skb' + set_wol: Enable Wake-on-LAN at the PHY level + get_wol: Get the Wake-on-LAN status at the PHY level Of these, only config_aneg and read_status are required to be assigned by the driver code. The rest are optional. Also, it is diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index 75e4fd708cc..0e30c7845b2 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -102,13 +102,20 @@ Examples: The 'minimum' MAC is what you set with dstmac. pgset "flag [name]" Set a flag to determine behaviour. Current flags - are: IPSRC_RND #IP Source is random (between min/max), - IPDST_RND, UDPSRC_RND, - UDPDST_RND, MACSRC_RND, MACDST_RND + are: IPSRC_RND # IP source is random (between min/max) + IPDST_RND # IP destination is random + UDPSRC_RND, UDPDST_RND, + MACSRC_RND, MACDST_RND + TXSIZE_RND, IPV6, MPLS_RND, VID_RND, SVID_RND + FLOW_SEQ, QUEUE_MAP_RND # queue map random QUEUE_MAP_CPU # queue map mirrors smp_processor_id() + UDPCSUM, + IPSEC # IPsec encapsulation (needs CONFIG_XFRM) + NODE_ALLOC # node specific memory allocation + pgset spi SPI_VALUE Set specific SA used to transform packet. pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then cycle through the port range. @@ -177,6 +184,18 @@ Note when adding devices to a specific CPU there good idea to also assign /proc/irq/XX/smp_affinity so the TX-interrupts gets bound to the same CPU. as this reduces cache bouncing when freeing skb's. +Enable IPsec +============ +Default IPsec transformation with ESP encapsulation plus Transport mode +could be enabled by simply setting: + +pgset "flag IPSEC" +pgset "flows 1" + +To avoid breaking existing testbed scripts for using AH type and tunnel mode, +user could use "pgset spi SPI_VALUE" to specify which formal of transformation +to employ. + Current commands and configuration options ========================================== @@ -219,12 +238,22 @@ udp_dst_max flag IPSRC_RND - TXSIZE_RND IPDST_RND UDPSRC_RND UDPDST_RND MACSRC_RND MACDST_RND + TXSIZE_RND + IPV6 + MPLS_RND + VID_RND + SVID_RND + FLOW_SEQ + QUEUE_MAP_RND + QUEUE_MAP_CPU + UDPCSUM + IPSEC + NODE_ALLOC dst_min dst_max diff --git a/Documentation/networking/ppp_generic.txt b/Documentation/networking/ppp_generic.txt index 15b5172fbb9..091d20273dc 100644 --- a/Documentation/networking/ppp_generic.txt +++ b/Documentation/networking/ppp_generic.txt @@ -342,7 +342,7 @@ an interface unit are: numbers on received multilink fragments SC_MP_XSHORTSEQ transmit short multilink sequence nos. - The values of these flags are defined in <linux/if_ppp.h>. Note + The values of these flags are defined in <linux/ppp-ioctl.h>. Note that the values of the SC_MULTILINK, SC_MP_SHORTSEQ and SC_MP_XSHORTSEQ bits are ignored if the CONFIG_PPP_MULTILINK option is not selected. @@ -358,7 +358,7 @@ an interface unit are: * PPPIOCSCOMPRESS sets the parameters for packet compression or decompression. The argument should point to a ppp_option_data - structure (defined in <linux/if_ppp.h>), which contains a + structure (defined in <linux/ppp-ioctl.h>), which contains a pointer/length pair which should describe a block of memory containing a CCP option specifying a compression method and its parameters. The ppp_option_data struct also contains a `transmit' @@ -395,7 +395,7 @@ an interface unit are: * PPPIOCSNPMODE sets the network-protocol mode for a given network protocol. The argument should point to an npioctl struct (defined - in <linux/if_ppp.h>). The `protocol' field gives the PPP protocol + in <linux/ppp-ioctl.h>). The `protocol' field gives the PPP protocol number for the protocol to be affected, and the `mode' field specifies what to do with packets for that protocol: diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt index 9551622d0a7..356f791af57 100644 --- a/Documentation/networking/regulatory.txt +++ b/Documentation/networking/regulatory.txt @@ -159,10 +159,10 @@ struct ieee80211_regdomain mydriver_jp_regdom = { REG_RULE(2412-20, 2484+20, 40, 6, 20, 0), /* IEEE 802.11a, channels 34..48 */ REG_RULE(5170-20, 5240+20, 40, 6, 20, - NL80211_RRF_PASSIVE_SCAN), + NL80211_RRF_NO_IR), /* IEEE 802.11a, channels 52..64 */ REG_RULE(5260-20, 5320+20, 40, 6, 20, - NL80211_RRF_NO_IBSS | + NL80211_RRF_NO_IR| NL80211_RRF_DFS), } }; diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 60d05eb77c6..16a924c486b 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -27,6 +27,8 @@ Contents of this document: (*) AF_RXRPC kernel interface. + (*) Configurable parameters. + ======== OVERVIEW @@ -144,7 +146,7 @@ An overview of the RxRPC protocol: (*) Calls use ACK packets to handle reliability. Data packets are also explicitly sequenced per call. - (*) There are two types of positive acknowledgement: hard-ACKs and soft-ACKs. + (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs. A hard-ACK indicates to the far side that all the data received to a point has been received and processed; a soft-ACK indicates that the data has been received but may yet be discarded and re-requested. The sender may @@ -864,3 +866,82 @@ The kernel interface functions are as follows: This is used to allocate a null RxRPC key that can be used to indicate anonymous security for a particular domain. + + +======================= +CONFIGURABLE PARAMETERS +======================= + +The RxRPC protocol driver has a number of configurable parameters that can be +adjusted through sysctls in /proc/net/rxrpc/: + + (*) req_ack_delay + + The amount of time in milliseconds after receiving a packet with the + request-ack flag set before we honour the flag and actually send the + requested ack. + + Usually the other side won't stop sending packets until the advertised + reception window is full (to a maximum of 255 packets), so delaying the + ACK permits several packets to be ACK'd in one go. + + (*) soft_ack_delay + + The amount of time in milliseconds after receiving a new packet before we + generate a soft-ACK to tell the sender that it doesn't need to resend. + + (*) idle_ack_delay + + The amount of time in milliseconds after all the packets currently in the + received queue have been consumed before we generate a hard-ACK to tell + the sender it can free its buffers, assuming no other reason occurs that + we would send an ACK. + + (*) resend_timeout + + The amount of time in milliseconds after transmitting a packet before we + transmit it again, assuming no ACK is received from the receiver telling + us they got it. + + (*) max_call_lifetime + + The maximum amount of time in seconds that a call may be in progress + before we preemptively kill it. + + (*) dead_call_expiry + + The amount of time in seconds before we remove a dead call from the call + list. Dead calls are kept around for a little while for the purpose of + repeating ACK and ABORT packets. + + (*) connection_expiry + + The amount of time in seconds after a connection was last used before we + remove it from the connection list. Whilst a connection is in existence, + it serves as a placeholder for negotiated security; when it is deleted, + the security must be renegotiated. + + (*) transport_expiry + + The amount of time in seconds after a transport was last used before we + remove it from the transport list. Whilst a transport is in existence, it + serves to anchor the peer data and keeps the connection ID counter. + + (*) rxrpc_rx_window_size + + The size of the receive window in packets. This is the maximum number of + unconsumed received packets we're willing to hold in memory for any + particular call. + + (*) rxrpc_rx_mtu + + The maximum packet MTU size that we're willing to receive in bytes. This + indicates to the peer whether we're willing to accept jumbo packets. + + (*) rxrpc_rx_jumbo_max + + The maximum number of packets that we're willing to accept in a jumbo + packet. Non-terminal packets in a jumbo packet must contain a four byte + header plus exactly 1412 bytes of data. The terminal packet must contain + a four byte header plus any amount of data. In any event, a jumbo packet + may not exceed rxrpc_rx_mtu in size. diff --git a/Documentation/networking/s2io.txt b/Documentation/networking/s2io.txt index 9d4e0f4df5a..d2a9f43b554 100644 --- a/Documentation/networking/s2io.txt +++ b/Documentation/networking/s2io.txt @@ -37,7 +37,7 @@ To associate an interface with a physical adapter use "ethtool -p <ethX>". The corresponding adapter's LED will blink multiple times. 3. Features supported: -a. Jumbo frames. Xframe I/II supports MTU upto 9600 bytes, +a. Jumbo frames. Xframe I/II supports MTU up to 9600 bytes, modifiable using ifconfig command. b. Offloads. Supports checksum offload(TCP/UDP/IP) on transmit @@ -49,7 +49,7 @@ significant performance improvement on certain platforms(SGI Altix, IBM xSeries). d. MSI/MSI-X. Can be enabled on platforms which support this feature -(IA64, Xeon) resulting in noticeable performance improvement(upto 7% +(IA64, Xeon) resulting in noticeable performance improvement(up to 7% on certain platforms). e. Statistics. Comprehensive MAC-level and software statistics displayed @@ -136,16 +136,6 @@ For more information, please review the AMD8131 errata at http://vip.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/ 26310_AMD-8131_HyperTransport_PCI-X_Tunnel_Revision_Guide_rev_3_18.pdf -6. Available Downloads -Neterion "s2io" driver in Red Hat and Suse 2.6-based distributions is kept up -to date, also the latest "s2io" code (including support for 2.4 kernels) is -available via "Support" link on the Neterion site: http://www.neterion.com. - -For Xframe User Guide (Programming manual), visit ftp site ns1.s2io.com, -user: linuxdocs password: HALdocs - -7. Support +6. Support For further support please contact either your 10GbE Xframe NIC vendor (IBM, -HP, SGI etc.) or click on the "Support" link on the Neterion site: -http://www.neterion.com. - +HP, SGI etc.) diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt new file mode 100644 index 00000000000..99ca40e8e81 --- /dev/null +++ b/Documentation/networking/scaling.txt @@ -0,0 +1,436 @@ +Scaling in the Linux Networking Stack + + +Introduction +============ + +This document describes a set of complementary techniques in the Linux +networking stack to increase parallelism and improve performance for +multi-processor systems. + +The following technologies are described: + + RSS: Receive Side Scaling + RPS: Receive Packet Steering + RFS: Receive Flow Steering + Accelerated Receive Flow Steering + XPS: Transmit Packet Steering + + +RSS: Receive Side Scaling +========================= + +Contemporary NICs support multiple receive and transmit descriptor queues +(multi-queue). On reception, a NIC can send different packets to different +queues to distribute processing among CPUs. The NIC distributes packets by +applying a filter to each packet that assigns it to one of a small number +of logical flows. Packets for each flow are steered to a separate receive +queue, which in turn can be processed by separate CPUs. This mechanism is +generally known as “Receive-side Scaling” (RSS). The goal of RSS and +the other scaling techniques is to increase performance uniformly. +Multi-queue distribution can also be used for traffic prioritization, but +that is not the focus of these techniques. + +The filter used in RSS is typically a hash function over the network +and/or transport layer headers-- for example, a 4-tuple hash over +IP addresses and TCP ports of a packet. The most common hardware +implementation of RSS uses a 128-entry indirection table where each entry +stores a queue number. The receive queue for a packet is determined +by masking out the low order seven bits of the computed hash for the +packet (usually a Toeplitz hash), taking this number as a key into the +indirection table and reading the corresponding value. + +Some advanced NICs allow steering packets to queues based on +programmable filters. For example, webserver bound TCP port 80 packets +can be directed to their own receive queue. Such “n-tuple” filters can +be configured from ethtool (--config-ntuple). + +==== RSS Configuration + +The driver for a multi-queue capable NIC typically provides a kernel +module parameter for specifying the number of hardware queues to +configure. In the bnx2x driver, for instance, this parameter is called +num_queues. A typical RSS configuration would be to have one receive queue +for each CPU if the device supports enough queues, or otherwise at least +one for each memory domain, where a memory domain is a set of CPUs that +share a particular memory level (L1, L2, NUMA node, etc.). + +The indirection table of an RSS device, which resolves a queue by masked +hash, is usually programmed by the driver at initialization. The +default mapping is to distribute the queues evenly in the table, but the +indirection table can be retrieved and modified at runtime using ethtool +commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the +indirection table could be done to give different queues different +relative weights. + +== RSS IRQ Configuration + +Each receive queue has a separate IRQ associated with it. The NIC triggers +this to notify a CPU when new packets arrive on the given queue. The +signaling path for PCIe devices uses message signaled interrupts (MSI-X), +that can route each interrupt to a particular CPU. The active mapping +of queues to IRQs can be determined from /proc/interrupts. By default, +an IRQ may be handled on any CPU. Because a non-negligible part of packet +processing takes place in receive interrupt handling, it is advantageous +to spread receive interrupts between CPUs. To manually adjust the IRQ +affinity of each interrupt see Documentation/IRQ-affinity.txt. Some systems +will be running irqbalance, a daemon that dynamically optimizes IRQ +assignments and as a result may override any manual settings. + +== Suggested Configuration + +RSS should be enabled when latency is a concern or whenever receive +interrupt processing forms a bottleneck. Spreading load between CPUs +decreases queue length. For low latency networking, the optimal setting +is to allocate as many queues as there are CPUs in the system (or the +NIC maximum, if lower). The most efficient high-rate configuration +is likely the one with the smallest number of receive queues where no +receive queue overflows due to a saturated CPU, because in default +mode with interrupt coalescing enabled, the aggregate number of +interrupts (and thus work) grows with each additional queue. + +Per-cpu load can be observed using the mpstat utility, but note that on +processors with hyperthreading (HT), each hyperthread is represented as +a separate CPU. For interrupt handling, HT has shown no benefit in +initial tests, so limit the number of queues to the number of CPU cores +in the system. + + +RPS: Receive Packet Steering +============================ + +Receive Packet Steering (RPS) is logically a software implementation of +RSS. Being in software, it is necessarily called later in the datapath. +Whereas RSS selects the queue and hence CPU that will run the hardware +interrupt handler, RPS selects the CPU to perform protocol processing +above the interrupt handler. This is accomplished by placing the packet +on the desired CPU’s backlog queue and waking up the CPU for processing. +RPS has some advantages over RSS: 1) it can be used with any NIC, +2) software filters can easily be added to hash over new protocols, +3) it does not increase hardware device interrupt rate (although it does +introduce inter-processor interrupts (IPIs)). + +RPS is called during bottom half of the receive interrupt handler, when +a driver sends a packet up the network stack with netif_rx() or +netif_receive_skb(). These call the get_rps_cpu() function, which +selects the queue that should process a packet. + +The first step in determining the target CPU for RPS is to calculate a +flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash +depending on the protocol). This serves as a consistent hash of the +associated flow of the packet. The hash is either provided by hardware +or will be computed in the stack. Capable hardware can pass the hash in +the receive descriptor for the packet; this would usually be the same +hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in +skb->rx_hash and can be used elsewhere in the stack as a hash of the +packet’s flow. + +Each receive hardware queue has an associated list of CPUs to which +RPS may enqueue packets for processing. For each received packet, +an index into the list is computed from the flow hash modulo the size +of the list. The indexed CPU is the target for processing the packet, +and the packet is queued to the tail of that CPU’s backlog queue. At +the end of the bottom half routine, IPIs are sent to any CPUs for which +packets have been queued to their backlog queue. The IPI wakes backlog +processing on the remote CPU, and any queued packets are then processed +up the networking stack. + +==== RPS Configuration + +RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on +by default for SMP). Even when compiled in, RPS remains disabled until +explicitly configured. The list of CPUs to which RPS may forward traffic +can be configured for each receive queue using a sysfs file entry: + + /sys/class/net/<dev>/queues/rx-<n>/rps_cpus + +This file implements a bitmap of CPUs. RPS is disabled when it is zero +(the default), in which case packets are processed on the interrupting +CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to +the bitmap. + +== Suggested Configuration + +For a single queue device, a typical RPS configuration would be to set +the rps_cpus to the CPUs in the same memory domain of the interrupting +CPU. If NUMA locality is not an issue, this could also be all CPUs in +the system. At high interrupt rate, it might be wise to exclude the +interrupting CPU from the map since that already performs much work. + +For a multi-queue system, if RSS is configured so that a hardware +receive queue is mapped to each CPU, then RPS is probably redundant +and unnecessary. If there are fewer hardware queues than CPUs, then +RPS might be beneficial if the rps_cpus for each queue are the ones that +share the same memory domain as the interrupting CPU for that queue. + +==== RPS Flow Limit + +RPS scales kernel receive processing across CPUs without introducing +reordering. The trade-off to sending all packets from the same flow +to the same CPU is CPU load imbalance if flows vary in packet rate. +In the extreme case a single flow dominates traffic. Especially on +common server workloads with many concurrent connections, such +behavior indicates a problem such as a misconfiguration or spoofed +source Denial of Service attack. + +Flow Limit is an optional RPS feature that prioritizes small flows +during CPU contention by dropping packets from large flows slightly +ahead of those from small flows. It is active only when an RPS or RFS +destination CPU approaches saturation. Once a CPU's input packet +queue exceeds half the maximum queue length (as set by sysctl +net.core.netdev_max_backlog), the kernel starts a per-flow packet +count over the last 256 packets. If a flow exceeds a set ratio (by +default, half) of these packets when a new packet arrives, then the +new packet is dropped. Packets from other flows are still only +dropped once the input packet queue reaches netdev_max_backlog. +No packets are dropped when the input packet queue length is below +the threshold, so flow limit does not sever connections outright: +even large flows maintain connectivity. + +== Interface + +Flow limit is compiled in by default (CONFIG_NET_FLOW_LIMIT), but not +turned on. It is implemented for each CPU independently (to avoid lock +and cache contention) and toggled per CPU by setting the relevant bit +in sysctl net.core.flow_limit_cpu_bitmap. It exposes the same CPU +bitmap interface as rps_cpus (see above) when called from procfs: + + /proc/sys/net/core/flow_limit_cpu_bitmap + +Per-flow rate is calculated by hashing each packet into a hashtable +bucket and incrementing a per-bucket counter. The hash function is +the same that selects a CPU in RPS, but as the number of buckets can +be much larger than the number of CPUs, flow limit has finer-grained +identification of large flows and fewer false positives. The default +table has 4096 buckets. This value can be modified through sysctl + + net.core.flow_limit_table_len + +The value is only consulted when a new table is allocated. Modifying +it does not update active tables. + +== Suggested Configuration + +Flow limit is useful on systems with many concurrent connections, +where a single connection taking up 50% of a CPU indicates a problem. +In such environments, enable the feature on all CPUs that handle +network rx interrupts (as set in /proc/irq/N/smp_affinity). + +The feature depends on the input packet queue length to exceed +the flow limit threshold (50%) + the flow history length (256). +Setting net.core.netdev_max_backlog to either 1000 or 10000 +performed well in experiments. + + +RFS: Receive Flow Steering +========================== + +While RPS steers packets solely based on hash, and thus generally +provides good load distribution, it does not take into account +application locality. This is accomplished by Receive Flow Steering +(RFS). The goal of RFS is to increase datacache hitrate by steering +kernel processing of packets to the CPU where the application thread +consuming the packet is running. RFS relies on the same RPS mechanisms +to enqueue packets onto the backlog of another CPU and to wake up that +CPU. + +In RFS, packets are not forwarded directly by the value of their hash, +but the hash is used as index into a flow lookup table. This table maps +flows to the CPUs where those flows are being processed. The flow hash +(see RPS section above) is used to calculate the index into this table. +The CPU recorded in each entry is the one which last processed the flow. +If an entry does not hold a valid CPU, then packets mapped to that entry +are steered using plain RPS. Multiple table entries may point to the +same CPU. Indeed, with many flows and few CPUs, it is very likely that +a single application thread handles flows with many different flow hashes. + +rps_sock_flow_table is a global flow table that contains the *desired* CPU +for flows: the CPU that is currently processing the flow in userspace. +Each table value is a CPU index that is updated during calls to recvmsg +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() +and tcp_splice_read()). + +When the scheduler moves a thread to a new CPU while it has outstanding +receive packets on the old CPU, packets may arrive out of order. To +avoid this, RFS uses a second flow table to track outstanding packets +for each flow: rps_dev_flow_table is a table specific to each hardware +receive queue of each device. Each table value stores a CPU index and a +counter. The CPU index represents the *current* CPU onto which packets +for this flow are enqueued for further kernel processing. Ideally, kernel +and userspace processing occur on the same CPU, and hence the CPU index +in both tables is identical. This is likely false if the scheduler has +recently migrated a userspace thread while the kernel still has packets +enqueued for kernel processing on the old CPU. + +The counter in rps_dev_flow_table values records the length of the current +CPU's backlog when a packet in this flow was last enqueued. Each backlog +queue has a head counter that is incremented on dequeue. A tail counter +is computed as head counter + queue length. In other words, the counter +in rps_dev_flow[i] records the last element in flow i that has +been enqueued onto the currently designated CPU for flow i (of course, +entry i is actually selected by hash and multiple flows may hash to the +same entry i). + +And now the trick for avoiding out of order packets: when selecting the +CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table +and the rps_dev_flow table of the queue that the packet was received on +are compared. If the desired CPU for the flow (found in the +rps_sock_flow table) matches the current CPU (found in the rps_dev_flow +table), the packet is enqueued onto that CPU’s backlog. If they differ, +the current CPU is updated to match the desired CPU if one of the +following is true: + +- The current CPU's queue head counter >= the recorded tail counter + value in rps_dev_flow[i] +- The current CPU is unset (equal to RPS_NO_CPU) +- The current CPU is offline + +After this check, the packet is sent to the (possibly updated) current +CPU. These rules aim to ensure that a flow only moves to a new CPU when +there are no packets outstanding on the old CPU, as the outstanding +packets could arrive later than those about to be processed on the new +CPU. + +==== RFS Configuration + +RFS is only available if the kconfig symbol CONFIG_RPS is enabled (on +by default for SMP). The functionality remains disabled until explicitly +configured. The number of entries in the global flow table is set through: + + /proc/sys/net/core/rps_sock_flow_entries + +The number of entries in the per-queue flow table are set through: + + /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt + +== Suggested Configuration + +Both of these need to be set before RFS is enabled for a receive queue. +Values for both are rounded up to the nearest power of two. The +suggested flow count depends on the expected number of active connections +at any given time, which may be significantly less than the number of open +connections. We have found that a value of 32768 for rps_sock_flow_entries +works fairly well on a moderately loaded server. + +For a single queue device, the rps_flow_cnt value for the single queue +would normally be configured to the same value as rps_sock_flow_entries. +For a multi-queue device, the rps_flow_cnt for each queue might be +configured as rps_sock_flow_entries / N, where N is the number of +queues. So for instance, if rps_sock_flow_entries is set to 32768 and there +are 16 configured receive queues, rps_flow_cnt for each queue might be +configured as 2048. + + +Accelerated RFS +=============== + +Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load +balancing mechanism that uses soft state to steer flows based on where +the application thread consuming the packets of each flow is running. +Accelerated RFS should perform better than RFS since packets are sent +directly to a CPU local to the thread consuming the data. The target CPU +will either be the same CPU where the application runs, or at least a CPU +which is local to the application thread’s CPU in the cache hierarchy. + +To enable accelerated RFS, the networking stack calls the +ndo_rx_flow_steer driver function to communicate the desired hardware +queue for packets matching a particular flow. The network stack +automatically calls this function every time a flow entry in +rps_dev_flow_table is updated. The driver in turn uses a device specific +method to program the NIC to steer the packets. + +The hardware queue for a flow is derived from the CPU recorded in +rps_dev_flow_table. The stack consults a CPU to hardware queue map which +is maintained by the NIC driver. This is an auto-generated reverse map of +the IRQ affinity table shown by /proc/interrupts. Drivers can use +functions in the cpu_rmap (“CPU affinity reverse map”) kernel library +to populate the map. For each CPU, the corresponding queue in the map is +set to be one whose processing CPU is closest in cache locality. + +==== Accelerated RFS Configuration + +Accelerated RFS is only available if the kernel is compiled with +CONFIG_RFS_ACCEL and support is provided by the NIC device and driver. +It also requires that ntuple filtering is enabled via ethtool. The map +of CPU to queues is automatically deduced from the IRQ affinities +configured for each receive queue by the driver, so no additional +configuration should be necessary. + +== Suggested Configuration + +This technique should be enabled whenever one wants to use RFS and the +NIC supports hardware acceleration. + +XPS: Transmit Packet Steering +============================= + +Transmit Packet Steering is a mechanism for intelligently selecting +which transmit queue to use when transmitting a packet on a multi-queue +device. To accomplish this, a mapping from CPU to hardware queue(s) is +recorded. The goal of this mapping is usually to assign queues +exclusively to a subset of CPUs, where the transmit completions for +these queues are processed on a CPU within this set. This choice +provides two benefits. First, contention on the device queue lock is +significantly reduced since fewer CPUs contend for the same queue +(contention can be eliminated completely if each CPU has its own +transmit queue). Secondly, cache miss rate on transmit completion is +reduced, in particular for data cache lines that hold the sk_buff +structures. + +XPS is configured per transmit queue by setting a bitmap of CPUs that +may use that queue to transmit. The reverse mapping, from CPUs to +transmit queues, is computed and maintained for each network device. +When transmitting the first packet in a flow, the function +get_xps_queue() is called to select a queue. This function uses the ID +of the running CPU as a key into the CPU-to-queue lookup table. If the +ID matches a single queue, that is used for transmission. If multiple +queues match, one is selected by using the flow hash to compute an index +into the set. + +The queue chosen for transmitting a particular flow is saved in the +corresponding socket structure for the flow (e.g. a TCP connection). +This transmit queue is used for subsequent packets sent on the flow to +prevent out of order (ooo) packets. The choice also amortizes the cost +of calling get_xps_queues() over all packets in the flow. To avoid +ooo packets, the queue for a flow can subsequently only be changed if +skb->ooo_okay is set for a packet in the flow. This flag indicates that +there are no outstanding packets in the flow, so the transmit queue can +change without the risk of generating out of order packets. The +transport layer is responsible for setting ooo_okay appropriately. TCP, +for instance, sets the flag when all data for a connection has been +acknowledged. + +==== XPS Configuration + +XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by +default for SMP). The functionality remains disabled until explicitly +configured. To enable XPS, the bitmap of CPUs that may use a transmit +queue is configured using the sysfs file entry: + +/sys/class/net/<dev>/queues/tx-<n>/xps_cpus + +== Suggested Configuration + +For a network device with a single transmission queue, XPS configuration +has no effect, since there is no choice in this case. In a multi-queue +system, XPS is preferably configured so that each CPU maps onto one queue. +If there are as many queues as there are CPUs in the system, then each +queue can also map onto one CPU, resulting in exclusive pairings that +experience no contention. If there are fewer queues than CPUs, then the +best CPUs to share a given queue are probably those that share the cache +with the CPU that processes transmit completions for that queue +(transmit interrupts). + + +Further Information +=================== +RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into +2.6.38. Original patches were submitted by Tom Herbert +(therbert@google.com) + +Accelerated RFS was introduced in 2.6.35. Original patches were +submitted by Ben Hutchings (bwh@kernel.org) + +Authors: +Tom Herbert (therbert@google.com) +Willem de Bruijn (willemb@google.com) diff --git a/Documentation/networking/sctp.txt b/Documentation/networking/sctp.txt index 0c790a76910..97b810ca908 100644 --- a/Documentation/networking/sctp.txt +++ b/Documentation/networking/sctp.txt @@ -19,7 +19,6 @@ of SCTP that is RFC 2960 compliant and provides an programming interface referred to as the UDP-style API of the Sockets Extensions for SCTP, as proposed in IETF Internet-Drafts. - Caveats: -lksctp can be built as statically or as a module. However, be aware that @@ -33,6 +32,4 @@ For more information, please visit the lksctp project website: http://www.sf.net/projects/lksctp Or contact the lksctp developers through the mailing list: - <lksctp-developers@lists.sourceforge.net> - - + <linux-sctp@vger.kernel.org> diff --git a/Documentation/networking/smctr.txt b/Documentation/networking/smctr.txt deleted file mode 100644 index 9af25b810c1..00000000000 --- a/Documentation/networking/smctr.txt +++ /dev/null @@ -1,66 +0,0 @@ -Text File for the SMC TokenCard TokenRing Linux driver (smctr.c). - By Jay Schulist <jschlst@samba.org> - -The Linux SMC Token Ring driver works with the SMC TokenCard Elite (8115T) -ISA and SMC TokenCard Elite/A (8115T/A) MCA adapters. - -Latest information on this driver can be obtained on the Linux-SNA WWW site. -Please point your browser to: http://www.linux-sna.org - -This driver is rather simple to use. Select Y to Token Ring adapter support -in the kernel configuration. A choice for SMC Token Ring adapters will -appear. This drives supports all SMC ISA/MCA adapters. Choose this -option. I personally recommend compiling the driver as a module (M), but if you -you would like to compile it statically answer Y instead. - -This driver supports multiple adapters without the need to load multiple copies -of the driver. You should be able to load up to 7 adapters without any kernel -modifications, if you are in need of more please contact the maintainer of this -driver. - -Load the driver either by lilo/loadlin or as a module. When a module using the -following command will suffice for most: - -# modprobe smctr -smctr.c: v1.00 12/6/99 by jschlst@samba.org -tr0: SMC TokenCard 8115T at Io 0x300, Irq 10, Rom 0xd8000, Ram 0xcc000. - -Now just setup the device via ifconfig and set and routes you may have. After -this you are ready to start sending some tokens. - -Errata: -1). For anyone wondering where to pick up the SMC adapters please browse - to http://www.smc.com - -2). If you are the first/only Token Ring Client on a Token Ring LAN, please - specify the ringspeed with the ringspeed=[4/16] module option. If no - ringspeed is specified the driver will attempt to autodetect the ring - speed and/or if the adapter is the first/only station on the ring take - the appropriate actions. - - NOTE: Default ring speed is 16MB UTP. - -3). PnP support for this adapter sucks. I recommend hard setting the - IO/MEM/IRQ by the jumpers on the adapter. If this is not possible - load the module with the following io=[ioaddr] mem=[mem_addr] - irq=[irq_num]. - - The following IRQ, IO, and MEM settings are supported. - - IO ports: - 0x200, 0x220, 0x240, 0x260, 0x280, 0x2A0, 0x2C0, 0x2E0, 0x300, - 0x320, 0x340, 0x360, 0x380. - - IRQs: - 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 13, 14, 15 - - Memory addresses: - 0xA0000, 0xA4000, 0xA8000, 0xAC000, 0xB0000, 0xB4000, - 0xB8000, 0xBC000, 0xC0000, 0xC4000, 0xC8000, 0xCC000, - 0xD0000, 0xD4000, 0xD8000, 0xDC000, 0xE0000, 0xE4000, - 0xE8000, 0xEC000, 0xF0000, 0xF4000, 0xF8000, 0xFC000 - -This driver is under the GNU General Public License. Its Firmware image is -included as an initialized C-array and is licensed by SMC to the Linux -users of this driver. However no warranty about its fitness is expressed or -implied by SMC. diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt index 4b4adb8eb14..b0b75f8463b 100644 --- a/Documentation/networking/spider_net.txt +++ b/Documentation/networking/spider_net.txt @@ -73,7 +73,7 @@ Thus, in an idle system, the GDACTDPA, tail and head pointers will all be pointing at the same descr, which should be "empty". All of the other descrs in the ring should be "empty" as well. -The show_rx_chain() routine will print out the the locations of the +The show_rx_chain() routine will print out the locations of the GDACTDPA, tail and head pointers. It will also summarize the contents of the ring, starting at the tail pointer, and listing the status of the descrs that follow. diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index 80a7a345490..2090895b08d 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -1,17 +1,19 @@ STMicroelectronics 10/100/1000 Synopsys Ethernet driver -Copyright (C) 2007-2010 STMicroelectronics Ltd +Copyright (C) 2007-2013 STMicroelectronics Ltd Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers -(Synopsys IP blocks); it has been fully tested on STLinux platforms. +(Synopsys IP blocks). Currently this network device driver is for all STM embedded MAC/GMAC -(7xxx SoCs). Other platforms start using it i.e. ARM SPEAr. +(i.e. 7xxx/5xxx SoCs), SPEAr (arm), Loongson1B (mips) and XLINX XC2V3000 +FF1152AMT0221 D1215994A VIRTEX FPGA board. -DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100 -Universal version 4.0 have been used for developing the first code -implementation. +DWC Ether MAC 10/100/1000 Universal version 3.70a (and older) and DWC Ether +MAC 10/100 Universal version 4.0 have been used for developing this driver. + +This driver supports both the platform bus and PCI. Please, for more information also visit: www.stlinux.com @@ -27,11 +29,11 @@ The kernel configuration option is STMMAC_ETH: dma_txsize: DMA tx ring size; buf_sz: DMA buffer size; tc: control the HW FIFO threshold; - tx_coe: Enable/Disable Tx Checksum Offload engine; watchdog: transmit timeout (in milliseconds); flow_ctrl: Flow control ability [on/off]; pause: Flow Control Pause Time; - tmrate: timer period (only if timer optimisation is configured). + eee_timer: tx EEE timer; + chain_mode: select chain mode instead of ring. 3) Command line options Driver parameters can be also passed in command line by using: @@ -52,31 +54,42 @@ net_device structure enabling the scatter/gather feature. When one or more packets are received, an interrupt happens. The interrupts are not queued so the driver has to scan all the descriptors in the ring during the receive process. -This is based on NAPI so the interrupt handler signals only if there is work to be -done, and it exits. +This is based on NAPI so the interrupt handler signals only if there is work +to be done, and it exits. Then the poll method will be scheduled at some future point. The incoming packets are stored, by the DMA, in a list of pre-allocated socket buffers in order to avoid the memcpy (Zero-copy). -4.3) Timer-Driver Interrupt -Instead of having the device that asynchronously notifies the frame receptions, the -driver configures a timer to generate an interrupt at regular intervals. -Based on the granularity of the timer, the frames that are received by the device -will experience different levels of latency. Some NICs have dedicated timer -device to perform this task. STMMAC can use either the RTC device or the TMU -channel 2 on STLinux platforms. -The timers frequency can be passed to the driver as parameter; when change it, -take care of both hardware capability and network stability/performance impact. -Several performance tests on STM platforms showed this optimisation allows to spare -the CPU while having the maximum throughput. +4.3) Interrupt Mitigation +The driver is able to mitigate the number of its DMA interrupts +using NAPI for the reception on chips older than the 3.50. +New chips have an HW RX-Watchdog used for this mitigation. + +On Tx-side, the mitigation schema is based on a SW timer that calls the +tx function (stmmac_tx) to reclaim the resource after transmitting the +frames. +Also there is another parameter (like a threshold) used to program +the descriptors avoiding to set the interrupt on completion bit in +when the frame is sent (xmit). + +Mitigation parameters can be tuned by ethtool. 4.4) WOL -Wake up on Lan feature through Magic Frame is only supported for the GMAC -core. +Wake up on Lan feature through Magic and Unicast frames are supported for the +GMAC core. 4.5) DMA descriptors Driver handles both normal and enhanced descriptors. The latter has been only -tested on DWC Ether MAC 10/100/1000 Universal version 3.41a. +tested on DWC Ether MAC 10/100/1000 Universal version 3.41a and later. + +STMMAC supports DMA descriptor to operate both in dual buffer (RING) +and linked-list(CHAINED) mode. In RING each descriptor points to two +data buffer pointers whereas in CHAINED mode they point to only one data +buffer pointer. RING mode is the default. + +In CHAINED mode each descriptor will have pointer to next descriptor in +the list, hence creating the explicit chaining in the descriptor itself, +whereas such explicit chaining is not possible in RING mode. 4.6) Ethtool support Ethtool is supported. Driver statistics and internal errors can be taken using: @@ -91,79 +104,268 @@ LRO is not supported. The driver is compatible with PAL to work with PHY and GPHY devices. 4.9) Platform information -Several information came from the platform; please refer to the -driver's Header file in include/linux directory. +Several driver's information can be passed through the platform +These are included in the include/linux/stmmac.h header file +and detailed below as well: struct plat_stmmacenet_data { + char *phy_bus_name; int bus_id; - int pbl; + int phy_addr; + int interface; + struct stmmac_mdio_bus_data *mdio_bus_data; + struct stmmac_dma_cfg *dma_cfg; int clk_csr; int has_gmac; int enh_desc; int tx_coe; + int rx_coe; int bugged_jumbo; int pmt; - void (*fix_mac_speed)(void *priv, unsigned int speed); - void (*bus_setup)(unsigned long ioaddr); -#ifdef CONFIG_STM_DRIVERS - struct stm_pad_config *pad_config; -#endif - void *bsp_priv; -}; + int force_sf_dma_mode; + int force_thresh_dma_mode; + int riwt_off; + void (*fix_mac_speed)(void *priv, unsigned int speed); + void (*bus_setup)(void __iomem *ioaddr); + void *(*setup)(struct platform_device *pdev); + int (*init)(struct platform_device *pdev, void *priv); + void (*exit)(struct platform_device *pdev, void *priv); + void *custom_cfg; + void *custom_data; + void *bsp_priv; + }; + +Where: + o phy_bus_name: phy bus name to attach to the stmmac. + o bus_id: bus identifier. + o phy_addr: the physical address can be passed from the platform. + If it is set to -1 the driver will automatically + detect it at run-time by probing all the 32 addresses. + o interface: PHY device's interface. + o mdio_bus_data: specific platform fields for the MDIO bus. + o dma_cfg: internal DMA parameters + o pbl: the Programmable Burst Length is maximum number of beats to + be transferred in one DMA transaction. + GMAC also enables the 4xPBL by default. + o fixed_burst/mixed_burst/burst_len + o clk_csr: fixed CSR Clock range selection. + o has_gmac: uses the GMAC core. + o enh_desc: if sets the MAC will use the enhanced descriptor structure. + o tx_coe: core is able to perform the tx csum in HW. + o rx_coe: the supports three check sum offloading engine types: + type_1, type_2 (full csum) and no RX coe. + o bugged_jumbo: some HWs are not able to perform the csum in HW for + over-sized frames due to limited buffer sizes. + Setting this flag the csum will be done in SW on + JUMBO frames. + o pmt: core has the embedded power module (optional). + o force_sf_dma_mode: force DMA to use the Store and Forward mode + instead of the Threshold. + o force_thresh_dma_mode: force DMA to use the Threshold mode other than + the Store and Forward mode. + o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode. + o fix_mac_speed: this callback is used for modifying some syscfg registers + (on ST SoCs) according to the link speed negotiated by the + physical layer . + o bus_setup: perform HW setup of the bus. For example, on some ST platforms + this field is used to configure the AMBA bridge to generate more + efficient STBus traffic. + o setup/init/exit: callbacks used for calling a custom initialization; + this is sometime necessary on some platforms (e.g. ST boxes) + where the HW needs to have set some PIO lines or system cfg + registers. setup should return a pointer to private data, + which will be stored in bsp_priv, and then passed to init and + exit callbacks. init/exit callbacks should not use or modify + platform data. + o custom_cfg/custom_data: this is a custom configuration that can be passed + while initializing the resources. + o bsp_priv: another private pointer. + +For MDIO bus The we have: + + struct stmmac_mdio_bus_data { + int (*phy_reset)(void *priv); + unsigned int phy_mask; + int *irqs; + int probed_phy_irq; + }; Where: -- pbl (Programmable Burst Length) is maximum number of - beats to be transferred in one DMA transaction. - GMAC also enables the 4xPBL by default. -- fix_mac_speed and bus_setup are used to configure internal target - registers (on STM platforms); -- has_gmac: GMAC core is on board (get it at run-time in the next step); -- bus_id: bus identifier. -- tx_coe: core is able to perform the tx csum in HW. -- enh_desc: if sets the MAC will use the enhanced descriptor structure. -- clk_csr: CSR Clock range selection. -- bugged_jumbo: some HWs are not able to perform the csum in HW for - over-sized frames due to limited buffer sizes. Setting this - flag the csum will be done in SW on JUMBO frames. - -struct plat_stmmacphy_data { - int bus_id; - int phy_addr; - unsigned int phy_mask; - int interface; - int (*phy_reset)(void *priv); - void *priv; + o phy_reset: hook to reset the phy device attached to the bus. + o phy_mask: phy mask passed when register the MDIO bus within the driver. + o irqs: list of IRQs, one per PHY. + o probed_phy_irq: if irqs is NULL, use this for probed PHY. + +For DMA engine we have the following internal fields that should be +tuned according to the HW capabilities. + +struct stmmac_dma_cfg { + int pbl; + int fixed_burst; + int burst_len_supported; }; Where: -- bus_id: bus identifier; -- phy_addr: physical address used for the attached phy device; - set it to -1 to get it at run-time; -- interface: physical MII interface mode; -- phy_reset: hook to reset HW function. - -SOURCES: -- Kconfig -- Makefile -- stmmac_main.c: main network device driver; -- stmmac_mdio.c: mdio functions; -- stmmac_ethtool.c: ethtool support; -- stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts - Only tested on ST40 platforms based. -- stmmac.h: private driver structure; -- common.h: common definitions and VFTs; -- descs.h: descriptor structure definitions; -- dwmac1000_core.c: GMAC core functions; -- dwmac1000_dma.c: dma functions for the GMAC chip; -- dwmac1000.h: specific header file for the GMAC; -- dwmac100_core: MAC 100 core and dma code; -- dwmac100_dma.c: dma funtions for the MAC chip; -- dwmac1000.h: specific header file for the MAC; -- dwmac_lib.c: generic DMA functions shared among chips -- enh_desc.c: functions for handling enhanced descriptors -- norm_desc.c: functions for handling normal descriptors - -TODO: -- XGMAC controller is not supported. -- Review the timer optimisation code to use an embedded device that seems to be - available in new chip generations. + o pbl: Programmable Burst Length + o fixed_burst: program the DMA to use the fixed burst mode + o burst_len: this is the value we put in the register + supported values are provided as macros in + linux/stmmac.h header file. + +--- + +Below an example how the structures above are using on ST platforms. + + static struct plat_stmmacenet_data stxYYY_ethernet_platform_data = { + .has_gmac = 0, + .enh_desc = 0, + .fix_mac_speed = stxYYY_ethernet_fix_mac_speed, + | + |-> to write an internal syscfg + | on this platform when the + | link speed changes from 10 to + | 100 and viceversa + .init = &stmmac_claim_resource, + | + |-> On ST SoC this calls own "PAD" + | manager framework to claim + | all the resources necessary + | (GPIO ...). The .custom_cfg field + | is used to pass a custom config. +}; + +Below the usage of the stmmac_mdio_bus_data: on this SoC, in fact, +there are two MAC cores: one MAC is for MDIO Bus/PHY emulation +with fixed_link support. + +static struct stmmac_mdio_bus_data stmmac1_mdio_bus = { + .phy_reset = phy_reset; + | + |-> function to provide the phy_reset on this board + .phy_mask = 0, +}; + +static struct fixed_phy_status stmmac0_fixed_phy_status = { + .link = 1, + .speed = 100, + .duplex = 1, +}; + +During the board's device_init we can configure the first +MAC for fixed_link by calling: + fixed_phy_add(PHY_POLL, 1, &stmmac0_fixed_phy_status));) +and the second one, with a real PHY device attached to the bus, +by using the stmmac_mdio_bus_data structure (to provide the id, the +reset procedure etc). + +4.10) List of source files: + o Kconfig + o Makefile + o stmmac_main.c: main network device driver; + o stmmac_mdio.c: mdio functions; + o stmmac_pci: PCI driver; + o stmmac_platform.c: platform driver + o stmmac_ethtool.c: ethtool support; + o stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts + (only tested on ST40 platforms based); + o stmmac.h: private driver structure; + o common.h: common definitions and VFTs; + o descs.h: descriptor structure definitions; + o dwmac1000_core.c: GMAC core functions; + o dwmac1000_dma.c: dma functions for the GMAC chip; + o dwmac1000.h: specific header file for the GMAC; + o dwmac100_core: MAC 100 core and dma code; + o dwmac100_dma.c: dma functions for the MAC chip; + o dwmac1000.h: specific header file for the MAC; + o dwmac_lib.c: generic DMA functions shared among chips; + o enh_desc.c: functions for handling enhanced descriptors; + o norm_desc.c: functions for handling normal descriptors; + o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP + o stmmac_ptp.c: PTP 1588 clock + +5) Debug Information + +The driver exports many information i.e. internal statistics, +debug information, MAC and DMA registers etc. + +These can be read in several ways depending on the +type of the information actually needed. + +For example a user can be use the ethtool support +to get statistics: e.g. using: ethtool -S ethX +(that shows the Management counters (MMC) if supported) +or sees the MAC/DMA registers: e.g. using: ethtool -d ethX + +Compiling the Kernel with CONFIG_DEBUG_FS and enabling the +STMMAC_DEBUG_FS option the driver will export the following +debugfs entries: + +/sys/kernel/debug/stmmaceth/descriptors_status + To show the DMA TX/RX descriptor rings + +Developer can also use the "debug" module parameter to get +further debug information. + +In the end, there are other macros (that cannot be enabled +via menuconfig) to turn-on the RX/TX DMA debugging, +specific MAC core debug printk etc. Others to enable the +debug in the TX and RX processes. +All these are only useful during the developing stage +and should never enabled inside the code for general usage. +In fact, these can generate an huge amount of debug messages. + +6) Energy Efficient Ethernet + +Energy Efficient Ethernet(EEE) enables IEEE 802.3 MAC sublayer along +with a family of Physical layer to operate in the Low power Idle(LPI) +mode. The EEE mode supports the IEEE 802.3 MAC operation at 100Mbps, +1000Mbps & 10Gbps. + +The LPI mode allows power saving by switching off parts of the +communication device functionality when there is no data to be +transmitted & received. The system on both the side of the link can +disable some functionalities & save power during the period of low-link +utilization. The MAC controls whether the system should enter or exit +the LPI mode & communicate this to PHY. + +As soon as the interface is opened, the driver verifies if the EEE can +be supported. This is done by looking at both the DMA HW capability +register and the PHY devices MCD registers. +To enter in Tx LPI mode the driver needs to have a software timer +that enable and disable the LPI mode when there is nothing to be +transmitted. + +7) Extended descriptors +The extended descriptors give us information about the receive Ethernet payload +when it is carrying PTP packets or TCP/UDP/ICMP over IP. +These are not available on GMAC Synopsys chips older than the 3.50. +At probe time the driver will decide if these can be actually used. +This support also is mandatory for PTPv2 because the extra descriptors 6 and 7 +are used for saving the hardware timestamps. + +8) Precision Time Protocol (PTP) +The driver supports the IEEE 1588-2002, Precision Time Protocol (PTP), +which enables precise synchronization of clocks in measurement and +control systems implemented with technologies such as network +communication. + +In addition to the basic timestamp features mentioned in IEEE 1588-2002 +Timestamps, new GMAC cores support the advanced timestamp features. +IEEE 1588-2008 that can be enabled when configure the Kernel. + +9) SGMII/RGMII supports +New GMAC devices provide own way to manage RGMII/SGMII. +This information is available at run-time by looking at the +HW capability register. This means that the stmmac can manage +auto-negotiation and link status w/o using the PHYLIB stuff +In fact, the HW provides a subset of extended registers to +restart the ANE, verify Full/Half duplex mode and Speed. +Also thanks to these registers it is possible to look at the +Auto-negotiated Link Parter Ability. + +10) TODO: + o XGMAC is not supported. + o Complete the TBI & RTBI support. + o extend VLAN support for 3.70a SYNP GMAC. diff --git a/Documentation/networking/tc-actions-env-rules.txt b/Documentation/networking/tc-actions-env-rules.txt index dcadf6f88e3..70d6cf60825 100644 --- a/Documentation/networking/tc-actions-env-rules.txt +++ b/Documentation/networking/tc-actions-env-rules.txt @@ -1,5 +1,5 @@ -The "enviromental" rules for authors of any new tc actions are: +The "environmental" rules for authors of any new tc actions are: 1) If you stealeth or borroweth any packet thou shalt be branching from the righteous path and thou shalt cloneth. @@ -20,7 +20,7 @@ this way any action downstream can stomp on the packet. 3) Dropping packets you don't own is a no-no. You simply return TC_ACT_SHOT to the caller and they will drop it. -The "enviromental" rules for callers of actions (qdiscs etc) are: +The "environmental" rules for callers of actions (qdiscs etc) are: *) Thou art responsible for freeing anything returned as being TC_ACT_SHOT/STOLEN/QUEUED. If none of TC_ACT_SHOT/STOLEN/QUEUED is diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt index 7d11bb5dc30..bdc4c0db51e 100644 --- a/Documentation/networking/tcp.txt +++ b/Documentation/networking/tcp.txt @@ -30,7 +30,7 @@ A congestion control mechanism can be registered through functions in tcp_cong.c. The functions used by the congestion control mechanism are registered via passing a tcp_congestion_ops struct to tcp_register_congestion_control. As a minimum name, ssthresh, -cong_avoid, min_cwnd must be valid. +cong_avoid must be valid. Private data for a congestion control mechanism is stored in tp->ca_priv. tcp_ca(tp) returns a pointer to this space. This is preallocated space - it diff --git a/Documentation/networking/team.txt b/Documentation/networking/team.txt new file mode 100644 index 00000000000..5a013686b9e --- /dev/null +++ b/Documentation/networking/team.txt @@ -0,0 +1,2 @@ +Team devices are driven from userspace via libteam library which is here: + https://github.com/jpirko/libteam diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index 98097d8cb91..bc355412490 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -21,26 +21,38 @@ has such a feature). SO_TIMESTAMPING: -Instructs the socket layer which kind of information is wanted. The -parameter is an integer with some of the following bits set. Setting -other bits is an error and doesn't change the current state. - -SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware -SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or - fails, then do it in software -SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp - as generated by the hardware -SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or - fails, then do it in software -SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp -SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to - the system time base -SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in - software - -SOF_TIMESTAMPING_TX/RX determine how time stamps are generated. -SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the -following control message: +Instructs the socket layer which kind of information should be collected +and/or reported. The parameter is an integer with some of the following +bits set. Setting other bits is an error and doesn't change the current +state. + +Four of the bits are requests to the stack to try to generate +timestamps. Any combination of them is valid. + +SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware +SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software +SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware +SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software + +The other three bits control which timestamps will be reported in a +generated control message. If none of these bits are set or if none of +the set bits correspond to data that is available, then the control +message will not be generated: + +SOF_TIMESTAMPING_SOFTWARE: report systime if available +SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available +SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available + +It is worth noting that timestamps may be collected for reasons other +than being requested by a particular socket with +SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that +can generate hardware receive timestamps ignore +SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag +in case future drivers pay attention. + +If timestamps are reported, they will appear in a control message with +cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like +this: struct scm_timestamping { struct timespec systime; @@ -85,7 +97,7 @@ Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support by the network device and will be empty without that support. -SIOCSHWTSTAMP: +SIOCSHWTSTAMP, SIOCGHWTSTAMP: Hardware time stamping must also be initialized for each device driver that is expected to do hardware time stamping. The parameter is defined in @@ -115,6 +127,10 @@ Only a processes with admin rights may change the configuration. User space is responsible to ensure that multiple processes don't interfere with each other and that the settings are reset. +Any process can read the actual configuration by passing this +structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has +not been implemented in all drivers. + /* possible values for hwtstamp_config->tx_type */ enum { /* @@ -157,7 +173,8 @@ DEVICE IMPLEMENTATION A driver which supports hardware time stamping must support the SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with -the actual values as described in the section on SIOCSHWTSTAMP. +the actual values as described in the section on SIOCSHWTSTAMP. It +should also support SIOCGHWTSTAMP. Time stamps for received packets must be stored in the skb. To get a pointer to the shared time stamp structure of the skb call skb_hwtstamps(). Then @@ -185,6 +202,9 @@ Time stamps for outgoing packets are to be generated as follows: and not free the skb. A driver not supporting hardware time stamping doesn't do that. A driver must never touch sk_buff::tstamp! It is used to store software generated time stamps by the network subsystem. +- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware + as possible. skb_tx_timestamp() provides a software time stamp if requested + and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set). - As soon as the driver has sent the packet and/or obtained a hardware time stamp for it, it passes the time stamp back by calling skb_hwtstamp_tx() with the original skb, the raw @@ -195,6 +215,3 @@ Time stamps for outgoing packets are to be generated as follows: this would occur at a later time in the processing pipeline than other software time stamping and therefore could lead to unexpected deltas between time stamps. -- If the driver did not set the SKBTX_IN_PROGRESS flag (see above), then - dev_hard_start_xmit() checks whether software time stamping - is wanted as fallback and potentially generates the time stamp. diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore index 71e81eb2e22..a380159765c 100644 --- a/Documentation/networking/timestamping/.gitignore +++ b/Documentation/networking/timestamping/.gitignore @@ -1 +1,2 @@ timestamping +hwtstamp_config diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile index e79973443e9..d934afc8306 100644 --- a/Documentation/networking/timestamping/Makefile +++ b/Documentation/networking/timestamping/Makefile @@ -2,12 +2,13 @@ obj- := dummy.o # List of programs to build -hostprogs-y := timestamping +hostprogs-y := timestamping hwtstamp_config # Tell kbuild to always build the programs always := $(hostprogs-y) HOSTCFLAGS_timestamping.o += -I$(objtree)/usr/include +HOSTCFLAGS_hwtstamp_config.o += -I$(objtree)/usr/include clean: - rm -f timestamping + rm -f timestamping hwtstamp_config diff --git a/Documentation/networking/timestamping/hwtstamp_config.c b/Documentation/networking/timestamping/hwtstamp_config.c new file mode 100644 index 00000000000..e8b685a7f15 --- /dev/null +++ b/Documentation/networking/timestamping/hwtstamp_config.c @@ -0,0 +1,134 @@ +/* Test program for SIOC{G,S}HWTSTAMP + * Copyright 2013 Solarflare Communications + * Author: Ben Hutchings + */ + +#include <errno.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> + +#include <sys/socket.h> +#include <sys/ioctl.h> + +#include <linux/if.h> +#include <linux/net_tstamp.h> +#include <linux/sockios.h> + +static int +lookup_value(const char **names, int size, const char *name) +{ + int value; + + for (value = 0; value < size; value++) + if (names[value] && strcasecmp(names[value], name) == 0) + return value; + + return -1; +} + +static const char * +lookup_name(const char **names, int size, int value) +{ + return (value >= 0 && value < size) ? names[value] : NULL; +} + +static void list_names(FILE *f, const char **names, int size) +{ + int value; + + for (value = 0; value < size; value++) + if (names[value]) + fprintf(f, " %s\n", names[value]); +} + +static const char *tx_types[] = { +#define TX_TYPE(name) [HWTSTAMP_TX_ ## name] = #name + TX_TYPE(OFF), + TX_TYPE(ON), + TX_TYPE(ONESTEP_SYNC) +#undef TX_TYPE +}; +#define N_TX_TYPES ((int)(sizeof(tx_types) / sizeof(tx_types[0]))) + +static const char *rx_filters[] = { +#define RX_FILTER(name) [HWTSTAMP_FILTER_ ## name] = #name + RX_FILTER(NONE), + RX_FILTER(ALL), + RX_FILTER(SOME), + RX_FILTER(PTP_V1_L4_EVENT), + RX_FILTER(PTP_V1_L4_SYNC), + RX_FILTER(PTP_V1_L4_DELAY_REQ), + RX_FILTER(PTP_V2_L4_EVENT), + RX_FILTER(PTP_V2_L4_SYNC), + RX_FILTER(PTP_V2_L4_DELAY_REQ), + RX_FILTER(PTP_V2_L2_EVENT), + RX_FILTER(PTP_V2_L2_SYNC), + RX_FILTER(PTP_V2_L2_DELAY_REQ), + RX_FILTER(PTP_V2_EVENT), + RX_FILTER(PTP_V2_SYNC), + RX_FILTER(PTP_V2_DELAY_REQ), +#undef RX_FILTER +}; +#define N_RX_FILTERS ((int)(sizeof(rx_filters) / sizeof(rx_filters[0]))) + +static void usage(void) +{ + fputs("Usage: hwtstamp_config if_name [tx_type rx_filter]\n" + "tx_type is any of (case-insensitive):\n", + stderr); + list_names(stderr, tx_types, N_TX_TYPES); + fputs("rx_filter is any of (case-insensitive):\n", stderr); + list_names(stderr, rx_filters, N_RX_FILTERS); +} + +int main(int argc, char **argv) +{ + struct ifreq ifr; + struct hwtstamp_config config; + const char *name; + int sock; + + if ((argc != 2 && argc != 4) || (strlen(argv[1]) >= IFNAMSIZ)) { + usage(); + return 2; + } + + if (argc == 4) { + config.flags = 0; + config.tx_type = lookup_value(tx_types, N_TX_TYPES, argv[2]); + config.rx_filter = lookup_value(rx_filters, N_RX_FILTERS, argv[3]); + if (config.tx_type < 0 || config.rx_filter < 0) { + usage(); + return 2; + } + } + + sock = socket(AF_INET, SOCK_DGRAM, 0); + if (sock < 0) { + perror("socket"); + return 1; + } + + strcpy(ifr.ifr_name, argv[1]); + ifr.ifr_data = (caddr_t)&config; + + if (ioctl(sock, (argc == 2) ? SIOCGHWTSTAMP : SIOCSHWTSTAMP, &ifr)) { + perror("ioctl"); + return 1; + } + + printf("flags = %#x\n", config.flags); + name = lookup_name(tx_types, N_TX_TYPES, config.tx_type); + if (name) + printf("tx_type = %s\n", name); + else + printf("tx_type = %d\n", config.tx_type); + name = lookup_name(rx_filters, N_RX_FILTERS, config.rx_filter); + if (name) + printf("rx_filter = %s\n", name); + else + printf("rx_filter = %d\n", config.rx_filter); + + return 0; +} diff --git a/Documentation/networking/tms380tr.txt b/Documentation/networking/tms380tr.txt deleted file mode 100644 index 1f73e13058d..00000000000 --- a/Documentation/networking/tms380tr.txt +++ /dev/null @@ -1,147 +0,0 @@ -Text file for the Linux SysKonnect Token Ring ISA/PCI Adapter Driver. - Text file by: Jay Schulist <jschlst@samba.org> - -The Linux SysKonnect Token Ring driver works with the SysKonnect TR4/16(+) ISA, -SysKonnect TR4/16(+) PCI, SysKonnect TR4/16 PCI, and older revisions of the -SK NET TR4/16 ISA card. - -Latest information on this driver can be obtained on the Linux-SNA WWW site. -Please point your browser to: -http://www.linux-sna.org - -Many thanks to Christoph Goos for his excellent work on this driver and -SysKonnect for donating the adapters to Linux-SNA for the testing and -maintenance of this device driver. - -Important information to be noted: -1. Adapters can be slow to open (~20 secs) and close (~5 secs), please be - patient. -2. This driver works very well when autoprobing for adapters. Why even - think about those nasty io/int/dma settings of modprobe when the driver - will do it all for you! - -This driver is rather simple to use. Select Y to Token Ring adapter support -in the kernel configuration. A choice for SysKonnect Token Ring adapters will -appear. This drives supports all SysKonnect ISA and PCI adapters. Choose this -option. I personally recommend compiling the driver as a module (M), but if you -you would like to compile it statically answer Y instead. - -This driver supports multiple adapters without the need to load multiple copies -of the driver. You should be able to load up to 7 adapters without any kernel -modifications, if you are in need of more please contact the maintainer of this -driver. - -Load the driver either by lilo/loadlin or as a module. When a module using the -following command will suffice for most: - -# modprobe sktr - -This will produce output similar to the following: (Output is user specific) - -sktr.c: v1.01 08/29/97 by Christoph Goos -tr0: SK NET TR 4/16 PCI found at 0x6100, using IRQ 17. -tr1: SK NET TR 4/16 PCI found at 0x6200, using IRQ 16. -tr2: SK NET TR 4/16 ISA found at 0xa20, using IRQ 10 and DMA 5. - -Now just setup the device via ifconfig and set and routes you may have. After -this you are ready to start sending some tokens. - -Errata: -For anyone wondering where to pick up the SysKonnect adapters please browse -to http://www.syskonnect.com - -This driver is under the GNU General Public License. Its Firmware image is -included as an initialized C-array and is licensed by SysKonnect to the Linux -users of this driver. However no warranty about its fitness is expressed or -implied by SysKonnect. - -Below find attached the setting for the SK NET TR 4/16 ISA adapters -------------------------------------------------------------------- - - *************************** - *** C O N T E N T S *** - *************************** - - 1) Location of DIP-Switch W1 - 2) Default settings - 3) DIP-Switch W1 description - - - ============================================================== - CHAPTER 1 LOCATION OF DIP-SWITCH - ============================================================== - -UÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄ¿ -þUÄÄÄÄÄÄ¿ UÄÄÄÄÄ¿ UÄÄÄ¿ þ -þAÄÄÄÄÄÄU W1 AÄÄÄÄÄU UÄÄÄÄ¿ þ þ þ -þUÄÄÄÄÄÄ¿ þ þ þ þ UÄÄÅ¿ -þAÄÄÄÄÄÄU UÄÄÄÄÄÄÄÄÄÄÄ¿ AÄÄÄÄU þ þ þ þþ -þUÄÄÄÄÄÄ¿ þ þ UÄÄÄ¿ AÄÄÄU AÄÄÅU -þAÄÄÄÄÄÄU þ TMS380C26 þ þ þ þ -þUÄÄÄÄÄÄ¿ þ þ AÄÄÄU AÄ¿ -þAÄÄÄÄÄÄU þ þ þ þ -þ AÄÄÄÄÄÄÄÄÄÄÄU þ þ -þ þ þ -þ AÄU -þ þ -þ þ -þ þ -þ þ -AÄÄÄÄÄÄÄÄÄÄÄÄAÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄAÄÄAÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄAÄÄÄÄÄÄÄÄÄU - AÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄU AÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄÄU - - ============================================================== - CHAPTER 2 DEFAULT SETTINGS - ============================================================== - - W1 1 2 3 4 5 6 7 8 - +------------------------------+ - | ON X | - | OFF X X X X X X X | - +------------------------------+ - - W1.1 = ON Adapter drives address lines SA17..19 - W1.2 - 1.5 = OFF BootROM disabled - W1.6 - 1.8 = OFF I/O address 0A20h - - ============================================================== - CHAPTER 3 DIP SWITCH W1 DESCRIPTION - ============================================================== - - UÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄ¿ ON - þ 1 þ 2 þ 3 þ 4 þ 5 þ 6 þ 7 þ 8 þ - AÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄAÄÄÄU OFF - |AD | BootROM Addr. | I/O | - +-+-+-------+-------+-----+-----+ - | | | - | | +------ 6 7 8 - | | ON ON ON 1900h - | | ON ON OFF 0900h - | | ON OFF ON 1980h - | | ON OFF OFF 0980h - | | OFF ON ON 1b20h - | | OFF ON OFF 0b20h - | | OFF OFF ON 1a20h - | | OFF OFF OFF 0a20h (+) - | | - | | - | +-------- 2 3 4 5 - | OFF x x x disabled (+) - | ON ON ON ON C0000 - | ON ON ON OFF C4000 - | ON ON OFF ON C8000 - | ON ON OFF OFF CC000 - | ON OFF ON ON D0000 - | ON OFF ON OFF D4000 - | ON OFF OFF ON D8000 - | ON OFF OFF OFF DC000 - | - | - +----- 1 - OFF adapter does NOT drive SA<17..19> - ON adapter drives SA<17..19> (+) - - - (+) means default setting - - ******************************** diff --git a/Documentation/networking/tproxy.txt b/Documentation/networking/tproxy.txt index 7b5996d9357..ec11429e1d4 100644 --- a/Documentation/networking/tproxy.txt +++ b/Documentation/networking/tproxy.txt @@ -2,9 +2,8 @@ Transparent proxy support ========================= This feature adds Linux 2.2-like transparent proxy support to current kernels. -To use it, enable NETFILTER_TPROXY, the socket match and the TPROXY target in -your kernel config. You will need policy routing too, so be sure to enable that -as well. +To use it, enable the socket match and the TPROXY target in your kernel config. +You will need policy routing too, so be sure to enable that as well. 1. Making non-local sockets work diff --git a/Documentation/networking/tuntap.txt b/Documentation/networking/tuntap.txt index c0aab985bad..949d5dcdd9a 100644 --- a/Documentation/networking/tuntap.txt +++ b/Documentation/networking/tuntap.txt @@ -105,6 +105,83 @@ Copyright (C) 1999-2000 Maxim Krasnyansky <max_mk@yahoo.com> Proto [2 bytes] Raw protocol(IP, IPv6, etc) frame. + 3.3 Multiqueue tuntap interface: + + From version 3.8, Linux supports multiqueue tuntap which can uses multiple + file descriptors (queues) to parallelize packets sending or receiving. The + device allocation is the same as before, and if user wants to create multiple + queues, TUNSETIFF with the same device name must be called many times with + IFF_MULTI_QUEUE flag. + + char *dev should be the name of the device, queues is the number of queues to + be created, fds is used to store and return the file descriptors (queues) + created to the caller. Each file descriptor were served as the interface of a + queue which could be accessed by userspace. + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_alloc_mq(char *dev, int queues, int *fds) + { + struct ifreq ifr; + int fd, err, i; + + if (!dev) + return -1; + + memset(&ifr, 0, sizeof(ifr)); + /* Flags: IFF_TUN - TUN device (no Ethernet headers) + * IFF_TAP - TAP device + * + * IFF_NO_PI - Do not provide packet information + * IFF_MULTI_QUEUE - Create a queue of multiqueue device + */ + ifr.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_MULTI_QUEUE; + strcpy(ifr.ifr_name, dev); + + for (i = 0; i < queues; i++) { + if ((fd = open("/dev/net/tun", O_RDWR)) < 0) + goto err; + err = ioctl(fd, TUNSETIFF, (void *)&ifr); + if (err) { + close(fd); + goto err; + } + fds[i] = fd; + } + + return 0; + err: + for (--i; i >= 0; i--) + close(fds[i]); + return err; + } + + A new ioctl(TUNSETQUEUE) were introduced to enable or disable a queue. When + calling it with IFF_DETACH_QUEUE flag, the queue were disabled. And when + calling it with IFF_ATTACH_QUEUE flag, the queue were enabled. The queue were + enabled by default after it was created through TUNSETIFF. + + fd is the file descriptor (queue) that we want to enable or disable, when + enable is true we enable it, otherwise we disable it + + #include <linux/if.h> + #include <linux/if_tun.h> + + int tun_set_queue(int fd, int enable) + { + struct ifreq ifr; + + memset(&ifr, 0, sizeof(ifr)); + + if (enable) + ifr.ifr_flags = IFF_ATTACH_QUEUE; + else + ifr.ifr_flags = IFF_DETACH_QUEUE; + + return ioctl(fd, TUNSETQUEUE, (void *)&ifr); + } + Universal TUN/TAP device driver Frequently Asked Question. 1. What platforms are supported by TUN/TAP driver ? diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt index bd70976b816..97282da82b7 100644 --- a/Documentation/networking/vortex.txt +++ b/Documentation/networking/vortex.txt @@ -67,8 +67,8 @@ Module parameters ================= There are several parameters which may be provided to the driver when -its module is loaded. These are usually placed in /etc/modprobe.conf -(/etc/modules.conf in 2.4). Example: +its module is loaded. These are usually placed in /etc/modprobe.d/*.conf +configuration files. Example: options 3c59x debug=3 rx_copybreak=300 @@ -178,7 +178,7 @@ max_interrupt_work=N The driver's interrupt service routine can handle many receive and transmit packets in a single invocation. It does this in a loop. - The value of max_interrupt_work governs how mnay times the interrupt + The value of max_interrupt_work governs how many times the interrupt service routine will loop. The default value is 32 loops. If this is exceeded the interrupt service routine gives up and generates a warning message "eth0: Too much work in interrupt". @@ -359,7 +359,7 @@ steps you should take: - OK, it's a driver problem. You need to generate a report. Typically this is an email to the - maintainer and/or linux-net@vger.kernel.org. The maintainer's + maintainer and/or netdev@vger.kernel.org. The maintainer's email address will be in the driver source or in the MAINTAINERS file. - The contents of your report will vary a lot depending upon the @@ -425,7 +425,7 @@ steps you should take: 1) Increase the debug level. Usually this is done via: a) modprobe driver debug=7 - b) In /etc/modprobe.conf (or /etc/modules.conf for 2.4): + b) In /etc/modprobe.d/driver.conf: options driver debug=7 2) Recreate the problem with the higher debug level, diff --git a/Documentation/networking/vxge.txt b/Documentation/networking/vxge.txt index d2e2997e6fa..bb76c667a47 100644 --- a/Documentation/networking/vxge.txt +++ b/Documentation/networking/vxge.txt @@ -91,10 +91,3 @@ v) addr_learn_en virtualization environment. Valid range: 0,1 (disabled, enabled respectively) Default: 0 - -4) Troubleshooting: -------------------- - -To resolve an issue with the source code or X3100 series adapter, please collect -the statistics, register dumps using ethool, relevant logs and email them to -support@neterion.com. diff --git a/Documentation/networking/vxlan.txt b/Documentation/networking/vxlan.txt new file mode 100644 index 00000000000..6d993510f09 --- /dev/null +++ b/Documentation/networking/vxlan.txt @@ -0,0 +1,47 @@ +Virtual eXtensible Local Area Networking documentation +====================================================== + +The VXLAN protocol is a tunnelling protocol that is designed to +solve the problem of limited number of available VLAN's (4096). +With VXLAN identifier is expanded to 24 bits. + +It is a draft RFC standard, that is implemented by Cisco Nexus, +Vmware and Brocade. The protocol runs over UDP using a single +destination port (still not standardized by IANA). +This document describes the Linux kernel tunnel device, +there is also an implantation of VXLAN for Openvswitch. + +Unlike most tunnels, a VXLAN is a 1 to N network, not just point +to point. A VXLAN device can either dynamically learn the IP address +of the other end, in a manner similar to a learning bridge, or the +forwarding entries can be configured statically. + +The management of vxlan is done in a similar fashion to it's +too closest neighbors GRE and VLAN. Configuring VXLAN requires +the version of iproute2 that matches the kernel release +where VXLAN was first merged upstream. + +1. Create vxlan device + # ip li add vxlan0 type vxlan id 42 group 239.1.1.1 dev eth1 + +This creates a new device (vxlan0). The device uses the +the multicast group 239.1.1.1 over eth1 to handle packets where +no entry is in the forwarding table. + +2. Delete vxlan device + # ip link delete vxlan0 + +3. Show vxlan info + # ip -d link show vxlan0 + +It is possible to create, destroy and display the vxlan +forwarding table using the new bridge command. + +1. Create forwarding table entry + # bridge fdb add to 00:17:42:8a:b4:05 dst 192.19.0.2 dev vxlan0 + +2. Delete forwarding table entry + # bridge fdb delete 00:17:42:8a:b4:05 dev vxlan0 + +3. Show forwarding table + # bridge fdb show dev vxlan0 diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt index 78f662ee062..7f213b556e8 100644 --- a/Documentation/networking/x25-iface.txt +++ b/Documentation/networking/x25-iface.txt @@ -105,7 +105,7 @@ reduced by the following measures or a combination thereof: later. The lapb module interface was modified to support this. Its data_indication() method should now transparently pass the - netif_rx() return value to the (lapb mopdule) caller. + netif_rx() return value to the (lapb module) caller. (2) Drivers for kernel versions 2.2.x should always check the global variable netdev_dropping when a new frame is received. The driver should only call netif_rx() if netdev_dropping is zero. Otherwise |
