aboutsummaryrefslogtreecommitdiff
path: root/Documentation/networking
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/networking')
-rw-r--r--Documentation/networking/00-INDEX30
-rw-r--r--Documentation/networking/3c505.txt45
-rw-r--r--Documentation/networking/altera_tse.txt263
-rw-r--r--Documentation/networking/batman-adv.txt61
-rw-r--r--Documentation/networking/bonding.txt232
-rw-r--r--Documentation/networking/can.txt354
-rw-r--r--Documentation/networking/cdc_mbim.txt339
-rw-r--r--Documentation/networking/dccp.txt6
-rw-r--r--Documentation/networking/e100.txt2
-rw-r--r--Documentation/networking/filter.txt1072
-rw-r--r--Documentation/networking/gianfar.txt30
-rw-r--r--Documentation/networking/i40evf.txt47
-rw-r--r--Documentation/networking/ieee802154.txt4
-rw-r--r--Documentation/networking/igb.txt48
-rw-r--r--Documentation/networking/ip-sysctl.txt89
-rw-r--r--Documentation/networking/ipsec.txt38
-rw-r--r--Documentation/networking/l2tp.txt2
-rw-r--r--Documentation/networking/netdev-FAQ.txt24
-rw-r--r--Documentation/networking/netdevices.txt10
-rw-r--r--Documentation/networking/netlink_mmap.txt10
-rw-r--r--Documentation/networking/operstates.txt4
-rw-r--r--Documentation/networking/packet_mmap.txt63
-rw-r--r--Documentation/networking/phy.txt14
-rw-r--r--Documentation/networking/pktgen.txt37
-rw-r--r--Documentation/networking/regulatory.txt4
-rw-r--r--Documentation/networking/rxrpc.txt83
-rw-r--r--Documentation/networking/scaling.txt2
-rw-r--r--Documentation/networking/spider_net.txt2
-rw-r--r--Documentation/networking/stmmac.txt20
-rw-r--r--Documentation/networking/tcp.txt2
-rw-r--r--Documentation/networking/timestamping.txt67
-rw-r--r--Documentation/networking/timestamping/.gitignore1
-rw-r--r--Documentation/networking/timestamping/Makefile5
-rw-r--r--Documentation/networking/timestamping/hwtstamp_config.c134
-rw-r--r--Documentation/networking/vortex.txt4
-rw-r--r--Documentation/networking/x25-iface.txt2
36 files changed, 2684 insertions, 466 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX
index f11580f8719..557b6ef70c2 100644
--- a/Documentation/networking/00-INDEX
+++ b/Documentation/networking/00-INDEX
@@ -6,8 +6,14 @@
- information on the 3Com Etherlink III Series Ethernet cards.
6pack.txt
- info on the 6pack protocol, an alternative to KISS for AX.25
-DLINK.txt
- - info on the D-Link DE-600/DE-620 parallel port pocket adapters
+LICENSE.qla3xxx
+ - GPLv2 for QLogic Linux Networking HBA Driver
+LICENSE.qlge
+ - GPLv2 for QLogic Linux qlge NIC Driver
+LICENSE.qlcnic
+ - GPLv2 for QLogic Linux qlcnic NIC Driver
+Makefile
+ - Makefile for docsrc.
PLIP.txt
- PLIP: The Parallel Line Internet Protocol device driver
README.ipw2100
@@ -17,7 +23,7 @@ README.ipw2200
README.sb1000
- info on General Instrument/NextLevel SURFboard1000 cable modem.
alias.txt
- - info on using alias network devices
+ - info on using alias network devices.
arcnet-hardware.txt
- tons of info on ARCnet, hubs, jumper settings for ARCnet cards, etc.
arcnet.txt
@@ -80,7 +86,7 @@ framerelay.txt
- info on using Frame Relay/Data Link Connection Identifier (DLCI).
gen_stats.txt
- Generic networking statistics for netlink users.
-generic_hdlc.txt
+generic-hdlc.txt
- The generic High Level Data Link Control (HDLC) layer.
generic_netlink.txt
- info on Generic Netlink
@@ -88,6 +94,8 @@ gianfar.txt
- Gianfar Ethernet Driver.
i40e.txt
- README for the Intel Ethernet Controller XL710 Driver (i40e).
+i40evf.txt
+ - Short note on the Driver for the Intel(R) XL710 X710 Virtual Function
ieee802154.txt
- Linux IEEE 802.15.4 implementation, API and drivers
igb.txt
@@ -102,6 +110,8 @@ ipddp.txt
- AppleTalk-IP Decapsulation and AppleTalk-IP Encapsulation
iphase.txt
- Interphase PCI ATM (i)Chip IA Linux driver info.
+ipsec.txt
+ - Note on not compressing IPSec payload and resulting failed policy check.
ipv6.txt
- Options to the ipv6 kernel module.
ipvs-sysctl.txt
@@ -120,6 +130,8 @@ lapb-module.txt
- programming information of the LAPB module.
ltpc.txt
- the Apple or Farallon LocalTalk PC card driver
+mac80211-auth-assoc-deauth.txt
+ - authentication and association / deauth-disassoc with max80211
mac80211-injection.txt
- HOWTO use packet injection with mac80211
multiqueue.txt
@@ -134,6 +146,10 @@ netdevices.txt
- info on network device driver functions exported to the kernel.
netif-msg.txt
- Design of the network interface message level setting (NETIF_MSG_*).
+netlink_mmap.txt
+ - memory mapped I/O with netlink
+nf_conntrack-sysctl.txt
+ - list of netfilter-sysctl knobs.
nfc.txt
- The Linux Near Field Communication (NFS) subsystem.
openvswitch.txt
@@ -176,7 +192,7 @@ skfp.txt
- SysKonnect FDDI (SK-5xxx, Compaq Netelligent) driver info.
smc9.txt
- the driver for SMC's 9000 series of Ethernet cards
-spider-net.txt
+spider_net.txt
- README for the Spidernet Driver (as found in PS3 / Cell BE).
stmmac.txt
- README for the STMicro Synopsys Ethernet driver.
@@ -188,6 +204,8 @@ tcp.txt
- short blurb on how TCP output takes place.
tcp-thin.txt
- kernel tuning options for low rate 'thin' TCP streams.
+team.txt
+ - pointer to information for ethernet teaming devices.
tlan.txt
- ThunderLAN (Compaq Netelligent 10/100, Olicom OC-2xxx) driver info.
tproxy.txt
@@ -200,6 +218,8 @@ vortex.txt
- info on using 3Com Vortex (3c590, 3c592, 3c595, 3c597) Ethernet cards.
vxge.txt
- README for the Neterion X3100 PCIe Server Adapter.
+vxlan.txt
+ - Virtual extensible LAN overview
x25.txt
- general info on X.25 development.
x25-iface.txt
diff --git a/Documentation/networking/3c505.txt b/Documentation/networking/3c505.txt
deleted file mode 100644
index 72f38b13101..00000000000
--- a/Documentation/networking/3c505.txt
+++ /dev/null
@@ -1,45 +0,0 @@
-The 3Com Etherlink Plus (3c505) driver.
-
-This driver now uses DMA. There is currently no support for PIO operation.
-The default DMA channel is 6; this is _not_ autoprobed, so you must
-make sure you configure it correctly. If loading the driver as a
-module, you can do this with "modprobe 3c505 dma=n". If the driver is
-linked statically into the kernel, you must either use an "ether="
-statement on the command line, or change the definition of ELP_DMA in 3c505.h.
-
-The driver will warn you if it has to fall back on the compiled in
-default DMA channel.
-
-If no base address is given at boot time, the driver will autoprobe
-ports 0x300, 0x280 and 0x310 (in that order). If no IRQ is given, the driver
-will try to probe for it.
-
-The driver can be used as a loadable module.
-
-Theoretically, one instance of the driver can now run multiple cards,
-in the standard way (when loading a module, say "modprobe 3c505
-io=0x300,0x340 irq=10,11 dma=6,7" or whatever). I have not tested
-this, though.
-
-The driver may now support revision 2 hardware; the dependency on
-being able to read the host control register has been removed. This
-is also untested, since I don't have a suitable card.
-
-Known problems:
- I still see "DMA upload timed out" messages from time to time. These
-seem to be fairly non-fatal though.
- The card is old and slow.
-
-To do:
- Improve probe/setup code
- Test multicast and promiscuous operation
-
-Authors:
- The driver is mainly written by Craig Southeren, email
- <craigs@ineluki.apana.org.au>.
- Parts of the driver (adapting the driver to 1.1.4+ kernels,
- IRQ/address detection, some changes) and this README by
- Juha Laiho <jlaiho@ichaos.nullnet.fi>.
- DMA mode, more fixes, etc, by Philip Blundell <pjb27@cam.ac.uk>
- Multicard support, Software configurable DMA, etc., by
- Christopher Collins <ccollins@pcug.org.au>
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt
new file mode 100644
index 00000000000..3f24df8c6e6
--- /dev/null
+++ b/Documentation/networking/altera_tse.txt
@@ -0,0 +1,263 @@
+ Altera Triple-Speed Ethernet MAC driver
+
+Copyright (C) 2008-2014 Altera Corporation
+
+This is the driver for the Altera Triple-Speed Ethernet (TSE) controllers
+using the SGDMA and MSGDMA soft DMA IP components. The driver uses the
+platform bus to obtain component resources. The designs used to test this
+driver were built for a Cyclone(R) V SOC FPGA board, a Cyclone(R) V FPGA board,
+and tested with ARM and NIOS processor hosts seperately. The anticipated use
+cases are simple communications between an embedded system and an external peer
+for status and simple configuration of the embedded system.
+
+For more information visit www.altera.com and www.rocketboards.org. Support
+forums for the driver may be found on www.rocketboards.org, and a design used
+to test this driver may be found there as well. Support is also available from
+the maintainer of this driver, found in MAINTAINERS.
+
+The Triple-Speed Ethernet, SGDMA, and MSGDMA components are all soft IP
+components that can be assembled and built into an FPGA using the Altera
+Quartus toolchain. Quartus 13.1 and 14.0 were used to build the design that
+this driver was tested against. The sopc2dts tool is used to create the
+device tree for the driver, and may be found at rocketboards.org.
+
+The driver probe function examines the device tree and determines if the
+Triple-Speed Ethernet instance is using an SGDMA or MSGDMA component. The
+probe function then installs the appropriate set of DMA routines to
+initialize, setup transmits, receives, and interrupt handling primitives for
+the respective configurations.
+
+The SGDMA component is to be deprecated in the near future (over the next 1-2
+years as of this writing in early 2014) in favor of the MSGDMA component.
+SGDMA support is included for existing designs and reference in case a
+developer wishes to support their own soft DMA logic and driver support. Any
+new designs should not use the SGDMA.
+
+The SGDMA supports only a single transmit or receive operation at a time, and
+therefore will not perform as well compared to the MSGDMA soft IP. Please
+visit www.altera.com for known, documented SGDMA errata.
+
+Scatter-gather DMA is not supported by the SGDMA or MSGDMA at this time.
+Scatter-gather DMA will be added to a future maintenance update to this
+driver.
+
+Jumbo frames are not supported at this time.
+
+The driver limits PHY operations to 10/100Mbps, and has not yet been fully
+tested for 1Gbps. This support will be added in a future maintenance update.
+
+1) Kernel Configuration
+The kernel configuration option is ALTERA_TSE:
+ Device Drivers ---> Network device support ---> Ethernet driver support --->
+ Altera Triple-Speed Ethernet MAC support (ALTERA_TSE)
+
+2) Driver parameters list:
+ debug: message level (0: no output, 16: all);
+ dma_rx_num: Number of descriptors in the RX list (default is 64);
+ dma_tx_num: Number of descriptors in the TX list (default is 64).
+
+3) Command line options
+Driver parameters can be also passed in command line by using:
+ altera_tse=dma_rx_num:128,dma_tx_num:512
+
+4) Driver information and notes
+
+4.1) Transmit process
+When the driver's transmit routine is called by the kernel, it sets up a
+transmit descriptor by calling the underlying DMA transmit routine (SGDMA or
+MSGDMA), and initites a transmit operation. Once the transmit is complete, an
+interrupt is driven by the transmit DMA logic. The driver handles the transmit
+completion in the context of the interrupt handling chain by recycling
+resource required to send and track the requested transmit operation.
+
+4.2) Receive process
+The driver will post receive buffers to the receive DMA logic during driver
+intialization. Receive buffers may or may not be queued depending upon the
+underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able
+to queue receive buffers to the SGDMA receive logic). When a packet is
+received, the DMA logic generates an interrupt. The driver handles a receive
+interrupt by obtaining the DMA receive logic status, reaping receive
+completions until no more receive completions are available.
+
+4.3) Interrupt Mitigation
+The driver is able to mitigate the number of its DMA interrupts
+using NAPI for receive operations. Interrupt mitigation is not yet supported
+for transmit operations, but will be added in a future maintenance release.
+
+4.4) Ethtool support
+Ethtool is supported. Driver statistics and internal errors can be taken using:
+ethtool -S ethX command. It is possible to dump registers etc.
+
+4.5) PHY Support
+The driver is compatible with PAL to work with PHY and GPHY devices.
+
+4.7) List of source files:
+ o Kconfig
+ o Makefile
+ o altera_tse_main.c: main network device driver
+ o altera_tse_ethtool.c: ethtool support
+ o altera_tse.h: private driver structure and common definitions
+ o altera_msgdma.h: MSGDMA implementation function definitions
+ o altera_sgdma.h: SGDMA implementation function definitions
+ o altera_msgdma.c: MSGDMA implementation
+ o altera_sgdma.c: SGDMA implementation
+ o altera_sgdmahw.h: SGDMA register and descriptor definitions
+ o altera_msgdmahw.h: MSGDMA register and descriptor definitions
+ o altera_utils.c: Driver utility functions
+ o altera_utils.h: Driver utility function definitions
+
+5) Debug Information
+
+The driver exports debug information such as internal statistics,
+debug information, MAC and DMA registers etc.
+
+A user may use the ethtool support to get statistics:
+e.g. using: ethtool -S ethX (that shows the statistics counters)
+or sees the MAC registers: e.g. using: ethtool -d ethX
+
+The developer can also use the "debug" module parameter to get
+further debug information.
+
+6) Statistics Support
+
+The controller and driver support a mix of IEEE standard defined statistics,
+RFC defined statistics, and driver or Altera defined statistics. The four
+specifications containing the standard definitions for these statistics are
+as follows:
+
+ o IEEE 802.3-2012 - IEEE Standard for Ethernet.
+ o RFC 2863 found at http://www.rfc-editor.org/rfc/rfc2863.txt.
+ o RFC 2819 found at http://www.rfc-editor.org/rfc/rfc2819.txt.
+ o Altera Triple Speed Ethernet User Guide, found at http://www.altera.com
+
+The statistics supported by the TSE and the device driver are as follows:
+
+"tx_packets" is equivalent to aFramesTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.2. This statistics is the count of frames that are successfully
+transmitted.
+
+"rx_packets" is equivalent to aFramesReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.5. This statistic is the count of frames that are successfully
+received. This count does not include any error packets such as CRC errors,
+length errors, or alignment errors.
+
+"rx_crc_errors" is equivalent to aFrameCheckSequenceErrors defined in IEEE
+802.3-2012, Section 5.2.2.1.6. This statistic is the count of frames that are
+an integral number of bytes in length and do not pass the CRC test as the frame
+is received.
+
+"rx_align_errors" is equivalent to aAlignmentErrors defined in IEEE 802.3-2012,
+Section 5.2.2.1.7. This statistic is the count of frames that are not an
+integral number of bytes in length and do not pass the CRC test as the frame is
+received.
+
+"tx_bytes" is equivalent to aOctetsTransmittedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.8. This statistic is the count of data and pad bytes
+successfully transmitted from the interface.
+
+"rx_bytes" is equivalent to aOctetsReceivedOK defined in IEEE 802.3-2012,
+Section 5.2.2.1.14. This statistic is the count of data and pad bytes
+successfully received by the controller.
+
+"tx_pause" is equivalent to aPAUSEMACCtrlFramesTransmitted defined in IEEE
+802.3-2012, Section 30.3.4.2. This statistic is a count of PAUSE frames
+transmitted from the network controller.
+
+"rx_pause" is equivalent to aPAUSEMACCtrlFramesReceived defined in IEEE
+802.3-2012, Section 30.3.4.3. This statistic is a count of PAUSE frames
+received by the network controller.
+
+"rx_errors" is equivalent to ifInErrors defined in RFC 2863. This statistic is
+a count of the number of packets received containing errors that prevented the
+packet from being delivered to a higher level protocol.
+
+"tx_errors" is equivalent to ifOutErrors defined in RFC 2863. This statistic
+is a count of the number of packets that could not be transmitted due to errors.
+
+"rx_unicast" is equivalent to ifInUcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were not addressed
+to the broadcast address or a multicast group.
+
+"rx_multicast" is equivalent to ifInMulticastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+a multicast address group.
+
+"rx_broadcast" is equivalent to ifInBroadcastPkts defined in RFC 2863. This
+statistic is a count of the number of packets received that were addressed to
+the broadcast address.
+
+"tx_discards" is equivalent to ifOutDiscards defined in RFC 2863. This
+statistic is the number of outbound packets not transmitted even though an
+error was not detected. An example of a reason this might occur is to free up
+internal buffer space.
+
+"tx_unicast" is equivalent to ifOutUcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were not addressed to
+a multicast group or broadcast address.
+
+"tx_multicast" is equivalent to ifOutMulticastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+multicast group.
+
+"tx_broadcast" is equivalent to ifOutBroadcastPkts defined in RFC 2863. This
+statistic counts the number of packets transmitted that were addressed to a
+broadcast address.
+
+"ether_drops" is equivalent to etherStatsDropEvents defined in RFC 2819.
+This statistic counts the number of packets dropped due to lack of internal
+controller resources.
+
+"rx_total_bytes" is equivalent to etherStatsOctets defined in RFC 2819.
+This statistic counts the total number of bytes received by the controller,
+including error and discarded packets.
+
+"rx_total_packets" is equivalent to etherStatsPkts defined in RFC 2819.
+This statistic counts the total number of packets received by the controller,
+including error, discarded, unicast, multicast, and broadcast packets.
+
+"rx_undersize" is equivalent to etherStatsUndersizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets received less
+than 64 bytes long.
+
+"rx_oversize" is equivalent to etherStatsOversizePkts defined in RFC 2819.
+This statistic counts the number of correctly formed packets greater than 1518
+bytes long.
+
+"rx_64_bytes" is equivalent to etherStatsPkts64Octets defined in RFC 2819.
+This statistic counts the total number of packets received that were 64 octets
+in length.
+
+"rx_65_127_bytes" is equivalent to etherStatsPkts65to127Octets defined in RFC
+2819. This statistic counts the total number of packets received that were
+between 65 and 127 octets in length inclusive.
+
+"rx_128_255_bytes" is equivalent to etherStatsPkts128to255Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 128 and 255 octets in length inclusive.
+
+"rx_256_511_bytes" is equivalent to etherStatsPkts256to511Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 256 and 511 octets in length inclusive.
+
+"rx_512_1023_bytes" is equivalent to etherStatsPkts512to1023Octets defined in
+RFC 2819. This statistic is the total number of packets received that were
+between 512 and 1023 octets in length inclusive.
+
+"rx_1024_1518_bytes" is equivalent to etherStatsPkts1024to1518Octets define
+in RFC 2819. This statistic is the total number of packets received that were
+between 1024 and 1518 octets in length inclusive.
+
+"rx_gte_1519_bytes" is a statistic defined specific to the behavior of the
+Altera TSE. This statistics counts the number of received good and errored
+frames between the length of 1519 and the maximum frame length configured
+in the frm_length register. See the Altera TSE User Guide for More details.
+
+"rx_jabbers" is equivalent to etherStatsJabbers defined in RFC 2819. This
+statistic is the total number of packets received that were longer than 1518
+octets, and had either a bad CRC with an integral number of octets (CRC Error)
+or a bad CRC with a non-integral number of octets (Alignment Error).
+
+"rx_runts" is equivalent to etherStatsFragments defined in RFC 2819. This
+statistic is the total number of packets received that were less than 64 octets
+in length and had either a bad CRC with an integral number of octets (CRC
+error) or a bad CRC with a non-integral number of octets (Alignment Error).
diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt
index c1d82047a4b..58e49042fc2 100644
--- a/Documentation/networking/batman-adv.txt
+++ b/Documentation/networking/batman-adv.txt
@@ -66,19 +66,17 @@ All mesh wide settings can be found in batman's own interface
folder:
# ls /sys/class/net/bat0/mesh/
-# aggregated_ogms gw_bandwidth log_level
-# ap_isolation gw_mode orig_interval
-# bonding gw_sel_class routing_algo
-# bridge_loop_avoidance hop_penalty vis_mode
-# fragmentation
-
+#aggregated_ogms distributed_arp_table gw_sel_class orig_interval
+#ap_isolation fragmentation hop_penalty routing_algo
+#bonding gw_bandwidth isolation_mark vlan0
+#bridge_loop_avoidance gw_mode log_level
There is a special folder for debugging information:
# ls /sys/kernel/debug/batman_adv/bat0/
# bla_backbone_table log transtable_global
# bla_claim_table originators transtable_local
-# gateways socket vis_data
+# gateways socket
Some of the files contain all sort of status information regard-
ing the mesh network. For example, you can view the table of
@@ -127,51 +125,6 @@ ously assigned to interfaces now used by batman advanced, e.g.
# ifconfig eth0 0.0.0.0
-VISUALIZATION
--------------
-
-If you want topology visualization, at least one mesh node must
-be configured as VIS-server:
-
-# echo "server" > /sys/class/net/bat0/mesh/vis_mode
-
-Each node is either configured as "server" or as "client" (de-
-fault: "client"). Clients send their topology data to the server
-next to them, and server synchronize with other servers. If there
-is no server configured (default) within the mesh, no topology
-information will be transmitted. With these "synchronizing
-servers", there can be 1 or more vis servers sharing the same (or
-at least very similar) data.
-
-When configured as server, you can get a topology snapshot of
-your mesh:
-
-# cat /sys/kernel/debug/batman_adv/bat0/vis_data
-
-This raw output is intended to be easily parsable and convertable
-with other tools. Have a look at the batctl README if you want a
-vis output in dot or json format for instance and how those out-
-puts could then be visualised in an image.
-
-The raw format consists of comma separated values per entry where
-each entry is giving information about a certain source inter-
-face. Each entry can/has to have the following values:
--> "mac" - mac address of an originator's source interface
- (each line begins with it)
--> "TQ mac value" - src mac's link quality towards mac address
- of a neighbor originator's interface which
- is being used for routing
--> "TT mac" - TT announced by source mac
--> "PRIMARY" - this is a primary interface
--> "SEC mac" - secondary mac address of source
- (requires preceding PRIMARY)
-
-The TQ value has a range from 4 to 255 with 255 being the best.
-The TT entries are showing which hosts are connected to the mesh
-via bat0 or being bridged into the mesh network. The PRIMARY/SEC
-values are only applied on primary interfaces
-
-
LOGGING/DEBUGGING
-----------------
@@ -245,5 +198,5 @@ Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription
You can also contact the Authors:
-Marek Lindner <lindner_marek@yahoo.de>
-Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
+Marek Lindner <mareklindner@neomailbox.ch>
+Simon Wunderlich <sw@simonwunderlich.de>
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 87bbcfee2e0..9c723ecd002 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -270,16 +270,15 @@ arp_ip_target
arp_validate
Specifies whether or not ARP probes and replies should be
- validated in the active-backup mode. This causes the ARP
- monitor to examine the incoming ARP requests and replies, and
- only consider a slave to be up if it is receiving the
- appropriate ARP traffic.
+ validated in any mode that supports arp monitoring, or whether
+ non-ARP traffic should be filtered (disregarded) for link
+ monitoring purposes.
Possible values are:
none or 0
- No validation is performed. This is the default.
+ No validation or filtering is performed.
active or 1
@@ -293,31 +292,68 @@ arp_validate
Validation is performed for all slaves.
- For the active slave, the validation checks ARP replies to
- confirm that they were generated by an arp_ip_target. Since
- backup slaves do not typically receive these replies, the
- validation performed for backup slaves is on the ARP request
- sent out via the active slave. It is possible that some
- switch or network configurations may result in situations
- wherein the backup slaves do not receive the ARP requests; in
- such a situation, validation of backup slaves must be
- disabled.
-
- The validation of ARP requests on backup slaves is mainly
- helping bonding to decide which slaves are more likely to
- work in case of the active slave failure, it doesn't really
- guarantee that the backup slave will work if it's selected
- as the next active slave.
-
- This option is useful in network configurations in which
- multiple bonding hosts are concurrently issuing ARPs to one or
- more targets beyond a common switch. Should the link between
- the switch and target fail (but not the switch itself), the
- probe traffic generated by the multiple bonding instances will
- fool the standard ARP monitor into considering the links as
- still up. Use of the arp_validate option can resolve this, as
- the ARP monitor will only consider ARP requests and replies
- associated with its own instance of bonding.
+ filter or 4
+
+ Filtering is applied to all slaves. No validation is
+ performed.
+
+ filter_active or 5
+
+ Filtering is applied to all slaves, validation is performed
+ only for the active slave.
+
+ filter_backup or 6
+
+ Filtering is applied to all slaves, validation is performed
+ only for backup slaves.
+
+ Validation:
+
+ Enabling validation causes the ARP monitor to examine the incoming
+ ARP requests and replies, and only consider a slave to be up if it
+ is receiving the appropriate ARP traffic.
+
+ For an active slave, the validation checks ARP replies to confirm
+ that they were generated by an arp_ip_target. Since backup slaves
+ do not typically receive these replies, the validation performed
+ for backup slaves is on the broadcast ARP request sent out via the
+ active slave. It is possible that some switch or network
+ configurations may result in situations wherein the backup slaves
+ do not receive the ARP requests; in such a situation, validation
+ of backup slaves must be disabled.
+
+ The validation of ARP requests on backup slaves is mainly helping
+ bonding to decide which slaves are more likely to work in case of
+ the active slave failure, it doesn't really guarantee that the
+ backup slave will work if it's selected as the next active slave.
+
+ Validation is useful in network configurations in which multiple
+ bonding hosts are concurrently issuing ARPs to one or more targets
+ beyond a common switch. Should the link between the switch and
+ target fail (but not the switch itself), the probe traffic
+ generated by the multiple bonding instances will fool the standard
+ ARP monitor into considering the links as still up. Use of
+ validation can resolve this, as the ARP monitor will only consider
+ ARP requests and replies associated with its own instance of
+ bonding.
+
+ Filtering:
+
+ Enabling filtering causes the ARP monitor to only use incoming ARP
+ packets for link availability purposes. Arriving packets that are
+ not ARPs are delivered normally, but do not count when determining
+ if a slave is available.
+
+ Filtering operates by only considering the reception of ARP
+ packets (any ARP packet, regardless of source or destination) when
+ determining if a slave has received traffic for link availability
+ purposes.
+
+ Filtering is useful in network configurations in which significant
+ levels of third party broadcast traffic would fool the standard
+ ARP monitor into considering the links as still up. Use of
+ filtering can resolve this, as only ARP traffic is considered for
+ link availability purposes.
This option was added in bonding version 3.1.0.
@@ -549,13 +585,19 @@ mode
balance-tlb or 5
Adaptive transmit load balancing: channel bonding that
- does not require any special switch support. The
- outgoing traffic is distributed according to the
- current load (computed relative to the speed) on each
- slave. Incoming traffic is received by the current
- slave. If the receiving slave fails, another slave
- takes over the MAC address of the failed receiving
- slave.
+ does not require any special switch support.
+
+ In tlb_dynamic_lb=1 mode; the outgoing traffic is
+ distributed according to the current load (computed
+ relative to the speed) on each slave.
+
+ In tlb_dynamic_lb=0 mode; the load balancing based on
+ current load is disabled and the load is distributed
+ only using the hash distribution.
+
+ Incoming traffic is received by the current slave.
+ If the receiving slave fails, another slave takes over
+ the MAC address of the failed receiving slave.
Prerequisite:
@@ -639,6 +681,15 @@ num_unsol_na
are generated by the ipv4 and ipv6 code and the numbers of
repetitions cannot be set independently.
+packets_per_slave
+
+ Specify the number of packets to transmit through a slave before
+ moving to the next one. When set to 0 then a slave is chosen at
+ random.
+
+ The valid range is 0 - 65535; the default value is 1. This option
+ has effect only in balance-rr mode.
+
primary
A string (eth0, eth2, etc) specifying which slave is the
@@ -648,7 +699,8 @@ primary
one slave is preferred over another, e.g., when one slave has
higher throughput than another.
- The primary option is only valid for active-backup mode.
+ The primary option is only valid for active-backup(1),
+ balance-tlb (5) and balance-alb (6) mode.
primary_reselect
@@ -690,6 +742,28 @@ primary_reselect
This option was added for bonding version 3.6.0.
+tlb_dynamic_lb
+
+ Specifies if dynamic shuffling of flows is enabled in tlb
+ mode. The value has no effect on any other modes.
+
+ The default behavior of tlb mode is to shuffle active flows across
+ slaves based on the load in that interval. This gives nice lb
+ characteristics but can cause packet reordering. If re-ordering is
+ a concern use this variable to disable flow shuffling and rely on
+ load balancing provided solely by the hash distribution.
+ xmit-hash-policy can be used to select the appropriate hashing for
+ the setup.
+
+ The sysfs entry can be used to change the setting per bond device
+ and the initial value is derived from the module parameter. The
+ sysfs entry is allowed to be changed only if the bond device is
+ down.
+
+ The default value is "1" that enables flow shuffling while value "0"
+ disables it. This option was added in bonding driver 3.7.1
+
+
updelay
Specifies the time, in milliseconds, to wait before enabling a
@@ -723,7 +797,7 @@ use_carrier
xmit_hash_policy
Selects the transmit hash policy to use for slave selection in
- balance-xor and 802.3ad modes. Possible values are:
+ balance-xor, 802.3ad, and tlb modes. Possible values are:
layer2
@@ -743,21 +817,16 @@ xmit_hash_policy
protocol information to generate the hash.
Uses XOR of hardware MAC addresses and IP addresses to
- generate the hash. The IPv4 formula is
-
- (((source IP XOR dest IP) AND 0xffff) XOR
- ( source MAC XOR destination MAC ))
- modulo slave count
+ generate the hash. The formula is
- The IPv6 formula is
+ hash = source MAC XOR destination MAC
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ And then hash is reduced modulo slave count.
- hash = (source ip quad 2 XOR dest IP quad 2) XOR
- (source ip quad 3 XOR dest IP quad 3) XOR
- (source ip quad 4 XOR dest IP quad 4)
-
- (((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
- XOR (source MAC XOR destination MAC))
- modulo slave count
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
This algorithm will place all traffic to a particular
network peer on the same slave. For non-IP traffic,
@@ -779,21 +848,16 @@ xmit_hash_policy
slaves, although a single connection will not span
multiple slaves.
- The formula for unfragmented IPv4 TCP and UDP packets is
-
- ((source port XOR dest port) XOR
- ((source IP XOR dest IP) AND 0xffff)
- modulo slave count
+ The formula for unfragmented TCP and UDP packets is
- The formula for unfragmented IPv6 TCP and UDP packets is
+ hash = source port, destination port (as in the header)
+ hash = hash XOR source IP XOR destination IP
+ hash = hash XOR (hash RSHIFT 16)
+ hash = hash XOR (hash RSHIFT 8)
+ And then hash is reduced modulo slave count.
- hash = (source port XOR dest port) XOR
- ((source ip quad 2 XOR dest IP quad 2) XOR
- (source ip quad 3 XOR dest IP quad 3) XOR
- (source ip quad 4 XOR dest IP quad 4))
-
- ((hash >> 24) XOR (hash >> 16) XOR (hash >> 8) XOR hash)
- modulo slave count
+ If the protocol is IPv6 then the source and destination
+ addresses are first hashed using ipv6_addr_hash.
For fragmented TCP or UDP packets and all other IPv4 and
IPv6 protocol traffic, the source and destination port
@@ -801,10 +865,6 @@ xmit_hash_policy
formula is the same as for the layer2 transmit hash
policy.
- The IPv4 policy is intended to mimic the behavior of
- certain switches, notably Cisco switches with PFC2 as
- well as some Foundry and IBM products.
-
This algorithm is not fully 802.3ad compliant. A
single TCP or UDP conversation containing both
fragmented and unfragmented packets will see packets
@@ -815,6 +875,26 @@ xmit_hash_policy
conversations. Other implementations of 802.3ad may
or may not tolerate this noncompliance.
+ encap2+3
+
+ This policy uses the same formula as layer2+3 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
+ encap3+4
+
+ This policy uses the same formula as layer3+4 but it
+ relies on skb_flow_dissect to obtain the header fields
+ which might result in the use of inner headers if an
+ encapsulation protocol is used. For example this will
+ improve the performance for tunnel users because the
+ packets will be distributed according to the encapsulated
+ flows.
+
The default value is layer2. This option was added in bonding
version 2.6.3. In earlier versions of bonding, this parameter
does not exist, and the layer2 policy is the only policy. The
@@ -838,6 +918,14 @@ resend_igmp
This option was added for bonding version 3.7.0.
+lp_interval
+
+ Specifies the number of seconds between instances where the bonding
+ driver sends learning packets to each slaves peer switch.
+
+ The valid range is 1 - 0x7fffffff; the default value is 1. This Option
+ has effect only in balance-tlb and balance-alb modes.
+
3. Configuring Bonding Devices
==============================
@@ -1362,6 +1450,12 @@ To add ARP targets:
To remove an ARP target:
# echo -192.168.0.100 > /sys/class/net/bond0/bonding/arp_ip_target
+To configure the interval between learning packet transmits:
+# echo 12 > /sys/class/net/bond0/bonding/lp_interval
+ NOTE: the lp_inteval is the number of seconds between instances where
+the bonding driver sends learning packets to each slaves peer switch. The
+default interval is 1 second.
+
Example Configuration
---------------------
We begin with the same example that is shown in section 3.3,
diff --git a/Documentation/networking/can.txt b/Documentation/networking/can.txt
index 820f55344ed..2236d6dcb7d 100644
--- a/Documentation/networking/can.txt
+++ b/Documentation/networking/can.txt
@@ -2,21 +2,20 @@
can.txt
-Readme file for the Controller Area Network Protocol Family (aka Socket CAN)
+Readme file for the Controller Area Network Protocol Family (aka SocketCAN)
This file contains
- 1 Overview / What is Socket CAN
+ 1 Overview / What is SocketCAN
2 Motivation / Why using the socket API
- 3 Socket CAN concept
+ 3 SocketCAN concept
3.1 receive lists
3.2 local loopback of sent frames
- 3.3 network security issues (capabilities)
- 3.4 network problem notifications
+ 3.3 network problem notifications
- 4 How to use Socket CAN
+ 4 How to use SocketCAN
4.1 RAW protocol sockets with can_filters (SOCK_RAW)
4.1.1 RAW socket option CAN_RAW_FILTER
4.1.2 RAW socket option CAN_RAW_ERR_FILTER
@@ -25,10 +24,16 @@ This file contains
4.1.5 RAW socket option CAN_RAW_FD_FRAMES
4.1.6 RAW socket returned message flags
4.2 Broadcast Manager protocol sockets (SOCK_DGRAM)
+ 4.2.1 Broadcast Manager operations
+ 4.2.2 Broadcast Manager message flags
+ 4.2.3 Broadcast Manager transmission timers
+ 4.2.4 Broadcast Manager message sequence transmission
+ 4.2.5 Broadcast Manager receive filter timers
+ 4.2.6 Broadcast Manager multiplex message receive filter
4.3 connected transport protocols (SOCK_SEQPACKET)
4.4 unconnected transport protocols (SOCK_DGRAM)
- 5 Socket CAN core module
+ 5 SocketCAN core module
5.1 can.ko module params
5.2 procfs content
5.3 writing own CAN protocol modules
@@ -45,20 +50,20 @@ This file contains
6.6 CAN FD (flexible data rate) driver support
6.7 supported CAN hardware
- 7 Socket CAN resources
+ 7 SocketCAN resources
8 Credits
============================================================================
-1. Overview / What is Socket CAN
+1. Overview / What is SocketCAN
--------------------------------
The socketcan package is an implementation of CAN protocols
(Controller Area Network) for Linux. CAN is a networking technology
which has widespread use in automation, embedded devices, and
automotive fields. While there have been other CAN implementations
-for Linux based on character devices, Socket CAN uses the Berkeley
+for Linux based on character devices, SocketCAN uses the Berkeley
socket API, the Linux network stack and implements the CAN device
drivers as network interfaces. The CAN socket API has been designed
as similar as possible to the TCP/IP protocols to allow programmers,
@@ -68,7 +73,7 @@ sockets.
2. Motivation / Why using the socket API
----------------------------------------
-There have been CAN implementations for Linux before Socket CAN so the
+There have been CAN implementations for Linux before SocketCAN so the
question arises, why we have started another project. Most existing
implementations come as a device driver for some CAN hardware, they
are based on character devices and provide comparatively little
@@ -83,10 +88,10 @@ the CAN controller requires employment of another device driver and
often the need for adaption of large parts of the application to the
new driver's API.
-Socket CAN was designed to overcome all of these limitations. A new
+SocketCAN was designed to overcome all of these limitations. A new
protocol family has been implemented which provides a socket interface
to user space applications and which builds upon the Linux network
-layer, so to use all of the provided queueing functionality. A device
+layer, enabling use all of the provided queueing functionality. A device
driver for CAN controller hardware registers itself with the Linux
network layer as a network device, so that CAN frames from the
controller can be passed up to the network layer and on to the CAN
@@ -140,15 +145,15 @@ solution for a couple of reasons:
providing an API for device drivers to register with. However, then
it would be no more difficult, or may be even easier, to use the
networking framework provided by the Linux kernel, and this is what
- Socket CAN does.
+ SocketCAN does.
The use of the networking framework of the Linux kernel is just the
natural and most appropriate way to implement CAN for Linux.
-3. Socket CAN concept
+3. SocketCAN concept
---------------------
- As described in chapter 2 it is the main goal of Socket CAN to
+ As described in chapter 2 it is the main goal of SocketCAN to
provide a socket interface to user space applications which builds
upon the Linux network layer. In contrast to the commonly known
TCP/IP and ethernet networking, the CAN bus is a broadcast-only(!)
@@ -162,11 +167,11 @@ solution for a couple of reasons:
The network transparent access of multiple applications leads to the
problem that different applications may be interested in the same
- CAN-IDs from the same CAN network interface. The Socket CAN core
+ CAN-IDs from the same CAN network interface. The SocketCAN core
module - which implements the protocol family CAN - provides several
high efficient receive lists for this reason. If e.g. a user space
application opens a CAN RAW socket, the raw protocol module itself
- requests the (range of) CAN-IDs from the Socket CAN core that are
+ requests the (range of) CAN-IDs from the SocketCAN core that are
requested by the user. The subscription and unsubscription of
CAN-IDs can be done for specific CAN interfaces or for all(!) known
CAN interfaces with the can_rx_(un)register() functions provided to
@@ -211,21 +216,7 @@ solution for a couple of reasons:
* = you really like to have this when you're running analyser tools
like 'candump' or 'cansniffer' on the (same) node.
- 3.3 network security issues (capabilities)
-
- The Controller Area Network is a local field bus transmitting only
- broadcast messages without any routing and security concepts.
- In the majority of cases the user application has to deal with
- raw CAN frames. Therefore it might be reasonable NOT to restrict
- the CAN access only to the user root, as known from other networks.
- Since the currently implemented CAN_RAW and CAN_BCM sockets can only
- send and receive frames to/from CAN interfaces it does not affect
- security of others networks to allow all users to access the CAN.
- To enable non-root users to access CAN_RAW and CAN_BCM protocol
- sockets the Kconfig options CAN_RAW_USER and/or CAN_BCM_USER may be
- selected at kernel compile time.
-
- 3.4 network problem notifications
+ 3.3 network problem notifications
The use of the CAN bus may lead to several problems on the physical
and media access control layer. Detecting and logging of these lower
@@ -245,11 +236,11 @@ solution for a couple of reasons:
by default. The format of the CAN error message frame is briefly
described in the Linux header file "include/linux/can/error.h".
-4. How to use Socket CAN
+4. How to use SocketCAN
------------------------
Like TCP/IP, you first need to open a socket for communicating over a
- CAN network. Since Socket CAN implements a new protocol family, you
+ CAN network. Since SocketCAN implements a new protocol family, you
need to pass PF_CAN as the first argument to the socket(2) system
call. Currently, there are two CAN protocols to choose from, the raw
socket protocol and the broadcast manager (BCM). So to open a socket,
@@ -280,8 +271,8 @@ solution for a couple of reasons:
};
The alignment of the (linear) payload data[] to a 64bit boundary
- allows the user to define own structs and unions to easily access the
- CAN payload. There is no given byteorder on the CAN bus by
+ allows the user to define their own structs and unions to easily access
+ the CAN payload. There is no given byteorder on the CAN bus by
default. A read(2) system call on a CAN_RAW socket transfers a
struct can_frame to the user space.
@@ -473,11 +464,46 @@ solution for a couple of reasons:
setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, NULL, 0);
- To set the filters to zero filters is quite obsolete as not read
+ To set the filters to zero filters is quite obsolete as to not read
data causes the raw socket to discard the received CAN frames. But
having this 'send only' use-case we may remove the receive list in the
Kernel to save a little (really a very little!) CPU usage.
+ 4.1.1.1 CAN filter usage optimisation
+
+ The CAN filters are processed in per-device filter lists at CAN frame
+ reception time. To reduce the number of checks that need to be performed
+ while walking through the filter lists the CAN core provides an optimized
+ filter handling when the filter subscription focusses on a single CAN ID.
+
+ For the possible 2048 SFF CAN identifiers the identifier is used as an index
+ to access the corresponding subscription list without any further checks.
+ For the 2^29 possible EFF CAN identifiers a 10 bit XOR folding is used as
+ hash function to retrieve the EFF table index.
+
+ To benefit from the optimized filters for single CAN identifiers the
+ CAN_SFF_MASK or CAN_EFF_MASK have to be set into can_filter.mask together
+ with set CAN_EFF_FLAG and CAN_RTR_FLAG bits. A set CAN_EFF_FLAG bit in the
+ can_filter.mask makes clear that it matters whether a SFF or EFF CAN ID is
+ subscribed. E.g. in the example from above
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = CAN_SFF_MASK;
+
+ both SFF frames with CAN ID 0x123 and EFF frames with 0xXXXXX123 can pass.
+
+ To filter for only 0x123 (SFF) and 0x12345678 (EFF) CAN identifiers the
+ filter has to be defined in this way to benefit from the optimized filters:
+
+ struct can_filter rfilter[2];
+
+ rfilter[0].can_id = 0x123;
+ rfilter[0].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_SFF_MASK);
+ rfilter[1].can_id = 0x12345678 | CAN_EFF_FLAG;
+ rfilter[1].can_mask = (CAN_EFF_FLAG | CAN_RTR_FLAG | CAN_EFF_MASK);
+
+ setsockopt(s, SOL_CAN_RAW, CAN_RAW_FILTER, &rfilter, sizeof(rfilter));
+
4.1.2 RAW socket option CAN_RAW_ERR_FILTER
As described in chapter 3.4 the CAN interface driver can generate so
@@ -563,12 +589,6 @@ solution for a couple of reasons:
not specified in the struct can_frame and therefore it is only valid in
CANFD_MTU sized CAN FD frames.
- As long as the payload length is <=8 the received CAN frames from CAN FD
- capable CAN devices can be received and read by legacy sockets too. When
- user-generated CAN FD frames have a payload length <=8 these can be send
- by legacy CAN network interfaces too. Sending CAN FD frames with payload
- length > 8 to a legacy CAN network interface returns an -EMSGSIZE error.
-
Implementation hint for new CAN applications:
To build a CAN FD aware application use struct canfd_frame as basic CAN
@@ -593,21 +613,232 @@ solution for a couple of reasons:
In order to receive such messages, CAN_RAW_RECV_OWN_MSGS must be set.
4.2 Broadcast Manager protocol sockets (SOCK_DGRAM)
+
+ The Broadcast Manager protocol provides a command based configuration
+ interface to filter and send (e.g. cyclic) CAN messages in kernel space.
+
+ Receive filters can be used to down sample frequent messages; detect events
+ such as message contents changes, packet length changes, and do time-out
+ monitoring of received messages.
+
+ Periodic transmission tasks of CAN frames or a sequence of CAN frames can be
+ created and modified at runtime; both the message content and the two
+ possible transmit intervals can be altered.
+
+ A BCM socket is not intended for sending individual CAN frames using the
+ struct can_frame as known from the CAN_RAW socket. Instead a special BCM
+ configuration message is defined. The basic BCM configuration message used
+ to communicate with the broadcast manager and the available operations are
+ defined in the linux/can/bcm.h include. The BCM message consists of a
+ message header with a command ('opcode') followed by zero or more CAN frames.
+ The broadcast manager sends responses to user space in the same form:
+
+ struct bcm_msg_head {
+ __u32 opcode; /* command */
+ __u32 flags; /* special flags */
+ __u32 count; /* run 'count' times with ival1 */
+ struct timeval ival1, ival2; /* count and subsequent interval */
+ canid_t can_id; /* unique can_id for task */
+ __u32 nframes; /* number of can_frames following */
+ struct can_frame frames[0];
+ };
+
+ The aligned payload 'frames' uses the same basic CAN frame structure defined
+ at the beginning of section 4 and in the include/linux/can.h include. All
+ messages to the broadcast manager from user space have this structure.
+
+ Note a CAN_BCM socket must be connected instead of bound after socket
+ creation (example without error checking):
+
+ int s;
+ struct sockaddr_can addr;
+ struct ifreq ifr;
+
+ s = socket(PF_CAN, SOCK_DGRAM, CAN_BCM);
+
+ strcpy(ifr.ifr_name, "can0");
+ ioctl(s, SIOCGIFINDEX, &ifr);
+
+ addr.can_family = AF_CAN;
+ addr.can_ifindex = ifr.ifr_ifindex;
+
+ connect(s, (struct sockaddr *)&addr, sizeof(addr))
+
+ (..)
+
+ The broadcast manager socket is able to handle any number of in flight
+ transmissions or receive filters concurrently. The different RX/TX jobs are
+ distinguished by the unique can_id in each BCM message. However additional
+ CAN_BCM sockets are recommended to communicate on multiple CAN interfaces.
+ When the broadcast manager socket is bound to 'any' CAN interface (=> the
+ interface index is set to zero) the configured receive filters apply to any
+ CAN interface unless the sendto() syscall is used to overrule the 'any' CAN
+ interface index. When using recvfrom() instead of read() to retrieve BCM
+ socket messages the originating CAN interface is provided in can_ifindex.
+
+ 4.2.1 Broadcast Manager operations
+
+ The opcode defines the operation for the broadcast manager to carry out,
+ or details the broadcast managers response to several events, including
+ user requests.
+
+ Transmit Operations (user space to broadcast manager):
+
+ TX_SETUP: Create (cyclic) transmission task.
+
+ TX_DELETE: Remove (cyclic) transmission task, requires only can_id.
+
+ TX_READ: Read properties of (cyclic) transmission task for can_id.
+
+ TX_SEND: Send one CAN frame.
+
+ Transmit Responses (broadcast manager to user space):
+
+ TX_STATUS: Reply to TX_READ request (transmission task configuration).
+
+ TX_EXPIRED: Notification when counter finishes sending at initial interval
+ 'ival1'. Requires the TX_COUNTEVT flag to be set at TX_SETUP.
+
+ Receive Operations (user space to broadcast manager):
+
+ RX_SETUP: Create RX content filter subscription.
+
+ RX_DELETE: Remove RX content filter subscription, requires only can_id.
+
+ RX_READ: Read properties of RX content filter subscription for can_id.
+
+ Receive Responses (broadcast manager to user space):
+
+ RX_STATUS: Reply to RX_READ request (filter task configuration).
+
+ RX_TIMEOUT: Cyclic message is detected to be absent (timer ival1 expired).
+
+ RX_CHANGED: BCM message with updated CAN frame (detected content change).
+ Sent on first message received or on receipt of revised CAN messages.
+
+ 4.2.2 Broadcast Manager message flags
+
+ When sending a message to the broadcast manager the 'flags' element may
+ contain the following flag definitions which influence the behaviour:
+
+ SETTIMER: Set the values of ival1, ival2 and count
+
+ STARTTIMER: Start the timer with the actual values of ival1, ival2
+ and count. Starting the timer leads simultaneously to emit a CAN frame.
+
+ TX_COUNTEVT: Create the message TX_EXPIRED when count expires
+
+ TX_ANNOUNCE: A change of data by the process is emitted immediately.
+
+ TX_CP_CAN_ID: Copies the can_id from the message header to each
+ subsequent frame in frames. This is intended as usage simplification. For
+ TX tasks the unique can_id from the message header may differ from the
+ can_id(s) stored for transmission in the subsequent struct can_frame(s).
+
+ RX_FILTER_ID: Filter by can_id alone, no frames required (nframes=0).
+
+ RX_CHECK_DLC: A change of the DLC leads to an RX_CHANGED.
+
+ RX_NO_AUTOTIMER: Prevent automatically starting the timeout monitor.
+
+ RX_ANNOUNCE_RESUME: If passed at RX_SETUP and a receive timeout occurred, a
+ RX_CHANGED message will be generated when the (cyclic) receive restarts.
+
+ TX_RESET_MULTI_IDX: Reset the index for the multiple frame transmission.
+
+ RX_RTR_FRAME: Send reply for RTR-request (placed in op->frames[0]).
+
+ 4.2.3 Broadcast Manager transmission timers
+
+ Periodic transmission configurations may use up to two interval timers.
+ In this case the BCM sends a number of messages ('count') at an interval
+ 'ival1', then continuing to send at another given interval 'ival2'. When
+ only one timer is needed 'count' is set to zero and only 'ival2' is used.
+ When SET_TIMER and START_TIMER flag were set the timers are activated.
+ The timer values can be altered at runtime when only SET_TIMER is set.
+
+ 4.2.4 Broadcast Manager message sequence transmission
+
+ Up to 256 CAN frames can be transmitted in a sequence in the case of a cyclic
+ TX task configuration. The number of CAN frames is provided in the 'nframes'
+ element of the BCM message head. The defined number of CAN frames are added
+ as array to the TX_SETUP BCM configuration message.
+
+ /* create a struct to set up a sequence of four CAN frames */
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[4];
+ } mytxmsg;
+
+ (..)
+ mytxmsg.nframes = 4;
+ (..)
+
+ write(s, &mytxmsg, sizeof(mytxmsg));
+
+ With every transmission the index in the array of CAN frames is increased
+ and set to zero at index overflow.
+
+ 4.2.5 Broadcast Manager receive filter timers
+
+ The timer values ival1 or ival2 may be set to non-zero values at RX_SETUP.
+ When the SET_TIMER flag is set the timers are enabled:
+
+ ival1: Send RX_TIMEOUT when a received message is not received again within
+ the given time. When START_TIMER is set at RX_SETUP the timeout detection
+ is activated directly - even without a former CAN frame reception.
+
+ ival2: Throttle the received message rate down to the value of ival2. This
+ is useful to reduce messages for the application when the signal inside the
+ CAN frame is stateless as state changes within the ival2 periode may get
+ lost.
+
+ 4.2.6 Broadcast Manager multiplex message receive filter
+
+ To filter for content changes in multiplex message sequences an array of more
+ than one CAN frames can be passed in a RX_SETUP configuration message. The
+ data bytes of the first CAN frame contain the mask of relevant bits that
+ have to match in the subsequent CAN frames with the received CAN frame.
+ If one of the subsequent CAN frames is matching the bits in that frame data
+ mark the relevant content to be compared with the previous received content.
+ Up to 257 CAN frames (multiplex filter bit mask CAN frame plus 256 CAN
+ filters) can be added as array to the TX_SETUP BCM configuration message.
+
+ /* usually used to clear CAN frame data[] - beware of endian problems! */
+ #define U64_DATA(p) (*(unsigned long long*)(p)->data)
+
+ struct {
+ struct bcm_msg_head msg_head;
+ struct can_frame frame[5];
+ } msg;
+
+ msg.msg_head.opcode = RX_SETUP;
+ msg.msg_head.can_id = 0x42;
+ msg.msg_head.flags = 0;
+ msg.msg_head.nframes = 5;
+ U64_DATA(&msg.frame[0]) = 0xFF00000000000000ULL; /* MUX mask */
+ U64_DATA(&msg.frame[1]) = 0x01000000000000FFULL; /* data mask (MUX 0x01) */
+ U64_DATA(&msg.frame[2]) = 0x0200FFFF000000FFULL; /* data mask (MUX 0x02) */
+ U64_DATA(&msg.frame[3]) = 0x330000FFFFFF0003ULL; /* data mask (MUX 0x33) */
+ U64_DATA(&msg.frame[4]) = 0x4F07FC0FF0000000ULL; /* data mask (MUX 0x4F) */
+
+ write(s, &msg, sizeof(msg));
+
4.3 connected transport protocols (SOCK_SEQPACKET)
4.4 unconnected transport protocols (SOCK_DGRAM)
-5. Socket CAN core module
+5. SocketCAN core module
-------------------------
- The Socket CAN core module implements the protocol family
+ The SocketCAN core module implements the protocol family
PF_CAN. CAN protocol modules are loaded by the core module at
runtime. The core module provides an interface for CAN protocol
modules to subscribe needed CAN IDs (see chapter 3.1).
5.1 can.ko module params
- - stats_timer: To calculate the Socket CAN core statistics
+ - stats_timer: To calculate the SocketCAN core statistics
(e.g. current/maximum frames per second) this 1 second timer is
invoked at can.ko module start time by default. This timer can be
disabled by using stattimer=0 on the module commandline.
@@ -616,7 +847,7 @@ solution for a couple of reasons:
5.2 procfs content
- As described in chapter 3.1 the Socket CAN core uses several filter
+ As described in chapter 3.1 the SocketCAN core uses several filter
lists to deliver received CAN frames to CAN protocol modules. These
receive lists, their filters and the count of filter matches can be
checked in the appropriate receive list. All entries contain the
@@ -643,15 +874,15 @@ solution for a couple of reasons:
Additional procfs files in /proc/net/can
- stats - Socket CAN core statistics (rx/tx frames, match ratios, ...)
+ stats - SocketCAN core statistics (rx/tx frames, match ratios, ...)
reset_stats - manual statistic reset
- version - prints the Socket CAN core version and the ABI version
+ version - prints the SocketCAN core version and the ABI version
5.3 writing own CAN protocol modules
To implement a new protocol in the protocol family PF_CAN a new
protocol has to be defined in include/linux/can.h .
- The prototypes and definitions to use the Socket CAN core can be
+ The prototypes and definitions to use the SocketCAN core can be
accessed by including include/linux/can/core.h .
In addition to functions that register the CAN protocol and the
CAN device notifier chain there are functions to subscribe CAN
@@ -821,7 +1052,7 @@ solution for a couple of reasons:
in case of a bus-off condition after the specified delay time
in milliseconds. By default it's off.
- "bitrate 125000 sample_point 0.875"
+ "bitrate 125000 sample-point 0.875"
Shows the real bit-rate in bits/sec and the sample-point in the
range 0.000..0.999. If the calculation of bit-timing parameters
is enabled in the kernel (CONFIG_CAN_CALC_BITTIMING=y), the
@@ -888,7 +1119,7 @@ solution for a couple of reasons:
$ ip link set canX up type can bitrate 125000
- A device may enter the "bus-off" state if too much errors occurred on
+ A device may enter the "bus-off" state if too many errors occurred on
the CAN bus. Then no more messages are received or sent. An automatic
bus-off recovery can be enabled by setting the "restart-ms" to a
non-zero value, e.g.:
@@ -908,7 +1139,7 @@ solution for a couple of reasons:
CAN FD capable CAN controllers support two different bitrates for the
arbitration phase and the payload phase of the CAN FD frame. Therefore a
- second bittiming has to be specified in order to enable the CAN FD bitrate.
+ second bit timing has to be specified in order to enable the CAN FD bitrate.
Additionally CAN FD capable CAN controllers support up to 64 bytes of
payload. The representation of this length in can_frame.can_dlc and
@@ -933,21 +1164,16 @@ solution for a couple of reasons:
6.7 Supported CAN hardware
Please check the "Kconfig" file in "drivers/net/can" to get an actual
- list of the support CAN hardware. On the Socket CAN project website
+ list of the support CAN hardware. On the SocketCAN project website
(see chapter 7) there might be further drivers available, also for
older kernel versions.
-7. Socket CAN resources
+7. SocketCAN resources
-----------------------
- You can find further resources for Socket CAN like user space tools,
- support for old kernel versions, more drivers, mailing lists, etc.
- at the BerliOS OSS project website for Socket CAN:
-
- http://developer.berlios.de/projects/socketcan
-
- If you have questions, bug fixes, etc., don't hesitate to post them to
- the Socketcan-Users mailing list. But please search the archives first.
+ The Linux CAN / SocketCAN project ressources (project site / mailing list)
+ are referenced in the MAINTAINERS file in the Linux source tree.
+ Search for CAN NETWORK [LAYERS|DRIVERS].
8. Credits
----------
diff --git a/Documentation/networking/cdc_mbim.txt b/Documentation/networking/cdc_mbim.txt
new file mode 100644
index 00000000000..a15ea602aa5
--- /dev/null
+++ b/Documentation/networking/cdc_mbim.txt
@@ -0,0 +1,339 @@
+ cdc_mbim - Driver for CDC MBIM Mobile Broadband modems
+ ========================================================
+
+The cdc_mbim driver supports USB devices conforming to the "Universal
+Serial Bus Communications Class Subclass Specification for Mobile
+Broadband Interface Model" [1], which is a further development of
+"Universal Serial Bus Communications Class Subclass Specifications for
+Network Control Model Devices" [2] optimized for Mobile Broadband
+devices, aka "3G/LTE modems".
+
+
+Command Line Parameters
+=======================
+
+The cdc_mbim driver has no parameters of its own. But the probing
+behaviour for NCM 1.0 backwards compatible MBIM functions (an
+"NCM/MBIM function" as defined in section 3.2 of [1]) is affected
+by a cdc_ncm driver parameter:
+
+prefer_mbim
+-----------
+Type: Boolean
+Valid Range: N/Y (0-1)
+Default Value: Y (MBIM is preferred)
+
+This parameter sets the system policy for NCM/MBIM functions. Such
+functions will be handled by either the cdc_ncm driver or the cdc_mbim
+driver depending on the prefer_mbim setting. Setting prefer_mbim=N
+makes the cdc_mbim driver ignore these functions and lets the cdc_ncm
+driver handle them instead.
+
+The parameter is writable, and can be changed at any time. A manual
+unbind/bind is required to make the change effective for NCM/MBIM
+functions bound to the "wrong" driver
+
+
+Basic usage
+===========
+
+MBIM functions are inactive when unmanaged. The cdc_mbim driver only
+provides an userspace interface to the MBIM control channel, and will
+not participate in the management of the function. This implies that a
+userspace MBIM management application always is required to enable a
+MBIM function.
+
+Such userspace applications includes, but are not limited to:
+ - mbimcli (included with the libmbim [3] library), and
+ - ModemManager [4]
+
+Establishing a MBIM IP session reequires at least these actions by the
+management application:
+ - open the control channel
+ - configure network connection settings
+ - connect to network
+ - configure IP interface
+
+Management application development
+----------------------------------
+The driver <-> userspace interfaces are described below. The MBIM
+control channel protocol is described in [1].
+
+
+MBIM control channel userspace ABI
+==================================
+
+/dev/cdc-wdmX character device
+------------------------------
+The driver creates a two-way pipe to the MBIM function control channel
+using the cdc-wdm driver as a subdriver. The userspace end of the
+control channel pipe is a /dev/cdc-wdmX character device.
+
+The cdc_mbim driver does not process or police messages on the control
+channel. The channel is fully delegated to the userspace management
+application. It is therefore up to this application to ensure that it
+complies with all the control channel requirements in [1].
+
+The cdc-wdmX device is created as a child of the MBIM control
+interface USB device. The character device associated with a specific
+MBIM function can be looked up using sysfs. For example:
+
+ bjorn@nemi:~$ ls /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc
+ cdc-wdm0
+
+ bjorn@nemi:~$ grep . /sys/bus/usb/drivers/cdc_mbim/2-4:2.12/usbmisc/cdc-wdm0/dev
+ 180:0
+
+
+USB configuration descriptors
+-----------------------------
+The wMaxControlMessage field of the CDC MBIM functional descriptor
+limits the maximum control message size. The managament application is
+responsible for negotiating a control message size complying with the
+requirements in section 9.3.1 of [1], taking this descriptor field
+into consideration.
+
+The userspace application can access the CDC MBIM functional
+descriptor of a MBIM function using either of the two USB
+configuration descriptor kernel interfaces described in [6] or [7].
+
+See also the ioctl documentation below.
+
+
+Fragmentation
+-------------
+The userspace application is responsible for all control message
+fragmentation and defragmentaion, as described in section 9.5 of [1].
+
+
+/dev/cdc-wdmX write()
+---------------------
+The MBIM control messages from the management application *must not*
+exceed the negotiated control message size.
+
+
+/dev/cdc-wdmX read()
+--------------------
+The management application *must* accept control messages of up the
+negotiated control message size.
+
+
+/dev/cdc-wdmX ioctl()
+--------------------
+IOCTL_WDM_MAX_COMMAND: Get Maximum Command Size
+This ioctl returns the wMaxControlMessage field of the CDC MBIM
+functional descriptor for MBIM devices. This is intended as a
+convenience, eliminating the need to parse the USB descriptors from
+userspace.
+
+ #include <stdio.h>
+ #include <fcntl.h>
+ #include <sys/ioctl.h>
+ #include <linux/types.h>
+ #include <linux/usb/cdc-wdm.h>
+ int main()
+ {
+ __u16 max;
+ int fd = open("/dev/cdc-wdm0", O_RDWR);
+ if (!ioctl(fd, IOCTL_WDM_MAX_COMMAND, &max))
+ printf("wMaxControlMessage is %d\n", max);
+ }
+
+
+Custom device services
+----------------------
+The MBIM specification allows vendors to freely define additional
+services. This is fully supported by the cdc_mbim driver.
+
+Support for new MBIM services, including vendor specified services, is
+implemented entirely in userspace, like the rest of the MBIM control
+protocol
+
+New services should be registered in the MBIM Registry [5].
+
+
+
+MBIM data channel userspace ABI
+===============================
+
+wwanY network device
+--------------------
+The cdc_mbim driver represents the MBIM data channel as a single
+network device of the "wwan" type. This network device is initially
+mapped to MBIM IP session 0.
+
+
+Multiplexed IP sessions (IPS)
+-----------------------------
+MBIM allows multiplexing up to 256 IP sessions over a single USB data
+channel. The cdc_mbim driver models such IP sessions as 802.1q VLAN
+subdevices of the master wwanY device, mapping MBIM IP session Z to
+VLAN ID Z for all values of Z greater than 0.
+
+The device maximum Z is given in the MBIM_DEVICE_CAPS_INFO structure
+described in section 10.5.1 of [1].
+
+The userspace management application is responsible for adding new
+VLAN links prior to establishing MBIM IP sessions where the SessionId
+is greater than 0. These links can be added by using the normal VLAN
+kernel interfaces, either ioctl or netlink.
+
+For example, adding a link for a MBIM IP session with SessionId 3:
+
+ ip link add link wwan0 name wwan0.3 type vlan id 3
+
+The driver will automatically map the "wwan0.3" network device to MBIM
+IP session 3.
+
+
+Device Service Streams (DSS)
+----------------------------
+MBIM also allows up to 256 non-IP data streams to be multiplexed over
+the same shared USB data channel. The cdc_mbim driver models these
+sessions as another set of 802.1q VLAN subdevices of the master wwanY
+device, mapping MBIM DSS session A to VLAN ID (256 + A) for all values
+of A.
+
+The device maximum A is given in the MBIM_DEVICE_SERVICES_INFO
+structure described in section 10.5.29 of [1].
+
+The DSS VLAN subdevices are used as a practical interface between the
+shared MBIM data channel and a MBIM DSS aware userspace application.
+It is not intended to be presented as-is to an end user. The
+assumption is that an userspace application initiating a DSS session
+also takes care of the necessary framing of the DSS data, presenting
+the stream to the end user in an appropriate way for the stream type.
+
+The network device ABI requires a dummy ethernet header for every DSS
+data frame being transported. The contents of this header is
+arbitrary, with the following exceptions:
+ - TX frames using an IP protocol (0x0800 or 0x86dd) will be dropped
+ - RX frames will have the protocol field set to ETH_P_802_3 (but will
+ not be properly formatted 802.3 frames)
+ - RX frames will have the destination address set to the hardware
+ address of the master device
+
+The DSS supporting userspace management application is responsible for
+adding the dummy ethernet header on TX and stripping it on RX.
+
+This is a simple example using tools commonly available, exporting
+DssSessionId 5 as a pty character device pointed to by a /dev/nmea
+symlink:
+
+ ip link add link wwan0 name wwan0.dss5 type vlan id 261
+ ip link set dev wwan0.dss5 up
+ socat INTERFACE:wwan0.dss5,type=2 PTY:,echo=0,link=/dev/nmea
+
+This is only an example, most suitable for testing out a DSS
+service. Userspace applications supporting specific MBIM DSS services
+are expected to use the tools and programming interfaces required by
+that service.
+
+Note that adding VLAN links for DSS sessions is entirely optional. A
+management application may instead choose to bind a packet socket
+directly to the master network device, using the received VLAN tags to
+map frames to the correct DSS session and adding 18 byte VLAN ethernet
+headers with the appropriate tag on TX. In this case using a socket
+filter is recommended, matching only the DSS VLAN subset. This avoid
+unnecessary copying of unrelated IP session data to userspace. For
+example:
+
+ static struct sock_filter dssfilter[] = {
+ /* use special negative offsets to get VLAN tag */
+ BPF_STMT(BPF_LD|BPF_B|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG_PRESENT),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, 1, 0, 6), /* true */
+
+ /* verify DSS VLAN range */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, SKF_AD_OFF + SKF_AD_VLAN_TAG),
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 256, 0, 4), /* 256 is first DSS VLAN */
+ BPF_JUMP(BPF_JMP|BPF_JGE|BPF_K, 512, 3, 0), /* 511 is last DSS VLAN */
+
+ /* verify ethertype */
+ BPF_STMT(BPF_LD|BPF_H|BPF_ABS, 2 * ETH_ALEN),
+ BPF_JUMP(BPF_JMP|BPF_JEQ|BPF_K, ETH_P_802_3, 0, 1),
+
+ BPF_STMT(BPF_RET|BPF_K, (u_int)-1), /* accept */
+ BPF_STMT(BPF_RET|BPF_K, 0), /* ignore */
+ };
+
+
+
+Tagged IP session 0 VLAN
+------------------------
+As described above, MBIM IP session 0 is treated as special by the
+driver. It is initially mapped to untagged frames on the wwanY
+network device.
+
+This mapping implies a few restrictions on multiplexed IPS and DSS
+sessions, which may not always be practical:
+ - no IPS or DSS session can use a frame size greater than the MTU on
+ IP session 0
+ - no IPS or DSS session can be in the up state unless the network
+ device representing IP session 0 also is up
+
+These problems can be avoided by optionally making the driver map IP
+session 0 to a VLAN subdevice, similar to all other IP sessions. This
+behaviour is triggered by adding a VLAN link for the magic VLAN ID
+4094. The driver will then immediately start mapping MBIM IP session
+0 to this VLAN, and will drop untagged frames on the master wwanY
+device.
+
+Tip: It might be less confusing to the end user to name this VLAN
+subdevice after the MBIM SessionID instead of the VLAN ID. For
+example:
+
+ ip link add link wwan0 name wwan0.0 type vlan id 4094
+
+
+VLAN mapping
+------------
+
+Summarizing the cdc_mbim driver mapping described above, we have this
+relationship between VLAN tags on the wwanY network device and MBIM
+sessions on the shared USB data channel:
+
+ VLAN ID MBIM type MBIM SessionID Notes
+ ---------------------------------------------------------
+ untagged IPS 0 a)
+ 1 - 255 IPS 1 - 255 <VLANID>
+ 256 - 511 DSS 0 - 255 <VLANID - 256>
+ 512 - 4093 b)
+ 4094 IPS 0 c)
+
+ a) if no VLAN ID 4094 link exists, else dropped
+ b) unsupported VLAN range, unconditionally dropped
+ c) if a VLAN ID 4094 link exists, else dropped
+
+
+
+
+References
+==========
+
+[1] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specification for Mobile Broadband
+ Interface Model", Revision 1.0 (Errata 1), May 1, 2013
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[2] USB Implementers Forum, Inc. - "Universal Serial Bus
+ Communications Class Subclass Specifications for Network Control
+ Model Devices", Revision 1.0 (Errata 1), November 24, 2010
+ - http://www.usb.org/developers/docs/devclass_docs/
+
+[3] libmbim - "a glib-based library for talking to WWAN modems and
+ devices which speak the Mobile Interface Broadband Model (MBIM)
+ protocol"
+ - http://www.freedesktop.org/wiki/Software/libmbim/
+
+[4] ModemManager - "a DBus-activated daemon which controls mobile
+ broadband (2G/3G/4G) devices and connections"
+ - http://www.freedesktop.org/wiki/Software/ModemManager/
+
+[5] "MBIM (Mobile Broadband Interface Model) Registry"
+ - http://compliance.usb.org/mbim/
+
+[6] "/proc/bus/usb filesystem output"
+ - Documentation/usb/proc_usb_info.txt
+
+[7] "/sys/bus/usb/devices/.../descriptors"
+ - Documentation/ABI/stable/sysfs-bus-usb
diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt
index d718bc2ff1c..55c575fcaf1 100644
--- a/Documentation/networking/dccp.txt
+++ b/Documentation/networking/dccp.txt
@@ -18,8 +18,8 @@ Introduction
Datagram Congestion Control Protocol (DCCP) is an unreliable, connection
oriented protocol designed to solve issues present in UDP and TCP, particularly
for real-time and multimedia (streaming) traffic.
-It divides into a base protocol (RFC 4340) and plugable congestion control
-modules called CCIDs. Like plugable TCP congestion control, at least one CCID
+It divides into a base protocol (RFC 4340) and pluggable congestion control
+modules called CCIDs. Like pluggable TCP congestion control, at least one CCID
needs to be enabled in order for the protocol to function properly. In the Linux
implementation, this is the TCP-like CCID2 (RFC 4341). Additional CCIDs, such as
the TCP-friendly CCID3 (RFC 4342), are optional.
@@ -86,7 +86,7 @@ built-in CCIDs.
DCCP_SOCKOPT_CCID is write-only and sets both the TX and RX CCIDs at the same
time, combining the operation of the next two socket options. This option is
-preferrable over the latter two, since often applications will use the same
+preferable over the latter two, since often applications will use the same
type of CCID for both directions; and mixed use of CCIDs is not currently well
understood. This socket option takes as argument at least one uint8_t value, or
an array of uint8_t values, which must match available CCIDS (see above). CCIDs
diff --git a/Documentation/networking/e100.txt b/Documentation/networking/e100.txt
index 13a32124bca..f862cf3aff3 100644
--- a/Documentation/networking/e100.txt
+++ b/Documentation/networking/e100.txt
@@ -103,7 +103,7 @@ Additional Configurations
PRO/100 Family of Adapters is e100.
As an example, if you install the e100 driver for two PRO/100 adapters
- (eth0 and eth1), add the following to a configuraton file in /etc/modprobe.d/
+ (eth0 and eth1), add the following to a configuration file in /etc/modprobe.d/
alias eth0 e100
alias eth1 e100
diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
index cdb3e40b9d1..ee78eba78a9 100644
--- a/Documentation/networking/filter.txt
+++ b/Documentation/networking/filter.txt
@@ -1,49 +1,1027 @@
-filter.txt: Linux Socket Filtering
-Written by: Jay Schulist <jschlst@samba.org>
+Linux Socket Filtering aka Berkeley Packet Filter (BPF)
+=======================================================
Introduction
-============
-
- Linux Socket Filtering is derived from the Berkeley
-Packet Filter. There are some distinct differences between
-the BSD and Linux Kernel Filtering.
-
-Linux Socket Filtering (LSF) allows a user-space program to
-attach a filter onto any socket and allow or disallow certain
-types of data to come through the socket. LSF follows exactly
-the same filter code structure as the BSD Berkeley Packet Filter
-(BPF), so referring to the BSD bpf.4 manpage is very helpful in
-creating filters.
-
-LSF is much simpler than BPF. One does not have to worry about
-devices or anything like that. You simply create your filter
-code, send it to the kernel via the SO_ATTACH_FILTER option and
-if your filter code passes the kernel check on it, you then
-immediately begin filtering data on that socket.
-
-You can also detach filters from your socket via the
-SO_DETACH_FILTER option. This will probably not be used much
-since when you close a socket that has a filter on it the
-filter is automagically removed. The other less common case
-may be adding a different filter on the same socket where you had another
-filter that is still running: the kernel takes care of removing
-the old one and placing your new one in its place, assuming your
-filter has passed the checks, otherwise if it fails the old filter
-will remain on that socket.
-
-SO_LOCK_FILTER option allows to lock the filter attached to a
-socket. Once set, a filter cannot be removed or changed. This allows
-one process to setup a socket, attach a filter, lock it then drop
-privileges and be assured that the filter will be kept until the
-socket is closed.
-
-Examples
-========
-
-Ioctls-
-setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &Filter, sizeof(Filter));
-setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &value, sizeof(value));
-setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &value, sizeof(value));
-
-See the BSD bpf.4 manpage and the BSD Packet Filter paper written by
-Steven McCanne and Van Jacobson of Lawrence Berkeley Laboratory.
+------------
+
+Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter.
+Though there are some distinct differences between the BSD and Linux
+Kernel filtering, but when we speak of BPF or LSF in Linux context, we
+mean the very same mechanism of filtering in the Linux kernel.
+
+BPF allows a user-space program to attach a filter onto any socket and
+allow or disallow certain types of data to come through the socket. LSF
+follows exactly the same filter code structure as BSD's BPF, so referring
+to the BSD bpf.4 manpage is very helpful in creating filters.
+
+On Linux, BPF is much simpler than on BSD. One does not have to worry
+about devices or anything like that. You simply create your filter code,
+send it to the kernel via the SO_ATTACH_FILTER option and if your filter
+code passes the kernel check on it, you then immediately begin filtering
+data on that socket.
+
+You can also detach filters from your socket via the SO_DETACH_FILTER
+option. This will probably not be used much since when you close a socket
+that has a filter on it the filter is automagically removed. The other
+less common case may be adding a different filter on the same socket where
+you had another filter that is still running: the kernel takes care of
+removing the old one and placing your new one in its place, assuming your
+filter has passed the checks, otherwise if it fails the old filter will
+remain on that socket.
+
+SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once
+set, a filter cannot be removed or changed. This allows one process to
+setup a socket, attach a filter, lock it then drop privileges and be
+assured that the filter will be kept until the socket is closed.
+
+The biggest user of this construct might be libpcap. Issuing a high-level
+filter command like `tcpdump -i em1 port 22` passes through the libpcap
+internal compiler that generates a structure that can eventually be loaded
+via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd`
+displays what is being placed into this structure.
+
+Although we were only speaking about sockets here, BPF in Linux is used
+in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
+qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places
+such as team driver, PTP code, etc where BPF is being used.
+
+ [1] Documentation/prctl/seccomp_filter.txt
+
+Original BPF paper:
+
+Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new
+architecture for user-level packet capture. In Proceedings of the
+USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993
+Conference Proceedings (USENIX'93). USENIX Association, Berkeley,
+CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf]
+
+Structure
+---------
+
+User space applications include <linux/filter.h> which contains the
+following relevant structures:
+
+struct sock_filter { /* Filter block */
+ __u16 code; /* Actual filter code */
+ __u8 jt; /* Jump true */
+ __u8 jf; /* Jump false */
+ __u32 k; /* Generic multiuse field */
+};
+
+Such a structure is assembled as an array of 4-tuples, that contains
+a code, jt, jf and k value. jt and jf are jump offsets and k a generic
+value to be used for a provided code.
+
+struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
+ unsigned short len; /* Number of filter blocks */
+ struct sock_filter __user *filter;
+};
+
+For socket filtering, a pointer to this structure (as shown in
+follow-up example) is being passed to the kernel through setsockopt(2).
+
+Example
+-------
+
+#include <sys/socket.h>
+#include <sys/types.h>
+#include <arpa/inet.h>
+#include <linux/if_ether.h>
+/* ... */
+
+/* From the example above: tcpdump -i em1 port 22 -dd */
+struct sock_filter code[] = {
+ { 0x28, 0, 0, 0x0000000c },
+ { 0x15, 0, 8, 0x000086dd },
+ { 0x30, 0, 0, 0x00000014 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 17, 0x00000011 },
+ { 0x28, 0, 0, 0x00000036 },
+ { 0x15, 14, 0, 0x00000016 },
+ { 0x28, 0, 0, 0x00000038 },
+ { 0x15, 12, 13, 0x00000016 },
+ { 0x15, 0, 12, 0x00000800 },
+ { 0x30, 0, 0, 0x00000017 },
+ { 0x15, 2, 0, 0x00000084 },
+ { 0x15, 1, 0, 0x00000006 },
+ { 0x15, 0, 8, 0x00000011 },
+ { 0x28, 0, 0, 0x00000014 },
+ { 0x45, 6, 0, 0x00001fff },
+ { 0xb1, 0, 0, 0x0000000e },
+ { 0x48, 0, 0, 0x0000000e },
+ { 0x15, 2, 0, 0x00000016 },
+ { 0x48, 0, 0, 0x00000010 },
+ { 0x15, 0, 1, 0x00000016 },
+ { 0x06, 0, 0, 0x0000ffff },
+ { 0x06, 0, 0, 0x00000000 },
+};
+
+struct sock_fprog bpf = {
+ .len = ARRAY_SIZE(code),
+ .filter = code,
+};
+
+sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
+if (sock < 0)
+ /* ... bail out ... */
+
+ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf));
+if (ret < 0)
+ /* ... bail out ... */
+
+/* ... */
+close(sock);
+
+The above example code attaches a socket filter for a PF_PACKET socket
+in order to let all IPv4/IPv6 packets with port 22 pass. The rest will
+be dropped for this socket.
+
+The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments
+and SO_LOCK_FILTER for preventing the filter to be detached, takes an
+integer value with 0 or 1.
+
+Note that socket filters are not restricted to PF_PACKET sockets only,
+but can also be used on other socket families.
+
+Summary of system calls:
+
+ * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val));
+ * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val));
+
+Normally, most use cases for socket filtering on packet sockets will be
+covered by libpcap in high-level syntax, so as an application developer
+you should stick to that. libpcap wraps its own layer around all that.
+
+Unless i) using/linking to libpcap is not an option, ii) the required BPF
+filters use Linux extensions that are not supported by libpcap's compiler,
+iii) a filter might be more complex and not cleanly implementable with
+libpcap's compiler, or iv) particular filter codes should be optimized
+differently than libpcap's internal compiler does; then in such cases
+writing such a filter "by hand" can be of an alternative. For example,
+xt_bpf and cls_bpf users might have requirements that could result in
+more complex filter code, or one that cannot be expressed with libpcap
+(e.g. different return codes for various code paths). Moreover, BPF JIT
+implementors may wish to manually write test cases and thus need low-level
+access to BPF code as well.
+
+BPF engine and instruction set
+------------------------------
+
+Under tools/net/ there's a small helper tool called bpf_asm which can
+be used to write low-level filters for example scenarios mentioned in the
+previous section. Asm-like syntax mentioned here has been implemented in
+bpf_asm and will be used for further explanations (instead of dealing with
+less readable opcodes directly, principles are the same). The syntax is
+closely modelled after Steven McCanne's and Van Jacobson's BPF paper.
+
+The BPF architecture consists of the following basic elements:
+
+ Element Description
+
+ A 32 bit wide accumulator
+ X 32 bit wide X register
+ M[] 16 x 32 bit wide misc registers aka "scratch memory
+ store", addressable from 0 to 15
+
+A program, that is translated by bpf_asm into "opcodes" is an array that
+consists of the following elements (as already mentioned):
+
+ op:16, jt:8, jf:8, k:32
+
+The element op is a 16 bit wide opcode that has a particular instruction
+encoded. jt and jf are two 8 bit wide jump targets, one for condition
+"jump if true", the other one "jump if false". Eventually, element k
+contains a miscellaneous argument that can be interpreted in different
+ways depending on the given instruction in op.
+
+The instruction set consists of load, store, branch, alu, miscellaneous
+and return instructions that are also represented in bpf_asm syntax. This
+table lists all bpf_asm instructions available resp. what their underlying
+opcodes as defined in linux/filter.h stand for:
+
+ Instruction Addressing mode Description
+
+ ld 1, 2, 3, 4, 10 Load word into A
+ ldi 4 Load word into A
+ ldh 1, 2 Load half-word into A
+ ldb 1, 2 Load byte into A
+ ldx 3, 4, 5, 10 Load word into X
+ ldxi 4 Load word into X
+ ldxb 5 Load byte into X
+
+ st 3 Store A into M[]
+ stx 3 Store X into M[]
+
+ jmp 6 Jump to label
+ ja 6 Jump to label
+ jeq 7, 8 Jump on k == A
+ jneq 8 Jump on k != A
+ jne 8 Jump on k != A
+ jlt 8 Jump on k < A
+ jle 8 Jump on k <= A
+ jgt 7, 8 Jump on k > A
+ jge 7, 8 Jump on k >= A
+ jset 7, 8 Jump on k & A
+
+ add 0, 4 A + <x>
+ sub 0, 4 A - <x>
+ mul 0, 4 A * <x>
+ div 0, 4 A / <x>
+ mod 0, 4 A % <x>
+ neg 0, 4 !A
+ and 0, 4 A & <x>
+ or 0, 4 A | <x>
+ xor 0, 4 A ^ <x>
+ lsh 0, 4 A << <x>
+ rsh 0, 4 A >> <x>
+
+ tax Copy A into X
+ txa Copy X into A
+
+ ret 4, 9 Return
+
+The next table shows addressing formats from the 2nd column:
+
+ Addressing mode Syntax Description
+
+ 0 x/%x Register X
+ 1 [k] BHW at byte offset k in the packet
+ 2 [x + k] BHW at the offset X + k in the packet
+ 3 M[k] Word at offset k in M[]
+ 4 #k Literal value stored in k
+ 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet
+ 6 L Jump label L
+ 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
+ 8 #k,Lt Jump to Lt if predicate is true
+ 9 a/%a Accumulator A
+ 10 extension BPF extension
+
+The Linux kernel also has a couple of BPF extensions that are used along
+with the class of load instructions by "overloading" the k argument with
+a negative offset + a particular extension offset. The result of such BPF
+extensions are loaded into A.
+
+Possible BPF extensions are shown in the following table:
+
+ Extension Description
+
+ len skb->len
+ proto skb->protocol
+ type skb->pkt_type
+ poff Payload start offset
+ ifidx skb->dev->ifindex
+ nla Netlink attribute of type X with offset A
+ nlan Nested Netlink attribute of type X with offset A
+ mark skb->mark
+ queue skb->queue_mapping
+ hatype skb->dev->type
+ rxhash skb->hash
+ cpu raw_smp_processor_id()
+ vlan_tci vlan_tx_tag_get(skb)
+ vlan_pr vlan_tx_tag_present(skb)
+ rand prandom_u32()
+
+These extensions can also be prefixed with '#'.
+Examples for low-level BPF:
+
+** ARP packets:
+
+ ldh [12]
+ jne #0x806, drop
+ ret #-1
+ drop: ret #0
+
+** IPv4 TCP packets:
+
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #6, drop
+ ret #-1
+ drop: ret #0
+
+** (Accelerated) VLAN w/ id 10:
+
+ ld vlan_tci
+ jneq #10, drop
+ ret #-1
+ drop: ret #0
+
+** icmp random packet sampling, 1 in 4
+ ldh [12]
+ jne #0x800, drop
+ ldb [23]
+ jneq #1, drop
+ # get a random uint32 number
+ ld rand
+ mod #4
+ jneq #1, drop
+ ret #-1
+ drop: ret #0
+
+** SECCOMP filter example:
+
+ ld [4] /* offsetof(struct seccomp_data, arch) */
+ jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */
+ ld [0] /* offsetof(struct seccomp_data, nr) */
+ jeq #15, good /* __NR_rt_sigreturn */
+ jeq #231, good /* __NR_exit_group */
+ jeq #60, good /* __NR_exit */
+ jeq #0, good /* __NR_read */
+ jeq #1, good /* __NR_write */
+ jeq #5, good /* __NR_fstat */
+ jeq #9, good /* __NR_mmap */
+ jeq #14, good /* __NR_rt_sigprocmask */
+ jeq #13, good /* __NR_rt_sigaction */
+ jeq #35, good /* __NR_nanosleep */
+ bad: ret #0 /* SECCOMP_RET_KILL */
+ good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */
+
+The above example code can be placed into a file (here called "foo"), and
+then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf
+and cls_bpf understands and can directly be loaded with. Example with above
+ARP code:
+
+$ ./bpf_asm foo
+4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
+
+In copy and paste C-like output:
+
+$ ./bpf_asm -c foo
+{ 0x28, 0, 0, 0x0000000c },
+{ 0x15, 0, 1, 0x00000806 },
+{ 0x06, 0, 0, 0xffffffff },
+{ 0x06, 0, 0, 0000000000 },
+
+In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF
+filters that might not be obvious at first, it's good to test filters before
+attaching to a live system. For that purpose, there's a small tool called
+bpf_dbg under tools/net/ in the kernel source directory. This debugger allows
+for testing BPF filters against given pcap files, single stepping through the
+BPF code on the pcap's packets and to do BPF machine register dumps.
+
+Starting bpf_dbg is trivial and just requires issuing:
+
+# ./bpf_dbg
+
+In case input and output do not equal stdin/stdout, bpf_dbg takes an
+alternative stdin source as a first argument, and an alternative stdout
+sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`.
+
+Other than that, a particular libreadline configuration can be set via
+file "~/.bpf_dbg_init" and the command history is stored in the file
+"~/.bpf_dbg_history".
+
+Interaction in bpf_dbg happens through a shell that also has auto-completion
+support (follow-up example commands starting with '>' denote bpf_dbg shell).
+The usual workflow would be to ...
+
+> load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
+ Loads a BPF filter from standard output of bpf_asm, or transformed via
+ e.g. `tcpdump -iem1 -ddd port 22 | tr '\n' ','`. Note that for JIT
+ debugging (next section), this command creates a temporary socket and
+ loads the BPF code into the kernel. Thus, this will also be useful for
+ JIT developers.
+
+> load pcap foo.pcap
+ Loads standard tcpdump pcap file.
+
+> run [<n>]
+bpf passes:1 fails:9
+ Runs through all packets from a pcap to account how many passes and fails
+ the filter will generate. A limit of packets to traverse can be given.
+
+> disassemble
+l0: ldh [12]
+l1: jeq #0x800, l2, l5
+l2: ldb [23]
+l3: jeq #0x1, l4, l5
+l4: ret #0xffff
+l5: ret #0
+ Prints out BPF code disassembly.
+
+> dump
+/* { op, jt, jf, k }, */
+{ 0x28, 0, 0, 0x0000000c },
+{ 0x15, 0, 3, 0x00000800 },
+{ 0x30, 0, 0, 0x00000017 },
+{ 0x15, 0, 1, 0x00000001 },
+{ 0x06, 0, 0, 0x0000ffff },
+{ 0x06, 0, 0, 0000000000 },
+ Prints out C-style BPF code dump.
+
+> breakpoint 0
+breakpoint at: l0: ldh [12]
+> breakpoint 1
+breakpoint at: l1: jeq #0x800, l2, l5
+ ...
+ Sets breakpoints at particular BPF instructions. Issuing a `run` command
+ will walk through the pcap file continuing from the current packet and
+ break when a breakpoint is being hit (another `run` will continue from
+ the currently active breakpoint executing next instructions):
+
+ > run
+ -- register dump --
+ pc: [0] <-- program counter
+ code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction
+ curr: l0: ldh [12] <-- disassembly of current instruction
+ A: [00000000][0] <-- content of A (hex, decimal)
+ X: [00000000][0] <-- content of X (hex, decimal)
+ M[0,15]: [00000000][0] <-- folded content of M (hex, decimal)
+ -- packet dump -- <-- Current packet from pcap (hex)
+ len: 42
+ 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01
+ 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26
+ 32: 00 00 00 00 00 00 0a 3b 01 01
+ (breakpoint)
+ >
+
+> breakpoint
+breakpoints: 0 1
+ Prints currently set breakpoints.
+
+> step [-<n>, +<n>]
+ Performs single stepping through the BPF program from the current pc
+ offset. Thus, on each step invocation, above register dump is issued.
+ This can go forwards and backwards in time, a plain `step` will break
+ on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
+
+> select <n>
+ Selects a given packet from the pcap file to continue from. Thus, on
+ the next `run` or `step`, the BPF program is being evaluated against
+ the user pre-selected packet. Numbering starts just as in Wireshark
+ with index 1.
+
+> quit
+#
+ Exits bpf_dbg.
+
+JIT compiler
+------------
+
+The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC,
+ARM and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is
+transparently invoked for each attached filter from user space or for internal
+kernel users if it has been previously enabled by root:
+
+ echo 1 > /proc/sys/net/core/bpf_jit_enable
+
+For JIT developers, doing audits etc, each compile run can output the generated
+opcode image into the kernel log via:
+
+ echo 2 > /proc/sys/net/core/bpf_jit_enable
+
+Example output from dmesg:
+
+[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f
+[ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
+[ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
+[ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
+[ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
+[ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
+
+In the kernel source tree under tools/net/, there's bpf_jit_disasm for
+generating disassembly out of the kernel log's hexdump:
+
+# ./bpf_jit_disasm
+70 bytes emitted from JIT compiler (pass:3, flen:6)
+ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 1: mov %rsp,%rbp
+ 4: sub $0x60,%rsp
+ 8: mov %rbx,-0x8(%rbp)
+ c: mov 0x68(%rdi),%r9d
+ 10: sub 0x6c(%rdi),%r9d
+ 14: mov 0xd8(%rdi),%r8
+ 1b: mov $0xc,%esi
+ 20: callq 0xffffffffe0ff9442
+ 25: cmp $0x800,%eax
+ 2a: jne 0x0000000000000042
+ 2c: mov $0x17,%esi
+ 31: callq 0xffffffffe0ff945e
+ 36: cmp $0x1,%eax
+ 39: jne 0x0000000000000042
+ 3b: mov $0xffff,%eax
+ 40: jmp 0x0000000000000044
+ 42: xor %eax,%eax
+ 44: leaveq
+ 45: retq
+
+Issuing option `-o` will "annotate" opcodes to resulting assembler
+instructions, which can be very useful for JIT developers:
+
+# ./bpf_jit_disasm -o
+70 bytes emitted from JIT compiler (pass:3, flen:6)
+ffffffffa0069c8f + <x>:
+ 0: push %rbp
+ 55
+ 1: mov %rsp,%rbp
+ 48 89 e5
+ 4: sub $0x60,%rsp
+ 48 83 ec 60
+ 8: mov %rbx,-0x8(%rbp)
+ 48 89 5d f8
+ c: mov 0x68(%rdi),%r9d
+ 44 8b 4f 68
+ 10: sub 0x6c(%rdi),%r9d
+ 44 2b 4f 6c
+ 14: mov 0xd8(%rdi),%r8
+ 4c 8b 87 d8 00 00 00
+ 1b: mov $0xc,%esi
+ be 0c 00 00 00
+ 20: callq 0xffffffffe0ff9442
+ e8 1d 94 ff e0
+ 25: cmp $0x800,%eax
+ 3d 00 08 00 00
+ 2a: jne 0x0000000000000042
+ 75 16
+ 2c: mov $0x17,%esi
+ be 17 00 00 00
+ 31: callq 0xffffffffe0ff945e
+ e8 28 94 ff e0
+ 36: cmp $0x1,%eax
+ 83 f8 01
+ 39: jne 0x0000000000000042
+ 75 07
+ 3b: mov $0xffff,%eax
+ b8 ff ff 00 00
+ 40: jmp 0x0000000000000044
+ eb 02
+ 42: xor %eax,%eax
+ 31 c0
+ 44: leaveq
+ c9
+ 45: retq
+ c3
+
+For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful
+toolchain for developing and testing the kernel's JIT compiler.
+
+BPF kernel internals
+--------------------
+Internally, for the kernel interpreter, a different instruction set
+format with similar underlying principles from BPF described in previous
+paragraphs is being used. However, the instruction set format is modelled
+closer to the underlying architecture to mimic native instruction sets, so
+that a better performance can be achieved (more details later). This new
+ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which
+originates from [e]xtended BPF is not the same as BPF extensions! While
+eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading'
+of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.)
+
+It is designed to be JITed with one to one mapping, which can also open up
+the possibility for GCC/LLVM compilers to generate optimized eBPF code through
+an eBPF backend that performs almost as fast as natively compiled code.
+
+The new instruction set was originally designed with the possible goal in
+mind to write programs in "restricted C" and compile into eBPF with a optional
+GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
+minimal performance overhead over two steps, that is, C -> eBPF -> native code.
+
+Currently, the new format is being used for running user BPF programs, which
+includes seccomp BPF, classic socket filters, cls_bpf traffic classifier,
+team driver's classifier for its load-balancing mode, netfilter's xt_bpf
+extension, PTP dissector/classifier, and much more. They are all internally
+converted by the kernel into the new instruction set representation and run
+in the eBPF interpreter. For in-kernel handlers, this all works transparently
+by using sk_unattached_filter_create() for setting up the filter, resp.
+sk_unattached_filter_destroy() for destroying it. The macro
+SK_RUN_FILTER(filter, ctx) transparently invokes eBPF interpreter or JITed
+code to run the filter. 'filter' is a pointer to struct sk_filter that we
+got from sk_unattached_filter_create(), and 'ctx' the given context (e.g.
+skb pointer). All constraints and restrictions from sk_chk_filter() apply
+before a conversion to the new layout is being done behind the scenes!
+
+Currently, the classic BPF format is being used for JITing on most of the
+architectures. Only x86-64 performs JIT compilation from eBPF instruction set,
+however, future work will migrate other JIT compilers as well, so that they
+will profit from the very same benefits.
+
+Some core changes of the new internal format:
+
+- Number of registers increase from 2 to 10:
+
+ The old format had two registers A and X, and a hidden frame pointer. The
+ new layout extends this to be 10 internal registers and a read-only frame
+ pointer. Since 64-bit CPUs are passing arguments to functions via registers
+ the number of args from eBPF program to in-kernel function is restricted
+ to 5 and one register is used to accept return value from an in-kernel
+ function. Natively, x86_64 passes first 6 arguments in registers, aarch64/
+ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved
+ registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
+
+ Therefore, eBPF calling convention is defined as:
+
+ * R0 - return value from in-kernel function, and exit value for eBPF program
+ * R1 - R5 - arguments from eBPF program to in-kernel function
+ * R6 - R9 - callee saved registers that in-kernel function will preserve
+ * R10 - read-only frame pointer to access stack
+
+ Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64,
+ etc, and eBPF calling convention maps directly to ABIs used by the kernel on
+ 64-bit architectures.
+
+ On 32-bit architectures JIT may map programs that use only 32-bit arithmetic
+ and may let more complex programs to be interpreted.
+
+ R0 - R5 are scratch registers and eBPF program needs spill/fill them if
+ necessary across calls. Note that there is only one eBPF program (== one
+ eBPF main routine) and it cannot call other eBPF functions, it can only
+ call predefined in-kernel functions, though.
+
+- Register width increases from 32-bit to 64-bit:
+
+ Still, the semantics of the original 32-bit ALU operations are preserved
+ via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
+ subregisters that zero-extend into 64-bit if they are being written to.
+ That behavior maps directly to x86_64 and arm64 subregister definition, but
+ makes other JITs more difficult.
+
+ 32-bit architectures run 64-bit internal BPF programs via interpreter.
+ Their JITs may convert BPF programs that only use 32-bit subregisters into
+ native instruction set and let the rest being interpreted.
+
+ Operation is 64-bit, because on 64-bit architectures, pointers are also
+ 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
+ so 32-bit eBPF registers would otherwise require to define register-pair
+ ABI, thus, there won't be able to use a direct eBPF register to HW register
+ mapping and JIT would need to do combine/split/move operations for every
+ register in and out of the function, which is complex, bug prone and slow.
+ Another reason is the use of atomic 64-bit counters.
+
+- Conditional jt/jf targets replaced with jt/fall-through:
+
+ While the original design has constructs such as "if (cond) jump_true;
+ else jump_false;", they are being replaced into alternative constructs like
+ "if (cond) jump_true; /* else fall-through */".
+
+- Introduces bpf_call insn and register passing convention for zero overhead
+ calls from/to other kernel functions:
+
+ Before an in-kernel function call, the internal BPF program needs to
+ place function arguments into R1 to R5 registers to satisfy calling
+ convention, then the interpreter will take them from registers and pass
+ to in-kernel function. If R1 - R5 registers are mapped to CPU registers
+ that are used for argument passing on given architecture, the JIT compiler
+ doesn't need to emit extra moves. Function arguments will be in the correct
+ registers and BPF_CALL instruction will be JITed as single 'call' HW
+ instruction. This calling convention was picked to cover common call
+ situations without performance penalty.
+
+ After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has
+ a return value of the function. Since R6 - R9 are callee saved, their state
+ is preserved across the call.
+
+ For example, consider three C functions:
+
+ u64 f1() { return (*_f2)(1); }
+ u64 f2(u64 a) { return f3(a + 1, a); }
+ u64 f3(u64 a, u64 b) { return a - b; }
+
+ GCC can compile f1, f3 into x86_64:
+
+ f1:
+ movl $1, %edi
+ movq _f2(%rip), %rax
+ jmp *%rax
+ f3:
+ movq %rdi, %rax
+ subq %rsi, %rax
+ ret
+
+ Function f2 in eBPF may look like:
+
+ f2:
+ bpf_mov R2, R1
+ bpf_add R1, 1
+ bpf_call f3
+ bpf_exit
+
+ If f2 is JITed and the pointer stored to '_f2'. The calls f1 -> f2 -> f3 and
+ returns will be seamless. Without JIT, __sk_run_filter() interpreter needs to
+ be used to call into f2.
+
+ For practical reasons all eBPF programs have only one argument 'ctx' which is
+ already placed into R1 (e.g. on __sk_run_filter() startup) and the programs
+ can call kernel functions with up to 5 arguments. Calls with 6 or more arguments
+ are currently not supported, but these restrictions can be lifted if necessary
+ in the future.
+
+ On 64-bit architectures all register map to HW registers one to one. For
+ example, x86_64 JIT compiler can map them as ...
+
+ R0 - rax
+ R1 - rdi
+ R2 - rsi
+ R3 - rdx
+ R4 - rcx
+ R5 - r8
+ R6 - rbx
+ R7 - r13
+ R8 - r14
+ R9 - r15
+ R10 - rbp
+
+ ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing
+ and rbx, r12 - r15 are callee saved.
+
+ Then the following internal BPF pseudo-program:
+
+ bpf_mov R6, R1 /* save ctx */
+ bpf_mov R2, 2
+ bpf_mov R3, 3
+ bpf_mov R4, 4
+ bpf_mov R5, 5
+ bpf_call foo
+ bpf_mov R7, R0 /* save foo() return value */
+ bpf_mov R1, R6 /* restore ctx for next call */
+ bpf_mov R2, 6
+ bpf_mov R3, 7
+ bpf_mov R4, 8
+ bpf_mov R5, 9
+ bpf_call bar
+ bpf_add R0, R7
+ bpf_exit
+
+ After JIT to x86_64 may look like:
+
+ push %rbp
+ mov %rsp,%rbp
+ sub $0x228,%rsp
+ mov %rbx,-0x228(%rbp)
+ mov %r13,-0x220(%rbp)
+ mov %rdi,%rbx
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq foo
+ mov %rax,%r13
+ mov %rbx,%rdi
+ mov $0x2,%esi
+ mov $0x3,%edx
+ mov $0x4,%ecx
+ mov $0x5,%r8d
+ callq bar
+ add %r13,%rax
+ mov -0x228(%rbp),%rbx
+ mov -0x220(%rbp),%r13
+ leaveq
+ retq
+
+ Which is in this example equivalent in C to:
+
+ u64 bpf_filter(u64 ctx)
+ {
+ return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
+ }
+
+ In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64
+ arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
+ registers and place their return value into '%rax' which is R0 in eBPF.
+ Prologue and epilogue are emitted by JIT and are implicit in the
+ interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
+ them across the calls as defined by calling convention.
+
+ For example the following program is invalid:
+
+ bpf_mov R1, 1
+ bpf_call foo
+ bpf_mov R0, R1
+ bpf_exit
+
+ After the call the registers R1-R5 contain junk values and cannot be read.
+ In the future an eBPF verifier can be used to validate internal BPF programs.
+
+Also in the new design, eBPF is limited to 4096 insns, which means that any
+program will terminate quickly and will only call a fixed number of kernel
+functions. Original BPF and the new format are two operand instructions,
+which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
+
+The input context pointer for invoking the interpreter function is generic,
+its content is defined by a specific use case. For seccomp register R1 points
+to seccomp_data, for converted BPF filters R1 points to a skb.
+
+A program, that is translated internally consists of the following elements:
+
+ op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
+
+So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field
+has room for new instructions. Some of them may use 16/24/32 byte encoding. New
+instructions must be multiple of 8 bytes to preserve backward compatibility.
+
+Internal BPF is a general purpose RISC instruction set. Not every register and
+every instruction are used during translation from original BPF to new format.
+For example, socket filters are not using 'exclusive add' instruction, but
+tracing filters may do to maintain counters of events, for example. Register R9
+is not used by socket filters either, but more complex filters may be running
+out of registers and would have to resort to spill/fill to stack.
+
+Internal BPF can used as generic assembler for last step performance
+optimizations, socket filters and seccomp are using it as assembler. Tracing
+filters may use it as assembler to generate code from kernel. In kernel usage
+may not be bounded by security considerations, since generated internal BPF code
+may be optimizing internal code path and not being exposed to the user space.
+Safety of internal BPF can come from a verifier (TBD). In such use cases as
+described, it may be used as safe instruction set.
+
+Just like the original BPF, the new format runs within a controlled environment,
+is deterministic and the kernel can easily prove that. The safety of the program
+can be determined in two steps: first step does depth-first-search to disallow
+loops and other CFG validation; second step starts from the first insn and
+descends all possible paths. It simulates execution of every insn and observes
+the state change of registers and stack.
+
+eBPF opcode encoding
+--------------------
+
+eBPF is reusing most of the opcode encoding from classic to simplify conversion
+of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
+field is divided into three parts:
+
+ +----------------+--------+--------------------+
+ | 4 bits | 1 bit | 3 bits |
+ | operation code | source | instruction class |
+ +----------------+--------+--------------------+
+ (MSB) (LSB)
+
+Three LSB bits store instruction class which is one of:
+
+ Classic BPF classes: eBPF classes:
+
+ BPF_LD 0x00 BPF_LD 0x00
+ BPF_LDX 0x01 BPF_LDX 0x01
+ BPF_ST 0x02 BPF_ST 0x02
+ BPF_STX 0x03 BPF_STX 0x03
+ BPF_ALU 0x04 BPF_ALU 0x04
+ BPF_JMP 0x05 BPF_JMP 0x05
+ BPF_RET 0x06 [ class 6 unused, for future if needed ]
+ BPF_MISC 0x07 BPF_ALU64 0x07
+
+When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
+
+ BPF_K 0x00
+ BPF_X 0x08
+
+ * in classic BPF, this means:
+
+ BPF_SRC(code) == BPF_X - use register X as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+ * in eBPF, this means:
+
+ BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand
+ BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
+
+... and four MSB bits store operation code.
+
+If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:
+
+ BPF_ADD 0x00
+ BPF_SUB 0x10
+ BPF_MUL 0x20
+ BPF_DIV 0x30
+ BPF_OR 0x40
+ BPF_AND 0x50
+ BPF_LSH 0x60
+ BPF_RSH 0x70
+ BPF_NEG 0x80
+ BPF_MOD 0x90
+ BPF_XOR 0xa0
+ BPF_MOV 0xb0 /* eBPF only: mov reg to reg */
+ BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */
+ BPF_END 0xd0 /* eBPF only: endianness conversion */
+
+If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of:
+
+ BPF_JA 0x00
+ BPF_JEQ 0x10
+ BPF_JGT 0x20
+ BPF_JGE 0x30
+ BPF_JSET 0x40
+ BPF_JNE 0x50 /* eBPF only: jump != */
+ BPF_JSGT 0x60 /* eBPF only: signed '>' */
+ BPF_JSGE 0x70 /* eBPF only: signed '>=' */
+ BPF_CALL 0x80 /* eBPF only: function call */
+ BPF_EXIT 0x90 /* eBPF only: function return */
+
+So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF
+and eBPF. There are only two registers in classic BPF, so it means A += X.
+In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly,
+BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous
+src_reg = (u32) src_reg ^ (u32) imm32 in eBPF.
+
+Classic BPF is using BPF_MISC class to represent A = X and X = A moves.
+eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no
+BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean
+exactly the same operations as BPF_ALU, but with 64-bit wide operands
+instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.:
+dst_reg = dst_reg + src_reg
+
+Classic BPF wastes the whole BPF_RET class to represent a single 'ret'
+operation. Classic BPF_RET | BPF_K means copy imm32 into return register
+and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT
+in eBPF means function exit only. The eBPF program needs to store return
+value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is currently
+unused and reserved for future use.
+
+For load and store instructions the 8-bit 'code' field is divided as:
+
+ +--------+--------+-------------------+
+ | 3 bits | 2 bits | 3 bits |
+ | mode | size | instruction class |
+ +--------+--------+-------------------+
+ (MSB) (LSB)
+
+Size modifier is one of ...
+
+ BPF_W 0x00 /* word */
+ BPF_H 0x08 /* half word */
+ BPF_B 0x10 /* byte */
+ BPF_DW 0x18 /* eBPF only, double word */
+
+... which encodes size of load/store operation:
+
+ B - 1 byte
+ H - 2 byte
+ W - 4 byte
+ DW - 8 byte (eBPF only)
+
+Mode modifier is one of:
+
+ BPF_IMM 0x00 /* classic BPF only, reserved in eBPF */
+ BPF_ABS 0x20
+ BPF_IND 0x40
+ BPF_MEM 0x60
+ BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */
+ BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */
+ BPF_XADD 0xc0 /* eBPF only, exclusive add */
+
+eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and
+(BPF_IND | <size> | BPF_LD) which are used to access packet data.
+
+They had to be carried over from classic to have strong performance of
+socket filters running in eBPF interpreter. These instructions can only
+be used when interpreter context is a pointer to 'struct sk_buff' and
+have seven implicit operands. Register R6 is an implicit input that must
+contain pointer to sk_buff. Register R0 is an implicit output which contains
+the data fetched from the packet. Registers R1-R5 are scratch registers
+and must not be used to store the data across BPF_ABS | BPF_LD or
+BPF_IND | BPF_LD instructions.
+
+These instructions have implicit program exit condition as well. When
+eBPF program is trying to access the data beyond the packet boundary,
+the interpreter will abort the execution of the program. JIT compilers
+therefore must preserve this property. src_reg and imm32 fields are
+explicit inputs to these instructions.
+
+For example:
+
+ BPF_IND | BPF_W | BPF_LD means:
+
+ R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32))
+ and R1 - R5 were scratched.
+
+Unlike classic BPF instruction set, eBPF has generic load/store operations:
+
+BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg
+BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32
+BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off)
+BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg
+BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
+
+Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
+2 byte atomic increments are not supported.
+
+Testing
+-------
+
+Next to the BPF toolchain, the kernel also ships a test module that contains
+various test cases for classic and internal BPF that can be executed against
+the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
+enabled via Kconfig:
+
+ CONFIG_TEST_BPF=m
+
+After the module has been built and installed, the test suite can be executed
+via insmod or modprobe against 'test_bpf' module. Results of the test cases
+including timings in nsec can be found in the kernel log (dmesg).
+
+Misc
+----
+
+Also trinity, the Linux syscall fuzzer, has built-in support for BPF and
+SECCOMP-BPF kernel fuzzing.
+
+Written by
+----------
+
+The document was written in the hope that it is found useful and in order
+to give potential BPF hackers or security auditors a better overview of
+the underlying architecture.
+
+Jay Schulist <jschlst@samba.org>
+Daniel Borkmann <dborkman@redhat.com>
+Alexei Starovoitov <ast@plumgrid.com>
diff --git a/Documentation/networking/gianfar.txt b/Documentation/networking/gianfar.txt
index ad474ea07d0..ba1daea7f2e 100644
--- a/Documentation/networking/gianfar.txt
+++ b/Documentation/networking/gianfar.txt
@@ -1,38 +1,8 @@
The Gianfar Ethernet Driver
-Sysfs File description
Author: Andy Fleming <afleming@freescale.com>
Updated: 2005-07-28
-SYSFS
-
-Several of the features of the gianfar driver are controlled
-through sysfs files. These are:
-
-bd_stash:
-To stash RX Buffer Descriptors in the L2, echo 'on' or '1' to
-bd_stash, echo 'off' or '0' to disable
-
-rx_stash_len:
-To stash the first n bytes of the packet in L2, echo the number
-of bytes to buf_stash_len. echo 0 to disable.
-
-WARNING: You could really screw these up if you set them too low or high!
-fifo_threshold:
-To change the number of bytes the controller needs in the
-fifo before it starts transmission, echo the number of bytes to
-fifo_thresh. Range should be 0-511.
-
-fifo_starve:
-When the FIFO has less than this many bytes during a transmit, it
-enters starve mode, and increases the priority of TX memory
-transactions. To change, echo the number of bytes to
-fifo_starve. Range should be 0-511.
-
-fifo_starve_off:
-Once in starve mode, the FIFO remains there until it has this
-many bytes. To change, echo the number of bytes to
-fifo_starve_off. Range should be 0-511.
CHECKSUM OFFLOADING
diff --git a/Documentation/networking/i40evf.txt b/Documentation/networking/i40evf.txt
new file mode 100644
index 00000000000..21e41271af7
--- /dev/null
+++ b/Documentation/networking/i40evf.txt
@@ -0,0 +1,47 @@
+Linux* Base Driver for Intel(R) Network Connection
+==================================================
+
+Intel XL710 X710 Virtual Function Linux driver.
+Copyright(c) 2013 Intel Corporation.
+
+Contents
+========
+
+- Identifying Your Adapter
+- Known Issues/Troubleshooting
+- Support
+
+This file describes the i40evf Linux* Base Driver for the Intel(R) XL710
+X710 Virtual Function.
+
+The i40evf driver supports XL710 and X710 virtual function devices that
+can only be activated on kernels with CONFIG_PCI_IOV enabled.
+
+The guest OS loading the i40evf driver must support MSI-X interrupts.
+
+Identifying Your Adapter
+========================
+
+For more information on how to identify your adapter, go to the Adapter &
+Driver ID Guide at:
+
+ http://support.intel.com/support/go/network/adapter/idguide.htm
+
+Known Issues/Troubleshooting
+============================
+
+
+Support
+=======
+
+For general information, go to the Intel support website at:
+
+ http://support.intel.com
+
+or the Intel Wired Networking project hosted by Sourceforge at:
+
+ http://sourceforge.net/projects/e1000
+
+If an issue is identified with the released source code on the supported
+kernel with a supported adapter, email the specific information related
+to the issue to e1000-devel@lists.sf.net
diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt
index 09eb57329f1..22bbc7225f8 100644
--- a/Documentation/networking/ieee802154.txt
+++ b/Documentation/networking/ieee802154.txt
@@ -4,7 +4,7 @@
Introduction
============
-The IEEE 802.15.4 working group focuses on standartization of bottom
+The IEEE 802.15.4 working group focuses on standardization of bottom
two layers: Medium Access Control (MAC) and Physical (PHY). And there
are mainly two options available for upper layers:
- ZigBee - proprietary protocol from ZigBee Alliance
@@ -66,7 +66,7 @@ net_device, with .type = ARPHRD_IEEE802154. Data is exchanged with socket family
code via plain sk_buffs. On skb reception skb->cb must contain additional
info as described in the struct ieee802154_mac_cb. During packet transmission
the skb->cb is used to provide additional data to device's header_ops->create
-function. Be aware, that this data can be overriden later (when socket code
+function. Be aware that this data can be overridden later (when socket code
submits skb to qdisc), so if you need something from that cb later, you should
store info in the skb->data on your own.
diff --git a/Documentation/networking/igb.txt b/Documentation/networking/igb.txt
index 4ebbd659256..43d3549366a 100644
--- a/Documentation/networking/igb.txt
+++ b/Documentation/networking/igb.txt
@@ -36,54 +36,6 @@ Default Value: 0
This parameter adds support for SR-IOV. It causes the driver to spawn up to
max_vfs worth of virtual function.
-QueuePairs
-----------
-Valid Range: 0-1
-Default Value: 1 (TX and RX will be paired onto one interrupt vector)
-
-If set to 0, when MSI-X is enabled, the TX and RX will attempt to occupy
-separate vectors.
-
-This option can be overridden to 1 if there are not sufficient interrupts
-available. This can occur if any combination of RSS, VMDQ, and max_vfs
-results in more than 4 queues being used.
-
-Node
-----
-Valid Range: 0-n
-Default Value: -1 (off)
-
- 0 - n: where n is the number of the NUMA node that should be used to
- allocate memory for this adapter port.
- -1: uses the driver default of allocating memory on whichever processor is
- running insmod/modprobe.
-
- The Node parameter will allow you to pick which NUMA node you want to have
- the adapter allocate memory from. All driver structures, in-memory queues,
- and receive buffers will be allocated on the node specified. This parameter
- is only useful when interrupt affinity is specified, otherwise some portion
- of the time the interrupt could run on a different core than the memory is
- allocated on, causing slower memory access and impacting throughput, CPU, or
- both.
-
-EEE
----
-Valid Range: 0-1
-Default Value: 1 (enabled)
-
- A link between two EEE-compliant devices will result in periodic bursts of
- data followed by long periods where in the link is in an idle state. This Low
- Power Idle (LPI) state is supported in both 1Gbps and 100Mbps link speeds.
- NOTE: EEE support requires autonegotiation.
-
-DMAC
-----
-Valid Range: 0-1
-Default Value: 1 (enabled)
- Enables or disables DMA Coalescing feature.
-
-
-
Additional Configurations
=========================
diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index a46d78583ae..ab42c95f998 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -15,13 +15,47 @@ ip_default_ttl - INTEGER
forwarded) IP packets. Should be between 1 and 255 inclusive.
Default: 64 (as recommended by RFC1700)
-ip_no_pmtu_disc - BOOLEAN
- Disable Path MTU Discovery.
- default FALSE
+ip_no_pmtu_disc - INTEGER
+ Disable Path MTU Discovery. If enabled in mode 1 and a
+ fragmentation-required ICMP is received, the PMTU to this
+ destination will be set to min_pmtu (see below). You will need
+ to raise min_pmtu to the smallest interface MTU on your system
+ manually if you want to avoid locally generated fragments.
+
+ In mode 2 incoming Path MTU Discovery messages will be
+ discarded. Outgoing frames are handled the same as in mode 1,
+ implicitly setting IP_PMTUDISC_DONT on every created socket.
+
+ Mode 3 is a hardend pmtu discover mode. The kernel will only
+ accept fragmentation-needed errors if the underlying protocol
+ can verify them besides a plain socket lookup. Current
+ protocols for which pmtu events will be honored are TCP, SCTP
+ and DCCP as they verify e.g. the sequence number or the
+ association. This mode should not be enabled globally but is
+ only intended to secure e.g. name servers in namespaces where
+ TCP path mtu must still work but path MTU information of other
+ protocols should be discarded. If enabled globally this mode
+ could break other protocols.
+
+ Possible values: 0-3
+ Default: FALSE
min_pmtu - INTEGER
default 552 - minimum discovered Path MTU
+ip_forward_use_pmtu - BOOLEAN
+ By default we don't trust protocol path MTUs while forwarding
+ because they could be easily forged and can lead to unwanted
+ fragmentation by the router.
+ You only need to enable this if you have user-space software
+ which tries to discover path mtus by itself and depends on the
+ kernel honoring this information. This is normally not the
+ case.
+ Default: 0 (disabled)
+ Possible values:
+ 0 - disabled
+ 1 - enabled
+
route/max_size - INTEGER
Maximum number of routes allowed in the kernel. Increase
this when using large numbers of interfaces and/or routes.
@@ -156,6 +190,16 @@ tcp_app_win - INTEGER
buffer. Value 0 is special, it means that nothing is reserved.
Default: 31
+tcp_autocorking - BOOLEAN
+ Enable TCP auto corking :
+ When applications do consecutive small write()/sendmsg() system calls,
+ we try to coalesce these small writes as much as possible, to lower
+ total amount of sent packets. This is done if at least one prior
+ packet for the flow is waiting in Qdisc queues or device transmit
+ queue. Applications can still use TCP_CORK for optimal behavior
+ when they know how/when to uncork their sockets.
+ Default : 1
+
tcp_available_congestion_control - STRING
Shows the available congestion control choices that are registered.
More congestion control algorithms may be available as modules,
@@ -267,17 +311,6 @@ tcp_max_orphans - INTEGER
more aggressively. Let me to remind again: each orphan eats
up to ~64K of unswappable memory.
-tcp_max_ssthresh - INTEGER
- Limited Slow-Start for TCP with large congestion windows (cwnd) defined in
- RFC3742. Limited slow-start is a mechanism to limit growth of the cwnd
- on the region where cwnd is larger than tcp_max_ssthresh. TCP increases cwnd
- by at most tcp_max_ssthresh segments, and by at least tcp_max_ssthresh/2
- segments per RTT when the cwnd is above tcp_max_ssthresh.
- If TCP connection increased cwnd to thousands (or tens of thousands) segments,
- and thousands of packets were being dropped during slow-start, you can set
- tcp_max_ssthresh to improve performance for new TCP connection.
- Default: 0 (off)
-
tcp_max_syn_backlog - INTEGER
Maximal number of remembered connection requests, which have not
received an acknowledgment from connecting client.
@@ -451,7 +484,7 @@ tcp_fastopen - INTEGER
connect() to perform a TCP handshake automatically.
The values (bitmap) are
- 1: Enables sending data in the opening SYN on the client.
+ 1: Enables sending data in the opening SYN on the client w/ MSG_FASTOPEN.
2: Enables TCP Fast Open on the server side, i.e., allowing data in
a SYN packet to be accepted and passed to the application before
3-way hand shake finishes.
@@ -464,7 +497,7 @@ tcp_fastopen - INTEGER
different ways of setting max_qlen without the TCP_FASTOPEN socket
option.
- Default: 0
+ Default: 1
Note that the client & server side Fast Open flags (1 and 2
respectively) must be also enabled before the rest of flags can take
@@ -588,9 +621,6 @@ tcp_limit_output_bytes - INTEGER
typical pfifo_fast qdiscs.
tcp_limit_output_bytes limits the number of bytes on qdisc
or device to reduce artificial RTT/cwnd and reduce bufferbloat.
- Note: For GSO/TSO enabled flows, we try to have at least two
- packets in flight. Reducing tcp_limit_output_bytes might also
- reduce the size of individual GSO packet (64KB being the max)
Default: 131072
tcp_challenge_ack_limit - INTEGER
@@ -1058,6 +1088,12 @@ igmpv3_unsolicited_report_interval - INTEGER
IGMPv3 report retransmit will take place.
Default: 1000 (1 seconds)
+promote_secondaries - BOOLEAN
+ When a primary IP address is removed from this interface
+ promote a corresponding secondary IP address instead of
+ removing all the corresponding secondary IP addresses.
+
+
tag - INTEGER
Allows you to write a number, which can be used as required.
Default value is 0.
@@ -1088,6 +1124,21 @@ bindv6only - BOOLEAN
Default: FALSE (as specified in RFC3493)
+flowlabel_consistency - BOOLEAN
+ Protect the consistency (and unicity) of flow label.
+ You have to disable it to use IPV6_FL_F_REFLECT flag on the
+ flow label manager.
+ TRUE: enabled
+ FALSE: disabled
+ Default: TRUE
+
+anycast_src_echo_reply - BOOLEAN
+ Controls the use of anycast addresses as source addresses for ICMPv6
+ echo reply
+ TRUE: enabled
+ FALSE: disabled
+ Default: FALSE
+
IPv6 Fragmentation:
ip6frag_high_thresh - INTEGER
diff --git a/Documentation/networking/ipsec.txt b/Documentation/networking/ipsec.txt
new file mode 100644
index 00000000000..8dbc08b7e43
--- /dev/null
+++ b/Documentation/networking/ipsec.txt
@@ -0,0 +1,38 @@
+
+Here documents known IPsec corner cases which need to be keep in mind when
+deploy various IPsec configuration in real world production environment.
+
+1. IPcomp: Small IP packet won't get compressed at sender, and failed on
+ policy check on receiver.
+
+Quote from RFC3173:
+2.2. Non-Expansion Policy
+
+ If the total size of a compressed payload and the IPComp header, as
+ defined in section 3, is not smaller than the size of the original
+ payload, the IP datagram MUST be sent in the original non-compressed
+ form. To clarify: If an IP datagram is sent non-compressed, no
+
+ IPComp header is added to the datagram. This policy ensures saving
+ the decompression processing cycles and avoiding incurring IP
+ datagram fragmentation when the expanded datagram is larger than the
+ MTU.
+
+ Small IP datagrams are likely to expand as a result of compression.
+ Therefore, a numeric threshold should be applied before compression,
+ where IP datagrams of size smaller than the threshold are sent in the
+ original form without attempting compression. The numeric threshold
+ is implementation dependent.
+
+Current IPComp implementation is indeed by the book, while as in practice
+when sending non-compressed packet to the peer(whether or not packet len
+is smaller than the threshold or the compressed len is large than original
+packet len), the packet is dropped when checking the policy as this packet
+matches the selector but not coming from any XFRM layer, i.e., with no
+security path. Such naked packet will not eventually make it to upper layer.
+The result is much more wired to the user when ping peer with different
+payload length.
+
+One workaround is try to set "level use" for each policy if user observed
+above scenario. The consequence of doing so is small packet(uncompressed)
+will skip policy checking on receiver side.
diff --git a/Documentation/networking/l2tp.txt b/Documentation/networking/l2tp.txt
index e63fc1f7bf8..c74434de2fa 100644
--- a/Documentation/networking/l2tp.txt
+++ b/Documentation/networking/l2tp.txt
@@ -197,7 +197,7 @@ state information because the file format is subject to change. It is
implemented to provide extra debug information to help diagnose
problems.) Users should use the netlink API.
-/proc/net/pppol2tp is also provided for backwards compaibility with
+/proc/net/pppol2tp is also provided for backwards compatibility with
the original pppol2tp driver. It lists information about L2TPv2
tunnels and sessions only. Its use is discouraged.
diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt
index d9112f01c44..0fe1c6e0dbc 100644
--- a/Documentation/networking/netdev-FAQ.txt
+++ b/Documentation/networking/netdev-FAQ.txt
@@ -4,23 +4,23 @@ Information you need to know about netdev
Q: What is netdev?
-A: It is a mailing list for all network related linux stuff. This includes
+A: It is a mailing list for all network-related Linux stuff. This includes
anything found under net/ (i.e. core code like IPv6) and drivers/net
- (i.e. hardware specific drivers) in the linux source tree.
+ (i.e. hardware specific drivers) in the Linux source tree.
Note that some subsystems (e.g. wireless drivers) which have a high volume
of traffic have their own specific mailing lists.
- The netdev list is managed (like many other linux mailing lists) through
+ The netdev list is managed (like many other Linux mailing lists) through
VGER ( http://vger.kernel.org/ ) and archives can be found below:
http://marc.info/?l=linux-netdev
http://www.spinics.net/lists/netdev/
- Aside from subsystems like that mentioned above, all network related linux
- development (i.e. RFC, review, comments, etc) takes place on netdev.
+ Aside from subsystems like that mentioned above, all network-related Linux
+ development (i.e. RFC, review, comments, etc.) takes place on netdev.
-Q: How do the changes posted to netdev make their way into linux?
+Q: How do the changes posted to netdev make their way into Linux?
A: There are always two trees (git repositories) in play. Both are driven
by David Miller, the main network maintainer. There is the "net" tree,
@@ -35,7 +35,7 @@ A: There are always two trees (git repositories) in play. Both are driven
Q: How often do changes from these trees make it to the mainline Linus tree?
A: To understand this, you need to know a bit of background information
- on the cadence of linux development. Each new release starts off with
+ on the cadence of Linux development. Each new release starts off with
a two week "merge window" where the main maintainers feed their new
stuff to Linus for merging into the mainline tree. After the two weeks,
the merge window is closed, and it is called/tagged "-rc1". No new
@@ -46,7 +46,7 @@ A: To understand this, you need to know a bit of background information
things are in a state of churn), and a week after the last vX.Y-rcN
was done, the official "vX.Y" is released.
- Relating that to netdev: At the beginning of the 2 week merge window,
+ Relating that to netdev: At the beginning of the 2-week merge window,
the net-next tree will be closed - no new changes/features. The
accumulated new content of the past ~10 weeks will be passed onto
mainline/Linus via a pull request for vX.Y -- at the same time,
@@ -59,16 +59,16 @@ A: To understand this, you need to know a bit of background information
IMPORTANT: Do not send new net-next content to netdev during the
period during which net-next tree is closed.
- Shortly after the two weeks have passed, (and vX.Y-rc1 is released) the
+ Shortly after the two weeks have passed (and vX.Y-rc1 is released), the
tree for net-next reopens to collect content for the next (vX.Y+1) release.
If you aren't subscribed to netdev and/or are simply unsure if net-next
has re-opened yet, simply check the net-next git repository link above for
- any new networking related commits.
+ any new networking-related commits.
The "net" tree continues to collect fixes for the vX.Y content, and
is fed back to Linus at regular (~weekly) intervals. Meaning that the
- focus for "net" is on stablilization and bugfixes.
+ focus for "net" is on stabilization and bugfixes.
Finally, the vX.Y gets released, and the whole cycle starts over.
@@ -217,7 +217,7 @@ A: Attention to detail. Re-read your own work as if you were the
to why it happens, and then if necessary, explain why the fix proposed
is the best way to get things done. Don't mangle whitespace, and as
is common, don't mis-indent function arguments that span multiple lines.
- If it is your 1st patch, mail it to yourself so you can test apply
+ If it is your first patch, mail it to yourself so you can test apply
it to an unpatched tree to confirm infrastructure didn't mangle it.
Finally, go back and read Documentation/SubmittingPatches to be
diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt
index c7ecc708049..0b1cf6b2a59 100644
--- a/Documentation/networking/netdevices.txt
+++ b/Documentation/networking/netdevices.txt
@@ -10,12 +10,12 @@ network devices.
struct net_device allocation rules
==================================
Network device structures need to persist even after module is unloaded and
-must be allocated with kmalloc. If device has registered successfully,
-it will be freed on last use by free_netdev. This is required to handle the
-pathologic case cleanly (example: rmmod mydriver </sys/class/net/myeth/mtu )
+must be allocated with alloc_netdev_mqs() and friends.
+If device has registered successfully, it will be freed on last use
+by free_netdev(). This is required to handle the pathologic case cleanly
+(example: rmmod mydriver </sys/class/net/myeth/mtu )
-There are routines in net_init.c to handle the common cases of
-alloc_etherdev, alloc_netdev. These reserve extra space for driver
+alloc_netdev_mqs()/alloc_netdev() reserve extra space for driver
private data which gets freed when the network device is freed. If
separately allocated data is attached to the network device
(netdev_priv(dev)) then it is up to the module exit handler to free that.
diff --git a/Documentation/networking/netlink_mmap.txt b/Documentation/networking/netlink_mmap.txt
index 53337883954..c6af4bac5aa 100644
--- a/Documentation/networking/netlink_mmap.txt
+++ b/Documentation/networking/netlink_mmap.txt
@@ -45,7 +45,7 @@ processing.
Conversion of the reception path involves calling poll() on the file
descriptor, once the socket is readable the frames from the ring are
-processsed in order until no more messages are available, as indicated by
+processed in order until no more messages are available, as indicated by
a status word in the frame header.
On kernel side, in order to make use of memory mapped I/O on receive, the
@@ -56,7 +56,7 @@ Dumps of kernel databases automatically support memory mapped I/O.
Conversion of the transmit path involves changing message construction to
use memory from the TX ring instead of (usually) a buffer declared on the
-stack and setting up the frame header approriately. Optionally poll() can
+stack and setting up the frame header appropriately. Optionally poll() can
be used to wait for free frames in the TX ring.
Structured and definitions for using memory mapped I/O are contained in
@@ -226,12 +226,12 @@ Ring setup:
void *rx_ring, *tx_ring;
/* Configure ring parameters */
- if (setsockopt(fd, NETLINK_RX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_RX_RING, &req, sizeof(req)) < 0)
exit(1);
- if (setsockopt(fd, NETLINK_TX_RING, &req, sizeof(req)) < 0)
+ if (setsockopt(fd, SOL_NETLINK, NETLINK_TX_RING, &req, sizeof(req)) < 0)
exit(1)
- /* Calculate size of each invididual ring */
+ /* Calculate size of each individual ring */
ring_size = req.nm_block_nr * req.nm_block_size;
/* Map RX/TX rings. The TX ring is located after the RX ring */
diff --git a/Documentation/networking/operstates.txt b/Documentation/networking/operstates.txt
index 97694572338..355c6d8ef8a 100644
--- a/Documentation/networking/operstates.txt
+++ b/Documentation/networking/operstates.txt
@@ -89,8 +89,8 @@ packets. The name 'carrier' and the inversion are historical, think of
it as lower layer.
Note that for certain kind of soft-devices, which are not managing any
-real hardware, there is possible to set this bit from userpsace.
-One should use TVL IFLA_CARRIER to do so.
+real hardware, it is possible to set this bit from userspace. One
+should use TVL IFLA_CARRIER to do so.
netif_carrier_ok() can be used to query that bit.
diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index c01223628a8..38112d512f4 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -98,6 +98,11 @@ by the kernel.
The destruction of the socket and all associated resources
is done by a simple call to close(fd).
+Similarly as without PACKET_MMAP, it is possible to use one socket
+for capture and transmission. This can be done by mapping the
+allocated RX and TX buffer ring with a single mmap() call.
+See "Mapping and use of the circular buffer (ring)".
+
Next I will describe PACKET_MMAP settings and its constraints,
also the mapping of the circular buffer in the user process and
the use of this buffer.
@@ -123,6 +128,16 @@ Transmission process is similar to capture as shown below.
[shutdown] close() --------> destruction of the transmission socket and
deallocation of all associated resources.
+Socket creation and destruction is also straight forward, and is done
+the same way as in capturing described in the previous paragraph:
+
+ int fd = socket(PF_PACKET, mode, 0);
+
+The protocol can optionally be 0 in case we only want to transmit
+via this socket, which avoids an expensive call to packet_rcv().
+In this case, you also need to bind(2) the TX_RING with sll_protocol = 0
+set. Otherwise, htons(ETH_P_ALL) or any other protocol, for example.
+
Binding the socket to your network interface is mandatory (with zero copy) to
know the header size of frames used in the circular buffer.
@@ -404,6 +419,19 @@ tp_block_size/tp_frame_size frames there will be a gap between
the frames. This is because a frame cannot be spawn across two
blocks.
+To use one socket for capture and transmission, the mapping of both the
+RX and TX buffer ring has to be done with one call to mmap:
+
+ ...
+ setsockopt(fd, SOL_PACKET, PACKET_RX_RING, &foo, sizeof(foo));
+ setsockopt(fd, SOL_PACKET, PACKET_TX_RING, &bar, sizeof(bar));
+ ...
+ rx_ring = mmap(0, size * 2, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
+ tx_ring = rx_ring + size;
+
+RX must be the first as the kernel maps the TX ring memory right
+after the RX one.
+
At the beginning of each frame there is an status field (see
struct tpacket_hdr). If this field is 0 means that the frame is ready
to be used for the kernel, If not, there is a frame the user can read
@@ -425,7 +453,7 @@ TP_STATUS_COPY : This flag indicates that the frame (and associated
enabled previously with setsockopt() and
the PACKET_COPY_THRESH option.
- The number of frames than can be buffered to
+ The number of frames that can be buffered to
be read with recvfrom is limited like a normal socket.
See the SO_RCVBUF option in the socket (7) man page.
@@ -507,8 +535,6 @@ where 'tpacket_version' can be TPACKET_V1 (default), TPACKET_V2, TPACKET_V3.
TPACKET_V1:
- Default if not otherwise specified by setsockopt(2)
- RX_RING, TX_RING available
- - VLAN metadata information available for packets
- (TP_STATUS_VLAN_VALID)
TPACKET_V1 --> TPACKET_V2:
- Made 64 bit clean due to unsigned long usage in TPACKET_V1
@@ -516,6 +542,13 @@ TPACKET_V1 --> TPACKET_V2:
userspace and the like
- Timestamp resolution in nanoseconds instead of microseconds
- RX_RING, TX_RING available
+ - VLAN metadata information available for packets
+ (TP_STATUS_VLAN_VALID, TP_STATUS_VLAN_TPID_VALID),
+ in the tpacket2_hdr structure:
+ - TP_STATUS_VLAN_VALID bit being set into the tp_status field indicates
+ that the tp_vlan_tci field has valid VLAN TCI value
+ - TP_STATUS_VLAN_TPID_VALID bit being set into the tp_status field
+ indicates that the tp_vlan_tpid field has valid VLAN TPID value
- How to switch to TPACKET_V2:
1. Replace struct tpacket_hdr by struct tpacket2_hdr
2. Query header len and save
@@ -545,11 +578,12 @@ processes. This also works in combination with mmap(2) on packet sockets.
Currently implemented fanout policies are:
- - PACKET_FANOUT_HASH: schedule to socket by skb's rxhash
+ - PACKET_FANOUT_HASH: schedule to socket by skb's packet hash
- PACKET_FANOUT_LB: schedule to socket by round-robin
- PACKET_FANOUT_CPU: schedule to socket by CPU packet arrives on
- PACKET_FANOUT_RND: schedule to socket by random selection
- PACKET_FANOUT_ROLLOVER: if one socket is full, rollover to another
+ - PACKET_FANOUT_QM: schedule to socket by skbs recorded queue_mapping
Minimal example code by David S. Miller (try things like "./test eth0 hash",
"./test eth0 lb", etc.):
@@ -943,6 +977,27 @@ int main(int argc, char **argp)
}
-------------------------------------------------------------------------------
++ PACKET_QDISC_BYPASS
+-------------------------------------------------------------------------------
+
+If there is a requirement to load the network with many packets in a similar
+fashion as pktgen does, you might set the following option after socket
+creation:
+
+ int one = 1;
+ setsockopt(fd, SOL_PACKET, PACKET_QDISC_BYPASS, &one, sizeof(one));
+
+This has the side-effect, that packets sent through PF_PACKET will bypass the
+kernel's qdisc layer and are forcedly pushed to the driver directly. Meaning,
+packet are not buffered, tc disciplines are ignored, increased loss can occur
+and such packets are also not visible to other PF_PACKET sockets anymore. So,
+you have been warned; generally, this can be useful for stress testing various
+components of a system.
+
+On default, PACKET_QDISC_BYPASS is disabled and needs to be explicitly enabled
+on PF_PACKET sockets.
+
+-------------------------------------------------------------------------------
+ PACKET_TIMESTAMP
-------------------------------------------------------------------------------
diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index d5b1a393524..3544c98401f 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -48,7 +48,7 @@ The MDIO bus
time, so it is safe for them to block, waiting for an interrupt to signal
the operation is complete
- 2) A reset function is necessary. This is used to return the bus to an
+ 2) A reset function is optional. This is used to return the bus to an
initialized state.
3) A probe function is needed. This function should set up anything the bus
@@ -253,15 +253,25 @@ Writing a PHY driver
Each driver consists of a number of function pointers:
+ soft_reset: perform a PHY software reset
config_init: configures PHY into a sane state after a reset.
For instance, a Davicom PHY requires descrambling disabled.
- probe: Does any setup needed by the driver
+ probe: Allocate phy->priv, optionally refuse to bind.
+ PHY may not have been reset or had fixups run yet.
suspend/resume: power management
config_aneg: Changes the speed/duplex/negotiation settings
+ aneg_done: Determines the auto-negotiation result
read_status: Reads the current speed/duplex/negotiation settings
ack_interrupt: Clear a pending interrupt
+ did_interrupt: Checks if the PHY generated an interrupt
config_intr: Enable or disable interrupts
remove: Does any driver take-down
+ ts_info: Queries about the HW timestamping status
+ hwtstamp: Set the PHY HW timestamping configuration
+ rxtstamp: Requests a receive timestamp at the PHY level for a 'skb'
+ txtsamp: Requests a transmit timestamp at the PHY level for a 'skb'
+ set_wol: Enable Wake-on-LAN at the PHY level
+ get_wol: Get the Wake-on-LAN status at the PHY level
Of these, only config_aneg and read_status are required to be
assigned by the driver code. The rest are optional. Also, it is
diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt
index 75e4fd708cc..0e30c7845b2 100644
--- a/Documentation/networking/pktgen.txt
+++ b/Documentation/networking/pktgen.txt
@@ -102,13 +102,20 @@ Examples:
The 'minimum' MAC is what you set with dstmac.
pgset "flag [name]" Set a flag to determine behaviour. Current flags
- are: IPSRC_RND #IP Source is random (between min/max),
- IPDST_RND, UDPSRC_RND,
- UDPDST_RND, MACSRC_RND, MACDST_RND
+ are: IPSRC_RND # IP source is random (between min/max)
+ IPDST_RND # IP destination is random
+ UDPSRC_RND, UDPDST_RND,
+ MACSRC_RND, MACDST_RND
+ TXSIZE_RND, IPV6,
MPLS_RND, VID_RND, SVID_RND
+ FLOW_SEQ,
QUEUE_MAP_RND # queue map random
QUEUE_MAP_CPU # queue map mirrors smp_processor_id()
+ UDPCSUM,
+ IPSEC # IPsec encapsulation (needs CONFIG_XFRM)
+ NODE_ALLOC # node specific memory allocation
+ pgset spi SPI_VALUE Set specific SA used to transform packet.
pgset "udp_src_min 9" set UDP source port min, If < udp_src_max, then
cycle through the port range.
@@ -177,6 +184,18 @@ Note when adding devices to a specific CPU there good idea to also assign
/proc/irq/XX/smp_affinity so the TX-interrupts gets bound to the same CPU.
as this reduces cache bouncing when freeing skb's.
+Enable IPsec
+============
+Default IPsec transformation with ESP encapsulation plus Transport mode
+could be enabled by simply setting:
+
+pgset "flag IPSEC"
+pgset "flows 1"
+
+To avoid breaking existing testbed scripts for using AH type and tunnel mode,
+user could use "pgset spi SPI_VALUE" to specify which formal of transformation
+to employ.
+
Current commands and configuration options
==========================================
@@ -219,12 +238,22 @@ udp_dst_max
flag
IPSRC_RND
- TXSIZE_RND
IPDST_RND
UDPSRC_RND
UDPDST_RND
MACSRC_RND
MACDST_RND
+ TXSIZE_RND
+ IPV6
+ MPLS_RND
+ VID_RND
+ SVID_RND
+ FLOW_SEQ
+ QUEUE_MAP_RND
+ QUEUE_MAP_CPU
+ UDPCSUM
+ IPSEC
+ NODE_ALLOC
dst_min
dst_max
diff --git a/Documentation/networking/regulatory.txt b/Documentation/networking/regulatory.txt
index 9551622d0a7..356f791af57 100644
--- a/Documentation/networking/regulatory.txt
+++ b/Documentation/networking/regulatory.txt
@@ -159,10 +159,10 @@ struct ieee80211_regdomain mydriver_jp_regdom = {
REG_RULE(2412-20, 2484+20, 40, 6, 20, 0),
/* IEEE 802.11a, channels 34..48 */
REG_RULE(5170-20, 5240+20, 40, 6, 20,
- NL80211_RRF_PASSIVE_SCAN),
+ NL80211_RRF_NO_IR),
/* IEEE 802.11a, channels 52..64 */
REG_RULE(5260-20, 5320+20, 40, 6, 20,
- NL80211_RRF_NO_IBSS |
+ NL80211_RRF_NO_IR|
NL80211_RRF_DFS),
}
};
diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt
index 60d05eb77c6..16a924c486b 100644
--- a/Documentation/networking/rxrpc.txt
+++ b/Documentation/networking/rxrpc.txt
@@ -27,6 +27,8 @@ Contents of this document:
(*) AF_RXRPC kernel interface.
+ (*) Configurable parameters.
+
========
OVERVIEW
@@ -144,7 +146,7 @@ An overview of the RxRPC protocol:
(*) Calls use ACK packets to handle reliability. Data packets are also
explicitly sequenced per call.
- (*) There are two types of positive acknowledgement: hard-ACKs and soft-ACKs.
+ (*) There are two types of positive acknowledgment: hard-ACKs and soft-ACKs.
A hard-ACK indicates to the far side that all the data received to a point
has been received and processed; a soft-ACK indicates that the data has
been received but may yet be discarded and re-requested. The sender may
@@ -864,3 +866,82 @@ The kernel interface functions are as follows:
This is used to allocate a null RxRPC key that can be used to indicate
anonymous security for a particular domain.
+
+
+=======================
+CONFIGURABLE PARAMETERS
+=======================
+
+The RxRPC protocol driver has a number of configurable parameters that can be
+adjusted through sysctls in /proc/net/rxrpc/:
+
+ (*) req_ack_delay
+
+ The amount of time in milliseconds after receiving a packet with the
+ request-ack flag set before we honour the flag and actually send the
+ requested ack.
+
+ Usually the other side won't stop sending packets until the advertised
+ reception window is full (to a maximum of 255 packets), so delaying the
+ ACK permits several packets to be ACK'd in one go.
+
+ (*) soft_ack_delay
+
+ The amount of time in milliseconds after receiving a new packet before we
+ generate a soft-ACK to tell the sender that it doesn't need to resend.
+
+ (*) idle_ack_delay
+
+ The amount of time in milliseconds after all the packets currently in the
+ received queue have been consumed before we generate a hard-ACK to tell
+ the sender it can free its buffers, assuming no other reason occurs that
+ we would send an ACK.
+
+ (*) resend_timeout
+
+ The amount of time in milliseconds after transmitting a packet before we
+ transmit it again, assuming no ACK is received from the receiver telling
+ us they got it.
+
+ (*) max_call_lifetime
+
+ The maximum amount of time in seconds that a call may be in progress
+ before we preemptively kill it.
+
+ (*) dead_call_expiry
+
+ The amount of time in seconds before we remove a dead call from the call
+ list. Dead calls are kept around for a little while for the purpose of
+ repeating ACK and ABORT packets.
+
+ (*) connection_expiry
+
+ The amount of time in seconds after a connection was last used before we
+ remove it from the connection list. Whilst a connection is in existence,
+ it serves as a placeholder for negotiated security; when it is deleted,
+ the security must be renegotiated.
+
+ (*) transport_expiry
+
+ The amount of time in seconds after a transport was last used before we
+ remove it from the transport list. Whilst a transport is in existence, it
+ serves to anchor the peer data and keeps the connection ID counter.
+
+ (*) rxrpc_rx_window_size
+
+ The size of the receive window in packets. This is the maximum number of
+ unconsumed received packets we're willing to hold in memory for any
+ particular call.
+
+ (*) rxrpc_rx_mtu
+
+ The maximum packet MTU size that we're willing to receive in bytes. This
+ indicates to the peer whether we're willing to accept jumbo packets.
+
+ (*) rxrpc_rx_jumbo_max
+
+ The maximum number of packets that we're willing to accept in a jumbo
+ packet. Non-terminal packets in a jumbo packet must contain a four byte
+ header plus exactly 1412 bytes of data. The terminal packet must contain
+ a four byte header plus any amount of data. In any event, a jumbo packet
+ may not exceed rxrpc_rx_mtu in size.
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
index ca6977f5b2e..99ca40e8e81 100644
--- a/Documentation/networking/scaling.txt
+++ b/Documentation/networking/scaling.txt
@@ -429,7 +429,7 @@ RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
(therbert@google.com)
Accelerated RFS was introduced in 2.6.35. Original patches were
-submitted by Ben Hutchings (bhutchings@solarflare.com)
+submitted by Ben Hutchings (bwh@kernel.org)
Authors:
Tom Herbert (therbert@google.com)
diff --git a/Documentation/networking/spider_net.txt b/Documentation/networking/spider_net.txt
index 4b4adb8eb14..b0b75f8463b 100644
--- a/Documentation/networking/spider_net.txt
+++ b/Documentation/networking/spider_net.txt
@@ -73,7 +73,7 @@ Thus, in an idle system, the GDACTDPA, tail and head pointers will
all be pointing at the same descr, which should be "empty". All of the
other descrs in the ring should be "empty" as well.
-The show_rx_chain() routine will print out the the locations of the
+The show_rx_chain() routine will print out the locations of the
GDACTDPA, tail and head pointers. It will also summarize the contents
of the ring, starting at the tail pointer, and listing the status
of the descrs that follow.
diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index 457b8bbafb0..2090895b08d 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -127,8 +127,9 @@ struct plat_stmmacenet_data {
int riwt_off;
void (*fix_mac_speed)(void *priv, unsigned int speed);
void (*bus_setup)(void __iomem *ioaddr);
- int (*init)(struct platform_device *pdev);
- void (*exit)(struct platform_device *pdev);
+ void *(*setup)(struct platform_device *pdev);
+ int (*init)(struct platform_device *pdev, void *priv);
+ void (*exit)(struct platform_device *pdev, void *priv);
void *custom_cfg;
void *custom_data;
void *bsp_priv;
@@ -160,7 +161,7 @@ Where:
o pmt: core has the embedded power module (optional).
o force_sf_dma_mode: force DMA to use the Store and Forward mode
instead of the Threshold.
- o force_thresh_dma_mode: force DMA to use the Shreshold mode other than
+ o force_thresh_dma_mode: force DMA to use the Threshold mode other than
the Store and Forward mode.
o riwt_off: force to disable the RX watchdog feature and switch to NAPI mode.
o fix_mac_speed: this callback is used for modifying some syscfg registers
@@ -169,13 +170,16 @@ Where:
o bus_setup: perform HW setup of the bus. For example, on some ST platforms
this field is used to configure the AMBA bridge to generate more
efficient STBus traffic.
- o init/exit: callbacks used for calling a custom initialization;
+ o setup/init/exit: callbacks used for calling a custom initialization;
this is sometime necessary on some platforms (e.g. ST boxes)
where the HW needs to have set some PIO lines or system cfg
- registers.
+ registers. setup should return a pointer to private data,
+ which will be stored in bsp_priv, and then passed to init and
+ exit callbacks. init/exit callbacks should not use or modify
+ platform data.
o custom_cfg/custom_data: this is a custom configuration that can be passed
while initializing the resources.
- o bsp_priv: another private poiter.
+ o bsp_priv: another private pointer.
For MDIO bus The we have:
@@ -271,7 +275,7 @@ reset procedure etc).
o dwmac1000_dma.c: dma functions for the GMAC chip;
o dwmac1000.h: specific header file for the GMAC;
o dwmac100_core: MAC 100 core and dma code;
- o dwmac100_dma.c: dma funtions for the MAC chip;
+ o dwmac100_dma.c: dma functions for the MAC chip;
o dwmac1000.h: specific header file for the MAC;
o dwmac_lib.c: generic DMA functions shared among chips;
o enh_desc.c: functions for handling enhanced descriptors;
@@ -364,4 +368,4 @@ Auto-negotiated Link Parter Ability.
10) TODO:
o XGMAC is not supported.
o Complete the TBI & RTBI support.
- o extened VLAN support for 3.70a SYNP GMAC.
+ o extend VLAN support for 3.70a SYNP GMAC.
diff --git a/Documentation/networking/tcp.txt b/Documentation/networking/tcp.txt
index 7d11bb5dc30..bdc4c0db51e 100644
--- a/Documentation/networking/tcp.txt
+++ b/Documentation/networking/tcp.txt
@@ -30,7 +30,7 @@ A congestion control mechanism can be registered through functions in
tcp_cong.c. The functions used by the congestion control mechanism are
registered via passing a tcp_congestion_ops struct to
tcp_register_congestion_control. As a minimum name, ssthresh,
-cong_avoid, min_cwnd must be valid.
+cong_avoid must be valid.
Private data for a congestion control mechanism is stored in tp->ca_priv.
tcp_ca(tp) returns a pointer to this space. This is preallocated space - it
diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt
index 98097d8cb91..bc355412490 100644
--- a/Documentation/networking/timestamping.txt
+++ b/Documentation/networking/timestamping.txt
@@ -21,26 +21,38 @@ has such a feature).
SO_TIMESTAMPING:
-Instructs the socket layer which kind of information is wanted. The
-parameter is an integer with some of the following bits set. Setting
-other bits is an error and doesn't change the current state.
-
-SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamp in hardware
-SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or
- fails, then do it in software
-SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time stamp
- as generated by the hardware
-SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or
- fails, then do it in software
-SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp
-SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to
- the system time base
-SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in
- software
-
-SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
-SOF_TIMESTAMPING_RAW/SYS determine how they are reported in the
-following control message:
+Instructs the socket layer which kind of information should be collected
+and/or reported. The parameter is an integer with some of the following
+bits set. Setting other bits is an error and doesn't change the current
+state.
+
+Four of the bits are requests to the stack to try to generate
+timestamps. Any combination of them is valid.
+
+SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time stamps in hardware
+SOF_TIMESTAMPING_TX_SOFTWARE: try to obtain send time stamps in software
+SOF_TIMESTAMPING_RX_HARDWARE: try to obtain receive time stamps in hardware
+SOF_TIMESTAMPING_RX_SOFTWARE: try to obtain receive time stamps in software
+
+The other three bits control which timestamps will be reported in a
+generated control message. If none of these bits are set or if none of
+the set bits correspond to data that is available, then the control
+message will not be generated:
+
+SOF_TIMESTAMPING_SOFTWARE: report systime if available
+SOF_TIMESTAMPING_SYS_HARDWARE: report hwtimetrans if available
+SOF_TIMESTAMPING_RAW_HARDWARE: report hwtimeraw if available
+
+It is worth noting that timestamps may be collected for reasons other
+than being requested by a particular socket with
+SOF_TIMESTAMPING_[TR]X_(HARD|SOFT)WARE. For example, most drivers that
+can generate hardware receive timestamps ignore
+SOF_TIMESTAMPING_RX_HARDWARE. It is still a good idea to set that flag
+in case future drivers pay attention.
+
+If timestamps are reported, they will appear in a control message with
+cmsg_level==SOL_SOCKET, cmsg_type==SO_TIMESTAMPING, and a payload like
+this:
struct scm_timestamping {
struct timespec systime;
@@ -85,7 +97,7 @@ Filled in if SOF_TIMESTAMPING_SYS_HARDWARE is set. Requires support
by the network device and will be empty without that support.
-SIOCSHWTSTAMP:
+SIOCSHWTSTAMP, SIOCGHWTSTAMP:
Hardware time stamping must also be initialized for each device driver
that is expected to do hardware time stamping. The parameter is defined in
@@ -115,6 +127,10 @@ Only a processes with admin rights may change the configuration. User
space is responsible to ensure that multiple processes don't interfere
with each other and that the settings are reset.
+Any process can read the actual configuration by passing this
+structure to ioctl(SIOCGHWTSTAMP) in the same way. However, this has
+not been implemented in all drivers.
+
/* possible values for hwtstamp_config->tx_type */
enum {
/*
@@ -157,7 +173,8 @@ DEVICE IMPLEMENTATION
A driver which supports hardware time stamping must support the
SIOCSHWTSTAMP ioctl and update the supplied struct hwtstamp_config with
-the actual values as described in the section on SIOCSHWTSTAMP.
+the actual values as described in the section on SIOCSHWTSTAMP. It
+should also support SIOCGHWTSTAMP.
Time stamps for received packets must be stored in the skb. To get a pointer
to the shared time stamp structure of the skb call skb_hwtstamps(). Then
@@ -185,6 +202,9 @@ Time stamps for outgoing packets are to be generated as follows:
and not free the skb. A driver not supporting hardware time stamping doesn't
do that. A driver must never touch sk_buff::tstamp! It is used to store
software generated time stamps by the network subsystem.
+- Driver should call skb_tx_timestamp() as close to passing sk_buff to hardware
+ as possible. skb_tx_timestamp() provides a software time stamp if requested
+ and hardware timestamping is not possible (SKBTX_IN_PROGRESS not set).
- As soon as the driver has sent the packet and/or obtained a
hardware time stamp for it, it passes the time stamp back by
calling skb_hwtstamp_tx() with the original skb, the raw
@@ -195,6 +215,3 @@ Time stamps for outgoing packets are to be generated as follows:
this would occur at a later time in the processing pipeline than other
software time stamping and therefore could lead to unexpected deltas
between time stamps.
-- If the driver did not set the SKBTX_IN_PROGRESS flag (see above), then
- dev_hard_start_xmit() checks whether software time stamping
- is wanted as fallback and potentially generates the time stamp.
diff --git a/Documentation/networking/timestamping/.gitignore b/Documentation/networking/timestamping/.gitignore
index 71e81eb2e22..a380159765c 100644
--- a/Documentation/networking/timestamping/.gitignore
+++ b/Documentation/networking/timestamping/.gitignore
@@ -1 +1,2 @@
timestamping
+hwtstamp_config
diff --git a/Documentation/networking/timestamping/Makefile b/Documentation/networking/timestamping/Makefile
index e79973443e9..d934afc8306 100644
--- a/Documentation/networking/timestamping/Makefile
+++ b/Documentation/networking/timestamping/Makefile
@@ -2,12 +2,13 @@
obj- := dummy.o
# List of programs to build
-hostprogs-y := timestamping
+hostprogs-y := timestamping hwtstamp_config
# Tell kbuild to always build the programs
always := $(hostprogs-y)
HOSTCFLAGS_timestamping.o += -I$(objtree)/usr/include
+HOSTCFLAGS_hwtstamp_config.o += -I$(objtree)/usr/include
clean:
- rm -f timestamping
+ rm -f timestamping hwtstamp_config
diff --git a/Documentation/networking/timestamping/hwtstamp_config.c b/Documentation/networking/timestamping/hwtstamp_config.c
new file mode 100644
index 00000000000..e8b685a7f15
--- /dev/null
+++ b/Documentation/networking/timestamping/hwtstamp_config.c
@@ -0,0 +1,134 @@
+/* Test program for SIOC{G,S}HWTSTAMP
+ * Copyright 2013 Solarflare Communications
+ * Author: Ben Hutchings
+ */
+
+#include <errno.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <sys/socket.h>
+#include <sys/ioctl.h>
+
+#include <linux/if.h>
+#include <linux/net_tstamp.h>
+#include <linux/sockios.h>
+
+static int
+lookup_value(const char **names, int size, const char *name)
+{
+ int value;
+
+ for (value = 0; value < size; value++)
+ if (names[value] && strcasecmp(names[value], name) == 0)
+ return value;
+
+ return -1;
+}
+
+static const char *
+lookup_name(const char **names, int size, int value)
+{
+ return (value >= 0 && value < size) ? names[value] : NULL;
+}
+
+static void list_names(FILE *f, const char **names, int size)
+{
+ int value;
+
+ for (value = 0; value < size; value++)
+ if (names[value])
+ fprintf(f, " %s\n", names[value]);
+}
+
+static const char *tx_types[] = {
+#define TX_TYPE(name) [HWTSTAMP_TX_ ## name] = #name
+ TX_TYPE(OFF),
+ TX_TYPE(ON),
+ TX_TYPE(ONESTEP_SYNC)
+#undef TX_TYPE
+};
+#define N_TX_TYPES ((int)(sizeof(tx_types) / sizeof(tx_types[0])))
+
+static const char *rx_filters[] = {
+#define RX_FILTER(name) [HWTSTAMP_FILTER_ ## name] = #name
+ RX_FILTER(NONE),
+ RX_FILTER(ALL),
+ RX_FILTER(SOME),
+ RX_FILTER(PTP_V1_L4_EVENT),
+ RX_FILTER(PTP_V1_L4_SYNC),
+ RX_FILTER(PTP_V1_L4_DELAY_REQ),
+ RX_FILTER(PTP_V2_L4_EVENT),
+ RX_FILTER(PTP_V2_L4_SYNC),
+ RX_FILTER(PTP_V2_L4_DELAY_REQ),
+ RX_FILTER(PTP_V2_L2_EVENT),
+ RX_FILTER(PTP_V2_L2_SYNC),
+ RX_FILTER(PTP_V2_L2_DELAY_REQ),
+ RX_FILTER(PTP_V2_EVENT),
+ RX_FILTER(PTP_V2_SYNC),
+ RX_FILTER(PTP_V2_DELAY_REQ),
+#undef RX_FILTER
+};
+#define N_RX_FILTERS ((int)(sizeof(rx_filters) / sizeof(rx_filters[0])))
+
+static void usage(void)
+{
+ fputs("Usage: hwtstamp_config if_name [tx_type rx_filter]\n"
+ "tx_type is any of (case-insensitive):\n",
+ stderr);
+ list_names(stderr, tx_types, N_TX_TYPES);
+ fputs("rx_filter is any of (case-insensitive):\n", stderr);
+ list_names(stderr, rx_filters, N_RX_FILTERS);
+}
+
+int main(int argc, char **argv)
+{
+ struct ifreq ifr;
+ struct hwtstamp_config config;
+ const char *name;
+ int sock;
+
+ if ((argc != 2 && argc != 4) || (strlen(argv[1]) >= IFNAMSIZ)) {
+ usage();
+ return 2;
+ }
+
+ if (argc == 4) {
+ config.flags = 0;
+ config.tx_type = lookup_value(tx_types, N_TX_TYPES, argv[2]);
+ config.rx_filter = lookup_value(rx_filters, N_RX_FILTERS, argv[3]);
+ if (config.tx_type < 0 || config.rx_filter < 0) {
+ usage();
+ return 2;
+ }
+ }
+
+ sock = socket(AF_INET, SOCK_DGRAM, 0);
+ if (sock < 0) {
+ perror("socket");
+ return 1;
+ }
+
+ strcpy(ifr.ifr_name, argv[1]);
+ ifr.ifr_data = (caddr_t)&config;
+
+ if (ioctl(sock, (argc == 2) ? SIOCGHWTSTAMP : SIOCSHWTSTAMP, &ifr)) {
+ perror("ioctl");
+ return 1;
+ }
+
+ printf("flags = %#x\n", config.flags);
+ name = lookup_name(tx_types, N_TX_TYPES, config.tx_type);
+ if (name)
+ printf("tx_type = %s\n", name);
+ else
+ printf("tx_type = %d\n", config.tx_type);
+ name = lookup_name(rx_filters, N_RX_FILTERS, config.rx_filter);
+ if (name)
+ printf("rx_filter = %s\n", name);
+ else
+ printf("rx_filter = %d\n", config.rx_filter);
+
+ return 0;
+}
diff --git a/Documentation/networking/vortex.txt b/Documentation/networking/vortex.txt
index 9a8041dcbb5..97282da82b7 100644
--- a/Documentation/networking/vortex.txt
+++ b/Documentation/networking/vortex.txt
@@ -68,7 +68,7 @@ Module parameters
There are several parameters which may be provided to the driver when
its module is loaded. These are usually placed in /etc/modprobe.d/*.conf
-configuretion files. Example:
+configuration files. Example:
options 3c59x debug=3 rx_copybreak=300
@@ -178,7 +178,7 @@ max_interrupt_work=N
The driver's interrupt service routine can handle many receive and
transmit packets in a single invocation. It does this in a loop.
- The value of max_interrupt_work governs how mnay times the interrupt
+ The value of max_interrupt_work governs how many times the interrupt
service routine will loop. The default value is 32 loops. If this
is exceeded the interrupt service routine gives up and generates a
warning message "eth0: Too much work in interrupt".
diff --git a/Documentation/networking/x25-iface.txt b/Documentation/networking/x25-iface.txt
index 78f662ee062..7f213b556e8 100644
--- a/Documentation/networking/x25-iface.txt
+++ b/Documentation/networking/x25-iface.txt
@@ -105,7 +105,7 @@ reduced by the following measures or a combination thereof:
later.
The lapb module interface was modified to support this. Its
data_indication() method should now transparently pass the
- netif_rx() return value to the (lapb mopdule) caller.
+ netif_rx() return value to the (lapb module) caller.
(2) Drivers for kernel versions 2.2.x should always check the global
variable netdev_dropping when a new frame is received. The driver
should only call netif_rx() if netdev_dropping is zero. Otherwise