aboutsummaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2012-05-03net: Stop decapitating clones that have a head_fragAlexander Duyck
This change is meant ot prevent stealing the skb->head to use as a page in the event that the skb->head was cloned. This allows the other clones to track each other via shinfo->dataref. Without this we break down to two methods for tracking the reference count, one being dataref, the other being the page count. As a result it becomes difficult to track how many references there are to skb->head. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02net: implement tcp coalescing in tcp_queue_rcv()Eric Dumazet
Extend tcp coalescing implementing it from tcp_queue_rcv(), the main receiver function when application is not blocked in recvmsg(). Function tcp_queue_rcv() is moved a bit to allow its call from tcp_data_queue() This gives good results especially if GRO could not kick, and if skb head is a fragment. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02net: take care of cloned skbs in tcp_try_coalesce()Eric Dumazet
Before stealing fragments or skb head, we must make sure skbs are not cloned. Alexander was worried about destination skb being cloned : In bridge setups, a driver could be fooled if skb->data_len would not match skb nr_frags. If source skb is cloned, we must take references on pages instead. Bug happened using tcpdump (if not using mmap()) Introduce kfree_skb_partial() helper to cleanup code. Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02be2net: Fix EEH error reset before a flash dump completesSomnath Kotur
An EEH error can cause the FW to trigger a flash debug dump. Resetting the card while flash dump is in progress can cause it not to recover. Wait for it to finish before letting EEH flow to reset the card. Signed-off-by: Sathya Perla <Sathya.Perla@emulex.com> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02be2net: Record receive queue index in skb to aid RPS.Somnath Kotur
Signed-off-by: Sarveshwar Bandi <Sarveshwar.Bandi@emulex.com> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02be2net: Fix to apply duplex value as unknown when link is down.Somnath Kotur
Suggested-by: Ben Hutchings <bhutchings@solarflare.com> Signed-off-by: Sarveshwar Bandi <sarveshwar.bandi@emulex.com> Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02be2net: Fix to not set link speed for disabled functions of a UMC cardSomnath Kotur
This renders the interface view somewhat inconsistent from the Host OS POV considering the rest of the interfaces are showing their respective speeds based on the bandwidth assigned to them. Signed-off-by: Somnath Kotur <somnath.kotur@emulex.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02tcp: early retransmit: delayed fast retransmitYuchung Cheng
Implementing the advanced early retransmit (sysctl_tcp_early_retrans==2). Delays the fast retransmit by an interval of RTT/4. We borrow the RTO timer to implement the delay. If we receive another ACK or send a new packet, the timer is cancelled and restored to original RTO value offset by time elapsed. When the delayed-ER timer fires, we enter fast recovery and perform fast retransmit. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02tcp: early retransmitYuchung Cheng
This patch implements RFC 5827 early retransmit (ER) for TCP. It reduces DUPACK threshold (dupthresh) if outstanding packets are less than 4 to recover losses by fast recovery instead of timeout. While the algorithm is simple, small but frequent network reordering makes this feature dangerous: the connection repeatedly enter false recovery and degrade performance. Therefore we implement a mitigation suggested in the appendix of the RFC that delays entering fast recovery by a small interval, i.e., RTT/4. Currently ER is conservative and is disabled for the rest of the connection after the first reordering event. A large scale web server experiment on the performance impact of ER is summarized in section 6 of the paper "Proportional Rate Reduction for TCP”, IMC 2011. http://conferences.sigcomm.org/imc/2011/docs/p155.pdf Note that Linux has a similar feature called THIN_DUPACK. The differences are THIN_DUPACK do not mitigate reorderings and is only used after slow start. Currently ER is disabled if THIN_DUPACK is enabled. I would be happy to merge THIN_DUPACK feature with ER if people think it's a good idea. ER is enabled by sysctl_tcp_early_retrans: 0: Disables ER 1: Reduce dupthresh to packets_out - 1 when outstanding packets < 4. 2: (Default) reduce dupthresh like mode 1. In addition, delay entering fast recovery by RTT/4. Note: mode 2 is implemented in the third part of this patch series. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02tcp: early retransmit: tcp_enter_recovery()Yuchung Cheng
This a prepartion patch that refactors the code to enter recovery into a new function tcp_enter_recovery(). It's needed to implement the delayed fast retransmit in ER. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02net/pasemi: fix compiler warningStephen Rothwell
Fix this compiler warning (on PowerPC) by not marking a parameter as const: drivers/net/ethernet/pasemi/pasemi_mac.c: In function 'pasemi_mac_replenish_rx_ring': drivers/net/ethernet/pasemi/pasemi_mac.c:646:3: warning: passing argument 1 of 'netdev_alloc_skb' discards qualifiers from pointer target type include/linux/skbuff.h:1706:31: note: expected 'struct net_device *' but argument is of type 'const struct net_device *' Cc: Olof Johansson <olof@lixom.net> Cc: Pradeep A. Dalvi <netdev@pradeepdalvi.com> Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02bnx2x: fix handling single MSIX mode for 57710/57711Dmitry Kravkov
commit 30a5de7723a8a4211be02e94236e9167a424fd07 added ability to use single MSI-X vector, but lack proper handling for 57710/57711 HW Signed-off-by: Dmitry Kravkov <dmitry@broadcom.com> Signed-off-by: Eilon Greenstein <eilong@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-02ixgbe: Reset max_vfs to zero when user request is out of rangeGreg Rose
If the user request for the number of VFs in the max_vfs parameter is out of range then reset the value to the default value of zero. This makes the behavior of the ixgbe driver the same as for the igb driver. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Robert Garrett <robertx.e.garrett@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02ixgbe: Deny MACVLAN requests from VFs with admin set MACGreg Rose
If the host VMM administrator has set the virtual function device's MAC address then also deny VF requests for MACVLAN filters. Signed-off-by: Greg Rose <gregory.v.rose@intel.com> Tested-by: Garrett, Robert <robertx.e.garrett@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02ixgbe: add hwmon interface to export thermal dataDon Skidmore
Some of our adapters have thermal data available, this patch exports this data via hwmon sysfs interface. Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Tested-by: Stephen Ko <stephen.s.ko@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02ixgbe: add support functions to access thermal dataDon Skidmore
Some 82599 adapters contain thermal data that we can get to via an i2c interface. These functions provide support to get at that data. A following patch will export this data. Signed-off-by: Don Skidmore <donald.c.skidmore@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02e1000e: fix .ndo_set_rx_mode for 82579Bruce Allan
Secondary unicast and multicast addresses are added to the Receive Address registers (RAR) for most parts supported by the driver. For 82579, there is only one actual RAR and a number of Shared Receive Address registers (SHRAR) that are shared among the driver and f/w which can be reserved and write-protected by the f/w. On this device, use the SHRARs that are not taken by f/w for the additional addresses. Add a MAC ops function pointer infrastructure (similar to other MAC operations in the driver) for setting RARs, introduce a new rar_set function for 82579 and convert the existing code that sets RARs on other devices to a generic rar_set function. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02e1000e: PHY initialization flow changes for 82577/8/9Bruce Allan
The PHY initialization flows and assorted workarounds for 82577/8/9 done during driver load and resume from Sx should be the same yet they are not. Combine the current flows/workarounds into a common set of functions that are called during the different code paths. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Aaron Brown <aaron.f.brown@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-02e1000e: workaround EEPROM configuration change on 82579Bruce Allan
An update to the EEPROM on 82579 will extend a delay in hardware to fix an issue with WoL not working after a G3->S5 transition which is unrelated to the driver. However, this extended delay conflicts with nominal operation of the device when it is initialized by the driver and after every reset of the hardware (i.e. the driver starts configuring the device before the hardware is done with it's own configuration work). The workaround for when the driver is in control of the device is to tell the hardware after every reset the configuration delay should be the original shorter one. Some pre-existing variables are renamed generically to be re-used with new register accesses. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Jeff Pieper <jeffrey.e.pieper@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2012-05-01netem: add ECN capabilityEric Dumazet
Add ECN (Explicit Congestion Notification) marking capability to netem tc qdisc add dev eth0 root netem drop 0.5 ecn Instead of dropping packets, try to ECN mark them. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Cc: Stephen Hemminger <shemminger@vyatta.com> Acked-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01net: skb_peek()/skb_peek_tail() cleanupsEric Dumazet
remove useless casts and rename variables for less confusion. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01net: add a prefetch in socket backlog processingEric Dumazet
TCP or UDP stacks have big enough latencies that prefetching next pointer is worth it. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: let iproute2 create L2TPv3 IP tunnels using IPv6James Chapman
The netlink API lets users create unmanaged L2TPv3 tunnels using iproute2. Until now, a request to create an unmanaged L2TPv3 IP encapsulation tunnel over IPv6 would be rejected with EPROTONOSUPPORT. Now that l2tp_ip6 implements sockets for L2TP IP encapsulation over IPv6, we can add support for that tunnel type. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: introduce L2TPv3 IP encapsulation support for IPv6Chris Elston
L2TPv3 defines an IP encapsulation packet format where data is carried directly over IP (no UDP). The kernel already has support for L2TP IP encapsulation over IPv4 (l2tp_ip). This patch introduces support for L2TP IP encapsulation over IPv6. The implementation is derived from ipv6/raw and ipv4/l2tp_ip. Signed-off-by: Chris Elston <celston@katalix.com> Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01ipv6: Export ipv6 functions for use by other protocolsChris Elston
For implementing other protocols on top of IPv6, such as L2TPv3's IP encapsulation over ipv6, we'd like to call some IPv6 functions which are not currently exported. This patch exports them. Signed-off-by: Chris Elston <celston@katalix.com> Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: netlink api for l2tpv3 ipv6 unmanaged tunnelsChris Elston
This patch adds support for unmanaged L2TPv3 tunnels over IPv6 using the netlink API. We already support unmanaged L2TPv3 tunnels over IPv4. A patch to iproute2 to make use of this feature will be submitted separately. Signed-off-by: Chris Elston <celston@katalix.com> Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: show IPv6 addresses in l2tp debugfs fileChris Elston
If an L2TP tunnel uses IPv6, make sure the l2tp debugfs file shows the IPv6 address correctly. Signed-off-by: Chris Elston <celston@katalix.com> Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: pppol2tp_connect() handles ipv6 sockaddr variantsJames Chapman
Userspace uses connect() to associate a pppol2tp socket with a tunnel socket. This needs to allow the caller to supply the new IPv6 sockaddr_pppol2tp structures if IPv6 is used. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01pppox: Replace __attribute__((packed)) in if_pppox.hJames Chapman
Checkpatch warns about the use of __attribute__((packed)). So use the recommended __packed syntax instead. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: remove unused stats from l2tp_ip socketJames Chapman
The l2tp_ip socket currently maintains packet/byte stats in its private socket structure. But these counters aren't exposed to userspace and so serve no purpose. The counters were also smp-unsafe. So this patch just gets rid of the stats. While here, change a couple of internal __u32 variables to u32. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: Use ip4_datagram_connect() in l2tp_ip_connect()James Chapman
Cleanup the l2tp_ip code to make use of an existing ipv4 support function. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-01l2tp: fix locking of 64-bit counters for smpJames Chapman
L2TP uses 64-bit counters but since these are not updated atomically, we need to make them safe for smp. This patch addresses that. Signed-off-by: James Chapman <jchapman@katalix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: remove PHY polling from atl1c_change_mtuHuang, Xiong
PHY polling code for FPGA is considered in every MDIO R/W API. no need to add additional code to atl1c_change_mtu. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: David Liu <dwliu@qca.qaulcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: Disable L0S when no cable linkHuang, Xiong
L0S might be unstable if no cable link, only enable it when link up. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: do MAC-reset when PHY link downHuang, Xiong
There may be tx-skbs still pending in HW when PHY link down. Reset MAC will make the DMA engine go to the start point. and release all pending skbs. Note: Reset MAC will clear any interrupt status and mask. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: cancel task when interface closedHuang, Xiong
common_task might be running while close routine is called, wait/cancel it. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: enlarge L1 response waiting timerHuang, Xiong
The hardware incorrectly process L0S/L1 entrance if the chipset/root response after specific/shorter timer and cause system hang. Enlarge the timeout value to avoid this issue. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: refine mac address related codeHuang, Xiong
On some platform with EEPROM/OTP existing, the BIOS could overwrite a new MAC address for the NIC. so, the permanent mac address should be from BIOS. the address is restored when driver removing. Voltage raising isn't applicable for l1d. Replace swab32 with htonl for big/little endian platform. related Registers are refined as well. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: remove code of closing register writable attributionHuang, Xiong
The Close-action is done by atl1c_reset_pcie, remove it from atl1c_get_permanent_address. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: clear WoL status when reset pcieHuang, Xiong
WoL status is read-clear and should be cleared when in S0 status. putting it in atl1c_reset_pcie is more suitable than in atl1c_get_permanent_address. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: add PHY link event(up/down) patchHuang, Xiong
On some platforms the PHY settings need to change depending on the cable link status to get better stability. Signed-off-by: xiong <xiong@qca.qualcomm.com> Tested-by: Liu David <dwliu@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30atl1c: add workaround for issue of bit INTX-disable for MSI interruptHuang, Xiong
All supported devices have one issue that msi interrupt doesn't assert if pci command register bit (PCI_COMMAND_INTX_DISABLE) is set. Add workaround in drivers/pci/quirks.c Signed-off-by: xiong <xiong@qca.qualcomm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30Merge branch 'tipc_net-next' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux
2012-04-30bnx2x: remove some bloatEric Dumazet
Before doing skb->head_frag work on bnx2x driver, I found too much stuff was inlined in bnx2x/bnx2x_cmn.h for no good reason and made my work not very easy. Move some big functions out of this include file to the respective .c file. A lot of inline keywords are not needed at all in this huge driver. text data bss dec hex filename 490083 1270 56 491409 77f91 bnx2x/bnx2x.ko.before 484206 1270 56 485532 7689c bnx2x/bnx2x.ko Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Eilon Greenstein <eilong@broadcom.com> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30pch_gbe: reprogram multicast address register on resetRongQing.Li
The reset logic after a Rx FIFO overrun will clear the programmed multicast addresses. This patch fixes the issue by reprogramming the registers after the reset. The commit eefc48b ("pch_gbe: reprogram multicast address register on reset") tried to fix this problem, but it introduces unnecessary codes. In fact, all multicast addresses have been saved in netdev->mc, So we can call pch_gbe_set_multi() directly after reset_hw and reset_rx. This commit kills 50+ line codes Cc: Richard Cochran <richardcochran@gmail.com> Cc: Takahiro Shimizu <tshimizu818@gmail.com> Signed-off-by: RongQing.Li <roy.qing.li@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30net: makes skb_splice_bits() aware of skb->head_fragEric Dumazet
__skb_splice_bits() can check if skb to be spliced has its skb->head mapped to a page fragment, instead of a kmalloc() area. If so we can avoid a copy of the skb head and get a reference on underlying page. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30tcp: makes tcp_try_coalesce aware of skb->head_fragEric Dumazet
TCP coalesce can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We had to disable coalescing in this case, for performance reasons. We 'upgrade' skb->head as a fragment in itself. This reduces number of cache misses when user makes its copies, since a less sk_buff are fetched. This makes receive and ofo queues shorter and thus reduce cache line misses in TCP stack. This is a followup of patch "net: allow skb->head to be a page fragment" Tested with tg3 nic, with GRO on or off. We can see "TCPRcvCoalesce" counter being incremented. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30net: make GRO aware of skb->head_fragEric Dumazet
GRO can check if skb to be merged has its skb->head mapped to a page fragment, instead of a kmalloc() area. We 'upgrade' skb->head as a fragment in itself This avoids the frag_list fallback, and permits to build true GRO skb (one sk_buff and up to 16 fragments), using less memory. This reduces number of cache misses when user makes its copy, since a single sk_buff is fetched. This is a followup of patch "net: allow skb->head to be a page fragment" Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30tg3: provide frags as skb headEric Dumazet
This patch converts tg3 driver, one of our reference drivers, to use new build_skb() api in frag mode. Instead of using kmalloc() to allocate the memory block that will be used by build_skb() as skb->head, we use a page fragment. This is a followup of patch "net: allow skb->head to be a page fragment" This allows GRO, TCP coalescing, and splice() to be more efficient. Incidentally, this also removes SLUB slow path contention in kfree() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-30net: allow skb->head to be a page fragmentEric Dumazet
skb->head is currently allocated from kmalloc(). This is convenient but has the drawback the data cannot be converted to a page fragment if needed. We have three spots were it hurts : 1) GRO aggregation When a linear skb must be appended to another skb, GRO uses the frag_list fallback, very inefficient since we keep all struct sk_buff around. So drivers enabling GRO but delivering linear skbs to network stack aren't enabling full GRO power. 2) splice(socket -> pipe). We must copy the linear part to a page fragment. This kind of defeats splice() purpose (zero copy claim) 3) TCP coalescing. Recently introduced, this permits to group several contiguous segments into a single skb. This shortens queue lengths and save kernel memory, and greatly reduce probabilities of TCP collapses. This coalescing doesnt work on linear skbs (or we would need to copy data, this would be too slow) Given all these issues, the following patch introduces the possibility of having skb->head be a fragment in itself. We use a new skb flag, skb->head_frag to carry this information. build_skb() is changed to accept a frag_size argument. Drivers willing to provide a page fragment instead of kmalloc() data will set a non zero value, set to the fragment size. Then, on situations we need to convert the skb head to a frag in itself, we can check if skb->head_frag is set and avoid the copies or various fallbacks we have. This means drivers currently using frags could be updated to avoid the current skb->head allocation and reduce their memory footprint (aka skb truesize). (thats 512 or 1024 bytes saved per skb). This also makes bpf/netfilter faster since the 'first frag' will be part of skb linear part, no need to copy data. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Cc: Herbert Xu <herbert@gondor.apana.org.au> Cc: Maciej Żenczykowski <maze@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Cc: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Cc: Ben Hutchings <bhutchings@solarflare.com> Cc: Matt Carlson <mcarlson@broadcom.com> Cc: Michael Chan <mchan@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>