aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2012-12-17inet_diag: validate port comparison byte code to prevent unsafe readsNeal Cardwell
[ Upstream commit 5e1f54201cb481f40a04bc47e1bc8c093a189e23 ] Add logic to verify that a port comparison byte code operation actually has the second inet_diag_bc_op from which we read the port for such operations. Previously the code blindly referenced op[1] without first checking whether a second inet_diag_bc_op struct could fit there. So a malicious user could make the kernel read 4 bytes beyond the end of the bytecode array by claiming to have a whole port comparison byte code (2 inet_diag_bc_op structs) when in fact the bytecode was not long enough to hold both. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17inet_diag: avoid unsafe and nonsensical prefix matches in inet_diag_bc_run()Neal Cardwell
[ Upstream commit f67caec9068cee426ec23cf9005a1dee2ecad187 ] Add logic to check the address family of the user-supplied conditional and the address family of the connection entry. We now do not do prefix matching of addresses from different address families (AF_INET vs AF_INET6), except for the previously existing support for having an IPv4 prefix match an IPv4-mapped IPv6 address (which this commit maintains as-is). This change is needed for two reasons: (1) The addresses are different lengths, so comparing a 128-bit IPv6 prefix match condition to a 32-bit IPv4 connection address can cause us to unwittingly walk off the end of the IPv4 address and read garbage or oops. (2) The IPv4 and IPv6 address spaces are semantically distinct, so a simple bit-wise comparison of the prefixes is not meaningful, and would lead to bogus results (except for the IPv4-mapped IPv6 case, which this commit maintains). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17inet_diag: validate byte code to prevent oops in inet_diag_bc_run()Neal Cardwell
[ Upstream commit 405c005949e47b6e91359159c24753519ded0c67 ] Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND operations. Previously we did not validate the inet_diag_hostcond, address family, address length, and prefix length. So a malicious user could make the kernel read beyond the end of the bytecode array by claiming to have a whole inet_diag_hostcond when the bytecode was not long enough to contain a whole inet_diag_hostcond of the given address family. Or they could make the kernel read up to about 27 bytes beyond the end of a connection address by passing a prefix length that exceeded the length of addresses of the given family. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17inet_diag: fix oops for IPv4 AF_INET6 TCP SYN-RECV stateNeal Cardwell
[ Upstream commit 1c95df85ca49640576de2f0a850925957b547b84 ] Fix inet_diag to be aware of the fact that AF_INET6 TCP connections instantiated for IPv4 traffic and in the SYN-RECV state were actually created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This means that for such connections inet6_rsk(req) returns a pointer to a random spot in memory up to roughly 64KB beyond the end of the request_sock. With this bug, for a server using AF_INET6 TCP sockets and serving IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to inet_diag_fill_req() causing an oops or the export to user space of 16 bytes of kernel memory as a garbage IPv6 address, depending on where the garbage inet6_rsk(req) pointed. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17ipv4: ip_check_defrag must not modify skb before unsharingJohannes Berg
[ Upstream commit 1bf3751ec90cc3174e01f0d701e8449ce163d113 ] ip_check_defrag() might be called from af_packet within the RX path where shared SKBs are used, so it must not modify the input SKB before it has unshared it for defragmentation. Use skb_copy_bits() to get the IP header and only pull in everything later. The same is true for the other caller in macvlan as it is called from dev->rx_handler which can also get a shared SKB. Reported-by: Eric Leblond <eric@regit.org> Cc: stable@vger.kernel.org Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17ipv4: avoid passing NULL to inet_putpeer() in icmpv4_xrlim_allow()Neal Cardwell
[ Upstream commit e1a676424c290b1c8d757e3860170ac7ecd89af4 ] inet_getpeer_v4() can return NULL under OOM conditions, and while inet_peer_xrlim_allow() is OK with a NULL peer, inet_putpeer() will crash. This code path now uses the same idiom as the others from: 1d861aa4b3fb08822055345f480850205ffe6170 ("inet: Minimize use of cached route inetpeer."). Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-12-17ipv4: do not cache looped multicastsJulian Anastasov
[ Upstream commit 636174219b52b5a8bc51bc23bbcba97cd30a65e3 ] Starting from 3.6 we cache output routes for multicasts only when using route to 224/4. For local receivers we can set RTCF_LOCAL flag depending on the membership but in such case we use maddr and saddr which are not caching keys as before. Additionally, we can not use same place to cache routes that differ in RTCF_LOCAL flag value. Fix it by caching only RTCF_MULTICAST entries without RTCF_LOCAL (send-only, no loopback). As a side effect, we avoid unneeded lookup for fnhe when not caching because multicasts are not redirected and they do not learn PMTU. Thanks to Maxime Bizon for showing the caching problems in __mkroute_output for 3.6 kernels: different RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or ip_output call and the visible problem is that traffic can not reach local receivers via loopback. Reported-by: Maxime Bizon <mbizon@freebox.fr> Tested-by: Maxime Bizon <mbizon@freebox.fr> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26netfilter: nf_nat: don't check for port change on ICMP tuplesUlrich Weber
commit 38fe36a248ec3228f8e6507955d7ceb0432d2000 upstream. ICMP tuples have id in src and type/code in dst. So comparing src.u.all with dst.u.all will always fail here and ip_xfrm_me_harder() is called for every ICMP packet, even if there was no NAT. Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26tcp: handle tcp_net_metrics_init() order-5 memory allocation failuresEric Dumazet
[ Upstream commit 976a702ac9eeacea09e588456ab165dc06f9ee83 ] order-5 allocations can fail with current kernels, we should try vmalloc() as well. Reported-by: Julien Tinnes <jln@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26tcp: fix retransmission in repair modeAndrew Vagin
[ Upstream commit ec34232575083fd0f43d3a101e8ebb041b203761 ] Currently if a socket was repaired with a few packet in a write queue, a kernel bug may be triggered: kernel BUG at net/ipv4/tcp_output.c:2330! RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610 According to the initial realization v3.4-rc2-963-gc0e88ff, all skb-s should look like already posted. This patch fixes code according with this sentence. Here are three points, which were not done in the initial patch: 1. A tcp send head should not be changed 2. Initialize TSO state of a skb 3. Reset the retransmission time This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet passes the ussual way, but isn't sent to network. This patch solves all described problems and handles tcp_sendpages. Signed-off-by: Andrey Vagin <avagin@openvz.org> Cc: Pavel Emelyanov <xemul@parallels.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ipv4/ip_vti.c: VTI fix post-decryption forwardingSaurabh Mohan
[ Upstream commit b2942004fb5c9f3304b77e187b8a1977b3626c9b ] With the latest kernel there are two things that must be done post decryption so that the packet are forwarded. 1. Remove the mark from the packet. This will cause the packet to not match the ipsec-policy again. However doing this causes the post-decryption check to fail also and the packet will get dropped. (cat /proc/net/xfrm_stat). 2. Remove the sp association in the skbuff so that no policy check is done on the packet for VTI tunnels. Due to #2 above we must now do a security-policy check in the vti rcv path prior to resetting the mark in the skbuff. Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com> Reported-by: Ruben Herold <ruben@puettmann.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-26ipv4: avoid undefined behavior in do_ip_setsockopt()Xi Wang
[ Upstream commit 0c9f79be295c99ac7e4b569ca493d75fdcc19e4e ] (1<<optname) is undefined behavior in C with a negative optname or optname larger than 31. In those cases the result of the shift is not necessarily zero (e.g., on x86). This patch simplifies the code with a switch statement on optname. It also allows the compiler to generate better code (e.g., using a 64-bit mask). Signed-off-by: Xi Wang <xi.wang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17net: inet_diag -- Return error code if protocol handler is missedCyrill Gorcunov
[ Upstream commit cacb6ba0f36ab14a507f4ee7697e8332899015d2 ] We've observed that in case if UDP diag module is not supported in kernel the netlink returns NLMSG_DONE without notifying a caller that handler is missed. This patch makes __inet_diag_dump to return error code instead. So as example it become possible to detect such situation and handle it gracefully on userspace level. Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org> CC: David Miller <davem@davemloft.net> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: Pavel Emelyanov <xemul@parallels.com> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17tcp-repair: Handle zero-length data put in rcv queuePavel Emelyanov
[ Upstream commit c454e6111d1ef4268fe98e87087216e51c2718c3 ] When sending data into a tcp socket in repair state we should check for the amount of data being 0 explicitly. Otherwise we'll have an skb with seq == end_seq in rcv queue, but tcp doesn't expect this to happen (in particular a warn_on in tcp_recvmsg shoots). Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Reported-by: Giorgos Mavrikas <gmavrikas@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17net: fix divide by zero in tcp algorithm illinoisJesper Dangaard Brouer
[ Upstream commit 8f363b77ee4fbf7c3bbcf5ec2c5ca482d396d664 ] Reading TCP stats when using TCP Illinois congestion control algorithm can cause a divide by zero kernel oops. The division by zero occur in tcp_illinois_info() at: do_div(t, ca->cnt_rtt); where ca->cnt_rtt can become zero (when rtt_reset is called) Steps to Reproduce: 1. Register tcp_illinois: # sysctl -w net.ipv4.tcp_congestion_control=illinois 2. Monitor internal TCP information via command "ss -i" # watch -d ss -i 3. Establish new TCP conn to machine Either it fails at the initial conn, or else it needs to wait for a loss or a reset. This is only related to reading stats. The function avg_delay() also performs the same divide, but is guarded with a (ca->cnt_rtt > 0) at its calling point in update_params(). Thus, simply fix tcp_illinois_info(). Function tcp_illinois_info() / get_info() is called without socket lock. Thus, eliminate any race condition on ca->cnt_rtt by using a local stack variable. Simply reuse info.tcpv_rttcnt, as its already set to ca->cnt_rtt. Function avg_delay() is not affected by this race condition, as its called with the socket lock. Cc: Petr Matousek <pmatouse@redhat.com> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17tcp: fix FIONREAD/SIOCINQEric Dumazet
[ Upstream commit a3374c42aa5f7237e87ff3b0622018636b0c847e ] tcp_ioctl() tries to take into account if tcp socket received a FIN to report correct number bytes in receive queue. But its flaky because if the application ate the last skb, we return 1 instead of 0. Correct way to detect that FIN was received is to test SOCK_DONE. Reported-by: Elliot Hughes <enh@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-11-17ipv4: Fix flushing of cached routing informationsSteffen Klassert
[ Upstream commit 13d82bf50dce632355fcccafa4fe44a9b5e706d8 ] Currently we can not flush cached pmtu/redirect informations via the ipv4_sysctl_rtcache_flush sysctl. We need to check the rt_genid of the old route and reset the nh exeption if the old route is expired when we bind a new route to a nh exeption. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28tcp: resets are misroutedAlexey Kuznetsov
[ Upstream commit 4c67525849e0b7f4bd4fab2487ec9e43ea52ef29 ] After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no sock case").. tcp resets are always lost, when routing is asymmetric. Yes, backing out that patch will result in misrouting of resets for dead connections which used interface binding when were alive, but we actually cannot do anything here. What's died that's died and correct handling normal unbound connections is obviously a priority. Comment to comment: > This has few benefits: > 1. tcp_v6_send_reset already did that. It was done to route resets for IPv6 link local addresses. It was a mistake to do so for global addresses. The patch fixes this as well. Actually, the problem appears to be even more serious than guaranteed loss of resets. As reported by Sergey Soloviev <sol@eqv.ru>, those misrouted resets create a lot of arp traffic and huge amount of unresolved arp entires putting down to knees NAT firewalls which use asymmetric routing. Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: Add FLOWI_FLAG_KNOWN_NHJulian Anastasov
[ Upstream commit c92b96553a80c1dbe2ebe128bbe37c8f98f148bf ] Add flag to request that output route should be returned with known rt_gateway, in case we want to use it as nexthop for neighbour resolving. The returned route can be cached as follows: - in NH exception: because the cached routes are not shared with other destinations - in FIB NH: when using gateway because all destinations for NH share same gateway As last option, to return rt_gateway!=0 we have to set DST_NOCACHE. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: introduce rt_uses_gatewayJulian Anastasov
[ Upstream commit 155e8336c373d14d87a7f91e356d85ef4b93b8f9 ] Add new flag to remember when route is via gateway. We will use it to allow rt_gateway to contain address of directly connected host for the cases when DST_NOCACHE is used or when the NH exception caches per-destination route without DST_NOCACHE flag, i.e. when routes are not used for other destinations. By this way we force the neighbour resolving to work with the routed destination but we can use different address in the packet, feature needed for IPVS-DR where original packet for virtual IP is routed via route to real IP. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: make sure nh_pcpu_rth_output is always allocatedJulian Anastasov
[ Upstream commit f8a17175c63fd3e8b573719f7538816f8c96abf4 ] Avoid checking nh_pcpu_rth_output in fast path, abort fib_info creation on alloc_percpu failure. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: fix forwarding for strict source routesJulian Anastasov
[ Upstream commit e0adef0f7456d5d3a3bfe8ea61c7dddf146b40e1 ] After the change "Adjust semantics of rt->rt_gateway" (commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares it directly with nexthop. What we want here is to check if traffic is to directly connected nexthop and to fail if using gateway. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: fix sending of redirectsJulian Anastasov
[ Upstream commit e81da0e113a1b7fc7449ae6213f65f89ccac6d06 ] After "Cache input routes in fib_info nexthops" (commit d2d68ba9fe) and "Elide fib_validate_source() completely when possible" (commit 7a9bc9b81a) we can not send ICMP redirects. It seems we should not cache the RTCF_DOREDIRECT flag in nh_rth_input because the same fib_info can be used for traffic that is not redirected, eg. from other input devices or from sources that are not in same subnet. As result, we have to disable the caching of RTCF_DOREDIRECT flag and to force source validation for the case when forwarding traffic to the input device. If traffic comes from directly connected source we allow redirection as it was done before both changes. Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS is disabled, this can avoid source address validation and to help caching the routes. After the change "Adjust semantics of rt->rt_gateway" (commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages contain daddr instead of 0.0.0.0 when target is directly connected. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: Don't report stale pmtu values to userspaceSteffen Klassert
[ Upstream commit ee9a8f7ab2edf801b8b514c310455c94acc232f6 ] We report cached pmtu values even if they are already expired. Change this to not report these values after they are expired and fix a race in the expire time calculation, as suggested by Eric Dumazet. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: Don't create nh exeption when the device mtu is smaller than the ↵Steffen Klassert
reported pmtu [ Upstream commit 7f92d334ba19a0d8e96f8f8f092219553367d921 ] When a local tool like tracepath tries to send packets bigger than the device mtu, we create a nh exeption and set the pmtu to device mtu. The device mtu does not expire, so check if the device mtu is smaller than the reported pmtu and don't crerate a nh exeption in that case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-28ipv4: Always invalidate or update the route on pmtu eventsSteffen Klassert
[ Upstream commit d851c12b60471188e15e5c8405b289073e8dd025 ] Some protocols, like IPsec still cache routes. So we need to invalidate the old route on pmtu events to avoid the reuse of stale routes. We also need to update the mtu and expire time of the route if we already use a nh exception route, otherwise we ignore newly learned pmtu values after the first expiration. With this patch we always invalidate or update the route on pmtu events. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13ipv4: add a fib_type to fib_infoEric Dumazet
[ Upstream commit f4ef85bbda96324785097356336bc79cdd37db0a ] commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced a regression for forwarding. This was hard to reproduce but the symptom was that packets were delivered to local host instead of being forwarded. David suggested to add fib_type to fib_info so that we dont inadvertently share same fib_info for different purposes. With help from Julian Anastasov who provided very helpful hints, reproduced here : <quote> Can it be a problem related to fib_info reuse from different routes. For example, when local IP address is created for subnet we have: broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 The "dev DEV proto kernel scope link src 192.168.0.1" is a reused fib_info structure where we put cached routes. The result can be same fib_info for 192.168.0.255 and 192.168.0.0/24. RTN_BROADCAST is cached only for input routes. Incoming broadcast to 192.168.0.255 can be cached and can cause problems for traffic forwarded to 192.168.0.0/24. So, this patch should solve the problem because it separates the broadcast from unicast traffic. And the ip_route_input_slow caching will work for local and broadcast input routes (above routes 1 and 3) just because they differ in scope and use different fib_info. </quote> Many thanks to Chris Clayton for his patience and help. Reported-by: Chris Clayton <chris2553@googlemail.com> Bisected-by: Chris Clayton <chris2553@googlemail.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Tested-by: Chris Clayton <chris2553@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-09-27inetpeer: fix token initializationNicolas Dichtel
When jiffies wraps around (for example, 5 minutes after the boot, see INITIAL_JIFFIES) and peer has just been created, now - peer->rate_last can be < XRLIM_BURST_FACTOR * timeout, so token is not set to the maximum value, thus some icmp packets can be unexpectedly dropped. Fix this case by initializing last_rate to 60 seconds in the past. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-22ipv4: raw: fix icmp_filter()Eric Dumazet
icmp_filter() should not modify its input, or else its caller would need to recompute ip_hdr() if skb->head is reallocated. Use skb_header_pointer() instead of pskb_may_pull() and change the prototype to make clear both sk and skb are const. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-21net: change return values from -EACCES to -EPERMZhao Hongjiang
Change return value from -EACCES to -EPERM when the permission check fails. Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-20tcp: restore rcv_wscale in a repair mode (v2)Andrey Vagin
rcv_wscale is a symetric parameter with snd_wscale. Both this parameters are set on a connection handshake. Without this value a remote window size can not be interpreted correctly, because a value from a packet should be shifted on rcv_wscale. And one more thing is that wscale_ok should be set too. This patch doesn't break a backward compatibility. If someone uses it in a old scheme, a rcv window will be restored with the same bug (rcv_wscale = 0). v2: Save backward compatibility on big-endian system. Before the first two bytes were snd_wscale and the second two bytes were rcv_wscale. Now snd_wscale is opt_val & 0xFFFF and rcv_wscale >> 16. This approach is independent on byte ordering. Cc: David S. Miller <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> CC: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: Andrew Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-19tcp: flush DMA queue before sk_wait_data if rcv_wnd is zeroMichal Kubeček
If recv() syscall is called for a TCP socket so that - IOAT DMA is used - MSG_WAITALL flag is used - requested length is bigger than sk_rcvbuf - enough data has already arrived to bring rcv_wnd to zero then when tcp_recvmsg() gets to calling sk_wait_data(), receive window can be still zero while sk_async_wait_queue exhausts enough space to keep it zero. As this queue isn't cleaned until the tcp_service_net_dma() call, sk_wait_data() cannot receive any data and blocks forever. If zero receive window and non-empty sk_async_wait_queue is detected before calling sk_wait_data(), process the queue first. Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18tcp: fix regression in urgent data handlingEric Dumazet
Stephan Springl found that commit 1402d366019fed "tcp: introduce tcp_try_coalesce" introduced a regression for rlogin It turns out problem comes from TCP urgent data handling and a change in behavior in input path. rlogin sends two one-byte packets with URG ptr set, and when next data frame is coalesced, we lack sk_data_ready() calls to wakeup consumer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Stephan Springl <springl-k@lar.bfw.de> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18netns: move net->ipv4.rt_genid to net->rt_genidNicolas Dichtel
This commit prepares the use of rt_genid by both IPv4 and IPv6. Initialization is left in IPv4 part. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18net: rt_cache_flush() cleanupEric Dumazet
We dont use jhash anymore since route cache removal, so we can get rid of get_random_bytes() calls for rt_genid changes. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-18ipv4/route: arg delay is useless in rt_cache_flush()Nicolas Dichtel
Since route cache deletion (89aef8921bfbac22f), delay is no more used. Remove it. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-09-07udp: increment UDP_MIB_INERRORS if copy failedEric Dumazet
In UDP recvmsg(), we miss an increase of UDP_MIB_INERRORS if the copy of skb to userspace failed for whatever reason. Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-31Merge branch 'master' of git://1984.lsi.us.es/nfDavid S. Miller
2012-08-30ipv4: must use rcu protection while calling fib_lookupEric Dumazet
Following lockdep splat was reported by Pavel Roskin : [ 1570.586223] =============================== [ 1570.586225] [ INFO: suspicious RCU usage. ] [ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted [ 1570.586229] ------------------------------- [ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage! [ 1570.586233] [ 1570.586233] other info that might help us debug this: [ 1570.586233] [ 1570.586236] [ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0 [ 1570.586238] 2 locks held by Chrome_IOThread/4467: [ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [<ffffffff814f2c0c>] release_sock+0x2c/0xa0 [ 1570.586253] #1: (fnhe_lock){+.-...}, at: [<ffffffff815302fc>] update_or_create_fnhe+0x2c/0x270 [ 1570.586260] [ 1570.586260] stack backtrace: [ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98 [ 1570.586265] Call Trace: [ 1570.586271] [<ffffffff810976ed>] lockdep_rcu_suspicious+0xfd/0x130 [ 1570.586275] [<ffffffff8153042c>] update_or_create_fnhe+0x15c/0x270 [ 1570.586278] [<ffffffff815305b3>] __ip_rt_update_pmtu+0x73/0xb0 [ 1570.586282] [<ffffffff81530619>] ip_rt_update_pmtu+0x29/0x90 [ 1570.586285] [<ffffffff815411dc>] inet_csk_update_pmtu+0x2c/0x80 [ 1570.586290] [<ffffffff81558d1e>] tcp_v4_mtu_reduced+0x2e/0xc0 [ 1570.586293] [<ffffffff81553bc4>] tcp_release_cb+0xa4/0xb0 [ 1570.586296] [<ffffffff814f2c35>] release_sock+0x55/0xa0 [ 1570.586300] [<ffffffff815442ef>] tcp_sendmsg+0x4af/0xf50 [ 1570.586305] [<ffffffff8156fc60>] inet_sendmsg+0x120/0x230 [ 1570.586308] [<ffffffff8156fb40>] ? inet_sk_rebuild_header+0x40/0x40 [ 1570.586312] [<ffffffff814f4bdd>] ? sock_update_classid+0xbd/0x3b0 [ 1570.586315] [<ffffffff814f4c50>] ? sock_update_classid+0x130/0x3b0 [ 1570.586320] [<ffffffff814ec435>] do_sock_write+0xc5/0xe0 [ 1570.586323] [<ffffffff814ec4a3>] sock_aio_write+0x53/0x80 [ 1570.586328] [<ffffffff8114bc83>] do_sync_write+0xa3/0xe0 [ 1570.586332] [<ffffffff8114c5a5>] vfs_write+0x165/0x180 [ 1570.586335] [<ffffffff8114c805>] sys_write+0x45/0x90 [ 1570.586340] [<ffffffff815d2722>] system_call_fastpath+0x16/0x1b Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Pavel Roskin <proski@gnu.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30net: ipv4: ipmr_expire_timer causes crash when removing net namespaceFrancesco Ruggeri
When tearing down a net namespace, ipv4 mr_table structures are freed without first deactivating their timers. This can result in a crash in run_timer_softirq. This patch mimics the corresponding behaviour in ipv6. Locking and synchronization seem to be adequate. We are about to kfree mrt, so existing code should already make sure that no other references to mrt are pending or can be created by incoming traffic. The functions invoked here do not cause new references to mrt or other race conditions to be created. Invoking del_timer_sync guarantees that ipmr_expire_timer is inactive. Both ipmr_expire_process (whose completion we may have to wait in del_timer_sync) and mroute_clean_tables internally use mfc_unres_lock or other synchronizations when needed, and they both only modify mrt. Tested in Linux 3.4.8. Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-30netfilter: nf_nat_sip: fix incorrect handling of EBUSY for RTCP expectationPablo Neira Ayuso
We're hitting bug while trying to reinsert an already existing expectation: kernel BUG at kernel/timer.c:895! invalid opcode: 0000 [#1] SMP [...] Call Trace: <IRQ> [<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack] [<ffffffff812d423a>] ? in4_pton+0x72/0x131 [<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip] [<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip] [<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip] [<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c [<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip] We have to remove the RTP expectation if the RTCP expectation hits EBUSY since we keep trying with other ports until we succeed. Reported-by: Rafal Fitt <rafalf@aplusc.com.pl> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2012-08-24tcp: fix cwnd reduction for non-sack recoveryYuchung Cheng
The cwnd reduction in fast recovery is based on the number of packets newly delivered per ACK. For non-sack connections every DUPACK signifies a packet has been delivered, but the sender mistakenly skips counting them for cwnd reduction. The fix is to compute newly_acked_sacked after DUPACKs are accounted in sacked_out for non-sack connections. Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-24ipv4: take rt_uncached_lock only if neededEric Dumazet
Multicast traffic allocates dst with DST_NOCACHE, but dst is not inserted into rt_uncached_list. This slowdown multicast workloads on SMP because rt_uncached_lock is contended. Change the test before taking the lock to actually check the dst was inserted into rt_uncached_list. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-22ipv4: properly update pmtuEric Dumazet
Sylvain Munault reported following info : - TCP connection get "stuck" with data in send queue when doing "large" transfers ( like typing 'ps ax' on a ssh connection ) - Only happens on path where the PMTU is lower than the MTU of the interface - Is not present right after boot, it only appears 10-20min after boot or so. (and that's inside the _same_ TCP connection, it works fine at first and then in the same ssh session, it'll get stuck) - Definitely seems related to fragments somehow since I see a router sending ICMP message saying fragmentation is needed. - Exact same setup works fine with kernel 3.5.1 Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration period is over. ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration, but dst_set_expires() does nothing because dst.expires is already set. It seems we want to set the expires field to a new value, regardless of prior one. With help from Julian Anastasov. Reported-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: Eric Dumazet <edumazet@google.com> CC: Julian Anastasov <ja@ssi.bg> Tested-by: Sylvain Munaut <s.munaut@whatever-company.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21ipv4: fix ip header ident selection in __ip_make_skb()Eric Dumazet
Christian Casteyde reported a kmemcheck 32-bit read from uninitialized memory in __ip_select_ident(). It turns out that __ip_make_skb() called ip_select_ident() before properly initializing iph->daddr. This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of cached route inetpeer.) Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131 Reported-by: Christian Casteyde <casteyde.christian@free.fr> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21ipv4: Use newinet->inet_opt in inet_csk_route_child_sock()Christoph Paasch
Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and TCP."), inet_csk_route_child_sock() is called instead of inet_csk_route_req(). However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(), ireq->opt is set to NULL, before calling inet_csk_route_child_sock(). Thus, inside inet_csk_route_child_sock() opt is always NULL and the SRR-options are not respected anymore. Packets sent by the server won't have the correct destination-IP. This patch fixes it by accessing newinet->inet_opt instead of ireq->opt inside inet_csk_route_child_sock(). Reported-by: Luca Boccassi <luca.boccassi@gmail.com> Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-21tcp: fix possible socket refcount problemEric Dumazet
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events) added bug leading to following trace : [ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.131726] [ 2866.132188] ========================= [ 2866.132281] [ BUG: held lock freed! ] [ 2866.132281] 3.6.0-rc1+ #622 Not tainted [ 2866.132281] ------------------------- [ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there! [ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] 4 locks held by kworker/0:1/652: [ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f [ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6 [ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f [ 2866.132281] [ 2866.132281] stack backtrace: [ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622 [ 2866.132281] Call Trace: [ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159 [ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a [ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114 [ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e [ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56 [ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f [ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f [ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85 [ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148 [ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9 [ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e [ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30 [ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6 [ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad [ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4 [ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f [ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1 [ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56 [ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9 [ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26 [ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4 [ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70 [ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4 [ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1 [ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a [ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6 [ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5 [ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4 [ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5 [ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f [ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd [ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb [ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43 [ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80 [ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0 [ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61 [ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1 [ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e [ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225 [ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c [ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34 [ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f [ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f [ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225 [ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317 [ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f [ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2 [ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10 [ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13 [ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a [ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13 [ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000 [ 2866.309689] ============================================================================= [ 2866.310254] BUG TCP (Not tainted): Object already free [ 2866.310254] ----------------------------------------------------------------------------- [ 2866.310254] The bug comes from the fact that timer set in sk_reset_timer() can run before we actually do the sock_hold(). socket refcount reaches zero and we free the socket too soon. timer handler is not allowed to reduce socket refcnt if socket is owned by the user, or we need to change sk_reset_timer() implementation. We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED was used instead of TCP_DELACK_TIMER_DEFERRED. For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED, even if not fired from a timer. Reported-by: Fengguang Wu <fengguang.wu@intel.com> Tested-by: Fengguang Wu <fengguang.wu@intel.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20net: tcp: move sk_rx_dst_set call after tcp_create_openreq_child()Neal Cardwell
This commit removes the sk_rx_dst_set calls from tcp_create_openreq_child(), because at that point the icsk_af_ops field of ipv6_mapped TCP sockets has not been set to its proper final value. Instead, to make sure we get the right sk_rx_dst_set variant appropriate for the address family of the new connection, we have tcp_v{4,6}_syn_recv_sock() directly call the appropriate function shortly after the call to tcp_create_openreq_child() returns. This also moves inet6_sk_rx_dst_set() to avoid a forward declaration with the new approach. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reported-by: Artem Savkov <artem.savkov@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-20Merge git://1984.lsi.us.es/nfDavid S. Miller
Pable Neira Ayuso says: ==================== The following five patches contain fixes for 3.6-rc, they are: * Two fixes for message parsing in the SIP conntrack helper, from Patrick McHardy. * One fix for the SIP helper introduced in the user-space cthelper infrastructure, from Patrick McHardy. * fix missing appropriate locking while modifying one conntrack entry from the nfqueue integration code, from myself. * fix possible access to uninitiliazed timer in the nf_conntrack expectation infrastructure, from myself. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2012-08-14ipv4: Cache local output routesYan, Zheng
Commit caacf05e5ad1abf causes big drop of UDP loop back performance. The cause of the regression is that we do not cache the local output routes. Each time we send a datagram from unconnected UDP socket, the kernel allocates a dst_entry and adds it to the rt_uncached_list. It creates lock contention on the rt_uncached_lock. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>