Age | Commit message (Collapse) | Author |
|
[ Upstream commit 5e1f54201cb481f40a04bc47e1bc8c093a189e23 ]
Add logic to verify that a port comparison byte code operation
actually has the second inet_diag_bc_op from which we read the port
for such operations.
Previously the code blindly referenced op[1] without first checking
whether a second inet_diag_bc_op struct could fit there. So a
malicious user could make the kernel read 4 bytes beyond the end of
the bytecode array by claiming to have a whole port comparison byte
code (2 inet_diag_bc_op structs) when in fact the bytecode was not
long enough to hold both.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f67caec9068cee426ec23cf9005a1dee2ecad187 ]
Add logic to check the address family of the user-supplied conditional
and the address family of the connection entry. We now do not do
prefix matching of addresses from different address families (AF_INET
vs AF_INET6), except for the previously existing support for having an
IPv4 prefix match an IPv4-mapped IPv6 address (which this commit
maintains as-is).
This change is needed for two reasons:
(1) The addresses are different lengths, so comparing a 128-bit IPv6
prefix match condition to a 32-bit IPv4 connection address can cause
us to unwittingly walk off the end of the IPv4 address and read
garbage or oops.
(2) The IPv4 and IPv6 address spaces are semantically distinct, so a
simple bit-wise comparison of the prefixes is not meaningful, and
would lead to bogus results (except for the IPv4-mapped IPv6 case,
which this commit maintains).
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 405c005949e47b6e91359159c24753519ded0c67 ]
Add logic to validate INET_DIAG_BC_S_COND and INET_DIAG_BC_D_COND
operations.
Previously we did not validate the inet_diag_hostcond, address family,
address length, and prefix length. So a malicious user could make the
kernel read beyond the end of the bytecode array by claiming to have a
whole inet_diag_hostcond when the bytecode was not long enough to
contain a whole inet_diag_hostcond of the given address family. Or
they could make the kernel read up to about 27 bytes beyond the end of
a connection address by passing a prefix length that exceeded the
length of addresses of the given family.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1c95df85ca49640576de2f0a850925957b547b84 ]
Fix inet_diag to be aware of the fact that AF_INET6 TCP connections
instantiated for IPv4 traffic and in the SYN-RECV state were actually
created with inet_reqsk_alloc(), instead of inet6_reqsk_alloc(). This
means that for such connections inet6_rsk(req) returns a pointer to a
random spot in memory up to roughly 64KB beyond the end of the
request_sock.
With this bug, for a server using AF_INET6 TCP sockets and serving
IPv4 traffic, an inet_diag user like `ss state SYN-RECV` would lead to
inet_diag_fill_req() causing an oops or the export to user space of 16
bytes of kernel memory as a garbage IPv6 address, depending on where
the garbage inet6_rsk(req) pointed.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 1bf3751ec90cc3174e01f0d701e8449ce163d113 ]
ip_check_defrag() might be called from af_packet within the
RX path where shared SKBs are used, so it must not modify
the input SKB before it has unshared it for defragmentation.
Use skb_copy_bits() to get the IP header and only pull in
everything later.
The same is true for the other caller in macvlan as it is
called from dev->rx_handler which can also get a shared SKB.
Reported-by: Eric Leblond <eric@regit.org>
Cc: stable@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e1a676424c290b1c8d757e3860170ac7ecd89af4 ]
inet_getpeer_v4() can return NULL under OOM conditions, and while
inet_peer_xrlim_allow() is OK with a NULL peer, inet_putpeer() will
crash.
This code path now uses the same idiom as the others from:
1d861aa4b3fb08822055345f480850205ffe6170 ("inet: Minimize use of
cached route inetpeer.").
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 636174219b52b5a8bc51bc23bbcba97cd30a65e3 ]
Starting from 3.6 we cache output routes for
multicasts only when using route to 224/4. For local receivers
we can set RTCF_LOCAL flag depending on the membership but
in such case we use maddr and saddr which are not caching
keys as before. Additionally, we can not use same place to
cache routes that differ in RTCF_LOCAL flag value.
Fix it by caching only RTCF_MULTICAST entries
without RTCF_LOCAL (send-only, no loopback). As a side effect,
we avoid unneeded lookup for fnhe when not caching because
multicasts are not redirected and they do not learn PMTU.
Thanks to Maxime Bizon for showing the caching
problems in __mkroute_output for 3.6 kernels: different
RTCF_LOCAL flag in cache can lead to wrong ip_mc_output or
ip_output call and the visible problem is that traffic can
not reach local receivers via loopback.
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Tested-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
commit 38fe36a248ec3228f8e6507955d7ceb0432d2000 upstream.
ICMP tuples have id in src and type/code in dst.
So comparing src.u.all with dst.u.all will always fail here
and ip_xfrm_me_harder() is called for every ICMP packet,
even if there was no NAT.
Signed-off-by: Ulrich Weber <ulrich.weber@sophos.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 976a702ac9eeacea09e588456ab165dc06f9ee83 ]
order-5 allocations can fail with current kernels, we should
try vmalloc() as well.
Reported-by: Julien Tinnes <jln@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ec34232575083fd0f43d3a101e8ebb041b203761 ]
Currently if a socket was repaired with a few packet in a write queue,
a kernel bug may be triggered:
kernel BUG at net/ipv4/tcp_output.c:2330!
RIP: 0010:[<ffffffff8155784f>] tcp_retransmit_skb+0x5ff/0x610
According to the initial realization v3.4-rc2-963-gc0e88ff,
all skb-s should look like already posted. This patch fixes code
according with this sentence.
Here are three points, which were not done in the initial patch:
1. A tcp send head should not be changed
2. Initialize TSO state of a skb
3. Reset the retransmission time
This patch moves logic from tcp_sendmsg to tcp_write_xmit. A packet
passes the ussual way, but isn't sent to network. This patch solves
all described problems and handles tcp_sendpages.
Signed-off-by: Andrey Vagin <avagin@openvz.org>
Cc: Pavel Emelyanov <xemul@parallels.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit b2942004fb5c9f3304b77e187b8a1977b3626c9b ]
With the latest kernel there are two things that must be done post decryption
so that the packet are forwarded.
1. Remove the mark from the packet. This will cause the packet to not match
the ipsec-policy again. However doing this causes the post-decryption check to
fail also and the packet will get dropped. (cat /proc/net/xfrm_stat).
2. Remove the sp association in the skbuff so that no policy check is done on
the packet for VTI tunnels.
Due to #2 above we must now do a security-policy check in the vti rcv path
prior to resetting the mark in the skbuff.
Signed-off-by: Saurabh Mohan <saurabh.mohan@vyatta.com>
Reported-by: Ruben Herold <ruben@puettmann.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 0c9f79be295c99ac7e4b569ca493d75fdcc19e4e ]
(1<<optname) is undefined behavior in C with a negative optname or
optname larger than 31. In those cases the result of the shift is
not necessarily zero (e.g., on x86).
This patch simplifies the code with a switch statement on optname.
It also allows the compiler to generate better code (e.g., using a
64-bit mask).
Signed-off-by: Xi Wang <xi.wang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit cacb6ba0f36ab14a507f4ee7697e8332899015d2 ]
We've observed that in case if UDP diag module is not
supported in kernel the netlink returns NLMSG_DONE without
notifying a caller that handler is missed.
This patch makes __inet_diag_dump to return error code instead.
So as example it become possible to detect such situation
and handle it gracefully on userspace level.
Signed-off-by: Cyrill Gorcunov <gorcunov@openvz.org>
CC: David Miller <davem@davemloft.net>
CC: Eric Dumazet <eric.dumazet@gmail.com>
CC: Pavel Emelyanov <xemul@parallels.com>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c454e6111d1ef4268fe98e87087216e51c2718c3 ]
When sending data into a tcp socket in repair state we should check
for the amount of data being 0 explicitly. Otherwise we'll have an skb
with seq == end_seq in rcv queue, but tcp doesn't expect this to happen
(in particular a warn_on in tcp_recvmsg shoots).
Signed-off-by: Pavel Emelyanov <xemul@parallels.com>
Reported-by: Giorgos Mavrikas <gmavrikas@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 8f363b77ee4fbf7c3bbcf5ec2c5ca482d396d664 ]
Reading TCP stats when using TCP Illinois congestion control algorithm
can cause a divide by zero kernel oops.
The division by zero occur in tcp_illinois_info() at:
do_div(t, ca->cnt_rtt);
where ca->cnt_rtt can become zero (when rtt_reset is called)
Steps to Reproduce:
1. Register tcp_illinois:
# sysctl -w net.ipv4.tcp_congestion_control=illinois
2. Monitor internal TCP information via command "ss -i"
# watch -d ss -i
3. Establish new TCP conn to machine
Either it fails at the initial conn, or else it needs to wait
for a loss or a reset.
This is only related to reading stats. The function avg_delay() also
performs the same divide, but is guarded with a (ca->cnt_rtt > 0) at its
calling point in update_params(). Thus, simply fix tcp_illinois_info().
Function tcp_illinois_info() / get_info() is called without
socket lock. Thus, eliminate any race condition on ca->cnt_rtt
by using a local stack variable. Simply reuse info.tcpv_rttcnt,
as its already set to ca->cnt_rtt.
Function avg_delay() is not affected by this race condition, as
its called with the socket lock.
Cc: Petr Matousek <pmatouse@redhat.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit a3374c42aa5f7237e87ff3b0622018636b0c847e ]
tcp_ioctl() tries to take into account if tcp socket received a FIN
to report correct number bytes in receive queue.
But its flaky because if the application ate the last skb,
we return 1 instead of 0.
Correct way to detect that FIN was received is to test SOCK_DONE.
Reported-by: Elliot Hughes <enh@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 13d82bf50dce632355fcccafa4fe44a9b5e706d8 ]
Currently we can not flush cached pmtu/redirect informations via
the ipv4_sysctl_rtcache_flush sysctl. We need to check the rt_genid
of the old route and reset the nh exeption if the old route is
expired when we bind a new route to a nh exeption.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 4c67525849e0b7f4bd4fab2487ec9e43ea52ef29 ]
After commit e2446eaa ("tcp_v4_send_reset: binding oif to iif in no
sock case").. tcp resets are always lost, when routing is asymmetric.
Yes, backing out that patch will result in misrouting of resets for
dead connections which used interface binding when were alive, but we
actually cannot do anything here. What's died that's died and correct
handling normal unbound connections is obviously a priority.
Comment to comment:
> This has few benefits:
> 1. tcp_v6_send_reset already did that.
It was done to route resets for IPv6 link local addresses. It was a
mistake to do so for global addresses. The patch fixes this as well.
Actually, the problem appears to be even more serious than guaranteed
loss of resets. As reported by Sergey Soloviev <sol@eqv.ru>, those
misrouted resets create a lot of arp traffic and huge amount of
unresolved arp entires putting down to knees NAT firewalls which use
asymmetric routing.
Signed-off-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit c92b96553a80c1dbe2ebe128bbe37c8f98f148bf ]
Add flag to request that output route should be
returned with known rt_gateway, in case we want to use
it as nexthop for neighbour resolving.
The returned route can be cached as follows:
- in NH exception: because the cached routes are not shared
with other destinations
- in FIB NH: when using gateway because all destinations for
NH share same gateway
As last option, to return rt_gateway!=0 we have to
set DST_NOCACHE.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit 155e8336c373d14d87a7f91e356d85ef4b93b8f9 ]
Add new flag to remember when route is via gateway.
We will use it to allow rt_gateway to contain address of
directly connected host for the cases when DST_NOCACHE is
used or when the NH exception caches per-destination route
without DST_NOCACHE flag, i.e. when routes are not used for
other destinations. By this way we force the neighbour
resolving to work with the routed destination but we
can use different address in the packet, feature needed
for IPVS-DR where original packet for virtual IP is routed
via route to real IP.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f8a17175c63fd3e8b573719f7538816f8c96abf4 ]
Avoid checking nh_pcpu_rth_output in fast path,
abort fib_info creation on alloc_percpu failure.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e0adef0f7456d5d3a3bfe8ea61c7dddf146b40e1 ]
After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) rt_gateway can be 0 but ip_forward() compares
it directly with nexthop. What we want here is to check if traffic
is to directly connected nexthop and to fail if using gateway.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit e81da0e113a1b7fc7449ae6213f65f89ccac6d06 ]
After "Cache input routes in fib_info nexthops" (commit
d2d68ba9fe) and "Elide fib_validate_source() completely when possible"
(commit 7a9bc9b81a) we can not send ICMP redirects. It seems we
should not cache the RTCF_DOREDIRECT flag in nh_rth_input because
the same fib_info can be used for traffic that is not redirected,
eg. from other input devices or from sources that are not in same subnet.
As result, we have to disable the caching of RTCF_DOREDIRECT
flag and to force source validation for the case when forwarding
traffic to the input device. If traffic comes from directly connected
source we allow redirection as it was done before both changes.
Avoid setting RTCF_DOREDIRECT if IN_DEV_TX_REDIRECTS
is disabled, this can avoid source address validation and to
help caching the routes.
After the change "Adjust semantics of rt->rt_gateway"
(commit f8126f1d51) we should make sure our ICMP_REDIR_HOST messages
contain daddr instead of 0.0.0.0 when target is directly connected.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit ee9a8f7ab2edf801b8b514c310455c94acc232f6 ]
We report cached pmtu values even if they are already expired.
Change this to not report these values after they are expired
and fix a race in the expire time calculation, as suggested by
Eric Dumazet.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
reported pmtu
[ Upstream commit 7f92d334ba19a0d8e96f8f8f092219553367d921 ]
When a local tool like tracepath tries to send packets bigger than
the device mtu, we create a nh exeption and set the pmtu to device
mtu. The device mtu does not expire, so check if the device mtu is
smaller than the reported pmtu and don't crerate a nh exeption in
that case.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit d851c12b60471188e15e5c8405b289073e8dd025 ]
Some protocols, like IPsec still cache routes. So we need to invalidate
the old route on pmtu events to avoid the reuse of stale routes.
We also need to update the mtu and expire time of the route if we already
use a nh exception route, otherwise we ignore newly learned pmtu values
after the first expiration.
With this patch we always invalidate or update the route on pmtu events.
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
[ Upstream commit f4ef85bbda96324785097356336bc79cdd37db0a ]
commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.)
introduced a regression for forwarding.
This was hard to reproduce but the symptom was that packets were
delivered to local host instead of being forwarded.
David suggested to add fib_type to fib_info so that we dont
inadvertently share same fib_info for different purposes.
With help from Julian Anastasov who provided very helpful
hints, reproduced here :
<quote>
Can it be a problem related to fib_info reuse
from different routes. For example, when local IP address
is created for subnet we have:
broadcast 192.168.0.255 dev DEV proto kernel scope link src
192.168.0.1
192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1
local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1
The "dev DEV proto kernel scope link src 192.168.0.1" is
a reused fib_info structure where we put cached routes.
The result can be same fib_info for 192.168.0.255 and
192.168.0.0/24. RTN_BROADCAST is cached only for input
routes. Incoming broadcast to 192.168.0.255 can be cached
and can cause problems for traffic forwarded to 192.168.0.0/24.
So, this patch should solve the problem because it
separates the broadcast from unicast traffic.
And the ip_route_input_slow caching will work for
local and broadcast input routes (above routes 1 and 3) just
because they differ in scope and use different fib_info.
</quote>
Many thanks to Chris Clayton for his patience and help.
Reported-by: Chris Clayton <chris2553@googlemail.com>
Bisected-by: Chris Clayton <chris2553@googlemail.com>
Reported-by: Dave Jones <davej@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Julian Anastasov <ja@ssi.bg>
Tested-by: Chris Clayton <chris2553@googlemail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
|
When jiffies wraps around (for example, 5 minutes after the boot, see
INITIAL_JIFFIES) and peer has just been created, now - peer->rate_last can be
< XRLIM_BURST_FACTOR * timeout, so token is not set to the maximum value, thus
some icmp packets can be unexpectedly dropped.
Fix this case by initializing last_rate to 60 seconds in the past.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
icmp_filter() should not modify its input, or else its caller
would need to recompute ip_hdr() if skb->head is reallocated.
Use skb_header_pointer() instead of pskb_may_pull() and
change the prototype to make clear both sk and skb are const.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Change return value from -EACCES to -EPERM when the permission check fails.
Signed-off-by: Zhao Hongjiang <zhaohongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
rcv_wscale is a symetric parameter with snd_wscale.
Both this parameters are set on a connection handshake.
Without this value a remote window size can not be interpreted correctly,
because a value from a packet should be shifted on rcv_wscale.
And one more thing is that wscale_ok should be set too.
This patch doesn't break a backward compatibility.
If someone uses it in a old scheme, a rcv window
will be restored with the same bug (rcv_wscale = 0).
v2: Save backward compatibility on big-endian system. Before
the first two bytes were snd_wscale and the second two bytes were
rcv_wscale. Now snd_wscale is opt_val & 0xFFFF and rcv_wscale >> 16.
This approach is independent on byte ordering.
Cc: David S. Miller <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: James Morris <jmorris@namei.org>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Patrick McHardy <kaber@trash.net>
CC: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: Andrew Vagin <avagin@openvz.org>
Acked-by: Pavel Emelyanov <xemul@parallels.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If recv() syscall is called for a TCP socket so that
- IOAT DMA is used
- MSG_WAITALL flag is used
- requested length is bigger than sk_rcvbuf
- enough data has already arrived to bring rcv_wnd to zero
then when tcp_recvmsg() gets to calling sk_wait_data(), receive
window can be still zero while sk_async_wait_queue exhausts
enough space to keep it zero. As this queue isn't cleaned until
the tcp_service_net_dma() call, sk_wait_data() cannot receive
any data and blocks forever.
If zero receive window and non-empty sk_async_wait_queue is
detected before calling sk_wait_data(), process the queue first.
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Stephan Springl found that commit 1402d366019fed "tcp: introduce
tcp_try_coalesce" introduced a regression for rlogin
It turns out problem comes from TCP urgent data handling and
a change in behavior in input path.
rlogin sends two one-byte packets with URG ptr set, and when next data
frame is coalesced, we lack sk_data_ready() calls to wakeup consumer.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stephan Springl <springl-k@lar.bfw.de>
Cc: Alexander Duyck <alexander.h.duyck@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This commit prepares the use of rt_genid by both IPv4 and IPv6.
Initialization is left in IPv4 part.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We dont use jhash anymore since route cache removal,
so we can get rid of get_random_bytes() calls for rt_genid
changes.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since route cache deletion (89aef8921bfbac22f), delay is no
more used. Remove it.
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In UDP recvmsg(), we miss an increase of UDP_MIB_INERRORS if the copy
of skb to userspace failed for whatever reason.
Reported-by: Shawn Bohrer <sbohrer@rgmadvisors.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
Following lockdep splat was reported by Pavel Roskin :
[ 1570.586223] ===============================
[ 1570.586225] [ INFO: suspicious RCU usage. ]
[ 1570.586228] 3.6.0-rc3-wl-main #98 Not tainted
[ 1570.586229] -------------------------------
[ 1570.586231] /home/proski/src/linux/net/ipv4/route.c:645 suspicious rcu_dereference_check() usage!
[ 1570.586233]
[ 1570.586233] other info that might help us debug this:
[ 1570.586233]
[ 1570.586236]
[ 1570.586236] rcu_scheduler_active = 1, debug_locks = 0
[ 1570.586238] 2 locks held by Chrome_IOThread/4467:
[ 1570.586240] #0: (slock-AF_INET){+.-...}, at: [<ffffffff814f2c0c>] release_sock+0x2c/0xa0
[ 1570.586253] #1: (fnhe_lock){+.-...}, at: [<ffffffff815302fc>] update_or_create_fnhe+0x2c/0x270
[ 1570.586260]
[ 1570.586260] stack backtrace:
[ 1570.586263] Pid: 4467, comm: Chrome_IOThread Not tainted 3.6.0-rc3-wl-main #98
[ 1570.586265] Call Trace:
[ 1570.586271] [<ffffffff810976ed>] lockdep_rcu_suspicious+0xfd/0x130
[ 1570.586275] [<ffffffff8153042c>] update_or_create_fnhe+0x15c/0x270
[ 1570.586278] [<ffffffff815305b3>] __ip_rt_update_pmtu+0x73/0xb0
[ 1570.586282] [<ffffffff81530619>] ip_rt_update_pmtu+0x29/0x90
[ 1570.586285] [<ffffffff815411dc>] inet_csk_update_pmtu+0x2c/0x80
[ 1570.586290] [<ffffffff81558d1e>] tcp_v4_mtu_reduced+0x2e/0xc0
[ 1570.586293] [<ffffffff81553bc4>] tcp_release_cb+0xa4/0xb0
[ 1570.586296] [<ffffffff814f2c35>] release_sock+0x55/0xa0
[ 1570.586300] [<ffffffff815442ef>] tcp_sendmsg+0x4af/0xf50
[ 1570.586305] [<ffffffff8156fc60>] inet_sendmsg+0x120/0x230
[ 1570.586308] [<ffffffff8156fb40>] ? inet_sk_rebuild_header+0x40/0x40
[ 1570.586312] [<ffffffff814f4bdd>] ? sock_update_classid+0xbd/0x3b0
[ 1570.586315] [<ffffffff814f4c50>] ? sock_update_classid+0x130/0x3b0
[ 1570.586320] [<ffffffff814ec435>] do_sock_write+0xc5/0xe0
[ 1570.586323] [<ffffffff814ec4a3>] sock_aio_write+0x53/0x80
[ 1570.586328] [<ffffffff8114bc83>] do_sync_write+0xa3/0xe0
[ 1570.586332] [<ffffffff8114c5a5>] vfs_write+0x165/0x180
[ 1570.586335] [<ffffffff8114c805>] sys_write+0x45/0x90
[ 1570.586340] [<ffffffff815d2722>] system_call_fastpath+0x16/0x1b
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Pavel Roskin <proski@gnu.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When tearing down a net namespace, ipv4 mr_table structures are freed
without first deactivating their timers. This can result in a crash in
run_timer_softirq.
This patch mimics the corresponding behaviour in ipv6.
Locking and synchronization seem to be adequate.
We are about to kfree mrt, so existing code should already make sure that
no other references to mrt are pending or can be created by incoming traffic.
The functions invoked here do not cause new references to mrt or other
race conditions to be created.
Invoking del_timer_sync guarantees that ipmr_expire_timer is inactive.
Both ipmr_expire_process (whose completion we may have to wait in
del_timer_sync) and mroute_clean_tables internally use mfc_unres_lock
or other synchronizations when needed, and they both only modify mrt.
Tested in Linux 3.4.8.
Signed-off-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
We're hitting bug while trying to reinsert an already existing
expectation:
kernel BUG at kernel/timer.c:895!
invalid opcode: 0000 [#1] SMP
[...]
Call Trace:
<IRQ>
[<ffffffffa0069563>] nf_ct_expect_related_report+0x4a0/0x57a [nf_conntrack]
[<ffffffff812d423a>] ? in4_pton+0x72/0x131
[<ffffffffa00ca69e>] ip_nat_sdp_media+0xeb/0x185 [nf_nat_sip]
[<ffffffffa00b5b9b>] set_expected_rtp_rtcp+0x32d/0x39b [nf_conntrack_sip]
[<ffffffffa00b5f15>] process_sdp+0x30c/0x3ec [nf_conntrack_sip]
[<ffffffff8103f1eb>] ? irq_exit+0x9a/0x9c
[<ffffffffa00ca738>] ? ip_nat_sdp_media+0x185/0x185 [nf_nat_sip]
We have to remove the RTP expectation if the RTCP expectation hits EBUSY
since we keep trying with other ports until we succeed.
Reported-by: Rafal Fitt <rafalf@aplusc.com.pl>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
The cwnd reduction in fast recovery is based on the number of packets
newly delivered per ACK. For non-sack connections every DUPACK
signifies a packet has been delivered, but the sender mistakenly
skips counting them for cwnd reduction.
The fix is to compute newly_acked_sacked after DUPACKs are accounted
in sacked_out for non-sack connections.
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Nandita Dukkipati <nanditad@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Multicast traffic allocates dst with DST_NOCACHE, but dst is
not inserted into rt_uncached_list.
This slowdown multicast workloads on SMP because rt_uncached_lock is
contended.
Change the test before taking the lock to actually check the dst
was inserted into rt_uncached_list.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Sylvain Munault reported following info :
- TCP connection get "stuck" with data in send queue when doing
"large" transfers ( like typing 'ps ax' on a ssh connection )
- Only happens on path where the PMTU is lower than the MTU of
the interface
- Is not present right after boot, it only appears 10-20min after
boot or so. (and that's inside the _same_ TCP connection, it works
fine at first and then in the same ssh session, it'll get stuck)
- Definitely seems related to fragments somehow since I see a router
sending ICMP message saying fragmentation is needed.
- Exact same setup works fine with kernel 3.5.1
Problem happens when the 10 minutes (ip_rt_mtu_expires) expiration
period is over.
ip_rt_update_pmtu() calls dst_set_expires() to rearm a new expiration,
but dst_set_expires() does nothing because dst.expires is already set.
It seems we want to set the expires field to a new value, regardless
of prior one.
With help from Julian Anastasov.
Reported-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
CC: Julian Anastasov <ja@ssi.bg>
Tested-by: Sylvain Munaut <s.munaut@whatever-company.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Christian Casteyde reported a kmemcheck 32-bit read from uninitialized
memory in __ip_select_ident().
It turns out that __ip_make_skb() called ip_select_ident() before
properly initializing iph->daddr.
This is a bug uncovered by commit 1d861aa4b3fb (inet: Minimize use of
cached route inetpeer.)
Addresses https://bugzilla.kernel.org/show_bug.cgi?id=46131
Reported-by: Christian Casteyde <casteyde.christian@free.fr>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Stephen Hemminger <shemminger@vyatta.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Since 0e734419923bd ("ipv4: Use inet_csk_route_child_sock() in DCCP and
TCP."), inet_csk_route_child_sock() is called instead of
inet_csk_route_req().
However, after creating the child-sock in tcp/dccp_v4_syn_recv_sock(),
ireq->opt is set to NULL, before calling inet_csk_route_child_sock().
Thus, inside inet_csk_route_child_sock() opt is always NULL and the
SRR-options are not respected anymore.
Packets sent by the server won't have the correct destination-IP.
This patch fixes it by accessing newinet->inet_opt instead of ireq->opt
inside inet_csk_route_child_sock().
Reported-by: Luca Boccassi <luca.boccassi@gmail.com>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 6f458dfb40 (tcp: improve latencies of timer triggered events)
added bug leading to following trace :
[ 2866.131281] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.131726]
[ 2866.132188] =========================
[ 2866.132281] [ BUG: held lock freed! ]
[ 2866.132281] 3.6.0-rc1+ #622 Not tainted
[ 2866.132281] -------------------------
[ 2866.132281] kworker/0:1/652 is freeing memory ffff880019ec0000-ffff880019ec0a1f, with a lock still held there!
[ 2866.132281] (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] 4 locks held by kworker/0:1/652:
[ 2866.132281] #0: (rpciod){.+.+.+}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #1: ((&task->u.tk_work)){+.+.+.}, at: [<ffffffff81083567>] process_one_work+0x1de/0x47f
[ 2866.132281] #2: (sk_lock-AF_INET-RPC){+.+...}, at: [<ffffffff81903619>] tcp_sendmsg+0x29/0xcc6
[ 2866.132281] #3: (&icsk->icsk_retransmit_timer){+.-...}, at: [<ffffffff81078017>] run_timer_softirq+0x1ad/0x35f
[ 2866.132281]
[ 2866.132281] stack backtrace:
[ 2866.132281] Pid: 652, comm: kworker/0:1 Not tainted 3.6.0-rc1+ #622
[ 2866.132281] Call Trace:
[ 2866.132281] <IRQ> [<ffffffff810bc527>] debug_check_no_locks_freed+0x112/0x159
[ 2866.132281] [<ffffffff818a0839>] ? __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff811549fa>] kmem_cache_free+0x6b/0x13a
[ 2866.132281] [<ffffffff818a0839>] __sk_free+0xfd/0x114
[ 2866.132281] [<ffffffff818a08c0>] sk_free+0x1c/0x1e
[ 2866.132281] [<ffffffff81911e1c>] tcp_write_timer+0x51/0x56
[ 2866.132281] [<ffffffff81078082>] run_timer_softirq+0x218/0x35f
[ 2866.132281] [<ffffffff81078017>] ? run_timer_softirq+0x1ad/0x35f
[ 2866.132281] [<ffffffff810f5831>] ? rb_commit+0x58/0x85
[ 2866.132281] [<ffffffff81911dcb>] ? tcp_write_timer_handler+0x148/0x148
[ 2866.132281] [<ffffffff81070bd6>] __do_softirq+0xcb/0x1f9
[ 2866.132281] [<ffffffff81a0a00c>] ? _raw_spin_unlock+0x29/0x2e
[ 2866.132281] [<ffffffff81a1227c>] call_softirq+0x1c/0x30
[ 2866.132281] [<ffffffff81039f38>] do_softirq+0x4a/0xa6
[ 2866.132281] [<ffffffff81070f2b>] irq_exit+0x51/0xad
[ 2866.132281] [<ffffffff81a129cd>] do_IRQ+0x9d/0xb4
[ 2866.132281] [<ffffffff81a0a3ef>] common_interrupt+0x6f/0x6f
[ 2866.132281] <EOI> [<ffffffff8109d006>] ? sched_clock_cpu+0x58/0xd1
[ 2866.132281] [<ffffffff81a0a172>] ? _raw_spin_unlock_irqrestore+0x4c/0x56
[ 2866.132281] [<ffffffff81078692>] mod_timer+0x178/0x1a9
[ 2866.132281] [<ffffffff818a00aa>] sk_reset_timer+0x19/0x26
[ 2866.132281] [<ffffffff8190b2cc>] tcp_rearm_rto+0x99/0xa4
[ 2866.132281] [<ffffffff8190dfba>] tcp_event_new_data_sent+0x6e/0x70
[ 2866.132281] [<ffffffff8190f7ea>] tcp_write_xmit+0x7de/0x8e4
[ 2866.132281] [<ffffffff818a565d>] ? __alloc_skb+0xa0/0x1a1
[ 2866.132281] [<ffffffff8190f952>] __tcp_push_pending_frames+0x2e/0x8a
[ 2866.132281] [<ffffffff81904122>] tcp_sendmsg+0xb32/0xcc6
[ 2866.132281] [<ffffffff819229c2>] inet_sendmsg+0xaa/0xd5
[ 2866.132281] [<ffffffff81922918>] ? inet_autobind+0x5f/0x5f
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189adab>] sock_sendmsg+0xa3/0xc4
[ 2866.132281] [<ffffffff810f5de6>] ? rb_reserve_next_event+0x26f/0x2d5
[ 2866.132281] [<ffffffff8103e6a9>] ? native_sched_clock+0x29/0x6f
[ 2866.132281] [<ffffffff8103e6f8>] ? sched_clock+0x9/0xd
[ 2866.132281] [<ffffffff810ee7f1>] ? trace_clock_local+0x9/0xb
[ 2866.132281] [<ffffffff8189ae03>] kernel_sendmsg+0x37/0x43
[ 2866.132281] [<ffffffff8199ce49>] xs_send_kvec+0x77/0x80
[ 2866.132281] [<ffffffff8199cec1>] xs_sendpages+0x6f/0x1a0
[ 2866.132281] [<ffffffff8107826d>] ? try_to_del_timer_sync+0x55/0x61
[ 2866.132281] [<ffffffff8199d0d2>] xs_tcp_send_request+0x55/0xf1
[ 2866.132281] [<ffffffff8199bb90>] xprt_transmit+0x89/0x1db
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff81999d92>] call_transmit+0x1c5/0x20e
[ 2866.132281] [<ffffffff819a0d55>] __rpc_execute+0x6f/0x225
[ 2866.132281] [<ffffffff81999bcd>] ? call_connect+0x3c/0x3c
[ 2866.132281] [<ffffffff819a0f33>] rpc_async_schedule+0x28/0x34
[ 2866.132281] [<ffffffff810835d6>] process_one_work+0x24d/0x47f
[ 2866.132281] [<ffffffff81083567>] ? process_one_work+0x1de/0x47f
[ 2866.132281] [<ffffffff819a0f0b>] ? __rpc_execute+0x225/0x225
[ 2866.132281] [<ffffffff81083a6d>] worker_thread+0x236/0x317
[ 2866.132281] [<ffffffff81083837>] ? process_scheduled_works+0x2f/0x2f
[ 2866.132281] [<ffffffff8108b7b8>] kthread+0x9a/0xa2
[ 2866.132281] [<ffffffff81a12184>] kernel_thread_helper+0x4/0x10
[ 2866.132281] [<ffffffff81a0a4b0>] ? retint_restore_args+0x13/0x13
[ 2866.132281] [<ffffffff8108b71e>] ? __init_kthread_worker+0x5a/0x5a
[ 2866.132281] [<ffffffff81a12180>] ? gs_change+0x13/0x13
[ 2866.308506] IPv4: Attempt to release TCP socket in state 1 ffff880019ec0000
[ 2866.309689] =============================================================================
[ 2866.310254] BUG TCP (Not tainted): Object already free
[ 2866.310254] -----------------------------------------------------------------------------
[ 2866.310254]
The bug comes from the fact that timer set in sk_reset_timer() can run
before we actually do the sock_hold(). socket refcount reaches zero and
we free the socket too soon.
timer handler is not allowed to reduce socket refcnt if socket is owned
by the user, or we need to change sk_reset_timer() implementation.
We should take a reference on the socket in case TCP_DELACK_TIMER_DEFERRED
or TCP_DELACK_TIMER_DEFERRED bit are set in tsq_flags
Also fix a typo in tcp_delack_timer(), where TCP_WRITE_TIMER_DEFERRED
was used instead of TCP_DELACK_TIMER_DEFERRED.
For consistency, use same socket refcount change for TCP_MTU_REDUCED_DEFERRED,
even if not fired from a timer.
Reported-by: Fengguang Wu <fengguang.wu@intel.com>
Tested-by: Fengguang Wu <fengguang.wu@intel.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
This commit removes the sk_rx_dst_set calls from
tcp_create_openreq_child(), because at that point the icsk_af_ops
field of ipv6_mapped TCP sockets has not been set to its proper final
value.
Instead, to make sure we get the right sk_rx_dst_set variant
appropriate for the address family of the new connection, we have
tcp_v{4,6}_syn_recv_sock() directly call the appropriate function
shortly after the call to tcp_create_openreq_child() returns.
This also moves inet6_sk_rx_dst_set() to avoid a forward declaration
with the new approach.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reported-by: Artem Savkov <artem.savkov@gmail.com>
Cc: Eric Dumazet <edumazet@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pable Neira Ayuso says:
====================
The following five patches contain fixes for 3.6-rc, they are:
* Two fixes for message parsing in the SIP conntrack helper, from
Patrick McHardy.
* One fix for the SIP helper introduced in the user-space cthelper
infrastructure, from Patrick McHardy.
* fix missing appropriate locking while modifying one conntrack entry
from the nfqueue integration code, from myself.
* fix possible access to uninitiliazed timer in the nf_conntrack
expectation infrastructure, from myself.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit caacf05e5ad1abf causes big drop of UDP loop back performance.
The cause of the regression is that we do not cache the local output
routes. Each time we send a datagram from unconnected UDP socket,
the kernel allocates a dst_entry and adds it to the rt_uncached_list.
It creates lock contention on the rt_uncached_lock.
Reported-by: Alex Shi <alex.shi@intel.com>
Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|