aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2012-03-19tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_unaNeal Cardwell
[ Upstream commit 4648dc97af9d496218a05353b0e442b3dfa6aaab ] This commit fixes tcp_shift_skb_data() so that it does not shift SACKed data below snd_una. This fixes an issue whose symptoms exactly match reports showing tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at net/ipv4/tcp_input.c:3418" thread on netdev). Since 2008 (832d11c5cd076abc0aa1eaf7be96c81d1a59ce41) tcp_shift_skb_data() had been shifting SACKed ranges that were below snd_una. It checked that the *end* of the skb it was about to shift from was above snd_una, but did not check that the end of the actual shifted range was above snd_una; this commit adds that check. Shifting SACKed ranges below snd_una is problematic because for such ranges tcp_sacktag_one() short-circuits: it does not declare anything as SACKed and does not increase sacked_out. Before the fixes in commits cc9a672ee522d4805495b98680f4a3db5d0a0af9 and daef52bab1fd26e24e8e9578f8fb33ba1d0cb412, shifting SACKed ranges below snd_una happened to work because tcp_shifted_skb() was always (incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence tcp_sacktag_one() never short-circuited and always increased tp->sacked_out in this case. After those two fixes, my testing has verified that shifting SACKed ranges below snd_una could cause tp->sacked_out to go negative with the following sequence of events: (1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una, then shifts a prefix of that skb that is below snd_una (2) tcp_shifted_skb() increments the packet count of the already-SACKed prev sk_buff (3) tcp_sacktag_one() sees the end of the new SACKed range is below snd_una, so it short-circuits and doesn't increase tp->sacked_out (5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed, decrements tp->sacked_out by this "inflated" pcount that was missing a matching increase in tp->sacked_out, and hence tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted to s32 is negative. (6) this leads to the warnings seen in the recent "WARNING: at net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.: tcp_input.c:3418 WARN_ON((int)tp->sacked_out < 0); More generally, I think this bug can be tickled in some cases where two or more ACKs from the receiver are lost and then a DSACK arrives that is immediately above an existing SACKed skb in the write queue. This fix changes tcp_shift_skb_data() to abort this sequence at step (1) in the scenario above by noticing that the bytes are below snd_una and not shifting them. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19tcp: don't fragment SACKed skbs in tcp_mark_head_lost()Neal Cardwell
[ Upstream commit c0638c247f559e1a16ee79e54df14bca2cb679ea ] In tcp_mark_head_lost() we should not attempt to fragment a SACKed skb to mark the first portion as lost. This is for two primary reasons: (1) tcp_shifted_skb() coalesces adjacent regions of SACKed skbs. When doing this, it preserves the sum of their packet counts in order to reflect the real-world dynamics on the wire. But given that skbs can have remainders that do not align to MSS boundaries, this packet count preservation means that for SACKed skbs there is not necessarily a direct linear relationship between tcp_skb_pcount(skb) and skb->len. Thus tcp_mark_head_lost()'s previous attempts to fragment off and mark as lost a prefix of length (packets - oldcnt)*mss from SACKed skbs were leading to occasional failures of the WARN_ON(len > skb->len) in tcp_fragment() (which used to be a BUG_ON(); see the recent "crash in tcp_fragment" thread on netdev). (2) there is no real point in fragmenting off part of a SACKed skb and calling tcp_skb_mark_lost() on it, since tcp_skb_mark_lost() is a NOP for SACKed skbs. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19tcp: fix false reordering signal in tcp_shifted_skbNeal Cardwell
[ Upstream commit 4c90d3b30334833450ccbb02f452d4972a3c3c3f ] When tcp_shifted_skb() shifts bytes from the skb that is currently pointed to by 'highest_sack' then the increment of TCP_SKB_CB(skb)->seq implicitly advances tcp_highest_sack_seq(). This implicit advancement, combined with the recent fix to pass the correct SACKed range into tcp_sacktag_one(), caused tcp_sacktag_one() to think that the newly SACKed range was before the tcp_highest_sack_seq(), leading to a call to tcp_update_reordering() with a degree of reordering matching the size of the newly SACKed range (typically just 1 packet, which is a NOP, but potentially larger). This commit fixes this by simply calling tcp_sacktag_one() before the TCP_SKB_CB(skb)->seq advancement that can advance our notion of the highest SACKed sequence. Correspondingly, we can simplify the code a little now that tcp_shifted_skb() should update the lost_cnt_hint in all cases where skb == tp->lost_skb_hint. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-03-19ipsec: be careful of non existing mac headersEric Dumazet
[ Upstream commit 03606895cd98c0a628b17324fd7b5ff15db7e3cd ] Niccolo Belli reported ipsec crashes in case we handle a frame without mac header (atm in his case) Before copying mac header, better make sure it is present. Bugzilla reference: https://bugzilla.kernel.org/show_bug.cgi?id=42809 Reported-by: Niccolò Belli <darkbasic@linuxsystems.it> Tested-by: Niccolò Belli <darkbasic@linuxsystems.it> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29tcp: fix tcp_shifted_skb() adjustment of lost_cnt_hint for FACKNeal Cardwell
[ Upstream commit 0af2a0d0576205dda778d25c6c344fc6508fc81d ] This commit ensures that lost_cnt_hint is correctly updated in tcp_shifted_skb() for FACK TCP senders. The lost_cnt_hint adjustment in tcp_sacktag_one() only applies to non-FACK senders, so FACK senders need their own adjustment. This applies the spirit of 1e5289e121372a3494402b1b131b41bfe1cf9b7f - except now that the sequence range passed into tcp_sacktag_one() is correct we need only have a special case adjustment for FACK. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29tcp: fix range tcp_shifted_skb() passes to tcp_sacktag_one()Neal Cardwell
[ Upstream commit daef52bab1fd26e24e8e9578f8fb33ba1d0cb412 ] Fix the newly-SACKed range to be the range of newly-shifted bytes. Previously - since 832d11c5cd076abc0aa1eaf7be96c81d1a59ce41 - tcp_shifted_skb() incorrectly called tcp_sacktag_one() with the start and end sequence numbers of the skb it passes in set to the range just beyond the range that is newly-SACKed. This commit also removes a special-case adjustment to lost_cnt_hint in tcp_shifted_skb() since the pre-existing adjustment of lost_cnt_hint in tcp_sacktag_one() now properly handles this things now that the correct start sequence number is passed in. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29tcp: allow tcp_sacktag_one() to tag ranges not aligned with skbsNeal Cardwell
[ Upstream commit cc9a672ee522d4805495b98680f4a3db5d0a0af9 ] This commit allows callers of tcp_sacktag_one() to pass in sequence ranges that do not align with skb boundaries, as tcp_shifted_skb() needs to do in an upcoming fix in this patch series. In fact, now tcp_sacktag_one() does not need to depend on an input skb at all, which makes its semantics and dependencies more clear. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29ipv4: Fix wrong order of ip_rt_get_source() and update iph->daddr.Li Wei
[ Upstream commit 5dc7883f2a7c25f8df40d7479687153558cd531b ] This patch fix a bug which introduced by commit ac8a4810 (ipv4: Save nexthop address of LSRR/SSRR option to IPCB.).In that patch, we saved the nexthop of SRR in ip_option->nexthop and update iph->daddr until we get to ip_forward_options(), but we need to update it before ip_rt_get_source(), otherwise we may get a wrong src. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29tcp_v4_send_reset: binding oif to iif in no sock caseShawn Lu
[ Upstream commit e2446eaab5585555a38ea0df4e01ff313dbb4ac9 ] Binding RST packet outgoing interface to incoming interface for tcp v4 when there is no socket associate with it. when sk is not NULL, using sk->sk_bound_dev_if instead. (suggested by Eric Dumazet). This has few benefits: 1. tcp_v6_send_reset already did that. 2. This helps tcp connect with SO_BINDTODEVICE set. When connection is lost, we still able to sending out RST using same interface. 3. we are sending reply, it is most likely to be succeed if iif is used Signed-off-by: Shawn Lu <shawn.lu@ericsson.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-29net: Don't proxy arp respond if iif == rt->dst.dev if private VLAN is disabledThomas Graf
[ Upstream commit 70620c46ac2b45c24b0f22002fdf5ddd1f7daf81 ] Commit 653241 (net: RFC3069, private VLAN proxy arp support) changed the behavior of arp proxy to send arp replies back out on the interface the request came in even if the private VLAN feature is disabled. Previously we checked rt->dst.dev != skb->dev for in scenarios, when proxy arp is enabled on for the netdevice and also when individual proxy neighbour entries have been added. This patch adds the check back for the pneigh_lookup() scenario. Signed-off-by: Thomas Graf <tgraf@suug.ch> Acked-by: Jesper Dangaard Brouer <hawk@comx.dk> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03tcp: md5: using remote adress for md5 lookup in rst packetshawnlu
[ Upstream commit 8a622e71f58ec9f092fc99eacae0e6cf14f6e742 ] md5 key is added in socket through remote address. remote address should be used in finding md5 key when sending out reset packet. Signed-off-by: shawnlu <shawn.lu@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03tcp: fix tcp_trim_head() to adjust segment count with skb MSSNeal Cardwell
[ Upstream commit 5b35e1e6e9ca651e6b291c96d1106043c9af314a ] This commit fixes tcp_trim_head() to recalculate the number of segments in the skb with the skb's existing MSS, so trimming the head causes the skb segment count to be monotonically non-increasing - it should stay the same or go down, but not increase. Previously tcp_trim_head() used the current MSS of the connection. But if there was a decrease in MSS between original transmission and ACK (e.g. due to PMTUD), this could cause tcp_trim_head() to counter-intuitively increase the segment count when trimming bytes off the head of an skb. This violated assumptions in tcp_tso_acked() that tcp_trim_head() only decreases the packet count, so that packets_acked in tcp_tso_acked() could underflow, leading tcp_clean_rtx_queue() to pass u32 pkts_acked values as large as 0xffffffff to ca_ops->pkts_acked(). As an aside, if tcp_trim_head() had really wanted the skb to reflect the current MSS, it should have called tcp_set_skb_tso_segs() unconditionally, since a decrease in MSS would mean that a single-packet skb should now be sliced into multiple segments. Signed-off-by: Neal Cardwell <ncardwell@google.com> Acked-by: Nandita Dukkipati <nanditad@google.com> Acked-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-02-03net: reintroduce missing rcu_assign_pointer() callsEric Dumazet
[ Upstream commit cf778b00e96df6d64f8e21b8395d1f8a859ecdc7 ] commit a9b3cd7f32 (rcu: convert uses of rcu_assign_pointer(x, NULL) to RCU_INIT_POINTER) did a lot of incorrect changes, since it did a complete conversion of rcu_assign_pointer(x, y) to RCU_INIT_POINTER(x, y). We miss needed barriers, even on x86, when y is not NULL. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-01-12igmp: Avoid zero delay when receiving odd mixture of IGMP queriesBen Hutchings
commit a8c1f65c79cbbb2f7da782d4c9d15639a9b94b27 upstream. Commit 5b7c84066733c5dfb0e4016d939757b38de189e4 ('ipv4: correct IGMP behavior on v3 query during v2-compatibility mode') added yet another case for query parsing, which can result in max_delay = 0. Substitute a value of 1, as in the usual v3 case. Reported-by: Simon McVittie <smcv@debian.org> References: http://bugs.debian.org/654876 Signed-off-by: Ben Hutchings <ben@decadent.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22net: introduce DST_NOPEER dst flagEric Dumazet
Chris Boot reported crashes occurring in ipv6_select_ident(). [ 461.457562] RIP: 0010:[<ffffffff812dde61>] [<ffffffff812dde61>] ipv6_select_ident+0x31/0xa7 [ 461.578229] Call Trace: [ 461.580742] <IRQ> [ 461.582870] [<ffffffff812efa7f>] ? udp6_ufo_fragment+0x124/0x1a2 [ 461.589054] [<ffffffff812dbfe0>] ? ipv6_gso_segment+0xc0/0x155 [ 461.595140] [<ffffffff812700c6>] ? skb_gso_segment+0x208/0x28b [ 461.601198] [<ffffffffa03f236b>] ? ipv6_confirm+0x146/0x15e [nf_conntrack_ipv6] [ 461.608786] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.614227] [<ffffffff81271d64>] ? dev_hard_start_xmit+0x357/0x543 [ 461.620659] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.626440] [<ffffffffa0379745>] ? br_parse_ip_options+0x19a/0x19a [bridge] [ 461.633581] [<ffffffff812722ff>] ? dev_queue_xmit+0x3af/0x459 [ 461.639577] [<ffffffffa03747d2>] ? br_dev_queue_push_xmit+0x72/0x76 [bridge] [ 461.646887] [<ffffffffa03791e3>] ? br_nf_post_routing+0x17d/0x18f [bridge] [ 461.653997] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.659473] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.665485] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.671234] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.677299] [<ffffffffa0379215>] ? nf_bridge_update_protocol+0x20/0x20 [bridge] [ 461.684891] [<ffffffffa03bb0e5>] ? nf_ct_zone+0xa/0x17 [nf_conntrack] [ 461.691520] [<ffffffffa0374760>] ? br_flood+0xfa/0xfa [bridge] [ 461.697572] [<ffffffffa0374812>] ? NF_HOOK.constprop.8+0x3c/0x56 [bridge] [ 461.704616] [<ffffffffa0379031>] ? nf_bridge_push_encap_header+0x1c/0x26 [bridge] [ 461.712329] [<ffffffffa037929f>] ? br_nf_forward_finish+0x8a/0x95 [bridge] [ 461.719490] [<ffffffffa037900a>] ? nf_bridge_pull_encap_header+0x1c/0x27 [bridge] [ 461.727223] [<ffffffffa0379974>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge] [ 461.734292] [<ffffffff81291c4d>] ? nf_iterate+0x41/0x77 [ 461.739758] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.746203] [<ffffffff81291cf6>] ? nf_hook_slow+0x73/0x111 [ 461.751950] [<ffffffffa03748cc>] ? __br_deliver+0xa0/0xa0 [bridge] [ 461.758378] [<ffffffffa037533a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge] This is caused by bridge netfilter special dst_entry (fake_rtable), a special shared entry, where attaching an inetpeer makes no sense. Problem is present since commit 87c48fa3b46 (ipv6: make fragment identifications less predictable) Introduce DST_NOPEER dst flag and make sure ipv6_select_ident() and __ip_select_ident() fallback to the 'no peer attached' handling. Reported-by: Chris Boot <bootc@bootc.net> Tested-by: Chris Boot <bootc@bootc.net> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22ipv4: using prefetch requires including prefetch.hStephen Rothwell
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: David Miller <davem@davemloft.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2011-12-21ipv4: reintroduce route cache garbage collectorEric Dumazet
Commit 2c8cec5c10b (ipv4: Cache learned PMTU information in inetpeer) removed IP route cache garbage collector a bit too soon, as this gc was responsible for expired routes cleanup, releasing their neighbour reference. As pointed out by Robert Gladewitz, recent kernels can fill and exhaust their neighbour cache. Reintroduce the garbage collection, since we'll have to wait our neighbour lookups become refcount-less to not depend on this stuff. Reported-by: Robert Gladewitz <gladewitz@gmx.de> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-20net: have ipconfig not wait if no dev is availableGerlando Falauto
previous commit 3fb72f1e6e6165c5f495e8dc11c5bbd14c73385c makes IP-Config wait for carrier on at least one network device. Before waiting (predefined value 120s), check that at least one device was successfully brought up. Otherwise (e.g. buggy bootloader which does not set the MAC address) there is no point in waiting for carrier. Cc: Micha Nelissen <micha@neli.hopto.org> Cc: Holger Brunck <holger.brunck@keymile.com> Signed-off-by: Gerlando Falauto <gerlando.falauto@keymile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12ipip, sit: copy parms.name after register_netdeviceTed Feng
Same fix as 731abb9cb2 for ipip and sit tunnel. Commit 1c5cae815d removed an explicit call to dev_alloc_name in ipip_tunnel_locate and ipip6_tunnel_locate, because register_netdevice will now create a valid name, however the tunnel keeps a copy of the name in the private parms structure. Fix this by copying the name back after register_netdevice has successfully returned. This shows up if you do a simple tunnel add, followed by a tunnel show: $ sudo ip tunnel add mode ipip remote 10.2.20.211 $ ip tunnel tunl0: ip/ip remote any local any ttl inherit nopmtudisc tunl%d: ip/ip remote 10.2.20.211 local any ttl inherit $ sudo ip tunnel add mode sit remote 10.2.20.212 $ ip tunnel sit0: ipv6/ip remote any local any ttl 64 nopmtudisc 6rd-prefix 2002::/16 sit%d: ioctl 89f8 failed: No such device sit%d: ipv6/ip remote 10.2.20.212 local any ttl inherit Cc: stable@vger.kernel.org Signed-off-by: Ted Feng <artisdom@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-05ipv4: Fix peer validation on cached lookup.David S. Miller
If ipv4_valdiate_peer() fails during a cached entry lookup, we'll NULL derer since the loop iterator assumes rth is not NULL. Letting this be handled as a failure is just bogus, so just make it not fail. If we have trouble getting a non-NULL neighbour for the redirected gateway, just restore the original gateway and continue. The very next use of this cached route will try again. Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-03ipv4: make sure RTO_ONLINK is saved in routing cacheJulian Anastasov
__mkroute_output fails to work with the original tos and uses value with stripped RTO_ONLINK bit. Make sure we put the original TOS bits into rt_key_tos because it used to match cached route. Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01ipv4: flush route cache after change accept_localPeter Pan(潘卫平)
After reset ipv4_devconf->data[IPV4_DEVCONF_ACCEPT_LOCAL] to 0, we should flush route cache, or it will continue receive packets with local source address, which should be dropped. Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01Revert "udp: remove redundant variable"David S. Miller
This reverts commit 81d54ec8479a2c695760da81f05b5a9fb2dbe40a. If we take the "try_again" goto, due to a checksum error, the 'len' has already been truncated. So we won't compute the same values as the original code did. Reported-by: paul bilke <fsmail@conspiracy.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-01ipv4: Perform peer validation on cached route lookup.David S. Miller
Otherwise we won't notice the peer GENID change. Reported-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-30ipv4: fix lockdep splat in rt_cache_seq_showEric Dumazet
After commit f2c31e32b378 (fix NULL dereferences in check_peer_redir()), dst_get_neighbour() should be guarded by rcu_read_lock() / rcu_read_unlock() section. Reported-by: Miles Lane <miles.lane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-29Merge branch 'nf' of git://1984.lsi.us.es/netDavid S. Miller
2011-11-26inet: add a redirect generation id in inetpeerEric Dumazet
Now inetpeer is the place where we cache redirect information for ipv4 destinations, we must be able to invalidate informations when a route is added/removed on host. As inetpeer is not yet namespace aware, this patch adds a shared redirect_genid, and a per inetpeer redirect_genid. This might be changed later if inetpeer becomes ns aware. Cache information for one inerpeer is valid as long as its redirect_genid has the same value than global redirect_genid. Reported-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> Tested-by: Arkadiusz Miśkiewicz <a.miskiewicz@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26ipv4: Don't use the cached pmtu informations for input routesSteffen Klassert
The pmtu informations on the inetpeer are visible for output and input routes. On packet forwarding, we might propagate a learned pmtu to the sender. As we update the pmtu informations of the inetpeer on demand, the original sender of the forwarded packets might never notice when the pmtu to that inetpeer increases. So use the mtu of the outgoing device on packet forwarding instead of the pmtu to the final destination. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26net: Move mtu handling down to the protocol depended handlersSteffen Klassert
We move all mtu handling from dst_mtu() down to the protocol layer. So each protocol can implement the mtu handling in a different manner. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26net: Rename the dst_opt default_mtu method to mtuSteffen Klassert
We plan to invoke the dst_opt->default_mtu() method unconditioally from dst_mtu(). So rename the method to dst_opt->mtu() to match the name with the new meaning. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-26route: Use the device mtu as the default for blackhole routesSteffen Klassert
As it is, we return null as the default mtu of blackhole routes. This may lead to a propagation of a bogus pmtu if the default_mtu method of a blackhole route is invoked. So return dst->dev->mtu as the default mtu instead. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23ipv4: Save nexthop address of LSRR/SSRR option to IPCB.Li Wei
We can not update iph->daddr in ip_options_rcv_srr(), It is too early. When some exception ocurred later (eg. in ip_forward() when goto sr_failed) we need the ip header be identical to the original one as ICMP need it. Add a field 'nexthop' in struct ip_options to save nexthop of LSRR or SSRR option. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23ipv4 : igmp : fix error handle in ip_mc_add_src()Jun Zhao
When add sources to interface failure, need to roll back the sfcount[MODE] to before state. We need to match it corresponding. Acked-by: David L Stevens <dlstevens@us.ibm.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Jun Zhao <mypopydev@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-23netfilter: Remove NOTRACK/RAW dependency on NETFILTER_ADVANCED.David S. Miller
Distributions are using this in their default scripts, so don't hide them behind the advanced setting. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-22net-netlink: fix diag to export IPv4 tos for dual-stack IPv6 socketsMaciej Żenczykowski
Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-21netfilter: possible unaligned packet header in ip_route_me_harderPaul Guo
This patch tries to fix the following issue in netfilter: In ip_route_me_harder(), we invoke pskb_expand_head() that rellocates new header with additional head room which can break the alignment of the original packet header. In one of my NAT test case, the NIC port for internal hosts is configured with vlan and the port for external hosts is with general configuration. If we ping an external "unknown" hosts from an internal host, an icmp packet will be sent. We find that in icmp_send()->...->ip_route_me_harder()->pskb_expand_head(), hh_len=18 and current headroom (skb_headroom(skb)) of the packet is 16. After calling pskb_expand_head() the packet header becomes to be unaligned and then our system (arch/tile) panics immediately. Signed-off-by: Paul Guo <ggang@tilera.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2011-11-18ipv4: fix redirect handlingEric Dumazet
commit f39925dbde77 (ipv4: Cache learned redirect information in inetpeer.) introduced a regression in ICMP redirect handling. It assumed ipv4_dst_check() would be called because all possible routes were attached to the inetpeer we modify in ip_rt_redirect(), but thats not true. commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix this but solution was not complete. (It fixed only one route) So we must lookup existing routes (including different TOS values) and call check_peer_redir() on them. Reported-by: Ivan Zahariev <famzah@icdsoft.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Flavio Leitner <fbl@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-18ping: dont increment ICMP_MIB_INERRORSEric Dumazet
ping module incorrectly increments ICMP_MIB_INERRORS if feeded with a frame not belonging to its own sockets. RFC 2011 states that ICMP_MIB_INERRORS should count "the number of ICMP messages which the entiry received but determined as having ICMP-specific errors (bad ICMP checksums, bad length, etc.)." Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Vasiliy Kulikov <segoon@openwall.com> Acked-by: Flavio Leitner <fbl@redhat.com> Acked-by: Vasiliy Kulikov <segoon@openwall.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-16tcp: clear xmit timers in tcp_v4_syn_recv_sock()Eric Dumazet
Simon Kirby reported divides by zero errors in __tcp_select_window() This happens when inet_csk_route_child_sock() returns a NULL pointer : We free new socket while we eventually armed keepalive timer in tcp_create_openreq_child() Fix this by a call to tcp_clear_xmit_timers() [ This is a followup to commit 918eb39962dff (net: add missing bh_unlock_sock() calls) ] Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-14net-netlink: Add a new attribute to expose TCLASS values via netlinkMaciej Żenczykowski
commit 3ceca749668a52bd795585e0f71c6f0b04814f7b added a TOS attribute. Unfortunately TOS and TCLASS are both present in a dual-stack v6 socket, furthermore they can have different values. As such one cannot in a sane way expose both through a single attribute. Signed-off-by: Maciej Żenczyowski <maze@google.com> CC: Murali Raja <muralira@google.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Eric Dumazet <eric.dumazet@gmail.com> CC: David S. Miller <davem@davemloft.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-12ah: Don't return NET_XMIT_DROP on input.Nick Bowler
When the ahash driver returns -EBUSY, AH4/6 input functions return NET_XMIT_DROP, presumably copied from the output code path. But returning transmit codes on input doesn't make a lot of sense. Since NET_XMIT_DROP is a positive int, this gets interpreted as the next header type (i.e., success). As that can only end badly, remove the check. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09ipv4: fix for ip_options_rcv_srr() daddr update.Li Wei
When opt->srr_is_hit is set skb_rtable(skb) has been updated for 'nexthop' and iph->daddr should always equals to skb_rtable->rt_dst holds, We need update iph->daddr either. Signed-off-by: Li Wei <lw@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09ah: Read nexthdr value before overwriting it in ahash input callback.Nick Bowler
The AH4/6 ahash input callbacks read out the nexthdr field from the AH header *after* they overwrite that header. This is obviously not going to end well. Fix it up. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-09ah: Correctly pass error codes in ahash output callback.Nick Bowler
The AH4/6 ahash output callbacks pass nexthdr to xfrm_output_resume instead of the error code. This appears to be a copy+paste error from the input case, where nexthdr is expected. This causes the driver to continuously add AH headers to the datagram until either an allocation fails and the packet is dropped or the ahash driver hits a synchronous fallback and the resulting monstrosity is transmitted. Correct this issue by simply passing the error code unadulterated. Signed-off-by: Nick Bowler <nbowler@elliptictech.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08ipv4: Fix inetpeer expire time informationSteffen Klassert
As we update the learned pmtu informations on demand, we might report a nagative expiration time value to userspace if the pmtu informations are already expired and we have not send a packet to that inetpeer after expiration. With this patch we send a expire time of null to userspace after expiration until the next packet is send to that inetpeer. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-08tcp: Fix comments for Nagle algorithmFeng King
TCP_NODELAY is weaker than TCP_CORK, when TCP_CORK was set, small segments will always pass Nagle test regardless of TCP_NODELAY option. Signed-off-by: Feng King <kinwin2008@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-06Merge branch 'modsplit-Oct31_2011' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits) Revert "tracing: Include module.h in define_trace.h" irq: don't put module.h into irq.h for tracking irqgen modules. bluetooth: macroize two small inlines to avoid module.h ip_vs.h: fix implicit use of module_get/module_put from module.h nf_conntrack.h: fix up fallout from implicit moduleparam.h presence include: replace linux/module.h with "struct module" wherever possible include: convert various register fcns to macros to avoid include chaining crypto.h: remove unused crypto_tfm_alg_modname() inline uwb.h: fix implicit use of asm/page.h for PAGE_SIZE pm_runtime.h: explicitly requires notifier.h linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h miscdevice.h: fix up implicit use of lists and types stop_machine.h: fix implicit use of smp.h for smp_processor_id of: fix implicit use of errno.h in include/linux/of.h of_platform.h: delete needless include <linux/module.h> acpi: remove module.h include from platform/aclinux.h miscdevice.h: delete unnecessary inclusion of module.h device_cgroup.h: delete needless include <linux/module.h> net: sch_generic remove redundant use of <linux/module.h> net: inet_timewait_sock doesnt need <linux/module.h> ... Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in - drivers/media/dvb/frontends/dibx000_common.c - drivers/media/video/{mt9m111.c,ov6650.c} - drivers/mfd/ab3550-core.c - include/linux/dmaengine.h
2011-11-03net: add missing bh_unlock_sock() callsEric Dumazet
Simon Kirby reported lockdep warnings and following messages : [104661.897577] huh, entered softirq 3 NET_RX ffffffff81613740 preempt_count 00000101, exited with 00000102? [104661.923653] huh, entered softirq 3 NET_RX ffffffff81613740 preempt_count 00000101, exited with 00000102? Problem comes from commit 0e734419 (ipv4: Use inet_csk_route_child_sock() in DCCP and TCP.) If inet_csk_route_child_sock() returns NULL, we should release socket lock before freeing it. Another lock imbalance exists if __inet_inherit_port() returns an error since commit 093d282321da ( tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()) a backport is also needed for >= 2.6.37 kernels. Reported-by: Simon Kirby <sim@hostway.ca> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Tested-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Balazs Scheidler <bazsi@balabit.hu> CC: KOVACS Krisztian <hidden@balabit.hu> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Tested-by: Simon Kirby <sim@hostway.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-02udp: fix a race in encap_rcv handlingEric Dumazet
udp_queue_rcv_skb() has a possible race in encap_rcv handling, since this pointer can be changed anytime. We should use ACCESS_ONCE() to close the race. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-11-01net: make the tcp and udp file_operations for the /proc stuff constArjan van de Ven
the tcp and udp code creates a set of struct file_operations at runtime while it can also be done at compile time, with the added benefit of then having these file operations be const. the trickiest part was to get the "THIS_MODULE" reference right; the naive method of declaring a struct in the place of registration would not work for this reason. Signed-off-by: Arjan van de Ven <arjan@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>