linux/net/ipv4/inetpeer.c, branch v3.2.38

net: Compute protocol sequence numbers and fragment IDs using MD5.

2011-08-07T01:33:19Z

Computers have become a lot faster since we compromised on the partial MD4 hash which we use currently for performance reasons. MD5 is a much safer choice, and is inline with both RFC1948 and other ISS generators (OpenBSD, Solaris, etc.) Furthermore, only having 24-bits of the sequence number be truly unpredictable is a very serious limitation. So the periodic regeneration and 8-bit counter have been removed. We compute and use a full 32-bit sequence number. For ipv6, DCCP was found to use a 32-bit truncated initial sequence number (it needs 43-bits) and that is fixed here as well. Reported-by: Dan Kaminsky Tested-by: Willy Tarreau Signed-off-by: David S. Miller

ipv6: make fragment identifications less predictable

2011-07-22T04:25:58Z

IPv6 fragment identification generation is way beyond what we use for IPv4 : It uses a single generator. Its not scalable and allows DOS attacks. Now inetpeer is IPv6 aware, we can use it to provide a more secure and scalable frag ident generator (per destination, instead of system wide) This patch : 1) defines a new secure_ipv6_id() helper 2) extends inet_getid() to provide 32bit results 3) extends ipv6_select_ident() with a new dest parameter Reported-by: Fernando Gont Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller

inetpeer: kill inet_putpeer race

2011-07-12T03:25:04Z

We currently can free inetpeer entries too early : [ 782.636674] WARNING: kmemcheck: Caught 32-bit read from uninitialized memory (f130f44c) [ 782.636677] 1f7b13c100000000000000000000000002000000000000000000000000000000 [ 782.636686] i i i i u u u u i i i i u u u u i i i i u u u u u u u u u u u u [ 782.636694] ^ [ 782.636696] [ 782.636698] Pid: 4638, comm: ssh Not tainted 3.0.0-rc5+ #270 Hewlett-Packard HP Compaq 6005 Pro SFF PC/3047h [ 782.636702] EIP: 0060:[] EFLAGS: 00010286 CPU: 0 [ 782.636707] EIP is at inet_getpeer+0x25b/0x5a0 [ 782.636709] EAX: 00000002 EBX: 00010080 ECX: f130f3c0 EDX: f0209d30 [ 782.636711] ESI: 0000bc87 EDI: 0000ea60 EBP: f0209ddc ESP: c173134c [ 782.636712] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 [ 782.636714] CR0: 8005003b CR2: f0beca80 CR3: 30246000 CR4: 000006d0 [ 782.636716] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000 [ 782.636717] DR6: ffff4ff0 DR7: 00000400 [ 782.636718] [] rt_set_nexthop.clone.45+0x56/0x220 [ 782.636722] [] __ip_route_output_key+0x309/0x860 [ 782.636724] [] tcp_v4_connect+0x124/0x450 [ 782.636728] [] inet_stream_connect+0xa3/0x270 [ 782.636731] [] sys_connect+0xa1/0xb0 [ 782.636733] [] sys_socketcall+0x25d/0x2a0 [ 782.636736] [] sysenter_do_call+0x12/0x28 [ 782.636738] [] 0xffffffff Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller

inetpeer: remove unused list

2011-06-09T00:05:30Z

Andi Kleen and Tim Chen reported huge contention on inetpeer unused_peers.lock, on memcached workload on a 40 core machine, with disabled route cache. It appears we constantly flip peers refcnt between 0 and 1 values, and we must insert/remove peers from unused_peers.list, holding a contended spinlock. Remove this list completely and perform a garbage collection on-the-fly, at lookup time, using the expired nodes we met during the tree traversal. This removes a lot of code, makes locking more standard, and obsoletes two sysctls (inet_peer_gc_mintime and inet_peer_gc_maxtime). This also removes two pointers in inet_peer structure. There is still a false sharing effect because refcnt is in first cache line of object [were the links and keys used by lookups are located], we might move it at the end of inet_peer structure to let this first cache line mostly read by cpus. Signed-off-by: Eric Dumazet CC: Andi Kleen CC: Tim Chen Signed-off-by: David S. Miller

inetpeer: fix race in unused_list manipulations

2011-05-27T17:39:11Z

Several crashes in cleanup_once() were reported in recent kernels. Commit d6cc1d642de9 (inetpeer: various changes) added a race in unlink_from_unused(). One way to avoid taking unused_peers.lock before doing the list_empty() test is to catch 0->1 refcnt transitions, using full barrier atomic operations variants (atomic_cmpxchg() and atomic_inc_return()) instead of previous atomic_inc() and atomic_add_unless() variants. We then call unlink_from_unused() only for the owner of the 0->1 transition. Add a new atomic_add_unless_return() static helper With help from Arun Sharma. Refs: https://bugzilla.kernel.org/show_bug.cgi?id=32772 Reported-by: Arun Sharma Reported-by: Maximilian Engelhardt Reported-by: Yann Dupont Reported-by: Denys Fedoryshchenko Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller

inetpeer: reduce stack usage

2011-04-12T20:58:33Z

On 64bit arches, we use 752 bytes of stack when cleanup_once() is called from inet_getpeer(). Lets share the avl stack to save ~376 bytes. Before patch : # objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x000006c3 unlink_from_pool [inetpeer.o]: 376 0x00000721 unlink_from_pool [inetpeer.o]: 376 0x00000cb1 inet_getpeer [inetpeer.o]: 376 0x00000e6d inet_getpeer [inetpeer.o]: 376 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5320 432 21 5773 168d net/ipv4/inetpeer.o After patch : objdump -d net/ipv4/inetpeer.o | scripts/checkstack.pl 0x00000c11 inet_getpeer [inetpeer.o]: 376 0x00000dcd inet_getpeer [inetpeer.o]: 376 0x00000ab9 peer_check_expire [inetpeer.o]: 328 0x00000b7f peer_check_expire [inetpeer.o]: 328 0x0004 inet_initpeers [inetpeer.o]: 112 # size net/ipv4/inetpeer.o text data bss dec hex filename 5163 432 21 5616 15f0 net/ipv4/inetpeer.o Signed-off-by: Eric Dumazet Cc: Scot Doyle Cc: Stephen Hemminger Cc: Hiroaki SHIMODA Reviewed-by: Hiroaki SHIMODA Signed-off-by: David S. Miller

inetpeer: should use call_rcu() variant

2011-03-14T06:22:23Z

After commit 7b46ac4e77f3224a (inetpeer: Don't disable BH for initial fast RCU lookup.), we should use call_rcu() to wait proper RCU grace period. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller

ipv4: Fix PMTU update.

2011-03-14T01:37:49Z

On current net-next-2.6, when Linux receives ICMP Type: 3, Code: 4 (Destination unreachable (Fragmentation needed)), icmp_unreach -> ip_rt_frag_needed (peer->pmtu_expires is set here) -> tcp_v4_err -> do_pmtu_discovery -> ip_rt_update_pmtu (peer->pmtu_expires is already set, so check_peer_pmtu is skipped.) -> check_peer_pmtu check_peer_pmtu is skipped and MTU is not updated. To fix this, let check_peer_pmtu execute unconditionally. And some minor fixes 1) Avoid potential peer->pmtu_expires set to be zero. 2) In check_peer_pmtu, argument of time_before is reversed. 3) check_peer_pmtu expects peer->pmtu_orig is initialized as zero, but not initialized. Signed-off-by: Hiroaki SHIMODA Signed-off-by: David S. Miller

inetpeer: Don't disable BH for initial fast RCU lookup.

2011-03-08T22:59:28Z

If modifications on other cpus are ok, then modifications to the tree during lookup done by the local cpu are ok too. Signed-off-by: David S. Miller

inetpeer: seqlock optimization

2011-03-04T22:33:59Z

David noticed : ------------------ Eric, I was profiling the non-routing-cache case and something that stuck out is the case of calling inet_getpeer() with create==0. If an entry is not found, we have to redo the lookup under a spinlock to make certain that a concurrent writer rebalancing the tree does not "hide" an existing entry from us. This makes the case of a create==0 lookup for a not-present entry really expensive. It is on the order of 600 cpu cycles on my Niagara2. I added a hack to not do the relookup under the lock when create==0 and it now costs less than 300 cycles. This is now a pretty common operation with the way we handle COW'd metrics, so I think it's definitely worth optimizing. ----------------- One solution is to use a seqlock instead of a spinlock to protect struct inet_peer_base. After a failed avl tree lookup, we can easily detect if a writer did some changes during our lookup. Taking the lock and redo the lookup is only necessary in this case. Note: Add one private rcu_deref_locked() macro to place in one spot the access to spinlock included in seqlock. Signed-off-by: Eric Dumazet Signed-off-by: David S. Miller