diff options
author | David Woodhouse <dwmw2@shinybook.infradead.org> | 2005-07-02 13:39:09 +0100 |
---|---|---|
committer | David Woodhouse <dwmw2@shinybook.infradead.org> | 2005-07-02 13:39:09 +0100 |
commit | d2f6409584e2c62ffad81690562330ff3bf4a458 (patch) | |
tree | 3bdfb97d0b51be2f7f414f2107e97603c1206abb /net/ipv4 | |
parent | e1b09eba2686eca94a3a188042b518df6044a3c1 (diff) | |
parent | 4a89a04f1ee21a7c1f4413f1ad7dcfac50ff9b63 (diff) |
Merge with master.kernel.org:/pub/scm/linux/kernel/git/torvalds/linux-2.6.git
Diffstat (limited to 'net/ipv4')
56 files changed, 5065 insertions, 1058 deletions
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig index 6d3e8b1bd1f..3e63123f7bb 100644 --- a/net/ipv4/Kconfig +++ b/net/ipv4/Kconfig @@ -53,6 +53,44 @@ config IP_ADVANCED_ROUTER If unsure, say N here. +choice + prompt "Choose IP: FIB lookup algorithm (choose FIB_HASH if unsure)" + depends on IP_ADVANCED_ROUTER + default IP_FIB_HASH + +config IP_FIB_HASH + bool "FIB_HASH" + ---help--- + Current FIB is very proven and good enough for most users. + +config IP_FIB_TRIE + bool "FIB_TRIE" + ---help--- + Use new experimental LC-trie as FIB lookup algoritm. + This improves lookup performance if you have a large + number of routes. + + LC-trie is a longest matching prefix lookup algorithm which + performs better than FIB_HASH for large routing tables. + But, it consumes more memory and is more complex. + + LC-trie is described in: + + IP-address lookup using LC-tries. Stefan Nilsson and Gunnar Karlsson + IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999 + An experimental study of compression methods for dynamic tries + Stefan Nilsson and Matti Tikkanen. Algorithmica, 33(1):19-33, 2002. + http://www.nada.kth.se/~snilsson/public/papers/dyntrie2/ + +endchoice + +# If the user does not enable advanced routing, he gets the safe +# default of the fib-hash algorithm. +config IP_FIB_HASH + bool + depends on !IP_ADVANCED_ROUTER + default y + config IP_MULTIPLE_TABLES bool "IP: policy routing" depends on IP_ADVANCED_ROUTER @@ -407,5 +445,112 @@ config IP_TCPDIAG config IP_TCPDIAG_IPV6 def_bool (IP_TCPDIAG=y && IPV6=y) || (IP_TCPDIAG=m && IPV6) +config TCP_CONG_ADVANCED + bool "TCP: advanced congestion control" + depends on INET + ---help--- + Support for selection of various TCP congestion control + modules. + + Nearly all users can safely say no here, and a safe default + selection will be made (BIC-TCP with new Reno as a fallback). + + If unsure, say N. + +# TCP Reno is builtin (required as fallback) +menu "TCP congestion control" + depends on TCP_CONG_ADVANCED + +config TCP_CONG_BIC + tristate "Binary Increase Congestion (BIC) control" + depends on INET + default y + ---help--- + BIC-TCP is a sender-side only change that ensures a linear RTT + fairness under large windows while offering both scalability and + bounded TCP-friendliness. The protocol combines two schemes + called additive increase and binary search increase. When the + congestion window is large, additive increase with a large + increment ensures linear RTT fairness as well as good + scalability. Under small congestion windows, binary search + increase provides TCP friendliness. + See http://www.csc.ncsu.edu/faculty/rhee/export/bitcp/ + +config TCP_CONG_WESTWOOD + tristate "TCP Westwood+" + depends on INET + default m + ---help--- + TCP Westwood+ is a sender-side only modification of the TCP Reno + protocol stack that optimizes the performance of TCP congestion + control. It is based on end-to-end bandwidth estimation to set + congestion window and slow start threshold after a congestion + episode. Using this estimation, TCP Westwood+ adaptively sets a + slow start threshold and a congestion window which takes into + account the bandwidth used at the time congestion is experienced. + TCP Westwood+ significantly increases fairness wrt TCP Reno in + wired networks and throughput over wireless links. + +config TCP_CONG_HTCP + tristate "H-TCP" + depends on INET + default m + ---help--- + H-TCP is a send-side only modifications of the TCP Reno + protocol stack that optimizes the performance of TCP + congestion control for high speed network links. It uses a + modeswitch to change the alpha and beta parameters of TCP Reno + based on network conditions and in a way so as to be fair with + other Reno and H-TCP flows. + +config TCP_CONG_HSTCP + tristate "High Speed TCP" + depends on INET && EXPERIMENTAL + default n + ---help--- + Sally Floyd's High Speed TCP (RFC 3649) congestion control. + A modification to TCP's congestion control mechanism for use + with large congestion windows. A table indicates how much to + increase the congestion window by when an ACK is received. + For more detail see http://www.icir.org/floyd/hstcp.html + +config TCP_CONG_HYBLA + tristate "TCP-Hybla congestion control algorithm" + depends on INET && EXPERIMENTAL + default n + ---help--- + TCP-Hybla is a sender-side only change that eliminates penalization of + long-RTT, large-bandwidth connections, like when satellite legs are + involved, expecially when sharing a common bottleneck with normal + terrestrial connections. + +config TCP_CONG_VEGAS + tristate "TCP Vegas" + depends on INET && EXPERIMENTAL + default n + ---help--- + TCP Vegas is a sender-side only change to TCP that anticipates + the onset of congestion by estimating the bandwidth. TCP Vegas + adjusts the sending rate by modifying the congestion + window. TCP Vegas should provide less packet loss, but it is + not as aggressive as TCP Reno. + +config TCP_CONG_SCALABLE + tristate "Scalable TCP" + depends on INET && EXPERIMENTAL + default n + ---help--- + Scalable TCP is a sender-side only change to TCP which uses a + MIMD congestion control algorithm which has some nice scaling + properties, though is known to have fairness issues. + See http://www-lce.eng.cam.ac.uk/~ctk21/scalable/ + +endmenu + +config TCP_CONG_BIC + tristate + depends on !TCP_CONG_ADVANCED + default y + source "net/ipv4/ipvs/Kconfig" diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile index 8b379627ebb..5718cdb3a61 100644 --- a/net/ipv4/Makefile +++ b/net/ipv4/Makefile @@ -5,10 +5,13 @@ obj-y := utils.o route.o inetpeer.o protocol.o \ ip_input.o ip_fragment.o ip_forward.o ip_options.o \ ip_output.o ip_sockglue.o \ - tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o tcp_minisocks.o \ + tcp.o tcp_input.o tcp_output.o tcp_timer.o tcp_ipv4.o \ + tcp_minisocks.o tcp_cong.o \ datagram.o raw.o udp.o arp.o icmp.o devinet.o af_inet.o igmp.o \ - sysctl_net_ipv4.o fib_frontend.o fib_semantics.o fib_hash.o + sysctl_net_ipv4.o fib_frontend.o fib_semantics.o +obj-$(CONFIG_IP_FIB_HASH) += fib_hash.o +obj-$(CONFIG_IP_FIB_TRIE) += fib_trie.o obj-$(CONFIG_PROC_FS) += proc.o obj-$(CONFIG_IP_MULTIPLE_TABLES) += fib_rules.o obj-$(CONFIG_IP_MROUTE) += ipmr.o @@ -28,6 +31,13 @@ obj-$(CONFIG_NETFILTER) += netfilter/ obj-$(CONFIG_IP_VS) += ipvs/ obj-$(CONFIG_IP_TCPDIAG) += tcp_diag.o obj-$(CONFIG_IP_ROUTE_MULTIPATH_CACHED) += multipath.o +obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o +obj-$(CONFIG_TCP_CONG_WESTWOOD) += tcp_westwood.o +obj-$(CONFIG_TCP_CONG_HSTCP) += tcp_highspeed.o +obj-$(CONFIG_TCP_CONG_HYBLA) += tcp_hybla.o +obj-$(CONFIG_TCP_CONG_HTCP) += tcp_htcp.o +obj-$(CONFIG_TCP_CONG_VEGAS) += tcp_vegas.o +obj-$(CONFIG_TCP_CONG_SCALABLE) += tcp_scalable.o obj-$(CONFIG_XFRM) += xfrm4_policy.o xfrm4_state.o xfrm4_input.o \ xfrm4_output.o diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 03942f13394..658e7977924 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1119,6 +1119,10 @@ module_init(inet_init); #ifdef CONFIG_PROC_FS extern int fib_proc_init(void); extern void fib_proc_exit(void); +#ifdef CONFIG_IP_FIB_TRIE +extern int fib_stat_proc_init(void); +extern void fib_stat_proc_exit(void); +#endif extern int ip_misc_proc_init(void); extern int raw_proc_init(void); extern void raw_proc_exit(void); @@ -1139,11 +1143,19 @@ static int __init ipv4_proc_init(void) goto out_udp; if (fib_proc_init()) goto out_fib; +#ifdef CONFIG_IP_FIB_TRIE + if (fib_stat_proc_init()) + goto out_fib_stat; + #endif if (ip_misc_proc_init()) goto out_misc; out: return rc; out_misc: +#ifdef CONFIG_IP_FIB_TRIE + fib_stat_proc_exit(); +out_fib_stat: +#endif fib_proc_exit(); out_fib: udp4_proc_exit(); diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c index 0e98f2235b6..514c85b2631 100644 --- a/net/ipv4/ah4.c +++ b/net/ipv4/ah4.c @@ -200,7 +200,7 @@ static void ah4_err(struct sk_buff *skb, u32 info) xfrm_state_put(x); } -static int ah_init_state(struct xfrm_state *x, void *args) +static int ah_init_state(struct xfrm_state *x) { struct ah_data *ahp = NULL; struct xfrm_algo_desc *aalg_desc; diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c index 650dcb12d9a..d8a10e3dd77 100644 --- a/net/ipv4/devinet.c +++ b/net/ipv4/devinet.c @@ -1471,7 +1471,7 @@ static void devinet_sysctl_register(struct in_device *in_dev, * by sysctl and we wouldn't want anyone to change it under our feet * (see SIOCSIFNAME). */ - dev_name = net_sysctl_strdup(dev_name); + dev_name = kstrdup(dev_name, GFP_KERNEL); if (!dev_name) goto free; diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c index eae84cc39d3..ba57446d5d1 100644 --- a/net/ipv4/esp4.c +++ b/net/ipv4/esp4.c @@ -362,7 +362,7 @@ static void esp_destroy(struct xfrm_state *x) kfree(esp); } -static int esp_init_state(struct xfrm_state *x, void *args) +static int esp_init_state(struct xfrm_state *x) { struct esp_data *esp = NULL; diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 563e7d61270..cd8e45ab958 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -516,6 +516,60 @@ static void fib_del_ifaddr(struct in_ifaddr *ifa) #undef BRD1_OK } +static void nl_fib_lookup(struct fib_result_nl *frn, struct fib_table *tb ) +{ + + struct fib_result res; + struct flowi fl = { .nl_u = { .ip4_u = { .daddr = frn->fl_addr, + .fwmark = frn->fl_fwmark, + .tos = frn->fl_tos, + .scope = frn->fl_scope } } }; + if (tb) { + local_bh_disable(); + + frn->tb_id = tb->tb_id; + frn->err = tb->tb_lookup(tb, &fl, &res); + + if (!frn->err) { + frn->prefixlen = res.prefixlen; + frn->nh_sel = res.nh_sel; + frn->type = res.type; + frn->scope = res.scope; + } + local_bh_enable(); + } +} + +static void nl_fib_input(struct sock *sk, int len) +{ + struct sk_buff *skb = NULL; + struct nlmsghdr *nlh = NULL; + struct fib_result_nl *frn; + int err; + u32 pid; + struct fib_table *tb; + + skb = skb_recv_datagram(sk, 0, 0, &err); + nlh = (struct nlmsghdr *)skb->data; + + frn = (struct fib_result_nl *) NLMSG_DATA(nlh); + tb = fib_get_table(frn->tb_id_in); + + nl_fib_lookup(frn, tb); + + pid = nlh->nlmsg_pid; /*pid of sending process */ + NETLINK_CB(skb).groups = 0; /* not in mcast group */ + NETLINK_CB(skb).pid = 0; /* from kernel */ + NETLINK_CB(skb).dst_pid = pid; + NETLINK_CB(skb).dst_groups = 0; /* unicast */ + netlink_unicast(sk, skb, pid, MSG_DONTWAIT); +} + +static void nl_fib_lookup_init(void) +{ + netlink_kernel_create(NETLINK_FIB_LOOKUP, nl_fib_input); +} + static void fib_disable_ip(struct net_device *dev, int force) { if (fib_sync_down(0, dev, force)) @@ -604,6 +658,7 @@ void __init ip_fib_init(void) register_netdevice_notifier(&fib_netdev_notifier); register_inetaddr_notifier(&fib_inetaddr_notifier); + nl_fib_lookup_init(); } EXPORT_SYMBOL(inet_addr_type); diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c new file mode 100644 index 00000000000..b56e88edf1b --- /dev/null +++ b/net/ipv4/fib_trie.c @@ -0,0 +1,2476 @@ +/* + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + * + * Robert Olsson <robert.olsson@its.uu.se> Uppsala Universitet + * & Swedish University of Agricultural Sciences. + * + * Jens Laas <jens.laas@data.slu.se> Swedish University of + * Agricultural Sciences. + * + * Hans Liss <hans.liss@its.uu.se> Uppsala Universitet + * + * This work is based on the LPC-trie which is originally descibed in: + * + * An experimental study of compression methods for dynamic tries + * Stefan Nilsson and Matti Tikkanen. Algorithmica, 33(1):19-33, 2002. + * http://www.nada.kth.se/~snilsson/public/papers/dyntrie2/ + * + * + * IP-address lookup using LC-tries. Stefan Nilsson and Gunnar Karlsson + * IEEE Journal on Selected Areas in Communications, 17(6):1083-1092, June 1999 + * + * Version: $Id: fib_trie.c,v 1.3 2005/06/08 14:20:01 robert Exp $ + * + * + * Code from fib_hash has been reused which includes the following header: + * + * + * INET An implementation of the TCP/IP protocol suite for the LINUX + * operating system. INET is implemented using the BSD Socket + * interface as the means of communication with the user level. + * + * IPv4 FIB: lookup engine and maintenance routines. + * + * + * Authors: Alexey Kuznetsov, <kuznet@ms2.inr.ac.ru> + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version + * 2 of the License, or (at your option) any later version. + */ + +#define VERSION "0.324" + +#include <linux/config.h> +#include <asm/uaccess.h> +#include <asm/system.h> +#include <asm/bitops.h> +#include <linux/types.h> +#include <linux/kernel.h> +#include <linux/sched.h> +#include <linux/mm.h> +#include <linux/string.h> +#include <linux/socket.h> +#include <linux/sockios.h> +#include <linux/errno.h> +#include <linux/in.h> +#include <linux/inet.h> +#include <linux/netdevice.h> +#include <linux/if_arp.h> +#include <linux/proc_fs.h> +#include <linux/skbuff.h> +#include <linux/netlink.h> +#include <linux/init.h> +#include <linux/list.h> +#include <net/ip.h> +#include <net/protocol.h> +#include <net/route.h> +#include <net/tcp.h> +#include <net/sock.h> +#include <net/ip_fib.h> +#include "fib_lookup.h" + +#undef CONFIG_IP_FIB_TRIE_STATS +#define MAX_CHILDS 16384 + +#define EXTRACT(p, n, str) ((str)<<(p)>>(32-(n))) +#define KEYLENGTH (8*sizeof(t_key)) +#define MASK_PFX(k, l) (((l)==0)?0:(k >> (KEYLENGTH-l)) << (KEYLENGTH-l)) +#define TKEY_GET_MASK(offset, bits) (((bits)==0)?0:((t_key)(-1) << (KEYLENGTH - bits) >> offset)) + +static DEFINE_RWLOCK(fib_lock); + +typedef unsigned int t_key; + +#define T_TNODE 0 +#define T_LEAF 1 +#define NODE_TYPE_MASK 0x1UL +#define NODE_PARENT(_node) \ +((struct tnode *)((_node)->_parent & ~NODE_TYPE_MASK)) +#define NODE_SET_PARENT(_node, _ptr) \ +((_node)->_parent = (((unsigned long)(_ptr)) | \ + ((_node)->_parent & NODE_TYPE_MASK))) +#define NODE_INIT_PARENT(_node, _type) \ +((_node)->_parent = (_type)) +#define NODE_TYPE(_node) \ +((_node)->_parent & NODE_TYPE_MASK) + +#define IS_TNODE(n) (!(n->_parent & T_LEAF)) +#define IS_LEAF(n) (n->_parent & T_LEAF) + +struct node { + t_key key; + unsigned long _parent; +}; + +struct leaf { + t_key key; + unsigned long _parent; + struct hlist_head list; +}; + +struct leaf_info { + struct hlist_node hlist; + int plen; + struct list_head falh; +}; + +struct tnode { + t_key key; + unsigned long _parent; + unsigned short pos:5; /* 2log(KEYLENGTH) bits needed */ + unsigned short bits:5; /* 2log(KEYLENGTH) bits needed */ + unsigned short full_children; /* KEYLENGTH bits needed */ + unsigned short empty_children; /* KEYLENGTH bits needed */ + struct node *child[0]; +}; + +#ifdef CONFIG_IP_FIB_TRIE_STATS +struct trie_use_stats { + unsigned int gets; + unsigned int backtrack; + unsigned int semantic_match_passed; + unsigned int semantic_match_miss; + unsigned int null_node_hit; +}; +#endif + +struct trie_stat { + unsigned int totdepth; + unsigned int maxdepth; + unsigned int tnodes; + unsigned int leaves; + unsigned int nullpointers; + unsigned int nodesizes[MAX_CHILDS]; +}; + +struct trie { + struct node *trie; +#ifdef CONFIG_IP_FIB_TRIE_STATS + struct trie_use_stats stats; +#endif + int size; + unsigned int revision; +}; + +static int trie_debug = 0; + +static int tnode_full(struct tnode *tn, struct node *n); +static void put_child(struct trie *t, struct tnode *tn, int i, struct node *n); +static void tnode_put_child_reorg(struct tnode *tn, int i, struct node *n, int wasfull); +static int tnode_child_length(struct tnode *tn); +static struct node *resize(struct trie *t, struct tnode *tn); +static struct tnode *inflate(struct trie *t, struct tnode *tn); +static struct tnode *halve(struct trie *t, struct tnode *tn); +static void tnode_free(struct tnode *tn); +static void trie_dump_seq(struct seq_file *seq, struct trie *t); +extern struct fib_alias *fib_find_alias(struct list_head *fah, u8 tos, u32 prio); +extern int fib_detect_death(struct fib_info *fi, int order, + struct fib_info **last_resort, int *last_idx, int *dflt); + +extern void rtmsg_fib(int event, u32 key, struct fib_alias *fa, int z, int tb_id, + struct nlmsghdr *n, struct netlink_skb_parms *req); + +static kmem_cache_t *fn_alias_kmem; +static struct trie *trie_local = NULL, *trie_main = NULL; + +static void trie_bug(char *err) +{ + printk("Trie Bug: %s\n", err); + BUG(); +} + +static inline struct node *tnode_get_child(struct tnode *tn, int i) +{ + if (i >= 1<<tn->bits) + trie_bug("tnode_get_child"); + + return tn->child[i]; +} + +static inline int tnode_child_length(struct tnode *tn) +{ + return 1<<tn->bits; +} + +/* + _________________________________________________________________ + | i | i | i | i | i | i | i | N | N | N | S | S | S | S | S | C | + ---------------------------------------------------------------- + 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 + + _________________________________________________________________ + | C | C | C | u | u | u | u | u | u | u | u | u | u | u | u | u | + ----------------------------------------------------------------- + 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 + + tp->pos = 7 + tp->bits = 3 + n->pos = 15 + n->bits=4 + KEYLENGTH=32 +*/ + +static inline t_key tkey_extract_bits(t_key a, int offset, int bits) +{ + if (offset < KEYLENGTH) + return ((t_key)(a << offset)) >> (KEYLENGTH - bits); + else + return 0; +} + +static inline int tkey_equals(t_key a, t_key b) +{ + return a == b; +} + +static inline int tkey_sub_equals(t_key a, int offset, int bits, t_key b) +{ + if (bits == 0 || offset >= KEYLENGTH) + return 1; + bits = bits > KEYLENGTH ? KEYLENGTH : bits; + return ((a ^ b) << offset) >> (KEYLENGTH - bits) == 0; +} + +static inline int tkey_mismatch(t_key a, int offset, t_key b) +{ + t_key diff = a ^ b; + int i = offset; + + if(!diff) + return 0; + while((diff << i) >> (KEYLENGTH-1) == 0) + i++; + return i; +} + +/* Candiate for fib_semantics */ + +static void fn_free_alias(struct fib_alias *fa) +{ + fib_release_info(fa->fa_info); + kmem_cache_free(fn_alias_kmem, fa); +} + +/* + To understand this stuff, an understanding of keys and all their bits is + necessary. Every node in the trie has a key associated with it, but not + all of the bits in that key are significant. + + Consider a node 'n' and its parent 'tp'. + + If n is a leaf, every bit in its key is significant. Its presence is + necessitaded by path compression, since during a tree traversal (when + searching for a leaf - unless we are doing an insertion) we will completely + ignore all skipped bits we encounter. Thus we need to verify, at the end of + a potentially successful search, that we have indeed been walking the + correct key path. + + Note that we can never "miss" the correct key in the tree if present by + following the wrong path. Path compression ensures that segments of the key + that are the same for all keys with a given prefix are skipped, but the + skipped part *is* identical for each node in the su |