aboutsummaryrefslogtreecommitdiff
path: root/net/sched/Kconfig
AgeCommit message (Collapse)Author
2013-10-29net: sched: cls_bpf: add BPF-based classifierDaniel Borkmann
This work contains a lightweight BPF-based traffic classifier that can serve as a flexible alternative to ematch-based tree classification, i.e. now that BPF filter engine can also be JITed in the kernel. Naturally, tc actions and policies are supported as well with cls_bpf. Multiple BPF programs/filter can be attached for a class, or they can just as well be written within a single BPF program, that's really up to the user how he wishes to run/optimize the code, e.g. also for inversion of verdicts etc. The notion of a BPF program's return/exit codes is being kept as follows: 0: No match -1: Select classid given in "tc filter ..." command else: flowid, overwrite the default one As a minimal usage example with iproute2, we use a 3 band prio root qdisc on a router with sfq each as leave, and assign ssh and icmp bpf-based filters to band 1, http traffic to band 2 and the rest to band 3. For the first two bands we load the bytecode from a file, in the 2nd we load it inline as an example: echo 1 > /proc/sys/net/core/bpf_jit_enable tc qdisc del dev em1 root tc qdisc add dev em1 root handle 1: prio bands 3 priomap 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 tc qdisc add dev em1 parent 1:1 sfq perturb 16 tc qdisc add dev em1 parent 1:2 sfq perturb 16 tc qdisc add dev em1 parent 1:3 sfq perturb 16 tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/ssh.bpf flowid 1:1 tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/icmp.bpf flowid 1:1 tc filter add dev em1 parent 1: bpf run bytecode-file /etc/tc/http.bpf flowid 1:2 tc filter add dev em1 parent 1: bpf run bytecode "`bpfc -f tc -i misc.ops`" flowid 1:3 BPF programs can be easily created and passed to tc, either as inline 'bytecode' or 'bytecode-file'. There are a couple of front-ends that can compile opcodes, for example: 1) People familiar with tcpdump-like filters: tcpdump -iem1 -ddd port 22 | tr '\n' ',' > /etc/tc/ssh.bpf 2) People that want to low-level program their filters or use BPF extensions that lack support by libpcap's compiler: bpfc -f tc -i ssh.ops > /etc/tc/ssh.bpf ssh.ops example code: ldh [12] jne #0x800, drop ldb [23] jneq #6, drop ldh [20] jset #0x1fff, drop ldxb 4 * ([14] & 0xf) ldh [%x + 14] jeq #0x16, pass ldh [%x + 16] jne #0x16, drop pass: ret #-1 drop: ret #0 It was chosen to load bytecode into tc, since the reverse operation, tc filter list dev em1, is then able to show the exact commands again. Possible follow-up work could also include a small expression compiler for iproute2. Tested with the help of bmon. This idea came up during the Netfilter Workshop 2013 in Copenhagen. Also thanks to feedback from Eric Dumazet! Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-08-29pkt_sched: fq: Fair Queue packet schedulerEric Dumazet
- Uses perfect flow match (not stochastic hash like SFQ/FQ_codel) - Uses the new_flow/old_flow separation from FQ_codel - New flows get an initial credit allowing IW10 without added delay. - Special FIFO queue for high prio packets (no need for PRIO + FQ) - Uses a hash table of RB trees to locate the flows at enqueue() time - Smart on demand gc (at enqueue() time, RB tree lookup evicts old unused flows) - Dynamic memory allocations. - Designed to allow millions of concurrent flows per Qdisc. - Small memory footprint : ~8K per Qdisc, and 104 bytes per flow. - Single high resolution timer for throttled flows (if any). - One RB tree to link throttled flows. - Ability to have a max rate per flow. We might add a socket option to add per socket limitation. Attempts have been made to add TCP pacing in TCP stack, but this seems to add complex code to an already complex stack. TCP pacing is welcomed for flows having idle times, as the cwnd permits TCP stack to queue a possibly large number of packets. This removes the 'slow start after idle' choice, hitting badly large BDP flows, and applications delivering chunks of data as video streams. Nicely spaced packets : Here interface is 10Gbit, but flow bottleneck is ~20Mbit cwin is big, yet FQ avoids the typical bursts generated by TCP (as in netperf TCP_RR -- -r 100000,100000) 15:01:23.545279 IP A > B: . 78193:81089(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.545394 IP B > A: . ack 81089 win 3668 <nop,nop,timestamp 11597985 1115> 15:01:23.546488 IP A > B: . 81089:83985(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.546565 IP B > A: . ack 83985 win 3668 <nop,nop,timestamp 11597986 1115> 15:01:23.547713 IP A > B: . 83985:86881(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.547778 IP B > A: . ack 86881 win 3668 <nop,nop,timestamp 11597987 1115> 15:01:23.548911 IP A > B: . 86881:89777(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.548949 IP B > A: . ack 89777 win 3668 <nop,nop,timestamp 11597988 1115> 15:01:23.550116 IP A > B: . 89777:92673(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.550182 IP B > A: . ack 92673 win 3668 <nop,nop,timestamp 11597989 1115> 15:01:23.551333 IP A > B: . 92673:95569(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.551406 IP B > A: . ack 95569 win 3668 <nop,nop,timestamp 11597991 1115> 15:01:23.552539 IP A > B: . 95569:98465(2896) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.552576 IP B > A: . ack 98465 win 3668 <nop,nop,timestamp 11597992 1115> 15:01:23.553756 IP A > B: . 98465:99913(1448) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554138 IP A > B: P 99913:100001(88) ack 65248 win 3125 <nop,nop,timestamp 1115 11597805> 15:01:23.554204 IP B > A: . ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.554234 IP B > A: . 65248:68144(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.555620 IP B > A: . 68144:71040(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.557005 IP B > A: . 71040:73936(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.558390 IP B > A: . 73936:76832(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.559773 IP B > A: . 76832:79728(2896) ack 100001 win 3668 <nop,nop,timestamp 11597993 1115> 15:01:23.561158 IP B > A: . 79728:82624(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.562543 IP B > A: . 82624:85520(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.563928 IP B > A: . 85520:88416(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.565313 IP B > A: . 88416:91312(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.566698 IP B > A: . 91312:94208(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.568083 IP B > A: . 94208:97104(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.569467 IP B > A: . 97104:100000(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.570852 IP B > A: . 100000:102896(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.572237 IP B > A: . 102896:105792(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.573639 IP B > A: . 105792:108688(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.575024 IP B > A: . 108688:111584(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.576408 IP B > A: . 111584:114480(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> 15:01:23.577793 IP B > A: . 114480:117376(2896) ack 100001 win 3668 <nop,nop,timestamp 11597994 1115> TCP timestamps show that most packets from B were queued in the same ms timeframe (TSval 1159799{3,4}), but FQ managed to send them right in time to avoid a big burst. In slow start or steady state, very few packets are throttled [1] FQ gets a bunch of tunables as : limit : max number of packets on whole Qdisc (default 10000) flow_limit : max number of packets per flow (default 100) quantum : the credit per RR round (default is 2 MTU) initial_quantum : initial credit for new flows (default is 10 MTU) maxrate : max per flow rate (default : unlimited) buckets : number of RB trees (default : 1024) in hash table. (consumes 8 bytes per bucket) [no]pacing : disable/enable pacing (default is enable) All of them can be changed on a live qdisc. $ tc qd add dev eth0 root fq help Usage: ... fq [ limit PACKETS ] [ flow_limit PACKETS ] [ quantum BYTES ] [ initial_quantum BYTES ] [ maxrate RATE ] [ buckets NUMBER ] [ [no]pacing ] $ tc -s -d qd qdisc fq 8002: dev eth0 root refcnt 32 limit 10000p flow_limit 100p buckets 256 quantum 3028 initial_quantum 15140 Sent 216532416 bytes 148395 pkt (dropped 0, overlimits 0 requeues 14) backlog 0b 0p requeues 14 511 flows, 511 inactive, 0 throttled 110 gc, 0 highprio, 0 retrans, 1143 throttled, 0 flows_plimit [1] Except if initial srtt is overestimated, as if using cached srtt in tcp metrics. We'll provide a fix for this issue. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-11-25net: sched: enable CAN Identifier to be build into kernelMarc Kleine-Budde
This patch makes it possible to build the CAN Identifier into the kernel, even if the CAN support is build as a module. Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-12net: sched: add ipset ematchFlorian Westphal
Can be used to match packets against netfilter ip sets created via ipset(8). skb->sk_iif is used as 'incoming interface', skb->dev is 'outgoing interface'. Since ipset is usually called from netfilter, the ematch initializes a fake xt_action_param, pulls the ip header into the linear area and also sets skb->data to the IP header (otherwise matching Layer 4 set types doesn't work). Tested-by: Mr Dash Four <mr.dash.four@googlemail.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-07-04net: em_canid: Ematch rule to match CAN frames according to their identifiersRostislav Lisovy
This ematch makes it possible to classify CAN frames (AF_CAN) according to their identifiers. This functionality can not be easily achieved with existing classifiers, such as u32, because CAN identifier is always stored in native endianness, whereas u32 expects Network byte order. Signed-off-by: Rostislav Lisovy <lisovy@gmail.com> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2012-05-12fq_codel: Fair Queue Codel AQMEric Dumazet
Fair Queue Codel packet scheduler Principles : - Packets are classified (internal classifier or external) on flows. - This is a Stochastic model (as we use a hash, several flows might be hashed on same slot) - Each flow has a CoDel managed queue. - Flows are linked onto two (Round Robin) lists, so that new flows have priority on old ones. - For a given flow, packets are not reordered (CoDel uses a FIFO) - head drops only. - ECN capability is on by default. - Very low memory footprint (64 bytes per flow) tc qdisc ... fq_codel [ limit PACKETS ] [ flows number ] [ target TIME ] [ interval TIME ] [ noecn ] [ quantum BYTES ] defaults : 1024 flows, 10240 packets limit, quantum : device MTU target : 5ms (CoDel default) interval : 100ms (CoDel default) Impressive results on load : class htb 1:1 root leaf 10: prio 0 quantum 1514 rate 200000Kbit ceil 200000Kbit burst 1475b/8 mpu 0b overhead 0b cburst 1475b/8 mpu 0b overhead 0b level 0 Sent 43304920109 bytes 33063109 pkt (dropped 0, overlimits 0 requeues 0) rate 201691Kbit 28595pps backlog 0b 312p requeues 0 lended: 33063109 borrowed: 0 giants: 0 tokens: -912 ctokens: -912 class fq_codel 10:1735 parent 10: (dropped 1292, overlimits 0 requeues 0) backlog 15140b 10p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.1ms class fq_codel 10:4524 parent 10: (dropped 1291, overlimits 0 requeues 0) backlog 16654b 11p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.1ms class fq_codel 10:4e74 parent 10: (dropped 1290, overlimits 0 requeues 0) backlog 6056b 4p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 6.4ms dropping drop_next 92.0ms class fq_codel 10:628a parent 10: (dropped 1289, overlimits 0 requeues 0) backlog 7570b 5p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 5.4ms dropping drop_next 90.9ms class fq_codel 10:a4b3 parent 10: (dropped 302, overlimits 0 requeues 0) backlog 16654b 11p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.1ms class fq_codel 10:c3c2 parent 10: (dropped 1284, overlimits 0 requeues 0) backlog 13626b 9p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 5.9ms class fq_codel 10:d331 parent 10: (dropped 299, overlimits 0 requeues 0) backlog 15140b 10p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.0ms class fq_codel 10:d526 parent 10: (dropped 12160, overlimits 0 requeues 0) backlog 35870b 211p requeues 0 deficit 1508 count 12160 lastcount 1 ldelay 15.3ms dropping drop_next 247us class fq_codel 10:e2c6 parent 10: (dropped 1288, overlimits 0 requeues 0) backlog 15140b 10p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.1ms class fq_codel 10:eab5 parent 10: (dropped 1285, overlimits 0 requeues 0) backlog 16654b 11p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 5.9ms class fq_codel 10:f220 parent 10: (dropped 1289, overlimits 0 requeues 0) backlog 15140b 10p requeues 0 deficit 1514 count 1 lastcount 1 ldelay 7.1ms qdisc htb 1: root refcnt 6 r2q 10 default 1 direct_packets_stat 0 ver 3.17 Sent 43331086547 bytes 33092812 pkt (dropped 0, overlimits 66063544 requeues 71) rate 201697Kbit 28602pps backlog 0b 260p requeues 71 qdisc fq_codel 10: parent 1:1 limit 10240p flows 65536 target 5.0ms interval 100.0ms ecn Sent 43331086547 bytes 33092812 pkt (dropped 949359, overlimits 0 requeues 0) rate 201697Kbit 28602pps backlog 189352b 260p requeues 0 maxpacket 1514 drop_overlimit 0 new_flow_count 5582 ecn_mark 125593 new_flows_len 0 old_flows_len 11 PING 172.30.42.18 (172.30.42.18) 56(84) bytes of data. 64 bytes from 172.30.42.18: icmp_req=1 ttl=64 time=0.227 ms 64 bytes from 172.30.42.18: icmp_req=2 ttl=64 time=0.165 ms 64 bytes from 172.30.42.18: icmp_req=3 ttl=64 time=0.166 ms 64 bytes from 172.30.42.18: icmp_req=4 ttl=64 time=0.151 ms 64 bytes from 172.30.42.18: icmp_req=5 ttl=64 time=0.164 ms 64 bytes from 172.30.42.18: icmp_req=6 ttl=64 time=0.172 ms 64 bytes from 172.30.42.18: icmp_req=7 ttl=64 time=0.175 ms 64 bytes from 172.30.42.18: icmp_req=8 ttl=64 time=0.183 ms 64 bytes from 172.30.42.18: icmp_req=9 ttl=64 time=0.158 ms 64 bytes from 172.30.42.18: icmp_req=10 ttl=64 time=0.200 ms 10 packets transmitted, 10 received, 0% packet loss, time 8999ms rtt min/avg/max/mdev = 0.151/0.176/0.227/0.022 ms Much better than SFQ because of priority given to new flows, and fast path dirtying less cache lines. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-05-10codel: Controlled Delay AQMEric Dumazet
An implementation of CoDel AQM, from Kathleen Nichols and Van Jacobson. http://queue.acm.org/detail.cfm?id=2209336 This AQM main input is no longer queue size in bytes or packets, but the delay packets stay in (FIFO) queue. As we don't have infinite memory, we still can drop packets in enqueue() in case of massive load, but mean of CoDel is to drop packets in dequeue(), using a control law based on two simple parameters : target : target sojourn time (default 5ms) interval : width of moving time window (default 100ms) Based on initial work from Dave Taht. Refactored to help future codel inclusion as a plugin for other linux qdisc (FQ_CODEL, ...), like RED. include/net/codel.h contains codel algorithm as close as possible than Kathleen reference. net/sched/sch_codel.c contains the linux qdisc specific glue. Separate structures permit a memory efficient implementation of fq_codel (to be sent as a separate work) : Each flow has its own struct codel_vars. timestamps are taken at enqueue() time with 1024 ns precision, allowing a range of 2199 seconds in queue, and 100Gb links support. iproute2 uses usec as base unit. Selected packets are dropped, unless ECN is enabled and packets can get ECN mark instead. Tested from 2Mb to 10Gb speeds with no particular problems, on ixgbe and tg3 drivers (BQL enabled). Usage: tc qdisc ... codel [ limit PACKETS ] [ target TIME ] [ interval TIME ] [ ecn ] qdisc codel 10: parent 1:1 limit 2000p target 3.0ms interval 60.0ms ecn Sent 13347099587 bytes 8815805 pkt (dropped 0, overlimits 0 requeues 0) rate 202365Kbit 16708pps backlog 113550b 75p requeues 0 count 116 lastcount 98 ldelay 4.3ms dropping drop_next 816us maxpacket 1514 ecn_mark 84399 drop_overlimit 0 CoDel must be seen as a base module, and should be used keeping in mind there is still a FIFO queue. So a typical setup will probably need a hierarchy of several qdiscs and packet classifiers to be able to meet whatever constraints a user might have. One possible example would be to use fq_codel, which combines Fair Queueing and CoDel, in replacement of sfq / sfq_red. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Dave Taht <dave.taht@bufferbloat.net> Cc: Kathleen Nichols <nichols@pollere.com> Cc: Van Jacobson <van@pollere.net> Cc: Tom Herbert <therbert@google.com> Cc: Matt Mathis <mattmathis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-07net/sched: sch_plug - Queue traffic until an explicit release commandShriram Rajagopalan
The qdisc supports two operations - plug and unplug. When the qdisc receives a plug command via netlink request, packets arriving henceforth are buffered until a corresponding unplug command is received. Depending on the type of unplug command, the queue can be unplugged indefinitely or selectively. This qdisc can be used to implement output buffering, an essential functionality required for consistent recovery in checkpoint based fault-tolerance systems. Output buffering enables speculative execution by allowing generated network traffic to be rolled back. It is used to provide network protection for Xen Guests in the Remus high availability project, available as part of Xen. This module is generic enough to be used by any other system that wishes to add speculative execution and output buffering to its applications. This module was originally available in the linux 2.6.32 PV-OPS tree, used as dom0 for Xen. For more information, please refer to http://nss.cs.ubc.ca/remus/ and http://wiki.xensource.com/xenwiki/Remus Changes in V3: * Removed debug output (printk) on queue overflow * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to use this qdisc, for simple plug/unplug operations. * Use of packet counts instead of pointers to keep track of the buffers in the queue. Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Signed-off-by: Brendan Cully <brendan@cs.ubc.ca> [author of the code in the linux 2.6.32 pvops tree] Signed-off-by: David S. Miller <davem@davemloft.net>
2011-05-19networking: NET_CLS_ROUTE4 depends on INETRandy Dunlap
IP_ROUTE_CLASSID depends on INET and NET_CLS_ROUTE4 selects IP_ROUTE_CLASSID, but when INET is not enabled, this kconfig warning is produced, so fix it by making NET_CLS_ROUTE4 depend on INET. warning: (NET_CLS_ROUTE4) selects IP_ROUTE_CLASSID which has unmet direct dependencies (NET && INET) Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-04-04pkt_sched: QFQ - quick fair queue schedulerstephen hemminger
This is an implementation of the Quick Fair Queue scheduler developed by Fabio Checconi. The same algorithm is already implemented in ipfw in FreeBSD. Fabio had an earlier version developed on Linux, I just cleaned it up. Thanks to Eric Dumazet for testing this under load. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-23net_sched: SFB flow schedulerEric Dumazet
This is the Stochastic Fair Blue scheduler, based on work from : W. Feng, D. Kandlur, D. Saha, K. Shin. Blue: A New Class of Active Queue Management Algorithms. U. Michigan CSE-TR-387-99, April 1999. http://www.thefengs.com/wuchang/blue/CSE-TR-387-99.pdf This implementation is based on work done by Juliusz Chroboczek General SFB algorithm can be found in figure 14, page 15: B[l][n] : L x N array of bins (L levels, N bins per level) enqueue() Calculate hash function values h{0}, h{1}, .. h{L-1} Update bins at each level for i = 0 to L - 1 if (B[i][h{i}].qlen > bin_size) B[i][h{i}].p_mark += p_increment; else if (B[i][h{i}].qlen == 0) B[i][h{i}].p_mark -= p_decrement; p_min = min(B[0][h{0}].p_mark ... B[L-1][h{L-1}].p_mark); if (p_min == 1.0) ratelimit(); else mark/drop with probabilty p_min; I did the adaptation of Juliusz code to meet current kernel standards, and various changes to address previous comments : http://thread.gmane.org/gmane.linux.network/90225 http://thread.gmane.org/gmane.linux.network/90375 Default flow classifier is the rxhash introduced by RPS in 2.6.35, but we can use an external flow classifier if wanted. tc qdisc add dev $DEV parent 1:11 handle 11: \ est 0.5sec 2sec sfb limit 128 tc filter add dev $DEV protocol ip parent 11: handle 3 \ flow hash keys dst divisor 1024 Notes: 1) SFB default child qdisc is pfifo_fast. It can be changed by another qdisc but a child qdisc MUST not drop a packet previously queued. This is because SFB needs to handle a dequeued packet in order to maintain its virtual queue states. pfifo_head_drop or CHOKe should not be used. 2) ECN is enabled by default, unlike RED/CHOKe/GRED With help from Patrick McHardy & Andi Kleen Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Juliusz Chroboczek <Juliusz.Chroboczek@pps.jussieu.fr> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Patrick McHardy <kaber@trash.net> CC: Andi Kleen <andi@firstfloor.org> CC: John W. Linville <linville@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-02-02sched: CHOKe flow schedulerstephen hemminger
CHOKe ("CHOose and Kill" or "CHOose and Keep") is an alternative packet scheduler based on the Random Exponential Drop (RED) algorithm. The core idea is: For every packet arrival: Calculate Qave if (Qave < minth) Queue the new packet else Select randomly a packet from the queue if (both packets from same flow) then Drop both the packets else if (Qave > maxth) Drop packet else Admit packet with proability p (same as RED) See also: Rong Pan, Balaji Prabhakar, Konstantinos Psounis, "CHOKe: a stateless active queue management scheme for approximating fair bandwidth allocation", Proceeding of INFOCOM'2000, March 2000. Help from: Eric Dumazet <eric.dumazet@gmail.com> Patrick McHardy <kaber@trash.net> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-20Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-next-2.6
2011-01-19net_sched: implement a root container qdisc sch_mqprioJohn Fastabend
This implements a mqprio queueing discipline that by default creates a pfifo_fast qdisc per tx queue and provides the needed configuration interface. Using the mqprio qdisc the number of tcs currently in use along with the range of queues alloted to each class can be configured. By default skbs are mapped to traffic classes using the skb priority. This mapping is configurable. Configurable parameters, struct tc_mqprio_qopt { __u8 num_tc; __u8 prio_tc_map[TC_BITMASK + 1]; __u8 hw; __u16 count[TC_MAX_QUEUE]; __u16 offset[TC_MAX_QUEUE]; }; Here the count/offset pairing give the queue alignment and the prio_tc_map gives the mapping from skb->priority to tc. The hw bit determines if the hardware should configure the count and offset values. If the hardware bit is set then the operation will fail if the hardware does not implement the ndo_setup_tc operation. This is to avoid undetermined states where the hardware may or may not control the queue mapping. Also minimal bounds checking is done on the count/offset to verify a queue does not exceed num_tx_queues and that queue ranges do not overlap. Otherwise it is left to user policy or hardware configuration to create useful mappings. It is expected that hardware QOS schemes can be implemented by creating appropriate mappings of queues in ndo_tc_setup(). One expected use case is drivers will use the ndo_setup_tc to map queue ranges onto 802.1Q traffic classes. This provides a generic mechanism to map network traffic onto these traffic classes and removes the need for lower layer drivers to know specifics about traffic types. Signed-off-by: John Fastabend <john.r.fastabend@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-01-19Merge branch 'master' of /repos/git/net-next-2.6Patrick McHardy
2011-01-14netfilter: fix Kconfig dependenciesPatrick McHardy
Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE, which itself depends on NET_SCHED; this dependency is missing from netfilter. Since matching on realms is also useful without having NET_SCHED enabled and the option really only controls whether the tclassid member is included in route and dst entries, rename the config option to IP_ROUTE_CLASSID and move it outside of traffic scheduling context to get rid of the NET_SCHED dependeny. Reported-by: Vladis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Patrick McHardy <kaber@trash.net>
2010-11-15Docs/Kconfig: Update: osdl.org -> linuxfoundation.orgMichael Witten
Some of the documentation refers to web pages under the domain `osdl.org'. However, `osdl.org' now redirects to `linuxfoundation.org'. Rather than rely on redirections, this patch updates the addresses appropriately; for the most part, only documentation that is meant to be current has been updated. The patch should be pretty quick to scan and check; each new web-page url was gotten by trying out the original URL in a browser and then simply copying the the redirected URL (formatting as necessary). There is some conflict as to which one of these domain names is preferred: linuxfoundation.org linux-foundation.org So, I wrote: info@linuxfoundation.org and got this reply: Message-ID: <4CE17EE6.9040807@linuxfoundation.org> Date: Mon, 15 Nov 2010 10:41:42 -0800 From: David Ames <david@linuxfoundation.org> ... linuxfoundation.org is preferred. The canonical name for our web site is www.linuxfoundation.org. Our list site is actually lists.linux-foundation.org. Regarding email linuxfoundation.org is preferred there are a few people who choose to use linux-foundation.org for their own reasons. Consequently, I used `linuxfoundation.org' for web pages and `lists.linux-foundation.org' for mailing-list web pages and email addresses; the only personal email address I updated from `@osdl.org' was that of Andrew Morton, who prefers `linux-foundation.org' according `git log'. Signed-off-by: Michael Witten <mfwitten@gmail.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
2010-08-23pkt_sched: Make act_csum depend upon INET.David S. Miller
It uses ip_send_check() and stuff like that. Reported-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-08-20net/sched: add ACT_CSUM action to update packets checksumsGrégoire Baron
net/sched: add ACT_CSUM action to update packets checksums ACT_CSUM can be called just after ACT_PEDIT in order to re-compute some altered checksums in IPv4 and IPv6 packets. The following checksums are supported by this patch: - IPv4: IPv4 header, ICMP, IGMP, TCP, UDP & UDPLite - IPv6: ICMPv6, TCP, UDP & UDPLite It's possible to request in the same action to update different kind of checksums, if the packets flow mix TCP, UDP and UDPLite, ... An example of usage is done in the associated iproute2 patch. Version 3 changes: - remove useless goto instructions - improve IPv6 hop options decoding Version 2 changes: - coding style correction - remove useless arguments of some functions - use stack in tcf_csum_dump() - add tcf_csum_skb_nextlayer() to factor code Signed-off-by: Gregoire Baron <baronchon@n7mm.org> Acked-by: jamal <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-03-23cgroups: net_cls as moduleBen Blum
Allows the net_cls cgroup subsystem to be compiled as a module This patch modifies net/sched/cls_cgroup.c to allow the net_cls subsystem to be optionally compiled as a module instead of builtin. The cgroup_subsys struct is moved around a bit to allow the subsys_id to be either declared as a compile-time constant by the cgroup_subsys.h include in cgroup.h, or, if it's a module, initialized within the struct by cgroup_load_subsys. Signed-off-by: Ben Blum <bblum@andrew.cmu.edu> Acked-by: Li Zefan <lizf@cn.fujitsu.com> Cc: Paul Menage <menage@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Lai Jiangshan <laijs@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2010-02-08net/sched: Fix module name in KconfigJan Luebbe
The action modules have been prefixed with 'act_', but the Kconfig description was not changed. Signed-off-by: Jan Luebbe <jluebbe@debian.org> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-12-29cls_cgroup: clean up KconfigLi Zefan
cls_cgroup can't be compiled as a module, since it's not supported by cgroup. Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-20pkt_sched: add DRR schedulerPatrick McHardy
Add classful DRR scheduler as a more flexible replacement for SFQ. The main difference to the algorithm described in "Efficient Fair Queueing using Deficit Round Robin" is that this implementation doesn't drop packets from the longest queue on overrun because its classful and limits are handled by each individual child qdisc. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-11-07pkt_sched: Control group classifierThomas Graf
The classifier should cover the most common use case and will work without any special configuration. The principle of the classifier is to directly access the task_struct via get_current(). In order for this to work, classification requests from softirqs must be ignored. This is not a problem because the vast majority of packets in softirq context are not assigned to a task anyway. For this to work, a mechanism is needed to trace softirq context. This repost goes back to the method of relying on the number of nested bh disable calls for the sake of not adding too much complexity and the option to come up with something more reliable if actually needed. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-12pkt_action: add new action skbeditAlexander Duyck
This new action will have the ability to change the priority and/or queue_mapping fields on an sk_buff. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-09-12pkt_sched: Add multiqueue scheduler supportAlexander Duyck
This patch is intended to add a qdisc to support the new tx multiqueue architecture by providing a band for each hardware queue. By doing this it is possible to support a different qdisc per physical hardware queue. This qdisc uses the skb->queue_mapping to select which band to place the traffic onto. It then uses a round robin w/ a check to see if the subqueue is stopped to determine which band to dequeue the packet from. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-06-27pkt_sched: Remove CONFIG_NET_SCH_RRAdrian Bunk
Commit d62733c8e437fdb58325617c4b3331769ba82d70 ([SCHED]: Qdisc changes and sch_rr added for multiqueue) added a NET_SCH_RR option that was unused since the code went unconditionally into sch_prio. Reported-by: Robert P. J. Day <rpjday@crashcourse.ca> Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NET_SCHED]: Add flow classifierPatrick McHardy
Add new "flow" classifier, which is meant to extend the SFQ hashing capabilities without hard-coding new hash functions and also allows deterministic mappings of keys to classes, replacing some out of tree iptables patches like IPCLASSIFY (maps IPs to classes), IPMARK (maps IPs to marks, with fw filters to classes), ... Some examples: - Classic SFQ hash: tc filter add ... flow hash \ keys src,dst,proto,proto-src,proto-dst divisor 1024 - Classic SFQ hash, but using information from conntrack to work properly in combination with NAT: tc filter add ... flow hash \ keys nfct-src,nfct-dst,proto,nfct-proto-src,nfct-proto-dst divisor 1024 - Map destination IPs of 192.168.0.0/24 to classids 1-257: tc filter add ... flow map \ key dst addend -192.168.0.0 divisor 256 - alternatively: tc filter add ... flow map \ key dst and 0xff - similar, but reverse ordered: tc filter add ... flow map \ key dst and 0xff xor 0xff Perturbation is currently not supported because we can't reliable kill the timer on destruction. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-31[NET_SCHED]: sch_ingress: remove netfilter supportPatrick McHardy
Since the old policer code is gone, TC actions are needed for policing. The ingress qdisc can get packets directly from netif_receive_skb() in case TC actions are enabled or through netfilter otherwise, but since without TC actions there is no policer the only thing it actually does is count packets. Remove the netfilter support and always require TC actions. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: kill obsolete NET_CLS_POLICE optionPatrick McHardy
The code is already gone for about half a year, the config option has been kept around to select the replacement options for easier upgrades. This seems long enough, people upgrading from older kernels will have to reconfigure a lot anyway. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NET_SCHED]: sch_ingress: move dependencies to KconfigPatrick McHardy
Instead of complaining at scheduler initialization time, check the dependencies in Kconfig. Signed-off-by: Patrick McHardy <kaber@trash.net> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2008-01-28[NETFILTER]: Convert old checksum helper namesPatrick McHardy
Kill the defines again, convert to the new checksum helper names and remove the dependency of NET_ACT_NAT on NETFILTER. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-18[NET]: QoS/Sched as menuconfigRandy Dunlap
Convert "QoS and/or fair queueing" to menuconfig. This makes it easy for someone to disable all sub-options with one config symbol. Signed-off-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-10-10[PKT_SCHED]: Add stateless NATHerbert Xu
Stateless NAT is useful in controlled environments where restrictions are placed on through traffic such that we don't need connection tracking to correctly NAT protocol-specific data. In particular, this is of interest when the number of flows or the number of addresses being NATed is large, or if connection tracking information has to be replicated and where it is not practical to do so. Previously we had stateless NAT functionality which was integrated into the IPv4 routing subsystem. This was a great solution as long as the NAT worked on a subnet to subnet basis such that the number of NAT rules was relatively small. The reason is that for SNAT the routing based system had to perform a linear scan through the rules. If the number of rules is large then major renovations would have take place in the routing subsystem to make this practical. For the time being, the least intrusive way of achieving this is to use the u32 classifier written by Alexey Kuznetsov along with the actions infrastructure implemented by Jamal Hadi Salim. The following patch is an attempt at this problem by creating a new nat action that can be invoked from u32 hash tables which would allow large number of stateless NAT rules that can be used/updated in constant time. The actual NAT code is mostly based on the previous stateless NAT code written by Alexey. In future we might be able to utilise the protocol NAT code from netfilter to improve support for other protocols. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-18[PKT_SCHED]: Some typo fixes in net/sched/KconfigGabriel Craciunescu
Signed-off-by: Gabriel Craciunescu <nix.or.die@googlemail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-15[NET_SCHED]: Kill CONFIG_NET_CLS_POLICEPatrick McHardy
The NET_CLS_ACT option is now a full replacement for NET_CLS_POLICE, remove the old code. The config option will be kept around to select the equivalent NET_CLS_ACT options for a short time to allow easier upgrades. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10[NET_SCHED]: Remove CONFIG_NET_ESTIMATOR optionPatrick McHardy
The generic estimator is always built in anways and all the config options does is prevent including a minimal amount of code for setting it up. Additionally the option is already automatically selected for most cases. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-07-10[SCHED]: Qdisc changes and sch_rr added for multiqueuePeter P Waskiewicz Jr
Add the new sch_rr qdisc for multiqueue network device support. Allow sch_prio and sch_rr to be compiled with or without multiqueue hardware support. sch_rr is part of sch_prio, and is referenced from MODULE_ALIAS. This was done since sch_prio and sch_rr only differ in their dequeue routine. Signed-off-by: Peter P Waskiewicz Jr <peter.p.waskiewicz.jr@intel.com> Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2007-04-25[NET_SCHED]: Use ktime as clocksourcePatrick McHardy
Get rid of the manual clock source selection mess and use ktime. Also use a scalar representation, which allows to clean up pkt_sched.h a bit more and results in less ktime_to_ns() calls in most cases. The PSCHED_US2JIFFIE/PSCHED_JIFFIE2US macros are implemented quite inefficient by this patch, following patches will convert all qdiscs to hrtimers and get rid of them entirely. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02[PKT_SCHED]: Make sch_fifo.o available when CONFIG_NET_SCHED is not set.David Kimdon
Based on patch by Patrick McHardy. Add a new option, NET_SCH_FIFO, which provides a simple fifo qdisc without requiring CONFIG_NET_SCHED. The d80211 stack needs a generic fifo qdisc for WME. At present it uses net/d80211/fifo_qdisc.c which is functionally equivalent to sch_fifo.c. This patch will allow the d80211 stack to remove net/d80211/fifo_qdisc.c and use sch_fifo.c instead. Signed-off-by: David Kimdon <david.kimdon@devicescape.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-12-02[NET]: Turn nfmark into generic markThomas Graf
nfmark is being used in various subsystems and has become the defacto mark field for all kinds of packets. Therefore it makes sense to rename it to `mark' and remove the dependency on CONFIG_NETFILTER. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-06-30Kconfig: Typos in net/sched/KconfigMatt LaPlante
Signed-off-by: Adrian Bunk <bunk@stusta.de>
2006-03-20[PKT_SCHED]: Let NET_CLS_ACT no longer depend on EXPERIMENTALAdrian Bunk
This option should IMHO no longer depend on EXPERIMENTAL. Signed-off-by: Adrian Bunk <bunk@stusta.de> ACKed-by: Jamal Hadi Salim <hadi@cyberus.ca> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-13[PKT_SCHED]: Change default clock source to gettimeofdayPatrick McHardy
The default of using jiffies is very bad and results in underutilization except with very low bandwidth. Signed-off-by: Patrick McHardy <kaber@trash.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2006-01-11[PKT_SCHED] net/sched/Kconfig: fix typo in NET_EMATCH_META descriptionAdrian Bunk
Noted by Matt LaPlante <webmaster@cyberdogtech.com>. Signed-off-by: Adrian Bunk <bunk@stusta.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-17[NET]: Sanitize NET_SCHED protection in /net/sched/KconfigRoman Zippel
On Thu, 17 Nov 2005, David Gómez wrote: > I found out that if i select NET_CLS_ROUTE4, save my changes and exit > menuconfig, execute again make menuconfig and go to QoS options, then the new > available options are visible. So menuconfig has some problem refreshing > contents :? No, they were there before too, but you have to go up one level to see them. It's better in 2.6.15-rc1-git5, but the menu structure is still a little messed up, the patch below properly indents all menu entries. Signed-off-by: Roman Zippel <zippel@linux-m68k.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-11-03[PKT_SCHED]: Rework QoS and/or fair queueing configurationThomas Graf
Make "QoS and/or fair queueing" have its own menu, it's too big to be inlined into "Network options". Remove the obsolete NET_QOS option. Automatically select NET_CLS if needed. Do the same for NET_ESTIMATOR but allow it to be selected manually for statistical purposes. Add comments to separate queueing from classification. Fix dependencies and ordering of classifiers. Improve descriptions/help texts and remove outdated pieces. Signed-off-by: Thomas Graf <tgraf@suug.ch> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
2005-10-13[NET]: Disable NET_SCH_CLK_CPU for SMP x86 hostsAndi Kleen
Opterons with frequency scaling have fully unsynchronized TSCs running at different frequencies, so using TSCs there is not a good idea. Also some other x86 boxes have this problem. gettimeofday should be good enough, so just disable it. Signed-off-by: Andi Kleen <ak@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2005-08-29[LIB]: Make TEXTSEARCH_BM plain tristate like the othersDavid S. Miller
And select it when the relevant modules are enabled. Signed-off-by: David S. Miller <davem@davemloft.net>
2005-07-11[NET]: move config options out to individual protocolsSam Ravnborg
Move the protocol specific config options out to the specific protocols. With this change net/Kconfig now starts to become readable and serve as a good basis for further re-structuring. The menu structure is left almost intact, except that indention is fixed in most cases. Most visible are the INET changes where several "depends on INET" are replaced with a single ifdef INET / endif pair. Several new files were created to accomplish this change - they are small but serve the purpose that config options are now distributed out where they belongs. Signed-off-by: Sam Ravnborg <sam@ravnborg.org> Signed-off-by: David S. Miller <davem@davemloft.net>