summaryrefslogtreecommitdiff
path: root/net/sched
AgeCommit message (Collapse)Author
2013-09-14htb: fix sign extension bugstephen hemminger
[ Upstream commit cbd375567f7e4811b1c721f75ec519828ac6583f ] When userspace passes a large priority value the assignment of the unsigned value hopt->prio to signed int cl->prio causes cl->prio to become negative and the comparison is with TC_HTB_NUMPRIO is always false. The result is that HTB crashes by referencing outside the array when processing packets. With this patch the large value wraps around like other values outside the normal range. See: https://bugzilla.kernel.org/show_bug.cgi?id=60669 Signed-off-by: Stephen Hemminger <stephen@networkplumber.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11net_sched: info leak in atm_tc_dump_class()Dan Carpenter
[ Upstream commit 8cb3b9c3642c0263d48f31d525bcee7170eedc20 ] The "pvc" struct has a hole after pvc.sap_family which is not cleared. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Jiri Pirko <jiri@resnulli.us> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-08-11net_sched: Fix stack info leak in cbq_dump_wrr().David S. Miller
[ Upstream commit a0db856a95a29efb1c23db55c02d9f0ff4f0db48 ] Make sure the reserved fields, and padding (if any), are fully initialized. Based upon a patch by Dan Carpenter and feedback from Joe Perches. Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-19net_sched: act_ipt forward compat with xtablesJamal Hadi Salim
[ Upstream commit 0dcffd09641f3abb21ac5cabc61542ab289d1a3c ] Deal with changes in newer xtables while maintaining backward compatibility. Thanks to Jan Engelhardt for suggestions. Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-05-01cbq: incorrect processing of high limitsVasily Averin
[ Upstream commit f0f6ee1f70c4eaab9d52cf7d255df4bd89f8d1c2 ] currently cbq works incorrectly for limits > 10% real link bandwidth, and practically does not work for limits > 50% real link bandwidth. Below are results of experiments taken on 1 Gbit link In shaper | Actual Result -----------+--------------- 100M | 108 Mbps 200M | 244 Mbps 300M | 412 Mbps 500M | 893 Mbps This happen because of q->now changes incorrectly in cbq_dequeue(): when it is called before real end of packet transmitting, L2T is greater than real time delay, q_now gets an extra boost but never compensate it. To fix this problem we prevent change of q->now until its synchronization with real time. Signed-off-by: Vasily Averin <vvs@openvz.org> Reviewed-by: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2013-01-11net: sched: integer overflow fixStefan Hasko
[ Upstream commit d2fe85da52e89b8012ffad010ef352a964725d5f ] Fixed integer overflow in function htb_dequeue Signed-off-by: Stefan Hasko <hasko.stevo@gmail.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13pkt_sched: fix virtual-start-time update in QFQPaolo Valente
[ Upstream commit 71261956973ba9e0637848a5adb4a5819b4bae83 ] If the old timestamps of a class, say cl, are stale when the class becomes active, then QFQ may assign to cl a much higher start time than the maximum value allowed. This may happen when QFQ assigns to the start time of cl the finish time of a group whose classes are characterized by a higher value of the ratio max_class_pkt/weight_of_the_class with respect to that of cl. Inserting a class with a too high start time into the bucket list corrupts the data structure and may eventually lead to crashes. This patch limits the maximum start time assigned to a class. Signed-off-by: Paolo Valente <paolo.valente@unimore.it> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-13net-sched: sch_cbq: avoid infinite loopEric Dumazet
[ Upstream commit bdfc87f7d1e253e0a61e2fc6a75ea9d76f7fc03a ] Its possible to setup a bad cbq configuration leading to an infinite loop in cbq_classify() DEV_OUT=eth0 ICMP="match ip protocol 1 0xff" U32="protocol ip u32" DST="match ip dst" tc qdisc add dev $DEV_OUT root handle 1: cbq avpkt 1000 \ bandwidth 100mbit tc class add dev $DEV_OUT parent 1: classid 1:1 cbq \ rate 512kbit allot 1500 prio 5 bounded isolated tc filter add dev $DEV_OUT parent 1: prio 3 $U32 \ $ICMP $DST 192.168.3.234 flowid 1: Reported-by: Denys Fedoryschenko <denys@visp.net.lb> Tested-by: Denys Fedoryschenko <denys@visp.net.lb> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-10-02net_sched: gact: Fix potential panic in tcf_gact().Hiroaki SHIMODA
[ Upstream commit 696ecdc10622d86541f2e35cc16e15b6b3b1b67e ] gact_rand array is accessed by gact->tcfg_ptype whose value is assumed to less than MAX_RAND, but any range checks are not performed. So add a check in tcf_gact_init(). And in tcf_gact(), we can reduce a branch. Signed-off-by: Hiroaki SHIMODA <shimoda.hiroaki@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09sch_sfb: Fix missing NULL checkAlan Cox
[ Upstream commit 7ac2908e4b2edaec60e9090ddb4d9ceb76c05e7d ] Resolves-bug: https://bugzilla.kernel.org/show_bug.cgi?id=44461 Signed-off-by: Alan Cox <alan@linux.intel.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-08-09netem: add limitation to reordered packetsEric Dumazet
[ Upstream commit 960fb66e520a405dde39ff883f17ff2669c13d85 ] Fix two netem bugs : 1) When a frame was dropped by tfifo_enqueue(), drop counter was incremented twice. 2) When reordering is triggered, we enqueue a packet without checking queue limit. This can OOM pretty fast when this is repeated enough, since skbs are orphaned, no socket limit can help in this situation. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Mark Gordon <msg@google.com> Cc: Andreas Terzis <aterzis@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2012-05-01netem: fix possible skb leakEric Dumazet
skb_checksum_help(skb) can return an error, we must free skb in this case. qdisc_drop(skb, sch) can also be feeded with a NULL skb (if skb_unshare() failed), so lets use this generic helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-04-16net_sched: gred: Fix oops in gred_dump() in WRED modeDavid Ward
A parameter set exists for WRED mode, called wred_set, to hold the same values for qavg and qidlestart across all VQs. The WRED mode values had been previously held in the VQ for the default DP. After these values were moved to wred_set, the VQ for the default DP was no longer created automatically (so that it could be omitted on purpose, to have packets in the default DP enqueued directly to the device without using RED). However, gred_dump() was overlooked during that change; in WRED mode it still reads qavg/qidlestart from the VQ for the default DP, which might not even exist. As a result, this command sequence will cause an oops: tc qdisc add dev $DEV handle $HANDLE parent $PARENT gred setup \ DPs 3 default 2 grio tc qdisc change dev $DEV handle $HANDLE gred DP 0 prio 8 $RED_OPTIONS tc qdisc change dev $DEV handle $HANDLE gred DP 1 prio 8 $RED_OPTIONS This fixes gred_dump() in WRED mode to use the values held in wred_set. Signed-off-by: David Ward <david.ward@ll.mit.edu> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-03-20Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking merge from David Miller: "1) Move ixgbe driver over to purely page based buffering on receive. From Alexander Duyck. 2) Add receive packet steering support to e1000e, from Bruce Allan. 3) Convert TCP MD5 support over to RCU, from Eric Dumazet. 4) Reduce cpu usage in handling out-of-order TCP packets on modern systems, also from Eric Dumazet. 5) Support the IP{,V6}_UNICAST_IF socket options, making the wine folks happy, from Erich Hoover. 6) Support VLAN trunking from guests in hyperv driver, from Haiyang Zhang. 7) Support byte-queue-limtis in r8169, from Igor Maravic. 8) Outline code intended for IP_RECVTOS in IP_PKTOPTIONS existed but was never properly implemented, Jiri Benc fixed that. 9) 64-bit statistics support in r8169 and 8139too, from Junchang Wang. 10) Support kernel side dump filtering by ctmark in netfilter ctnetlink, from Pablo Neira Ayuso. 11) Support byte-queue-limits in gianfar driver, from Paul Gortmaker. 12) Add new peek socket options to assist with socket migration, from Pavel Emelyanov. 13) Add sch_plug packet scheduler whose queue is controlled by userland daemons using explicit freeze and release commands. From Shriram Rajagopalan. 14) Fix FCOE checksum offload handling on transmit, from Yi Zou." * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1846 commits) Fix pppol2tp getsockname() Remove printk from rds_sendmsg ipv6: fix incorrent ipv6 ipsec packet fragment cpsw: Hook up default ndo_change_mtu. net: qmi_wwan: fix build error due to cdc-wdm dependecy netdev: driver: ethernet: Add TI CPSW driver netdev: driver: ethernet: add cpsw address lookup engine support phy: add am79c874 PHY support mlx4_core: fix race on comm channel bonding: send igmp report for its master fs_enet: Add MPC5125 FEC support and PHY interface selection net: bpf_jit: fix BPF_S_LDX_B_MSH compilation net: update the usage of CHECKSUM_UNNECESSARY fcoe: use CHECKSUM_UNNECESSARY instead of CHECKSUM_PARTIAL on tx net: do not do gso for CHECKSUM_UNNECESSARY in netif_needs_gso ixgbe: Fix issues with SR-IOV loopback when flow control is disabled net/hyperv: Fix the code handling tx busy ixgbe: fix namespace issues when FCoE/DCB is not enabled rtlwifi: Remove unused ETH_ADDR_LEN defines igbvf: Use ETH_ALEN ... Fix up fairly trivial conflicts in drivers/isdn/gigaset/interface.c and drivers/net/usb/{Kconfig,qmi_wwan.c} as per David.
2012-03-20Merge branch 'for-3.4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup Pull cgroup changes from Tejun Heo: "Out of the 8 commits, one fixes a long-standing locking issue around tasklist walking and others are cleanups." * 'for-3.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: cgroup: Walk task list under tasklist_lock in cgroup_enable_task_cg_list cgroup: Remove wrong comment on cgroup_enable_task_cg_list() cgroup: remove cgroup_subsys argument from callbacks cgroup: remove extra calls to find_existing_css_set cgroup: replace tasklist_lock with rcu_read_lock cgroup: simplify double-check locking in cgroup_attach_proc cgroup: move struct cgroup_pidlist out from the header file cgroup: remove cgroup_attach_task_current_cg()
2012-03-18Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-03-16sch_sfq: revert dont put new flow at the end of flowsEric Dumazet
This reverts commit d47a0ac7b6 (sch_sfq: dont put new flow at the end of flows) As Jesper found out, patch sounded great but has bad side effects. In stress situation, pushing new flows in front of the queue can prevent old flows doing any progress. Packets can stay in SFQ queue for unlimited amount of time. It's possible to add heuristics to limit this problem, but this would add complexity outside of SFQ scope. A more sensible answer to Dave Taht concerns (who reported the issued I tried to solve in original commit) is probably to use a qdisc hierarchy so that high prio packets dont enter a potentially crowded SFQ qdisc. Reported-by: Jesper Dangaard Brouer <jdb@comx.dk> Cc: Dave Taht <dave.taht@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/sfc/rx.c Overlapping changes in drivers/net/ethernet/sfc/rx.c, one to change the rx_buf->is_page boolean into a set of u16 flags, and another to adjust how ->ip_summed is initialized. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-19netem: fix dequeueEric Dumazet
commit 50612537e9 (netem: fix classful handling) added two errors in netem_dequeue() 1) After checking skb at the head of tfifo queue for time constraints, it dequeues tail skb, thus adding unwanted reordering. 2) qdisc stats are updated twice per packet (one when packet dequeued from tfifo, once when delivered) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-13net_sched: sch_plug: plug_qdisc_ops is staticEric Dumazet
net/sched/sch_plug.c:211:18: warning: symbol 'plug_qdisc_ops' was not declared. Should it be static? Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-09net: Make qdisc_skb_cb upper size bound explicit.David S. Miller
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside of other data structures. This is intended to be used by IPoIB so that it can remember addressing information stored at hard_header_ops->create() time that it can fetch when the packet gets to the transmit routine. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-07net/sched: sch_plug - Queue traffic until an explicit release commandShriram Rajagopalan
The qdisc supports two operations - plug and unplug. When the qdisc receives a plug command via netlink request, packets arriving henceforth are buffered until a corresponding unplug command is received. Depending on the type of unplug command, the queue can be unplugged indefinitely or selectively. This qdisc can be used to implement output buffering, an essential functionality required for consistent recovery in checkpoint based fault-tolerance systems. Output buffering enables speculative execution by allowing generated network traffic to be rolled back. It is used to provide network protection for Xen Guests in the Remus high availability project, available as part of Xen. This module is generic enough to be used by any other system that wishes to add speculative execution and output buffering to its applications. This module was originally available in the linux 2.6.32 PV-OPS tree, used as dom0 for Xen. For more information, please refer to http://nss.cs.ubc.ca/remus/ and http://wiki.xensource.com/xenwiki/Remus Changes in V3: * Removed debug output (printk) on queue overflow * Added TCQ_PLUG_RELEASE_INDEFINITE - that allows the user to use this qdisc, for simple plug/unplug operations. * Use of packet counts instead of pointers to keep track of the buffers in the queue. Signed-off-by: Shriram Rajagopalan <rshriram@cs.ubc.ca> Signed-off-by: Brendan Cully <brendan@cs.ubc.ca> [author of the code in the linux 2.6.32 pvops tree] Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-06net: Make qdisc_skb_cb upper size bound explicit.David S. Miller
Just like skb->cb[], so that qdisc_skb_cb can be encapsulated inside of other data structures. This is intended to be used by IPoIB so that it can remember addressing information stored at hard_header_ops->create() time that it can fetch when the packet gets to the transmit routine. Signed-off-by: David S. Miller <davem@davemloft.net>
2012-02-02cgroup: remove cgroup_subsys argument from callbacksLi Zefan
The argument is not used at all, and it's not necessary, because a specific callback handler of course knows which subsys it belongs to. Now only ->pupulate() takes this argument, because the handlers of this callback always call cgroup_add_file()/cgroup_add_files(). So we reduce a few lines of code, though the shrinking of object size is minimal. 16 files changed, 113 insertions(+), 162 deletions(-) text data bss dec hex filename 5486240 656987 7039960 13183187 c928d3 vmlinux.o.orig 5486170 656987 7039960 13183117 c9288d vmlinux.o Signed-off-by: Li Zefan <lizf@cn.fujitsu.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2012-01-22netem: Fix off-by-one bug in reorderingVijay Subramanian
With netem reordering, a gap of N is supposed to reorder every Nth packet with given reorder probability. However, the code currently skips N packets and reorders every (N+1)th packet. Signed-off-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-12net_sched: sfq: add optional RED on top of SFQEric Dumazet
Adds an optional Random Early Detection on each SFQ flow queue. Traditional SFQ limits count of packets, while RED permits to also control number of bytes per flow, and adds ECN capability as well. 1) We dont handle the idle time management in this RED implementation, since each 'new flow' begins with a null qavg. We really want to address backlogged flows. 2) if headdrop is selected, we try to ecn mark first packet instead of currently enqueued packet. This gives faster feedback for tcp flows compared to traditional RED [ marking the last packet in queue ] Example of use : tc qdisc add dev $DEV parent 1:1 handle 10: est 1sec 4sec sfq \ limit 3000 headdrop flows 512 divisor 16384 \ redflowlimit 100000 min 8000 max 60000 probability 0.20 ecn qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop flows 512/16384 divisor 16384 ewma 6 min 8000b max 60000b probability 0.2 ecn prob_mark 0 prob_mark_head 4876 prob_drop 6131 forced_mark 0 forced_mark_head 0 forced_drop 0 Sent 1175211782 bytes 777537 pkt (dropped 6131, overlimits 11007 requeues 0) rate 99483Kbit 8219pps backlog 689392b 456p requeues 0 In this test, with 64 netperf TCP_STREAM sessions, 50% using ECN enabled flows, we can see number of packets CE marked is smaller than number of drops (for non ECN flows) If same test is run, without RED, we can check backlog is much bigger. qdisc sfq 10: parent 1:1 limit 3000p quantum 1514b depth 127 headdrop flows 512/16384 divisor 16384 Sent 1148683617 bytes 795006 pkt (dropped 0, overlimits 0 requeues 0) rate 98429Kbit 8521pps backlog 1221290b 841p requeues 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Dave Taht <dave.taht@gmail.com> Tested-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05net_sched: red: split red_parms into parms and varsEric Dumazet
This patch splits the red_parms structure into two components. One holding the RED 'constant' parameters, and one containing the variables. This permits a size reduction of GRED qdisc, and is a preliminary step to add an optional RED unit to SFQ. SFQRED will have a single red_parms structure shared by all flows, and a private red_vars per flow. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Dave Taht <dave.taht@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05net_sched: sfq: extend limitsEric Dumazet
SFQ as implemented in Linux is very limited, with at most 127 flows and limit of 127 packets. [ So if 127 flows are active, we have one packet per flow ] This patch brings to SFQ following features to cope with modern needs. - Ability to specify a smaller per flow limit of inflight packets. (default value being at 127 packets) - Ability to have up to 65408 active flows (instead of 127) - Ability to have head drops instead of tail drops (to drop old packets from a flow) Example of use : No more than 20 packets per flow, max 8000 flows, max 20000 packets in SFQ qdisc, hash table of 65536 slots. tc qdisc add ... sfq \ flows 8000 \ depth 20 \ headdrop \ limit 20000 \ divisor 65536 Ram usage : 2 bytes per hash table entry (instead of previous 1 byte/entry) 32 bytes per flow on 64bit arches, instead of 384 for QFQ, so much better cache hit ratio. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-05net_sched: Bug in netem reorderingHagen Paul Pfeifer
Not now, but it looks you are correct. q->qdisc is NULL until another additional qdisc is attached (beside tfifo). See 50612537e9ab2969312. The following patch should work. From: Hagen Paul Pfeifer <hagen@jauu.net> netem: catch NULL pointer by updating the real qdisc statistic Reported-by: Vijay Subramanian <subramanian.vijay@gmail.com> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2012-01-04net_sched: sfq: always randomize hash perturbationEric Dumazet
SFQ q->perturbation is used in sfq_hash() as an input to Jenkins hash. We currently randomize this 32bit value only if a perturbation timer is setup. Its much better to always initialize it to defeat attackers, or else they can predict very well what kind of packets they have to forge to hit a particular flow. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-04net_sched: sfq: fix mem alloc error recoveryEric Dumazet
Since commit 817fb15dfd98 (net_sched: sfq: allow divisor to be a parameter), we can leave perturbation timer armed if a memory allocation error aborts sfq_init(). Memory containing active struct timer_list is freed and kernel can crash. Call sfq_destroy() from sfq_init() to properly dismantle qdisc. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03net_sched: qdisc_alloc_handle() can be too slowEric Dumazet
When trying to allocate ~32768 qdiscs using autohandle mechanism, we can fill the space managed by kernel (handles in [8000-FFFF]:0000 range) But O(N^2) qdisc_alloc_handle() loops 0x10000 times instead of 0x8000 time tc add qdisc add dev eth0 parent 10:7fff pfifo limit 10 RTNETLINK answers: Cannot allocate memory real 1m54.826s user 0m0.000s sys 0m0.004s INFO: rcu_sched_state detected stall on CPU 0 (t=60000 jiffies) Half number of loops, and add a cond_resched() call. We hold rtnl at this point. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03sch_qfq: accurate wsum handlingEric Dumazet
We can underestimate q->wsum in case of "tc class replace ... qfq" and/or qdisc_create_dflt() error. wsum is not really used in fast path, only at qfq qdisc/class setup, to catch user error. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03sch_qfq: fix overflow in qfq_update_start()Eric Dumazet
grp->slot_shift is between 22 and 41, so using 32bit wide variables is probably a typo. This could explain QFQ hangs Dave reported to me, after 2^23 packets ? (23 = 64 - 41) Reported-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> CC: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2012-01-03sch_sfq: dont put new flow at the end of flowsEric Dumazet
SFQ enqueue algo puts a new flow _behind_ all pre-existing flows in the circular list. In fact this is probably an old SFQ implementation bug. 100 Mbits = ~8333 full frames per second, or ~8 frames per ms. With 50 flows, it means your "new flow" will have to wait 50 packets being sent before its own packet. Thats the ~6ms. We certainly can change SFQ to give a priority advantage to new flows, so that next dequeued packet is taken from a new flow, not an old one. Reported-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30netem: fix classful handlingEric Dumazet
Commit 10f6dfcfde (Revert "sch_netem: Remove classful functionality") reintroduced classful functionality to netem, but broke basic netem behavior : netem uses an t(ime)fifo queue, and store timestamps in skb->cb[] If qdisc is changed, time constraints are not respected and other qdisc can destroy skb->cb[] and block netem at dequeue time. Fix this by always using internal tfifo, and optionally attach a child qdisc to netem (or a tree of qdiscs) Example of use : DEV=eth3 tc qdisc del dev $DEV root tc qdisc add dev $DEV root handle 30: est 1sec 8sec netem delay 20ms 10ms tc qdisc add dev $DEV handle 40:0 parent 30:0 tbf \ burst 20480 limit 20480 mtu 1514 rate 32000bps qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms 10.0ms Sent 190792 bytes 413 pkt (dropped 0, overlimits 0 requeues 0) rate 18416bit 3pps backlog 0b 0p requeues 0 qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us Sent 190792 bytes 413 pkt (dropped 6, overlimits 10 requeues 0) backlog 0b 5p requeues 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2011-12-29sch_tbf: report backlog informationEric Dumazet
Provide child qdisc backlog (byte count) information so that "tc -s qdisc" can report it to user. qdisc netem 30: root refcnt 18 limit 1000 delay 20.0ms 10.0ms Sent 948517 bytes 898 pkt (dropped 0, overlimits 0 requeues 1) rate 175056bit 16pps backlog 114b 1p requeues 1 qdisc tbf 40: parent 30: rate 256000bit burst 20Kb/8 mpu 0b lat 0us Sent 948517 bytes 898 pkt (dropped 15, overlimits 611 requeues 0) backlog 18168b 12p requeues 0 Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-24netem: dont call vfree() under spinlock and BH disabledEric Dumazet
commit 6373a9a286 (netem: use vmalloc for distribution table) added a regression, since vfree() is called while holding a spinlock and BH being disabled. Fix this by doing the pointers swap in critical section, and freeing after spinlock release. Also add __GFP_NOWARN to the kmalloc() try, since we fallback to vmalloc(). Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: net/bluetooth/l2cap_core.c Just two overlapping changes, one added an initialization of a local variable, and another change added a new local variable. Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23netem: loss model API sizesstephen hemminger
The new netem loss model is configured with nested netlink messages. This code is being overly strict about sizes, and is easily confused by padding (or possible future expansion). Also message for gemodel is incorrect. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-23sch_hfsc: report backlog informationEric Dumazet
Add backlog (byte count) information in hfsc classes and qdisc, so that "tc -s" can report it to user, instead of 0 values : qdisc hfsc 1: root refcnt 6 default 20 Sent 45141660 bytes 30545 pkt (dropped 0, overlimits 91751 requeues 0) rate 1492Kbit 126pps backlog 103226b 74p requeues 0 ... class hfsc 1:20 parent 1:1 leaf 1201: rt m1 0bit d 0us m2 400000bit ls m1 0bit d 0us m2 200000bit Sent 49534912 bytes 33519 pkt (dropped 0, overlimits 0 requeues 0) backlog 81822b 56p requeues 0 period 23 work 49451576 bytes rtwork 13277552 bytes level 0 ... Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> CC: John A. Sullivan III <jsullivan@opensourcedevel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-22mqprio: Avoid panic if no options are providedThomas Graf
Userspace may not provide TCA_OPTIONS, in fact tc currently does so not do so if no arguments are specified on the command line. Return EINVAL instead of panicing. Signed-off-by: Thomas Graf <tgraf@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-21sch_sfq: rehash queues in perturb timerEric Dumazet
A known Out Of Order (OOO) problem hurts SFQ when timer changes perturbation value, since all new packets delivered to SFQ enqueue might end on different slots than previous in-flight packets. With round robin delivery, we can thus deliver packets in a different order. Since SFQ is limited to small amount of in-flight packets, we can rehash packets so that this OOO problem is fixed. This rehashing is performed only if internal flow classifier is in use. We now store in skb->cb[] the "struct flow_keys" so that we dont call skb_flow_dissect() again while rehashing. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-16sch_gred: prefer GFP_KERNEL allocationsEric Dumazet
In control path, its better to use GFP_KERNEL allocations where possible. Before taking qdisc spinlock, we preallocate memory just in case we'll need it in gred_change_vq() This is a followup to commit 3f1e6d3fd37b (sch_gred: should not use GFP_KERNEL while holding a spinlock) Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/freescale/fsl_pq_mdio.c net/batman-adv/translation-table.c net/ipv6/route.c
2011-12-14cls_flow: remove one dynamic arrayEric Dumazet
Its better to use a predefined size for this small automatic variable. Removes a sparse error as well : net/sched/cls_flow.c:288:13: error: bad constant expression Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12netem: add cell concept to simulate special MAC behaviorHagen Paul Pfeifer
This extension can be used to simulate special link layer characteristics. Simulate because packet data is not modified, only the calculation base is changed to delay a packet based on the original packet size and artificial cell information. packet_overhead can be used to simulate a link layer header compression scheme (e.g. set packet_overhead to -20) or with a positive packet_overhead value an additional MAC header can be simulated. It is also possible to "replace" the 14 byte Ethernet header with something else. cell_size and cell_overhead can be used to simulate link layer schemes, based on cells, like some TDMA schemes. Another application area are MAC schemes using a link layer fragmentation with a (small) header each. Cell size is the maximum amount of data bytes within one cell. Cell overhead is an additional variable to change the per-cell-overhead (e.g. 5 byte header per fragment). Example (5 kbit/s, 20 byte per packet overhead, cell-size 100 byte, per cell overhead 5 byte): tc qdisc add dev eth0 root netem rate 5kbit 20 100 5 Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2011-12-12sch_gred: should not use GFP_KERNEL while holding a spinlockEric Dumazet
gred_change_vq() is called under sch_tree_lock(sch). This means a spinlock is held, and we are not allowed to sleep in this context. We might pre-allocate memory using GFP_KERNEL before taking spinlock, but this is not suitable for stable material. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>