Age | Commit message (Collapse) | Author |
|
When TCP Small Queues was added, we used a sysctl to limit amount of
packets queues on Qdisc/device queues for a given TCP flow.
Problem is this limit is either too big for low rates, or too small
for high rates.
Now TCP stack has rate estimation in sk->sk_pacing_rate, and TSO
auto sizing, it can better control number of packets in Qdisc/device
queues.
New limit is two packets or at least 1 to 2 ms worth of packets.
Low rates flows benefit from this patch by having even smaller
number of packets in queues, allowing for faster recovery,
better RTT estimations.
High rates flows benefit from this patch by allowing more than 2 packets
in flight as we had reports this was a limiting factor to reach line
rate. [ In particular if TX completion is delayed because of coalescing
parameters ]
Example for a single flow on 10Gbp link controlled by FQ/pacing
14 packets in flight instead of 2
$ tc -s -d qd
qdisc fq 8001: dev eth0 root refcnt 32 limit 10000p flow_limit 100p
buckets 1024 quantum 3028 initial_quantum 15140
Sent 1168459366606 bytes 771822841 pkt (dropped 0, overlimits 0
requeues 6822476)
rate 9346Mbit 771713pps backlog 953820b 14p requeues 6822476
2047 flow, 2046 inactive, 1 throttled, delay 15673 ns
2372 gc, 0 highprio, 0 retrans, 9739249 throttled, 0 flows_plimit
Note that sk_pacing_rate is currently set to twice the actual rate, but
this might be refined in the future when a flow is in congestion
avoidance.
Additional change : skb->destructor should be set to tcp_wfree().
A future patch (for linux 3.13+) might remove tcp_limit_output_bytes
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Liu <wei.liu2@citrix.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211
|
|
fq_reset() should drops all packets in queue, including
throttled flows.
This patch moves code from fq_destroy() to fq_reset()
to do the cleaning.
fq_change() must stop calling fq_dequeue() if all remaining
packets are from throttled flows.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
In commit 8ed781668dd49 ("flow_keys: include thoff into flow_keys for
later usage"), we missed that existing code was using nhoff as a
temporary variable that could not always contain transport header
offset.
This is not a problem for TCP/UDP because port offset (@poff)
is 0 for these protocols.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Daniel Borkmann <dborkman@redhat.com>
Cc: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Nikolay Aleksandrov <nikolay@redhat.com>
Acked-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Consider the scenario where an IPv6 router is advertising a fixed
preferred_lft of 1800 seconds, while the valid_lft begins at 3600
seconds and counts down in realtime.
A client should reset its preferred_lft to 1800 every time the RA is
received, but a bug is causing Linux to ignore the update.
The core problem is here:
if (prefered_lft != ifp->prefered_lft) {
Note that ifp->prefered_lft is an offset, so it doesn't decrease over
time. Thus, the comparison is always (1800 != 1800), which fails to
trigger an update.
The most direct solution would be to compute a "stored_prefered_lft",
and use that value in the comparison. But I think that trying to filter
out unnecessary updates here is a premature optimization. In order for
the filter to apply, both of these would need to hold:
- The advertised valid_lft and preferred_lft are both declining in
real time.
- No clock skew exists between the router & client.
So in this patch, I've set "update_lft = 1" unconditionally, which
allows the surrounding code to be greatly simplified.
Signed-off-by: Paul Marks <pmarks@google.com>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
While sending packet skb_cow_head() can change skb header which
invalidates inner_iph pointer to skb header. Following patch
avoid using it. Found by code inspection.
This bug was introduced by commit 0e6fbc5b6c6218 (ip_tunnels: extend
iptunnel_xmit()).
Signed-off-by: Pravin B Shelar <pshelar@nicira.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
TCP packets hitting the SYN proxy through the SYNPROXY target are not
validated by TCP conntrack. When th->doff is below 5, an underflow happens
when calculating the options length, causing skb_header_pointer() to
return NULL and triggering the BUG_ON().
Handle this case gracefully by checking for NULL instead of using BUG_ON().
Reported-by: Martin Topholm <mph@one.com>
Tested-by: Martin Topholm <mph@one.com>
Signed-off-by: Patrick McHardy <kaber@trash.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
mac80211 scan processing could get stuck if roc work for pending, but
not started when a scan request was deferred due to such roc item.
Normally the deferred scan would be started from
ieee80211_start_next_roc(), but ieee80211_sw_roc_work() calls that only
if the finished ROC was started. Fix this by calling
ieee80211_run_deferred_scan() in the case the last ROC was not actually
started.
This issue was hit relatively easily in P2P find operations where Listen
state (remain-on-channel) and Search state (scan) are repeated in a
loop.
Signed-off-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
When clients are idle for too long, hostapd sends nullfunc frames for
probing. When those are acked by the client, the idle time needs to be
updated.
To make this work (and to avoid unnecessary probing), update sta->last_rx
whenever an ACK was received for a tx packet. Only do this if the flag
IEEE80211_HW_REPORTS_TX_ACK_STATUS is set.
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
This allows calls for clients in AP_VLANs (e.g. for 4-addr) to succeed
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
A host might need net_secret[] and never open a single socket.
Problem added in commit aebda156a570782
("net: defer net_secret[] initialization")
Based on prior patch from Hannes Frederic Sowa.
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Hannes Frederic Sowa <hannes@strressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
There is currently serialization network namespaces exiting and
network devices exiting as the final part of netdev_run_todo does not
happen under the rtnl_lock. This is compounded by the fact that the
only list of devices unregistering in netdev_run_todo is local to the
netdev_run_todo.
This lack of serialization in extreme cases results in network devices
unregistering in netdev_run_todo after the loopback device of their
network namespace has been freed (making dst_ifdown unsafe), and after
the their network namespace has exited (making the NETDEV_UNREGISTER,
and NETDEV_UNREGISTER_FINAL callbacks unsafe).
Add the missing serialization by a per network namespace count of how
many network devices are unregistering and having a wait queue that is
woken up whenever the count is decreased. The count and wait queue
allow default_device_exit_batch to wait until all of the unregistration
activity for a network namespace has finished before proceeding to
unregister the loopback device and then allowing the network namespace
to exit.
Only a single global wait queue is used because there is a single global
lock, and there is a single waiter, per network namespace wait queues
would be a waste of resources.
The per network namespace count of unregistering devices gives a
progress guarantee because the number of network devices unregistering
in an exiting network namespace must ultimately drop to zero (assuming
network device unregistration completes).
The basic logic remains the same as in v1. This patch is now half
comment and half rtnl_lock_unregistering an expanded version of
wait_event performs no extra work in the common case where no network
devices are unregistering when we get to default_device_exit_batch.
Reported-by: Francesco Ruggeri <fruggeri@aristanetworks.com>
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When a router is doing DNAT for 6to4/6rd packets the latest
anti-spoofing commit 218774dc ("ipv6: add anti-spoofing checks for
6to4 and 6rd") will drop them because the IPv6 address embedded does
not match the IPv4 destination. This patch will allow them to pass by
testing if we have an address that matches on 6to4/6rd interface. I
have been hit by this problem using Fedora and IPV6TO4_IPV4ADDR.
Also, log the dropped packets (with rate limit).
Signed-off-by: Catalin(ux) M. BOIE <catab@embedromix.ro>
Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless into for-davem
Also fixed-up a badly indented closing brace...
Signed-off-by: John W. Linville <linville@tuxdriver.com>
|
|
My locking rework/race fixes caused a regression in the
registration, causing uevent notifications for wireless
devices before the device is really fully registered and
available in nl80211.
Fix this by moving the device_add() under rtnl and move
the rfkill to afterwards (it can't be under rtnl.)
Reported-and-tested-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
The patch "mac80211: select and adjust bitrates according to
channel mode" causes regression and breaks the extended supported rate
IE setting. Since "i" is starting with 8, so this is not necessary
to introduce "skip" here.
Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@cozybit.com>
Signed-off-by: Colleen Twitty <colleen@cozybit.com>
Reviewed-by: Jason Abele <jason@cozybit.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
If an Ad-Hoc node receives packets with the Cell ID or its own MAC
address as source address, it hits a WARN_ON in sta_info_insert_check()
With many packets, this can massively spam the logs. One way that this
can easily happen is through having Cisco APs in the area with rouge AP
detection and countermeasures enabled.
Such Cisco APs will regularly send fake beacons, disassoc and deauth
packets that trigger these warnings.
To fix this issue, drop such spoofed packets early in the rx path.
Cc: stable@vger.kernel.org
Reported-by: Thomas Huehn <thomas@net.t-labs.tu-berlin.de>
Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
|
|
Fix kernel warning when using WEXT for configuring ad-hoc mode,
e.g. "iwconfig wlan0 essid test channel 1"
WARNING: at net/wireless/chan.c:373 cfg80211_chandef_usable+0x50/0x21c [cfg80211]()
The warning is caused by an uninitialized variable center_freq1.
Cc: stable@vger.kernel.org
Signed-off-by: Bruno Randolf <br1@einfach.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
Use MONITOR_FLAG_ACTIVE, which is a flag mask, instead of
NL80211_MNTR_FLAG_ACTIVE, which is a flag index, when checking if the
hardware supports active monitoring.
Cc: stable@vger.kernel.org
Signed-off-by: Luciano Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
|
If asynchronous events are enabled for a particular netlink socket,
the notify function is called by the advance function. The notify
function creates and dispatches a km_event if a replay timeout occurred,
or at least replay_maxdiff packets have been received since the last
asynchronous event has been sent. The function is supposed to return if
neither of the two events were detected for a state, or replay_maxdiff
is equal to zero.
Replay_maxdiff is initialized in xfrm_state_construct to the value of
the xfrm.sysctl_aevent_rseqth (2 by default), and updated if for a state
if the netlink attribute XFRMA_REPLAY_THRESH is set.
If, however, replay_maxdiff is set to zero, then all of the three notify
implementations perform a break from the switch statement instead of
checking whether a timeout occurred, and -- if not -- return. As a
result an asynchronous event is generated for every replay update of a
state that has a zero replay_maxdiff value.
This patch modifies the notify functions such that they immediately
return if replay_maxdiff has the value zero, unless a timeout occurred.
Signed-off-by: Thomas Egerer <thomas.egerer@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
In the following scenario the socket is corked:
If the first UDP packet is larger then the mtu we try to append it to the
write queue via ip6_ufo_append_data. A following packet, which is smaller
than the mtu would be appended to the already queued up gso-skb via
plain ip6_append_data. This causes random memory corruptions.
In ip6_ufo_append_data we also have to be careful to not queue up the
same skb multiple times. So setup the gso frame only when no first skb
is available.
This also fixes a shortcoming where we add the current packet's length to
cork->length but return early because of a packet > mtu with dontfrag set
(instead of sutracting it again).
Found with trinity.
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Redirect isn't an error condition, it should leave
the error handler without touching the socket.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Redirect isn't an error condition, it should leave
the error handler without touching the socket.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
MRP doesn't implement the periodictimer in 802.1Q, so it never retries
if packets get lost. I ran into this problem when MRP sent a MVRP
JoinIn before the interface was fully up. The JoinIn was lost, MRP
didn't retry, and MVRP registration failed.
Tested against Juniper QFabric switches
Signed-off-by: Noel Burton-Krahn <noel@burton-krahn.com>
Acked-by: David Ward <david.ward@ll.mit.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Actually re-send packets when the T1 timer runs out. This fixes a bug
where packets are waiting on the write queue until disconnection when
no other traffic is outstanding.
Signed-off-by: Josselin Costanzi <josselin.costanzi@mobile-devices.fr>
Signed-off-by: Maxime Jayat <maxime.jayat@mobile-devices.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Pull NFS client bugfix from Trond Myklebust:
"Fix a regression due to incorrect sharing of gss auth caches"
* tag 'nfs-for-3.12-3' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
RPCSEC_GSS: fix crash on destroying gss auth
|
|
When the dlc is closed, rfcomm_dev_state_change() tries to release the
port in the case it cannot get a reference to the tty. However this is
racy and not even needed.
Infact as Peter Hurley points out:
1. Only consider dlcs that are 'stolen' from a connected socket, ie.
reused. Allocated dlcs cannot have been closed prior to port
activate and so for these dlcs a tty reference will always be avail
in rfcomm_dev_state_change() -- except for the conditions covered by
#2b below.
2. If a tty was at some point previously created for this rfcomm, then
either
(a) the tty reference is still avail, so rfcomm_dev_state_change()
will perform a hangup. So nothing to do, or,
(b) the tty reference is no longer avail, and the tty_port will be
destroyed by the last tty_port_put() in rfcomm_tty_cleanup.
Again, no action required.
3. Prior to obtaining the dlc lock in rfcomm_dev_add(),
rfcomm_dev_state_change() will not 'see' a rfcomm_dev so nothing to
do here.
4. After releasing the dlc lock in rfcomm_dev_add(),
rfcomm_dev_state_change() will 'see' an incomplete rfcomm_dev if a
tty reference could not be obtained. Again, the best thing to do here
is nothing. Any future attempted open() will block on
rfcomm_dev_carrier_raised(). The unconnected device will exist until
released by ioctl(RFCOMMRELEASEDEV).
The patch removes the aforementioned code and uses the
tty_port_tty_hangup() helper to hangup the tty.
Signed-off-by: Gianluca Anzolin <gianluca@sottospazio.it>
Reviewed-by: Peter Hurley <peter@hurleysoftware.com>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
|
|
Pull networking fixes from David Miller:
1) If the local_df boolean is set on an SKB we have to allocate a
unique ID even if IP_DF is set in the ipv4 headers, from Ansis
Atteka.
2) Some fixups for the new chipset support that went into the sfc
driver, from Ben Hutchings.
3) Because SCTP bypasses a good chunk of, and actually duplicates, the
logic of the ipv6 output path, some IPSEC things don't get done
properly. Integrate SCTP better into the ipv6 output path so that
these problems are fixed and such issues don't get missed in the
future either. From Daniel Borkmann.
4) Fix skge regressions added by the DMA mapping error return checking
added in v3.10, from Mikulas Patocka.
5) Kill some more IRQF_DISABLED references, from Michael Opdenacker.
6) Fix races and deadlocks in the bridging code, from Hong Zhiguo.
7) Fix error handling in tun_set_iff(), in particular don't leak
resources. From Jason Wang.
8) Prevent format-string injection into xen-netback driver, from Kees
Cook.
9) Fix regression added to netpoll ARP packet handling, in particular
check for the right ETH_P_ARP protocol code. From Sonic Zhang.
10) Try to deal with AMD IOMMU errors when using r8169 chips, from
Francois Romieu.
11) Cure freezes due to recent changes in the rt2x00 wireless driver,
from Stanislaw Gruszka.
12) Don't do SPI transfers (which can sleep) in interrupt context in
cw1200 driver, from Solomon Peachy.
13) Fix LEDs handling bug in 5720 tg3 chips already handled for 5719.
From Nithin Sujir.
14) Make xen_netbk_count_skb_slots() count the actual number of slots
that will be used, taking into consideration packing and other
issues that the transmit path will run into. From David Vrabel.
15) Use the correct maximum age when calculating the bridge
message_age_timer, from Chris Healy.
16) Get rid of memory leaks in mcs7780 IRDA driver, from Alexey
Khoroshilov.
17) Netfilter conntrack extensions were converted to RCU but are not
always freed properly using kfree_rcu(). Fix from Michal Kubecek.
18) VF reset recovery not being done correctly in qlcnic driver, from
Manish Chopra.
19) Fix inverted test in ATM nicstar driver, from Andy Shevchenko.
20) Missing workqueue destroy in cxgb4 error handling, from Wei Yang.
21) Internal switch not initialized properly in bgmac driver, from Rafał
Miłecki.
22) Netlink messages report wrong local and remote addresses in IPv6
tunneling, from Ding Zhi.
23) ICMP redirects should not generate socket errors in DCCP and SCTP.
We're still working out how this should be handled for RAW and UDP
sockets. From Daniel Borkmann and Duan Jiong.
24) We've had several bugs wherein the network namespace's loopback
device gets accessed after it is free'd, NULL it out so that we can
catch these problems more readily. From Eric W Biederman.
25) Fix regression in TCP RTO calculations, from Neal Cardwell.
26) Fix too early free of xen-netback network device when VIFs still
exist. From Paul Durrant.
* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (87 commits)
netconsole: fix a deadlock with rtnl and netconsole's mutex
netpoll: fix NULL pointer dereference in netpoll_cleanup
skge: fix broken driver
ip: generate unique IP identificator if local fragmentation is allowed
ip: use ip_hdr() in __ip_make_skb() to retrieve IP header
xen-netback: Don't destroy the netdev until the vif is shut down
net:dccp: do not report ICMP redirects to user space
cnic: Fix crash in cnic_bnx2x_service_kcq()
bnx2x, cnic, bnx2i, bnx2fc: Fix bnx2i and bnx2fc regressions.
vxlan: Avoid creating fdb entry with NULL destination
tcp: fix RTO calculated from cached RTT
drivers: net: phy: cicada.c: clears warning Use #include <linux/io.h> instead of <asm/io.h>
net loopback: Set loopback_dev to NULL when freed
batman-adv: set the TAG flag for the vid passed to BLA
netfilter: nfnetlink_queue: use network skb for sequence adjustment
net: sctp: rfc4443: do not report ICMP redirects to user space
net: usb: cdc_ether: use usb.h macros whenever possible
net: usb: cdc_ether: fix checkpatch errors and warnings
net: usb: cdc_ether: Use wwan interface for Telit modules
ip6_tunnels: raddr and laddr are inverted in nl msg
...
|
|
I've been hitting a NULL ptr deref while using netconsole because the
np->dev check and the pointer manipulation in netpoll_cleanup are done
without rtnl and the following sequence happens when having a netconsole
over a vlan and we remove the vlan while disabling the netconsole:
CPU 1 CPU2
removes vlan and calls the notifier
enters store_enabled(), calls
netdev_cleanup which checks np->dev
and then waits for rtnl
executes the netconsole netdev
release notifier making np->dev
== NULL and releases rtnl
continues to dereference a member of
np->dev which at this point is == NULL
Signed-off-by: Nikolay Aleksandrov <nikolay@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
If local fragmentation is allowed, then ip_select_ident() and
ip_select_ident_more() need to generate unique IDs to ensure
correct defragmentation on the peer.
For example, if IPsec (tunnel mode) has to encrypt large skbs
that have local_df bit set, then all IP fragments that belonged
to different ESP datagrams would have used the same identificator.
If one of these IP fragments would get lost or reordered, then
peer could possibly stitch together wrong IP fragments that did
not belong to the same datagram. This would lead to a packet loss
or data corruption.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
skb->data already points to IP header, but for the sake of
consistency we can also use ip_hdr() to retrieve it.
Signed-off-by: Ansis Atteka <aatteka@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client
Pull ceph fixes from Sage Weil:
"These fix several bugs with RBD from 3.11 that didn't get tested in
time for the merge window: some error handling, a use-after-free, and
a sequencing issue when unmapping and image races with a notify
operation.
There is also a patch fixing a problem with the new ceph + fscache
code that just went in"
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client:
fscache: check consistency does not decrement refcount
rbd: fix error handling from rbd_snap_name()
rbd: ignore unmapped snapshots that no longer exist
rbd: fix use-after free of rbd_dev->disk
rbd: make rbd_obj_notify_ack() synchronous
rbd: complete notifies before cleaning up osd_client and rbd_dev
libceph: add function to ensure notifies are complete
|
|
When reading percpu stats we need to properly reset
the sum when CPU 0 is not present in the possible mask.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
|
|
commit c5549571f975ab ("ipvs: convert lblcr scheduler to rcu")
allows RCU readers to use dest after calling ip_vs_dest_put().
In the corner case it can race with ip_vs_dest_trash_expire()
which can release the dest while it is being returned to the
RCU readers as scheduling result.
To fix the problem do not allow e->dest to be replaced and
defer the ip_vs_dest_put() call by using RCU callback. Now
e->dest does not need to be RCU pointer.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
|
|
commit c2a4ffb70eef39 ("ipvs: convert lblc scheduler to rcu")
allows RCU readers to use dest after calling ip_vs_dest_put().
In the corner case it can race with ip_vs_dest_trash_expire()
which can release the dest while it is being returned to the
RCU readers as scheduling result.
To fix the problem do not allow en->dest to be replaced and
defer the ip_vs_dest_put() call by using RCU callback. Now
en->dest does not need to be RCU pointer.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
|
|
commit 578bc3ef1e473a ("ipvs: reorganize dest trash") added
IP_VS_DEST_STATE_REMOVING flag and RCU callback named
ip_vs_dest_wait_readers() to keep dests and services after
removal for at least a RCU grace period. But we have the
following corner cases:
- we can not reuse the same dest if its service is removed
while IP_VS_DEST_STATE_REMOVING is still set because another dest
removal in the first grace period can not extend this period.
It can happen when ipvsadm -C && ipvsadm -R is used.
- dest->svc can be replaced but ip_vs_in_stats() and
ip_vs_out_stats() have no explicit read memory barriers
when accessing dest->svc. It can happen that dest->svc
was just freed (replaced) while we use it to update
the stats.
We solve the problems as follows:
- IP_VS_DEST_STATE_REMOVING is removed and we ensure a fixed
idle period for the dest (IP_VS_DEST_TRASH_PERIOD). idle_start
will remember when for first time after deletion we noticed
dest->refcnt=0. Later, the connections can grab a reference
while in RCU grace period but if refcnt becomes 0 we can
safely free the dest and its svc.
- dest->svc becomes RCU pointer. As result, we add explicit
RCU locking in ip_vs_in_stats() and ip_vs_out_stats().
- __ip_vs_unbind_svc is renamed to __ip_vs_svc_put(), it
now can free the service immediately or after a RCU grace
period. dest->svc is not set to NULL anymore.
As result, unlinked dests and their services are
freed always after IP_VS_DEST_TRASH_PERIOD period, unused
services are freed after a RCU grace period.
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
|
|
Schedulers such as lblc and lblcr require the weight to be as high as the
maximum number of active connections. In commit b552f7e3a9524abcbcdf
("ipvs: unify the formula to estimate the overhead of processing
connections"), the consideration of inactconns and activeconns was cleaned
up to always count activeconns as 256 times more important than inactconns.
In cases where 3000 or more connections are expected, a weight of 3000 *
256 * 3000 connections overflows the 32-bit signed result used to determine
if rescheduling is required.
On amd64, this merely changes the multiply and comparison instructions to
64-bit. On x86, a 64-bit result is already present from imull, so only
a few more comparison instructions are emitted.
Signed-off-by: Simon Kirby <sim@hostway.ca>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
|
|
We need to let the setup stage complete cleanly even when the HCI device
is rfkilled. Otherwise the HCI device will stay in an undefined state
and never get notified to user space through mgmt (even when it gets
unblocked through rfkill).
This patch makes sure that hci_dev_open() can be called in the HCI_SETUP
stage, that blocking the device doesn't abort the setup stage, and that
the device gets proper powered down as soon as the setup stage completes
in case it was blocked meanwhile.
The bug that this patch fixed can be very easily reproduced using e.g.
the rfkill command line too. By running "rfkill block all" before
inserting a Bluetooth dongle the resulting HCI device goes into a state
where it is never announced over mgmt, not even when "rfkill unblock all"
is run.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
|
|
This makes it more convenient to check for rfkill (no need to check for
dev->rfkill before calling rfkill_blocked()) and also avoids potential
races if the RFKILL state needs to be checked from within the rfkill
callback.
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org
Acked-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Gustavo Padovan <gustavo.padovan@collabora.co.uk>
|
|
DCCP shouldn't be setting sk_err on redirects as it
isn't an error condition. it should be doing exactly
what tcp is doing and leaving the error handler without
touching the socket.
Signed-off-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Included change:
- fix the Bridge Loop Avoidance component by marking the variables containing
the VLAN ID with the HAS_TAG flag when needed.
|
|
This fixes a regression since eb6dc19d8e72ce3a957af5511d20c0db0a8bd007
"RPCSEC_GSS: Share all credential caches on a per-transport basis" which
could cause an occasional oops in the nfsd code (see below).
The problem was that an auth was left referencing a client that had been
freed. To avoid this we need to ensure that auths are shared only
between descendants of a common client; the fact that a clone of an
rpc_client takes a reference on its parent then ensures that the parent
client will last as long as the auth.
Also add a comment explaining what I think was the intention of this
code.
general protection fault: 0000 [#1] PREEMPT SMP
Modules linked in: rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl lockd sunrpc
CPU: 3 PID: 4071 Comm: kworker/u8:2 Not tainted 3.11.0-rc2-00182-g025145f #1665
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Workqueue: nfsd4_callbacks nfsd4_do_callback_rpc [nfsd]
task: ffff88003e206080 ti: ffff88003c384000 task.ti: ffff88003c384000
RIP: 0010:[<ffffffffa00001f3>] [<ffffffffa00001f3>] rpc_net_ns+0x53/0x70 [sunrpc]
RSP: 0000:ffff88003c385ab8 EFLAGS: 00010246
RAX: 6b6b6b6b6b6b6b6b RBX: ffff88003af9a800 RCX: 0000000000000002
RDX: ffffffffa00001a5 RSI: 0000000000000001 RDI: ffffffff81e284e0
RBP: ffff88003c385ad8 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000015 R12: ffff88003c990840
R13: ffff88003c990878 R14: ffff88003c385ba8 R15: ffff88003e206080
FS: 0000000000000000(0000) GS:ffff88003fd80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fcdf737e000 CR3: 000000003ad2b000 CR4: 00000000000006e0
Stack:
ffffffffa00001a5 0000000000000006 0000000000000006 ffff88003af9a800
ffff88003c385b08 ffffffffa00d52a4 ffff88003c385ba8 ffff88003c751bd8
ffff88003c751bc0 ffff88003e113600 ffff88003c385b18 ffffffffa00d530c
Call Trace:
[<ffffffffa00001a5>] ? rpc_net_ns+0x5/0x70 [sunrpc]
[<ffffffffa00d52a4>] __gss_pipe_release+0x54/0x90 [auth_rpcgss]
[<ffffffffa00d530c>] gss_pipe_free+0x2c/0x30 [auth_rpcgss]
[<ffffffffa00d678b>] gss_destroy+0x9b/0xf0 [auth_rpcgss]
[<ffffffffa000de63>] rpcauth_release+0x23/0x30 [sunrpc]
[<ffffffffa0001e81>] rpc_release_client+0x51/0xb0 [sunrpc]
[<ffffffffa00020d5>] rpc_shutdown_client+0xe5/0x170 [sunrpc]
[<ffffffff81098a14>] ? cpuacct_charge+0xa4/0xb0
[<ffffffff81098975>] ? cpuacct_charge+0x5/0xb0
[<ffffffffa019556f>] nfsd4_process_cb_update.isra.17+0x2f/0x210 [nfsd]
[<ffffffff819a4ac0>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff819a4acb>] ? _raw_spin_unlock_irq+0x3b/0x60
[<ffffffff810703ab>] ? process_one_work+0x15b/0x510
[<ffffffffa01957dd>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
[<ffffffff8107041e>] process_one_work+0x1ce/0x510
[<ffffffff810703ab>] ? process_one_work+0x15b/0x510
[<ffffffff810712ab>] worker_thread+0x11b/0x370
[<ffffffff81071190>] ? manage_workers.isra.24+0x2b0/0x2b0
[<ffffffff8107854b>] kthread+0xdb/0xe0
[<ffffffff819a4ac0>] ? _raw_spin_unlock_irq+0x30/0x60
[<ffffffff81078470>] ? __init_kthread_worker+0x70/0x70
[<ffffffff819ac7dc>] ret_from_fork+0x7c/0xb0
[<ffffffff81078470>] ? __init_kthread_worker+0x70/0x70
Code: a5 01 00 a0 31 d2 31 f6 48 c7 c7 e0 84 e2 81 e8 f4 91 0a e1 48 8b 43 60 48 c7 c2 a5 01 00 a0 be 01 00 00 00 48 c7 c7 e0 84 e2 81 <48> 8b 98 10 07 00 00 e8 91 8f 0a e1 e8
+3c 4e 07 e1 48 83 c4 18
RIP [<ffffffffa00001f3>] rpc_net_ns+0x53/0x70 [sunrpc]
RSP <ffff88003c385ab8>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
|
|
Pablo Neira Ayuso says:
====================
The following patchset contains Netfilter fixes for you net tree,
mostly targeted to ipset, they are:
* Fix ICMPv6 NAT due to wrong comparison, code instead of type, from
Phil Oester.
* Fix RCU race in conntrack extensions release path, from Michal Kubecek.
* Fix missing inversion in the userspace ipset test command match if
the nomatch option is specified, from Jozsef Kadlecsik.
* Skip layer 4 protocol matching in ipset in case of IPv6 fragments,
also from Jozsef Kadlecsik.
* Fix sequence adjustment in nfnetlink_queue due to using the netlink
skb instead of the network skb, from Gao feng.
* Make sure we cannot swap of sets with different layer 3 family in
ipset, from Jozsef Kadlecsik.
* Fix possible bogus matching in ipset if hash sets with net elements
are used, from Oliver Smith.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Commit 1b7fdd2ab5852 ("tcp: do not use cached RTT for RTT estimation")
did not correctly account for the fact that crtt is the RTT shifted
left 3 bits. Fix the calculation to consistently reflect this fact.
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-By: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
When receiving or sending a packet a packet on a VLAN, the
vid has to be marked with the TAG flag in order to make any
component in batman-adv understand that the packet is coming
from a really tagged network.
This fix the Bridge Loop Avoidance behaviour which was not
able to send announces over VLAN interfaces.
Introduced by 0b1da1765fdb00ca5d53bc95c9abc70dfc9aae5b
("batman-adv: change VID semantic in the BLA code")
Signed-off-by: Antonio Quartulli <antonio@open-mesh.org>
Acked-by: Simon Wunderlich <siwu@hrz.tu-chemnitz.de>
Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
|
|
Instead of the netlink skb.
Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
For legacy IPsec anti replay mechanism:
bitmap in struct xfrm_replay_state could only provide a 32 bits
window size limit in current design, thus user level parameter
sadb_sa_replay should honor this limit, otherwise misleading
outputs("replay=244") by setkey -D will be:
192.168.25.2 192.168.22.2
esp mode=transport spi=147561170(0x08cb9ad2) reqid=0(0x00000000)
E: aes-cbc 9a8d7468 7655cf0b 719d27be b0ddaac2
A: hmac-sha1 2d2115c2 ebf7c126 1c54f186 3b139b58 264a7331
seq=0x00000000 replay=244 flags=0x00000000 state=mature
created: Sep 17 14:00:00 2013 current: Sep 17 14:00:22 2013
diff: 22(s) hard: 30(s) soft: 26(s)
last: Sep 17 14:00:00 2013 hard: 0(s) soft: 0(s)
current: 1408(bytes) hard: 0(bytes) soft: 0(bytes)
allocated: 22 hard: 0 soft: 0
sadb_seq=1 pid=4854 refcnt=0
192.168.22.2 192.168.25.2
esp mode=transport spi=255302123(0x0f3799eb) reqid=0(0x00000000)
E: aes-cbc 6485d990 f61a6bd5 e5660252 608ad282
A: hmac-sha1 0cca811a eb4fa893 c47ae56c 98f6e413 87379a88
seq=0x00000000 replay=244 flags=0x00000000 state=mature
created: Sep 17 14:00:00 2013 current: Sep 17 14:00:22 2013
diff: 22(s) hard: 30(s) soft: 26(s)
last: Sep 17 14:00:00 2013 hard: 0(s) soft: 0(s)
current: 1408(bytes) hard: 0(bytes) soft: 0(bytes)
allocated: 22 hard: 0 soft: 0
sadb_seq=0 pid=4854 refcnt=0
And also, optimizing xfrm_replay_check window checking by setting the
desirable x->props.replay_window with only doing the comparison once
for all when xfrm_state is first born.
Signed-off-by: Fan Du <fan.du@windriver.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
|
|
Adapt the same behaviour for SCTP as present in TCP for ICMP redirect
messages. For IPv6, RFC4443, section 2.4. says:
...
(e) An ICMPv6 error message MUST NOT be originated as a result of
receiving the following:
...
(e.2) An ICMPv6 redirect message [IPv6-DISC].
...
Therefore, do not report an error to user space, just invoke dst's redirect
callback and leave, same for IPv4 as done in TCP as well. The implication
w/o having this patch could be that the reception of such packets would
generate a poll notification and in worst case it could even tear down the
whole connection. Therefore, stop updating sk_err on redirects.
Reported-by: Duan Jiong <duanj.fnst@cn.fujitsu.com>
Reported-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Suggested-by: Vlad Yasevich <vyasevich@gmail.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
IFLA_IPTUN_LOCAL and IFLA_IPTUN_REMOTE were inverted.
Introduced by c075b13098b3 (ip6tnl: advertise tunnel param via rtnl).
Signed-off-by: Ding Zhi <zhi.ding@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|