summaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)Author
2023-08-18Merge tag 'net-6.5-rc7' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from ipsec and netfilter. No known outstanding regressions. Fixes to fixes: - virtio-net: set queues after driver_ok, avoid a potential race added by recent fix - Revert "vlan: Fix VLAN 0 memory leak", it may lead to a warning when VLAN 0 is registered explicitly - nf_tables: - fix false-positive lockdep splat in recent fixes - don't fail inserts if duplicate has expired (fix test failures) - fix races between garbage collection and netns dismantle Current release - new code bugs: - mlx5: Fix mlx5_cmd_update_root_ft() error flow Previous releases - regressions: - phy: fix IRQ-based wake-on-lan over hibernate / power off Previous releases - always broken: - sock: fix misuse of sk_under_memory_pressure() preventing system from exiting global TCP memory pressure if a single cgroup is under pressure - fix the RTO timer retransmitting skb every 1ms if linear option is enabled - af_key: fix sadb_x_filter validation, amment netlink policy - ipsec: fix slab-use-after-free in decode_session6() - macb: in ZynqMP resume always configure PS GTR for non-wakeup source Misc: - netfilter: set default timeout to 3 secs for sctp shutdown send and recv state (from 300ms), align with protocol timers" * tag 'net-6.5-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits) ice: Block switchdev mode when ADQ is active and vice versa qede: fix firmware halt over suspend and resume net: do not allow gso_size to be set to GSO_BY_FRAGS sock: Fix misuse of sk_under_memory_pressure() sfc: don't fail probe if MAE/TC setup fails sfc: don't unregister flow_indr if it was never registered net: dsa: mv88e6xxx: Wait for EEPROM done before HW reset net/mlx5: Fix mlx5_cmd_update_root_ft() error flow net/mlx5e: XDP, Fix fifo overrun on XDP_REDIRECT i40e: fix misleading debug logs iavf: fix FDIR rule fields masks validation ipv6: fix indentation of a config attribute mailmap: add entries for Simon Horman broadcom: b44: Use b44_writephy() return value net: openvswitch: reject negative ifindex team: Fix incorrect deletion of ETH_P_8021AD protocol vid from slaves net: phy: broadcom: stub c45 read/write for 54810 netfilter: nft_dynset: disallow object maps netfilter: nf_tables: GC transaction race with netns dismantle netfilter: nf_tables: fix GC transaction races with netns and netlink event exit path ...
2023-08-17sock: Fix misuse of sk_under_memory_pressure()Abel Wu
The status of global socket memory pressure is updated when: a) __sk_mem_raise_allocated(): enter: sk_memory_allocated(sk) > sysctl_mem[1] leave: sk_memory_allocated(sk) <= sysctl_mem[0] b) __sk_mem_reduce_allocated(): leave: sk_under_memory_pressure(sk) && sk_memory_allocated(sk) < sysctl_mem[0] So the conditions of leaving global pressure are inconstant, which may lead to the situation that one pressured net-memcg prevents the global pressure from being cleared when there is indeed no global pressure, thus the global constrains are still in effect unexpectedly on the other sockets. This patch fixes this by ignoring the net-memcg's pressure when deciding whether should leave global memory pressure. Fixes: e1aab161e013 ("socket: initial cgroup code.") Signed-off-by: Abel Wu <wuyun.abel@bytedance.com> Acked-by: Shakeel Butt <shakeelb@google.com> Link: https://lore.kernel.org/r/20230816091226.1542-1-wuyun.abel@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16Merge tag 'nf-23-08-16' of ↵David S. Miller
https://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Florisn Westphal says: ==================== These are netfilter fixes for the *net* tree. First patch resolves a false-positive lockdep splat: rcu_dereference is used outside of rcu read lock. Let lockdep validate that the transaction mutex is locked. Second patch fixes a kdoc warning added in previous PR. Third patch fixes a memory leak: The catchall element isn't disabled correctly, this allows userspace to deactivate the element again. This results in refcount underflow which in turn prevents memory release. This was always broken since the feature was added in 5.13. Patch 4 fixes an incorrect change in the previous pull request: Adding a duplicate key to a set should work if the duplicate key has expired, restore this behaviour. All from myself. Patch #5 resolves an old historic artifact in sctp conntrack: a 300ms timeout for shutdown_ack. Increase this to 3s. From Xin Long. Patch #6 fixes a sysctl data race in ipvs, two threads can clobber the sysctl value, from Sishuai Gong. This is a day-0 bug that predates git history. Patches 7, 8 and 9, from Pablo Neira Ayuso, are also followups for the previous GC rework in nf_tables: The netlink notifier and the netns exit path must both increment the gc worker seqcount, else worker may encounter stale (free'd) pointers. ================ Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16ipv6: fix indentation of a config attributePrasad Pandit
Fix indentation of a type attribute of IPV6_VTI config entry. Signed-off-by: Prasad Pandit <pjp@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-16Merge tag 'ipsec-2023-08-15' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== 1) Fix a slab-out-of-bounds read in xfrm_address_filter. From Lin Ma. 2) Fix the pfkey sadb_x_filter validation. From Lin Ma. 3) Use the correct nla_policy structure for XFRMA_SEC_CTX. From Lin Ma. 4) Fix warnings triggerable by bad packets in the encap functions. From Herbert Xu. 5) Fix some slab-use-after-free in decode_session6. From Zhengchao Shao. 6) Fix a possible NULL piointer dereference in xfrm_update_ae_params. Lin Ma. 7) Add a forgotten nla_policy for XFRMA_MTIMER_THRESH. From Lin Ma. 8) Don't leak offloaded policies. From Leon Romanovsky. 9) Delete also the offloading part of an acquire state. From Leon Romanovsky. Please pull or let me know if there are problems.
2023-08-15net: openvswitch: reject negative ifindexJakub Kicinski
Recent changes in net-next (commit 759ab1edb56c ("net: store netdevs in an xarray")) refactored the handling of pre-assigned ifindexes and let syzbot surface a latent problem in ovs. ovs does not validate ifindex, making it possible to create netdev ports with negative ifindex values. It's easy to repro with YNL: $ ./cli.py --spec netlink/specs/ovs_datapath.yaml \ --do new \ --json '{"upcall-pid": 1, "name":"my-dp"}' $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' $ ip link show -65536: some-port0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 link/ether 7a:48:21:ad:0b:fb brd ff:ff:ff:ff:ff:ff ... Validate the inputs. Now the second command correctly returns: $ ./cli.py --spec netlink/specs/ovs_vport.yaml \ --do new \ --json '{"upcall-pid": "00000001", "name": "some-port0", "dp-ifindex":3,"ifindex":4294901760,"type":2}' lib.ynl.NlError: Netlink error: Numerical result out of range nl_len = 108 (92) nl_flags = 0x300 nl_type = 2 error: -34 extack: {'msg': 'integer out of range', 'unknown': [[type:4 len:36] b'\x0c\x00\x02\x00\x00\x00\x00\x00\x00\x00\x00\x00\x0c\x00\x03\x00\xff\xff\xff\x7f\x00\x00\x00\x00\x08\x00\x01\x00\x08\x00\x00\x00'], 'bad-attr': '.ifindex'} Accept 0 since it used to be silently ignored. Fixes: 54c4ef34c4b6 ("openvswitch: allow specifying ifindex of new interfaces") Reported-by: syzbot+7456b5dcf65111553320@syzkaller.appspotmail.com Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://lore.kernel.org/r/20230814203840.2908710-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-16netfilter: nft_dynset: disallow object mapsPablo Neira Ayuso
Do not allow to insert elements from datapath to objects maps. Fixes: 8aeff920dcc9 ("netfilter: nf_tables: add stateful object reference to set elements") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: GC transaction race with netns dismantlePablo Neira Ayuso
Use maybe_get_net() since GC workqueue might race with netns exit path. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix GC transaction races with netns and netlink event ↵Pablo Neira Ayuso
exit path Netlink event path is missing a synchronization point with GC transactions. Add GC sequence number update to netns release path and netlink event path, any GC transaction losing race will be discarded. Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16ipvs: fix racy memcpy in proc_do_sync_thresholdSishuai Gong
When two threads run proc_do_sync_threshold() in parallel, data races could happen between the two memcpy(): Thread-1 Thread-2 memcpy(val, valp, sizeof(val)); memcpy(valp, val, sizeof(val)); This race might mess up the (struct ctl_table *) table->data, so we add a mutex lock to serialize them. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Link: https://lore.kernel.org/netdev/B6988E90-0A1E-4B85-BF26-2DAF6D482433@gmail.com/ Signed-off-by: Sishuai Gong <sishuai.system@gmail.com> Acked-by: Simon Horman <horms@kernel.org> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: set default timeout to 3 secs for sctp shutdown send and recv stateXin Long
In SCTP protocol, it is using the same timer (T2 timer) for SHUTDOWN and SHUTDOWN_ACK retransmission. However in sctp conntrack the default timeout value for SCTP_CONNTRACK_SHUTDOWN_ACK_SENT state is 3 secs while it's 300 msecs for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV state. As Paolo Valerio noticed, this might cause unwanted expiration of the ct entry. In my test, with 1s tc netem delay set on the NAT path, after the SHUTDOWN is sent, the sctp ct entry enters SCTP_CONNTRACK_SHUTDOWN_SEND state. However, due to 300ms (too short) delay, when the SHUTDOWN_ACK is sent back from the peer, the sctp ct entry has expired and been deleted, and then the SHUTDOWN_ACK has to be dropped. Also, it is confusing these two sysctl options always show 0 due to all timeout values using sec as unit: net.netfilter.nf_conntrack_sctp_timeout_shutdown_recd = 0 net.netfilter.nf_conntrack_sctp_timeout_shutdown_sent = 0 This patch fixes it by also using 3 secs for sctp shutdown send and recv state in sctp conntrack, which is also RTO.initial value in SCTP protocol. Note that the very short time value for SCTP_CONNTRACK_SHUTDOWN_SEND/RECV was probably used for a rare scenario where SHUTDOWN is sent on 1st path but SHUTDOWN_ACK is replied on 2nd path, then a new connection started immediately on 1st path. So this patch also moves from SHUTDOWN_SEND/RECV to CLOSE when receiving INIT in the ORIGINAL direction. Fixes: 9fb9cbb1082d ("[NETFILTER]: Add nf_conntrack subsystem.") Reported-by: Paolo Valerio <pvalerio@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: don't fail inserts if duplicate has expiredFlorian Westphal
nftables selftests fail: run-tests.sh testcases/sets/0044interval_overlap_0 Expected: 0-2 . 0-3, got: W: [FAILED] ./testcases/sets/0044interval_overlap_0: got 1 Insertion must ignore duplicate but expired entries. Moreover, there is a strange asymmetry in nft_pipapo_activate: It refetches the current element, whereas the other ->activate callbacks (bitmap, hash, rhash, rbtree) use elem->priv. Same for .remove: other set implementations take elem->priv, nft_pipapo_remove fetches elem->priv, then does a relookup, remove this. I suspect this was the reason for the change that prompted the removal of the expired check in pipapo_get() in the first place, but skipping exired elements there makes no sense to me, this helper is used for normal get requests, insertions (duplicate check) and deactivate callback. In first two cases expired elements must be skipped. For ->deactivate(), this gets called for DELSETELEM, so it seems to me that expired elements should be skipped as well, i.e. delete request should fail with -ENOENT error. Fixes: 24138933b97b ("netfilter: nf_tables: don't skip expired elements during walk") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: deactivate catchall elements in next generationFlorian Westphal
When flushing, individual set elements are disabled in the next generation via the ->flush callback. Catchall elements are not disabled. This is incorrect and may lead to double-deactivations of catchall elements which then results in memory leaks: WARNING: CPU: 1 PID: 3300 at include/net/netfilter/nf_tables.h:1172 nft_map_deactivate+0x549/0x730 CPU: 1 PID: 3300 Comm: nft Not tainted 6.5.0-rc5+ #60 RIP: 0010:nft_map_deactivate+0x549/0x730 [..] ? nft_map_deactivate+0x549/0x730 nf_tables_delset+0xb66/0xeb0 (the warn is due to nft_use_dec() detecting underflow). Fixes: aaa31047a6d2 ("netfilter: nftables: add catch-all set element support") Reported-by: lonial con <kongln9170@gmail.com> Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix kdoc warnings after gc reworkFlorian Westphal
Jakub Kicinski says: We've got some new kdoc warnings here: net/netfilter/nft_set_pipapo.c:1557: warning: Function parameter or member '_set' not described in 'pipapo_gc' net/netfilter/nft_set_pipapo.c:1557: warning: Excess function parameter 'set' description in 'pipapo_gc' include/net/netfilter/nf_tables.h:577: warning: Function parameter or member 'dead' not described in 'nft_set' Fixes: 5f68718b34a5 ("netfilter: nf_tables: GC transaction API to avoid race with control plane") Fixes: f6c383b8c31a ("netfilter: nf_tables: adapt set backend to use GC transaction API") Reported-by: Jakub Kicinski <kuba@kernel.org> Closes: https://lore.kernel.org/netdev/20230810104638.746e46f1@kernel.org/ Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-16netfilter: nf_tables: fix false-positive lockdep splatFlorian Westphal
->abort invocation may cause splat on debug kernels: WARNING: suspicious RCU usage net/netfilter/nft_set_pipapo.c:1697 suspicious rcu_dereference_check() usage! [..] rcu_scheduler_active = 2, debug_locks = 1 1 lock held by nft/133554: [..] (nft_net->commit_mutex){+.+.}-{3:3}, at: nf_tables_valid_genid [..] lockdep_rcu_suspicious+0x1ad/0x260 nft_pipapo_abort+0x145/0x180 __nf_tables_abort+0x5359/0x63d0 nf_tables_abort+0x24/0x40 nfnetlink_rcv+0x1a0a/0x22c0 netlink_unicast+0x73c/0x900 netlink_sendmsg+0x7f0/0xc20 ____sys_sendmsg+0x48d/0x760 Transaction mutex is held, so parallel updates are not possible. Switch to _protected and check mutex is held for lockdep enabled builds. Fixes: 212ed75dc5fb ("netfilter: nf_tables: integrate pipapo into commit protocol") Signed-off-by: Florian Westphal <fw@strlen.de>
2023-08-15net: fix the RTO timer retransmitting skb every 1ms if linear option is enabledJason Xing
In the real workload, I encountered an issue which could cause the RTO timer to retransmit the skb per 1ms with linear option enabled. The amount of lost-retransmitted skbs can go up to 1000+ instantly. The root cause is that if the icsk_rto happens to be zero in the 6th round (which is the TCP_THIN_LINEAR_RETRIES value), then it will always be zero due to the changed calculation method in tcp_retransmit_timer() as follows: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); Above line could be converted to icsk->icsk_rto = min(0 << 1, TCP_RTO_MAX) = 0 Therefore, the timer expires so quickly without any doubt. I read through the RFC 6298 and found that the RTO value can be rounded up to a certain value, in Linux, say TCP_RTO_MIN as default, which is regarded as the lower bound in this patch as suggested by Eric. Fixes: 36e31b0af587 ("net: TCP thin linear timeouts") Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-14sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsgJeff Layton
svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages, but commit 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") dropped that part of the handling. Fix it by setting the bv_offset of the first bvec. Fixes: 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2023-08-14Revert "vlan: Fix VLAN 0 memory leak"Vlad Buslov
This reverts commit 718cb09aaa6fa78cc8124e9517efbc6c92665384. The commit triggers multiple syzbot issues, probably due to possibility of manually creating VLAN 0 on netdevice which will cause the code to delete it since it can't distinguish such VLAN from implicit VLAN 0 automatically created for devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. Reported-by: syzbot+662f783a5cdf3add2719@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000090196d0602a6167d@google.com/ Reported-by: syzbot+4b4f06495414e92701d5@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/00000000000096ae870602a61602@google.com/ Reported-by: syzbot+d810d3cd45ed1848c3f7@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/0000000000009f0f9c0602a616ce@google.com/ Fixes: 718cb09aaa6f ("vlan: Fix VLAN 0 memory leak") Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-10Merge tag 'net-6.5-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter, wireless and bpf. Still trending up in size but the good news is that the "current" regressions are resolved, AFAIK. We're getting weirdly many fixes for Wake-on-LAN and suspend/resume handling on embedded this week (most not merged yet), not sure why. But those are all for older bugs. Current release - regressions: - tls: set MSG_SPLICE_PAGES consistently when handing encrypted data over to TCP Current release - new code bugs: - eth: mlx5: correct IDs on VFs internal to the device (IPU) Previous releases - regressions: - phy: at803x: fix WoL support / reporting on AR8032 - bonding: fix incorrect deletion of ETH_P_8021AD protocol VID from slaves, leading to BUG_ON() - tun: prevent tun_build_skb() from exceeding the packet size limit - wifi: rtw89: fix 8852AE disconnection caused by RX full flags - eth/PCI: enetc: fix probing after 6fffbc7ae137 ("PCI: Honor firmware's device disabled status"), keep PCI devices around even if they are disabled / not going to be probed to be able to apply quirks on them - eth: prestera: fix handling IPv4 routes with nexthop IDs Previous releases - always broken: - netfilter: re-work garbage collection to avoid races between user-facing API and timeouts - tunnels: fix generating ipv4 PMTU error on non-linear skbs - nexthop: fix infinite nexthop bucket dump when using maximum nexthop ID - wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() Misc: - unix: use consistent error code in SO_PEERPIDFD - ipv6: adjust ndisc_is_useropt() to include PREFIX_INFO, in prep for upcoming IETF RFC" * tag 'net-6.5-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) net: hns3: fix strscpy causing content truncation issue net: tls: set MSG_SPLICE_PAGES consistently ibmvnic: Ensure login failure recovery is safe from other resets ibmvnic: Do partial reset on login failure ibmvnic: Handle DMA unmapping of login buffs in release functions ibmvnic: Unmap DMA login rsp buffer on send login fail ibmvnic: Enforce stronger sanity checks on login response net: mana: Fix MANA VF unload when hardware is unresponsive netfilter: nf_tables: remove busy mark and gc batch API netfilter: nft_set_hash: mark set element as dead when deleting from packet path netfilter: nf_tables: adapt set backend to use GC transaction API netfilter: nf_tables: GC transaction API to avoid race with control plane selftests/bpf: Add sockmap test for redirecting partial skb data selftests/bpf: fix a CI failure caused by vsock sockmap test bpf, sockmap: Fix bug that strp_done cannot be called bpf, sockmap: Fix map type error in sock_map_del_link xsk: fix refcount underflow in error path ipv6: adjust ndisc_is_useropt() to also return true for PIO selftests: forwarding: bridge_mdb: Make test more robust selftests: forwarding: bridge_mdb_max: Fix failing test with old libnet ...
2023-08-10net: tls: set MSG_SPLICE_PAGES consistentlyJakub Kicinski
We used to change the flags for the last segment, because non-last segments had the MSG_SENDPAGE_NOTLAST flag set. That flag is no longer a thing so remove the setting. Since flags most likely don't have MSG_SPLICE_PAGES set this avoids passing parts of the sg as splice and parts as non-splice. Before commit under Fixes we'd have called tcp_sendpage() which would add the MSG_SPLICE_PAGES. Why this leads to trouble remains unclear but Tariq reports hitting the WARN_ON(!sendpage_ok()) due to page refcount of 0. Fixes: e117dcfd646e ("tls: Inline do_tcp_sendpages()") Reported-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/all/4c49176f-147a-4283-f1b1-32aac7b4b996@gmail.com/ Tested-by: Tariq Toukan <tariqt@nvidia.com> Link: https://lore.kernel.org/r/20230808180917.1243540-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10Merge tag 'nf-23-08-10' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The existing attempt to resolve races between control plane and GC work is error prone, as reported by Bien Pham <phamnnb@sea.com>, some places forgot to call nft_set_elem_mark_busy(), leading to double-deactivation of elements. This series contains the following patches: 1) Do not skip expired elements during walk otherwise elements might never decrement the reference counter on data, leading to memleak. 2) Add a GC transaction API to replace the former attempt to deal with races between control plane and GC. GC worker sets on NFT_SET_ELEM_DEAD_BIT on elements and it creates a GC transaction to remove the expired elements, GC transaction could abort in case of interference with control plane and retried later (GC async). Set backends such as rbtree and pipapo also perform GC from control plane (GC sync), in such case, element deactivation and removal is safe because mutex is held then collected elements are released via call_rcu(). 3) Adapt existing set backends to use the GC transaction API. 4) Update rhash set backend to set on _DEAD bit to report deleted elements from datapath for GC. 5) Remove old GC batch API and the NFT_SET_ELEM_BUSY_BIT. * tag 'nf-23-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_tables: remove busy mark and gc batch API netfilter: nft_set_hash: mark set element as dead when deleting from packet path netfilter: nf_tables: adapt set backend to use GC transaction API netfilter: nf_tables: GC transaction API to avoid race with control plane netfilter: nf_tables: don't skip expired elements during walk ==================== Link: https://lore.kernel.org/r/20230810070830.24064-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Martin KaFai Lau says: ==================== pull-request: bpf 2023-08-09 We've added 5 non-merge commits during the last 7 day(s) which contain a total of 6 files changed, 102 insertions(+), 8 deletions(-). The main changes are: 1) A bpf sockmap memleak fix and a fix in accessing the programs of a sockmap under the incorrect map type from Xu Kuohai. 2) A refcount underflow fix in xsk from Magnus Karlsson. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Add sockmap test for redirecting partial skb data selftests/bpf: fix a CI failure caused by vsock sockmap test bpf, sockmap: Fix bug that strp_done cannot be called bpf, sockmap: Fix map type error in sock_map_del_link xsk: fix refcount underflow in error path ==================== Link: https://lore.kernel.org/r/20230810055303.120917-1-martin.lau@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-10netfilter: nf_tables: remove busy mark and gc batch APIPablo Neira Ayuso
Ditch it, it has been replace it by the GC transaction API and it has no clients anymore. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10netfilter: nft_set_hash: mark set element as dead when deleting from packet pathPablo Neira Ayuso
Set on the NFT_SET_ELEM_DEAD_BIT flag on this element, instead of performing element removal which might race with an ongoing transaction. Enable gc when dynamic flag is set on since dynset deletion requires garbage collection after this patch. Fixes: d0a8d877da97 ("netfilter: nft_dynset: support for element deletion") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10netfilter: nf_tables: adapt set backend to use GC transaction APIPablo Neira Ayuso
Use the GC transaction API to replace the old and buggy gc API and the busy mark approach. No set elements are removed from async garbage collection anymore, instead the _DEAD bit is set on so the set element is not visible from lookup path anymore. Async GC enqueues transaction work that might be aborted and retried later. rbtree and pipapo set backends does not set on the _DEAD bit from the sync GC path since this runs in control plane path where mutex is held. In this case, set elements are deactivated, removed and then released via RCU callback, sync GC never fails. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-10netfilter: nf_tables: GC transaction API to avoid race with control planePablo Neira Ayuso
The set types rhashtable and rbtree use a GC worker to reclaim memory. From system work queue, in periodic intervals, a scan of the table is done. The major caveat here is that the nft transaction mutex is not held. This causes a race between control plane and GC when they attempt to delete the same element. We cannot grab the netlink mutex from the work queue, because the control plane has to wait for the GC work queue in case the set is to be removed, so we get following deadlock: cpu 1 cpu2 GC work transaction comes in , lock nft mutex `acquire nft mutex // BLOCKS transaction asks to remove the set set destruction calls cancel_work_sync() cancel_work_sync will now block forever, because it is waiting for the mutex the caller already owns. This patch adds a new API that deals with garbage collection in two steps: 1) Lockless GC of expired elements sets on the NFT_SET_ELEM_DEAD_BIT so they are not visible via lookup. Annotate current GC sequence in the GC transaction. Enqueue GC transaction work as soon as it is full. If ruleset is updated, then GC transaction is aborted and retried later. 2) GC work grabs the mutex. If GC sequence has changed then this GC transaction lost race with control plane, abort it as it contains stale references to objects and let GC try again later. If the ruleset is intact, then this GC transaction deactivates and removes the elements and it uses call_rcu() to destroy elements. Note that no elements are removed from GC lockless path, the _DEAD bit is set and pointers are collected. GC catchall does not remove the elements anymore too. There is a new set->dead flag that is set on to abort the GC transaction to deal with set->ops->destroy() path which removes the remaining elements in the set from commit_release, where no mutex is held. To deal with GC when mutex is held, which allows safe deactivate and removal, add sync GC API which releases the set element object via call_rcu(). This is used by rbtree and pipapo backends which also perform garbage collection from control plane path. Since element removal from sets can happen from control plane and element garbage collection/timeout, it is necessary to keep the set structure alive until all elements have been deactivated and destroyed. We cannot do a cancel_work_sync or flush_work in nft_set_destroy because its called with the transaction mutex held, but the aforementioned async work queue might be blocked on the very mutex that nft_set_destroy() callchain is sitting on. This gives us the choice of ABBA deadlock or UaF. To avoid both, add set->refs refcount_t member. The GC API can then increment the set refcount and release it once the elements have been free'd. Set backends are adapted to use the GC transaction API in a follow up patch entitled: ("netfilter: nf_tables: use gc transaction API in set backends") This is joint work with Florian Westphal. Fixes: cfed7e1b1f8e ("netfilter: nf_tables: add set garbage collection helpers") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-09bpf, sockmap: Fix bug that strp_done cannot be calledXu Kuohai
strp_done is only called when psock->progs.stream_parser is not NULL, but stream_parser was set to NULL by sk_psock_stop_strp(), called by sk_psock_drop() earlier. So, strp_done can never be called. Introduce SK_PSOCK_RX_ENABLED to mark whether there is strp on psock. Change the condition for calling strp_done from judging whether stream_parser is set to judging whether this flag is set. This flag is only set once when strp_init() succeeds, and will never be cleared later. Fixes: c0d95d3380ee ("bpf, sockmap: Re-evaluate proto ops when psock is removed from sockmap") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-3-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09bpf, sockmap: Fix map type error in sock_map_del_linkXu Kuohai
sock_map_del_link() operates on both SOCKMAP and SOCKHASH, although both types have member named "progs", the offset of "progs" member in these two types is different, so "progs" should be accessed with the real map type. Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Xu Kuohai <xukuohai@huawei.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/r/20230804073740.194770-2-xukuohai@huaweicloud.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09xsk: fix refcount underflow in error pathMagnus Karlsson
Fix a refcount underflow problem reported by syzbot that can happen when a system is running out of memory. If xp_alloc_tx_descs() fails, and it can only fail due to not having enough memory, then the error path is triggered. In this error path, the refcount of the pool is decremented as it has incremented before. However, the reference to the pool in the socket was not nulled. This means that when the socket is closed later, the socket teardown logic will think that there is a pool attached to the socket and try to decrease the refcount again, leading to a refcount underflow. I chose this fix as it involved adding just a single line. Another option would have been to move xp_get_pool() and the assignment of xs->pool to after the if-statement and using xs_umem->pool instead of xs->pool in the whole if-statement resulting in somewhat simpler code, but this would have led to much more churn in the code base perhaps making it harder to backport. Fixes: ba3beec2ec1d ("xsk: Fix possible crash when multiple sockets are created") Reported-by: syzbot+8ada0057e69293a05fd4@syzkaller.appspotmail.com Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Link: https://lore.kernel.org/r/20230809142843.13944-1-magnus.karlsson@gmail.com Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
2023-08-09ipv6: adjust ndisc_is_useropt() to also return true for PIOMaciej Żenczykowski
The upcoming (and nearly finalized): https://datatracker.ietf.org/doc/draft-collink-6man-pio-pflag/ will update the IPv6 RA to include a new flag in the PIO field, which will serve as a hint to perform DHCPv6-PD. As we don't want DHCPv6 related logic inside the kernel, this piece of information needs to be exposed to userspace. The simplest option is to simply expose the entire PIO through the already existing mechanism. Even without this new flag, the already existing PIO R (router address) flag (from RFC6275) cannot AFAICT be handled entirely in kernel, and provides useful information that should be exposed to userspace (the router's global address, for use by Mobile IPv6). Also cc'ing stable@ for inclusion in LTS, as while technically this is not quite a bugfix, and instead more of a feature, it is absolutely trivial and the alternative is manually cherrypicking into all Android Common Kernel trees - and I know Greg will ask for it to be sent in via LTS instead... Cc: Jen Linkova <furry@google.com> Cc: Lorenzo Colitti <lorenzo@google.com> Cc: David Ahern <dsahern@gmail.com> Cc: YOSHIFUJI Hideaki / 吉藤英明 <yoshfuji@linux-ipv6.org> Cc: stable@vger.kernel.org Signed-off-by: Maciej Żenczykowski <maze@google.com> Link: https://lore.kernel.org/r/20230807102533.1147559-1-maze@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09Merge tag 'wireless-2023-08-09' of ↵Jakub Kicinski
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless Johannes Berg says: ==================== Just a few small updates: * fix an integer overflow in nl80211 * fix rtw89 8852AE disconnections * fix a buffer overflow in ath12k * fix AP_VLAN configuration lookups * fix allocation failure handling in brcm80211 * update MAINTAINERS for some drivers * tag 'wireless-2023-08-09' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless: wifi: ath12k: Fix buffer overflow when scanning with extraie wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems() wifi: cfg80211: fix sband iftype data lookup for AP_VLAN wifi: rtw89: fix 8852AE disconnection caused by RX full flags MAINTAINERS: Remove tree entry for rtl8180 MAINTAINERS: Update entry for rtl8187 wifi: brcm80211: handle params_v1 allocation failure ==================== Link: https://lore.kernel.org/r/20230809124818.167432-2-johannes@sipsolutions.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09nexthop: Fix infinite nexthop bucket dump when using maximum nexthop IDIdo Schimmel
A netlink dump callback can return a positive number to signal that more information needs to be dumped or zero to signal that the dump is complete. In the second case, the core netlink code will append the NLMSG_DONE message to the skb in order to indicate to user space that the dump is complete. The nexthop bucket dump callback always returns a positive number if nexthop buckets were filled in the provided skb, even if the dump is complete. This means that a dump will span at least two recvmsg() calls as long as nexthop buckets are present. In the last recvmsg() call the dump callback will not fill in any nexthop buckets because the previous call indicated that the dump should restart from the last dumped nexthop ID plus one. # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 2 # strace -e sendto,recvmsg -s 5 ip nexthop bucket sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396980, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 128 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128 id 10 index 0 idle_time 6.66 nhid 1 id 10 index 1 idle_time 6.66 nhid 1 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396980, nlmsg_pid=347}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 +++ exited with 0 +++ This behavior is both inefficient and buggy. If the last nexthop to be dumped had the maximum ID of 0xffffffff, then the dump will restart from 0 (0xffffffff + 1) and never end: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2 # ip nexthop bucket id 4294967295 index 0 idle_time 5.55 nhid 1 id 4294967295 index 1 idle_time 5.55 nhid 1 id 4294967295 index 0 idle_time 5.55 nhid 1 id 4294967295 index 1 idle_time 5.55 nhid 1 [...] Fix by adjusting the dump callback to return zero when the dump is complete. After the fix only one recvmsg() call is made and the NLMSG_DONE message is appended to the RTM_NEWNEXTHOPBUCKET responses: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id $((2**32-1)) group 1 type resilient buckets 2 # strace -e sendto,recvmsg -s 5 ip nexthop bucket sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOPBUCKET, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691396737, nlmsg_pid=0}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 148 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=64, nlmsg_type=RTM_NEWNEXTHOPBUCKET, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, {family=AF_UNSPEC, data="\x00\x00\x00\x00\x00"...}], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691396737, nlmsg_pid=350}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 148 id 4294967295 index 0 idle_time 6.61 nhid 1 id 4294967295 index 1 idle_time 6.61 nhid 1 +++ exited with 0 +++ Note that if the NLMSG_DONE message cannot be appended because of size limitations, then another recvmsg() will be needed, but the core netlink code will not invoke the dump callback and simply reply with a NLMSG_DONE message since it knows that the callback previously returned zero. Add a test that fails before the fix: # ./fib_nexthops.sh -t basic_res [...] TEST: Maximum nexthop ID dump [FAIL] [...] And passes after it: # ./fib_nexthops.sh -t basic_res [...] TEST: Maximum nexthop ID dump [ OK ] [...] Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-4-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09nexthop: Make nexthop bucket dump more efficientIdo Schimmel
rtm_dump_nexthop_bucket_nh() is used to dump nexthop buckets belonging to a specific resilient nexthop group. The function returns a positive return code (the skb length) upon both success and failure. The above behavior is problematic. When a complete nexthop bucket dump is requested, the function that walks the different nexthops treats the non-zero return code as an error. This causes buckets belonging to different resilient nexthop groups to be dumped using different buffers even if they can all fit in the same buffer: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 1 # ip nexthop add id 20 group 1 type resilient buckets 1 # strace -e recvmsg -s 0 ip nexthop bucket [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64 id 10 index 0 idle_time 10.27 nhid 1 [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64 id 20 index 0 idle_time 6.44 nhid 1 [...] Fix by only returning a non-zero return code when an error occurred and restarting the dump from the bucket index we failed to fill in. This allows buckets belonging to different resilient nexthop groups to be dumped using the same buffer: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip nexthop add id 10 group 1 type resilient buckets 1 # ip nexthop add id 20 group 1 type resilient buckets 1 # strace -e recvmsg -s 0 ip nexthop bucket [...] recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[...], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 128 id 10 index 0 idle_time 30.21 nhid 1 id 20 index 0 idle_time 26.7 nhid 1 [...] While this change is more of a performance improvement change than an actual bug fix, it is a prerequisite for a subsequent patch that does fix a bug. Fixes: 8a1bbabb034d ("nexthop: Add netlink handlers for bucket dump") Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-3-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09nexthop: Fix infinite nexthop dump when using maximum nexthop IDIdo Schimmel
A netlink dump callback can return a positive number to signal that more information needs to be dumped or zero to signal that the dump is complete. In the second case, the core netlink code will append the NLMSG_DONE message to the skb in order to indicate to user space that the dump is complete. The nexthop dump callback always returns a positive number if nexthops were filled in the provided skb, even if the dump is complete. This means that a dump will span at least two recvmsg() calls as long as nexthops are present. In the last recvmsg() call the dump callback will not fill in any nexthops because the previous call indicated that the dump should restart from the last dumped nexthop ID plus one. # ip nexthop add id 1 blackhole # strace -e sendto,recvmsg -s 5 ip nexthop sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394315, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 36 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 1], {nla_len=4, nla_type=NHA_BLACKHOLE}]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 36 id 1 blackhole recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 20 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394315, nlmsg_pid=343}, 0], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 20 +++ exited with 0 +++ This behavior is both inefficient and buggy. If the last nexthop to be dumped had the maximum ID of 0xffffffff, then the dump will restart from 0 (0xffffffff + 1) and never end: # ip nexthop add id $((2**32-1)) blackhole # ip nexthop id 4294967295 blackhole id 4294967295 blackhole [...] Fix by adjusting the dump callback to return zero when the dump is complete. After the fix only one recvmsg() call is made and the NLMSG_DONE message is appended to the RTM_NEWNEXTHOP response: # ip nexthop add id $((2**32-1)) blackhole # strace -e sendto,recvmsg -s 5 ip nexthop sendto(3, [[{nlmsg_len=24, nlmsg_type=RTM_GETNEXTHOP, nlmsg_flags=NLM_F_REQUEST|NLM_F_DUMP, nlmsg_seq=1691394080, nlmsg_pid=0}, {nh_family=AF_UNSPEC, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}], {nlmsg_len=0, nlmsg_type=0 /* NLMSG_??? */, nlmsg_flags=0, nlmsg_seq=0, nlmsg_pid=0}], 152, 0, NULL, 0) = 152 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=NULL, iov_len=0}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_TRUNC}, MSG_PEEK|MSG_TRUNC) = 56 recvmsg(3, {msg_name={sa_family=AF_NETLINK, nl_pid=0, nl_groups=00000000}, msg_namelen=12, msg_iov=[{iov_base=[[{nlmsg_len=36, nlmsg_type=RTM_NEWNEXTHOP, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, {nh_family=AF_INET, nh_scope=RT_SCOPE_UNIVERSE, nh_protocol=RTPROT_UNSPEC, nh_flags=0}, [[{nla_len=8, nla_type=NHA_ID}, 4294967295], {nla_len=4, nla_type=NHA_BLACKHOLE}]], [{nlmsg_len=20, nlmsg_type=NLMSG_DONE, nlmsg_flags=NLM_F_MULTI, nlmsg_seq=1691394080, nlmsg_pid=342}, 0]], iov_len=32768}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 56 id 4294967295 blackhole +++ exited with 0 +++ Note that if the NLMSG_DONE message cannot be appended because of size limitations, then another recvmsg() will be needed, but the core netlink code will not invoke the dump callback and simply reply with a NLMSG_DONE message since it knows that the callback previously returned zero. Add a test that fails before the fix: # ./fib_nexthops.sh -t basic [...] TEST: Maximum nexthop ID dump [FAIL] [...] And passes after it: # ./fib_nexthops.sh -t basic [...] TEST: Maximum nexthop ID dump [ OK ] [...] Fixes: ab84be7e54fc ("net: Initial nexthop code") Reported-by: Petr Machata <petrm@nvidia.com> Closes: https://lore.kernel.org/netdev/87sf91enuf.fsf@nvidia.com/ Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20230808075233.3337922-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09vlan: Fix VLAN 0 memory leakVlad Buslov
The referenced commit intended to fix memleak of VLAN 0 that is implicitly created on devices with NETIF_F_HW_VLAN_CTAG_FILTER feature. However, it doesn't take into account that the feature can be re-set during the netdevice lifetime which will cause memory leak if feature is disabled during the device deletion as illustrated by [0]. Fix the leak by unconditionally deleting VLAN 0 on NETDEV_DOWN event. [0]: > modprobe 8021q > ip l set dev eth2 up > ethtool -K eth2 rx-vlan-filter off > modprobe -r mlx5_ib > modprobe -r mlx5_core > cat /sys/kernel/debug/kmemleak unreferenced object 0xffff888103dcd900 (size 256): comm "ip", pid 1490, jiffies 4294907305 (age 325.364s) hex dump (first 32 bytes): 00 80 5d 03 81 88 ff ff 00 00 00 00 00 00 00 00 ..]............. 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000899f3bb9>] kmalloc_trace+0x25/0x80 [<000000002889a7a2>] vlan_vid_add+0xa0/0x210 [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q] [<000000009a0716b1>] notifier_call_chain+0x35/0xb0 [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0 [<0000000053d2b05d>] dev_change_flags+0x4d/0x60 [<00000000982807e9>] do_setlink+0x28d/0x10a0 [<0000000058c1be00>] __rtnl_newlink+0x545/0x980 [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70 [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390 [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100 [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0 [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0 [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60 [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200 [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0 unreferenced object 0xffff88813354fde0 (size 32): comm "ip", pid 1490, jiffies 4294907305 (age 325.364s) hex dump (first 32 bytes): a0 d9 dc 03 81 88 ff ff a0 d9 dc 03 81 88 ff ff ................ 81 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<00000000899f3bb9>] kmalloc_trace+0x25/0x80 [<000000002da64724>] vlan_vid_add+0xdf/0x210 [<000000007177800e>] vlan_device_event+0x374/0x760 [8021q] [<000000009a0716b1>] notifier_call_chain+0x35/0xb0 [<00000000bbf3d162>] __dev_notify_flags+0x58/0xf0 [<0000000053d2b05d>] dev_change_flags+0x4d/0x60 [<00000000982807e9>] do_setlink+0x28d/0x10a0 [<0000000058c1be00>] __rtnl_newlink+0x545/0x980 [<00000000e66c3bd9>] rtnl_newlink+0x44/0x70 [<00000000a2cc5970>] rtnetlink_rcv_msg+0x29c/0x390 [<00000000d307d1e4>] netlink_rcv_skb+0x54/0x100 [<00000000259d16f9>] netlink_unicast+0x1f6/0x2c0 [<000000007ce2afa1>] netlink_sendmsg+0x232/0x4a0 [<00000000f3f4bb39>] sock_sendmsg+0x38/0x60 [<000000002f9c0624>] ____sys_sendmsg+0x1e3/0x200 [<00000000d6ff5520>] ___sys_sendmsg+0x80/0xc0 Fixes: efc73f4bbc23 ("net: Fix memory leak - vlan_info struct") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Link: https://lore.kernel.org/r/20230808093521.1468929-1-vladbu@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-09wifi: nl80211: fix integer overflow in nl80211_parse_mbssid_elems()Keith Yeo
nl80211_parse_mbssid_elems() uses a u8 variable num_elems to count the number of MBSSID elements in the nested netlink attribute attrs, which can lead to an integer overflow if a user of the nl80211 interface specifies 256 or more elements in the corresponding attribute in userspace. The integer overflow can lead to a heap buffer overflow as num_elems determines the size of the trailing array in elems, and this array is thereafter written to for each element in attrs. Note that this vulnerability only affects devices with the wiphy->mbssid_max_interfaces member set for the wireless physical device struct in the device driver, and can only be triggered by a process with CAP_NET_ADMIN capabilities. Fix this by checking for a maximum of 255 elements in attrs. Cc: stable@vger.kernel.org Fixes: dc1e3cb8da8b ("nl80211: MBSSID and EMA support in AP mode") Signed-off-by: Keith Yeo <keithyjy@gmail.com> Link: https://lore.kernel.org/r/20230731034719.77206-1-keithyjy@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2023-08-09netfilter: nf_tables: don't skip expired elements during walkFlorian Westphal
There is an asymmetry between commit/abort and preparation phase if the following conditions are met: 1. set is a verdict map ("1.2.3.4 : jump foo") 2. timeouts are enabled In this case, following sequence is problematic: 1. element E in set S refers to chain C 2. userspace requests removal of set S 3. kernel does a set walk to decrement chain->use count for all elements from preparation phase 4. kernel does another set walk to remove elements from the commit phase (or another walk to do a chain->use increment for all elements from abort phase) If E has already expired in 1), it will be ignored during list walk, so its use count won't have been changed. Then, when set is culled, ->destroy callback will zap the element via nf_tables_set_elem_destroy(), but this function is only safe for elements that have been deactivated earlier from the preparation phase: lack of earlier deactivate removes the element but leaks the chain use count, which results in a WARN splat when the chain gets removed later, plus a leak of the nft_chain structure. Update pipapo_get() not to skip expired elements, otherwise flush command reports bogus ENOENT errors. Fixes: 3c4287f62044 ("nf_tables: Add set type for arbitrary concatenation of ranges") Fixes: 8d8540c4f5e0 ("netfilter: nft_set_rbtree: add timeout support") Fixes: 9d0982927e79 ("netfilter: nft_hash: add support for timeouts") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2023-08-09net/smc: Use correct buffer sizes when switching between TCP and SMCGerd Bayer
Tuning of the effective buffer size through setsockopts was working for SMC traffic only but not for TCP fall-back connections even before commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable"). That change made it apparent that TCP fall-back connections would use net.smc.[rw]mem as buffer size instead of net.ipv4_tcp_[rw]mem. Amend the code that copies attributes between the (TCP) clcsock and the SMC socket and adjust buffer sizes appropriately: - Copy over sk_userlocks so that both sockets agree on whether tuning via setsockopt is active. - When falling back to TCP use sk_sndbuf or sk_rcvbuf as specified with setsockopt. Otherwise, use the sysctl value for TCP/IPv4. - Likewise, use either values from setsockopt or from sysctl for SMC (duplicated) on successful SMC connect. In smc_tcp_listen_work() drop the explicit copy of buffer sizes as that is taken care of by the attribute copy. Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-09net/smc: Fix setsockopt and sysctl to specify same buffer size againGerd Bayer
Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") introduced the net.smc.rmem and net.smc.wmem sysctls to specify the size of buffers to be used for SMC type connections. This created a regression for users that specified the buffer size via setsockopt() as the effective buffer size was now doubled. Re-introduce the division by 2 in the SMC buffer create code and level this out by duplicating the net.smc.[rw]mem values used for initializing sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both methods (setsockopt or sysctl) the effective buffer size that they expect. Initialize net.smc.[rw]mem from its own constant of 64kB, respectively. Internal performance tests show that this value is a good compromise between throughput/latency and memory consumption. Also, this decouples it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the module for SMC protocol was loaded. Check that no more than INT_MAX / 2 is assigned to net.smc.[rw]mem, in order to avoid any overflow condition when that is doubled for use in sk_sndbuf or sk_rcvbuf. While at it, drop the confusing sk_buf_size variable from __smc_buf_create and name "compressed" buffer size variables more consistently. Background: Before the commit mentioned above, SMC's buffer allocator in __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf value as initial value to search for appropriate buffers. If the search resorted to using a bigger buffer when all buffers of the specified size were busy, the duplicate of the used effective buffer size is stored back to sk_rcvbuf/sk_sndbuf. When available, buffers of exactly the size that a user had specified as input to setsockopt() were used, despite setsockopt()'s documentation in "man 7 socket" talking of a mandatory duplication: [...] SO_SNDBUF Sets or gets the maximum socket send buffer in bytes. The kernel doubles this value (to allow space for book‐ keeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/wmem_default file and the maximum allowed value is set by the /proc/sys/net/core/wmem_max file. The minimum (doubled) value for this option is 2048. [...] Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Co-developed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-08net/unix: use consistent error code in SO_PEERPIDFDDavid Rheinsberg
Change the new (unreleased) SO_PEERPIDFD sockopt to return ENODATA rather than ESRCH if a socket type does not support remote peer-PID queries. Currently, SO_PEERPIDFD returns ESRCH when the socket in question is not an AF_UNIX socket. This is quite unexpected, given that one would assume ESRCH means the peer process already exited and thus cannot be found. However, in that case the sockopt actually returns EINVAL (via pidfd_prepare()). This is rather inconsistent with other syscalls, which usually return ESRCH if a given PID refers to a non-existant process. This changes SO_PEERPIDFD to return ENODATA instead. This is also what SO_PEERGROUPS returns, and thus keeps a consistent behavior across sockopts. Note that this code is returned in 2 cases: First, if the socket type is not AF_UNIX, and secondly if the socket was not yet connected. In both cases ENODATA seems suitable. Signed-off-by: David Rheinsberg <david@readahead.eu> Reviewed-by: Christian Brauner <brauner@kernel.org> Acked-by: Luca Boccassi <bluca@debian.org> Fixes: 7b26952a91cf ("net: core: add getsockopt SO_PEERPIDFD") Link: https://lore.kernel.org/r/20230807081225.816199-1-david@readahead.eu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-07net: core: remove unnecessary frame_sz check in bpf_xdp_adjust_tail()Andrew Kanner
Syzkaller reported the following issue: ======================================= Too BIG xdp->frame_sz = 131072 WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121 ____bpf_xdp_adjust_tail net/core/filter.c:4121 [inline] WARNING: CPU: 0 PID: 5020 at net/core/filter.c:4121 bpf_xdp_adjust_tail+0x466/0xa10 net/core/filter.c:4103 ... Call Trace: <TASK> bpf_prog_4add87e5301a4105+0x1a/0x1c __bpf_prog_run include/linux/filter.h:600 [inline] bpf_prog_run_xdp include/linux/filter.h:775 [inline] bpf_prog_run_generic_xdp+0x57e/0x11e0 net/core/dev.c:4721 netif_receive_generic_xdp net/core/dev.c:4807 [inline] do_xdp_generic+0x35c/0x770 net/core/dev.c:4866 tun_get_user+0x2340/0x3ca0 drivers/net/tun.c:1919 tun_chr_write_iter+0xe8/0x210 drivers/net/tun.c:2043 call_write_iter include/linux/fs.h:1871 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x650/0xe40 fs/read_write.c:584 ksys_write+0x12f/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x38/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd xdp->frame_sz > PAGE_SIZE check was introduced in commit c8741e2bfe87 ("xdp: Allow bpf_xdp_adjust_tail() to grow packet size"). But Jesper Dangaard Brouer <jbrouer@redhat.com> noted that after introducing the xdp_init_buff() which all XDP driver use - it's safe to remove this check. The original intend was to catch cases where XDP drivers have not been updated to use xdp.frame_sz, but that is not longer a concern (since xdp_init_buff). Running the initial syzkaller repro it was discovered that the contiguous physical memory allocation is used for both xdp paths in tun_get_user(), e.g. tun_build_skb() and tun_alloc_skb(). It was also stated by Jesper Dangaard Brouer <jbrouer@redhat.com> that XDP can work on higher order pages, as long as this is contiguous physical memory (e.g. a page). Reported-and-tested-by: syzbot+f817490f5bd20541b90a@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000774b9205f1d8a80d@google.com/T/ Link: https://syzkaller.appspot.com/bug?extid=f817490f5bd20541b90a Link: https://lore.kernel.org/all/20230725155403.796-1-andrew.kanner@gmail.com/T/ Fixes: 43b5169d8355 ("net, xdp: Introduce xdp_init_buff utility routine") Signed-off-by: Andrew Kanner <andrew.kanner@gmail.com> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org> Acked-by: Jason Wang <jasowang@redhat.com> Link: https://lore.kernel.org/r/20230803190316.2380231-1-andrew.kanner@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-06net: tls: avoid discarding data on record closeJakub Kicinski
TLS records end with a 16B tag. For TLS device offload we only need to make space for this tag in the stream, the device will generate and replace it with the actual calculated tag. Long time ago the code would just re-reference the head frag which mostly worked but was suboptimal because it prevented TCP from combining the record into a single skb frag. I'm not sure if it was correct as the first frag may be shorter than the tag. The commit under fixes tried to replace that with using the page frag and if the allocation failed rolling back the data, if record was long enough. It achieves better fragment coalescing but is also buggy. We don't roll back the iterator, so unless we're at the end of send we'll skip the data we designated as tag and start the next record as if the rollback never happened. There's also the possibility that the record was constructed with MSG_MORE and the data came from a different syscall and we already told the user space that we "got it". Allocate a single dummy page and use it as fallback. Found by code inspection, and proven by forcing allocation failures. Fixes: e7b159a48ba6 ("net/tls: remove the record tail optimization") Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2023-08-04dccp: fix data-race around dp->dccps_mss_cacheEric Dumazet
dccp_sendmsg() reads dp->dccps_mss_cache before locking the socket. Same thing in do_dccp_getsockopt(). Add READ_ONCE()/WRITE_ONCE() annotations, and change dccp_sendmsg() to check again dccps_mss_cache after socket is locked. Fixes: 7c657876b63c ("[DCCP]: Initial implementation") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20230803163021.2958262-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04mptcp: fix disconnect vs accept racePaolo Abeni
Despite commit 0ad529d9fd2b ("mptcp: fix possible divide by zero in recvmsg()"), the mptcp protocol is still prone to a race between disconnect() (or shutdown) and accept. The root cause is that the mentioned commit checks the msk-level flag, but mptcp_stream_accept() does acquire the msk-level lock, as it can rely directly on the first subflow lock. As reported by Christoph than can lead to a race where an msk socket is accepted after that mptcp_subflow_queue_clean() releases the listener socket lock and just before it takes destructive actions leading to the following splat: BUG: kernel NULL pointer dereference, address: 0000000000000012 PGD 5a4ca067 P4D 5a4ca067 PUD 37d4c067 PMD 0 Oops: 0000 [#1] PREEMPT SMP CPU: 2 PID: 10955 Comm: syz-executor.5 Not tainted 6.5.0-rc1-gdc7b257ee5dd #37 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:mptcp_stream_accept+0x1ee/0x2f0 include/net/inet_sock.h:330 Code: 0a 09 00 48 8b 1b 4c 39 e3 74 07 e8 bc 7c 7f fe eb a1 e8 b5 7c 7f fe 4c 8b 6c 24 08 eb 05 e8 a9 7c 7f fe 49 8b 85 d8 09 00 00 <0f> b6 40 12 88 44 24 07 0f b6 6c 24 07 bf 07 00 00 00 89 ee e8 89 RSP: 0018:ffffc90000d07dc0 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff888037e8d020 RCX: ffff88803b093300 RDX: 0000000000000000 RSI: ffffffff833822c5 RDI: ffffffff8333896a RBP: 0000607f82031520 R08: ffff88803b093300 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000003e83 R12: ffff888037e8d020 R13: ffff888037e8c680 R14: ffff888009af7900 R15: ffff888009af6880 FS: 00007fc26d708640(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000012 CR3: 0000000066bc5001 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> do_accept+0x1ae/0x260 net/socket.c:1872 __sys_accept4+0x9b/0x110 net/socket.c:1913 __do_sys_accept4 net/socket.c:1954 [inline] __se_sys_accept4 net/socket.c:1951 [inline] __x64_sys_accept4+0x20/0x30 net/socket.c:1951 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x47/0xa0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Address the issue by temporary removing the pending request socket from the accept queue, so that racing accept() can't touch them. After depleting the msk - the ssk still exists, as plain TCP sockets, re-insert them into the accept queue, so that later inet_csk_listen_stop() will complete the tcp socket disposal. Fixes: 2a6a870e44dd ("mptcp: stops worker on unaccepted sockets at listener close") Cc: stable@vger.kernel.org Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/423 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-4-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04mptcp: avoid bogus reset on fallback closePaolo Abeni
Since the blamed commit, the MPTCP protocol unconditionally sends TCP resets on all the subflows on disconnect(). That fits full-blown MPTCP sockets - to implement the fastclose mechanism - but causes unexpected corruption of the data stream, caught as sporadic self-tests failures. Fixes: d21f83485518 ("mptcp: use fastclose on more edge scenarios") Cc: stable@vger.kernel.org Tested-by: Matthieu Baerts <matthieu.baerts@tessares.net> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/419 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803-upstream-net-20230803-misc-fixes-6-5-v1-3-6671b1ab11cc@tessares.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04tunnels: fix kasan splat when generating ipv4 pmtu errorFlorian Westphal
If we try to emit an icmp error in response to a nonliner skb, we get BUG: KASAN: slab-out-of-bounds in ip_compute_csum+0x134/0x220 Read of size 4 at addr ffff88811c50db00 by task iperf3/1691 CPU: 2 PID: 1691 Comm: iperf3 Not tainted 6.5.0-rc3+ #309 [..] kasan_report+0x105/0x140 ip_compute_csum+0x134/0x220 iptunnel_pmtud_build_icmp+0x554/0x1020 skb_tunnel_check_pmtu+0x513/0xb80 vxlan_xmit_one+0x139e/0x2ef0 vxlan_xmit+0x1867/0x2760 dev_hard_start_xmit+0x1ee/0x4f0 br_dev_queue_push_xmit+0x4d1/0x660 [..] ip_compute_csum() cannot deal with nonlinear skbs, so avoid it. After this change, splat is gone and iperf3 is no longer stuck. Fixes: 4cb47a8644cc ("tunnels: PMTU discovery support for directly bridged IP packets") Signed-off-by: Florian Westphal <fw@strlen.de> Link: https://lore.kernel.org/r/20230803152653.29535-2-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04net/packet: annotate data-races around tp->statusEric Dumazet
Another syzbot report [1] is about tp->status lockless reads from __packet_get_status() [1] BUG: KCSAN: data-race in __packet_rcv_has_room / __packet_set_status write to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 0: __packet_set_status+0x78/0xa0 net/packet/af_packet.c:407 tpacket_rcv+0x18bb/0x1a60 net/packet/af_packet.c:2483 deliver_skb net/core/dev.c:2173 [inline] __netif_receive_skb_core+0x408/0x1e80 net/core/dev.c:5337 __netif_receive_skb_one_core net/core/dev.c:5491 [inline] __netif_receive_skb+0x57/0x1b0 net/core/dev.c:5607 process_backlog+0x21f/0x380 net/core/dev.c:5935 __napi_poll+0x60/0x3b0 net/core/dev.c:6498 napi_poll net/core/dev.c:6565 [inline] net_rx_action+0x32b/0x750 net/core/dev.c:6698 __do_softirq+0xc1/0x265 kernel/softirq.c:571 invoke_softirq kernel/softirq.c:445 [inline] __irq_exit_rcu+0x57/0xa0 kernel/softirq.c:650 sysvec_apic_timer_interrupt+0x6d/0x80 arch/x86/kernel/apic/apic.c:1106 asm_sysvec_apic_timer_interrupt+0x1a/0x20 arch/x86/include/asm/idtentry.h:645 smpboot_thread_fn+0x33c/0x4a0 kernel/smpboot.c:112 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 read to 0xffff888117d7c080 of 8 bytes by interrupt on cpu 1: __packet_get_status net/packet/af_packet.c:436 [inline] packet_lookup_frame net/packet/af_packet.c:524 [inline] __tpacket_has_room net/packet/af_packet.c:1255 [inline] __packet_rcv_has_room+0x3f9/0x450 net/packet/af_packet.c:1298 tpacket_rcv+0x275/0x1a60 net/packet/af_packet.c:2285 deliver_skb net/core/dev.c:2173 [inline] dev_queue_xmit_nit+0x38a/0x5e0 net/core/dev.c:2243 xmit_one net/core/dev.c:3574 [inline] dev_hard_start_xmit+0xcf/0x3f0 net/core/dev.c:3594 __dev_queue_xmit+0xefb/0x1d10 net/core/dev.c:4244 dev_queue_xmit include/linux/netdevice.h:3088 [inline] can_send+0x4eb/0x5d0 net/can/af_can.c:276 bcm_can_tx+0x314/0x410 net/can/bcm.c:302 bcm_tx_timeout_handler+0xdb/0x260 __run_hrtimer kernel/time/hrtimer.c:1685 [inline] __hrtimer_run_queues+0x217/0x700 kernel/time/hrtimer.c:1749 hrtimer_run_softirq+0xd6/0x120 kernel/time/hrtimer.c:1766 __do_softirq+0xc1/0x265 kernel/softirq.c:571 run_ksoftirqd+0x17/0x20 kernel/softirq.c:939 smpboot_thread_fn+0x30a/0x4a0 kernel/smpboot.c:164 kthread+0x1d7/0x210 kernel/kthread.c:379 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308 value changed: 0x0000000000000000 -> 0x0000000020000081 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 6.4.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/27/2023 Fixes: 69e3c75f4d54 ("net: TX_RING and packet mmap") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://lore.kernel.org/r/20230803145600.2937518-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04mptcp: fix the incorrect judgment for msk->cb_flagsXiang Yang
Coccicheck reports the error below: net/mptcp/protocol.c:3330:15-28: ERROR: test of a variable/field address Since the address of msk->cb_flags is used in __test_and_clear_bit, the address should not be NULL. The judgment for if (unlikely(msk->cb_flags)) will always be true, we should check the real value of msk->cb_flags here. Fixes: 65a569b03ca8 ("mptcp: optimize release_cb for the common case") Signed-off-by: Xiang Yang <xiangyang3@huawei.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Link: https://lore.kernel.org/r/20230803072438.1847500-1-xiangyang3@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2023-08-04Merge tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-clientLinus Torvalds
Pull ceph fixes from Ilya Dryomov: "Two patches to improve RBD exclusive lock interaction with osd_request_timeout option and another fix to reduce the potential for erroneous blocklisting -- this time in CephFS. All going to stable" * tag 'ceph-for-6.5-rc5' of https://github.com/ceph/ceph-client: libceph: fix potential hang in ceph_osdc_notify() rbd: prevent busy loop when requesting exclusive lock ceph: defer stopping mdsc delayed_work
2023-08-03Merge tag 'net-6.5-rc5' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from bpf and wireless. Nothing scary here. Feels like the first wave of regressions from v6.5 is addressed - one outstanding fix still to come in TLS for the sendpage rework. Current release - regressions: - udp: fix __ip_append_data()'s handling of MSG_SPLICE_PAGES - dsa: fix older DSA drivers using phylink Previous releases - regressions: - gro: fix misuse of CB in udp socket lookup - mlx5: unregister devlink params in case interface is down - Revert "wifi: ath11k: Enable threaded NAPI" Previous releases - always broken: - sched: cls_u32: fix match key mis-addressing - sched: bind logic fixes for cls_fw, cls_u32 and cls_route - add bound checks to a number of places which hand-parse netlink - bpf: disable preemption in perf_event_output helpers code - qed: fix scheduling in a tasklet while getting stats - avoid using APIs which are not hardirq-safe in couple of drivers, when we may be in a hard IRQ (netconsole) - wifi: cfg80211: fix return value in scan logic, avoid page allocator warning - wifi: mt76: mt7615: do not advertise 5 GHz on first PHY of MT7615D (DBDC) Misc: - drop handful of inactive maintainers, put some new in place" * tag 'net-6.5-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (98 commits) MAINTAINERS: update TUN/TAP maintainers test/vsock: remove vsock_perf executable on `make clean` tcp_metrics: fix data-race in tcpm_suck_dst() vs fastopen tcp_metrics: annotate data-races around tm->tcpm_net tcp_metrics: annotate data-races around tm->tcpm_vals[] tcp_metrics: annotate data-races around tm->tcpm_lock tcp_metrics: annotate data-races around tm->tcpm_stamp tcp_metrics: fix addr_same() helper prestera: fix fallback to previous version on same major version udp: Fix __ip_append_data()'s handling of MSG_SPLICE_PAGES net/mlx5e: Set proper IPsec source port in L4 selector net/mlx5: fs_core: Skip the FTs in the same FS_TYPE_PRIO_CHAINS fs_prio net/mlx5: fs_core: Make find_closest_ft more generic wifi: brcmfmac: Fix field-spanning write in brcmf_scan_params_v2_to_v1() vxlan: Fix nexthop hash size ip6mr: Fix skb_under_panic in ip6mr_cache_report() s390/qeth: Don't call dev_close/dev_open (DOWN/UP) net: tap_open(): set sk_uid from current_fsuid() net: tun_chr_open(): set sk_uid from current_fsuid() net: dcb: choose correct policy to parse DCB_ATTR_BCN ...