summaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)Author
2025-07-18net: stream: add description for sk_stream_write_space()Suchit Karunakaran
Add a proper description for the sk_stream_write_space() function as previously marked by a FIXME comment. No functional changes. Signed-off-by: Suchit Karunakaran <suchitkarunakaran@gmail.com> Link: https://patch.msgid.link/20250716153404.7385-1-suchitkarunakaran@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Update pneigh_entry in pneigh_create().Kuniyuki Iwashima
neigh_add() updates pneigh_entry() found or created by pneigh_create(). This update is serialised by RTNL, but we will remove it. Let's move the update part to pneigh_create() and make it return errno instead of a pointer of pneigh_entry. Now, the pneigh code is RTNL free. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-16-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Protect tbl->phash_buckets[] with a dedicated mutex.Kuniyuki Iwashima
tbl->phash_buckets[] is only modified in the slow path by pneigh_create() and pneigh_delete() under the table lock. Both of them are called under RTNL, so no extra lock is needed, but we will remove RTNL from the paths. pneigh_create() looks up a pneigh_entry, and this part can be lockless, but it would complicate the logic like 1. lookup 2. allocate pengih_entry for GFP_KERNEL 3. lookup again but under lock 4. if found, return it after freeing the allocated memory 5. else, return the new one Instead, let's add a per-table mutex and run lookup and allocation under it. Note that updating pneigh_entry part in neigh_add() is still protected by RTNL and will be moved to pneigh_create() in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-15-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_lookup().Kuniyuki Iwashima
Now, all callers of pneigh_lookup() are under RCU, and the read lock there is no longer needed. Let's drop the lock, inline __pneigh_lookup_1() to pneigh_lookup(), and call it from pneigh_create(). The next patch will remove tbl->lock from pneigh_create(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Remove __pneigh_lookup().Kuniyuki Iwashima
__pneigh_lookup() is the lockless version of pneigh_lookup(), but its only caller pndisc_is_router() holds the table lock and reads pneigh_netry.flags. This is because accessing pneigh_entry after pneigh_lookup() was illegal unless the caller holds RTNL or the table lock. Now, pneigh_entry is guaranteed to be alive during the RCU critical section. Let's call pneigh_lookup() and use READ_ONCE() for n->flags in pndisc_is_router() and remove __pneigh_lookup(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Use rcu_dereference() in pneigh_get_{first,next}().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's use rcu_dereference() in pneigh_get_{first,next}(). Note that neigh_seq_start() still holds tbl->lock for the normal neighbour entry. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Drop read_lock_bh(&tbl->lock) in pneigh_dump_table().Kuniyuki Iwashima
Now pneigh_entry is guaranteed to be alive during the RCU critical section even without holding tbl->lock. Let's drop read_lock_bh(&tbl->lock) and use rcu_dereference() to iterate tbl->phash_buckets[] in pneigh_dump_table() Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Convert RTM_GETNEIGH to RCU.Kuniyuki Iwashima
Only __dev_get_by_index() is the RTNL dependant in neigh_get(). Let's replace it with dev_get_by_index_rcu() and convert RTM_GETNEIGH to RCU. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate access to struct pneigh_entry.{flags,protocol}.Kuniyuki Iwashima
We will convert pneigh readers to RCU, and its flags and protocol will be read locklessly. Let's annotate the access to the two fields. Note that all access to pn->permanent is under RTNL (neigh_add() and pneigh_ifdown_and_unlock()), so WRITE_ONCE() and READ_ONCE() are not needed. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Free pneigh_entry after RCU grace period.Kuniyuki Iwashima
We will convert RTM_GETNEIGH to RCU. neigh_get() looks up pneigh_entry by pneigh_lookup() and passes it to pneigh_fill_info(). Then, we must ensure that the entry is alive till pneigh_fill_info() completes, but read_lock_bh(&tbl->lock) in pneigh_lookup() does not guarantee that. Also, we will convert all readers of tbl->phash_buckets[] to RCU. Let's use call_rcu() to free pneigh_entry and update phash_buckets[] and ->next by rcu_assign_pointer(). pneigh_ifdown_and_unlock() uses list_head to avoid overwriting ->next and moving RCU iterators to another list. pndisc_destructor() (only IPv6 ndisc uses this) uses a mutex, so it is not delayed to call_rcu(), where we cannot sleep. This is fine because the mcast code works with RCU and ipv6_dev_mc_dec() frees mcast objects after RCU grace period. While at it, we change the return type of pneigh_ifdown_and_unlock() to void. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Annotate neigh_table.phash_buckets and pneigh_entry.next with __rcu.Kuniyuki Iwashima
The next patch will free pneigh_entry with call_rcu(). Then, we need to annotate neigh_table.phash_buckets[] and pneigh_entry.next with __rcu. To make the next patch cleaner, let's annotate the fields in advance. Currently, all accesses to the fields are under the neigh table lock, so rcu_dereference_protected() is used with 1 for now, but most of them (except in pneigh_delete() and pneigh_ifdown_and_unlock()) will be replaced with rcu_dereference() and rcu_dereference_check(). Note that pneigh_ifdown_and_unlock() changes pneigh_entry.next to a local list, which is illegal because the RCU iterator could be moved to another list. This part will be fixed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Split pneigh_lookup().Kuniyuki Iwashima
pneigh_lookup() has ASSERT_RTNL() in the middle of the function, which is confusing. When called with the last argument, creat, 0, pneigh_lookup() literally looks up a proxy neighbour entry. This is the case of the reader path as the fast path and RTM_GETNEIGH. pneigh_lookup(), however, creates a pneigh_entry when called with creat 1 from RTM_NEWNEIGH and SIOCSARP, which require RTNL. Let's split pneigh_lookup() into two functions. We will convert all the reader paths to RCU, and read_lock_bh(&tbl->lock) in the new pneigh_lookup() will be dropped. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Move neigh_find_table() to neigh_get().Kuniyuki Iwashima
neigh_valid_get_req() calls neigh_find_table() to fetch neigh_tables[]. neigh_find_table() uses rcu_dereference_rtnl(), but RTNL actually does not protect it at all; neigh_table_clear() can be called without RTNL and only waits for RCU readers by synchronize_rcu(). Fortunately, there is no bug because IPv4 is built-in, IPv6 cannot be unloaded, and DECNET was removed. To fetch neigh_tables[] by rcu_dereference() later, let's move neigh_find_table() from neigh_valid_get_req() to neigh_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Allocate skb in neigh_get().Kuniyuki Iwashima
We will remove RTNL for neigh_get() and run it under RCU instead. neigh_get_reply() and pneigh_get_reply() allocate skb with GFP_KERNEL. Let's move the allocation before __dev_get_by_index() in neigh_get(). Now, neigh_get_reply() and pneigh_get_reply() are inlined and rtnl_unicast() is factorised. We will convert pneigh_lookup() to __pneigh_lookup() later. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Move two validations from neigh_get() to neigh_valid_get_req().Kuniyuki Iwashima
We will remove RTNL for neigh_get() and run it under RCU instead. neigh_get() returns -EINVAL in the following cases: * NDA_DST is not specified * Both ndm->ndm_ifindex and NTF_PROXY are not specified These validations do not require RCU. Let's move them to neigh_valid_get_req(). While at it, the extack string for the first case is replaced with NL_SET_ERR_ATTR_MISS(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-17neighbour: Make neigh_valid_get_req() return ndmsg.Kuniyuki Iwashima
neigh_get() passes 4 local variable pointers to neigh_valid_get_req(). If it returns a pointer of struct ndmsg, we do not need to pass two of them. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250716221221.442239-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14Add support to set NAPI threaded for individual NAPISamiullah Khawaja
A net device has a threaded sysctl that can be used to enable threaded NAPI polling on all of the NAPI contexts under that device. Allow enabling threaded NAPI polling at individual NAPI level using netlink. Extend the netlink operation `napi-set` and allow setting the threaded attribute of a NAPI. This will enable the threaded polling on a NAPI context. Add a test in `nl_netdev.py` that verifies various cases of threaded NAPI being set at NAPI and at device level. Tested ./tools/testing/selftests/net/nl_netdev.py TAP version 13 1..7 ok 1 nl_netdev.empty_check ok 2 nl_netdev.lo_check ok 3 nl_netdev.page_pool_check ok 4 nl_netdev.napi_list_check ok 5 nl_netdev.dev_set_threaded ok 6 nl_netdev.napi_set_threaded ok 7 nl_netdev.nsim_rxq_reset_down # Totals: pass:7 fail:0 xfail:0 xpass:0 skip:0 error:0 Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250710211203.3979655-1-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-14dev: Pass netdevice_tracker to dev_get_by_flags_rcu().Kuniyuki Iwashima
This is a follow-up for commit eb1ac9ff6c4a5 ("ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST."). We should not add a new device lookup API without netdevice_tracker. Let's pass netdevice_tracker to dev_get_by_flags_rcu() and rename it with netdev_ prefix to match other newer APIs. Note that we always use GFP_ATOMIC for netdev_hold() as it's expected to be called under RCU. Suggested-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20250708184053.102109f6@kernel.org/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250711051120.2866855-1-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-09skbuff: Add MSG_MORE flag to optimize tcp large packet transmissionFeng Yang
When using sockmap for forwarding, the average latency for different packet sizes after sending 10,000 packets is as follows: size old(us) new(us) 512 56 55 1472 58 58 1600 106 81 3000 145 105 5000 182 125 Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Feng Yang <yangfeng@kylinos.cn> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250708054053.39551-1-yangfeng59949@163.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08ipv6: anycast: Don't hold RTNL for IPV6_JOIN_ANYCAST.Kuniyuki Iwashima
inet6_sk(sk)->ipv6_ac_list is protected by lock_sock(). In ipv6_sock_ac_join(), only __dev_get_by_index(), __dev_get_by_flags(), and __in6_dev_get() require RTNL. __dev_get_by_flags() is only used by ipv6_sock_ac_join() and can be converted to RCU version. Let's replace RCU version helper and drop RTNL from IPV6_JOIN_ANYCAST. setsockopt_needs_rtnl() will be removed in the next patch. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20250702230210.3115355-15-kuni1840@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: ethtool: remove the compat code for _rxfh_context opsJakub Kicinski
All drivers are now converted to dedicated _rxfh_context ops. Remove the use of >set_rxfh() to manage additional contexts. Reviewed-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Edward Cree <ecree.xilinx@gmail.com> Link: https://patch.msgid.link/20250707184115.2285277-5-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: account for encap headers in qdisc pkt lenFengyuan Gong
Refine qdisc_pkt_len_init to include headers up through the inner transport header when computing header size for encapsulations. Also refine net/sched/sch_cake.c borrowed from qdisc_pkt_len_init(). Signed-off-by: Fengyuan Gong <gfengyuan@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250702160741.1204919-1-gfengyuan@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit 6fa01ccd8830 ("skbuff: Add pskb_extract() helper function"), pskb_carve_frag_list() never used the argument @skb. Drop it and adapt the only caller. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: skbuff: Drop unused @skbMichal Luczaj
Since its introduction in commit ce098da1497c ("skbuff: Introduce slab_build_skb()"), __slab_build_skb() never used the @skb argument. Remove it and adapt both callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @gfpMichal Luczaj
Since its introduction in commit 2e910b95329c ("net: Add a function to splice pages into an skbuff for MSG_SPLICE_PAGES"), skb_splice_from_iter() never used the @gfp argument. Remove it and adapt callers. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-2-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-08net: splice: Drop unused @pipeMichal Luczaj
Since commit 41c73a0d44c9 ("net: speedup skb_splice_bits()"), __splice_segment() and spd_fill_page() do not use the @pipe argument. Drop it. While adapting the callers, move one line to enforce reverse xmas tree order. No functional change intended. Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Michal Luczaj <mhal@rbox.co> Link: https://patch.msgid.link/20250702-splice-drop-unused-v3-1-55f68b60d2b7@rbox.co Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: move Ethernet setup to push_eth() helperBreno Leitao
Refactor Ethernet header population into dedicated function, completing the layered abstraction with: - push_eth() for link layer - push_udp() for transport - push_ipv4()/push_ipv6() for network Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-6-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: factor out UDP header setup into push_udp() helperBreno Leitao
Move UDP header construction from netpoll_send_udp() into a new static helper function push_udp(). This completes the protocol layer refactoring by: 1. Creating a dedicated helper for UDP header assembly 2. Removing UDP-specific logic from the main send function 3. Establishing a consistent pattern with existing IPv4/IPv6 helpers: - push_udp() - push_ipv4() - push_ipv6() The change improves code organization and maintains the encapsulation pattern established in previous refactorings. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-5-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: factor out IPv4 header setup into push_ipv4() helperBreno Leitao
Move IPv4 header construction from netpoll_send_udp() into a new static helper function push_ipv4(). This completes the refactoring started with IPv6 header handling, creating symmetric helper functions for both IP versions. Changes include: 1. Extracting IPv4 header setup logic into push_ipv4() 2. Replacing inline IPv4 code with helper call 3. Moving eth assignment after helper calls for consistency The refactoring reduces code duplication and improves maintainability by isolating IP version-specific logic. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-4-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: factor out IPv6 header setup into push_ipv6() helperBreno Leitao
Move IPv6 header construction from netpoll_send_udp() into a new static helper function, push_ipv6(). This refactoring reduces code duplication and improves readability in netpoll_send_udp(). Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-3-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: factor out UDP checksum calculation into helperBreno Leitao
Extract UDP checksum calculation logic from netpoll_send_udp() into a new static helper function netpoll_udp_checksum(). This reduces code duplication and improves readability for both IPv4 and IPv6 cases. No functional change intended. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-2-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07netpoll: Improve code clarity with explicit struct size calculationsBreno Leitao
Replace pointer-dereference sizeof() operations with explicit struct names for improved readability and maintainability. This change: 1. Replaces `sizeof(*udph)` with `sizeof(struct udphdr)` 2. Replaces `sizeof(*ip6h)` with `sizeof(struct ipv6hdr)` 3. Replaces `sizeof(*iph)` with `sizeof(struct iphdr)` This will make it easy to move code in the upcoming patches. No functional changes are introduced by this patch. Signed-off-by: Breno Leitao <leitao@debian.org> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20250702-netpoll_untagle_ip-v2-1-13cf3db24e2b@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07net: remove RTNL use for /proc/sys/net/core/rps_default_maskEric Dumazet
Use a dedicated mutex instead. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250702061558.1585870-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename __page_pool_alloc_pages_slow() to ↵Byungchul Park
__page_pool_alloc_netmems_slow() Now that __page_pool_alloc_pages_slow() is for allocating netmem, not struct page, rename it to __page_pool_alloc_netmems_slow() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20250702053256.4594-4-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename __page_pool_release_page_dma() to ↵Byungchul Park
__page_pool_release_netmem_dma() Now that __page_pool_release_page_dma() is for releasing netmem, not struct page, rename it to __page_pool_release_netmem_dma() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://patch.msgid.link/20250702053256.4594-3-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-07page_pool: rename page_pool_return_page() to page_pool_return_netmem()Byungchul Park
Now that page_pool_return_page() is for returning netmem, not struct page, rename it to page_pool_return_netmem() to reflect what it does. Signed-off-by: Byungchul Park <byungchul@sk.com> Reviewed-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org> Link: https://patch.msgid.link/20250702053256.4594-2-byungchul@sk.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: preserve MSG_ZEROCOPY with forwardingWillem de Bruijn
MSG_ZEROCOPY data must be copied before data is queued to local sockets, to avoid indefinite timeout until memory release. This test is performed by skb_orphan_frags_rx, which is called when looping an egress skb to packet sockets, error queue or ingress path. To preserve zerocopy for skbs that are looped to ingress but are then forwarded to an egress device rather than delivered locally, defer this last check until an skb enters the local IP receive path. This is analogous to existing behavior of skb_clear_delivery_time. Signed-off-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630194312.1571410-2-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: add four helpers to annotate data-races around dst->devEric Dumazet
dst->dev is read locklessly in many contexts, and written in dst_dev_put(). Fixing all the races is going to need many changes. We probably will have to add full RCU protection. Add three helpers to ease this painful process. static inline struct net_device *dst_dev(const struct dst_entry *dst) { return READ_ONCE(dst->dev); } static inline struct net_device *skb_dst_dev(const struct sk_buff *skb) { return dst_dev(skb_dst(skb)); } static inline struct net *skb_dst_dev_net(const struct sk_buff *skb) { return dev_net(skb_dst_dev(skb)); } static inline struct net *skb_dst_dev_net_rcu(const struct sk_buff *skb) { return dev_net_rcu(skb_dst_dev(skb)); } Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-7-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->outputEric Dumazet
dst_dev_put() can overwrite dst->output while other cpus might read this field (for instance from dst_output()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need RCU protection in the future. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-6-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->inputEric Dumazet
dst_dev_put() can overwrite dst->input while other cpus might read this field (for instance from dst_input()) Add READ_ONCE()/WRITE_ONCE() annotations to suppress potential issues. We will likely need full RCU protection later. Fixes: 4a6ce2b6f2ec ("net: introduce a new function dst_dev_put()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->lastuseEric Dumazet
(dst_entry)->lastuse is read and written locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: dst: annotate data-races around dst->obsoleteEric Dumazet
(dst_entry)->obsolete is read locklessly, add corresponding annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250630121934.3399505-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02tcp: move tcp_memory_allocated into net_aligned_dataEric Dumazet
____cacheline_aligned_in_smp attribute only makes sure to align a field to a cache line. It does not prevent the linker to use the remaining of the cache line for other variables, causing potential false sharing. Move tcp_memory_allocated into a dedicated cache line. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: move net_cookie into net_aligned_dataEric Dumazet
Using per-cpu data for net->net_cookie generation is overkill, because even busy hosts do not create hundreds of netns per second. Make sure to put net_cookie in a private cache line to avoid potential false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-02net: add struct net_aligned_dataEric Dumazet
This structure will hold networking data that must consume a full cache line to avoid accidental false sharing. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20250630093540.3052835-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-07-01net: ieee8021q: fix insufficient table-size assertionRubenKelevra
_Static_assert(ARRAY_SIZE(map) != IEEE8021Q_TT_MAX - 1) rejects only a length of 7 and allows any other mismatch. Replace it with a strict equality test via a helper macro so that every mapping table must have exactly IEEE8021Q_TT_MAX (8) entries. Signed-off-by: RubenKelevra <rubenkelevra@gmail.com> Link: https://patch.msgid.link/20250626205907.1566384-1-rubenkelevra@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-06-30net: net->nsid_lock does not need BH safetyEric Dumazet
At the time of commit bc51dddf98c9 ("netns: avoid disabling irq for netns id") peernet2id() was not yet using RCU. Commit 2dce224f469f ("netns: protect netns ID lookups with RCU") changed peernet2id() to no longer acquire net->nsid_lock (potentially from BH context). We do not need to block soft interrupts when acquiring net->nsid_lock anymore. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Guillaume Nault <gnault@redhat.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20250627163242.230866-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-30neighbor: Add NTF_EXT_VALIDATED flag for externally validated entriesIdo Schimmel
tl;dr ===== Add a new neighbor flag ("extern_valid") that can be used to indicate to the kernel that a neighbor entry was learned and determined to be valid externally. The kernel will not try to remove or invalidate such an entry, leaving these decisions to the user space control plane. This is needed for EVPN multi-homing where a neighbor entry for a multi-homed host needs to be synced across all the VTEPs among which the host is multi-homed. Background ========== In a typical EVPN multi-homing setup each host is multi-homed using a set of links called ES (Ethernet Segment, i.e., LAG) to multiple leaf switches (VTEPs). VTEPs that are connected to the same ES are called ES peers. When a neighbor entry is learned on a VTEP, it is distributed to both ES peers and remote VTEPs using EVPN MAC/IP advertisement routes. ES peers use the neighbor entry when routing traffic towards the multi-homed host and remote VTEPs use it for ARP/NS suppression. Motivation ========== If the ES link between a host and the VTEP on which the neighbor entry was locally learned goes down, the EVPN MAC/IP advertisement route will be withdrawn and the neighbor entries will be removed from both ES peers and remote VTEPs. Routing towards the multi-homed host and ARP/NS suppression can fail until another ES peer locally learns the neighbor entry and distributes it via an EVPN MAC/IP advertisement route. "draft-rbickhart-evpn-ip-mac-proxy-adv-03" [1] suggests avoiding these intermittent failures by having the ES peers install the neighbor entries as before, but also injecting EVPN MAC/IP advertisement routes with a proxy indication. When the previously mentioned ES link goes down and the original EVPN MAC/IP advertisement route is withdrawn, the ES peers will not withdraw their neighbor entries, but instead start aging timers for the proxy indication. If an ES peer locally learns the neighbor entry (i.e., it becomes "reachable"), it will restart its aging timer for the entry and emit an EVPN MAC/IP advertisement route without a proxy indication. An ES peer will stop its aging timer for the proxy indication if it observes the removal of the proxy indication from at least one of the ES peers advertising the entry. In the event that the aging timer for the proxy indication expired, an ES peer will withdraw its EVPN MAC/IP advertisement route. If the timer expired on all ES peers and they all withdrew their proxy advertisements, the neighbor entry will be completely removed from the EVPN fabric. Implementation ============== In the above scheme, when the control plane (e.g., FRR) advertises a neighbor entry with a proxy indication, it expects the corresponding entry in the data plane (i.e., the kernel) to remain valid and not be removed due to garbage collection or loss of carrier. The control plane also expects the kernel to notify it if the entry was learned locally (i.e., became "reachable") so that it will remove the proxy indication from the EVPN MAC/IP advertisement route. That is why these entries cannot be programmed with dummy states such as "permanent" or "noarp". Instead, add a new neighbor flag ("extern_valid") which indicates that the entry was learned and determined to be valid externally and should not be removed or invalidated by the kernel. The kernel can probe the entry and notify user space when it becomes "reachable" (it is initially installed as "stale"). However, if the kernel does not receive a confirmation, have it return the entry to the "stale" state instead of the "failed" state. In other words, an entry marked with the "extern_valid" flag behaves like any other dynamically learned entry other than the fact that the kernel cannot remove or invalidate it. One can argue that the "extern_valid" flag should not prevent garbage collection and that instead a neighbor entry should be programmed with both the "extern_valid" and "extern_learn" flags. There are two reasons for not doing that: 1. Unclear why a control plane would like to program an entry that the kernel cannot invalidate but can completely remove. 2. The "extern_learn" flag is used by FRR for neighbor entries learned on remote VTEPs (for ARP/NS suppression) whereas here we are concerned with local entries. This distinction is currently irrelevant for the kernel, but might be relevant in the future. Given that the flag only makes sense when the neighbor has a valid state, reject attempts to add a neighbor with an invalid state and with this flag set. For example: # ip neigh add 192.0.2.1 nud none dev br0.10 extern_valid Error: Cannot create externally validated neighbor with an invalid state. # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 nud failed dev br0.10 extern_valid Error: Cannot mark neighbor as externally validated with an invalid state. The above means that a neighbor cannot be created with the "extern_valid" flag and flags such as "use" or "managed" as they result in a neighbor being created with an invalid state ("none") and immediately getting probed: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use Error: Cannot create externally validated neighbor with an invalid state. However, these flags can be used together with "extern_valid" after the neighbor was created with a valid state: # ip neigh add 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use One consequence of preventing the kernel from invalidating a neighbor entry is that by default it will only try to determine reachability using unicast probes. This can be changed using the "mcast_resolicit" sysctl: # sysctl net.ipv4.neigh.br0/10.mcast_resolicit 0 # tcpdump -nn -e -i br0.10 -Q out arp & # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 # sysctl -wq net.ipv4.neigh.br0/10.mcast_resolicit=3 # ip neigh replace 192.0.2.1 lladdr 00:11:22:33:44:55 nud stale dev br0.10 extern_valid use 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > 00:11:22:33:44:55, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 62:50:1d:11:93:6f > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 192.0.2.1 tell 192.0.2.2, length 28 iproute2 patches can be found here [2]. [1] https://datatracker.ietf.org/doc/html/draft-rbickhart-evpn-ip-mac-proxy-adv-03 [2] https://github.com/idosch/iproute2/tree/submit/extern_valid_v1 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://patch.msgid.link/20250626073111.244534-2-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Daniel Borkmann says: ==================== pull-request: bpf-next 2025-06-27 We've added 6 non-merge commits during the last 8 day(s) which contain a total of 6 files changed, 120 insertions(+), 20 deletions(-). The main changes are: 1) Fix RCU usage in task_cls_state() for BPF programs using helpers like bpf_get_cgroup_classid_curr() outside of networking, from Charalampos Mitrodimas. 2) Fix a sockmap race between map_update and a pending workqueue from an earlier map_delete freeing the old psock where both pointed to the same psock->sk, from Jiayuan Chen. 3) Fix a data corruption issue when using bpf_msg_pop_data() in kTLS which failed to recalculate the ciphertext length, also from Jiayuan Chen. 4) Remove xdp_redirect_map{,_err} trace events since they are unused and also hide XDP trace events under CONFIG_BPF_SYSCALL, from Steven Rostedt. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: xdp: tracing: Hide some xdp events under CONFIG_BPF_SYSCALL xdp: Remove unused events xdp_redirect_map and xdp_redirect_map_err net, bpf: Fix RCU usage in task_cls_state() for BPF programs selftests/bpf: Add test to cover ktls with bpf_msg_pop_data bpf, ktls: Fix data corruption when using bpf_msg_pop_data() in ktls bpf, sockmap: Fix psock incorrectly pointing to sk ==================== Link: https://patch.msgid.link/20250626230111.24772-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-06-26Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski
Cross-merge networking fixes after downstream PR (net-6.16-rc4). Conflicts: Documentation/netlink/specs/mptcp_pm.yaml 9e6dd4c256d0 ("netlink: specs: mptcp: replace underscores with dashes in names") ec362192aa9e ("netlink: specs: fix up indentation errors") https://lore.kernel.org/20250626122205.389c2cd4@canb.auug.org.au Adjacent changes: Documentation/netlink/specs/fou.yaml 791a9ed0a40d ("netlink: specs: fou: replace underscores with dashes in names") 880d43ca9aa4 ("netlink: specs: clean up spaces in brackets") Signed-off-by: Jakub Kicinski <kuba@kernel.org>