summaryrefslogtreecommitdiff
AgeCommit message (Collapse)Author
2024-05-27net: gro: initialize network_offset in network layerWillem de Bruijn
Syzkaller was able to trigger kernel BUG at net/core/gro.c:424 ! RIP: 0010:gro_pull_from_frag0 net/core/gro.c:424 [inline] RIP: 0010:gro_try_pull_from_frag0 net/core/gro.c:446 [inline] RIP: 0010:dev_gro_receive+0x242f/0x24b0 net/core/gro.c:571 Due to using an incorrect NAPI_GRO_CB(skb)->network_offset. The referenced commit sets this offset to 0 in skb_gro_reset_offset. That matches the expected case in dev_gro_receive: pp = INDIRECT_CALL_INET(ptype->callbacks.gro_receive, ipv6_gro_receive, inet_gro_receive, &gro_list->list, skb); But syzkaller injected an skb with protocol ETH_P_TEB into an ip6gre device (by writing the IP6GRE encapsulated version to a TAP device). The result was a first call to eth_gro_receive, and thus an extra ETH_HLEN in network_offset that should not be there. First issue hit is when computing offset from network header in ipv6_gro_pull_exthdrs. Initialize both offsets in the network layer gro_receive. This pairs with all reads in gro_receive, which use skb_gro_receive_network_offset(). Fixes: 186b1ea73ad8 ("net: gro: use cb instead of skb->network_header") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> CC: Richard Gobert <richardbgobert@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20240523141434.1752483-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27ipv4: Fix address dump when IPv4 is disabled on an interfaceIdo Schimmel
Cited commit started returning an error when user space requests to dump the interface's IPv4 addresses and IPv4 is disabled on the interface. Restore the previous behavior and do not return an error. Before cited commit: # ip address show dev dummy1 10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff inet6 fe80::e040:68ff:fe98:d018/64 scope link proto kernel_ll valid_lft forever preferred_lft forever # ip link set dev dummy1 mtu 67 # ip address show dev dummy1 10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether e2:40:68:98:d0:18 brd ff:ff:ff:ff:ff:ff After cited commit: # ip address show dev dummy1 10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether 32:2d:69:f2:9c:99 brd ff:ff:ff:ff:ff:ff inet6 fe80::302d:69ff:fef2:9c99/64 scope link proto kernel_ll valid_lft forever preferred_lft forever # ip link set dev dummy1 mtu 67 # ip address show dev dummy1 RTNETLINK answers: No such device Dump terminated With this patch: # ip address show dev dummy1 10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff inet6 fe80::dc17:56ff:febb:57c0/64 scope link proto kernel_ll valid_lft forever preferred_lft forever # ip link set dev dummy1 mtu 67 # ip address show dev dummy1 10: dummy1: <BROADCAST,NOARP,UP,LOWER_UP> mtu 67 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether de:17:56:bb:57:c0 brd ff:ff:ff:ff:ff:ff I fixed the exact same issue for IPv6 in commit c04f7dfe6ec2 ("ipv6: Fix address dump when IPv6 is disabled on an interface"), but noted [1] that I am not doing the change for IPv4 because I am not aware of a way to disable IPv4 on an interface other than unregistering it. I clearly missed the above case. [1] https://lore.kernel.org/netdev/20240321173042.2151756-1-idosch@nvidia.com/ Fixes: cdb2f80f1c10 ("inet: use xa_array iterator to implement inet_dump_ifaddr()") Reported-by: Carolina Jubran <cjubran@nvidia.com> Reported-by: Yamen Safadi <ysafadi@nvidia.com> Tested-by: Carolina Jubran <cjubran@nvidia.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://lore.kernel.org/r/20240523110257.334315-1-idosch@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27Merge tag 'for-netdev' of ↵Jakub Kicinski
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf Daniel Borkmann says: ==================== pull-request: bpf 2024-05-27 We've added 15 non-merge commits during the last 7 day(s) which contain a total of 18 files changed, 583 insertions(+), 55 deletions(-). The main changes are: 1) Fix broken BPF multi-uprobe PID filtering logic which filtered by thread while the promise was to filter by process, from Andrii Nakryiko. 2) Fix the recent influx of syzkaller reports to sockmap which triggered a locking rule violation by performing a map_delete, from Jakub Sitnicki. 3) Fixes to netkit driver in particular on skb->pkt_type override upon pass verdict, from Daniel Borkmann. 4) Fix an integer overflow in resolve_btfids which can wrongly trigger build failures, from Friedrich Vock. 5) Follow-up fixes for ARC JIT reported by static analyzers, from Shahab Vahedi. * tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: selftests/bpf: Cover verifier checks for mutating sockmap/sockhash Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem" bpf: Allow delete from sockmap/sockhash only if update is allowed selftests/bpf: Add netkit test for pkt_type selftests/bpf: Add netkit tests for mac address netkit: Fix pkt_type override upon netkit pass verdict netkit: Fix setting mac address in l2 mode ARC, bpf: Fix issues reported by the static analyzers selftests/bpf: extend multi-uprobe tests with USDTs selftests/bpf: extend multi-uprobe tests with child thread case libbpf: detect broken PID filtering logic for multi-uprobe bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logic bpf: fix multi-uprobe PID filtering logic bpf: Fix potential integer overflow in resolve_btfids MAINTAINERS: Add myself as reviewer of ARM64 BPF JIT ==================== Link: https://lore.kernel.org/r/20240527203551.29712-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-05-27selftests/bpf: Cover verifier checks for mutating sockmap/sockhashJakub Sitnicki
Verifier enforces that only certain program types can mutate sock{map,hash} maps, that is update it or delete from it. Add test coverage for these checks so we don't regress. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-3-944b372f2101@cloudflare.com
2024-05-27Revert "bpf, sockmap: Prevent lock inversion deadlock in map delete elem"Jakub Sitnicki
This reverts commit ff91059932401894e6c86341915615c5eb0eca48. This check is no longer needed. BPF programs attached to tracepoints are now rejected by the verifier when they attempt to delete from a sockmap/sockhash maps. Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-2-944b372f2101@cloudflare.com
2024-05-27bpf: Allow delete from sockmap/sockhash only if update is allowedJakub Sitnicki
We have seen an influx of syzkaller reports where a BPF program attached to a tracepoint triggers a locking rule violation by performing a map_delete on a sockmap/sockhash. We don't intend to support this artificial use scenario. Extend the existing verifier allowed-program-type check for updating sockmap/sockhash to also cover deleting from a map. From now on only BPF programs which were previously allowed to update sockmap/sockhash can delete from these map types. Fixes: ff9105993240 ("bpf, sockmap: Prevent lock inversion deadlock in map delete elem") Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp> Reported-by: syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Tested-by: syzbot+ec941d6e24f633a59172@syzkaller.appspotmail.com Acked-by: John Fastabend <john.fastabend@gmail.com> Closes: https://syzkaller.appspot.com/bug?extid=ec941d6e24f633a59172 Link: https://lore.kernel.org/bpf/20240527-sockmap-verify-deletes-v1-1-944b372f2101@cloudflare.com
2024-05-27net: usb: smsc95xx: fix changing LED_SEL bit value updated from EEPROMParthiban Veerasooran
LED Select (LED_SEL) bit in the LED General Purpose IO Configuration register is used to determine the functionality of external LED pins (Speed Indicator, Link and Activity Indicator, Full Duplex Link Indicator). The default value for this bit is 0 when no EEPROM is present. If a EEPROM is present, the default value is the value of the LED Select bit in the Configuration Flags of the EEPROM. A USB Reset or Lite Reset (LRST) will cause this bit to be restored to the image value last loaded from EEPROM, or to be set to 0 if no EEPROM is present. While configuring the dual purpose GPIO/LED pins to LED outputs in the LED General Purpose IO Configuration register, the LED_SEL bit is changed as 0 and resulting the configured value from the EEPROM is cleared. The issue is fixed by using read-modify-write approach. Fixes: f293501c61c5 ("smsc95xx: configure LED outputs") Signed-off-by: Parthiban Veerasooran <Parthiban.Veerasooran@microchip.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Woojung Huh <woojung.huh@microchip.com> Link: https://lore.kernel.org/r/20240523085314.167650-1-Parthiban.Veerasooran@microchip.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-27Octeontx2-pf: Free send queue buffers incase of leaf to innerHariprasad Kelam
There are two type of classes. "Leaf classes" that are the bottom of the class hierarchy. "Inner classes" that are neither the root class nor leaf classes. QoS rules can only specify leaf classes as targets for traffic. Root / \ / \ 1 2 /\ / \ 4 5 classes 1,4 and 5 are leaf classes. class 2 is a inner class. When a leaf class made as inner, or vice versa, resources associated with send queue (send queue buffers and transmit schedulers) are not getting freed. Fixes: 5e6808b4c68d ("octeontx2-pf: Add support for HTB offload") Signed-off-by: Hariprasad Kelam <hkelam@marvell.com> Link: https://lore.kernel.org/r/20240523073626.4114-1-hkelam@marvell.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-27af_unix: Read sk->sk_hash under bindlock during bind().Kuniyuki Iwashima
syzkaller reported data-race of sk->sk_hash in unix_autobind() [0], and the same ones exist in unix_bind_bsd() and unix_bind_abstract(). The three bind() functions prefetch sk->sk_hash locklessly and use it later after validating that unix_sk(sk)->addr is NULL under unix_sk(sk)->bindlock. The prefetched sk->sk_hash is the hash value of unbound socket set in unix_create1() and does not change until bind() completes. There could be a chance that sk->sk_hash changes after the lockless read. However, in such a case, non-NULL unix_sk(sk)->addr is visible under unix_sk(sk)->bindlock, and bind() returns -EINVAL without using the prefetched value. The KCSAN splat is false-positive, but let's silence it by reading sk->sk_hash under unix_sk(sk)->bindlock. [0]: BUG: KCSAN: data-race in unix_autobind / unix_autobind write to 0xffff888034a9fb88 of 4 bytes by task 4468 on cpu 0: __unix_set_addr_hash net/unix/af_unix.c:331 [inline] unix_autobind+0x47a/0x7d0 net/unix/af_unix.c:1185 unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373 __sys_connect_file+0xd7/0xe0 net/socket.c:2048 __sys_connect+0x114/0x140 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x40/0x50 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e read to 0xffff888034a9fb88 of 4 bytes by task 4465 on cpu 1: unix_autobind+0x28/0x7d0 net/unix/af_unix.c:1134 unix_dgram_connect+0x7e3/0x890 net/unix/af_unix.c:1373 __sys_connect_file+0xd7/0xe0 net/socket.c:2048 __sys_connect+0x114/0x140 net/socket.c:2065 __do_sys_connect net/socket.c:2075 [inline] __se_sys_connect net/socket.c:2072 [inline] __x64_sys_connect+0x40/0x50 net/socket.c:2072 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e value changed: 0x000000e4 -> 0x000001e3 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 6.8.0-12822-gcd51db110a7e #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: afd20b9290e1 ("af_unix: Replace the big lock with small locks.") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240522154218.78088-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-27af_unix: Annotate data-race around unix_sk(sk)->addr.Kuniyuki Iwashima
Once unix_sk(sk)->addr is assigned under net->unx.table.locks and unix_sk(sk)->bindlock, *(unix_sk(sk)->addr) and unix_sk(sk)->path are fully set up, and unix_sk(sk)->addr is never changed. unix_getname() and unix_copy_addr() access the two fields locklessly, and commit ae3b564179bf ("missing barriers in some of unix_sock ->addr and ->path accesses") added smp_store_release() and smp_load_acquire() pairs. In other functions, we still read unix_sk(sk)->addr locklessly to check if the socket is bound, and KCSAN complains about it. [0] Given these functions have no dependency for *(unix_sk(sk)->addr) and unix_sk(sk)->path, READ_ONCE() is enough to annotate the data-race. Note that it is safe to access unix_sk(sk)->addr locklessly if the socket is found in the hash table. For example, the lockless read of otheru->addr in unix_stream_connect() is safe. Note also that newu->addr there is of the child socket that is still not accessible from userspace, and smp_store_release() publishes the address in case the socket is accept()ed and unix_getname() / unix_copy_addr() is called. [0]: BUG: KCSAN: data-race in unix_bind / unix_listen write (marked) to 0xffff88805f8d1840 of 8 bytes by task 13723 on cpu 0: __unix_set_addr_hash net/unix/af_unix.c:329 [inline] unix_bind_bsd net/unix/af_unix.c:1241 [inline] unix_bind+0x881/0x1000 net/unix/af_unix.c:1319 __sys_bind+0x194/0x1e0 net/socket.c:1847 __do_sys_bind net/socket.c:1858 [inline] __se_sys_bind net/socket.c:1856 [inline] __x64_sys_bind+0x40/0x50 net/socket.c:1856 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e read to 0xffff88805f8d1840 of 8 bytes by task 13724 on cpu 1: unix_listen+0x72/0x180 net/unix/af_unix.c:734 __sys_listen+0xdc/0x160 net/socket.c:1881 __do_sys_listen net/socket.c:1890 [inline] __se_sys_listen net/socket.c:1888 [inline] __x64_sys_listen+0x2e/0x40 net/socket.c:1888 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x4f/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x46/0x4e value changed: 0x0000000000000000 -> 0xffff88807b5b1b40 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 13724 Comm: syz-executor.4 Not tainted 6.8.0-12822-gcd51db110a7e #12 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240522154002.77857-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-27selftests: hsr: Fix "File exists" errors for hsr_pingGeliang Tang
The hsr_ping test reports the following errors: INFO: preparing interfaces for HSRv0. INFO: Initial validation ping. INFO: Longer ping test. INFO: Cutting one link. INFO: Delay the link and drop a few packages. INFO: All good. INFO: preparing interfaces for HSRv1. RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists RTNETLINK answers: File exists Error: ipv4: Address already assigned. Error: ipv6: address already assigned. Error: ipv4: Address already assigned. Error: ipv6: address already assigned. Error: ipv4: Address already assigned. Error: ipv6: address already assigned. INFO: Initial validation ping. That is because the cleanup code for the 2nd round test before "setup_hsr_interfaces 1" is removed incorrectly in commit 680fda4f6714 ("test: hsr: Remove script code already implemented in lib.sh"). This patch fixes it by re-setup the namespaces using setup_ns ns1 ns2 ns3 command before "setup_hsr_interfaces 1". It deletes previous namespaces and create new ones. Fixes: 680fda4f6714 ("test: hsr: Remove script code already implemented in lib.sh") Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Signed-off-by: Geliang Tang <tanggeliang@kylinos.cn> Link: https://lore.kernel.org/r/6485d3005f467758d49f0f313c8c009759ba6b05.1716374462.git.tanggeliang@kylinos.cn Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-27enic: Validate length of nl attributes in enic_set_vf_portRoded Zats
enic_set_vf_port assumes that the nl attribute IFLA_PORT_PROFILE is of length PORT_PROFILE_MAX and that the nl attributes IFLA_PORT_INSTANCE_UUID, IFLA_PORT_HOST_UUID are of length PORT_UUID_MAX. These attributes are validated (in the function do_setlink in rtnetlink.c) using the nla_policy ifla_port_policy. The policy defines IFLA_PORT_PROFILE as NLA_STRING, IFLA_PORT_INSTANCE_UUID as NLA_BINARY and IFLA_PORT_HOST_UUID as NLA_STRING. That means that the length validation using the policy is for the max size of the attributes and not on exact size so the length of these attributes might be less than the sizes that enic_set_vf_port expects. This might cause an out of bands read access in the memcpys of the data of these attributes in enic_set_vf_port. Fixes: f8bd909183ac ("net: Add ndo_{set|get}_vf_port support for enic dynamic vnics") Signed-off-by: Roded Zats <rzats@paloaltonetworks.com> Link: https://lore.kernel.org/r/20240522073044.33519-1-rzats@paloaltonetworks.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-25selftests/bpf: Add netkit test for pkt_typeDaniel Borkmann
Add a test case to assert that the skb->pkt_type which was set from the BPF program is retained from the netkit xmit side to the peer's device at tcx ingress location. # ./vmtest.sh -- ./test_progs -t netkit [...] ./test_progs -t netkit [ 1.140780] bpf_testmod: loading out-of-tree module taints kernel. [ 1.141127] bpf_testmod: module verification failed: signature and/or required key missing - tainting kernel [ 1.284601] tsc: Refined TSC clocksource calibration: 3408.006 MHz [ 1.286672] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x311fd9b189d, max_idle_ns: 440795225691 ns [ 1.290384] clocksource: Switched to clocksource tsc #345 tc_netkit_basic:OK #346 tc_netkit_device:OK #347 tc_netkit_multi_links:OK #348 tc_netkit_multi_opts:OK #349 tc_netkit_neigh_links:OK #350 tc_netkit_pkt_type:OK Summary: 6/0 PASSED, 0 SKIPPED, 0 FAILED Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240524163619.26001-4-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25selftests/bpf: Add netkit tests for mac addressDaniel Borkmann
This adds simple tests around setting MAC addresses in the different netkit modes. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/r/20240524163619.26001-3-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25netkit: Fix pkt_type override upon netkit pass verdictDaniel Borkmann
When running Cilium connectivity test suite with netkit in L2 mode, we found that compared to tcx a few tests were failing which pushed traffic into an L7 proxy sitting in host namespace. The problem in particular is around the invocation of eth_type_trans() in netkit. In case of tcx, this is run before the tcx ingress is triggered inside host namespace and thus if the BPF program uses the bpf_skb_change_type() helper the newly set type is retained. However, in case of netkit, the late eth_type_trans() invocation overrides the earlier decision from the BPF program which eventually leads to the test failure. Instead of eth_type_trans(), split out the relevant parts, meaning, reset of mac header and call to eth_skb_pkt_type() before the BPF program is run in order to have the same behavior as with tcx, and refactor a small helper called eth_skb_pull_mac() which is run in case it's passed up the stack where the mac header must be pulled. With this all connectivity tests pass. Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20240524163619.26001-2-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25netkit: Fix setting mac address in l2 modeDaniel Borkmann
When running Cilium connectivity test suite with netkit in L2 mode, we found that it is expected to be able to specify a custom MAC address for the devices, in particular, cilium-cni obtains the specified MAC address by querying the endpoint and sets the MAC address of the interface inside the Pod. Thus, fix the missing support in netkit for L2 mode. Fixes: 35dfaad7188c ("netkit, bpf: Add bpf programmable net device") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Acked-by: Stanislav Fomichev <sdf@google.com> Link: https://lore.kernel.org/r/20240524163619.26001-1-daniel@iogearbox.net Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25ARC, bpf: Fix issues reported by the static analyzersShahab Vahedi
Also updated couple of comments along the way. One of the issues reported was indeed a bug in the code: memset(ctx, 0, sizeof(ctx)) // original line memset(ctx, 0, sizeof(*ctx)) // fixed line That was a nice catch. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202405222314.UG5F2NHn-lkp@intel.com/ Closes: https://lore.kernel.org/oe-kbuild-all/202405232036.Xqoc3b0J-lkp@intel.com/ Signed-off-by: Shahab Vahedi <shahab@synopsys.com> Link: https://lore.kernel.org/r/20240525035628.1026-1-list+bpf@vahedi.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25Merge branch 'fix-bpf-multi-uprobe-pid-filtering-logic'Alexei Starovoitov
Andrii Nakryiko says: ==================== Fix BPF multi-uprobe PID filtering logic It turns out that current implementation of multi-uprobe PID filtering logic is broken. It filters by thread, while the promise is filtering by process. Patch #1 fixes the logic trivially. The rest is testing and mitigations that are necessary for libbpf to not break users of USDT programs. v1->v2: - fix selftest in last patch (CI); - use semicolon in patch #3 (Jiri). ==================== Link: https://lore.kernel.org/r/20240521163401.3005045-1-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25selftests/bpf: extend multi-uprobe tests with USDTsAndrii Nakryiko
Validate libbpf's USDT-over-multi-uprobe logic by adding USDTs to existing multi-uprobe tests. This checks correct libbpf fallback to singular uprobes (when run on older kernels with buggy PID filtering). We reuse already established child process and child thread testing infrastructure, so additions are minimal. These test fail on either older kernels or older version of libbpf that doesn't detect PID filtering problems. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-6-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25selftests/bpf: extend multi-uprobe tests with child thread caseAndrii Nakryiko
Extend existing multi-uprobe tests to test that PID filtering works correctly. We already have child *process* tests, but we need also child *thread* tests. This patch adds spawn_thread() helper to start child thread, wait for it to be ready, and then instruct it to trigger desired uprobes. Additionally, we extend BPF-side code to track thread ID, not just process ID. Also we detect whether extraneous triggerings with unexpected process IDs happened, and validate that none of that happened in practice. These changes prove that fixed PID filtering logic for multi-uprobe works as expected. These tests fail on old kernels. Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Acked-by: Jiri Olsa <jolsa@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-5-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25libbpf: detect broken PID filtering logic for multi-uprobeAndrii Nakryiko
Libbpf is automatically (and transparently to user) detecting multi-uprobe support in the kernel, and, if supported, uses multi-uprobes to improve USDT attachment speed. USDTs can be attached system-wide or for the specific process by PID. In the latter case, we rely on correct kernel logic of not triggering USDT for unrelated processes. As such, on older kernels that do support multi-uprobes, but still have broken PID filtering logic, we need to fall back to singular uprobes. Unfortunately, whether user is using PID filtering or not is known at the attachment time, which happens after relevant BPF programs were loaded into the kernel. Also unfortunately, we need to make a call whether to use multi-uprobes or singular uprobe for SEC("usdt") programs during BPF object load time, at which point we have no information about possible PID filtering. The distinction between single and multi-uprobes is small, but important for the kernel. Multi-uprobes get BPF_TRACE_UPROBE_MULTI attach type, and kernel internally substitiute different implementation of some of BPF helpers (e.g., bpf_get_attach_cookie()) depending on whether uprobe is multi or singular. So, multi-uprobes and singular uprobes cannot be intermixed. All the above implies that we have to make an early and conservative call about the use of multi-uprobes. And so this patch modifies libbpf's existing feature detector for multi-uprobe support to also check correct PID filtering. If PID filtering is not yet fixed, we fall back to singular uprobes for USDTs. This extension to feature detection is simple thanks to kernel's -EINVAL addition for pid < 0. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-4-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25bpf: remove unnecessary rcu_read_{lock,unlock}() in multi-uprobe attach logicAndrii Nakryiko
get_pid_task() internally already calls rcu_read_lock() and rcu_read_unlock(), so there is no point to do this one extra time. This is a drive-by improvement and has no correctness implications. Acked-by: Jiri Olsa <jolsa@kernel.org> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-3-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-25bpf: fix multi-uprobe PID filtering logicAndrii Nakryiko
Current implementation of PID filtering logic for multi-uprobes in uprobe_prog_run() is filtering down to exact *thread*, while the intent for PID filtering it to filter by *process* instead. The check in uprobe_prog_run() also differs from the analogous one in uprobe_multi_link_filter() for some reason. The latter is correct, checking task->mm, not the task itself. Fix the check in uprobe_prog_run() to perform the same task->mm check. While doing this, we also update get_pid_task() use to use PIDTYPE_TGID type of lookup, given the intent is to get a representative task of an entire process. This doesn't change behavior, but seems more logical. It would hold task group leader task now, not any random thread task. Last but not least, given multi-uprobe support is half-broken due to this PID filtering logic (depending on whether PID filtering is important or not), we need to make it easy for user space consumers (including libbpf) to easily detect whether PID filtering logic was already fixed. We do it here by adding an early check on passed pid parameter. If it's negative (and so has no chance of being a valid PID), we return -EINVAL. Previous behavior would eventually return -ESRCH ("No process found"), given there can't be any process with negative PID. This subtle change won't make any practical change in behavior, but will allow applications to detect PID filtering fixes easily. Libbpf fixes take advantage of this in the next patch. Cc: stable@vger.kernel.org Acked-by: Jiri Olsa <jolsa@kernel.org> Fixes: b733eeade420 ("bpf: Add pid filter support for uprobe_multi link") Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/r/20240521163401.3005045-2-andrii@kernel.org Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2024-05-24bpf: Fix potential integer overflow in resolve_btfidsFriedrich Vock
err is a 32-bit integer, but elf_update returns an off_t, which is 64-bit at least on 64-bit platforms. If symbols_patch is called on a binary between 2-4GB in size, the result will be negative when cast to a 32-bit integer, which the code assumes means an error occurred. This can wrongly trigger build failures when building very large kernel images. Fixes: fbbb68de80a4 ("bpf: Add resolve_btfids tool to resolve BTF IDs in ELF object") Signed-off-by: Friedrich Vock <friedrich.vock@gmx.de> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20240514070931.199694-1-friedrich.vock@gmx.de
2024-05-24Merge branch 'mlx5-fixes'David S. Miller
Tariq Toukan says: ==================== mlx5 fixes 24-05-22 This patchset provides bug fixes to mlx5 core and Eth drivers. Series generated against: commit 9c91c7fadb17 ("net: mana: Fix the extra HZ in mana_hwc_send_request") ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Fix UDP GSO for encapsulated packetsGal Pressman
When the skb is encapsulated, adjust the inner UDP header instead of the outer one, and account for UDP header (instead of TCP) in the inline header size calculation. Fixes: 689adf0d4892 ("net/mlx5e: Add UDP GSO support") Reported-by: Jason Baron <jbaron@akamai.com> Closes: https://lore.kernel.org/netdev/c42961cb-50b9-4a9a-bd43-87fe48d88d29@akamai.com/ Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com> Reviewed-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Use rx_missed_errors instead of rx_dropped for reporting buffer ↵Carolina Jubran
exhaustion Previously, the driver incorrectly used rx_dropped to report device buffer exhaustion. According to the documentation, rx_dropped should not be used to count packets dropped due to buffer exhaustion, which is the purpose of rx_missed_errors. Use rx_missed_errors as intended for counting packets dropped due to buffer exhaustion. Fixes: 269e6b3af3bf ("net/mlx5e: Report additional error statistics in get stats ndo") Signed-off-by: Carolina Jubran <cjubran@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Do not use ptp structure for tx ts stats when not initializedRahul Rameshbabu
The ptp channel instance is only initialized when ptp traffic is first processed by the driver. This means that there is a window in between when port timestamping is enabled and ptp traffic is sent where the ptp channel instance is not initialized. Accessing statistics during this window will lead to an access violation (NULL + member offset). Check the validity of the instance before attempting to query statistics. BUG: unable to handle page fault for address: 0000000000003524 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109dfc067 P4D 109dfc067 PUD 1064ef067 PMD 0 Oops: 0000 [#1] SMP CPU: 0 PID: 420 Comm: ethtool Not tainted 6.9.0-rc2-rrameshbabu+ #245 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/204 RIP: 0010:mlx5e_stats_ts_get+0x4c/0x130 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5e_stats_ts_get+0x4c/0x130 ? mlx5e_stats_ts_get+0x20/0x130 mlx5e_get_ts_stats+0x15/0x20 <snip> Fixes: 3579032c08c1 ("net/mlx5e: Implement ethtool hardware timestamping statistics") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5e: Fix IPsec tunnel mode offload feature checkRahul Rameshbabu
Remove faulty check disabling checksum offload and GSO for offload of simple IPsec tunnel L4 traffic. Comment previously describing the deleted code incorrectly claimed the check prevented double tunnel (or three layers of ip headers). Fixes: f1267798c980 ("net/mlx5: Fix checksum issue of VXLAN and IPsec crypto offload") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5: Use mlx5_ipsec_rx_status_destroy to correctly delete status rulesRahul Rameshbabu
rx_create no longer allocates a modify_hdr instance that needs to be cleaned up. The mlx5_modify_header_dealloc call will lead to a NULL pointer dereference. A leak in the rules also previously occurred since there are now two rules populated related to status. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 109907067 P4D 109907067 PUD 116890067 PMD 0 Oops: 0000 [#1] SMP CPU: 1 PID: 484 Comm: ip Not tainted 6.9.0-rc2-rrameshbabu+ #254 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014 RIP: 0010:mlx5_modify_header_dealloc+0xd/0x70 <snip> Call Trace: <TASK> ? show_regs+0x60/0x70 ? __die+0x24/0x70 ? page_fault_oops+0x15f/0x430 ? free_to_partial_list.constprop.0+0x79/0x150 ? do_user_addr_fault+0x2c9/0x5c0 ? exc_page_fault+0x63/0x110 ? asm_exc_page_fault+0x27/0x30 ? mlx5_modify_header_dealloc+0xd/0x70 rx_create+0x374/0x590 rx_add_rule+0x3ad/0x500 ? rx_add_rule+0x3ad/0x500 ? mlx5_cmd_exec+0x2c/0x40 ? mlx5_create_ipsec_obj+0xd6/0x200 mlx5e_accel_ipsec_fs_add_rule+0x31/0xf0 mlx5e_xfrm_add_state+0x426/0xc00 <snip> Fixes: 94af50c0a9bb ("net/mlx5e: Unify esw and normal IPsec status table creation/destruction") Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5: Fix MTMP register capability offset in MCAM registerGal Pressman
The MTMP register (0x900a) capability offset is off-by-one, move it to the right place. Fixes: 1f507e80c700 ("net/mlx5: Expose NIC temperature via hardware monitoring kernel API") Signed-off-by: Gal Pressman <gal@nvidia.com> Reviewed-by: Cosmin Ratiu <cratiu@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5: Do not query MPIR on embedded CPU functionTariq Toukan
A proper query to MPIR needs to set the correct value in the depth field. On embedded CPU this value is not necessarily zero. As there is no real use case for multi-PF netdev on the embedded CPU of the smart NIC, block this option. This fixes the following failure: ACCESS_REG(0x805) op_mod(0x1) failed, status bad system state(0x4), syndrome (0x685f19), err(-5) Fixes: 678eb448055a ("net/mlx5: SD, Implement basic query and instantiation") Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net/mlx5: Lag, do bond only if slaves agree on roce stateMaher Sanalla
Currently, the driver does not enforce that lag bond slaves must have matching roce capabilities. Yet, in mlx5_do_bond(), the driver attempts to enable roce on all vports of the bond slaves, causing the following syndrome when one slave has no roce fw support: mlx5_cmd_out_err:809:(pid 25427): MODIFY_NIC_VPORT_CONTEXT(0×755) op_mod(0×0) failed, status bad parameter(0×3), syndrome (0xc1f678), err(-22) Thus, create HW lag only if bond's slaves agree on roce state, either all slaves have roce support resulting in a roce lag bond, or none do, resulting in a raw eth bond. Fixes: 7907f23adc18 ("net/mlx5: Implement RoCE LAG feature") Signed-off-by: Maher Sanalla <msanalla@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24net: phy: micrel: set soft_reset callback to genphy_soft_reset for KSZ8061Mathieu Othacehe
Following a similar reinstate for the KSZ8081 and KSZ9031. Older kernels would use the genphy_soft_reset if the PHY did not implement a .soft_reset. The KSZ8061 errata described here: https://ww1.microchip.com/downloads/en/DeviceDoc/KSZ8061-Errata-DS80000688B.pdf and worked around with 232ba3a51c ("net: phy: Micrel KSZ8061: link failure after cable connect") is back again without this soft reset. Fixes: 6e2d85ec0559 ("net: phy: Stop with excessive soft reset") Tested-by: Karim Ben Houcine <karim.benhoucine@landisgyr.com> Signed-off-by: Mathieu Othacehe <othacehe@gnu.org> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <florian.fainelli@broadcom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-24connector: Fix invalid conversion in cn_proc.hMatt Jan
The implicit conversion from unsigned int to enum proc_cn_event is invalid, so explicitly cast it for compilation in a C++ compiler. /usr/include/linux/cn_proc.h: In function 'proc_cn_event valid_event(proc_cn_event)': /usr/include/linux/cn_proc.h:72:17: error: invalid conversion from 'unsigned int' to 'proc_cn_event' [-fpermissive] 72 | ev_type &= PROC_EVENT_ALL; | ^ | | | unsigned int Signed-off-by: Matt Jan <zoo868e@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2024-05-23Merge tag 'net-6.10-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "Quite smaller than usual. Notably it includes the fix for the unix regression from the past weeks. The TCP window fix will require some follow-up, already queued. Current release - regressions: - af_unix: fix garbage collection of embryos Previous releases - regressions: - af_unix: fix race between GC and receive path - ipv6: sr: fix missing sk_buff release in seg6_input_core - tcp: remove 64 KByte limit for initial tp->rcv_wnd value - eth: r8169: fix rx hangup - eth: lan966x: remove ptp traps in case the ptp is not enabled - eth: ixgbe: fix link breakage vs cisco switches - eth: ice: prevent ethtool from corrupting the channels Previous releases - always broken: - openvswitch: set the skbuff pkt_type for proper pmtud support - tcp: Fix shift-out-of-bounds in dctcp_update_alpha() Misc: - a bunch of selftests stabilization patches" * tag 'net-6.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (25 commits) r8169: Fix possible ring buffer corruption on fragmented Tx packets. idpf: Interpret .set_channels() input differently ice: Interpret .set_channels() input differently nfc: nci: Fix handling of zero-length payload packets in nci_rx_work() net: relax socket state check at accept time. tcp: remove 64 KByte limit for initial tp->rcv_wnd value net: ti: icssg_prueth: Fix NULL pointer dereference in prueth_probe() tls: fix missing memory barrier in tls_init net: fec: avoid lock evasion when reading pps_enable Revert "ixgbe: Manual AN-37 for troublesome link partners for X550 SFI" testing: net-drv: use stats64 for testing net: mana: Fix the extra HZ in mana_hwc_send_request net: lan966x: Remove ptp traps in case the ptp is not enabled. openvswitch: Set the skbuff pkt_type for proper pmtud support. selftest: af_unix: Make SCM_RIGHTS into OOB data. af_unix: Fix garbage collection of embryos carrying OOB with SCM_RIGHTS tcp: Fix shift-out-of-bounds in dctcp_update_alpha(). selftests/net: use tc rule to filter the na packet ipv6: sr: fix memleak in seg6_hmac_init_algo af_unix: Update unix_sk(sk)->oob_skb under sk_receive_queue lock. ...
2024-05-23Merge tag 'trace-fixes-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: "Minor last minute fixes: - Fix a very tight race between the ring buffer readers and resizing the ring buffer - Correct some stale comments in the ring buffer code - Fix kernel-doc in the rv code - Add a MODULE_DESCRIPTION to preemptirq_delay_test" * tag 'trace-fixes-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: rv: Update rv_en(dis)able_monitor doc to match kernel-doc tracing: Add MODULE_DESCRIPTION() to preemptirq_delay_test ring-buffer: Fix a race between readers and resize checks ring-buffer: Correct stale comments related to non-consuming readers
2024-05-23Merge tag 'trace-tools-v6.10-2' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing tool fix from Steven Rostedt: "Fix printf format warnings in latency-collector. Use the printf format string with %s to take a string instead of taking in a string directly" * tag 'trace-tools-v6.10-2' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tools/latency-collector: Fix -Wformat-security compile warns
2024-05-23Merge tag 'trace-assign-str-v6.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing cleanup from Steven Rostedt: "Remove second argument of __assign_str() The __assign_str() macro logic of the TRACE_EVENT() macro was optimized so that it no longer needs the second argument. The __assign_str() is always matched with __string() field that takes a field name and the source for that field: __string(field, source) The TRACE_EVENT() macro logic will save off the source value and then use that value to copy into the ring buffer via the __assign_str(). Before commit c1fa617caeb0 ("tracing: Rework __assign_str() and __string() to not duplicate getting the string"), the __assign_str() needed the second argument which would perform the same logic as the __string() source parameter did. Not only would this add overhead, but it was error prone as if the __assign_str() source produced something different, it may not have allocated enough for the string in the ring buffer (as the __string() source was used to determine how much to allocate) Now that the __assign_str() just uses the same string that was used in __string() it no longer needs the source parameter. It can now be removed" * tag 'trace-assign-str-v6.10' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/treewide: Remove second parameter of __assign_str()
2024-05-23Merge tag 'sparc-for-6.10-tag1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc Pull sparc updates from Andreas Larsson: - Avoid on-stack cpumask variables in a number of places - Move struct termio to asm/termios.h, matching other architectures and allowing certain user space applications to build also for sparc - Fix missing prototype warnings for sparc64 - Fix version generation warnings for sparc32 - Fix bug where non-consecutive CPU IDs lead to some CPUs not starting - Simplification using swap and cleanup using NULL for pointer - Convert sparc parport and chmc drivers to use remove callbacks returning void * tag 'sparc-for-6.10-tag1' of git://git.kernel.org/pub/scm/linux/kernel/git/alarsson/linux-sparc: sparc/leon: Remove on-stack cpumask var sparc/pci_msi: Remove on-stack cpumask var sparc/of: Remove on-stack cpumask var sparc/irq: Remove on-stack cpumask var sparc/srmmu: Remove on-stack cpumask var sparc: chmc: Convert to platform remove callback returning void sparc: parport: Convert to platform remove callback returning void sparc: Compare pointers to NULL instead of 0 sparc: Use swap() to fix Coccinelle warning sparc32: Fix version generation failed warnings sparc64: Fix number of online CPUs sparc64: Fix prototype warning for sched_clock sparc64: Fix prototype warnings in adi_64.c sparc64: Fix prototype warning for dma_4v_iotsb_bind sparc64: Fix prototype warning for uprobe_trap sparc64: Fix prototype warning for alloc_irqstack_bootmem sparc64: Fix prototype warning for vmemmap_free sparc64: Fix prototype warnings in traps_64.c sparc64: Fix prototype warning for init_vdso_image sparc: move struct termio to asm/termios.h
2024-05-23Merge tag 'arm64-fixes' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Will Deacon: "The major fix here is for a filesystem corruption issue reported on Apple M1 as a result of buggy management of the floating point register state introduced in 6.8. I initially reverted one of the offending patches, but in the end Ard cooked a proper fix so there's a revert+reapply in the series. Aside from that, we've got some CPU errata workarounds and misc other fixes. - Fix broken FP register state tracking which resulted in filesystem corruption when dm-crypt is used - Workarounds for Arm CPU errata affecting the SSBS Spectre mitigation - Fix lockdep assertion in DMC620 memory controller PMU driver - Fix alignment of BUG table when CONFIG_DEBUG_BUGVERBOSE is disabled" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64/fpsimd: Avoid erroneous elide of user state reload Reapply "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: asm-bug: Add .align 2 to the end of __BUG_ENTRY perf/arm-dmc620: Fix lockdep assert in ->event_init() Revert "arm64: fpsimd: Implement lazy restore for kernel mode FPSIMD" arm64: errata: Add workaround for Arm errata 3194386 and 3312417 arm64: cputype: Add Neoverse-V3 definitions arm64: cputype: Add Cortex-X4 definitions arm64: barrier: Restore spec_bar() macro
2024-05-23Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "Several new features here: - virtio-net is finally supported in vduse - virtio (balloon and mem) interaction with suspend is improved - vhost-scsi now handles signals better/faster And fixes, cleanups all over the place" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (48 commits) virtio-pci: Check if is_avq is NULL virtio: delete vq in vp_find_vqs_msix() when request_irq() fails MAINTAINERS: add Eugenio Pérez as reviewer vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API vp_vdpa: don't allocate unused msix vectors sound: virtio: drop owner assignment fuse: virtio: drop owner assignment scsi: virtio: drop owner assignment rpmsg: virtio: drop owner assignment nvdimm: virtio_pmem: drop owner assignment wifi: mac80211_hwsim: drop owner assignment vsock/virtio: drop owner assignment net: 9p: virtio: drop owner assignment net: virtio: drop owner assignment net: caif: virtio: drop owner assignment misc: nsm: drop owner assignment iommu: virtio: drop owner assignment drm/virtio: drop owner assignment gpio: virtio: drop owner assignment firmware: arm_scmi: virtio: drop owner assignment ...
2024-05-23tools/latency-collector: Fix -Wformat-security compile warnsShuah Khan
Fix the following -Wformat-security compile warnings adding missing format arguments: latency-collector.c: In function ‘show_available’: latency-collector.c:938:17: warning: format not a string literal and no format arguments [-Wformat-security] 938 | warnx(no_tracer_msg); | ^~~~~ latency-collector.c:943:17: warning: format not a string literal and no format arguments [-Wformat-security] 943 | warnx(no_latency_tr_msg); | ^~~~~ latency-collector.c: In function ‘find_default_tracer’: latency-collector.c:986:25: warning: format not a string literal and no format arguments [-Wformat-security] 986 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ latency-collector.c: In function ‘scan_arguments’: latency-collector.c:1881:33: warning: format not a string literal and no format arguments [-Wformat-security] 1881 | errx(EXIT_FAILURE, no_tracer_msg); | ^~~~ Link: https://lore.kernel.org/linux-trace-kernel/20240404011009.32945-1-skhan@linuxfoundation.org Cc: stable@vger.kernel.org Fixes: e23db805da2df ("tracing/tools: Add the latency-collector to tools directory") Signed-off-by: Shuah Khan <skhan@linuxfoundation.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-05-23r8169: Fix possible ring buffer corruption on fragmented Tx packets.Ken Milmore
An issue was found on the RTL8125b when transmitting small fragmented packets, whereby invalid entries were inserted into the transmit ring buffer, subsequently leading to calls to dma_unmap_single() with a null address. This was caused by rtl8169_start_xmit() not noticing changes to nr_frags which may occur when small packets are padded (to work around hardware quirks) in rtl8169_tso_csum_v2(). To fix this, postpone inspecting nr_frags until after any padding has been applied. Fixes: 9020845fb5d6 ("r8169: improve rtl8169_start_xmit") Cc: stable@vger.kernel.org Signed-off-by: Ken Milmore <ken.milmore@gmail.com> Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/27ead18b-c23d-4f49-a020-1fc482c5ac95@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23Merge branch 'intel-interpret-set_channels-input-differently'Paolo Abeni
Jacob Keller says: ==================== intel: Interpret .set_channels() input differently The ice and idpf drivers can trigger a crash with AF_XDP due to incorrect interpretation of the asymmetric Tx and Rx parameters in their .set_channels() implementations: 1. ethtool -l <IFNAME> -> combined: 40 2. Attach AF_XDP to queue 30 3. ethtool -L <IFNAME> rx 15 tx 15 combined number is not specified, so command becomes {rx_count = 15, tx_count = 15, combined_count = 40}. 4. ethnl_set_channels checks, if there are any AF_XDP of queues from the new (combined_count + rx_count) to the old one, so from 55 to 40, check does not trigger. 5. the driver interprets `rx 15 tx 15` as 15 combined channels and deletes the queue that AF_XDP is attached to. This is fundamentally a problem with interpreting a request for asymmetric queues as symmetric combined queues. Fix the ice and idpf drivers to stop interpreting such requests as a request for combined queues. Due to current driver design for both ice and idpf, it is not possible to support requests of the same count of Tx and Rx queues with independent interrupts, (i.e. ethtool -L <IFNAME> rx 15 tx 15) so such requests are now rejected. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> ==================== Link: https://lore.kernel.org/r/20240521-iwl-net-2024-05-14-set-channels-fixes-v2-0-7aa39e2e99f1@intel.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23idpf: Interpret .set_channels() input differentlyLarysa Zaremba
Unlike ice, idpf does not check, if user has requested at least 1 combined channel. Instead, it relies on a check in the core code. Unfortunately, the check does not trigger for us because of the hacky .set_channels() interpretation logic that is not consistent with the core code. This naturally leads to user being able to trigger a crash with an invalid input. This is how: 1. ethtool -l <IFNAME> -> combined: 40 2. ethtool -L <IFNAME> rx 0 tx 0 combined number is not specified, so command becomes {rx_count = 0, tx_count = 0, combined_count = 40}. 3. ethnl_set_channels checks, if there is at least 1 RX and 1 TX channel, comparing (combined_count + rx_count) and (combined_count + tx_count) to zero. Obviously, (40 + 0) is greater than zero, so the core code deems the input OK. 4. idpf interprets `rx 0 tx 0` as 0 channels and tries to proceed with such configuration. The issue has to be solved fundamentally, as current logic is also known to cause AF_XDP problems in ice [0]. Interpret the command in a way that is more consistent with ethtool manual [1] (--show-channels and --set-channels) and new ice logic. Considering that in the idpf driver only the difference between RX and TX queues forms dedicated channels, change the correct way to set number of channels to: ethtool -L <IFNAME> combined 10 /* For symmetric queues */ ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */ [0] https://lore.kernel.org/netdev/20240418095857.2827-1-larysa.zaremba@intel.com/ [1] https://man7.org/linux/man-pages/man8/ethtool.8.html Fixes: 02cbfba1add5 ("idpf: add ethtool callbacks") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Igor Bagnucki <igor.bagnucki@intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Tested-by: Krishneil Singh <krishneil.k.singh@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23ice: Interpret .set_channels() input differentlyLarysa Zaremba
A bug occurs because a safety check guarding AF_XDP-related queues in ethnl_set_channels(), does not trigger. This happens, because kernel and ice driver interpret the ethtool command differently. How the bug occurs: 1. ethtool -l <IFNAME> -> combined: 40 2. Attach AF_XDP to queue 30 3. ethtool -L <IFNAME> rx 15 tx 15 combined number is not specified, so command becomes {rx_count = 15, tx_count = 15, combined_count = 40}. 4. ethnl_set_channels checks, if there are any AF_XDP of queues from the new (combined_count + rx_count) to the old one, so from 55 to 40, check does not trigger. 5. ice interprets `rx 15 tx 15` as 15 combined channels and deletes the queue that AF_XDP is attached to. Interpret the command in a way that is more consistent with ethtool manual [0] (--show-channels and --set-channels). Considering that in the ice driver only the difference between RX and TX queues forms dedicated channels, change the correct way to set number of channels to: ethtool -L <IFNAME> combined 10 /* For symmetric queues */ ethtool -L <IFNAME> combined 8 tx 2 rx 0 /* For asymmetric queues */ [0] https://man7.org/linux/man-pages/man8/ethtool.8.html Fixes: 87324e747fde ("ice: Implement ethtool ops for channels") Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com> Tested-by: Chandan Kumar Rout <chandanx.rout@intel.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23nfc: nci: Fix handling of zero-length payload packets in nci_rx_work()Ryosuke Yasuoka
When nci_rx_work() receives a zero-length payload packet, it should not discard the packet and exit the loop. Instead, it should continue processing subsequent packets. Fixes: d24b03535e5e ("nfc: nci: Fix uninit-value in nci_dev_up and nci_ntf_packet") Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240521153444.535399-1-ryasuoka@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23net: relax socket state check at accept time.Paolo Abeni
Christoph reported the following splat: WARNING: CPU: 1 PID: 772 at net/ipv4/af_inet.c:761 __inet_accept+0x1f4/0x4a0 Modules linked in: CPU: 1 PID: 772 Comm: syz-executor510 Not tainted 6.9.0-rc7-g7da7119fe22b #56 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 RIP: 0010:__inet_accept+0x1f4/0x4a0 net/ipv4/af_inet.c:759 Code: 04 38 84 c0 0f 85 87 00 00 00 41 c7 04 24 03 00 00 00 48 83 c4 10 5b 41 5c 41 5d 41 5e 41 5f 5d c3 cc cc cc cc e8 ec b7 da fd <0f> 0b e9 7f fe ff ff e8 e0 b7 da fd 0f 0b e9 fe fe ff ff 89 d9 80 RSP: 0018:ffffc90000c2fc58 EFLAGS: 00010293 RAX: ffffffff836bdd14 RBX: 0000000000000000 RCX: ffff888104668000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: dffffc0000000000 R08: ffffffff836bdb89 R09: fffff52000185f64 R10: dffffc0000000000 R11: fffff52000185f64 R12: dffffc0000000000 R13: 1ffff92000185f98 R14: ffff88810754d880 R15: ffff8881007b7800 FS: 000000001c772880(0000) GS:ffff88811b280000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fb9fcf2e178 CR3: 00000001045d2002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> inet_accept+0x138/0x1d0 net/ipv4/af_inet.c:786 do_accept+0x435/0x620 net/socket.c:1929 __sys_accept4_file net/socket.c:1969 [inline] __sys_accept4+0x9b/0x110 net/socket.c:1999 __do_sys_accept net/socket.c:2016 [inline] __se_sys_accept net/socket.c:2013 [inline] __x64_sys_accept+0x7d/0x90 net/socket.c:2013 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x58/0x100 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x4315f9 Code: fd ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 ab b4 fd ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffdb26d9c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002b RAX: ffffffffffffffda RBX: 0000000000400300 RCX: 00000000004315f9 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000004 RBP: 00000000006e1018 R08: 0000000000400300 R09: 0000000000400300 R10: 0000000000400300 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000040cdf0 R14: 000000000040ce80 R15: 0000000000000055 </TASK> The reproducer invokes shutdown() before entering the listener status. After commit 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets"), the above causes the child to reach the accept syscall in FIN_WAIT1 status. Eric noted we can relax the existing assertion in __inet_accept() Reported-by: Christoph Paasch <cpaasch@apple.com> Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/490 Suggested-by: Eric Dumazet <edumazet@google.com> Fixes: 94062790aedb ("tcp: defer shutdown(SEND_SHUTDOWN) for TCP_SYN_RECV sockets") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/23ab880a44d8cfd967e84de8b93dbf48848e3d8c.1716299669.git.pabeni@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-05-23tcp: remove 64 KByte limit for initial tp->rcv_wnd valueJason Xing
Recently, we had some servers upgraded to the latest kernel and noticed the indicator from the user side showed worse results than before. It is caused by the limitation of tp->rcv_wnd. In 2018 commit a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") limited the initial value of tp->rcv_wnd to 65535, most CDN teams would not benefit from this change because they cannot have a large window to receive a big packet, which will be slowed down especially in long RTT. Small rcv_wnd means slow transfer speed, to some extent. It's the side effect for the latency/time-sensitive users. To avoid future confusion, current change doesn't affect the initial receive window on the wire in a SYN or SYN+ACK packet which are set within 65535 bytes according to RFC 7323 also due to the limit in __tcp_transmit_skb(): th->window = htons(min(tp->rcv_wnd, 65535U)); In one word, __tcp_transmit_skb() already ensures that constraint is respected, no matter how large tp->rcv_wnd is. The change doesn't violate RFC. Let me provide one example if with or without the patch: Before: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=65536 ---> server Note: for the last ACK, the calculation is 512 << 7. After: client --- SYN: rwindow=65535 ---> server client <--- SYN+ACK: rwindow=65535 ---- server client --- ACK: rwindow=175232 ---> server Note: I use the following command to make it work: ip route change default via [ip] dev eth0 metric 100 initrwnd 120 For the last ACK, the calculation is 1369 << 7. When we apply such a patch, having a large rcv_wnd if the user tweak this knob can help transfer data more rapidly and save some rtts. Fixes: a337531b942b ("tcp: up initial rmem to 128KB and SYN rwin to around 64KB") Signed-off-by: Jason Xing <kernelxing@tencent.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Link: https://lore.kernel.org/r/20240521134220.12510-1-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>