summaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)Author
2019-06-21Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume Nault. 2) Don't access the DDM interface unless the transceiver implements it in bnx2x, from Mauro S. M. Rodrigues. 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from JingYi Hou. 4) Sign extension overflow in lio_core, from Colin Ian King. 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski. 6) Fix epollout hang in hvsock, from Sunil Muthuswamy. 7) Fix regression in default fib6_type, from David Ahern. 8) Handle memory limits in tcp_fragment more appropriately, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits) tcp: refine memory limit test in tcp_fragment() inet: clear num_timeout reqsk_alloc() net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier net/af_iucv: build proper skbs for HiperTransport net/af_iucv: remove GFP_DMA restriction for HiperTransport net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge() hvsock: fix epollout hang from race condition net/udp_gso: Allow TX timestamp with UDP GSO net: netem: fix use after free and double free with packet corruption net: netem: fix backlog accounting for corrupted GSO frames net: lio_core: fix potential sign-extension overflow on large shift tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL tun: wake up waitqueues after IFF_UP is set net: remove duplicate fetch in sock_getsockopt tipc: fix issues with early FAILOVER_MSG from peer ...
2019-06-21tcp: refine memory limit test in tcp_fragment()Eric Dumazet
tcp_fragment() might be called for skbs in the write queue. Memory limits might have been exceeded because tcp_sendmsg() only checks limits at full skb (64KB) boundaries. Therefore, we need to make sure tcp_fragment() wont punish applications that might have setup very low SO_SNDBUF values. Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-21Merge tag 'spdx-5.2-rc6' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx Pull still more SPDX updates from Greg KH: "Another round of SPDX updates for 5.2-rc6 Here is what I am guessing is going to be the last "big" SPDX update for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates that were "easy" to determine by pattern matching. The ones after this are going to be a bit more difficult and the people on the spdx list will be discussing them on a case-by-case basis now. Another 5000+ files are fixed up, so our overall totals are: Files checked: 64545 Files with SPDX: 45529 Compared to the 5.1 kernel which was: Files checked: 63848 Files with SPDX: 22576 This is a huge improvement. Also, we deleted another 20000 lines of boilerplate license crud, always nice to see in a diffstat" * tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485 ...
2019-06-21netfilter: nf_tables: add support for matching IPv4 optionsStephen Suryaputra
This is the kernel change for the overall changes with this description: Add capability to have rules matching IPv4 options. This is developed mainly to support dropping of IP packets with loose and/or strict source route route options. Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-21netfilter: synproxy: fix manual bump of the reference counterFernando Fernandez Mancera
This operation is handled by nf_synproxy_ipv4_init() now. Fixes: d7f9b2f18eae ("netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY") Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19inet: clear num_timeout reqsk_alloc()Eric Dumazet
KMSAN caught uninit-value in tcp_create_openreq_child() [1] This is caused by a recent change, combined by the fact that TCP cleared num_timeout, num_retrans and sk fields only when a request socket was about to be queued. Under syncookie mode, a temporary request socket is used, and req->num_timeout could contain garbage. Lets clear these three fields sooner, there is really no point trying to defer this and risk other bugs. [1] BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x401d50 Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00 RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50 RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0 R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline] kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781 reqsk_alloc include/net/request_sock.h:84 [inline] inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19net/ipv4: fib_trie: Avoid cryptic ternary expressionsMatthias Kaehlcke
empty_child_inc/dec() use the ternary operator for conditional operations. The conditions involve the post/pre in/decrement operator and the operation is only performed when the condition is *not* true. This is hard to parse for humans, use a regular 'if' construct instead and perform the in/decrement separately. This also fixes two warnings that are emitted about the value of the ternary expression being unused, when building the kernel with clang + "kbuild: Remove unnecessary -Wno-unused-value" (https://lore.kernel.org/patchwork/patch/1089869/): CC net/ipv4/fib_trie.o net/ipv4/fib_trie.c:351:2: error: expression result unused [-Werror,-Wunused-value] ++tn_info(n)->empty_children ? : ++tn_info(n)->full_children; Fixes: 95f60ea3e99a ("fib_trie: Add collapse() and should_collapse() to resize") Signed-off-by: Matthias Kaehlcke <mka@chromium.org> Reviewed-by: Douglas Anderson <dianders@chromium.org> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Acked-by: Alexander Duyck <alexander.h.duyck@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19inet: fix various use-after-free in defrags unitsEric Dumazet
syzbot reported another issue caused by my recent patches. [1] The issue here is that fqdir_exit() is initiating a work queue and immediately returns. A bit later cleanup_net() was able to free the MIB (percpu data) and the whole struct net was freed, but we had active frag timers that fired and triggered use-after-free. We need to make sure that timers can catch fqdir->dead being set, to bailout. Since RCU is used for the reader side, this means we want to respect an RCU grace period between these operations : 1) qfdir->dead = 1; 2) netns dismantle (freeing of various data structure) This patch uses new new (struct pernet_operations)->pre_exit infrastructure to ensures a full RCU grace period happens between fqdir_pre_exit() and fqdir_exit() This also means we can use a regular work queue, we no longer need rcu_work. Tested: $ time for i in {1..1000}; do unshare -n /bin/false;done real 0m2.585s user 0m0.160s sys 0m2.214s [1] BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152 Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860 CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x172/0x1f0 lib/dump_stack.c:113 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317 kasan_report+0x12/0x20 mm/kasan/common.c:614 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132 ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322 expire_timers kernel/time/timer.c:1366 [inline] __run_timers kernel/time/timer.c:1685 [inline] __run_timers kernel/time/timer.c:1653 [inline] run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698 __do_softirq+0x25c/0x94c kernel/softirq.c:293 invoke_softirq kernel/softirq.c:374 [inline] irq_exit+0x180/0x1d0 kernel/softirq.c:414 exiting_irq arch/x86/include/asm/apic.h:536 [inline] smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806 </IRQ> RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035 Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13 RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000 RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398 RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380 R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000 tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087 tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline] tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734 tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335 security_file_ioctl+0x77/0xc0 security/security.c:1370 ksys_ioctl+0x57/0xd0 fs/ioctl.c:711 __do_sys_ioctl fs/ioctl.c:720 [inline] __se_sys_ioctl fs/ioctl.c:718 [inline] __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4592c9 Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9 RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006 RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4 R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff Allocated by task 9047: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_kmalloc mm/kasan/common.c:489 [inline] __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497 slab_post_alloc_hook mm/slab.h:437 [inline] slab_alloc mm/slab.c:3326 [inline] kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488 kmem_cache_zalloc include/linux/slab.h:732 [inline] net_alloc net/core/net_namespace.c:386 [inline] copy_net_ns+0xed/0x340 net/core/net_namespace.c:426 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206 ksys_unshare+0x440/0x980 kernel/fork.c:2692 __do_sys_unshare kernel/fork.c:2760 [inline] __se_sys_unshare kernel/fork.c:2758 [inline] __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe Freed by task 2541: save_stack+0x23/0x90 mm/kasan/common.c:71 set_track mm/kasan/common.c:79 [inline] __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459 __cache_free mm/slab.c:3432 [inline] kmem_cache_free+0x86/0x260 mm/slab.c:3698 net_free net/core/net_namespace.c:402 [inline] net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409 net_drop_ns net/core/net_namespace.c:408 [inline] cleanup_net+0x538/0x960 net/core/net_namespace.c:571 process_one_work+0x989/0x1790 kernel/workqueue.c:2269 worker_thread+0x98/0xe40 kernel/workqueue.c:2415 kthread+0x354/0x420 kernel/kthread.c:255 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352 The buggy address belongs to the object at ffff88808b9fe100 which belongs to the cache net_namespace of size 6784 The buggy address is located 560 bytes inside of 6784-byte region [ffff88808b9fe100, ffff88808b9ffb80) The buggy address belongs to the page: page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0 flags: 0x1fffc0000010200(slab|head) raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0 raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Fixes: 3c8fc8782044 ("inet: frags: rework rhashtable dismantle") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500Thomas Gleixner
Based on 2 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation # extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 4122 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Enrico Weigelt <info@metux.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 484Thomas Gleixner
Based on 1 normalized pattern(s): this source code is licensed under general public license version 2 extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 5 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Enrico Weigelt <info@metux.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190604081204.871734026@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-18net/udp_gso: Allow TX timestamp with UDP GSOFred Klassen
Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. It also introduces the '-P' option which will call poll() before reading the error queue. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. This patch preserves tx_flags for the first UDP GSO segment. Only the first segment is timestamped, even though in some cases there may be benefital in timestamping both the first and last segment. Factors in deciding on first segment timestamp only: - Timestamping both first and last segmented is not feasible. Hardware can only have one outstanding TS request at a time. - Timestamping last segment may under report network latency of the previous segments. Even though the doorbell is suppressed, the ring producer counter has been incremented. - Timestamping the first segment has the upside in that it reports timestamps from the application's view, e.g. RTT. - Timestamping the first segment has the downside that it may underreport tx host network latency. It appears that we have to pick one or the other. And possibly follow-up with a config flag to choose behavior. v2: Remove tests as noted by Willem de Bruijn <willemb@google.com> Moving tests from net to net-next v3: Update only relevant tx_flag bits as per Willem de Bruijn <willemb@google.com> v4: Update comments and commit message as per Willem de Bruijn <willemb@google.com> Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long
iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Honestly all the conflicts were simple overlapping changes, nothing really interesting to report. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17net: ipv4: remove erroneous advancement of list pointerFlorian Westphal
Causes crash when lifetime expires on an adress as garbage is dereferenced soon after. This used to look like this: for (ifap = &ifa->ifa_dev->ifa_list; *ifap != NULL; ifap = &(*ifap)->ifa_next) { if (*ifap == ifa) ... but this was changed to: struct in_ifaddr *tmp; ifap = &ifa->ifa_dev->ifa_list; tmp = rtnl_dereference(*ifap); while (tmp) { tmp = rtnl_dereference(tmp->ifa_next); // Bogus if (rtnl_dereference(*ifap) == ifa) { ... ifap = &tmp->ifa_next; // Can be NULL tmp = rtnl_dereference(*ifap); // Dereference } } Remove the bogus assigment/list entry skip. Fixes: 2638eb8b50cf ("net: ipv4: provide __rcu annotation for ifa_list") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: "Lots of bug fixes here: 1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer. 2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John Crispin. 3) Use after free in psock backlog workqueue, from John Fastabend. 4) Fix source port matching in fdb peer flow rule of mlx5, from Raed Salem. 5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet. 6) Network header needs to be set for packet redirect in nfp, from John Hurley. 7) Fix udp zerocopy refcnt, from Willem de Bruijn. 8) Don't assume linear buffers in vxlan and geneve error handlers, from Stefano Brivio. 9) Fix TOS matching in mlxsw, from Jiri Pirko. 10) More SCTP cookie memory leak fixes, from Neil Horman. 11) Fix VLAN filtering in rtl8366, from Linus Walluij. 12) Various TCP SACK payload size and fragmentation memory limit fixes from Eric Dumazet. 13) Use after free in pneigh_get_next(), also from Eric Dumazet. 14) LAPB control block leak fix from Jeremy Sowden" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits) lapb: fixed leak of control-blocks. tipc: purge deferredq list for each grp member in tipc_group_delete ax25: fix inconsistent lock state in ax25_destroy_timer neigh: fix use-after-free read in pneigh_get_next tcp: fix compile error if !CONFIG_SYSCTL hv_sock: Suppress bogus "may be used uninitialized" warnings be2net: Fix number of Rx queues used for flow hashing net: handle 802.1P vlan 0 packets properly tcp: enforce tcp_min_snd_mss in tcp_mtu_probing() tcp: add tcp_min_snd_mss sysctl tcp: tcp_fragment() should apply sane memory limits tcp: limit payload size of sacked skbs Revert "net: phylink: set the autoneg state in phylink_phy_change" bpf: fix nested bpf tracepoints with per-cpu data bpf: Fix out of bounds memory access in bpf_sk_storage vsock/virtio: set SOCK_DONE on peer shutdown net: dsa: rtl8366: Fix up VLAN filtering net: phylink: set the autoneg state in phylink_phy_change net: add high_order_alloc_disable sysctl/static key tcp: add tcp_tx_skb_cache sysctl ...
2019-06-17net: ipv4: move tcp_fastopen server side code to SipHash libraryArd Biesheuvel
Using a bare block cipher in non-crypto code is almost always a bad idea, not only for security reasons (and we've seen some examples of this in the kernel in the past), but also for performance reasons. In the TCP fastopen case, we call into the bare AES block cipher one or two times (depending on whether the connection is IPv4 or IPv6). On most systems, this results in a call chain such as crypto_cipher_encrypt_one(ctx, dst, src) crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...); aesni_encrypt kernel_fpu_begin(); aesni_enc(ctx, dst, src); // asm routine kernel_fpu_end(); It is highly unlikely that the use of special AES instructions has a benefit in this case, especially since we are doing the above twice for IPv6 connections, instead of using a transform which can process the entire input in one go. We could switch to the cbcmac(aes) shash, which would at least get rid of the duplicated overhead in *some* cases (i.e., today, only arm64 has an accelerated implementation of cbcmac(aes), while x86 will end up using the generic cbcmac template wrapping the AES-NI cipher, which basically ends up doing exactly the above). However, in the given context, it makes more sense to use a light-weight MAC algorithm that is more suitable for the purpose at hand, such as SipHash. Since the output size of SipHash already matches our chosen value for TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input sizes, this greatly simplifies the code as well. NOTE: Server farms backing a single server IP for load balancing purposes and sharing a single fastopen key will be adversely affected by this change unless all systems in the pool receive their kernel upgrades at the same time. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17Merge branch 'tcp-fixes'David S. Miller
Eric Dumazet says: ==================== tcp: make sack processing more robust Jonathan Looney brought to our attention multiple problems in TCP stack at the sender side. SACK processing can be abused by malicious peers to either cause overflows, or increase of memory usage. First two patches fix the immediate problems. Since the malicious peers abuse senders by advertizing a very small MSS in their SYN or SYNACK packet, the last two patches add a new sysctl so that admins can chose a higher limit for MSS clamping. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXYFernando Fernandez Mancera
Add common functions into nf_synproxy_core.c to prepare for nftables support. The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new file nf_synproxy.h Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-16tcp: fix compile error if !CONFIG_SYSCTLEric Dumazet
tcp_tx_skb_cache_key and tcp_rx_skb_cache_key must be available even if CONFIG_SYSCTL is not set. Fixes: 0b7d7f6b2208 ("tcp: add tcp_tx_skb_cache sysctl") Fixes: ede61ca474a0 ("tcp: add tcp_rx_skb_cache sysctl") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()Eric Dumazet
If mtu probing is enabled tcp_mtu_probing() could very well end up with a too small MSS. Use the new sysctl tcp_min_snd_mss to make sure MSS search is performed in an acceptable range. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Lemon <jonathan.lemon@gmail.com> Cc: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: add tcp_min_snd_mss sysctlEric Dumazet
Some TCP peers announce a very small MSS option in their SYN and/or SYN/ACK messages. This forces the stack to send packets with a very high network/cpu overhead. Linux has enforced a minimal value of 48. Since this value includes the size of TCP options, and that the options can consume up to 40 bytes, this means that each segment can include only 8 bytes of payload. In some cases, it can be useful to increase the minimal value to a saner value. We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility reasons. Note that TCP_MAXSEG socket option enforces a minimal value of (TCP_MIN_MSS). David Miller increased this minimal value in commit c39508d6f118 ("tcp: Make TCP_MAXSEG minimum more correct.") from 64 to 88. We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS. CVE-2019-11479 -- tcp mss hardcoded to 48 Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: tcp_fragment() should apply sane memory limitsEric Dumazet
Jonathan Looney reported that a malicious peer can force a sender to fragment its retransmit queue into tiny skbs, inflating memory usage and/or overflow 32bit counters. TCP allows an application to queue up to sk_sndbuf bytes, so we need to give some allowance for non malicious splitting of retransmit queue. A new SNMP counter is added to monitor how many times TCP did not allow to split an skb if the allowance was exceeded. Note that this counter might increase in the case applications use SO_SNDBUF socket option to lower sk_sndbuf. CVE-2019-11478 : tcp_fragment, prevent fragmenting a packet when the socket is already using more than half the allowed space Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15tcp: limit payload size of sacked skbsEric Dumazet
Jonathan Looney reported that TCP can trigger the following crash in tcp_shifted_skb() : BUG_ON(tcp_skb_pcount(skb) < pcount); This can happen if the remote peer has advertized the smallest MSS that linux TCP accepts : 48 An skb can hold 17 fragments, and each fragment can hold 32KB on x86, or 64KB on PowerPC. This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs can overflow. Note that tcp_sendmsg() builds skbs with less than 64KB of payload, so this problem needs SACK to be enabled. SACK blocks allow TCP to coalesce multiple skbs in the retransmit queue, thus filling the 17 fragments to maximal capacity. CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs Fixes: 832d11c5cd07 ("tcp: Try to restore large SKBs while SACK processing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Jonathan Looney <jtl@netflix.com> Acked-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Tyler Hicks <tyhicks@canonical.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Bruce Curtis <brucec@netflix.com> Cc: Jonathan Lemon <jonathan.lemon@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: add tcp_tx_skb_cache sysctlEric Dumazet
Feng Tang reported a performance regression after introduction of per TCP socket tx/rx caches, for TCP over loopback (netperf) There is high chance the regression is caused by a change on how well the 32 KB per-thread page (current->task_frag) can be recycled, and lack of pcp caches for order-3 pages. I could not reproduce the regression myself, cpus all being spinning on the mm spinlocks for page allocs/freeing, regardless of enabling or disabling the per tcp socket caches. It seems best to disable the feature by default, and let admins enabling it. MM layer either needs to provide scalable order-3 pages allocations, or could attempt a trylock on zone->lock if the caller only attempts to get a high-order page and is able to fallback to order-0 ones in case of pressure. Tests run on a 56 cores host (112 hyper threads) - 35.49% netperf [kernel.vmlinux] [k] queued_spin_lock_slowpath - 35.49% queued_spin_lock_slowpath - 18.18% get_page_from_freelist - __alloc_pages_nodemask - 18.18% alloc_pages_current skb_page_frag_refill sk_page_frag_refill tcp_sendmsg_locked tcp_sendmsg inet_sendmsg sock_sendmsg __sys_sendto __x64_sys_sendto do_syscall_64 entry_SYSCALL_64_after_hwframe __libc_send + 17.31% __free_pages_ok + 31.43% swapper [kernel.vmlinux] [k] intel_idle + 9.12% netperf [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 6.53% netserver [kernel.vmlinux] [k] copy_user_enhanced_fast_string + 0.69% netserver [kernel.vmlinux] [k] queued_spin_lock_slowpath + 0.68% netperf [kernel.vmlinux] [k] skb_release_data + 0.52% netperf [kernel.vmlinux] [k] tcp_sendmsg_locked 0.46% netperf [kernel.vmlinux] [k] _raw_spin_lock_irqsave Fixes: 472c2e07eef0 ("tcp: add one skb cache for tx") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Feng Tang <feng.tang@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: add tcp_rx_skb_cache sysctlEric Dumazet
Instead of relying on rps_needed, it is safer to use a separate static key, since we do not want to enable TCP rx_skb_cache by default. This feature can cause huge increase of memory usage on hosts with millions of sockets. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14udp: Remove unused variable/function (exact_dif)Tim Beale
This was originally passed through to the VRF logic in compute_score(). But that logic has now been replaced by udp_sk_bound_dev_eq() and so this code is no longer used or needed. Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14udp: Remove unused parameter (exact_dif)Tim Beale
Originally this was used by the VRF logic in compute_score(), but that was later replaced by udp_sk_bound_dev_eq() and the parameter became unused. Note this change adds an 'unused variable' compiler warning that will be removed in the next patch (I've split the removal in two to make review slightly easier). Signed-off-by: Tim Beale <timbeale@catalyst.net.nz> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: tcp: fix ACK/RST sent with a transmit delayEric Dumazet
If we want to set a EDT time for the skb we want to send via ip_send_unicast_reply(), we have to pass a new parameter and initialize ipc.sockc.transmit_time with it. This fixes the EDT time for ACK/RST packets sent on behalf of a TIME_WAIT socket. Fixes: a842fe1425cb ("tcp: add optional per socket transmit delay") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14ipv4: Support multipath hashing on inner IP pkts for GRE tunnelStephen Suryaputra
Multipath hash policy value of 0 isn't distributing since the outer IP dest and src aren't varied eventhough the inner ones are. Since the flow is on the inner ones in the case of tunneled traffic, hashing on them is desired. This is done mainly for IP over GRE, hence only tested for that. But anything else supported by flow dissection should work. v2: Use skb_flow_dissect_flow_keys() directly so that other tunneling can be supported through flow dissection (per Nikolay Aleksandrov). v3: Remove accidental inclusion of ports in the hash keys and clarify the documentation (Nikolay Alexandrov). Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com> Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14tcp: use static_branch_deferred_inc for clean_acked_data_enabledWillem de Bruijn
Deferred static key clean_acked_data_enabled uses the deferred variants of dec and flush. Do the same for inc. Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14docs: kbuild: convert docs to ReST and rename to *.rstMauro Carvalho Chehab
The kbuild documentation clearly shows that the documents there are written at different times: some use markdown, some use their own peculiar logic to split sections. Convert everything to ReST without affecting too much the author's style and avoiding adding uneeded markups. The conversion is actually: - add blank lines and identation in order to identify paragraphs; - fix tables markups; - add some lists markups; - mark literal blocks; - adjust title markups. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org> Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2019-06-12tcp: add optional per socket transmit delayEric Dumazet
Adding delays to TCP flows is crucial for studying behavior of TCP stacks, including congestion control modules. Linux offers netem module, but it has unpractical constraints : - Need root access to change qdisc - Hard to setup on egress if combined with non trivial qdisc like FQ - Single delay for all flows. EDT (Earliest Departure Time) adoption in TCP stack allows us to enable a per socket delay at a very small cost. Networking tools can now establish thousands of flows, each of them with a different delay, simulating real world conditions. This requires FQ packet scheduler or a EDT-enabled NIC. This patchs adds TCP_TX_DELAY socket option, to set a delay in usec units. unsigned int tx_delay = 10000; /* 10 msec */ setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay)); Note that FQ packet scheduler limits might need some tweaking : man tc-fq PARAMETERS limit Hard limit on the real queue size. When this limit is reached, new packets are dropped. If the value is lowered, packets are dropped so that the new limit is met. Default is 10000 packets. flow_limit Hard limit on the maximum number of packets queued per flow. Default value is 100. Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc, so packets would be dropped if any of the previous limit is hit. Use of a jump label makes this support runtime-free, for hosts never using the option. Also note that TSQ (TCP Small Queues) limits are slightly changed with this patch : we need to account that skbs artificially delayed wont stop us providind more skbs to feed the pipe (netem uses skb_orphan_partial() for this purpose, but FQ can not use this trick) Because of that, using big delays might very well trigger old bugs in TSO auto defer logic and/or sndbuf limited detection. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11net: correct udp zerocopy refcnt also when zerocopy only on appendWillem de Bruijn
The below patch fixes an incorrect zerocopy refcnt increment when appending with MSG_MORE to an existing zerocopy udp skb. send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt 1 send(.., MSG_ZEROCOPY | MSG_MORE); // refcnt still 1 (bar frags) But it missed that zerocopy need not be passed at the first send. The right test whether the uarg is newly allocated and thus has extra refcnt 1 is not !skb, but !skb_zcopy. send(.., MSG_MORE); // <no uarg> send(.., MSG_ZEROCOPY); // refcnt 1 Fixes: 100f6d8e09905 ("net: correct zerocopy refcnt with udp MSG_MORE") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10nexthops: add support for replaceDavid Ahern
Add support for atomically upating a nexthop config. When updating a nexthop, walk the lists of associated fib entries and verify the new config is valid. Replace is done by swapping nh_info for single nexthops - new config is applied to old nexthop struct, and old config is moved to new nexthop struct. For nexthop groups the same applies but for nh_group. In addition for groups the nh_parent reference needs to be updated. The old config is released by calling __remove_nexthop on the 'new' nexthop which now has the old config. This is done to avoid messing around with the list_heads that track which fib entries are using the nexthop. After the swap of config data, bump the sequence counters for FIB entries to invalidate any dst entries and send notifications to userspace. The notifications include the new nexthop spec as well as any fib entries using the updated nexthop struct. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10ipv4: Optimization for fib_info lookup with nexthopsDavid Ahern
Be optimistic about re-using a fib_info when nexthop id is given and the route does not use metrics. Avoids a memory allocation which in most cases is expected to be freed anyways. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10ipv4: Allow routes to use nexthop objectsDavid Ahern
Add support for RTA_NH_ID attribute to allow a user to specify a nexthop id to use with a route. fc_nh_id is added to fib_config to hold the value passed in the RTA_NH_ID attribute. If a nexthop id is given, the gateway, device, encap and multipath attributes can not be set. Update fib_nh_match to check ids on a route delete. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop structDavid Ahern
IPv6 has traditionally had a single fib6_nh per fib6_info. With nexthops we can have multiple fib6_nh associated with a fib6_info. Add a nexthop helper to invoke a callback for each fib6_nh in a 'struct nexthop'. If the callback returns non-0, the loop is stopped and the return value passed to the caller. Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10tcp: Make tcp_fastopen_alloc_ctx staticYueHaibing
Fix sparse warning: net/ipv4/tcp_fastopen.c:75:29: warning: symbol 'tcp_fastopen_alloc_ctx' was not declared. Should it be static? Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Jason Baron <jbaron@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10Update my email addressJozsef Kadlecsik
It's better to use my kadlec@netfilter.org email address in the source code. I might not be able to use kadlec@blackhole.kfki.hu in the future. Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
2019-06-09ipv6: tcp: send consistent autoflowlabel in TIME_WAIT stateEric Dumazet
In case autoflowlabel is in action, skb_get_hash_flowi6() derives a non zero skb->hash to the flowlabel. If skb->hash is zero, a flow dissection is performed. Since all TCP skbs sent from ESTABLISH state inherit their skb->hash from sk->sk_txhash, we better keep a copy of sk->sk_txhash into the TIME_WAIT socket. After this patch, ACK or RST packets sent on behalf of a TIME_WAIT socket have the flowlabel that was previously used by the flow. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09tcp: fix undo spurious SYNACK in passive Fast OpenYuchung Cheng
Commit 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") may cause tcp_fastretrans_alert() to warn about pending retransmission in Open state. This is triggered when the Fast Open server both sends data and has spurious SYNACK retransmission during the handshake, and the data packets were lost or reordered. The root cause is a bit complicated: (1) Upon receiving SYN-data: a full socket is created with snd_una = ISN + 1 by tcp_create_openreq_child() (2) On SYNACK timeout the server/sender enters CA_Loss state. (3) Upon receiving the final ACK to complete the handshake, sender does not mark FLAG_SND_UNA_ADVANCED since (1) Sender then calls tcp_process_loss since state is CA_loss by (2) (4) tcp_process_loss() does not invoke undo operations but instead mark REXMIT_LOST to force retransmission (5) tcp_rcv_synrecv_state_fastopen() calls tcp_try_undo_loss(). It changes state to CA_Open but has positive tp->retrans_out (6) Next ACK triggers the WARN_ON in tcp_fastretrans_alert() The step that goes wrong is (4) where the undo operation should have been invoked because the ACK successfully acknowledged the SYN sequence. This fixes that by specifically checking undo when the SYN-ACK sequence is acknowledged. Then after tcp_process_loss() the state would be further adjusted based in tcp_fastretrans_alert() to avoid triggering the warning in (6). Fixes: 794200d66273 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit") Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09net: ipv4: fib_semantics: fix uninitialized variableEnrico Weigelt
fix an uninitialized variable: CC net/ipv4/fib_semantics.o net/ipv4/fib_semantics.c: In function 'fib_check_nh_v4_gw': net/ipv4/fib_semantics.c:1027:12: warning: 'err' may be used uninitialized in this function [-Wmaybe-uninitialized] if (!tbl || err) { ^~ Signed-off-by: Enrico Weigelt <info@metux.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08Merge tag 'spdx-5.2-rc4' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull yet more SPDX updates from Greg KH: "Another round of SPDX header file fixes for 5.2-rc4 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being added, based on the text in the files. We are slowly chipping away at the 700+ different ways people tried to write the license text. All of these were reviewed on the spdx mailing list by a number of different people. We now have over 60% of the kernel files covered with SPDX tags: $ ./scripts/spdxcheck.py -v 2>&1 | grep Files Files checked: 64533 Files with SPDX: 40392 Files with errors: 0 I think the majority of the "easy" fixups are now done, it's now the start of the longer-tail of crazy variants to wade through" * tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits) treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429 ...
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller
Daniel Borkmann says: ==================== pull-request: bpf 2019-06-07 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Fix several bugs in riscv64 JIT code emission which forgot to clear high 32-bits for alu32 ops, from Björn and Luke with selftests covering all relevant BPF alu ops from Björn and Jiong. 2) Two fixes for UDP BPF reuseport that avoid calling the program in case of __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption that skb->data is pointing to transport header, from Martin. 3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog workqueue, and a missing restore of sk_write_space when psock gets dropped, from Jakub and John. 4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it breaks standard applications like DNS if reverse NAT is not performed upon receive, from Daniel. 5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6 fails to verify that the length of the tuple is long enough, from Lorenz. 6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for {un,}successful probe) as that is expected to be propagated as an fd to load_sk_storage_btf() and thus closing the wrong descriptor otherwise, from Michal. 7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir. 8) Minor misc fixes in docs, samples and selftests, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Some ISDN files that got removed in net-next had some changes done in mainline, take the removals. Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds
Pull networking fixes from David Miller: 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn. 2) Read SFP eeprom in max 16 byte increments to avoid problems with some SFP modules, from Russell King. 3) Fix UDP socket lookup wrt. VRF, from Tim Beale. 4) Handle route invalidation properly in s390 qeth driver, from Julian Wiedmann. 5) Memory leak on unload in RDS, from Zhu Yanjun. 6) sctp_process_init leak, from Neil HOrman. 7) Fix fib_rules rule insertion semantic change that broke Android, from Hangbin Liu. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits) pktgen: do not sleep with the thread lock held. net: mvpp2: Use strscpy to handle stat strings net: rds: fix memory leak in rds_ib_flush_mr_pool ipv6: fix EFAULT on sendto with icmpv6 and hdrincl ipv6: use READ_ONCE() for inet->hdrincl as in ipv4 Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied" net: aquantia: fix wol configuration not applied sometimes ethtool: fix potential userspace buffer overflow Fix memory leak in sctp_process_init net: rds: fix memory leak when unload rds_rdma ipv6: fix the check before getting the cookie in rt6_get_cookie ipv4: not do cache for local delivery if bc_forwarding is enabled s390/qeth: handle error when updating TX queue count s390/qeth: fix VLAN attribute in bridge_hostnotify udev event s390/qeth: check dst entry before use s390/qeth: handle limited IPv4 broadcast in L3 TX path net: fix indirect calls helpers for ptype list hooks. net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set udp: only choose unbound UDP socket for multicast when not in a VRF net/tls: replace the sleeping lock around RX resync with a bit lock ...
2019-06-06bpf: fix unconnected udp hooksDaniel Borkmann
Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently to applications as also stated in original motivation in 7828f20e3779 ("Merge branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter two hooks into Cilium to enable host based load-balancing with Kubernetes, I ran into the issue that pods couldn't start up as DNS got broken. Kubernetes typically sets up DNS as a service and is thus subject to load-balancing. Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API is currently insufficient and thus not usable as-is for standard applications shipped with most distros. To break down the issue we ran into with a simple example: # cat /etc/resolv.conf nameserver 147.75.207.207 nameserver 147.75.207.208 For the purpose of a simple test, we set up above IPs as service IPs and transparently redirect traffic to a different DNS backend server for that node: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 The attached BPF program is basically selecting one of the backends if the service IP/port matches on the cgroup hook. DNS breaks here, because the hooks are not transparent enough to applications which have built-in msg_name address checks: # nslookup 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ;; connection timed out; no servers could be reached # dig 1.1.1.1 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53 ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53 [...] ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; connection timed out; no servers could be reached For comparison, if none of the service IPs is used, and we tell nslookup to use 8.8.8.8 directly it works just fine, of course: # nslookup 1.1.1.1 8.8.8.8 1.1.1.1.in-addr.arpa name = one.one.one.one. In order to fix this and thus act more transparent to the application, this needs reverse translation on recvmsg() side. A minimal fix for this API is to add similar recvmsg() hooks behind the BPF cgroups static key such that the program can track state and replace the current sockaddr_in{,6} with the original service IP. From BPF side, this basically tracks the service tuple plus socket cookie in an LRU map where the reverse NAT can then be retrieved via map value as one example. Side-note: the BPF cgroups static key should be converted to a per-hook static key in future. Same example after this fix: # cilium service list ID Frontend Backend 1 147.75.207.207:53 1 => 8.8.8.8:53 2 147.75.207.208:53 1 => 8.8.8.8:53 Lookups work fine now: # nslookup 1.1.1.1 1.1.1.1.in-addr.arpa name = one.one.one.one. Authoritative answers can be found from: # dig 1.1.1.1 ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 512 ;; QUESTION SECTION: ;1.1.1.1. IN A ;; AUTHORITY SECTION: . 23426 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2019052001 1800 900 604800 86400 ;; Query time: 17 msec ;; SERVER: 147.75.207.207#53(147.75.207.207) ;; WHEN: Tue May 21 12:59:38 UTC 2019 ;; MSG SIZE rcvd: 111 And from an actual packet level it shows that we're using the back end server when talking via 147.75.207.20{7,8} front end: # tcpdump -i any udp [...] 12:59:52.698732 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.698735 IP foo.42011 > google-public-dns-a.google.com.domain: 18803+ PTR? 1.1.1.1.in-addr.arpa. (38) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) 12:59:52.701208 IP google-public-dns-a.google.com.domain > foo.42011: 18803 1/0/0 PTR one.one.one.one. (67) [...] In order to be flexible and to have same semantics as in sendmsg BPF programs, we only allow return codes in [1,1] range. In the sendmsg case the program is called if msg->msg_name is present which can be the case in both, connected and unconnected UDP. The former only relies on the sockaddr_in{,6} passed via connect(2) if passed msg->msg_name was NULL. Therefore, on recvmsg side, we act in similar way to call into the BPF program whenever a non-NULL msg->msg_name was passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note that for TCP case, the msg->msg_name is ignored in the regular recvmsg path and therefore not relevant. For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE, the hook is not called. This is intentional as it aligns with the same semantics as in case of TCP cgroup BPF hooks right now. This might be better addressed in future through a different bpf_attach_type such that this case can be distinguished from the regular recvmsg paths, for example. Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg") Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Martynas Pumputis <m@lambda.lt> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-06xfrm: remove type and offload_type map from xfrm_state_afinfoFlorian Westphal
Only a handful of xfrm_types exist, no need to have 512 pointers for them. Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms. Also, the unregister function doesn't need to return an error, no single caller does anything useful with it. Just place a WARN_ON() where needed instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06xfrm: remove eth_proto value from xfrm_state_afinfoFlorian Westphal
xfrm_prepare_input needs to lookup the state afinfo backend again to fetch the address family ethernet protocol value. There are only two address families, so a switch statement is simpler. While at it, use u8 for family and proto and remove the owner member -- its not used anywhere. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05inet_connection_sock: remove unused parameter of reqsk_queue_unlink funcZhiqiang Liu
small cleanup: "struct request_sock_queue *queue" parameter of reqsk_queue_unlink func is never used in the func, so we can remove it. Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>