summaryrefslogtreecommitdiff
path: root/net/ipv6/tcp_ipv6.c
AgeCommit message (Collapse)Author
2015-01-05net: tcp: add per route congestion controlDaniel Borkmann
This work adds the possibility to define a per route/destination congestion control algorithm. Generally, this opens up the possibility for a machine with different links to enforce specific congestion control algorithms with optimal strategies for each of them based on their network characteristics, even transparently for a single application listening on all links. For our specific use case, this additionally facilitates deployment of DCTCP, for example, applications can easily serve internal traffic/dsts in DCTCP and external one with CUBIC. Other scenarios would also allow for utilizing e.g. long living, low priority background flows for certain destinations/routes while still being able for normal traffic to utilize the default congestion control algorithm. We also thought about a per netns setting (where different defaults are possible), but given its actually a link specific property, we argue that a per route/destination setting is the most natural and flexible. The administrator can utilize this through ip-route(8) by appending "congctl [lock] <name>", where <name> denotes the name of a congestion control algorithm and the optional lock parameter allows to enforce the given algorithm so that applications in user space would not be allowed to overwrite that algorithm for that destination. The dst metric lookups are being done when a dst entry is already available in order to avoid a costly lookup and still before the algorithms are being initialized, thus overhead is very low when the feature is not being used. While the client side would need to drop the current reference on the module, on server side this can actually even be avoided as we just got a flat-copied socket clone. Joint work with Florian Westphal. Suggested-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Daniel Borkmann <dborkman@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-22tcp6: don't move IP6CB before xfrm6_policy_check()Nicolas Dichtel
When xfrm6_policy_check() is used, _decode_session6() is called after some intermediate functions. This function uses IP6CB(), thus TCP_SKB_CB() must be prepared after the call of xfrm6_policy_check(). Before this patch, scenarii with IPv6 + TCP + IPsec Transport are broken. Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Reported-by: Huaibin Wang <huaibin.wang@6wind.com> Suggested-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-10Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/ethernet/amd/xgbe/xgbe-desc.c drivers/net/ethernet/renesas/sh_eth.c Overlapping changes in both conflict cases. Signed-off-by: David S. Miller <davem@davemloft.net>
2014-12-09tcp: fix more NULL deref after prequeue changesEric Dumazet
When I cooked commit c3658e8d0f1 ("tcp: fix possible NULL dereference in tcp_vX_send_reset()") I missed other spots we could deref a NULL skb_dst(skb) Again, if a socket is provided, we do not need skb_dst() to get a pointer to network namespace : sock_net(sk) is good enough. Reported-by: Dann Frazier <dann.frazier@canonical.com> Bisected-by: Dann Frazier <dann.frazier@canonical.com> Tested-by: Dann Frazier <dann.frazier@canonical.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-29Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2014-11-25tcp: fix possible NULL dereference in tcp_vX_send_reset()Eric Dumazet
After commit ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") we have to relax check against skb dst in tcp_v[46]_send_reset() if prequeue dropped the dst. If a socket is provided, a full lookup was done to find this socket, so the dst test can be skipped. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88191 Reported-by: Jaša Bartelj <jasa.bartelj@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Daniel Borkmann <dborkman@redhat.com> Fixes: ca777eff51f7 ("tcp: remove dst refcount false sharing for prequeue mode") Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11net: introduce SO_INCOMING_CPUEric Dumazet
Alternative to RPS/RFS is to use hardware support for multiple queues. Then split a set of million of sockets into worker threads, each one using epoll() to manage events on its own socket pool. Ideally, we want one thread per RX/TX queue/cpu, but we have no way to know after accept() or connect() on which queue/cpu a socket is managed. We normally use one cpu per RX queue (IRQ smp_affinity being properly set), so remembering on socket structure which cpu delivered last packet is enough to solve the problem. After accept(), connect(), or even file descriptor passing around processes, applications can use : int cpu; socklen_t len = sizeof(cpu); getsockopt(fd, SOL_SOCKET, SO_INCOMING_CPU, &cpu, &len); And use this information to put the socket into the right silo for optimal performance, as all networking stack should run on the appropriate cpu, without need to send IPI (RPS/RFS). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-11-11tcp: move sk_mark_napi_id() at the right placeEric Dumazet
sk_mark_napi_id() is used to record for a flow napi id of incoming packets for busypoll sake. We should do this only on established flows, not on listeners. This was 'working' by virtue of the socket cloning, but doing this on SYN packets in unecessary cache line dirtying. Even if we move sk_napi_id in the same cache line than sk_lock, we are working to make SYN processing lockless, so it is desirable to set sk_napi_id only for established flows. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-22net: fix saving TX flow hash in sock for outgoing connectionsSathya Perla
The commit "net: Save TX flow hash in sock and set in skbuf on xmit" introduced the inet_set_txhash() and ip6_set_txhash() routines to calculate and record flow hash(sk_txhash) in the socket structure. sk_txhash is used to set skb->hash which is used to spread flows across multiple TXQs. But, the above routines are invoked before the source port of the connection is created. Because of this all outgoing connections that just differ in the source port get hashed into the same TXQ. This patch fixes this problem for IPv4/6 by invoking the the above routines after the source port is available for the socket. Fixes: b73c3d0e4("net: Save TX flow hash in sock and set in skbuf on xmit") Signed-off-by: Sathya Perla <sathya.perla@emulex.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-17ipv6: introduce tcp_v6_iif()Eric Dumazet
Commit 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") added a regression for SO_BINDTODEVICE on IPv6. This is because we still use inet6_iif() which expects that IP6 control block is still at the beginning of skb->cb[] This patch adds tcp_v6_iif() helper and uses it where necessary. Because __inet6_lookup_skb() is used by TCP and DCCP, we add an iif parameter to it. Signed-off-by: Eric Dumazet <edumazet@google.com> Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Acked-by: Cong Wang <cwang@twopensource.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-10-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds
Pull networking updates from David Miller: "Most notable changes in here: 1) By far the biggest accomplishment, thanks to a large range of contributors, is the addition of multi-send for transmit. This is the result of discussions back in Chicago, and the hard work of several individuals. Now, when the ->ndo_start_xmit() method of a driver sees skb->xmit_more as true, it can choose to defer the doorbell telling the driver to start processing the new TX queue entires. skb->xmit_more means that the generic networking is guaranteed to call the driver immediately with another SKB to send. There is logic added to the qdisc layer to dequeue multiple packets at a time, and the handling mis-predicted offloads in software is now done with no locks held. Finally, pktgen is extended to have a "burst" parameter that can be used to test a multi-send implementation. Several drivers have xmit_more support: i40e, igb, ixgbe, mlx4, virtio_net Adding support is almost trivial, so export more drivers to support this optimization soon. I want to thank, in no particular or implied order, Jesper Dangaard Brouer, Eric Dumazet, Alexander Duyck, Tom Herbert, Jamal Hadi Salim, John Fastabend, Florian Westphal, Daniel Borkmann, David Tat, Hannes Frederic Sowa, and Rusty Russell. 2) PTP and timestamping support in bnx2x, from Michal Kalderon. 3) Allow adjusting the rx_copybreak threshold for a driver via ethtool, and add rx_copybreak support to enic driver. From Govindarajulu Varadarajan. 4) Significant enhancements to the generic PHY layer and the bcm7xxx driver in particular (EEE support, auto power down, etc.) from Florian Fainelli. 5) Allow raw buffers to be used for flow dissection, allowing drivers to determine the optimal "linear pull" size for devices that DMA into pools of pages. The objective is to get exactly the necessary amount of headers into the linear SKB area pre-pulled, but no more. The new interface drivers use is eth_get_headlen(). From WANG Cong, with driver conversions (several had their own by-hand duplicated implementations) by Alexander Duyck and Eric Dumazet. 6) Support checksumming more smoothly and efficiently for encapsulations, and add "foo over UDP" facility. From Tom Herbert. 7) Add Broadcom SF2 switch driver to DSA layer, from Florian Fainelli. 8) eBPF now can load programs via a system call and has an extensive testsuite. Alexei Starovoitov and Daniel Borkmann. 9) Major overhaul of the packet scheduler to use RCU in several major areas such as the classifiers and rate estimators. From John Fastabend. 10) Add driver for Intel FM10000 Ethernet Switch, from Alexander Duyck. 11) Rearrange TCP_SKB_CB() to reduce cache line misses, from Eric Dumazet. 12) Add Datacenter TCP congestion control algorithm support, From Florian Westphal. 13) Reorganize sk_buff so that __copy_skb_header() is significantly faster. From Eric Dumazet" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1558 commits) netlabel: directly return netlbl_unlabel_genl_init() net: add netdev_txq_bql_{enqueue, complete}_prefetchw() helpers net: description of dma_cookie cause make xmldocs warning cxgb4: clean up a type issue cxgb4: potential shift wrapping bug i40e: skb->xmit_more support net: fs_enet: Add NAPI TX net: fs_enet: Remove non NAPI RX r8169:add support for RTL8168EP net_sched: copy exts->type in tcf_exts_change() wimax: convert printk to pr_foo() af_unix: remove 0 assignment on static ipv6: Do not warn for informational ICMP messages, regardless of type. Update Intel Ethernet Driver maintainers list bridge: Save frag_max_size between PRE_ROUTING and POST_ROUTING tipc: fix bug in multicast congestion handling net: better IFF_XMIT_DST_RELEASE support net/mlx4_en: remove NETDEV_TX_BUSY 3c59x: fix bad split of cpu_to_le32(pci_map_single()) net: bcmgenet: fix Tx ring priority programming ...
2014-10-07Merge tag 'dmaengine-3.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine Pull dmaengine updates from Dan Williams: "Even though this has fixes marked for -stable, given the size and the needed conflict resolutions this is 3.18-rc1/merge-window material. These patches have been languishing in my tree for a long while. The fact that I do not have the time to do proper/prompt maintenance of this tree is a primary factor in the decision to step down as dmaengine maintainer. That and the fact that the bulk of drivers/dma/ activity is going through Vinod these days. The net_dma removal has not been in -next. It has developed simple conflicts against mainline and net-next (for-3.18). Continuing thanks to Vinod for staying on top of drivers/dma/. Summary: 1/ Step down as dmaengine maintainer see commit 08223d80df38 "dmaengine maintainer update" 2/ Removal of net_dma, as it has been marked 'broken' since 3.13 (commit 77873803363c "net_dma: mark broken"), without reports of performance regression. 3/ Miscellaneous fixes" * tag 'dmaengine-3.17' of git://git.kernel.org/pub/scm/linux/kernel/git/djbw/dmaengine: net: make tcp_cleanup_rbuf private net_dma: revert 'copied_early' net_dma: simple removal dmaengine maintainer update dmatest: prevent memory leakage on error path in thread ioat: Use time_before_jiffies() dmaengine: fix xor sources continuation dma: mv_xor: Rename __mv_xor_slot_cleanup() to mv_xor_slot_cleanup() dma: mv_xor: Remove all callers of mv_xor_slot_cleanup() dma: mv_xor: Remove unneeded mv_xor_clean_completed_slots() call ioat: Use pci_enable_msix_exact() instead of pci_enable_msix() drivers: dma: Include appropriate header file in dca.c drivers: dma: Mark functions as static in dma_v3.c dma: mv_xor: Add DMA API error checks ioat/dca: Use dev_is_pci() to check whether it is pci device
2014-09-28tcp: better TCP_SKB_CB layout to reduce cache line missesEric Dumazet
TCP maintains lists of skb in write queue, and in receive queues (in order and out of order queues) Scanning these lists both in input and output path usually requires access to skb->next, TCP_SKB_CB(skb)->seq, and TCP_SKB_CB(skb)->end_seq These fields are currently in two different cache lines, meaning we waste lot of memory bandwidth when these queues are big and flows have either packet drops or packet reorders. We can move TCP_SKB_CB(skb)->header at the end of TCP_SKB_CB, because this header is not used in fast path. This allows TCP to search much faster in the skb lists. Even with regular flows, we save one cache line miss in fast path. Thanks to Christoph Paasch for noticing we need to cleanup skb->cb[] (IPCB/IP6CB) before entering IP stack in tx path, and that I forgot IPCB use in tcp_v4_hnd_req() and tcp_v4_save_options(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28ipv6: add a struct inet6_skb_parm param to ipv6_opt_accepted()Eric Dumazet
ipv6_opt_accepted() assumes IP6CB(skb) holds the struct inet6_skb_parm that it needs. Lets not assume this, as TCP stack might use a different place. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-28net_dma: simple removalDan Williams
Per commit "77873803363c net_dma: mark broken" net_dma is no longer used and there is no plan to fix it. This is the mechanical removal of bits in CONFIG_NET_DMA ifdef guards. Reverting the remainder of the net_dma induced changes is deferred to subsequent patches. Marked for stable due to Roman's report of a memory leak in dma_pin_iovec_pages(): https://lkml.org/lkml/2014/9/3/177 Cc: Dave Jiang <dave.jiang@intel.com> Cc: Vinod Koul <vinod.koul@intel.com> Cc: David Whipple <whipple@securedatainnovations.ch> Cc: Alexander Duyck <alexander.h.duyck@intel.com> Cc: <stable@vger.kernel.org> Reported-by: Roman Gushchin <klamm@yandex-team.ru> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2014-09-15tcp: use TCP_SKB_CB(skb)->tcp_flags in input pathEric Dumazet
Input path of TCP do not currently uses TCP_SKB_CB(skb)->tcp_flags, which is only used in output path. tcp_recvmsg(), looks at tcp_hdr(skb)->syn for every skb found in receive queue, and its unfortunate because this bit is located in a cache line right before the payload. We can simplify TCP by copying tcp flags into TCP_SKB_CB(skb)->tcp_flags. This patch does so, and avoids the cache line miss in tcp_recvmsg() Following patches will - allow a segment with FIN being coalesced in tcp_try_coalesce() - simplify tcp_collapse() by not copying the headers. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-09tcp: remove dst refcount false sharing for prequeue modeEric Dumazet
Alexander Duyck reported high false sharing on dst refcount in tcp stack when prequeue is used. prequeue is the mechanism used when a thread is blocked in recvmsg()/read() on a TCP socket, using a blocking model rather than select()/poll()/epoll() non blocking one. We already try to use RCU in input path as much as possible, but we were forced to take a refcount on the dst when skb escaped RCU protected region. When/if the user thread runs on different cpu, dst_release() will then touch dst refcount again. Commit 093162553c33 (tcp: force a dst refcount when prequeue packet) was an example of a race fix. It turns out the only remaining usage of skb->dst for a packet stored in a TCP socket prequeue is IP early demux. We can add a logic to detect when IP early demux is probably going to use skb->dst. Because we do an optimistic check rather than duplicate existing logic, we need to guard inet_sk_rx_dst_set() and inet6_sk_rx_dst_set() from using a NULL dst. Many thanks to Alexander for providing a nice bug report, git bisection, and reproducer. Tested using Alexander script on a 40Gb NIC, 8 RX queues. Hosts have 24 cores, 48 hyper threads. echo 0 >/proc/sys/net/ipv4/tcp_autocorking for i in `seq 0 47` do for j in `seq 0 2` do netperf -H $DEST -t TCP_STREAM -l 1000 \ -c -C -T $i,$i -P 0 -- \ -m 64 -s 64K -D & done done Before patch : ~6Mpps and ~95% cpu usage on receiver After patch : ~9Mpps and ~35% cpu usage on receiver. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-09-05tcp: introduce TCP_SKB_CB(skb)->tcp_tw_isnEric Dumazet
TCP_SKB_CB(skb)->when has different meaning in output and input paths. In output path, it contains a timestamp. In input path, it contains an ISN, chosen by tcp_timewait_state_process() Lets add a different name to ease code comprehension. Note that 'when' field will disappear in following patch, as skb_mstamp already contains timestamp, the anonymous union will promptly disappear as well. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-14tcp: fix tcp_release_cb() to dispatch via address family for mtu_reduced()Neal Cardwell
Make sure we use the correct address-family-specific function for handling MTU reductions from within tcp_release_cb(). Previously AF_INET6 sockets were incorrectly always using the IPv6 code path when sometimes they were handling IPv4 traffic and thus had an IPv4 dst. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Diagnosed-by: Willem de Bruijn <willemb@google.com> Fixes: 563d34d057862 ("tcp: dont drop MTU reduction indications") Reviewed-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-08-06tcp: md5: check md5 signature without socket lockDmitry Popov
Since a8afca032 (tcp: md5: protects md5sig_info with RCU) tcp_md5_do_lookup doesn't require socket lock, rcu_read_lock is enough. Therefore socket lock is no longer required for tcp_v{4,6}_inbound_md5_hash too, so we can move these calls (wrapped with rcu_read_{,un}lock) before bh_lock_sock: from tcp_v{4,6}_do_rcv to tcp_v{4,6}_rcv. Signed-off-by: Dmitry Popov <ixaphire@qrator.net> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07net: Save TX flow hash in sock and set in skbuf on xmitTom Herbert
For a connected socket we can precompute the flow hash for setting in skb->hash on output. This is a performance advantage over calculating the skb->hash for every packet on the connection. The computation is done using the common hash algorithm to be consistent with computations done for packets of the connection in other states where thers is no socket (e.g. time-wait, syn-recv, syn-cookies). This patch adds sk_txhash to the sock structure. inet_set_txhash and ip6_set_txhash functions are added which are called from points in TCP and UDP where socket moves to established state. skb_set_hash_from_sk is a function which sets skb->hash from the sock txhash value. This is called in UDP and TCP transmit path when transmitting within the context of a socket. Tested: ran super_netperf with 200 TCP_RR streams over a vxlan interface (in this case skb_get_hash called on every TX packet to create a UDP source port). Before fix: 95.02% CPU utilization 154/256/505 90/95/99% latencies 1.13042e+06 tps Time in functions: 0.28% skb_flow_dissect 0.21% __skb_get_hash After fix: 94.95% CPU utilization 156/254/485 90/95/99% latencies 1.15447e+06 Neither __skb_get_hash nor skb_flow_dissect appear in perf Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-07-07tcp: switch snt_synack back to measuring transmit time of first SYNACKNeal Cardwell
Always store in snt_synack the time at which the server received the first client SYN and attempted to send the first SYNACK. Recent commit aa27fc501 ("tcp: tcp_v[46]_conn_request: fix snt_synack initialization") resolved an inconsistency between IPv4 and IPv6 in the initialization of snt_synack. This commit brings back the idea from 843f4a55e (tcp: use tcp_v4_send_synack on first SYN-ACK), which was going for the original behavior of snt_synack from the commit where it was added in 9ad7c049f0f79 ("tcp: RFC2988bis + taking RTT sample from 3WHS for the passive open side") in v3.1. In addition to being simpler (and probably a tiny bit faster), unconditionally storing the time of the first SYNACK attempt has been useful because it allows calculating a performance metric quantifying how long it took to establish a passive TCP connection. Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Cc: Octavian Purdila <octavian.purdila@intel.com> Cc: Jerry Chu <hkchu@google.com> Acked-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add tcp_conn_requestOctavian Purdila
Create tcp_conn_request and remove most of the code from tcp_v4_conn_request and tcp_v6_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add queue_add_hash to tcp_request_sock_opsOctavian Purdila
Add queue_add_hash member to tcp_request_sock_ops so that we can later unify tcp_v4_conn_request and tcp_v6_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add mss_clamp to tcp_request_sock_opsOctavian Purdila
Add mss_clamp member to tcp_request_sock_ops so that we can later unify tcp_v4_conn_request and tcp_v6_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: unify tcp_v4_rtx_synack and tcp_v6_rtx_synackOctavian Purdila
Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add send_synack method to tcp_request_sock_opsOctavian Purdila
Create a new tcp_request_sock_ops method to unify the IPv4/IPv6 signature for tcp_v[46]_send_synack. This allows us to later unify tcp_v4_rtx_synack with tcp_v6_rtx_synack and tcp_v4_conn_request with tcp_v4_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add init_seq method to tcp_request_sock_opsOctavian Purdila
More work in preparation of unifying tcp_v4_conn_request and tcp_v6_conn_request: indirect the init sequence calls via the tcp_request_sock_ops. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: move around a few calls in tcp_v6_conn_requestOctavian Purdila
Make the tcp_v6_conn_request calls flow similar with that of tcp_v4_conn_request. Note that want_cookie can be true only if isn is zero and that is why we can move the if (want_cookie) block out of the if (!isn) block. Moving security_inet_conn_request() has a couple of side effects: missing inet_rsk(req)->ecn_ok update and the req->cookie_ts update. However, neither SELinux nor Smack security hooks seems to check them. This change should also avoid future different behaviour for IPv4 and IPv6 in the security hooks. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Acked-by: Paul Moore <paul@paul-moore.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add route_req method to tcp_request_sock_opsOctavian Purdila
Create wrappers with same signature for the IPv4/IPv6 request routing calls and use these wrappers (via route_req method from tcp_request_sock_ops) in tcp_v4_conn_request and tcp_v6_conn_request with the purpose of unifying the two functions in a later patch. We can later drop the wrapper functions and modify inet_csk_route_req and inet6_cks_route_req to use the same signature. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add init_cookie_seq method to tcp_request_sock_opsOctavian Purdila
Move the specific IPv4/IPv6 cookie sequence initialization to a new method in tcp_request_sock_ops in preparation for unifying tcp_v4_conn_request and tcp_v6_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: add init_req method to tcp_request_sock_opsOctavian Purdila
Move the specific IPv4/IPv6 intializations to a new method in tcp_request_sock_ops in preparation for unifying tcp_v4_conn_request and tcp_v6_conn_request. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27net: remove inet6_reqsk_allocOctavian Purdila
Since pktops is only used for IPv6 only and opts is used for IPv4 only, we can move these fields into a union and this allows us to drop the inet6_reqsk_alloc function as after this change it becomes equivalent with inet_reqsk_alloc. This patch also fixes a kmemcheck issue in the IPv6 stack: the flags field was not annotated after a request_sock was allocated. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-27tcp: tcp_v[46]_conn_request: fix snt_synack initializationOctavian Purdila
Commit 016818d07 (tcp: TCP Fast Open Server - take SYNACK RTT after completing 3WHS) changes the code to only take a snt_synack timestamp when a SYNACK transmit or retransmit succeeds. This behaviour is later broken by commit 843f4a55e (tcp: use tcp_v4_send_synack on first SYN-ACK), as snt_synack is now updated even if tcp_v4_send_synack fails. Also, commit 3a19ce0ee (tcp: IPv6 support for fastopen server) misses the required IPv6 updates for 016818d07. This patch makes sure that snt_synack is updated only when the SYNACK trasnmit/retransmit succeeds, for both IPv4 and IPv6. Cc: Cardwell <ncardwell@google.com> Cc: Daniel Lee <longinus00@gmail.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-06-17tcp: move ir_mark initialization to tcp_openreq_initOctavian Purdila
ir_mark initialization is done for both TCP v4 and v6, move it in the common tcp_openreq_init function. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-23net: Eliminate no_check from protoswTom Herbert
It doesn't seem like an protocols are setting anything other than the default, and allowing to arbitrarily disable checksums for a whole protocol seems dangerous. This can be done on a per socket basis. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13net: support marking accepting TCP socketsLorenzo Colitti
When using mark-based routing, sockets returned from accept() may need to be marked differently depending on the incoming connection request. This is the case, for example, if different socket marks identify different networks: a listening socket may want to accept connections from all networks, but each connection should be marked with the network that the request came in on, so that subsequent packets are sent on the correct network. This patch adds a sysctl to mark TCP sockets based on the fwmark of the incoming SYN packet. If enabled, and an unmarked socket receives a SYN, then the SYN packet's fwmark is written to the connection's inet_request_sock, and later written back to the accepted socket when the connection is established. If the socket already has a nonzero mark, then the behaviour is the same as it is today, i.e., the listening socket's fwmark is used. Black-box tested using user-mode linux: - IPv4/IPv6 SYN+ACK, FIN, etc. packets are routed based on the mark of the incoming SYN packet. - The socket returned by accept() is marked with the mark of the incoming SYN packet. - Tested with syncookies=1 and syncookies=2. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13net: add a sysctl to reflect the fwmark on repliesLorenzo Colitti
Kernel-originated IP packets that have no user socket associated with them (e.g., ICMP errors and echo replies, TCP RSTs, etc.) are emitted with a mark of zero. Add a sysctl to make them have the same mark as the packet they are replying to. This allows an administrator that wishes to do so to use mark-based routing, firewalling, etc. for these replies by marking the original packets inbound. Tested using user-mode linux: - ICMP/ICMPv6 echo replies and errors. - TCP RST packets (IPv4 and IPv6). Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13tcp: IPv6 support for fastopen serverDaniel Lee
After all the preparatory works, supporting IPv6 in Fast Open is now easy. We pretty much just mirror v4 code. The only difference is how we generate the Fast Open cookie for IPv6 sockets. Since Fast Open cookie is 128 bits and we use AES 128, we use CBC-MAC to encrypt both the source and destination IPv6 addresses since the cookie is a MAC tag. Signed-off-by: Daniel Lee <longinus00@gmail.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Jerry Chu <hkchu@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-13tcp: improve fastopen icmp handlingYuchung Cheng
If a fast open socket is already accepted by the user, it should be treated like a connected socket to record the ICMP error in sk_softerr, so the user can fetch it. Do that in both tcp_v4_err and tcp_v6_err. Also refactor the sequence window check to improve readability (e.g., there were two local variables named 'req'). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Daniel Lee <longinus00@gmail.com> Signed-off-by: Jerry Chu <hkchu@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-05-05net: Call skb_checksum_init in IPv6Tom Herbert
Call skb_checksum_init instead of private functions. Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-04-11net: ipv6: Fix oif in TCP SYN+ACK route lookup.Lorenzo Colitti
net-next commit 9c76a11, ipv6: tcp_ipv6 policy route issue, had a boolean logic error that caused incorrect behaviour for TCP SYN+ACK when oif-based rules are in use. Specifically: 1. If a SYN comes in from a global address, and sk_bound_dev_if is not set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif unset, because for global addresses, the incoming interface doesn't necessarily have any bearing on the interface the SYN+ACK is sent out on. 2. If a SYN comes in from a link-local address, and sk_bound_dev_if is set, the routing lookup has oif set to the interface the SYN came in on. Instead, it should have oif set to sk_bound_dev_if, because that's what the application requested. Signed-off-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31ipv6: tcp_ipv6 policy route issueWang Yufen
The issue raises when adding policy route, specify a particular NIC as oif, the policy route did not take effect. The reason is that fl6.oif is not set and route map failed. From the tcp_v6_send_response function, if the binding address is linklocal, fl6.oif is set, but not for global address. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-31ipv6: tcp_ipv6 do some cleanupWang Yufen
Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-03-03tcp: snmp stats for Fast Open, SYN rtx, and data pktsYuchung Cheng
Add the following snmp stats: TCPFastOpenActiveFail: Fast Open attempts (SYN/data) failed beacuse the remote does not accept it or the attempts timed out. TCPSynRetrans: number of SYN and SYN/ACK retransmits to break down retransmissions into SYN, fast-retransmits, timeout retransmits, etc. TCPOrigDataSent: number of outgoing packets with original data (excluding retransmission but including data-in-SYN). This counter is different from TcpOutSegs because TcpOutSegs also tracks pure ACKs. TCPOrigDataSent is more useful to track the TCP retransmission rate. Change TCPFastOpenActive to track only successful Fast Opens to be symmetric to TCPFastOpenPassive. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Nandita Dukkipati <nanditad@google.com> Signed-off-by: Lawrence Brakmo <brakmo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-21tcp: delete redundant calls of tcp_mtup_init()Peter Pan(潘卫平)
As tcp_rcv_state_process() has already calls tcp_mtup_init() for non-fastopen sock, we can delete the redundant calls of tcp_mtup_init() in tcp_{v4,v6}_syn_recv_sock(). Signed-off-by: Weiping Pan <panweiping3@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-19ipv6: add the IPV6_FL_F_REFLECT flag to IPV6_FL_A_GETFlorent Fourcot
With this option, the socket will reply with the flow label value read on received packets. The goal is to have a connection with the same flow label in both direction of the communication. Changelog of V4: * Do not erase the flow label on the listening socket. Use pktopts to store the received value Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2014-01-17ipv6: tcp: fix flowlabel value in ACK messages send from TIME_WAITFlorent Fourcot
This patch is following the commit b903d324bee262 (ipv6: tcp: fix TCLASS value in ACK messages sent from TIME_WAIT). For the same reason than tclass, we have to store the flow label in the inet_timewait_sock to provide consistency of flow label on the last ACK. Signed-off-by: Florent Fourcot <florent.fourcot@enst-bretagne.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-26ipv6: cleanup for tcp_ipv6.cWeilong Chen
Fix some checkpatch errors for tcp_ipv6.c Signed-off-by: Weilong Chen <chenweilong@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2013-12-19Merge branch 'master' of ↵David S. Miller
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2013-12-19 1) Use the user supplied policy index instead of a generated one if present. From Fan Du. 2) Make xfrm migration namespace aware. From Fan Du. 3) Make the xfrm state and policy locks namespace aware. From Fan Du. 4) Remove ancient sleeping when the SA is in acquire state, we now queue packets to the policy instead. This replaces the sleeping code. 5) Remove FLOWI_FLAG_CAN_SLEEP. This was used to notify xfrm about the posibility to sleep. The sleeping code is gone, so remove it. 6) Check user specified spi for IPComp. Thr spi for IPcomp is only 16 bit wide, so check for a valid value. From Fan Du. 7) Export verify_userspi_info to check for valid user supplied spi ranges with pfkey and netlink. From Fan Du. 8) RFC3173 states that if the total size of a compressed payload and the IPComp header is not smaller than the size of the original payload, the IP datagram must be sent in the original non-compressed form. These packets are dropped by the inbound policy check because they are not transformed. Document the need to set 'level use' for IPcomp to receive such packets anyway. From Fan Du. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>