diff options
Diffstat (limited to 'Documentation/networking')
12 files changed, 247 insertions, 27 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index 247c6c4127e9..1cc35de336a4 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -433,6 +433,15 @@ start N bytes into the buffer leaving the first N bytes for the application to use. The final option is the flags field, but it will be dealt with in separate sections for each UMEM flag. +SO_BINDTODEVICE setsockopt +-------------------------- + +This is a generic SOL_SOCKET option that can be used to tie AF_XDP +socket to a particular network interface. It is useful when a socket +is created by a privileged process and passed to a non-privileged one. +Once the option is set, kernel will refuse attempts to bind that socket +to a different interface. Updating the value requires CAP_NET_RAW. + XDP_STATISTICS getsockopt ------------------------- diff --git a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst index 4118384cf8eb..289c146a8291 100644 --- a/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst +++ b/Documentation/networking/device_drivers/cellular/qualcomm/rmnet.rst @@ -190,8 +190,7 @@ MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad... 3. Userspace configuration ========================== -rmnet userspace configuration is done through netlink library librmnetctl -and command line utility rmnetcli. Utility is hosted in codeaurora forum git. -The driver uses rtnl_link_ops for communication. +rmnet userspace configuration is done through netlink using iproute2 +https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/ -https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl +The driver uses rtnl_link_ops for communication. diff --git a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst index 8bcb173e0353..5eaa3ab6c73e 100644 --- a/Documentation/networking/device_drivers/ethernet/amazon/ena.rst +++ b/Documentation/networking/device_drivers/ethernet/amazon/ena.rst @@ -38,6 +38,7 @@ debug logs. Some of the ENA devices support a working mode called Low-latency Queue (LLQ), which saves several more microseconds. + ENA Source Code Directory Structure =================================== @@ -205,6 +206,8 @@ Adaptive coalescing can be switched on/off through `ethtool(8)`'s More information about Adaptive Interrupt Moderation (DIM) can be found in Documentation/networking/net_dim.rst +.. _`RX copybreak`: + RX copybreak ============ The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK @@ -315,3 +318,34 @@ Rx - The new SKB is updated with the necessary information (protocol, checksum hw verify result, etc), and then passed to the network stack, using the NAPI interface function :code:`napi_gro_receive()`. + +Dynamic RX Buffers (DRB) +------------------------ + +Each RX descriptor in the RX ring is a single memory page (which is either 4KB +or 16KB long depending on system's configurations). +To reduce the memory allocations required when dealing with a high rate of small +packets, the driver tries to reuse the remaining RX descriptor's space if more +than 2KB of this page remain unused. + +A simple example of this mechanism is the following sequence of events: + +:: + + 1. Driver allocates page-sized RX buffer and passes it to hardware + +----------------------+ + |4KB RX Buffer | + +----------------------+ + + 2. A 300Bytes packet is received on this buffer + + 3. The driver increases the ref count on this page and returns it back to + HW as an RX buffer of size 4KB - 300Bytes = 3796 Bytes + +----+--------------------+ + |****|3796 Bytes RX Buffer| + +----+--------------------+ + +This mechanism isn't used when an XDP program is loaded, or when the +RX packet is less than rx_copybreak bytes (in which case the packet is +copied out of the RX buffer into the linear part of a new skb allocated +for it and the RX buffer remains the same size, see `RX copybreak`_). diff --git a/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst new file mode 100644 index 000000000000..587927d3de92 --- /dev/null +++ b/Documentation/networking/device_drivers/ethernet/amd/pds_vdpa.rst @@ -0,0 +1,85 @@ +.. SPDX-License-Identifier: GPL-2.0+ +.. note: can be edited and viewed with /usr/bin/formiko-vim + +========================================================== +PCI vDPA driver for the AMD/Pensando(R) DSC adapter family +========================================================== + +AMD/Pensando vDPA VF Device Driver + +Copyright(c) 2023 Advanced Micro Devices, Inc + +Overview +======== + +The ``pds_vdpa`` driver is an auxiliary bus driver that supplies +a vDPA device for use by the virtio network stack. It is used with +the Pensando Virtual Function devices that offer vDPA and virtio queue +services. It depends on the ``pds_core`` driver and hardware for the PF +and VF PCI handling as well as for device configuration services. + +Using the device +================ + +The ``pds_vdpa`` device is enabled via multiple configuration steps and +depends on the ``pds_core`` driver to create and enable SR-IOV Virtual +Function devices. After the VFs are enabled, we enable the vDPA service +in the ``pds_core`` device to create the auxiliary devices used by pds_vdpa. + +Example steps: + +.. code-block:: bash + + #!/bin/bash + + modprobe pds_core + modprobe vdpa + modprobe pds_vdpa + + PF_BDF=`ls /sys/module/pds_core/drivers/pci\:pds_core/*/sriov_numvfs | awk -F / '{print $7}'` + + # Enable vDPA VF auxiliary device(s) in the PF + devlink dev param set pci/$PF_BDF name enable_vnet cmode runtime value true + + # Create a VF for vDPA use + echo 1 > /sys/bus/pci/drivers/pds_core/$PF_BDF/sriov_numvfs + + # Find the vDPA services/devices available + PDS_VDPA_MGMT=`vdpa mgmtdev show | grep vDPA | head -1 | cut -d: -f1` + + # Create a vDPA device for use in virtio network configurations + vdpa dev add name vdpa1 mgmtdev $PDS_VDPA_MGMT mac 00:11:22:33:44:55 + + # Set up an ethernet interface on the vdpa device + modprobe virtio_vdpa + + + +Enabling the driver +=================== + +The driver is enabled via the standard kernel configuration system, +using the make command:: + + make oldconfig/menuconfig/etc. + +The driver is located in the menu structure at: + + -> Device Drivers + -> Network device support (NETDEVICES [=y]) + -> Ethernet driver support + -> Pensando devices + -> Pensando Ethernet PDS_VDPA Support + +Support +======= + +For general Linux networking support, please use the netdev mailing +list, which is monitored by Pensando personnel:: + + netdev@vger.kernel.org + +For more specific support needs, please use the Pensando driver support +email:: + + drivers@pensando.io diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 417ca514a4d0..94ecb67c0885 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -15,6 +15,7 @@ Contents: amazon/ena altera/altera_tse amd/pds_core + amd/pds_vdpa aquantia/atlantic chelsio/cxgb cirrus/cs89x0 diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 69695e5511f4..e4d065c55ea8 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -84,24 +84,6 @@ Once the VM shuts down, or otherwise releases the VF, the command will complete. -Important notes for SR-IOV and Link Aggregation ------------------------------------------------ -Link Aggregation is mutually exclusive with SR-IOV. - -- If Link Aggregation is active, SR-IOV VFs cannot be created on the PF. -- If SR-IOV is active, you cannot set up Link Aggregation on the interface. - -Bridging and MACVLAN are also affected by this. If you wish to use bridging or -MACVLAN with SR-IOV, you must set up bridging or MACVLAN before enabling -SR-IOV. If you are using bridging or MACVLAN in conjunction with SR-IOV, and -you want to remove the interface from the bridge or MACVLAN, you must follow -these steps: - -1. Destroy SR-IOV VFs if they exist -2. Remove the interface from the bridge or MACVLAN -3. Recreate SRIOV VFs as needed - - Additional Features and Configurations ====================================== diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst index 5ba9015336e2..bfd233cfac35 100644 --- a/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst +++ b/Documentation/networking/device_drivers/ethernet/marvell/octeontx2.rst @@ -13,6 +13,7 @@ Contents - `Drivers`_ - `Basic packet flow`_ - `Devlink health reporters`_ +- `Quality of service`_ Overview ======== @@ -287,3 +288,47 @@ For example:: NIX_AF_ERR: NIX Error Interrupt Reg : 64 Rx on unmapped PF_FUNC + + +Quality of service +================== + + +Hardware algorithms used in scheduling +-------------------------------------- + +octeontx2 silicon and CN10K transmit interface consists of five transmit levels +starting from SMQ/MDQ, TL4 to TL1. Each packet will traverse MDQ, TL4 to TL1 +levels. Each level contains an array of queues to support scheduling and shaping. +The hardware uses the below algorithms depending on the priority of scheduler queues. +once the usercreates tc classes with different priorities, the driver configures +schedulers allocated to the class with specified priority along with rate-limiting +configuration. + +1. Strict Priority + + - Once packets are submitted to MDQ, hardware picks all active MDQs having different priority + using strict priority. + +2. Round Robin + + - Active MDQs having the same priority level are chosen using round robin. + + +Setup HTB offload +----------------- + +1. Enable HW TC offload on the interface:: + + # ethtool -K <interface> hw-tc-offload on + +2. Crate htb root:: + + # tc qdisc add dev <interface> clsact + # tc qdisc replace dev <interface> root handle 1: htb offload + +3. Create tc classes with different priorities:: + + # tc class add dev <interface> parent 1: classid 1:1 htb rate 10Gbit prio 1 + + # tc class add dev <interface> parent 1: classid 1:2 htb rate 10Gbit prio 7 diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 6b2d1fe74ecf..a395df9c2751 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -797,6 +797,16 @@ Counters on the NIC port that is connected to a eSwitch. RoCE/UD/RC traffic) [#accel]_. - Acceleration + * - `vport_loopback_packets` + - Unicast, multicast and broadcast packets that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + + * - `vport_loopback_bytes` + - Unicast, multicast and broadcast bytes that were loop-back (received + and transmitted), IB/Eth [#accel]_. + - Acceleration + * - `rx_steer_missed_packets` - Number of packets that was received by the NIC, however was discarded because it did not match any flow in the NIC flow table. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst index 3354ca3608ee..a4edf908b707 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst @@ -290,6 +290,13 @@ Description of the vnic counters: - nic_receive_steering_discard number of packets that completed RX flow steering but were discarded due to a mismatch in flow table. +- generated_pkt_steering_fail + number of packets generated by the VNIC experiencing unexpected steering + failure (at any point in steering flow). +- handled_pkt_steering_fail + number of packets handled by the VNIC experiencing unexpected steering + failure (at any point in steering flow owned by the VNIC, including the FDB + for the eswitch owner). User commands examples: diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst index 01deedb71597..6e3f5ee8b0d0 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/switchdev.rst @@ -45,6 +45,28 @@ Following bridge VLAN functions are supported by mlx5: Subfunction =========== +Subfunction which are spawned over the E-switch are created only with devlink +device, and by default all the SF auxiliary devices are disabled. +This will allow user to configure the SF before the SF have been fully probed, +which will save time. + +Usage example: + +- Create SF:: + + $ devlink port add pci/0000:08:00.0 flavour pcisf pfnum 0 sfnum 11 + $ devlink port function set pci/0000:08:00.0/32768 hw_addr 00:00:00:00:00:11 state active + +- Enable ETH auxiliary device:: + + $ devlink dev param set auxiliary/mlx5_core.sf.1 name enable_eth value true cmode driverinit + +- Now, in order to fully probe the SF, use devlink reload:: + + $ devlink dev reload auxiliary/mlx5_core.sf.1 + +mlx5 supports ETH,rdma and vdpa (vnet) auxiliary devices devlink params (see :ref:`Documentation/networking/devlink/devlink-params.rst <devlink_params_generic>`). + mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface. A subfunction has its own function capabilities and its own resources. This diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst index 80b8f73a0244..4a010a7cde7f 100644 --- a/Documentation/networking/ip-sysctl.rst +++ b/Documentation/networking/ip-sysctl.rst @@ -881,9 +881,10 @@ tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value - is 6, which corresponds to 63seconds till the last retransmission - with the current initial RTO of 1second. With this the final timeout - for an active TCP connection attempt will happen after 127seconds. + is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) + till the last retransmission with the current initial RTO of 1second. + With this the final timeout for an active TCP connection attempt + will happen after 131seconds. tcp_timestamps - INTEGER Enable timestamps as defined in RFC1323. @@ -946,6 +947,16 @@ tcp_pacing_ca_ratio - INTEGER Default: 120 +tcp_syn_linear_timeouts - INTEGER + The number of times for an active TCP connection to retransmit SYNs with + a linear backoff timeout before defaulting to an exponential backoff + timeout. This has no effect on SYNACK at the passive TCP side. + + With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would + expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, + and the first exponential backoff using 2^0 * initial_RTO). + Default: 4 + tcp_tso_win_divisor - INTEGER This allows control over what percentage of the congestion window can be consumed by a single TSO frame. @@ -970,6 +981,21 @@ tcp_tw_reuse - INTEGER tcp_window_scaling - BOOLEAN Enable window scaling as defined in RFC1323. +tcp_shrink_window - BOOLEAN + This changes how the TCP receive window is calculated. + + RFC 7323, section 2.4, says there are instances when a retracted + window can be offered, and that TCP implementations MUST ensure + that they handle a shrinking window, as specified in RFC 1122. + + - 0 - Disabled. The window is never shrunk. + - 1 - Enabled. The window is shrunk when necessary to remain within + the memory limit set by autotuning (sk_rcvbuf). + This only occurs if a non-zero receive window + scaling factor is also in effect. + + Default: 0 + tcp_wmem - vector of 3 INTEGERs: min, default, max min: Amount of memory reserved for send buffers for TCP sockets. Each TCP socket has rights to use it due to fact of its birth. diff --git a/Documentation/networking/scaling.rst b/Documentation/networking/scaling.rst index 3d435caa3ef2..92c9fb46d6a2 100644 --- a/Documentation/networking/scaling.rst +++ b/Documentation/networking/scaling.rst @@ -269,8 +269,8 @@ a single application thread handles flows with many different flow hashes. rps_sock_flow_table is a global flow table that contains the *desired* CPU for flows: the CPU that is currently processing the flow in userspace. Each table value is a CPU index that is updated during calls to recvmsg -and sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage() -and tcp_splice_read()). +and sendmsg (specifically, inet_recvmsg(), inet_sendmsg() and +tcp_splice_read()). When the scheduler moves a thread to a new CPU while it has outstanding receive packets on the old CPU, packets may arrive out of order. To |