summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
authorLinus Torvalds <torvalds@linux-foundation.org>2022-12-13 15:47:48 -0800
committerLinus Torvalds <torvalds@linux-foundation.org>2022-12-13 15:47:48 -0800
commit7e68dd7d07a28faa2e6574dd6b9dbd90cdeaae91 (patch)
treeae0427c5a3b905f24b3a44b510a9bcf35d9b67a3 /Documentation
parent1ca06f1c1acecbe02124f14a37cce347b8c1a90c (diff)
parent7c4a6309e27f411743817fe74a832ec2d2798a4b (diff)
Merge tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Paolo Abeni: "Core: - Allow live renaming when an interface is up - Add retpoline wrappers for tc, improving considerably the performances of complex queue discipline configurations - Add inet drop monitor support - A few GRO performance improvements - Add infrastructure for atomic dev stats, addressing long standing data races - De-duplicate common code between OVS and conntrack offloading infrastructure - A bunch of UBSAN_BOUNDS/FORTIFY_SOURCE improvements - Netfilter: introduce packet parser for tunneled packets - Replace IPVS timer-based estimators with kthreads to scale up the workload with the number of available CPUs - Add the helper support for connection-tracking OVS offload BPF: - Support for user defined BPF objects: the use case is to allocate own objects, build own object hierarchies and use the building blocks to build own data structures flexibly, for example, linked lists in BPF - Make cgroup local storage available to non-cgroup attached BPF programs - Avoid unnecessary deadlock detection and failures wrt BPF task storage helpers - A relevant bunch of BPF verifier fixes and improvements - Veristat tool improvements to support custom filtering, sorting, and replay of results - Add LLVM disassembler as default library for dumping JITed code - Lots of new BPF documentation for various BPF maps - Add bpf_rcu_read_{,un}lock() support for sleepable programs - Add RCU grace period chaining to BPF to wait for the completion of access from both sleepable and non-sleepable BPF programs - Add support storing struct task_struct objects as kptrs in maps - Improve helper UAPI by explicitly defining BPF_FUNC_xxx integer values - Add libbpf *_opts API-variants for bpf_*_get_fd_by_id() functions Protocols: - TCP: implement Protective Load Balancing across switch links - TCP: allow dynamically disabling TCP-MD5 static key, reverting back to fast[er]-path - UDP: Introduce optional per-netns hash lookup table - IPv6: simplify and cleanup sockets disposal - Netlink: support different type policies for each generic netlink operation - MPTCP: add MSG_FASTOPEN and FastOpen listener side support - MPTCP: add netlink notification support for listener sockets events - SCTP: add VRF support, allowing sctp sockets binding to VRF devices - Add bridging MAC Authentication Bypass (MAB) support - Extensions for Ethernet VPN bridging implementation to better support multicast scenarios - More work for Wi-Fi 7 support, comprising conversion of all the existing drivers to internal TX queue usage - IPSec: introduce a new offload type (packet offload) allowing complete header processing and crypto offloading - IPSec: extended ack support for more descriptive XFRM error reporting - RXRPC: increase SACK table size and move processing into a per-local endpoint kernel thread, reducing considerably the required locking - IEEE 802154: synchronous send frame and extended filtering support, initial support for scanning available 15.4 networks - Tun: bump the link speed from 10Mbps to 10Gbps - Tun/VirtioNet: implement UDP segmentation offload support Driver API: - PHY/SFP: improve power level switching between standard level 1 and the higher power levels - New API for netdev <-> devlink_port linkage - PTP: convert existing drivers to new frequency adjustment implementation - DSA: add support for rx offloading - Autoload DSA tagging driver when dynamically changing protocol - Add new PCP and APPTRUST attributes to Data Center Bridging - Add configuration support for 800Gbps link speed - Add devlink port function attribute to enable/disable RoCE and migratable - Extend devlink-rate to support strict prioriry and weighted fair queuing - Add devlink support to directly reading from region memory - New device tree helper to fetch MAC address from nvmem - New big TCP helper to simplify temporary header stripping New hardware / drivers: - Ethernet: - Marvel Octeon CNF95N and CN10KB Ethernet Switches - Marvel Prestera AC5X Ethernet Switch - WangXun 10 Gigabit NIC - Motorcomm yt8521 Gigabit Ethernet - Microchip ksz9563 Gigabit Ethernet Switch - Microsoft Azure Network Adapter - Linux Automation 10Base-T1L adapter - PHY: - Aquantia AQR112 and AQR412 - Motorcomm YT8531S - PTP: - Orolia ART-CARD - WiFi: - MediaTek Wi-Fi 7 (802.11be) devices - RealTek rtw8821cu, rtw8822bu, rtw8822cu and rtw8723du USB devices - Bluetooth: - Broadcom BCM4377/4378/4387 Bluetooth chipsets - Realtek RTL8852BE and RTL8723DS - Cypress.CYW4373A0 WiFi + Bluetooth combo device Drivers: - CAN: - gs_usb: bus error reporting support - kvaser_usb: listen only and bus error reporting support - Ethernet NICs: - Intel (100G): - extend action skbedit to RX queue mapping - implement devlink-rate support - support direct read from memory - nVidia/Mellanox (mlx5): - SW steering improvements, increasing rules update rate - Support for enhanced events compression - extend H/W offload packet manipulation capabilities - implement IPSec packet offload mode - nVidia/Mellanox (mlx4): - better big TCP support - Netronome Ethernet NICs (nfp): - IPsec offload support - add support for multicast filter - Broadcom: - RSS and PTP support improvements - AMD/SolarFlare: - netlink extened ack improvements - add basic flower matches to offload, and related stats - Virtual NICs: - ibmvnic: introduce affinity hint support - small / embedded: - FreeScale fec: add initial XDP support - Marvel mv643xx_eth: support MII/GMII/RGMII modes for Kirkwood - TI am65-cpsw: add suspend/resume support - Mediatek MT7986: add RX wireless wthernet dispatch support - Realtek 8169: enable GRO software interrupt coalescing per default - Ethernet high-speed switches: - Microchip (sparx5): - add support for Sparx5 TC/flower H/W offload via VCAP - Mellanox mlxsw: - add 802.1X and MAC Authentication Bypass offload support - add ip6gre support - Embedded Ethernet switches: - Mediatek (mtk_eth_soc): - improve PCS implementation, add DSA untag support - enable flow offload support - Renesas: - add rswitch R-Car Gen4 gPTP support - Microchip (lan966x): - add full XDP support - add TC H/W offload via VCAP - enable PTP on bridge interfaces - Microchip (ksz8): - add MTU support for KSZ8 series - Qualcomm 802.11ax WiFi (ath11k): - support configuring channel dwell time during scan - MediaTek WiFi (mt76): - enable Wireless Ethernet Dispatch (WED) offload support - add ack signal support - enable coredump support - remain_on_channel support - Intel WiFi (iwlwifi): - enable Wi-Fi 7 Extremely High Throughput (EHT) PHY capabilities - 320 MHz channels support - RealTek WiFi (rtw89): - new dynamic header firmware format support - wake-over-WLAN support" * tag 'net-next-6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2002 commits) ipvs: fix type warning in do_div() on 32 bit net: lan966x: Remove a useless test in lan966x_ptp_add_trap() net: ipa: add IPA v4.7 support dt-bindings: net: qcom,ipa: Add SM6350 compatible bnxt: Use generic HBH removal helper in tx path IPv6/GRO: generic helper to remove temporary HBH/jumbo header in driver selftests: forwarding: Add bridge MDB test selftests: forwarding: Rename bridge_mdb test bridge: mcast: Support replacement of MDB port group entries bridge: mcast: Allow user space to specify MDB entry routing protocol bridge: mcast: Allow user space to add (*, G) with a source list and filter mode bridge: mcast: Add support for (*, G) with a source list and filter mode bridge: mcast: Avoid arming group timer when (S, G) corresponds to a source bridge: mcast: Add a flag for user installed source entries bridge: mcast: Expose __br_multicast_del_group_src() bridge: mcast: Expose br_multicast_new_group_src() bridge: mcast: Add a centralized error path bridge: mcast: Place netlink policy before validation functions bridge: mcast: Split (*, G) and (S, G) addition into different functions bridge: mcast: Do not derive entry type from its filter mode ...
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/bpf/bpf_design_QA.rst45
-rw-r--r--Documentation/bpf/bpf_devel_QA.rst27
-rw-r--r--Documentation/bpf/bpf_iterators.rst485
-rw-r--r--Documentation/bpf/btf.rst7
-rw-r--r--Documentation/bpf/index.rst2
-rw-r--r--Documentation/bpf/instruction-set.rst4
-rw-r--r--Documentation/bpf/kfuncs.rst255
-rw-r--r--Documentation/bpf/libbpf/index.rst3
-rw-r--r--Documentation/bpf/libbpf/program_types.rst203
-rw-r--r--Documentation/bpf/map_array.rst262
-rw-r--r--Documentation/bpf/map_bloom_filter.rst174
-rw-r--r--Documentation/bpf/map_cgrp_storage.rst109
-rw-r--r--Documentation/bpf/map_cpumap.rst177
-rw-r--r--Documentation/bpf/map_devmap.rst238
-rw-r--r--Documentation/bpf/map_hash.rst33
-rw-r--r--Documentation/bpf/map_lpm_trie.rst197
-rw-r--r--Documentation/bpf/map_of_maps.rst130
-rw-r--r--Documentation/bpf/map_queue_stack.rst146
-rw-r--r--Documentation/bpf/map_sk_storage.rst155
-rw-r--r--Documentation/bpf/map_xskmap.rst192
-rw-r--r--Documentation/bpf/maps.rst101
-rw-r--r--Documentation/bpf/programs.rst3
-rw-r--r--Documentation/bpf/redirect.rst81
-rw-r--r--Documentation/devicetree/bindings/arm/mediatek/mediatek,mt7622-wed.yaml52
-rw-r--r--Documentation/devicetree/bindings/interrupt-controller/fsl,intmux.yaml3
-rw-r--r--Documentation/devicetree/bindings/net/adi,adin1110.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/asix,ax88178.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/bluetooth.txt5
-rw-r--r--Documentation/devicetree/bindings/net/bluetooth/bluetooth-controller.yaml29
-rw-r--r--Documentation/devicetree/bindings/net/bluetooth/brcm,bcm4377-bluetooth.yaml81
-rw-r--r--Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml (renamed from Documentation/devicetree/bindings/net/qualcomm-bluetooth.yaml)6
-rw-r--r--Documentation/devicetree/bindings/net/broadcom-bluetooth.yaml3
-rw-r--r--Documentation/devicetree/bindings/net/can/fsl,flexcan.yaml1
-rw-r--r--Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml135
-rw-r--r--Documentation/devicetree/bindings/net/dsa/dsa-port.yaml3
-rw-r--r--Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml2
-rw-r--r--Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml2
-rw-r--r--Documentation/devicetree/bindings/net/ethernet-controller.yaml11
-rw-r--r--Documentation/devicetree/bindings/net/fsl,fec.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/fsl,fman-dtsec.yaml53
-rw-r--r--Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml2
-rw-r--r--Documentation/devicetree/bindings/net/fsl-fman.txt5
-rw-r--r--Documentation/devicetree/bindings/net/marvell,dfx-server.yaml62
-rw-r--r--Documentation/devicetree/bindings/net/marvell,pp2.yaml305
-rw-r--r--Documentation/devicetree/bindings/net/marvell,prestera.txt81
-rw-r--r--Documentation/devicetree/bindings/net/marvell,prestera.yaml91
-rw-r--r--Documentation/devicetree/bindings/net/marvell-pp2.txt141
-rw-r--r--Documentation/devicetree/bindings/net/microchip,lan95xx.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml4
-rw-r--r--Documentation/devicetree/bindings/net/pcs/fsl,lynx-pcs.yaml40
-rw-r--r--Documentation/devicetree/bindings/net/qca,ar71xx.yaml1
-rw-r--r--Documentation/devicetree/bindings/net/qcom,ipa.yaml86
-rw-r--r--Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml46
-rw-r--r--Documentation/devicetree/bindings/net/realtek-bluetooth.yaml1
-rw-r--r--Documentation/devicetree/bindings/net/renesas,r8a779f0-ether-switch.yaml262
-rw-r--r--Documentation/devicetree/bindings/net/sff,sfp.yaml3
-rw-r--r--Documentation/devicetree/bindings/net/snps,dwmac.yaml345
-rw-r--r--Documentation/devicetree/bindings/net/socionext,synquacer-netsec.yaml73
-rw-r--r--Documentation/devicetree/bindings/net/socionext-netsec.txt56
-rw-r--r--Documentation/devicetree/bindings/net/xilinx_axienet.txt2
-rw-r--r--Documentation/devicetree/bindings/soc/mediatek/mediatek,mt7986-wo-ccif.yaml51
-rw-r--r--Documentation/devicetree/bindings/soc/qcom/qcom,wcnss.yaml8
-rw-r--r--Documentation/networking/bonding.rst4
-rw-r--r--Documentation/networking/can.rst33
-rw-r--r--Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst9
-rw-r--r--Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst1
-rw-r--r--Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst124
-rw-r--r--Documentation/networking/device_drivers/ethernet/netronome/nfp.rst165
-rw-r--r--Documentation/networking/devlink/devlink-info.rst5
-rw-r--r--Documentation/networking/devlink/devlink-port.rst168
-rw-r--r--Documentation/networking/devlink/devlink-region.rst13
-rw-r--r--Documentation/networking/devlink/devlink-trap.rst13
-rw-r--r--Documentation/networking/devlink/etas_es58x.rst36
-rw-r--r--Documentation/networking/devlink/ice.rst128
-rw-r--r--Documentation/networking/ethtool-netlink.rst32
-rw-r--r--Documentation/networking/index.rst1
-rw-r--r--Documentation/networking/ip-sysctl.rst111
-rw-r--r--Documentation/networking/ipvs-sysctl.rst24
-rw-r--r--Documentation/networking/tc-queue-filters.rst37
-rw-r--r--Documentation/networking/timestamping.rst32
-rw-r--r--Documentation/networking/xfrm_device.rst62
82 files changed, 5425 insertions, 673 deletions
diff --git a/Documentation/bpf/bpf_design_QA.rst b/Documentation/bpf/bpf_design_QA.rst
index a210b8a4df00..cec2371173d7 100644
--- a/Documentation/bpf/bpf_design_QA.rst
+++ b/Documentation/bpf/bpf_design_QA.rst
@@ -298,3 +298,48 @@ A: NO.
The BTF_ID macro does not cause a function to become part of the ABI
any more than does the EXPORT_SYMBOL_GPL macro.
+
+Q: What is the compatibility story for special BPF types in map values?
+-----------------------------------------------------------------------
+Q: Users are allowed to embed bpf_spin_lock, bpf_timer fields in their BPF map
+values (when using BTF support for BPF maps). This allows to use helpers for
+such objects on these fields inside map values. Users are also allowed to embed
+pointers to some kernel types (with __kptr and __kptr_ref BTF tags). Will the
+kernel preserve backwards compatibility for these features?
+
+A: It depends. For bpf_spin_lock, bpf_timer: YES, for kptr and everything else:
+NO, but see below.
+
+For struct types that have been added already, like bpf_spin_lock and bpf_timer,
+the kernel will preserve backwards compatibility, as they are part of UAPI.
+
+For kptrs, they are also part of UAPI, but only with respect to the kptr
+mechanism. The types that you can use with a __kptr and __kptr_ref tagged
+pointer in your struct are NOT part of the UAPI contract. The supported types can
+and will change across kernel releases. However, operations like accessing kptr
+fields and bpf_kptr_xchg() helper will continue to be supported across kernel
+releases for the supported types.
+
+For any other supported struct type, unless explicitly stated in this document
+and added to bpf.h UAPI header, such types can and will arbitrarily change their
+size, type, and alignment, or any other user visible API or ABI detail across
+kernel releases. The users must adapt their BPF programs to the new changes and
+update them to make sure their programs continue to work correctly.
+
+NOTE: BPF subsystem specially reserves the 'bpf\_' prefix for type names, in
+order to introduce more special fields in the future. Hence, user programs must
+avoid defining types with 'bpf\_' prefix to not be broken in future releases.
+In other words, no backwards compatibility is guaranteed if one using a type
+in BTF with 'bpf\_' prefix.
+
+Q: What is the compatibility story for special BPF types in allocated objects?
+------------------------------------------------------------------------------
+Q: Same as above, but for allocated objects (i.e. objects allocated using
+bpf_obj_new for user defined types). Will the kernel preserve backwards
+compatibility for these features?
+
+A: NO.
+
+Unlike map value types, there are no stability guarantees for this case. The
+whole API to work with allocated objects and any support for special fields
+inside them is unstable (since it is exposed through kfuncs).
diff --git a/Documentation/bpf/bpf_devel_QA.rst b/Documentation/bpf/bpf_devel_QA.rst
index 761474bd7fe6..03d4993eda6f 100644
--- a/Documentation/bpf/bpf_devel_QA.rst
+++ b/Documentation/bpf/bpf_devel_QA.rst
@@ -44,6 +44,33 @@ is a guarantee that the reported issue will be overlooked.**
Submitting patches
==================
+Q: How do I run BPF CI on my changes before sending them out for review?
+------------------------------------------------------------------------
+A: BPF CI is GitHub based and hosted at https://github.com/kernel-patches/bpf.
+While GitHub also provides a CLI that can be used to accomplish the same
+results, here we focus on the UI based workflow.
+
+The following steps lay out how to start a CI run for your patches:
+
+- Create a fork of the aforementioned repository in your own account (one time
+ action)
+
+- Clone the fork locally, check out a new branch tracking either the bpf-next
+ or bpf branch, and apply your to-be-tested patches on top of it
+
+- Push the local branch to your fork and create a pull request against
+ kernel-patches/bpf's bpf-next_base or bpf_base branch, respectively
+
+Shortly after the pull request has been created, the CI workflow will run. Note
+that capacity is shared with patches submitted upstream being checked and so
+depending on utilization the run can take a while to finish.
+
+Note furthermore that both base branches (bpf-next_base and bpf_base) will be
+updated as patches are pushed to the respective upstream branches they track. As
+such, your patch set will automatically (be attempted to) be rebased as well.
+This behavior can result in a CI run being aborted and restarted with the new
+base line.
+
Q: To which mailing list do I need to submit my BPF patches?
------------------------------------------------------------
A: Please submit your BPF patches to the bpf kernel mailing list:
diff --git a/Documentation/bpf/bpf_iterators.rst b/Documentation/bpf/bpf_iterators.rst
new file mode 100644
index 000000000000..6d7770793fab
--- /dev/null
+++ b/Documentation/bpf/bpf_iterators.rst
@@ -0,0 +1,485 @@
+=============
+BPF Iterators
+=============
+
+
+----------
+Motivation
+----------
+
+There are a few existing ways to dump kernel data into user space. The most
+popular one is the ``/proc`` system. For example, ``cat /proc/net/tcp6`` dumps
+all tcp6 sockets in the system, and ``cat /proc/net/netlink`` dumps all netlink
+sockets in the system. However, their output format tends to be fixed, and if
+users want more information about these sockets, they have to patch the kernel,
+which often takes time to publish upstream and release. The same is true for popular
+tools like `ss <https://man7.org/linux/man-pages/man8/ss.8.html>`_ where any
+additional information needs a kernel patch.
+
+To solve this problem, the `drgn
+<https://www.kernel.org/doc/html/latest/bpf/drgn.html>`_ tool is often used to
+dig out the kernel data with no kernel change. However, the main drawback for
+drgn is performance, as it cannot do pointer tracing inside the kernel. In
+addition, drgn cannot validate a pointer value and may read invalid data if the
+pointer becomes invalid inside the kernel.
+
+The BPF iterator solves the above problem by providing flexibility on what data
+(e.g., tasks, bpf_maps, etc.) to collect by calling BPF programs for each kernel
+data object.
+
+----------------------
+How BPF Iterators Work
+----------------------
+
+A BPF iterator is a type of BPF program that allows users to iterate over
+specific types of kernel objects. Unlike traditional BPF tracing programs that
+allow users to define callbacks that are invoked at particular points of
+execution in the kernel, BPF iterators allow users to define callbacks that
+should be executed for every entry in a variety of kernel data structures.
+
+For example, users can define a BPF iterator that iterates over every task on
+the system and dumps the total amount of CPU runtime currently used by each of
+them. Another BPF task iterator may instead dump the cgroup information for each
+task. Such flexibility is the core value of BPF iterators.
+
+A BPF program is always loaded into the kernel at the behest of a user space
+process. A user space process loads a BPF program by opening and initializing
+the program skeleton as required and then invoking a syscall to have the BPF
+program verified and loaded by the kernel.
+
+In traditional tracing programs, a program is activated by having user space
+obtain a ``bpf_link`` to the program with ``bpf_program__attach()``. Once
+activated, the program callback will be invoked whenever the tracepoint is
+triggered in the main kernel. For BPF iterator programs, a ``bpf_link`` to the
+program is obtained using ``bpf_link_create()``, and the program callback is
+invoked by issuing system calls from user space.
+
+Next, let us see how you can use the iterators to iterate on kernel objects and
+read data.
+
+------------------------
+How to Use BPF iterators
+------------------------
+
+BPF selftests are a great resource to illustrate how to use the iterators. In
+this section, we’ll walk through a BPF selftest which shows how to load and use
+a BPF iterator program. To begin, we’ll look at `bpf_iter.c
+<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/prog_tests/bpf_iter.c>`_,
+which illustrates how to load and trigger BPF iterators on the user space side.
+Later, we’ll look at a BPF program that runs in kernel space.
+
+Loading a BPF iterator in the kernel from user space typically involves the
+following steps:
+
+* The BPF program is loaded into the kernel through ``libbpf``. Once the kernel
+ has verified and loaded the program, it returns a file descriptor (fd) to user
+ space.
+* Obtain a ``link_fd`` to the BPF program by calling the ``bpf_link_create()``
+ specified with the BPF program file descriptor received from the kernel.
+* Next, obtain a BPF iterator file descriptor (``bpf_iter_fd``) by calling the
+ ``bpf_iter_create()`` specified with the ``bpf_link`` received from Step 2.
+* Trigger the iteration by calling ``read(bpf_iter_fd)`` until no data is
+ available.
+* Close the iterator fd using ``close(bpf_iter_fd)``.
+* If needed to reread the data, get a new ``bpf_iter_fd`` and do the read again.
+
+The following are a few examples of selftest BPF iterator programs:
+
+* `bpf_iter_tcp4.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_tcp4.c>`_
+* `bpf_iter_task_vma.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_vma.c>`_
+* `bpf_iter_task_file.c <https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/tools/testing/selftests/bpf/progs/bpf_iter_task_file.c>`_
+
+Let us look at ``bpf_iter_task_file.c``, which runs in kernel space:
+
+Here is the definition of ``bpf_iter__task_file`` in `vmlinux.h
+<https://facebookmicrosites.github.io/bpf/blog/2020/02/19/bpf-portability-and-co-re.html#btf>`_.
+Any struct name in ``vmlinux.h`` in the format ``bpf_iter__<iter_name>``
+represents a BPF iterator. The suffix ``<iter_name>`` represents the type of
+iterator.
+
+::
+
+ struct bpf_iter__task_file {
+ union {
+ struct bpf_iter_meta *meta;
+ };
+ union {
+ struct task_struct *task;
+ };
+ u32 fd;
+ union {
+ struct file *file;
+ };
+ };
+
+In the above code, the field 'meta' contains the metadata, which is the same for
+all BPF iterator programs. The rest of the fields are specific to different
+iterators. For example, for task_file iterators, the kernel layer provides the
+'task', 'fd' and 'file' field values. The 'task' and 'file' are `reference
+counted
+<https://facebookmicrosites.github.io/bpf/blog/2018/08/31/object-lifetime.html#file-descriptors-and-reference-counters>`_,
+so they won't go away when the BPF program runs.
+
+Here is a snippet from the ``bpf_iter_task_file.c`` file:
+
+::
+
+ SEC("iter/task_file")
+ int dump_task_file(struct bpf_iter__task_file *ctx)
+ {
+ struct seq_file *seq = ctx->meta->seq;
+ struct task_struct *task = ctx->task;
+ struct file *file = ctx->file;
+ __u32 fd = ctx->fd;
+
+ if (task == NULL || file == NULL)
+ return 0;
+
+ if (ctx->meta->seq_num == 0) {
+ count = 0;
+ BPF_SEQ_PRINTF(seq, " tgid gid fd file\n");
+ }
+
+ if (tgid == task->tgid && task->tgid != task->pid)
+ count++;
+
+ if (last_tgid != task->tgid) {
+ last_tgid = task->tgid;
+ unique_tgid_count++;
+ }
+
+ BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
+ (long)file->f_op);
+ return 0;
+ }
+
+In the above example, the section name ``SEC(iter/task_file)``, indicates that
+the program is a BPF iterator program to iterate all files from all tasks. The
+context of the program is ``bpf_iter__task_file`` struct.
+
+The user space program invokes the BPF iterator program running in the kernel
+by issuing a ``read()`` syscall. Once invoked, the BPF
+program can export data to user space using a variety of BPF helper functions.
+You can use either ``bpf_seq_printf()`` (and BPF_SEQ_PRINTF helper macro) or
+``bpf_seq_write()`` function based on whether you need formatted output or just
+binary data, respectively. For binary-encoded data, the user space applications
+can process the data from ``bpf_seq_write()`` as needed. For the formatted data,
+you can use ``cat <path>`` to print the results similar to ``cat
+/proc/net/netlink`` after pinning the BPF iterator to the bpffs mount. Later,
+use ``rm -f <path>`` to remove the pinned iterator.
+
+For example, you can use the following command to create a BPF iterator from the
+``bpf_iter_ipv6_route.o`` object file and pin it to the ``/sys/fs/bpf/my_route``
+path:
+
+::
+
+ $ bpftool iter pin ./bpf_iter_ipv6_route.o /sys/fs/bpf/my_route
+
+And then print out the results using the following command:
+
+::
+
+ $ cat /sys/fs/bpf/my_route
+
+
+-------------------------------------------------------
+Implement Kernel Support for BPF Iterator Program Types
+-------------------------------------------------------
+
+To implement a BPF iterator in the kernel, the developer must make a one-time
+change to the following key data structure defined in the `bpf.h
+<https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/tree/include/linux/bpf.h>`_
+file.
+
+::
+
+ struct bpf_iter_reg {
+ const char *target;
+ bpf_iter_attach_target_t attach_target;
+ bpf_iter_detach_target_t detach_target;
+ bpf_iter_show_fdinfo_t show_fdinfo;
+ bpf_iter_fill_link_info_t fill_link_info;
+ bpf_iter_get_func_proto_t get_func_proto;
+ u32 ctx_arg_info_size;
+ u32 feature;
+ struct bpf_ctx_arg_aux ctx_arg_info[BPF_ITER_CTX_ARG_MAX];
+ const struct bpf_iter_seq_info *seq_info;
+ };
+
+After filling the data structure fields, call ``bpf_iter_reg_target()`` to
+register the iterator to the main BPF iterator subsystem.
+
+The following is the breakdown for each field in struct ``bpf_iter_reg``.
+
+.. list-table::
+ :widths: 25 50
+ :header-rows: 1
+
+ * - Fields
+ - Description
+ * - target
+ - Specifies the name of the BPF iterator. For example: ``bpf_map``,
+ ``bpf_map_elem``. The name should be different from other ``bpf_iter`` target names in the kernel.
+ * - attach_target and detach_target
+ - Allows for target specific ``link_create`` action since some targets
+ may need special processing. Called during the user space link_create stage.
+ * - show_fdinfo and fill_link_info
+ - Called to fill target specific information when user tries to get link
+ info associated with the iterator.
+ * - get_func_proto
+ - Permits a BPF iterator to access BPF helpers specific to the iterator.
+ * - ctx_arg_info_size and ctx_arg_info
+ - Specifies the verifier states for BPF program arguments associated with
+ the bpf iterator.
+ * - feature
+ - Specifies certain action requests in the kernel BPF iterator
+ infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
+ that the kernel function cond_resched() is called to avoid other kernel
+ subsystem (e.g., rcu) misbehaving.
+ * - seq_info
+ - Specifies certain action requests in the kernel BPF iterator
+ infrastructure. Currently, only BPF_ITER_RESCHED is supported. This means
+ that the kernel function cond_resched() is called to avoid other kernel
+ subsystem (e.g., rcu) misbehaving.
+
+
+`Click here
+<https://lore.kernel.org/bpf/20210212183107.50963-2-songliubraving@fb.com/>`_
+to see an implementation of the ``task_vma`` BPF iterator in the kernel.
+
+---------------------------------
+Parameterizing BPF Task Iterators
+---------------------------------
+
+By default, BPF iterators walk through all the objects of the specified types
+(processes, cgroups, maps, etc.) across the entire system to read relevant
+kernel data. But often, there are cases where we only care about a much smaller
+subset of iterable kernel objects, such as only iterating tasks within a
+specific process. Therefore, BPF iterator programs support filtering out objects
+from iteration by allowing user space to configure the iterator program when it
+is attached.
+
+--------------------------
+BPF Task Iterator Program
+--------------------------
+
+The following code is a BPF iterator program to print files and task information
+through the ``seq_file`` of the iterator. It is a standard BPF iterator program
+that visits every file of an iterator. We will use this BPF program in our
+example later.
+
+::
+
+ #include <vmlinux.h>
+ #include <bpf/bpf_helpers.h>
+
+ char _license[] SEC("license") = "GPL";
+
+ SEC("iter/task_file")
+ int dump_task_file(struct bpf_iter__task_file *ctx)
+ {
+ struct seq_file *seq = ctx->meta->seq;
+ struct task_struct *task = ctx->task;
+ struct file *file = ctx->file;
+ __u32 fd = ctx->fd;
+ if (task == NULL || file == NULL)
+ return 0;
+ if (ctx->meta->seq_num == 0) {
+ BPF_SEQ_PRINTF(seq, " tgid pid fd file\n");
+ }
+ BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
+ (long)file->f_op);
+ return 0;
+ }
+
+----------------------------------------
+Creating a File Iterator with Parameters
+----------------------------------------
+
+Now, let us look at how to create an iterator that includes only files of a
+process.
+
+First, fill the ``bpf_iter_attach_opts`` struct as shown below:
+
+::
+
+ LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+ union bpf_iter_link_info linfo;
+ memset(&linfo, 0, sizeof(linfo));
+ linfo.task.pid = getpid();
+ opts.link_info = &linfo;
+ opts.link_info_len = sizeof(linfo);
+
+``linfo.task.pid``, if it is non-zero, directs the kernel to create an iterator
+that only includes opened files for the process with the specified ``pid``. In
+this example, we will only be iterating files for our process. If
+``linfo.task.pid`` is zero, the iterator will visit every opened file of every
+process. Similarly, ``linfo.task.tid`` directs the kernel to create an iterator
+that visits opened files of a specific thread, not a process. In this example,
+``linfo.task.tid`` is different from ``linfo.task.pid`` only if the thread has a
+separate file descriptor table. In most circumstances, all process threads share
+a single file descriptor table.
+
+Now, in the userspace program, pass the pointer of struct to the
+``bpf_program__attach_iter()``.
+
+::
+
+ link = bpf_program__attach_iter(prog, &opts); iter_fd =
+ bpf_iter_create(bpf_link__fd(link));
+
+If both *tid* and *pid* are zero, an iterator created from this struct
+``bpf_iter_attach_opts`` will include every opened file of every task in the
+system (in the namespace, actually.) It is the same as passing a NULL as the
+second argument to ``bpf_program__attach_iter()``.
+
+The whole program looks like the following code:
+
+::
+
+ #include <stdio.h>
+ #include <unistd.h>
+ #include <bpf/bpf.h>
+ #include <bpf/libbpf.h>
+ #include "bpf_iter_task_ex.skel.h"
+
+ static int do_read_opts(struct bpf_program *prog, struct bpf_iter_attach_opts *opts)
+ {
+ struct bpf_link *link;
+ char buf[16] = {};
+ int iter_fd = -1, len;
+ int ret = 0;
+
+ link = bpf_program__attach_iter(prog, opts);
+ if (!link) {
+ fprintf(stderr, "bpf_program__attach_iter() fails\n");
+ return -1;
+ }
+ iter_fd = bpf_iter_create(bpf_link__fd(link));
+ if (iter_fd < 0) {
+ fprintf(stderr, "bpf_iter_create() fails\n");
+ ret = -1;
+ goto free_link;
+ }
+ /* not check contents, but ensure read() ends without error */
+ while ((len = read(iter_fd, buf, sizeof(buf) - 1)) > 0) {
+ buf[len] = 0;
+ printf("%s", buf);
+ }
+ printf("\n");
+ free_link:
+ if (iter_fd >= 0)
+ close(iter_fd);
+ bpf_link__destroy(link);
+ return 0;
+ }
+
+ static void test_task_file(void)
+ {
+ LIBBPF_OPTS(bpf_iter_attach_opts, opts);
+ struct bpf_iter_task_ex *skel;
+ union bpf_iter_link_info linfo;
+ skel = bpf_iter_task_ex__open_and_load();
+ if (skel == NULL)
+ return;
+ memset(&linfo, 0, sizeof(linfo));
+ linfo.task.pid = getpid();
+ opts.link_info = &linfo;
+ opts.link_info_len = sizeof(linfo);
+ printf("PID %d\n", getpid());
+ do_read_opts(skel->progs.dump_task_file, &opts);
+ bpf_iter_task_ex__destroy(skel);
+ }
+
+ int main(int argc, const char * const * argv)
+ {
+ test_task_file();
+ return 0;
+ }
+
+The following lines are the output of the program.
+::
+
+ PID 1859
+
+ tgid pid fd file
+ 1859 1859 0 ffffffff82270aa0
+ 1859 1859 1 ffffffff82270aa0
+ 1859 1859 2 ffffffff82270aa0
+ 1859 1859 3 ffffffff82272980
+ 1859 1859 4 ffffffff8225e120
+ 1859 1859 5 ffffffff82255120
+ 1859 1859 6 ffffffff82254f00
+ 1859 1859 7 ffffffff82254d80
+ 1859 1859 8 ffffffff8225abe0
+
+------------------
+Without Parameters
+------------------
+
+Let us look at how a BPF iterator without parameters skips files of other
+processes in the system. In this case, the BPF program has to check the pid or
+the tid of tasks, or it will receive every opened file in the system (in the
+current *pid* namespace, actually). So, we usually add a global variable in the
+BPF program to pass a *pid* to the BPF program.
+
+The BPF program would look like the following block.
+
+ ::
+
+ ......
+ int target_pid = 0;
+
+ SEC("iter/task_file")
+ int dump_task_file(struct bpf_iter__task_file *ctx)
+ {
+ ......
+ if (task->tgid != target_pid) /* Check task->pid instead to check thread IDs */
+ return 0;
+ BPF_SEQ_PRINTF(seq, "%8d %8d %8d %lx\n", task->tgid, task->pid, fd,
+ (long)file->f_op);
+ return 0;
+ }
+
+The user space program would look like the following block:
+
+ ::
+
+ ......
+ static void test_task_file(void)
+ {
+ ......
+ skel = bpf_iter_task_ex__open_and_load();
+ if (skel == NULL)
+ return;
+ skel->bss->target_pid = getpid(); /* process ID. For thread id, use gettid() */
+ memset(&linfo, 0, sizeof(linfo));
+ linfo.task.pid = getpid();
+ opts.link_info = &linfo;
+ opts.link_info_len = sizeof(linfo);
+ ......
+ }
+
+``target_pid`` is a global variable in the BPF program. The user space program
+should initialize the variable with a process ID to skip opened files of other
+processes in the BPF program. When you parametrize a BPF iterator, the iterator
+calls the BPF program fewer times which can save significant resources.
+
+---------------------------
+Parametrizing VMA Iterators
+---------------------------
+
+By default, a BPF VMA iterator includes every VMA in every process. However,
+you can still specify a process or a thread to include only its VMAs. Unlike
+files, a thread can not have a separate address space (since Linux 2.6.0-test6).
+Here, using *tid* makes no difference from using *pid*.
+
+----------------------------
+Parametrizing Task Iterators
+----------------------------
+
+A BPF task iterator with *pid* includes all tasks (threads) of a process. The
+BPF program receives these tasks one after another. You can specify a BPF task
+iterator with *tid* parameter to include only the tasks that match the given
+*tid*.
diff --git a/Documentation/bpf/btf.rst b/Documentation/bpf/btf.rst
index cf8722f96090..7cd7c5415a99 100644
--- a/Documentation/bpf/btf.rst
+++ b/Documentation/bpf/btf.rst
@@ -1062,4 +1062,9 @@ format.::
7. Testing
==========
-Kernel bpf selftest `test_btf.c` provides extensive set of BTF-related tests.
+The kernel BPF selftest `tools/testing/selftests/bpf/prog_tests/btf.c`_
+provides an extensive set of BTF-related tests.
+
+.. Links
+.. _tools/testing/selftests/bpf/prog_tests/btf.c:
+ https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/tools/testing/selftests/bpf/prog_tests/btf.c
diff --git a/Documentation/bpf/index.rst b/Documentation/bpf/index.rst
index 1b50de1983ee..b81533d8b061 100644
--- a/Documentation/bpf/index.rst
+++ b/Documentation/bpf/index.rst
@@ -24,11 +24,13 @@ that goes into great technical depth about the BPF Architecture.
maps
bpf_prog_run
classic_vs_extended.rst
+ bpf_iterators
bpf_licensing
test_debug
clang-notes
linux-notes
other
+ redirect
.. only:: subproject and html
diff --git a/Documentation/bpf/instruction-set.rst b/Documentation/bpf/instruction-set.rst
index 5d798437dad4..e672d5ec6cc7 100644
--- a/Documentation/bpf/instruction-set.rst
+++ b/Documentation/bpf/instruction-set.rst
@@ -122,11 +122,11 @@ BPF_END 0xd0 byte swap operations (see `Byte swap instructions`_ below)
``BPF_XOR | BPF_K | BPF_ALU`` means::
- src_reg = (u32) src_reg ^ (u32) imm32
+ dst_reg = (u32) dst_reg ^ (u32) imm32
``BPF_XOR | BPF_K | BPF_ALU64`` means::
- src_reg = src_reg ^ imm32
+ dst_reg = dst_reg ^ imm32
Byte swap instructions
diff --git a/Documentation/bpf/kfuncs.rst b/Documentation/bpf/kfuncs.rst
index 0f858156371d..9fd7fb539f85 100644
--- a/Documentation/bpf/kfuncs.rst
+++ b/Documentation/bpf/kfuncs.rst
@@ -72,6 +72,30 @@ argument as its size. By default, without __sz annotation, the size of the type
of the pointer is used. Without __sz annotation, a kfunc cannot accept a void
pointer.
+2.2.2 __k Annotation
+--------------------
+
+This annotation is only understood for scalar arguments, where it indicates that
+the verifier must check the scalar argument to be a known constant, which does
+not indicate a size parameter, and the value of the constant is relevant to the
+safety of the program.
+
+An example is given below::
+
+ void *bpf_obj_new(u32 local_type_id__k, ...)
+ {
+ ...
+ }
+
+Here, bpf_obj_new uses local_type_id argument to find out the size of that type
+ID in program's BTF and return a sized pointer to it. Each type ID will have a
+distinct size, hence it is crucial to treat each such call as distinct when
+values don't match during verifier state pruning checks.
+
+Hence, whenever a constant scalar argument is accepted by a kfunc which is not a
+size parameter, and the value of the constant matters for program safety, __k
+suffix should be used.
+
.. _BPF_kfunc_nodef:
2.3 Using an existing kernel function
@@ -137,22 +161,20 @@ KF_ACQUIRE and KF_RET_NULL flags.
--------------------------
The KF_TRUSTED_ARGS flag is used for kfuncs taking pointer arguments. It
-indicates that the all pointer arguments will always have a guaranteed lifetime,
-and pointers to kernel objects are always passed to helpers in their unmodified
-form (as obtained from acquire kfuncs).
+indicates that the all pointer arguments are valid, and that all pointers to
+BTF objects have been passed in their unmodified form (that is, at a zero
+offset, and without having been obtained from walking another pointer).
-It can be used to enforce that a pointer to a refcounted object acquired from a
-kfunc or BPF helper is passed as an argument to this kfunc without any
-modifications (e.g. pointer arithmetic) such that it is trusted and points to
-the original object.
+There are two types of pointers to kernel objects which are considered "valid":
-Meanwhile, it is also allowed pass pointers to normal memory to such kfuncs,
-but those can have a non-zero offset.
+1. Pointers which are passed as tracepoint or struct_ops callback arguments.
+2. Pointers which were returned from a KF_ACQUIRE or KF_KPTR_GET kfunc.
-This flag is often used for kfuncs that operate (change some property, perform
-some operation) on an object that was obtained using an acquire kfunc. Such
-kfuncs need an unchanged pointer to ensure the integrity of the operation being
-performed on the expected object.
+Pointers to non-BTF objects (e.g. scalar pointers) may also be passed to
+KF_TRUSTED_ARGS kfuncs, and may have a non-zero offset.
+
+The definition of "valid" pointers is subject to change at any time, and has
+absolutely no ABI stability guarantees.
2.4.6 KF_SLEEPABLE flag
-----------------------
@@ -169,6 +191,15 @@ rebooting or panicking. Due to this additional restrictions apply to these
calls. At the moment they only require CAP_SYS_BOOT capability, but more can be
added later.
+2.4.8 KF_RCU flag
+-----------------
+
+The KF_RCU flag is used for kfuncs which have a rcu ptr as its argument.
+When used together with KF_ACQUIRE, it indicates the kfunc should have a
+single argument which must be a trusted argument or a MEM_RCU pointer.
+The argument may have reference count of 0 and the kfunc must take this
+into consideration.
+
2.5 Registering the kfuncs
--------------------------
@@ -191,3 +222,201 @@ type. An example is shown below::
return register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &bpf_task_kfunc_set);
}
late_initcall(init_subsystem);
+
+3. Core kfuncs
+==============
+
+The BPF subsystem provides a number of "core" kfuncs that are potentially
+applicable to a wide variety of different possible use cases and programs.
+Those kfuncs are documented here.
+
+3.1 struct task_struct * kfuncs
+-------------------------------
+
+There are a number of kfuncs that allow ``struct task_struct *`` objects to be
+used as kptrs:
+
+.. kernel-doc:: kernel/bpf/helpers.c
+ :identifiers: bpf_task_acquire bpf_task_release
+
+These kfuncs are useful when you want to acquire or release a reference to a
+``struct task_struct *`` that was passed as e.g. a tracepoint arg, or a
+struct_ops callback arg. For example:
+
+.. code-block:: c
+
+ /**
+ * A trivial example tracepoint program that shows how to
+ * acquire and release a struct task_struct * pointer.
+ */
+ SEC("tp_btf/task_newtask")
+ int BPF_PROG(task_acquire_release_example, struct task_struct *task, u64 clone_flags)
+ {
+ struct task_struct *acquired;
+
+ acquired = bpf_task_acquire(task);
+
+ /*
+ * In a typical program you'd do something like store
+ * the task in a map, and the map will automatically
+ * release it later. Here, we release it manually.
+ */
+ bpf_task_release(acquired);
+ return 0;
+ }
+
+----
+
+A BPF program can also look up a task from a pid. This can be useful if the
+caller doesn't have a trusted pointer to a ``struct task_struct *`` object that
+it can acquire a reference on with bpf_task_acquire().
+
+.. kernel-doc:: kernel/bpf/helpers.c
+ :identifiers: bpf_task_from_pid
+
+Here is an example of it being used:
+
+.. code-block:: c
+
+ SEC("tp_btf/task_newtask")
+ int BPF_PROG(task_get_pid_example, struct task_struct *task, u64 clone_flags)
+ {
+ struct task_struct *lookup;
+
+ lookup = bpf_task_from_pid(task->pid);
+ if (!lookup)
+ /* A task should always be found, as %task is a tracepoint arg. */
+ return -ENOENT;
+
+ if (lookup->pid != task->pid) {
+ /* bpf_task_from_pid() looks up the task via its
+ * globally-unique pid from the init_pid_ns. Thus,
+ * the pid of the lookup task should always be the
+ * same as the input task.
+ */
+ bpf_task_release(lookup);
+ return -EINVAL;
+ }
+
+ /* bpf_task_from_pid() returns an acquired reference,
+ * so it must be dropped before returning from the
+ * tracepoint handler.
+ */
+ bpf_task_release(lookup);
+ return 0;
+ }
+
+3.2 struct cgroup * kfuncs
+--------------------------
+
+``struct cgroup *`` objects also have acquire and release functions:
+
+.. kernel-doc:: kernel/bpf/helpers.c
+ :identifiers: bpf_cgroup_acquire bpf_cgroup_release
+
+These kfuncs are used in exactly the same manner as bpf_task_acquire() and
+bpf_task_release() respectively, so we won't provide examples for them.
+
+----
+
+You may also acquire a reference to a ``struct cgroup`` kptr that's already
+stored in a map using bpf_cgroup_kptr_get():
+
+.. kernel-doc:: kernel/bpf/helpers.c
+ :identifiers: bpf_cgroup_kptr_get
+
+Here's an example of how it can be used:
+
+.. code-block:: c
+
+ /* struct containing the struct task_struct kptr which is actually stored in the map. */
+ struct __cgroups_kfunc_map_value {
+ struct cgroup __kptr_ref * cgroup;
+ };
+
+ /* The map containing struct __cgroups_kfunc_map_value entries. */
+ struct {
+ __uint(type, BPF_MAP_TYPE_HASH);
+ __type(key, int);
+ __type(value, struct __cgroups_kfunc_map_value);
+ __uint(max_entries, 1);
+ } __cgroups_kfunc_map SEC(".maps");
+
+ /* ... */
+
+ /**
+ * A simple example tracepoint program showing how a
+ * struct cgroup kptr that is stored in a map can
+ * be acquired using the bpf_cgroup_kptr_get() kfunc.
+ */
+ SEC("tp_btf/cgroup_mkdir")
+ int BPF_PROG(cgroup_kptr_get_example, struct cgroup *cgrp, const char *path)
+ {
+ struct cgroup *kptr;
+ struct __cgroups_kfunc_map_value *v;
+ s32 id = cgrp->self.id;
+
+ /* Assume a cgroup kptr was previously stored in the map. */
+ v = bpf_map_lookup_elem(&__cgroups_kfunc_map, &id);
+ if (!v)
+ return -ENOENT;
+
+ /* Acquire a reference to the cgroup kptr that's already stored in the map. */
+ kptr = bpf_cgroup_kptr_get(&v->cgroup);
+ if (!kptr)
+ /* If no cgroup was present in the map, it's because
+ * we're racing with another CPU that removed it with
+ * bpf_kptr_xchg() between the bpf_map_lookup_elem()
+ * above, and our call to bpf_cgroup_kptr_get().
+ * bpf_cgroup_kptr_get() internally safely handles this
+ * race, and will return NULL if the task is no longer
+ * present in the map by the time we invoke the kfunc.
+ */
+ return -EBUSY;
+
+ /* Free the reference we just took above. Note that the
+ * original struct cgroup kptr is still in the map. It will
+ * be freed either at a later time if another context deletes
+ * it from the map, or automatically by the BPF subsystem if
+ * it's still present when the map is destroyed.
+ */
+ bpf_cgroup_release(kptr);
+
+ return 0;
+ }
+
+----
+
+Another kfunc available for interacting with ``struct cgroup *`` objects is
+bpf_cgroup_ancestor(). This allows callers to access the ancestor of a cgroup,
+and return it as a cgroup kptr.
+
+.. kernel-doc:: kernel/bpf/helpers.c
+ :identifiers: bpf_cgroup_ancestor
+
+Eventually, BPF should be updated to allow this to happen with a normal memory
+load in the program itself. This is currently not possible without more work in
+the verifier. bpf_cgroup_ancestor() can be used as follows:
+
+.. code-block:: c
+
+ /**
+ * Simple tracepoint example that illustrates how a cgroup's
+ * ancestor can be accessed using bpf_cgroup_ancestor().
+ */
+ SEC("tp_btf/cgroup_mkdir")
+ int BPF_PROG(cgrp_ancestor_example, struct cgroup *cgrp, const char *path)
+ {
+ struct cgroup *parent;
+
+ /* The parent cgroup resides at the level before the current cgroup's level. */
+ parent = bpf_cgroup_ancestor(cgrp, cgrp->level - 1);
+ if (!parent)
+ return -ENOENT;
+
+ bpf_printk("Parent id is %d", parent->self.id);
+
+ /* Return the parent cgroup that was acquired above. */
+ bpf_cgroup_release(parent);
+ return 0;
+ }
diff --git a/Documentation/bpf/libbpf/index.rst b/Documentation/bpf/libbpf/index.rst
index 3722537d1384..f9b3b252e28f 100644
--- a/Documentation/bpf/libbpf/index.rst
+++ b/Documentation/bpf/libbpf/index.rst
@@ -1,5 +1,7 @@
.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+.. _libbpf:
+
libbpf
======
@@ -7,6 +9,7 @@ libbpf
:maxdepth: 1
API Documentation <https://libbpf.readthedocs.io/en/latest/api.html>
+ program_types
libbpf_naming_convention
libbpf_build
diff --git a/Documentation/bpf/libbpf/program_types.rst b/Documentation/bpf/libbpf/program_types.rst
new file mode 100644
index 000000000000..ad4d4d5eecb0
--- /dev/null
+++ b/Documentation/bpf/libbpf/program_types.rst
@@ -0,0 +1,203 @@
+.. SPDX-License-Identifier: (LGPL-2.1 OR BSD-2-Clause)
+
+.. _program_types_and_elf:
+
+Program Types and ELF Sections
+==============================
+
+The table below lists the program types, their attach types where relevant and the ELF section
+names supported by libbpf for them. The ELF section names follow these rules:
+
+- ``type`` is an exact match, e.g. ``SEC("socket")``
+- ``type+`` means it can be either exact ``SEC("type")`` or well-formed ``SEC("type/extras")``
+ with a '``/``' separator between ``type`` and ``extras``.
+
+When ``extras`` are specified, they provide details of how to auto-attach the BPF program. The
+format of ``extras`` depends on the program type, e.g. ``SEC("tracepoint/<category>/<name>")``
+for tracepoints or ``SEC("usdt/<path>:<provider>:<name>")`` for USDT probes. The extras are
+described in more detail in the footnotes.
+
+
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| Program Type | Attach Type | ELF Section Name | Sleepable |
++===========================================+========================================+==================================+===========+
+| ``BPF_PROG_TYPE_CGROUP_DEVICE`` | ``BPF_CGROUP_DEVICE`` | ``cgroup/dev`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_CGROUP_SKB`` | | ``cgroup/skb`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET_EGRESS`` | ``cgroup_skb/egress`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET_INGRESS`` | ``cgroup_skb/ingress`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_CGROUP_SOCKOPT`` | ``BPF_CGROUP_GETSOCKOPT`` | ``cgroup/getsockopt`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_SETSOCKOPT`` | ``cgroup/setsockopt`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_CGROUP_SOCK_ADDR`` | ``BPF_CGROUP_INET4_BIND`` | ``cgroup/bind4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET4_CONNECT`` | ``cgroup/connect4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET4_GETPEERNAME`` | ``cgroup/getpeername4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET4_GETSOCKNAME`` | ``cgroup/getsockname4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET6_BIND`` | ``cgroup/bind6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET6_CONNECT`` | ``cgroup/connect6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET6_GETPEERNAME`` | ``cgroup/getpeername6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET6_GETSOCKNAME`` | ``cgroup/getsockname6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_UDP4_RECVMSG`` | ``cgroup/recvmsg4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_UDP4_SENDMSG`` | ``cgroup/sendmsg4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_UDP6_RECVMSG`` | ``cgroup/recvmsg6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_UDP6_SENDMSG`` | ``cgroup/sendmsg6`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_CGROUP_SOCK`` | ``BPF_CGROUP_INET4_POST_BIND`` | ``cgroup/post_bind4`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET6_POST_BIND`` | ``cgroup/post_bind6`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET_SOCK_CREATE`` | ``cgroup/sock_create`` | |
++ + +----------------------------------+-----------+
+| | | ``cgroup/sock`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_CGROUP_INET_SOCK_RELEASE`` | ``cgroup/sock_release`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_CGROUP_SYSCTL`` | ``BPF_CGROUP_SYSCTL`` | ``cgroup/sysctl`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_EXT`` | | ``freplace+`` [#fentry]_ | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_FLOW_DISSECTOR`` | ``BPF_FLOW_DISSECTOR`` | ``flow_dissector`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_KPROBE`` | | ``kprobe+`` [#kprobe]_ | |
++ + +----------------------------------+-----------+
+| | | ``kretprobe+`` [#kprobe]_ | |
++ + +----------------------------------+-----------+
+| | | ``ksyscall+`` [#ksyscall]_ | |
++ + +----------------------------------+-----------+
+| | | ``kretsyscall+`` [#ksyscall]_ | |
++ + +----------------------------------+-----------+
+| | | ``uprobe+`` [#uprobe]_ | |
++ + +----------------------------------+-----------+
+| | | ``uprobe.s+`` [#uprobe]_ | Yes |
++ + +----------------------------------+-----------+
+| | | ``uretprobe+`` [#uprobe]_ | |
++ + +----------------------------------+-----------+
+| | | ``uretprobe.s+`` [#uprobe]_ | Yes |
++ + +----------------------------------+-----------+
+| | | ``usdt+`` [#usdt]_ | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_TRACE_KPROBE_MULTI`` | ``kprobe.multi+`` [#kpmulti]_ | |
++ + +----------------------------------+-----------+
+| | | ``kretprobe.multi+`` [#kpmulti]_ | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LIRC_MODE2`` | ``BPF_LIRC_MODE2`` | ``lirc_mode2`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LSM`` | ``BPF_LSM_CGROUP`` | ``lsm_cgroup+`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_LSM_MAC`` | ``lsm+`` [#lsm]_ | |
++ + +----------------------------------+-----------+
+| | | ``lsm.s+`` [#lsm]_ | Yes |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LWT_IN`` | | ``lwt_in`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LWT_OUT`` | | ``lwt_out`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LWT_SEG6LOCAL`` | | ``lwt_seg6local`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_LWT_XMIT`` | | ``lwt_xmit`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_PERF_EVENT`` | | ``perf_event`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE`` | | ``raw_tp.w+`` [#rawtp]_ | |
++ + +----------------------------------+-----------+
+| | | ``raw_tracepoint.w+`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_RAW_TRACEPOINT`` | | ``raw_tp+`` [#rawtp]_ | |
++ + +----------------------------------+-----------+
+| | | ``raw_tracepoint+`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SCHED_ACT`` | | ``action`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SCHED_CLS`` | | ``classifier`` | |
++ + +----------------------------------+-----------+
+| | | ``tc`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SK_LOOKUP`` | ``BPF_SK_LOOKUP`` | ``sk_lookup`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SK_MSG`` | ``BPF_SK_MSG_VERDICT`` | ``sk_msg`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SK_REUSEPORT`` | ``BPF_SK_REUSEPORT_SELECT_OR_MIGRATE`` | ``sk_reuseport/migrate`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_SK_REUSEPORT_SELECT`` | ``sk_reuseport`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SK_SKB`` | | ``sk_skb`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_SK_SKB_STREAM_PARSER`` | ``sk_skb/stream_parser`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_SK_SKB_STREAM_VERDICT`` | ``sk_skb/stream_verdict`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SOCKET_FILTER`` | | ``socket`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SOCK_OPS`` | ``BPF_CGROUP_SOCK_OPS`` | ``sockops`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_STRUCT_OPS`` | | ``struct_ops+`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_SYSCALL`` | | ``syscall`` | Yes |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_TRACEPOINT`` | | ``tp+`` [#tp]_ | |
++ + +----------------------------------+-----------+
+| | | ``tracepoint+`` [#tp]_ | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_TRACING`` | ``BPF_MODIFY_RETURN`` | ``fmod_ret+`` [#fentry]_ | |
++ + +----------------------------------+-----------+
+| | | ``fmod_ret.s+`` [#fentry]_ | Yes |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_TRACE_FENTRY`` | ``fentry+`` [#fentry]_ | |
++ + +----------------------------------+-----------+
+| | | ``fentry.s+`` [#fentry]_ | Yes |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_TRACE_FEXIT`` | ``fexit+`` [#fentry]_ | |
++ + +----------------------------------+-----------+
+| | | ``fexit.s+`` [#fentry]_ | Yes |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_TRACE_ITER`` | ``iter+`` [#iter]_ | |
++ + +----------------------------------+-----------+
+| | | ``iter.s+`` [#iter]_ | Yes |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_TRACE_RAW_TP`` | ``tp_btf+`` [#fentry]_ | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+| ``BPF_PROG_TYPE_XDP`` | ``BPF_XDP_CPUMAP`` | ``xdp.frags/cpumap`` | |
++ + +----------------------------------+-----------+
+| | | ``xdp/cpumap`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_XDP_DEVMAP`` | ``xdp.frags/devmap`` | |
++ + +----------------------------------+-----------+
+| | | ``xdp/devmap`` | |
++ +----------------------------------------+----------------------------------+-----------+
+| | ``BPF_XDP`` | ``xdp.frags`` | |
++ + +----------------------------------+-----------+
+| | | ``xdp`` | |
++-------------------------------------------+----------------------------------------+----------------------------------+-----------+
+
+
+.. rubric:: Footnotes
+
+.. [#fentry] The ``fentry`` attach format is ``fentry[.s]/<function>``.
+.. [#kprobe] The ``kprobe`` attach format is ``kprobe/<function>[+<offset>]``. Valid
+ characters for ``function`` are ``a-zA-Z0-9_.`` and ``offset`` must be a valid
+ non-negative integer.
+.. [#ksyscall] The ``ksyscall`` attach format is ``ksyscall/<syscall>``.
+.. [#uprobe] The ``uprobe`` attach format is ``uprobe[.s]/<path>:<function>[+<offset>]``.
+.. [#usdt] The ``usdt`` attach format is ``usdt/<path>:<provider>:<name>``.
+.. [#kpmulti] The ``kprobe.multi`` attach format is ``kprobe.multi/<pattern>`` where ``pattern``
+ supports ``*`` and ``?`` wildcards. Valid characters for pattern are
+ ``a-zA-Z0-9_.*?``.
+.. [#lsm] The ``lsm`` attachment format is ``lsm[.s]/<hook>``.
+.. [#rawtp] The ``raw_tp`` attach format is ``raw_tracepoint[.w]/<tracepoint>``.
+.. [#tp] The ``tracepoint`` attach format is ``tracepoint/<category>/<name>``.
+.. [#iter] The ``iter`` attach format is ``iter[.s]/<struct-name>``.
diff --git a/Documentation/bpf/map_array.rst b/Documentation/bpf/map_array.rst
new file mode 100644
index 000000000000..f2f51a53e8ae
--- /dev/null
+++ b/Documentation/bpf/map_array.rst
@@ -0,0 +1,262 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+================================================
+BPF_MAP_TYPE_ARRAY and BPF_MAP_TYPE_PERCPU_ARRAY
+================================================
+
+.. note::
+ - ``BPF_MAP_TYPE_ARRAY`` was introduced in kernel version 3.19
+ - ``BPF_MAP_TYPE_PERCPU_ARRAY`` was introduced in version 4.6
+
+``BPF_MAP_TYPE_ARRAY`` and ``BPF_MAP_TYPE_PERCPU_ARRAY`` provide generic array
+storage. The key type is an unsigned 32-bit integer (4 bytes) and the map is
+of constant size. The size of the array is defined in ``max_entries`` at
+creation time. All array elements are pre-allocated and zero initialized when
+created. ``BPF_MAP_TYPE_PERCPU_ARRAY`` uses a different memory region for each
+CPU whereas ``BPF_MAP_TYPE_ARRAY`` uses the same memory region. The value
+stored can be of any size, however, all array elements are aligned to 8
+bytes.
+
+Since kernel 5.5, memory mapping may be enabled for ``BPF_MAP_TYPE_ARRAY`` by
+setting the flag ``BPF_F_MMAPABLE``. The map definition is page-aligned and
+starts on the first page. Sufficient page-sized and page-aligned blocks of
+memory are allocated to store all array values, starting on the second page,
+which in some cases will result in over-allocation of memory. The benefit of
+using this is increased performance and ease of use since userspace programs
+would not be required to use helper functions to access and mutate data.
+
+Usage
+=====
+
+Kernel BPF
+----------
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Array elements can be retrieved using the ``bpf_map_lookup_elem()`` helper.
+This helper returns a pointer into the array element, so to avoid data races
+with userspace reading the value, the user must use primitives like
+``__sync_fetch_and_add()`` when updating the value in-place.
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+
+Array elements can be updated using the ``bpf_map_update_elem()`` helper.
+
+``bpf_map_update_elem()`` returns 0 on success, or negative error in case of
+failure.
+
+Since the array is of constant size, ``bpf_map_delete_elem()`` is not supported.
+To clear an array element, you may use ``bpf_map_update_elem()`` to insert a
+zero value to that index.
+
+Per CPU Array
+-------------
+
+Values stored in ``BPF_MAP_TYPE_ARRAY`` can be accessed by multiple programs
+across different CPUs. To restrict storage to a single CPU, you may use a
+``BPF_MAP_TYPE_PERCPU_ARRAY``.
+
+When using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` the ``bpf_map_update_elem()`` and
+``bpf_map_lookup_elem()`` helpers automatically access the slot for the current
+CPU.
+
+bpf_map_lookup_percpu_elem()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
+
+The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the array
+value for a specific CPU. Returns value on success , or ``NULL`` if no entry was
+found or ``cpu`` is invalid.
+
+Concurrency
+-----------
+
+Since kernel version 5.1, the BPF infrastructure provides ``struct bpf_spin_lock``
+to synchronize access.
+
+Userspace
+---------
+
+Access from userspace uses libbpf APIs with the same names as above, with
+the map identified by its ``fd``.
+
+Examples
+========
+
+Please see the ``tools/testing/selftests/bpf`` directory for functional
+examples. The code samples below demonstrate API usage.
+
+Kernel BPF
+----------
+
+This snippet shows how to declare an array in a BPF program.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, u32);
+ __type(value, long);
+ __uint(max_entries, 256);
+ } my_map SEC(".maps");
+
+
+This example BPF program shows how to access an array element.
+
+.. code-block:: c
+
+ int bpf_prog(struct __sk_buff *skb)
+ {
+ struct iphdr ip;
+ int index;
+ long *value;
+
+ if (bpf_skb_load_bytes(skb, ETH_HLEN, &ip, sizeof(ip)) < 0)
+ return 0;
+
+ index = ip.protocol;
+ value = bpf_map_lookup_elem(&my_map, &index);
+ if (value)
+ __sync_fetch_and_add(value, skb->len);
+
+ return 0;
+ }
+
+Userspace
+---------
+
+BPF_MAP_TYPE_ARRAY
+~~~~~~~~~~~~~~~~~~
+
+This snippet shows how to create an array, using ``bpf_map_create_opts`` to
+set flags.
+
+.. code-block:: c
+
+ #include <bpf/libbpf.h>
+ #include <bpf/bpf.h>
+
+ int create_array()
+ {
+ int fd;
+ LIBBPF_OPTS(bpf_map_create_opts, opts, .map_flags = BPF_F_MMAPABLE);
+
+ fd = bpf_map_create(BPF_MAP_TYPE_ARRAY,
+ "example_array", /* name */
+ sizeof(__u32), /* key size */
+ sizeof(long), /* value size */
+ 256, /* max entries */
+ &opts); /* create opts */
+ return fd;
+ }
+
+This snippet shows how to initialize the elements of an array.
+
+.. code-block:: c
+
+ int initialize_array(int fd)
+ {
+ __u32 i;
+ long value;
+ int ret;
+
+ for (i = 0; i < 256; i++) {
+ value = i;
+ ret = bpf_map_update_elem(fd, &i, &value, BPF_ANY);
+ if (ret < 0)
+ return ret;
+ }
+
+ return ret;
+ }
+
+This snippet shows how to retrieve an element value from an array.
+
+.. code-block:: c
+
+ int lookup(int fd)
+ {
+ __u32 index = 42;
+ long value;
+ int ret;
+
+ ret = bpf_map_lookup_elem(fd, &index, &value);
+ if (ret < 0)
+ return ret;
+
+ /* use value here */
+ assert(value == 42);
+
+ return ret;
+ }
+
+BPF_MAP_TYPE_PERCPU_ARRAY
+~~~~~~~~~~~~~~~~~~~~~~~~~
+
+This snippet shows how to initialize the elements of a per CPU array.
+
+.. code-block:: c
+
+ int initialize_array(int fd)
+ {
+ int ncpus = libbpf_num_possible_cpus();
+ long values[ncpus];
+ __u32 i, j;
+ int ret;
+
+ for (i = 0; i < 256 ; i++) {
+ for (j = 0; j < ncpus; j++)
+ values[j] = i;
+ ret = bpf_map_update_elem(fd, &i, &values, BPF_ANY);
+ if (ret < 0)
+ return ret;
+ }
+
+ return ret;
+ }
+
+This snippet shows how to access the per CPU elements of an array value.
+
+.. code-block:: c
+
+ int lookup(int fd)
+ {
+ int ncpus = libbpf_num_possible_cpus();
+ __u32 index = 42, j;
+ long values[ncpus];
+ int ret;
+
+ ret = bpf_map_lookup_elem(fd, &index, &values);
+ if (ret < 0)
+ return ret;
+
+ for (j = 0; j < ncpus; j++) {
+ /* Use per CPU value here */
+ assert(values[j] == 42);
+ }
+
+ return ret;
+ }
+
+Semantics
+=========
+
+As shown in the example above, when accessing a ``BPF_MAP_TYPE_PERCPU_ARRAY``
+in userspace, each value is an array with ``ncpus`` elements.
+
+When calling ``bpf_map_update_elem()`` the flag ``BPF_NOEXIST`` can not be used
+for these maps.
diff --git a/Documentation/bpf/map_bloom_filter.rst b/Documentation/bpf/map_bloom_filter.rst
new file mode 100644
index 000000000000..c82487f2fe0d
--- /dev/null
+++ b/Documentation/bpf/map_bloom_filter.rst
@@ -0,0 +1,174 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+=========================
+BPF_MAP_TYPE_BLOOM_FILTER
+=========================
+
+.. note::
+ - ``BPF_MAP_TYPE_BLOOM_FILTER`` was introduced in kernel version 5.16
+
+``BPF_MAP_TYPE_BLOOM_FILTER`` provides a BPF bloom filter map. Bloom
+filters are a space-efficient probabilistic data structure used to
+quickly test whether an element exists in a set. In a bloom filter,
+false positives are possible whereas false negatives are not.
+
+The bloom filter map does not have keys, only values. When the bloom
+filter map is created, it must be created with a ``key_size`` of 0. The
+bloom filter map supports two operations:
+
+- push: adding an element to the map
+- peek: determining whether an element is present in the map
+
+BPF programs must use ``bpf_map_push_elem`` to add an element to the
+bloom filter map and ``bpf_map_peek_elem`` to query the map. These
+operations are exposed to userspace applications using the existing
+``bpf`` syscall in the following way:
+
+- ``BPF_MAP_UPDATE_ELEM`` -> push
+- ``BPF_MAP_LOOKUP_ELEM`` -> peek
+
+The ``max_entries`` size that is specified at map creation time is used
+to approximate a reasonable bitmap size for the bloom filter, and is not
+otherwise strictly enforced. If the user wishes to insert more entries
+into the bloom filter than ``max_entries``, this may lead to a higher
+false positive rate.
+
+The number of hashes to use for the bloom filter is configurable using
+the lower 4 bits of ``map_extra`` in ``union bpf_attr`` at map creation
+time. If no number is specified, the default used will be 5 hash
+functions. In general, using more hashes decreases both the false
+positive rate and the speed of a lookup.
+
+It is not possible to delete elements from a bloom filter map. A bloom
+filter map may be used as an inner map. The user is responsible for
+synchronising concurrent updates and lookups to ensure no false negative
+lookups occur.
+
+Usage
+=====
+
+Kernel BPF
+----------
+
+bpf_map_push_elem()
+~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+
+A ``value`` can be added to a bloom filter using the
+``bpf_map_push_elem()`` helper. The ``flags`` parameter must be set to
+``BPF_ANY`` when adding an entry to the bloom filter. This helper
+returns ``0`` on success, or negative error in case of failure.
+
+bpf_map_peek_elem()
+~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_peek_elem(struct bpf_map *map, void *value)
+
+The ``bpf_map_peek_elem()`` helper is used to determine whether
+``value`` is present in the bloom filter map. This helper returns ``0``
+if ``value`` is probably present in the map, or ``-ENOENT`` if ``value``
+is definitely not present in the map.
+
+Userspace
+---------
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
+
+A userspace program can add a ``value`` to a bloom filter using libbpf's
+``bpf_map_update_elem`` function. The ``key`` parameter must be set to
+``NULL`` and ``flags`` must be set to ``BPF_ANY``. Returns ``0`` on
+success, or negative error in case of failure.
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_lookup_elem (int fd, const void *key, void *value)
+
+A userspace program can determine the presence of ``value`` in a bloom
+filter using libbpf's ``bpf_map_lookup_elem`` function. The ``key``
+parameter must be set to ``NULL``. Returns ``0`` if ``value`` is
+probably present in the map, or ``-ENOENT`` if ``value`` is definitely
+not present in the map.
+
+Examples
+========
+
+Kernel BPF
+----------
+
+This snippet shows how to declare a bloom filter in a BPF program:
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_BLOOM_FILTER);
+ __type(value, __u32);
+ __uint(max_entries, 1000);
+ __uint(map_extra, 3);
+ } bloom_filter SEC(".maps");
+
+This snippet shows how to determine presence of a value in a bloom
+filter in a BPF program:
+
+.. code-block:: c
+
+ void *lookup(__u32 key)
+ {
+ if (bpf_map_peek_elem(&bloom_filter, &key) == 0) {
+ /* Verify not a false positive and fetch an associated
+ * value using a secondary lookup, e.g. in a hash table
+ */
+ return bpf_map_lookup_elem(&hash_table, &key);
+ }
+ return 0;
+ }
+
+Userspace
+---------
+
+This snippet shows how to use libbpf to create a bloom filter map from
+userspace:
+
+.. code-block:: c
+
+ int create_bloom()
+ {
+ LIBBPF_OPTS(bpf_map_create_opts, opts,
+ .map_extra = 3); /* number of hashes */
+
+ return bpf_map_create(BPF_MAP_TYPE_BLOOM_FILTER,
+ "ipv6_bloom", /* name */
+ 0, /* key size, must be zero */
+ sizeof(ipv6_addr), /* value size */
+ 10000, /* max entries */
+ &opts); /* create options */
+ }
+
+This snippet shows how to add an element to a bloom filter from
+userspace:
+
+.. code-block:: c
+
+ int add_element(struct bpf_map *bloom_map, __u32 value)
+ {
+ int bloom_fd = bpf_map__fd(bloom_map);
+ return bpf_map_update_elem(bloom_fd, NULL, &value, BPF_ANY);
+ }
+
+References
+==========
+
+https://lwn.net/ml/bpf/20210831225005.2762202-1-joannekoong@fb.com/
diff --git a/Documentation/bpf/map_cgrp_storage.rst b/Documentation/bpf/map_cgrp_storage.rst
new file mode 100644
index 000000000000..5d3f603efffa
--- /dev/null
+++ b/Documentation/bpf/map_cgrp_storage.rst
@@ -0,0 +1,109 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Meta Platforms, Inc. and affiliates.
+
+=========================
+BPF_MAP_TYPE_CGRP_STORAGE
+=========================
+
+The ``BPF_MAP_TYPE_CGRP_STORAGE`` map type represents a local fix-sized
+storage for cgroups. It is only available with ``CONFIG_CGROUPS``.
+The programs are made available by the same Kconfig. The
+data for a particular cgroup can be retrieved by looking up the map
+with that cgroup.
+
+This document describes the usage and semantics of the
+``BPF_MAP_TYPE_CGRP_STORAGE`` map type.
+
+Usage
+=====
+
+The map key must be ``sizeof(int)`` representing a cgroup fd.
+To access the storage in a program, use ``bpf_cgrp_storage_get``::
+
+ void *bpf_cgrp_storage_get(struct bpf_map *map, struct cgroup *cgroup, void *value, u64 flags)
+
+``flags`` could be 0 or ``BPF_LOCAL_STORAGE_GET_F_CREATE`` which indicates that
+a new local storage will be created if one does not exist.
+
+The local storage can be removed with ``bpf_cgrp_storage_delete``::
+
+ long bpf_cgrp_storage_delete(struct bpf_map *map, struct cgroup *cgroup)
+
+The map is available to all program types.
+
+Examples
+========
+
+A BPF program example with BPF_MAP_TYPE_CGRP_STORAGE::
+
+ #include <vmlinux.h>
+ #include <bpf/bpf_helpers.h>
+ #include <bpf/bpf_tracing.h>
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_CGRP_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, long);
+ } cgrp_storage SEC(".maps");
+
+ SEC("tp_btf/sys_enter")
+ int BPF_PROG(on_enter, struct pt_regs *regs, long id)
+ {
+ struct task_struct *task = bpf_get_current_task_btf();
+ long *ptr;
+
+ ptr = bpf_cgrp_storage_get(&cgrp_storage, task->cgroups->dfl_cgrp, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (ptr)
+ __sync_fetch_and_add(ptr, 1);
+
+ return 0;
+ }
+
+Userspace accessing map declared above::
+
+ #include <linux/bpf.h>
+ #include <linux/libbpf.h>
+
+ __u32 map_lookup(struct bpf_map *map, int cgrp_fd)
+ {
+ __u32 *value;
+ value = bpf_map_lookup_elem(bpf_map__fd(map), &cgrp_fd);
+ if (value)
+ return *value;
+ return 0;
+ }
+
+Difference Between BPF_MAP_TYPE_CGRP_STORAGE and BPF_MAP_TYPE_CGROUP_STORAGE
+============================================================================
+
+The old cgroup storage map ``BPF_MAP_TYPE_CGROUP_STORAGE`` has been marked as
+deprecated (renamed to ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``). The new
+``BPF_MAP_TYPE_CGRP_STORAGE`` map should be used instead. The following
+illusates the main difference between ``BPF_MAP_TYPE_CGRP_STORAGE`` and
+``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
+
+(1). ``BPF_MAP_TYPE_CGRP_STORAGE`` can be used by all program types while
+ ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` is available only to cgroup program types
+ like BPF_CGROUP_INET_INGRESS or BPF_CGROUP_SOCK_OPS, etc.
+
+(2). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports local storage for more than one
+ cgroup while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only supports one cgroup
+ which is attached by a BPF program.
+
+(3). ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` allocates local storage at attach time so
+ ``bpf_get_local_storage()`` always returns non-NULL local storage.
+ ``BPF_MAP_TYPE_CGRP_STORAGE`` allocates local storage at runtime so
+ it is possible that ``bpf_cgrp_storage_get()`` may return null local storage.
+ To avoid such null local storage issue, user space can do
+ ``bpf_map_update_elem()`` to pre-allocate local storage before a BPF program
+ is attached.
+
+(4). ``BPF_MAP_TYPE_CGRP_STORAGE`` supports deleting local storage by a BPF program
+ while ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED`` only deletes storage during
+ prog detach time.
+
+So overall, ``BPF_MAP_TYPE_CGRP_STORAGE`` supports all ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``
+functionality and beyond. It is recommended to use ``BPF_MAP_TYPE_CGRP_STORAGE``
+instead of ``BPF_MAP_TYPE_CGROUP_STORAGE_DEPRECATED``.
diff --git a/Documentation/bpf/map_cpumap.rst b/Documentation/bpf/map_cpumap.rst
new file mode 100644
index 000000000000..923cfc8ab51f
--- /dev/null
+++ b/Documentation/bpf/map_cpumap.rst
@@ -0,0 +1,177 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+===================
+BPF_MAP_TYPE_CPUMAP
+===================
+
+.. note::
+ - ``BPF_MAP_TYPE_CPUMAP`` was introduced in kernel version 4.15
+
+.. kernel-doc:: kernel/bpf/cpumap.c
+ :doc: cpu map
+
+An example use-case for this map type is software based Receive Side Scaling (RSS).
+
+The CPUMAP represents the CPUs in the system indexed as the map-key, and the
+map-value is the config setting (per CPUMAP entry). Each CPUMAP entry has a dedicated
+kernel thread bound to the given CPU to represent the remote CPU execution unit.
+
+Starting from Linux kernel version 5.9 the CPUMAP can run a second XDP program
+on the remote CPU. This allows an XDP program to split its processing across
+multiple CPUs. For example, a scenario where the initial CPU (that sees/receives
+the packets) needs to do minimal packet processing and the remote CPU (to which
+the packet is directed) can afford to spend more cycles processing the frame. The
+initial CPU is where the XDP redirect program is executed. The remote CPU
+receives raw ``xdp_frame`` objects.
+
+Usage
+=====
+
+Kernel BPF
+----------
+bpf_redirect_map()
+^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+
+Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
+For ``BPF_MAP_TYPE_CPUMAP`` this map contains references to CPUs.
+
+The lower two bits of ``flags`` are used as the return code if the map lookup
+fails. This is so that the return value can be one of the XDP program return
+codes up to ``XDP_TX``, as chosen by the caller.
+
+User space
+----------
+.. note::
+ CPUMAP entries can only be updated/looked up/deleted from user space and not
+ from an eBPF program. Trying to call these functions from a kernel eBPF
+ program will result in the program failing to load and a verifier warning.
+
+bpf_map_update_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
+
+CPU entries can be added or updated using the ``bpf_map_update_elem()``
+helper. This helper replaces existing elements atomically. The ``value`` parameter
+can be ``struct bpf_cpumap_val``.
+
+ .. code-block:: c
+
+ struct bpf_cpumap_val {
+ __u32 qsize; /* queue size to remote target CPU */
+ union {
+ int fd; /* prog fd on map write */
+ __u32 id; /* prog id on map read */
+ } bpf_prog;
+ };
+
+The flags argument can be one of the following:
+ - BPF_ANY: Create a new element or update an existing element.
+ - BPF_NOEXIST: Create a new element only if it did not exist.
+ - BPF_EXIST: Update an existing element.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_lookup_elem(int fd, const void *key, void *value);
+
+CPU entries can be retrieved using the ``bpf_map_lookup_elem()``
+helper.
+
+bpf_map_delete_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_delete_elem(int fd, const void *key);
+
+CPU entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case of
+failure.
+
+Examples
+========
+Kernel
+------
+
+The following code snippet shows how to declare a ``BPF_MAP_TYPE_CPUMAP`` called
+``cpu_map`` and how to redirect packets to a remote CPU using a round robin scheme.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_CPUMAP);
+ __type(key, __u32);
+ __type(value, struct bpf_cpumap_val);
+ __uint(max_entries, 12);
+ } cpu_map SEC(".maps");
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY);
+ __type(key, __u32);
+ __type(value, __u32);
+ __uint(max_entries, 12);
+ } cpus_available SEC(".maps");
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_PERCPU_ARRAY);
+ __type(key, __u32);
+ __type(value, __u32);
+ __uint(max_entries, 1);
+ } cpus_iterator SEC(".maps");
+
+ SEC("xdp")
+ int xdp_redir_cpu_round_robin(struct xdp_md *ctx)
+ {
+ __u32 key = 0;
+ __u32 cpu_dest = 0;
+ __u32 *cpu_selected, *cpu_iterator;
+ __u32 cpu_idx;
+
+ cpu_iterator = bpf_map_lookup_elem(&cpus_iterator, &key);
+ if (!cpu_iterator)
+ return XDP_ABORTED;
+ cpu_idx = *cpu_iterator;
+
+ *cpu_iterator += 1;
+ if (*cpu_iterator == bpf_num_possible_cpus())
+ *cpu_iterator = 0;
+
+ cpu_selected = bpf_map_lookup_elem(&cpus_available, &cpu_idx);
+ if (!cpu_selected)
+ return XDP_ABORTED;
+ cpu_dest = *cpu_selected;
+
+ if (cpu_dest >= bpf_num_possible_cpus())
+ return XDP_ABORTED;
+
+ return bpf_redirect_map(&cpu_map, cpu_dest, 0);
+ }
+
+User space
+----------
+
+The following code snippet shows how to dynamically set the max_entries for a
+CPUMAP to the max number of cpus available on the system.
+
+.. code-block:: c
+
+ int set_max_cpu_entries(struct bpf_map *cpu_map)
+ {
+ if (bpf_map__set_max_entries(cpu_map, libbpf_num_possible_cpus()) < 0) {
+ fprintf(stderr, "Failed to set max entries for cpu_map map: %s",
+ strerror(errno));
+ return -1;
+ }
+ return 0;
+ }
+
+References
+===========
+
+- https://developers.redhat.com/blog/2021/05/13/receive-side-scaling-rss-with-ebpf-and-cpumap#redirecting_into_a_cpumap
diff --git a/Documentation/bpf/map_devmap.rst b/Documentation/bpf/map_devmap.rst
new file mode 100644
index 000000000000..927312c7b8c8
--- /dev/null
+++ b/Documentation/bpf/map_devmap.rst
@@ -0,0 +1,238 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+=================================================
+BPF_MAP_TYPE_DEVMAP and BPF_MAP_TYPE_DEVMAP_HASH
+=================================================
+
+.. note::
+ - ``BPF_MAP_TYPE_DEVMAP`` was introduced in kernel version 4.14
+ - ``BPF_MAP_TYPE_DEVMAP_HASH`` was introduced in kernel version 5.4
+
+``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` are BPF maps primarily
+used as backend maps for the XDP BPF helper call ``bpf_redirect_map()``.
+``BPF_MAP_TYPE_DEVMAP`` is backed by an array that uses the key as
+the index to lookup a reference to a net device. While ``BPF_MAP_TYPE_DEVMAP_HASH``
+is backed by a hash table that uses a key to lookup a reference to a net device.
+The user provides either <``key``/ ``ifindex``> or <``key``/ ``struct bpf_devmap_val``>
+pairs to update the maps with new net devices.
+
+.. note::
+ - The key to a hash map doesn't have to be an ``ifindex``.
+ - While ``BPF_MAP_TYPE_DEVMAP_HASH`` allows for densely packing the net devices
+ it comes at the cost of a hash of the key when performing a look up.
+
+The setup and packet enqueue/send code is shared between the two types of
+devmap; only the lookup and insertion is different.
+
+Usage
+=====
+Kernel BPF
+----------
+bpf_redirect_map()
+^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+
+Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
+For ``BPF_MAP_TYPE_DEVMAP`` and ``BPF_MAP_TYPE_DEVMAP_HASH`` this map contains
+references to net devices (for forwarding packets through other ports).
+
+The lower two bits of *flags* are used as the return code if the map lookup
+fails. This is so that the return value can be one of the XDP program return
+codes up to ``XDP_TX``, as chosen by the caller. The higher bits of ``flags``
+can be set to ``BPF_F_BROADCAST`` or ``BPF_F_EXCLUDE_INGRESS`` as defined
+below.
+
+With ``BPF_F_BROADCAST`` the packet will be broadcast to all the interfaces
+in the map, with ``BPF_F_EXCLUDE_INGRESS`` the ingress interface will be excluded
+from the broadcast.
+
+.. note::
+ - The key is ignored if BPF_F_BROADCAST is set.
+ - The broadcast feature can also be used to implement multicast forwarding:
+ simply create multiple DEVMAPs, each one corresponding to a single multicast group.
+
+This helper will return ``XDP_REDIRECT`` on success, or the value of the two
+lower bits of the ``flags`` argument if the map lookup fails.
+
+More information about redirection can be found :doc:`redirect`
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
+helper.
+
+User space
+----------
+.. note::
+ DEVMAP entries can only be updated/deleted from user space and not
+ from an eBPF program. Trying to call these functions from a kernel eBPF
+ program will result in the program failing to load and a verifier warning.
+
+bpf_map_update_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags);
+
+Net device entries can be added or updated using the ``bpf_map_update_elem()``
+helper. This helper replaces existing elements atomically. The ``value`` parameter
+can be ``struct bpf_devmap_val`` or a simple ``int ifindex`` for backwards
+compatibility.
+
+ .. code-block:: c
+
+ struct bpf_devmap_val {
+ __u32 ifindex; /* device index */
+ union {
+ int fd; /* prog fd on map write */
+ __u32 id; /* prog id on map read */
+ } bpf_prog;
+ };
+
+The ``flags`` argument can be one of the following:
+ - ``BPF_ANY``: Create a new element or update an existing element.
+ - ``BPF_NOEXIST``: Create a new element only if it did not exist.
+ - ``BPF_EXIST``: Update an existing element.
+
+DEVMAPs can associate a program with a device entry by adding a ``bpf_prog.fd``
+to ``struct bpf_devmap_val``. Programs are run after ``XDP_REDIRECT`` and have
+access to both Rx device and Tx device. The program associated with the ``fd``
+must have type XDP with expected attach type ``xdp_devmap``.
+When a program is associated with a device index, the program is run on an
+``XDP_REDIRECT`` and before the buffer is added to the per-cpu queue. Examples
+of how to attach/use xdp_devmap progs can be found in the kernel selftests:
+
+- ``tools/testing/selftests/bpf/prog_tests/xdp_devmap_attach.c``
+- ``tools/testing/selftests/bpf/progs/test_xdp_with_devmap_helpers.c``
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+.. c:function::
+ int bpf_map_lookup_elem(int fd, const void *key, void *value);
+
+Net device entries can be retrieved using the ``bpf_map_lookup_elem()``
+helper.
+
+bpf_map_delete_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+.. c:function::
+ int bpf_map_delete_elem(int fd, const void *key);
+
+Net device entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case of
+failure.
+
+Examples
+========
+
+Kernel BPF
+----------
+
+The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP``
+called tx_port.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_DEVMAP);
+ __type(key, __u32);
+ __type(value, __u32);
+ __uint(max_entries, 256);
+ } tx_port SEC(".maps");
+
+The following code snippet shows how to declare a ``BPF_MAP_TYPE_DEVMAP_HASH``
+called forward_map.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_DEVMAP_HASH);
+ __type(key, __u32);
+ __type(value, struct bpf_devmap_val);
+ __uint(max_entries, 32);
+ } forward_map SEC(".maps");
+
+.. note::
+
+ The value type in the DEVMAP above is a ``struct bpf_devmap_val``
+
+The following code snippet shows a simple xdp_redirect_map program. This program
+would work with a user space program that populates the devmap ``forward_map`` based
+on ingress ifindexes. The BPF program (below) is redirecting packets using the
+ingress ``ifindex`` as the ``key``.
+
+.. code-block:: c
+
+ SEC("xdp")
+ int xdp_redirect_map_func(struct xdp_md *ctx)
+ {
+ int index = ctx->ingress_ifindex;
+
+ return bpf_redirect_map(&forward_map, index, 0);
+ }
+
+The following code snippet shows a BPF program that is broadcasting packets to
+all the interfaces in the ``tx_port`` devmap.
+
+.. code-block:: c
+
+ SEC("xdp")
+ int xdp_redirect_map_func(struct xdp_md *ctx)
+ {
+ return bpf_redirect_map(&tx_port, 0, BPF_F_BROADCAST | BPF_F_EXCLUDE_INGRESS);
+ }
+
+User space
+----------
+
+The following code snippet shows how to update a devmap called ``tx_port``.
+
+.. code-block:: c
+
+ int update_devmap(int ifindex, int redirect_ifindex)
+ {
+ int ret;
+
+ ret = bpf_map_update_elem(bpf_map__fd(tx_port), &ifindex, &redirect_ifindex, 0);
+ if (ret < 0) {
+ fprintf(stderr, "Failed to update devmap_ value: %s\n",
+ strerror(errno));
+ }
+
+ return ret;
+ }
+
+The following code snippet shows how to update a hash_devmap called ``forward_map``.
+
+.. code-block:: c
+
+ int update_devmap(int ifindex, int redirect_ifindex)
+ {
+ struct bpf_devmap_val devmap_val = { .ifindex = redirect_ifindex };
+ int ret;
+
+ ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
+ if (ret < 0) {
+ fprintf(stderr, "Failed to update devmap_ value: %s\n",
+ strerror(errno));
+ }
+ return ret;
+ }
+
+References
+===========
+
+- https://lwn.net/Articles/728146/
+- https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git/commit/?id=6f9d451ab1a33728adb72d7ff66a7b374d665176
+- https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4106
diff --git a/Documentation/bpf/map_hash.rst b/Documentation/bpf/map_hash.rst
index e85120878b27..8669426264c6 100644
--- a/Documentation/bpf/map_hash.rst
+++ b/Documentation/bpf/map_hash.rst
@@ -34,7 +34,14 @@ the ``BPF_F_NO_COMMON_LRU`` flag when calling ``bpf_map_create``.
Usage
=====
-.. c:function::
+Kernel BPF
+----------
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
Hash entries can be added or updated using the ``bpf_map_update_elem()``
@@ -49,14 +56,22 @@ parameter can be used to control the update behaviour:
``bpf_map_update_elem()`` returns 0 on success, or negative error in
case of failure.
-.. c:function::
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
Hash entries can be retrieved using the ``bpf_map_lookup_elem()``
helper. This helper returns a pointer to the value associated with
``key``, or ``NULL`` if no entry was found.
-.. c:function::
+bpf_map_delete_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
long bpf_map_delete_elem(struct bpf_map *map, const void *key)
Hash entries can be deleted using the ``bpf_map_delete_elem()``
@@ -70,7 +85,11 @@ For ``BPF_MAP_TYPE_PERCPU_HASH`` and ``BPF_MAP_TYPE_LRU_PERCPU_HASH``
the ``bpf_map_update_elem()`` and ``bpf_map_lookup_elem()`` helpers
automatically access the hash slot for the current CPU.
-.. c:function::
+bpf_map_lookup_percpu_elem()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
void *bpf_map_lookup_percpu_elem(struct bpf_map *map, const void *key, u32 cpu)
The ``bpf_map_lookup_percpu_elem()`` helper can be used to lookup the
@@ -89,7 +108,11 @@ See ``tools/testing/selftests/bpf/progs/test_spin_lock.c``.
Userspace
---------
-.. c:function::
+bpf_map_get_next_key()
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
int bpf_map_get_next_key(int fd, const void *cur_key, void *next_key)
In userspace, it is possible to iterate through the keys of a hash using
diff --git a/Documentation/bpf/map_lpm_trie.rst b/Documentation/bpf/map_lpm_trie.rst
new file mode 100644
index 000000000000..74d64a30f500
--- /dev/null
+++ b/Documentation/bpf/map_lpm_trie.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+=====================
+BPF_MAP_TYPE_LPM_TRIE
+=====================
+
+.. note::
+ - ``BPF_MAP_TYPE_LPM_TRIE`` was introduced in kernel version 4.11
+
+``BPF_MAP_TYPE_LPM_TRIE`` provides a longest prefix match algorithm that
+can be used to match IP addresses to a stored set of prefixes.
+Internally, data is stored in an unbalanced trie of nodes that uses
+``prefixlen,data`` pairs as its keys. The ``data`` is interpreted in
+network byte order, i.e. big endian, so ``data[0]`` stores the most
+significant byte.
+
+LPM tries may be created with a maximum prefix length that is a multiple
+of 8, in the range from 8 to 2048. The key used for lookup and update
+operations is a ``struct bpf_lpm_trie_key``, extended by
+``max_prefixlen/8`` bytes.
+
+- For IPv4 addresses the data length is 4 bytes
+- For IPv6 addresses the data length is 16 bytes
+
+The value type stored in the LPM trie can be any user defined type.
+
+.. note::
+ When creating a map of type ``BPF_MAP_TYPE_LPM_TRIE`` you must set the
+ ``BPF_F_NO_PREALLOC`` flag.
+
+Usage
+=====
+
+Kernel BPF
+----------
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+The longest prefix entry for a given data value can be found using the
+``bpf_map_lookup_elem()`` helper. This helper returns a pointer to the
+value associated with the longest matching ``key``, or ``NULL`` if no
+entry was found.
+
+The ``key`` should have ``prefixlen`` set to ``max_prefixlen`` when
+performing longest prefix lookups. For example, when searching for the
+longest prefix match for an IPv4 address, ``prefixlen`` should be set to
+``32``.
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_update_elem(struct bpf_map *map, const void *key, const void *value, u64 flags)
+
+Prefix entries can be added or updated using the ``bpf_map_update_elem()``
+helper. This helper replaces existing elements atomically.
+
+``bpf_map_update_elem()`` returns ``0`` on success, or negative error in
+case of failure.
+
+ .. note::
+ The flags parameter must be one of BPF_ANY, BPF_NOEXIST or BPF_EXIST,
+ but the value is ignored, giving BPF_ANY semantics.
+
+bpf_map_delete_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_delete_elem(struct bpf_map *map, const void *key)
+
+Prefix entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case
+of failure.
+
+Userspace
+---------
+
+Access from userspace uses libbpf APIs with the same names as above, with
+the map identified by ``fd``.
+
+bpf_map_get_next_key()
+~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_get_next_key (int fd, const void *cur_key, void *next_key)
+
+A userspace program can iterate through the entries in an LPM trie using
+libbpf's ``bpf_map_get_next_key()`` function. The first key can be
+fetched by calling ``bpf_map_get_next_key()`` with ``cur_key`` set to
+``NULL``. Subsequent calls will fetch the next key that follows the
+current key. ``bpf_map_get_next_key()`` returns ``0`` on success,
+``-ENOENT`` if ``cur_key`` is the last key in the trie, or negative
+error in case of failure.
+
+``bpf_map_get_next_key()`` will iterate through the LPM trie elements
+from leftmost leaf first. This means that iteration will return more
+specific keys before less specific ones.
+
+Examples
+========
+
+Please see ``tools/testing/selftests/bpf/test_lpm_map.c`` for examples
+of LPM trie usage from userspace. The code snippets below demonstrate
+API usage.
+
+Kernel BPF
+----------
+
+The following BPF code snippet shows how to declare a new LPM trie for IPv4
+address prefixes:
+
+.. code-block:: c
+
+ #include <linux/bpf.h>
+ #include <bpf/bpf_helpers.h>
+
+ struct ipv4_lpm_key {
+ __u32 prefixlen;
+ __u32 data;
+ };
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_LPM_TRIE);
+ __type(key, struct ipv4_lpm_key);
+ __type(value, __u32);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __uint(max_entries, 255);
+ } ipv4_lpm_map SEC(".maps");
+
+The following BPF code snippet shows how to lookup by IPv4 address:
+
+.. code-block:: c
+
+ void *lookup(__u32 ipaddr)
+ {
+ struct ipv4_lpm_key key = {
+ .prefixlen = 32,
+ .data = ipaddr
+ };
+
+ return bpf_map_lookup_elem(&ipv4_lpm_map, &key);
+ }
+
+Userspace
+---------
+
+The following snippet shows how to insert an IPv4 prefix entry into an
+LPM trie:
+
+.. code-block:: c
+
+ int add_prefix_entry(int lpm_fd, __u32 addr, __u32 prefixlen, struct value *value)
+ {
+ struct ipv4_lpm_key ipv4_key = {
+ .prefixlen = prefixlen,
+ .data = addr
+ };
+ return bpf_map_update_elem(lpm_fd, &ipv4_key, value, BPF_ANY);
+ }
+
+The following snippet shows a userspace program walking through the entries
+of an LPM trie:
+
+
+.. code-block:: c
+
+ #include <bpf/libbpf.h>
+ #include <bpf/bpf.h>
+
+ void iterate_lpm_trie(int map_fd)
+ {
+ struct ipv4_lpm_key *cur_key = NULL;
+ struct ipv4_lpm_key next_key;
+ struct value value;
+ int err;
+
+ for (;;) {
+ err = bpf_map_get_next_key(map_fd, cur_key, &next_key);
+ if (err)
+ break;
+
+ bpf_map_lookup_elem(map_fd, &next_key, &value);
+
+ /* Use key and value here */
+
+ cur_key = &next_key;
+ }
+ }
diff --git a/Documentation/bpf/map_of_maps.rst b/Documentation/bpf/map_of_maps.rst
new file mode 100644
index 000000000000..7b5617c2d017
--- /dev/null
+++ b/Documentation/bpf/map_of_maps.rst
@@ -0,0 +1,130 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+========================================================
+BPF_MAP_TYPE_ARRAY_OF_MAPS and BPF_MAP_TYPE_HASH_OF_MAPS
+========================================================
+
+.. note::
+ - ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` were
+ introduced in kernel version 4.12
+
+``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and ``BPF_MAP_TYPE_HASH_OF_MAPS`` provide general
+purpose support for map in map storage. One level of nesting is supported, where
+an outer map contains instances of a single type of inner map, for example
+``array_of_maps->sock_map``.
+
+When creating an outer map, an inner map instance is used to initialize the
+metadata that the outer map holds about its inner maps. This inner map has a
+separate lifetime from the outer map and can be deleted after the outer map has
+been created.
+
+The outer map supports element lookup, update and delete from user space using
+the syscall API. A BPF program is only allowed to do element lookup in the outer
+map.
+
+.. note::
+ - Multi-level nesting is not supported.
+ - Any BPF map type can be used as an inner map, except for
+ ``BPF_MAP_TYPE_PROG_ARRAY``.
+ - A BPF program cannot update or delete outer map entries.
+
+For ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` the key is an unsigned 32-bit integer index
+into the array. The array is a fixed size with ``max_entries`` elements that are
+zero initialized when created.
+
+For ``BPF_MAP_TYPE_HASH_OF_MAPS`` the key type can be chosen when defining the
+map. The kernel is responsible for allocating and freeing key/value pairs, up to
+the max_entries limit that you specify. Hash maps use pre-allocation of hash
+table elements by default. The ``BPF_F_NO_PREALLOC`` flag can be used to disable
+pre-allocation when it is too memory expensive.
+
+Usage
+=====
+
+Kernel BPF Helper
+-----------------
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+Inner maps can be retrieved using the ``bpf_map_lookup_elem()`` helper. This
+helper returns a pointer to the inner map, or ``NULL`` if no entry was found.
+
+Examples
+========
+
+Kernel BPF Example
+------------------
+
+This snippet shows how to create and initialise an array of devmaps in a BPF
+program. Note that the outer array can only be modified from user space using
+the syscall API.
+
+.. code-block:: c
+
+ struct inner_map {
+ __uint(type, BPF_MAP_TYPE_DEVMAP);
+ __uint(max_entries, 10);
+ __type(key, __u32);
+ __type(value, __u32);
+ } inner_map1 SEC(".maps"), inner_map2 SEC(".maps");
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_ARRAY_OF_MAPS);
+ __uint(max_entries, 2);
+ __type(key, __u32);
+ __array(values, struct inner_map);
+ } outer_map SEC(".maps") = {
+ .values = { &inner_map1,
+ &inner_map2 }
+ };
+
+See ``progs/test_btf_map_in_map.c`` in ``tools/testing/selftests/bpf`` for more
+examples of declarative initialisation of outer maps.
+
+User Space
+----------
+
+This snippet shows how to create an array based outer map:
+
+.. code-block:: c
+
+ int create_outer_array(int inner_fd) {
+ LIBBPF_OPTS(bpf_map_create_opts, opts, .inner_map_fd = inner_fd);
+ int fd;
+
+ fd = bpf_map_create(BPF_MAP_TYPE_ARRAY_OF_MAPS,
+ "example_array", /* name */
+ sizeof(__u32), /* key size */
+ sizeof(__u32), /* value size */
+ 256, /* max entries */
+ &opts); /* create opts */
+ return fd;
+ }
+
+
+This snippet shows how to add an inner map to an outer map:
+
+.. code-block:: c
+
+ int add_devmap(int outer_fd, int index, const char *name) {
+ int fd;
+
+ fd = bpf_map_create(BPF_MAP_TYPE_DEVMAP, name,
+ sizeof(__u32), sizeof(__u32), 256, NULL);
+ if (fd < 0)
+ return fd;
+
+ return bpf_map_update_elem(outer_fd, &index, &fd, BPF_ANY);
+ }
+
+References
+==========
+
+- https://lore.kernel.org/netdev/20170322170035.923581-3-kafai@fb.com/
+- https://lore.kernel.org/netdev/20170322170035.923581-4-kafai@fb.com/
diff --git a/Documentation/bpf/map_queue_stack.rst b/Documentation/bpf/map_queue_stack.rst
new file mode 100644
index 000000000000..8d14ed49d6e1
--- /dev/null
+++ b/Documentation/bpf/map_queue_stack.rst
@@ -0,0 +1,146 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+=========================================
+BPF_MAP_TYPE_QUEUE and BPF_MAP_TYPE_STACK
+=========================================
+
+.. note::
+ - ``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` were introduced
+ in kernel version 4.20
+
+``BPF_MAP_TYPE_QUEUE`` provides FIFO storage and ``BPF_MAP_TYPE_STACK``
+provides LIFO storage for BPF programs. These maps support peek, pop and
+push operations that are exposed to BPF programs through the respective
+helpers. These operations are exposed to userspace applications using
+the existing ``bpf`` syscall in the following way:
+
+- ``BPF_MAP_LOOKUP_ELEM`` -> peek
+- ``BPF_MAP_LOOKUP_AND_DELETE_ELEM`` -> pop
+- ``BPF_MAP_UPDATE_ELEM`` -> push
+
+``BPF_MAP_TYPE_QUEUE`` and ``BPF_MAP_TYPE_STACK`` do not support
+``BPF_F_NO_PREALLOC``.
+
+Usage
+=====
+
+Kernel BPF
+----------
+
+bpf_map_push_elem()
+~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_push_elem(struct bpf_map *map, const void *value, u64 flags)
+
+An element ``value`` can be added to a queue or stack using the
+``bpf_map_push_elem`` helper. The ``flags`` parameter must be set to
+``BPF_ANY`` or ``BPF_EXIST``. If ``flags`` is set to ``BPF_EXIST`` then,
+when the queue or stack is full, the oldest element will be removed to
+make room for ``value`` to be added. Returns ``0`` on success, or
+negative error in case of failure.
+
+bpf_map_peek_elem()
+~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_peek_elem(struct bpf_map *map, void *value)
+
+This helper fetches an element ``value`` from a queue or stack without
+removing it. Returns ``0`` on success, or negative error in case of
+failure.
+
+bpf_map_pop_elem()
+~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_map_pop_elem(struct bpf_map *map, void *value)
+
+This helper removes an element into ``value`` from a queue or
+stack. Returns ``0`` on success, or negative error in case of failure.
+
+
+Userspace
+---------
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_update_elem (int fd, const void *key, const void *value, __u64 flags)
+
+A userspace program can push ``value`` onto a queue or stack using libbpf's
+``bpf_map_update_elem`` function. The ``key`` parameter must be set to
+``NULL`` and ``flags`` must be set to ``BPF_ANY`` or ``BPF_EXIST``, with the
+same semantics as the ``bpf_map_push_elem`` kernel helper. Returns ``0`` on
+success, or negative error in case of failure.
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_lookup_elem (int fd, const void *key, void *value)
+
+A userspace program can peek at the ``value`` at the head of a queue or stack
+using the libbpf ``bpf_map_lookup_elem`` function. The ``key`` parameter must be
+set to ``NULL``. Returns ``0`` on success, or negative error in case of
+failure.
+
+bpf_map_lookup_and_delete_elem()
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_lookup_and_delete_elem (int fd, const void *key, void *value)
+
+A userspace program can pop a ``value`` from the head of a queue or stack using
+the libbpf ``bpf_map_lookup_and_delete_elem`` function. The ``key`` parameter
+must be set to ``NULL``. Returns ``0`` on success, or negative error in case of
+failure.
+
+Examples
+========
+
+Kernel BPF
+----------
+
+This snippet shows how to declare a queue in a BPF program:
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_QUEUE);
+ __type(value, __u32);
+ __uint(max_entries, 10);
+ } queue SEC(".maps");
+
+
+Userspace
+---------
+
+This snippet shows how to use libbpf's low-level API to create a queue from
+userspace:
+
+.. code-block:: c
+
+ int create_queue()
+ {
+ return bpf_map_create(BPF_MAP_TYPE_QUEUE,
+ "sample_queue", /* name */
+ 0, /* key size, must be zero */
+ sizeof(__u32), /* value size */
+ 10, /* max entries */
+ NULL); /* create options */
+ }
+
+
+References
+==========
+
+https://lwn.net/ml/netdev/153986858555.9127.14517764371945179514.stgit@kernel/
diff --git a/Documentation/bpf/map_sk_storage.rst b/Documentation/bpf/map_sk_storage.rst
new file mode 100644
index 000000000000..047e16c8aaa8
--- /dev/null
+++ b/Documentation/bpf/map_sk_storage.rst
@@ -0,0 +1,155 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+=======================
+BPF_MAP_TYPE_SK_STORAGE
+=======================
+
+.. note::
+ - ``BPF_MAP_TYPE_SK_STORAGE`` was introduced in kernel version 5.2
+
+``BPF_MAP_TYPE_SK_STORAGE`` is used to provide socket-local storage for BPF
+programs. A map of type ``BPF_MAP_TYPE_SK_STORAGE`` declares the type of storage
+to be provided and acts as the handle for accessing the socket-local
+storage. The values for maps of type ``BPF_MAP_TYPE_SK_STORAGE`` are stored
+locally with each socket instead of with the map. The kernel is responsible for
+allocating storage for a socket when requested and for freeing the storage when
+either the map or the socket is deleted.
+
+.. note::
+ - The key type must be ``int`` and ``max_entries`` must be set to ``0``.
+ - The ``BPF_F_NO_PREALLOC`` flag must be used when creating a map for
+ socket-local storage.
+
+Usage
+=====
+
+Kernel BPF
+----------
+
+bpf_sk_storage_get()
+~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ void *bpf_sk_storage_get(struct bpf_map *map, void *sk, void *value, u64 flags)
+
+Socket-local storage can be retrieved using the ``bpf_sk_storage_get()``
+helper. The helper gets the storage from ``sk`` that is associated with ``map``.
+If the ``BPF_LOCAL_STORAGE_GET_F_CREATE`` flag is used then
+``bpf_sk_storage_get()`` will create the storage for ``sk`` if it does not
+already exist. ``value`` can be used together with
+``BPF_LOCAL_STORAGE_GET_F_CREATE`` to initialize the storage value, otherwise it
+will be zero initialized. Returns a pointer to the storage on success, or
+``NULL`` in case of failure.
+
+.. note::
+ - ``sk`` is a kernel ``struct sock`` pointer for LSM or tracing programs.
+ - ``sk`` is a ``struct bpf_sock`` pointer for other program types.
+
+bpf_sk_storage_delete()
+~~~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ long bpf_sk_storage_delete(struct bpf_map *map, void *sk)
+
+Socket-local storage can be deleted using the ``bpf_sk_storage_delete()``
+helper. The helper deletes the storage from ``sk`` that is identified by
+``map``. Returns ``0`` on success, or negative error in case of failure.
+
+User space
+----------
+
+bpf_map_update_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_update_elem(int map_fd, const void *key, const void *value, __u64 flags)
+
+Socket-local storage for the socket identified by ``key`` belonging to
+``map_fd`` can be added or updated using the ``bpf_map_update_elem()`` libbpf
+function. ``key`` must be a pointer to a valid ``fd`` in the user space
+program. The ``flags`` parameter can be used to control the update behaviour:
+
+- ``BPF_ANY`` will create storage for ``fd`` or update existing storage.
+- ``BPF_NOEXIST`` will create storage for ``fd`` only if it did not already
+ exist, otherwise the call will fail with ``-EEXIST``.
+- ``BPF_EXIST`` will update existing storage for ``fd`` if it already exists,
+ otherwise the call will fail with ``-ENOENT``.
+
+Returns ``0`` on success, or negative error in case of failure.
+
+bpf_map_lookup_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_lookup_elem(int map_fd, const void *key, void *value)
+
+Socket-local storage for the socket identified by ``key`` belonging to
+``map_fd`` can be retrieved using the ``bpf_map_lookup_elem()`` libbpf
+function. ``key`` must be a pointer to a valid ``fd`` in the user space
+program. Returns ``0`` on success, or negative error in case of failure.
+
+bpf_map_delete_elem()
+~~~~~~~~~~~~~~~~~~~~~
+
+.. code-block:: c
+
+ int bpf_map_delete_elem(int map_fd, const void *key)
+
+Socket-local storage for the socket identified by ``key`` belonging to
+``map_fd`` can be deleted using the ``bpf_map_delete_elem()`` libbpf
+function. Returns ``0`` on success, or negative error in case of failure.
+
+Examples
+========
+
+Kernel BPF
+----------
+
+This snippet shows how to declare socket-local storage in a BPF program:
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_SK_STORAGE);
+ __uint(map_flags, BPF_F_NO_PREALLOC);
+ __type(key, int);
+ __type(value, struct my_storage);
+ } socket_storage SEC(".maps");
+
+This snippet shows how to retrieve socket-local storage in a BPF program:
+
+.. code-block:: c
+
+ SEC("sockops")
+ int _sockops(struct bpf_sock_ops *ctx)
+ {
+ struct my_storage *storage;
+ struct bpf_sock *sk;
+
+ sk = ctx->sk;
+ if (!sk)
+ return 1;
+
+ storage = bpf_sk_storage_get(&socket_storage, sk, 0,
+ BPF_LOCAL_STORAGE_GET_F_CREATE);
+ if (!storage)
+ return 1;
+
+ /* Use 'storage' here */
+
+ return 1;
+ }
+
+
+Please see the ``tools/testing/selftests/bpf`` directory for functional
+examples.
+
+References
+==========
+
+https://lwn.net/ml/netdev/20190426171103.61892-1-kafai@fb.com/
diff --git a/Documentation/bpf/map_xskmap.rst b/Documentation/bpf/map_xskmap.rst
new file mode 100644
index 000000000000..7093b8208451
--- /dev/null
+++ b/Documentation/bpf/map_xskmap.rst
@@ -0,0 +1,192 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+===================
+BPF_MAP_TYPE_XSKMAP
+===================
+
+.. note::
+ - ``BPF_MAP_TYPE_XSKMAP`` was introduced in kernel version 4.18
+
+The ``BPF_MAP_TYPE_XSKMAP`` is used as a backend map for XDP BPF helper
+call ``bpf_redirect_map()`` and ``XDP_REDIRECT`` action, like 'devmap' and 'cpumap'.
+This map type redirects raw XDP frames to `AF_XDP`_ sockets (XSKs), a new type of
+address family in the kernel that allows redirection of frames from a driver to
+user space without having to traverse the full network stack. An AF_XDP socket
+binds to a single netdev queue. A mapping of XSKs to queues is shown below:
+
+.. code-block:: none
+
+ +---------------------------------------------------+
+ | xsk A | xsk B | xsk C |<---+ User space
+ =========================================================|==========
+ | Queue 0 | Queue 1 | Queue 2 | | Kernel
+ +---------------------------------------------------+ |
+ | Netdev eth0 | |
+ +---------------------------------------------------+ |
+ | +=============+ | |
+ | | key | xsk | | |
+ | +---------+ +=============+ | |
+ | | | | 0 | xsk A | | |
+ | | | +-------------+ | |
+ | | | | 1 | xsk B | | |
+ | | BPF |-- redirect -->+-------------+-------------+
+ | | prog | | 2 | xsk C | |
+ | | | +-------------+ |
+ | | | |
+ | | | |
+ | +---------+ |
+ | |
+ +---------------------------------------------------+
+
+.. note::
+ An AF_XDP socket that is bound to a certain <netdev/queue_id> will *only*
+ accept XDP frames from that <netdev/queue_id>. If an XDP program tries to redirect
+ from a <netdev/queue_id> other than what the socket is bound to, the frame will
+ not be received on the socket.
+
+Typically an XSKMAP is created per netdev. This map contains an array of XSK File
+Descriptors (FDs). The number of array elements is typically set or adjusted using
+the ``max_entries`` map parameter. For AF_XDP ``max_entries`` is equal to the number
+of queues supported by the netdev.
+
+.. note::
+ Both the map key and map value size must be 4 bytes.
+
+Usage
+=====
+
+Kernel BPF
+----------
+bpf_redirect_map()
+^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ long bpf_redirect_map(struct bpf_map *map, u32 key, u64 flags)
+
+Redirect the packet to the endpoint referenced by ``map`` at index ``key``.
+For ``BPF_MAP_TYPE_XSKMAP`` this map contains references to XSK FDs
+for sockets attached to a netdev's queues.
+
+.. note::
+ If the map is empty at an index, the packet is dropped. This means that it is
+ necessary to have an XDP program loaded with at least one XSK in the
+ XSKMAP to be able to get any traffic to user space through the socket.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ void *bpf_map_lookup_elem(struct bpf_map *map, const void *key)
+
+XSK entry references of type ``struct xdp_sock *`` can be retrieved using the
+``bpf_map_lookup_elem()`` helper.
+
+User space
+----------
+.. note::
+ XSK entries can only be updated/deleted from user space and not from
+ a BPF program. Trying to call these functions from a kernel BPF program will
+ result in the program failing to load and a verifier warning.
+
+bpf_map_update_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_update_elem(int fd, const void *key, const void *value, __u64 flags)
+
+XSK entries can be added or updated using the ``bpf_map_update_elem()``
+helper. The ``key`` parameter is equal to the queue_id of the queue the XSK
+is attaching to. And the ``value`` parameter is the FD value of that socket.
+
+Under the hood, the XSKMAP update function uses the XSK FD value to retrieve the
+associated ``struct xdp_sock`` instance.
+
+The flags argument can be one of the following:
+
+- BPF_ANY: Create a new element or update an existing element.
+- BPF_NOEXIST: Create a new element only if it did not exist.
+- BPF_EXIST: Update an existing element.
+
+bpf_map_lookup_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_lookup_elem(int fd, const void *key, void *value)
+
+Returns ``struct xdp_sock *`` or negative error in case of failure.
+
+bpf_map_delete_elem()
+^^^^^^^^^^^^^^^^^^^^^
+.. code-block:: c
+
+ int bpf_map_delete_elem(int fd, const void *key)
+
+XSK entries can be deleted using the ``bpf_map_delete_elem()``
+helper. This helper will return 0 on success, or negative error in case of
+failure.
+
+.. note::
+ When `libxdp`_ deletes an XSK it also removes the associated socket
+ entry from the XSKMAP.
+
+Examples
+========
+Kernel
+------
+
+The following code snippet shows how to declare a ``BPF_MAP_TYPE_XSKMAP`` called
+``xsks_map`` and how to redirect packets to an XSK.
+
+.. code-block:: c
+
+ struct {
+ __uint(type, BPF_MAP_TYPE_XSKMAP);
+ __type(key, __u32);
+ __type(value, __u32);
+ __uint(max_entries, 64);
+ } xsks_map SEC(".maps");
+
+
+ SEC("xdp")
+ int xsk_redir_prog(struct xdp_md *ctx)
+ {
+ __u32 index = ctx->rx_queue_index;
+
+ if (bpf_map_lookup_elem(&xsks_map, &index))
+ return bpf_redirect_map(&xsks_map, index, 0);
+ return XDP_PASS;
+ }
+
+User space
+----------
+
+The following code snippet shows how to update an XSKMAP with an XSK entry.
+
+.. code-block:: c
+
+ int update_xsks_map(struct bpf_map *xsks_map, int queue_id, int xsk_fd)
+ {
+ int ret;
+
+ ret = bpf_map_update_elem(bpf_map__fd(xsks_map), &queue_id, &xsk_fd, 0);
+ if (ret < 0)
+ fprintf(stderr, "Failed to update xsks_map: %s\n", strerror(errno));
+
+ return ret;
+ }
+
+For an example on how create AF_XDP sockets, please see the AF_XDP-example and
+AF_XDP-forwarding programs in the `bpf-examples`_ directory in the `libxdp`_ repository.
+For a detailed explaination of the AF_XDP interface please see:
+
+- `libxdp-readme`_.
+- `AF_XDP`_ kernel documentation.
+
+.. note::
+ The most comprehensive resource for using XSKMAPs and AF_XDP is `libxdp`_.
+
+.. _libxdp: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp
+.. _AF_XDP: https://www.kernel.org/doc/html/latest/networking/af_xdp.html
+.. _bpf-examples: https://github.com/xdp-project/bpf-examples
+.. _libxdp-readme: https://github.com/xdp-project/xdp-tools/tree/master/lib/libxdp#using-af_xdp-sockets
diff --git a/Documentation/bpf/maps.rst b/Documentation/bpf/maps.rst
index f41619e312ac..4906ff0f8382 100644
--- a/Documentation/bpf/maps.rst
+++ b/Documentation/bpf/maps.rst
@@ -1,52 +1,81 @@
-=========
-eBPF maps
+========
+BPF maps
+========
+
+BPF 'maps' provide generic storage of different types for sharing data between
+kernel and user space. There are several storage types available, including
+hash, array, bloom filter and radix-tree. Several of the map types exist to
+support specific BPF helpers that perform actions based on the map contents. The
+maps are accessed from BPF programs via BPF helpers which are documented in the
+`man-pages`_ for `bpf-helpers(7)`_.
+
+BPF maps are accessed from user space via the ``bpf`` syscall, which provides
+commands to create maps, lookup elements, update elements and delete
+elements. More details of the BPF syscall are available in
+:doc:`/userspace-api/ebpf/syscall` and in the `man-pages`_ for `bpf(2)`_.
+
+Map Types
=========
-'maps' is a generic storage of different types for sharing data between kernel
-and userspace.
+.. toctree::
+ :maxdepth: 1
+ :glob:
-The maps are accessed from user space via BPF syscall, which has commands:
+ map_*
-- create a map with given type and attributes
- ``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
- using attr->map_type, attr->key_size, attr->value_size, attr->max_entries
- returns process-local file descriptor or negative error
+Usage Notes
+===========
-- lookup key in a given map
- ``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key, attr->value
- returns zero and stores found elem into value or negative error
+.. c:function::
+ int bpf(int command, union bpf_attr *attr, u32 size)
-- create or update key/value pair in a given map
- ``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key, attr->value
- returns zero or negative error
+Use the ``bpf()`` system call to perform the operation specified by
+``command``. The operation takes parameters provided in ``attr``. The ``size``
+argument is the size of the ``union bpf_attr`` in ``attr``.
-- find and delete element by key in a given map
- ``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
- using attr->map_fd, attr->key
+**BPF_MAP_CREATE**
-- to delete map: close(fd)
- Exiting process will delete maps automatically
+Create a map with the desired type and attributes in ``attr``:
-userspace programs use this syscall to create/access maps that eBPF programs
-are concurrently updating.
+.. code-block:: c
-maps can have different types: hash, array, bloom filter, radix-tree, etc.
+ int fd;
+ union bpf_attr attr = {
+ .map_type = BPF_MAP_TYPE_ARRAY; /* mandatory */
+ .key_size = sizeof(__u32); /* mandatory */
+ .value_size = sizeof(__u32); /* mandatory */
+ .max_entries = 256; /* mandatory */
+ .map_flags = BPF_F_MMAPABLE;
+ .map_name = "example_array";
+ };
-The map is defined by:
+ fd = bpf(BPF_MAP_CREATE, &attr, sizeof(attr));
- - type
- - max number of elements
- - key size in bytes
- - value size in bytes
+Returns a process-local file descriptor on success, or negative error in case of
+failure. The map can be deleted by calling ``close(fd)``. Maps held by open
+file descriptors will be deleted automatically when a process exits.
-Map Types
-=========
+.. note:: Valid characters for ``map_name`` are ``A-Z``, ``a-z``, ``0-9``,
+ ``'_'`` and ``'.'``.
-.. toctree::
- :maxdepth: 1
- :glob:
+**BPF_MAP_LOOKUP_ELEM**
+
+Lookup key in a given map using ``attr->map_fd``, ``attr->key``,
+``attr->value``. Returns zero and stores found elem into ``attr->value`` on
+success, or negative error on failure.
+
+**BPF_MAP_UPDATE_ELEM**
+
+Create or update key/value pair in a given map using ``attr->map_fd``, ``attr->key``,
+``attr->value``. Returns zero on success or negative error on failure.
+
+**BPF_MAP_DELETE_ELEM**
+
+Find and delete element by key in a given map using ``attr->map_fd``,
+``attr->key``. Returns zero on success or negative error on failure.
- map_* \ No newline at end of file
+.. Links:
+.. _man-pages: https://www.kernel.org/doc/man-pages/
+.. _bpf(2): https://man7.org/linux/man-pages/man2/bpf.2.html
+.. _bpf-helpers(7): https://man7.org/linux/man-pages/man7/bpf-helpers.7.html
diff --git a/Documentation/bpf/programs.rst b/Documentation/bpf/programs.rst
index 620eb667ac7a..c99000ab6d9b 100644
--- a/Documentation/bpf/programs.rst
+++ b/Documentation/bpf/programs.rst
@@ -7,3 +7,6 @@ Program Types
:glob:
prog_*
+
+For a list of all program types, see :ref:`program_types_and_elf` in
+the :ref:`libbpf` documentation.
diff --git a/Documentation/bpf/redirect.rst b/Documentation/bpf/redirect.rst
new file mode 100644
index 000000000000..2fa2b0b05004
--- /dev/null
+++ b/Documentation/bpf/redirect.rst
@@ -0,0 +1,81 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+.. Copyright (C) 2022 Red Hat, Inc.
+
+========
+Redirect
+========
+XDP_REDIRECT
+############
+Supported maps
+--------------
+
+XDP_REDIRECT works with the following map types:
+
+- ``BPF_MAP_TYPE_DEVMAP``
+- ``BPF_MAP_TYPE_DEVMAP_HASH``
+- ``BPF_MAP_TYPE_CPUMAP``
+- ``BPF_MAP_TYPE_XSKMAP``
+
+For more information on these maps, please see the specific map documentation.
+
+Process
+-------
+
+.. kernel-doc:: net/core/filter.c
+ :doc: xdp redirect
+
+.. note::
+ Not all drivers support transmitting frames after a redirect, and for
+ those that do, not all of them support non-linear frames. Non-linear xdp
+ bufs/frames are bufs/frames that contain more than one fragment.
+
+Debugging packet drops
+----------------------
+Silent packet drops for XDP_REDIRECT can be debugged using:
+
+- bpf_trace
+- perf_record
+
+bpf_trace
+^^^^^^^^^
+The following bpftrace command can be used to capture and count all XDP tracepoints:
+
+.. code-block:: none
+
+ sudo bpftrace -e 'tracepoint:xdp:* { @cnt[probe] = count(); }'
+ Attaching 12 probes...
+ ^C
+
+ @cnt[tracepoint:xdp:mem_connect]: 18
+ @cnt[tracepoint:xdp:mem_disconnect]: 18
+ @cnt[tracepoint:xdp:xdp_exception]: 19605
+ @cnt[tracepoint:xdp:xdp_devmap_xmit]: 1393604
+ @cnt[tracepoint:xdp:xdp_redirect]: 22292200
+
+.. note::
+ The various xdp tracepoints can be found in ``source/include/trace/events/xdp.h``
+
+The following bpftrace command can be used to extract the ``ERRNO`` being returned as
+part of the err parameter:
+
+.. code-block:: none
+
+ sudo bpftrace -e \
+ 'tracepoint:xdp:xdp_redirect*_err {@redir_errno[-args->err] = count();}
+ tracepoint:xdp:xdp_devmap_xmit {@devmap_errno[-args->err] = count();}'
+
+perf record
+^^^^^^^^^^^
+The perf tool also supports recording tracepoints:
+
+.. code-block:: none
+
+ perf record -a -e xdp:xdp_redirect_err \
+ -e xdp:xdp_redirect_map_err \
+ -e xdp:xdp_exception \
+ -e xdp:xdp_devmap_xmit
+
+References
+===========
+
+- https://github.com/xdp-project/xdp-tutorial/tree/master/tracing02-xdp-monitor
diff --git a/Documentation/devicetree/bindings/arm/mediatek/mediatek,mt7622-wed.yaml b/Documentation/devicetree/bindings/arm/mediatek/mediatek,mt7622-wed.yaml
index 84fb0a146b6e..5c223cb063d4 100644
--- a/Documentation/devicetree/bindings/arm/mediatek/mediatek,mt7622-wed.yaml
+++ b/Documentation/devicetree/bindings/arm/mediatek/mediatek,mt7622-wed.yaml
@@ -29,6 +29,38 @@ properties:
interrupts:
maxItems: 1
+ memory-region:
+ items:
+ - description: firmware EMI region
+ - description: firmware ILM region
+ - description: firmware DLM region
+ - description: firmware CPU DATA region
+ - description: firmware BOOT region
+
+ memory-region-names:
+ items:
+ - const: wo-emi
+ - const: wo-ilm
+ - const: wo-dlm
+ - const: wo-data
+ - const: wo-boot
+
+ mediatek,wo-ccif:
+ $ref: /schemas/types.yaml#/definitions/phandle
+ description: mediatek wed-wo controller interface.
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ contains:
+ const: mediatek,mt7622-wed
+ then:
+ properties:
+ memory-region-names: false
+ memory-region: false
+ mediatek,wo-ccif: false
+
required:
- compatible
- reg
@@ -49,3 +81,23 @@ examples:
interrupts = <GIC_SPI 214 IRQ_TYPE_LEVEL_LOW>;
};
};
+
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+ soc {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ wed@15010000 {
+ compatible = "mediatek,mt7986-wed", "syscon";
+ reg = <0 0x15010000 0 0x1000>;
+ interrupts = <GIC_SPI 205 IRQ_TYPE_LEVEL_HIGH>;
+
+ memory-region = <&wo_emi>, <&wo_ilm>, <&wo_dlm>,
+ <&wo_data>, <&wo_boot>;
+ memory-region-names = "wo-emi", "wo-ilm", "wo-dlm",
+ "wo-data", "wo-boot";
+ mediatek,wo-ccif = <&wo_ccif0>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/interrupt-controller/fsl,intmux.yaml b/Documentation/devicetree/bindings/interrupt-controller/fsl,intmux.yaml
index 1d6e0f64a807..985bfa4f6fda 100644
--- a/Documentation/devicetree/bindings/interrupt-controller/fsl,intmux.yaml
+++ b/Documentation/devicetree/bindings/interrupt-controller/fsl,intmux.yaml
@@ -7,7 +7,8 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Freescale INTMUX interrupt multiplexer
maintainers:
- - Joakim Zhang <qiangqing.zhang@nxp.com>
+ - Shawn Guo <shawnguo@kernel.org>
+ - NXP Linux Team <linux-imx@nxp.com>
properties:
compatible:
diff --git a/Documentation/devicetree/bindings/net/adi,adin1110.yaml b/Documentation/devicetree/bindings/net/adi,adin1110.yaml
index b6bd8ee38a18..9de865295d7a 100644
--- a/Documentation/devicetree/bindings/net/adi,adin1110.yaml
+++ b/Documentation/devicetree/bindings/net/adi,adin1110.yaml
@@ -46,6 +46,10 @@ properties:
interrupts:
maxItems: 1
+ reset-gpios:
+ maxItems: 1
+ description: GPIO connected to active low reset
+
required:
- compatible
- reg
diff --git a/Documentation/devicetree/bindings/net/asix,ax88178.yaml b/Documentation/devicetree/bindings/net/asix,ax88178.yaml
index 1af52358de4c..a81dbc4792f6 100644
--- a/Documentation/devicetree/bindings/net/asix,ax88178.yaml
+++ b/Documentation/devicetree/bindings/net/asix,ax88178.yaml
@@ -27,7 +27,9 @@ properties:
- usbb95,772b # ASIX AX88772B
- usbb95,7e2b # ASIX AX88772B
- reg: true
+ reg:
+ maxItems: 1
+
local-mac-address: true
mac-address: true
diff --git a/Documentation/devicetree/bindings/net/bluetooth.txt b/Documentation/devicetree/bindings/net/bluetooth.txt
deleted file mode 100644
index 94797df751b8..000000000000
--- a/Documentation/devicetree/bindings/net/bluetooth.txt
+++ /dev/null
@@ -1,5 +0,0 @@
-The following properties are common to the Bluetooth controllers:
-
-- local-bd-address: array of 6 bytes, specifies the BD address that was
- uniquely assigned to the Bluetooth device, formatted with least significant
- byte first (little-endian).
diff --git a/Documentation/devicetree/bindings/net/bluetooth/bluetooth-controller.yaml b/Documentation/devicetree/bindings/net/bluetooth/bluetooth-controller.yaml
new file mode 100644
index 000000000000..9309dc40f54f
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/bluetooth/bluetooth-controller.yaml
@@ -0,0 +1,29 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/bluetooth/bluetooth-controller.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Bluetooth Controller Generic Binding
+
+maintainers:
+ - Marcel Holtmann <marcel@holtmann.org>
+ - Johan Hedberg <johan.hedberg@gmail.com>
+ - Luiz Augusto von Dentz <luiz.dentz@gmail.com>
+
+properties:
+ $nodename:
+ pattern: "^bluetooth(@.*)?$"
+
+ local-bd-address:
+ $ref: /schemas/types.yaml#/definitions/uint8-array
+ maxItems: 6
+ description:
+ Specifies the BD address that was uniquely assigned to the Bluetooth
+ device. Formatted with least significant byte first (little-endian), e.g.
+ in order to assign the address 00:11:22:33:44:55 this property must have
+ the value [55 44 33 22 11 00].
+
+additionalProperties: true
+
+...
diff --git a/Documentation/devicetree/bindings/net/bluetooth/brcm,bcm4377-bluetooth.yaml b/Documentation/devicetree/bindings/net/bluetooth/brcm,bcm4377-bluetooth.yaml
new file mode 100644
index 000000000000..37cb39a3a62e
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/bluetooth/brcm,bcm4377-bluetooth.yaml
@@ -0,0 +1,81 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/bluetooth/brcm,bcm4377-bluetooth.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Broadcom BCM4377 family PCIe Bluetooth Chips
+
+maintainers:
+ - Sven Peter <sven@svenpeter.dev>
+
+description:
+ This binding describes Broadcom BCM4377 family PCIe-attached bluetooth chips
+ usually found in Apple machines. The Wi-Fi part of the chip is described in
+ bindings/net/wireless/brcm,bcm4329-fmac.yaml.
+
+allOf:
+ - $ref: bluetooth-controller.yaml#
+
+properties:
+ compatible:
+ enum:
+ - pci14e4,5fa0 # BCM4377
+ - pci14e4,5f69 # BCM4378
+ - pci14e4,5f71 # BCM4387
+
+ reg:
+ maxItems: 1
+
+ brcm,board-type:
+ $ref: /schemas/types.yaml#/definitions/string
+ description: Board type of the Bluetooth chip. This is used to decouple
+ the overall system board from the Bluetooth module and used to construct
+ firmware and calibration data filenames.
+ On Apple platforms, this should be the Apple module-instance codename
+ prefixed by "apple,", e.g. "apple,atlantisb".
+ pattern: '^apple,.*'
+
+ brcm,taurus-cal-blob:
+ $ref: /schemas/types.yaml#/definitions/uint8-array
+ description: A per-device calibration blob for the Bluetooth radio. This
+ should be filled in by the bootloader from platform configuration
+ data, if necessary, and will be uploaded to the device.
+ This blob is used if the chip stepping of the Bluetooth module does not
+ support beamforming.
+
+ brcm,taurus-bf-cal-blob:
+ $ref: /schemas/types.yaml#/definitions/uint8-array
+ description: A per-device calibration blob for the Bluetooth radio. This
+ should be filled in by the bootloader from platform configuration
+ data, if necessary, and will be uploaded to the device.
+ This blob is used if the chip stepping of the Bluetooth module supports
+ beamforming.
+
+ local-bd-address: true
+
+required:
+ - compatible
+ - reg
+ - local-bd-address
+ - brcm,board-type
+
+additionalProperties: false
+
+examples:
+ - |
+ pcie@a0000000 {
+ #address-cells = <3>;
+ #size-cells = <2>;
+ reg = <0xa0000000 0x1000000>;
+ device_type = "pci";
+ ranges = <0x43000000 0x6 0xa0000000 0xa0000000 0x0 0x20000000>;
+
+ bluetooth@0,1 {
+ compatible = "pci14e4,5f69";
+ reg = <0x100 0x0 0x0 0x0 0x0>;
+ brcm,board-type = "apple,honshu";
+ /* To be filled by the bootloader */
+ local-bd-address = [00 00 00 00 00 00];
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.yaml b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml
index f93c6e7a1b59..a6a6b0e4df7a 100644
--- a/Documentation/devicetree/bindings/net/qualcomm-bluetooth.yaml
+++ b/Documentation/devicetree/bindings/net/bluetooth/qualcomm-bluetooth.yaml
@@ -1,7 +1,7 @@
# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
%YAML 1.2
---
-$id: http://devicetree.org/schemas/net/qualcomm-bluetooth.yaml#
+$id: http://devicetree.org/schemas/net/bluetooth/qualcomm-bluetooth.yaml#
$schema: http://devicetree.org/meta-schemas/core.yaml#
title: Qualcomm Bluetooth Chips
@@ -79,8 +79,7 @@ properties:
firmware-name:
description: specify the name of nvm firmware to load
- local-bd-address:
- description: see Documentation/devicetree/bindings/net/bluetooth.txt
+ local-bd-address: true
required:
@@ -89,6 +88,7 @@ required:
additionalProperties: false
allOf:
+ - $ref: bluetooth-controller.yaml#
- if:
properties:
compatible:
diff --git a/Documentation/devicetree/bindings/net/broadcom-bluetooth.yaml b/Documentation/devicetree/bindings/net/broadcom-bluetooth.yaml
index 445b2a553625..b964c7dcec15 100644
--- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.yaml
+++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.yaml
@@ -19,11 +19,14 @@ properties:
- brcm,bcm4329-bt
- brcm,bcm4330-bt
- brcm,bcm4334-bt
+ - brcm,bcm43430a0-bt
+ - brcm,bcm43430a1-bt
- brcm,bcm43438-bt
- brcm,bcm4345c5
- brcm,bcm43540-bt
- brcm,bcm4335a0
- brcm,bcm4349-bt
+ - cypress,cyw4373a0-bt
- infineon,cyw55572-bt
shutdown-gpios:
diff --git a/Documentation/devicetree/bindings/net/can/fsl,flexcan.yaml b/Documentation/devicetree/bindings/net/can/fsl,flexcan.yaml
index e52db841bb8c..6e59bd2a6094 100644
--- a/Documentation/devicetree/bindings/net/can/fsl,flexcan.yaml
+++ b/Documentation/devicetree/bindings/net/can/fsl,flexcan.yaml
@@ -17,6 +17,7 @@ properties:
compatible:
oneOf:
- enum:
+ - fsl,imx93-flexcan
- fsl,imx8qm-flexcan
- fsl,imx8mp-flexcan
- fsl,imx6q-flexcan
diff --git a/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml b/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml
index 6f71fc96bc4e..1eb98c9a1a26 100644
--- a/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml
+++ b/Documentation/devicetree/bindings/net/can/renesas,rcar-canfd.yaml
@@ -9,9 +9,6 @@ title: Renesas R-Car CAN FD Controller
maintainers:
- Fabrizio Castro <fabrizio.castro.jz@renesas.com>
-allOf:
- - $ref: can-controller.yaml#
-
properties:
compatible:
oneOf:
@@ -33,7 +30,7 @@ properties:
- items:
- enum:
- - renesas,r9a07g043-canfd # RZ/G2UL
+ - renesas,r9a07g043-canfd # RZ/G2UL and RZ/Five
- renesas,r9a07g044-canfd # RZ/G2{L,LC}
- renesas,r9a07g054-canfd # RZ/V2L
- const: renesas,rzg2l-canfd # RZ/G2L family
@@ -77,12 +74,13 @@ properties:
description: Maximum frequency of the CANFD clock.
patternProperties:
- "^channel[01]$":
+ "^channel[0-7]$":
type: object
description:
- The controller supports two channels and each is represented as a child
- node. Each child node supports the "status" property only, which
- is used to enable/disable the respective channel.
+ The controller supports multiple channels and each is represented as a
+ child node. Each channel can be enabled/disabled individually.
+
+ additionalProperties: false
required:
- compatible
@@ -98,60 +96,73 @@ required:
- channel0
- channel1
-if:
- properties:
- compatible:
- contains:
- enum:
- - renesas,rzg2l-canfd
-then:
- properties:
- interrupts:
- items:
- - description: CAN global error interrupt
- - description: CAN receive FIFO interrupt
- - description: CAN0 error interrupt
- - description: CAN0 transmit interrupt
- - description: CAN0 transmit/receive FIFO receive completion interrupt
- - description: CAN1 error interrupt
- - description: CAN1 transmit interrupt
- - description: CAN1 transmit/receive FIFO receive completion interrupt
-
- interrupt-names:
- items:
- - const: g_err
- - const: g_recc
- - const: ch0_err
- - const: ch0_rec
- - const: ch0_trx
- - const: ch1_err
- - const: ch1_rec
- - const: ch1_trx
-
- resets:
- maxItems: 2
-
- reset-names:
- items:
- - const: rstp_n
- - const: rstc_n
-
- required:
- - reset-names
-else:
- properties:
- interrupts:
- items:
- - description: Channel interrupt
- - description: Global interrupt
-
- interrupt-names:
- items:
- - const: ch_int
- - const: g_int
-
- resets:
- maxItems: 1
+allOf:
+ - $ref: can-controller.yaml#
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - renesas,rzg2l-canfd
+ then:
+ properties:
+ interrupts:
+ items:
+ - description: CAN global error interrupt
+ - description: CAN receive FIFO interrupt
+ - description: CAN0 error interrupt
+ - description: CAN0 transmit interrupt
+ - description: CAN0 transmit/receive FIFO receive completion interrupt
+ - description: CAN1 error interrupt
+ - description: CAN1 transmit interrupt
+ - description: CAN1 transmit/receive FIFO receive completion interrupt
+
+ interrupt-names:
+ items:
+ - const: g_err
+ - const: g_recc
+ - const: ch0_err
+ - const: ch0_rec
+ - const: ch0_trx
+ - const: ch1_err
+ - const: ch1_rec
+ - const: ch1_trx
+
+ resets:
+ maxItems: 2
+
+ reset-names:
+ items:
+ - const: rstp_n
+ - const: rstc_n
+
+ required:
+ - reset-names
+ else:
+ properties:
+ interrupts:
+ items:
+ - description: Channel interrupt
+ - description: Global interrupt
+
+ interrupt-names:
+ items:
+ - const: ch_int
+ - const: g_int
+
+ resets:
+ maxItems: 1
+
+ - if:
+ not:
+ properties:
+ compatible:
+ contains:
+ const: renesas,r8a779a0-canfd
+ then:
+ patternProperties:
+ "^channel[2-7]$": false
unevaluatedProperties: false
diff --git a/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml b/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml
index 10ad7e71097b..9abb8eba5fad 100644
--- a/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml
+++ b/Documentation/devicetree/bindings/net/dsa/dsa-port.yaml
@@ -19,7 +19,8 @@ allOf:
properties:
reg:
- description: Port number
+ items:
+ - description: Port number
label:
description:
diff --git a/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml b/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml
index 73b774eadd0b..1d7dab31457d 100644
--- a/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml
+++ b/Documentation/devicetree/bindings/net/dsa/hirschmann,hellcreek.yaml
@@ -12,7 +12,7 @@ allOf:
maintainers:
- Andrew Lunn <andrew@lunn.ch>
- Florian Fainelli <f.fainelli@gmail.com>
- - Vivien Didelot <vivien.didelot@gmail.com>
+ - Vladimir Oltean <olteanv@gmail.com>
- Kurt Kanzenbach <kurt@linutronix.de>
description:
diff --git a/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml b/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml
index 7ca9c19a157c..0a0d62b6c00e 100644
--- a/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml
+++ b/Documentation/devicetree/bindings/net/dsa/renesas,rzn1-a5psw.yaml
@@ -74,10 +74,10 @@ properties:
properties:
pcs-handle:
+ maxItems: 1
description:
phandle pointing to a PCS sub-node compatible with
renesas,rzn1-miic.yaml#
- $ref: /schemas/types.yaml#/definitions/phandle
unevaluatedProperties: false
diff --git a/Documentation/devicetree/bindings/net/ethernet-controller.yaml b/Documentation/devicetree/bindings/net/ethernet-controller.yaml
index 4b3c590fcebf..3aef506fa158 100644
--- a/Documentation/devicetree/bindings/net/ethernet-controller.yaml
+++ b/Documentation/devicetree/bindings/net/ethernet-controller.yaml
@@ -108,11 +108,17 @@ properties:
$ref: "#/properties/phy-connection-type"
pcs-handle:
- $ref: /schemas/types.yaml#/definitions/phandle
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ items:
+ maxItems: 1
description:
Specifies a reference to a node representing a PCS PHY device on a MDIO
bus to link with an external PHY (phy-handle) if exists.
+ pcs-handle-names:
+ description:
+ The name of each PCS in pcs-handle.
+
phy-handle:
$ref: /schemas/types.yaml#/definitions/phandle
description:
@@ -216,6 +222,9 @@ properties:
required:
- speed
+dependencies:
+ pcs-handle-names: [pcs-handle]
+
allOf:
- if:
properties:
diff --git a/Documentation/devicetree/bindings/net/fsl,fec.yaml b/Documentation/devicetree/bindings/net/fsl,fec.yaml
index e0f376f7e274..77e5f32cb62f 100644
--- a/Documentation/devicetree/bindings/net/fsl,fec.yaml
+++ b/Documentation/devicetree/bindings/net/fsl,fec.yaml
@@ -7,7 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: Freescale Fast Ethernet Controller (FEC)
maintainers:
- - Joakim Zhang <qiangqing.zhang@nxp.com>
+ - Shawn Guo <shawnguo@kernel.org>
+ - Wei Fang <wei.fang@nxp.com>
+ - NXP Linux Team <linux-imx@nxp.com>
allOf:
- $ref: ethernet-controller.yaml#
diff --git a/Documentation/devicetree/bindings/net/fsl,fman-dtsec.yaml b/Documentation/devicetree/bindings/net/fsl,fman-dtsec.yaml
index 3a35ac1c260d..c80c880a9dab 100644
--- a/Documentation/devicetree/bindings/net/fsl,fman-dtsec.yaml
+++ b/Documentation/devicetree/bindings/net/fsl,fman-dtsec.yaml
@@ -85,9 +85,39 @@ properties:
$ref: /schemas/types.yaml#/definitions/phandle
description: A reference to the IEEE1588 timer
+ phys:
+ description: A reference to the SerDes lane(s)
+ maxItems: 1
+
+ phy-names:
+ items:
+ - const: serdes
+
pcsphy-handle:
- $ref: /schemas/types.yaml#/definitions/phandle
- description: A reference to the PCS (typically found on the SerDes)
+ $ref: /schemas/types.yaml#/definitions/phandle-array
+ minItems: 1
+ maxItems: 3
+ deprecated: true
+ description: See pcs-handle.
+
+ pcs-handle:
+ minItems: 1
+ maxItems: 3
+ description: |
+ A reference to the various PCSs (typically found on the SerDes). If
+ pcs-handle-names is absent, and phy-connection-type is "xgmii", then the first
+ reference will be assumed to be for "xfi". Otherwise, if pcs-handle-names is
+ absent, then the first reference will be assumed to be for "sgmii".
+
+ pcs-handle-names:
+ minItems: 1
+ maxItems: 3
+ items:
+ enum:
+ - sgmii
+ - qsgmii
+ - xfi
+ description: The type of each PCS in pcsphy-handle.
tbi-handle:
$ref: /schemas/types.yaml#/definitions/phandle
@@ -100,6 +130,10 @@ required:
- fsl,fman-ports
- ptp-timer
+dependencies:
+ pcs-handle-names:
+ - pcs-handle
+
allOf:
- $ref: ethernet-controller.yaml#
- if:
@@ -110,14 +144,6 @@ allOf:
then:
required:
- tbi-handle
- - if:
- properties:
- compatible:
- contains:
- const: fsl,fman-memac
- then:
- required:
- - pcsphy-handle
unevaluatedProperties: false
@@ -138,8 +164,9 @@ examples:
reg = <0xe8000 0x1000>;
fsl,fman-ports = <&fman0_rx_0x0c &fman0_tx_0x2c>;
ptp-timer = <&ptp_timer0>;
- pcsphy-handle = <&pcsphy4>;
- phy-handle = <&sgmii_phy1>;
- phy-connection-type = "sgmii";
+ pcs-handle = <&pcsphy4>, <&qsgmiib_pcs1>;
+ pcs-handle-names = "sgmii", "qsgmii";
+ phys = <&serdes1 1>;
+ phy-names = "serdes";
};
...
diff --git a/Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml b/Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
index 7f620a71a972..600240281e8c 100644
--- a/Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
+++ b/Documentation/devicetree/bindings/net/fsl,qoriq-mc-dpmac.yaml
@@ -31,7 +31,7 @@ properties:
phy-mode: true
pcs-handle:
- $ref: /schemas/types.yaml#/definitions/phandle
+ maxItems: 1
description:
A reference to a node representing a PCS PHY device found on
the internal MDIO bus.
diff --git a/Documentation/devicetree/bindings/net/fsl-fman.txt b/Documentation/devicetree/bindings/net/fsl-fman.txt
index b9055335db3b..bda4b41af074 100644
--- a/Documentation/devicetree/bindings/net/fsl-fman.txt
+++ b/Documentation/devicetree/bindings/net/fsl-fman.txt
@@ -320,8 +320,9 @@ For internal PHY device on internal mdio bus, a PHY node should be created.
See the definition of the PHY node in booting-without-of.txt for an
example of how to define a PHY (Internal PHY has no interrupt line).
- For "fsl,fman-mdio" compatible internal mdio bus, the PHY is TBI PHY.
-- For "fsl,fman-memac-mdio" compatible internal mdio bus, the PHY is PCS PHY,
- PCS PHY addr must be '0'.
+- For "fsl,fman-memac-mdio" compatible internal mdio bus, the PHY is PCS PHY.
+ The PCS PHY address should correspond to the value of the appropriate
+ MDEV_PORT.
EXAMPLE
diff --git a/Documentation/devicetree/bindings/net/marvell,dfx-server.yaml b/Documentation/devicetree/bindings/net/marvell,dfx-server.yaml
new file mode 100644
index 000000000000..8a14c919e3f7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/marvell,dfx-server.yaml
@@ -0,0 +1,62 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/marvell,dfx-server.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell Prestera DFX server
+
+maintainers:
+ - Miquel Raynal <miquel.raynal@bootlin.com>
+
+select:
+ properties:
+ compatible:
+ contains:
+ const: marvell,dfx-server
+ required:
+ - compatible
+
+properties:
+ compatible:
+ items:
+ - const: marvell,dfx-server
+ - const: simple-bus
+
+ reg:
+ maxItems: 1
+
+ ranges: true
+
+ '#address-cells':
+ const: 1
+
+ '#size-cells':
+ const: 1
+
+required:
+ - compatible
+ - reg
+ - ranges
+
+# The DFX server may expose clocks described as subnodes
+additionalProperties:
+ type: object
+
+examples:
+ - |
+
+ #define MBUS_ID(target,attributes) (((target) << 24) | ((attributes) << 16))
+ bus@0 {
+ reg = <0 0>;
+ #address-cells = <2>;
+ #size-cells = <1>;
+
+ dfx-bus@ac000000 {
+ compatible = "marvell,dfx-server", "simple-bus";
+ #address-cells = <1>;
+ #size-cells = <1>;
+ ranges = <0 MBUS_ID(0x08, 0x00) 0 0x100000>;
+ reg = <MBUS_ID(0x08, 0x00) 0 0x100000>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/marvell,pp2.yaml b/Documentation/devicetree/bindings/net/marvell,pp2.yaml
new file mode 100644
index 000000000000..4eadafc43d4f
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/marvell,pp2.yaml
@@ -0,0 +1,305 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/marvell,pp2.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell CN913X / Marvell Armada 375, 7K, 8K Ethernet Controller
+
+maintainers:
+ - Marcin Wojtas <mw@semihalf.com>
+ - Russell King <linux@armlinux.org>
+
+description: |
+ Marvell Armada 375 Ethernet Controller (PPv2.1)
+ Marvell Armada 7K/8K Ethernet Controller (PPv2.2)
+ Marvell CN913X Ethernet Controller (PPv2.3)
+
+properties:
+ compatible:
+ enum:
+ - marvell,armada-375-pp2
+ - marvell,armada-7k-pp22
+
+ reg:
+ minItems: 3
+ maxItems: 4
+
+ "#address-cells":
+ const: 1
+
+ "#size-cells":
+ const: 0
+
+ clocks:
+ minItems: 2
+ items:
+ - description: main controller clock
+ - description: GOP clock
+ - description: MG clock
+ - description: MG Core clock
+ - description: AXI clock
+
+ clock-names:
+ minItems: 2
+ items:
+ - const: pp_clk
+ - const: gop_clk
+ - const: mg_clk
+ - const: mg_core_clk
+ - const: axi_clk
+
+ dma-coherent: true
+
+ marvell,system-controller:
+ $ref: /schemas/types.yaml#/definitions/phandle
+ description: a phandle to the system controller.
+
+patternProperties:
+ '^(ethernet-)?port@[0-2]$':
+ type: object
+ description: subnode for each ethernet port.
+ $ref: ethernet-controller.yaml#
+ unevaluatedProperties: false
+
+ properties:
+ reg:
+ description: ID of the port from the MAC point of view.
+ maximum: 2
+
+ interrupts:
+ minItems: 1
+ maxItems: 10
+ description: interrupt(s) for the port
+
+ interrupt-names:
+ minItems: 1
+ items:
+ - const: hif0
+ - const: hif1
+ - const: hif2
+ - const: hif3
+ - const: hif4
+ - const: hif5
+ - const: hif6
+ - const: hif7
+ - const: hif8
+ - const: link
+
+ description: >
+ if more than a single interrupt for is given, must be the
+ name associated to the interrupts listed. Valid names are:
+ "hifX", with X in [0..8], and "link". The names "tx-cpu0",
+ "tx-cpu1", "tx-cpu2", "tx-cpu3" and "rx-shared" are supported
+ for backward compatibility but shouldn't be used for new
+ additions.
+
+ phys:
+ minItems: 1
+ maxItems: 2
+ description: >
+ Generic PHY, providing SerDes connectivity. For most modes,
+ one lane is sufficient, but some (e.g. RXAUI) may require two.
+
+ phy-mode:
+ enum:
+ - gmii
+ - sgmii
+ - rgmii-id
+ - 1000base-x
+ - 2500base-x
+ - 5gbase-r
+ - rxaui
+ - 10gbase-r
+
+ port-id:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ deprecated: true
+ description: >
+ ID of the port from the MAC point of view.
+ Legacy binding for backward compatibility.
+
+ marvell,loopback:
+ $ref: /schemas/types.yaml#/definitions/flag
+ description: port is loopback mode.
+
+ gop-port-id:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: >
+ only for marvell,armada-7k-pp22, ID of the port from the
+ GOP (Group Of Ports) point of view. This ID is used to index the
+ per-port registers in the second register area.
+
+ required:
+ - reg
+ - interrupts
+ - phy-mode
+ - port-id
+
+required:
+ - compatible
+ - reg
+ - clocks
+ - clock-names
+
+allOf:
+ - if:
+ properties:
+ compatible:
+ const: marvell,armada-7k-pp22
+ then:
+ properties:
+ reg:
+ items:
+ - description: Packet Processor registers
+ - description: Networking interfaces registers
+ - description: CM3 address space used for TX Flow Control
+
+ clocks:
+ minItems: 5
+
+ clock-names:
+ minItems: 5
+
+ patternProperties:
+ '^(ethernet-)?port@[0-2]$':
+ required:
+ - gop-port-id
+
+ required:
+ - marvell,system-controller
+ else:
+ properties:
+ reg:
+ items:
+ - description: Packet Processor registers
+ - description: LMS registers
+ - description: Register area per eth0
+ - description: Register area per eth1
+
+ clocks:
+ maxItems: 2
+
+ clock-names:
+ maxItems: 2
+
+ patternProperties:
+ '^(ethernet-)?port@[0-1]$':
+ properties:
+ reg:
+ maximum: 1
+
+ gop-port-id: false
+
+additionalProperties: false
+
+examples:
+ - |
+ // For Armada 375 variant
+ #include <dt-bindings/interrupt-controller/mvebu-icu.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ ethernet@f0000 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "marvell,armada-375-pp2";
+ reg = <0xf0000 0xa000>,
+ <0xc0000 0x3060>,
+ <0xc4000 0x100>,
+ <0xc5000 0x100>;
+ clocks = <&gateclk 3>, <&gateclk 19>;
+ clock-names = "pp_clk", "gop_clk";
+
+ ethernet-port@0 {
+ interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
+ reg = <0>;
+ port-id = <0>; /* For backward compatibility. */
+ phy = <&phy0>;
+ phy-mode = "rgmii-id";
+ };
+
+ ethernet-port@1 {
+ interrupts = <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>;
+ reg = <1>;
+ port-id = <1>; /* For backward compatibility. */
+ phy = <&phy3>;
+ phy-mode = "gmii";
+ };
+ };
+
+ - |
+ // For Armada 7k/8k and Cn913x variants
+ #include <dt-bindings/interrupt-controller/mvebu-icu.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ ethernet@0 {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ compatible = "marvell,armada-7k-pp22";
+ reg = <0x0 0x100000>, <0x129000 0xb000>, <0x220000 0x800>;
+ clocks = <&cp0_clk 1 3>, <&cp0_clk 1 9>,
+ <&cp0_clk 1 5>, <&cp0_clk 1 6>, <&cp0_clk 1 18>;
+ clock-names = "pp_clk", "gop_clk", "mg_clk", "mg_core_clk", "axi_clk";
+ marvell,system-controller = <&cp0_syscon0>;
+
+ ethernet-port@0 {
+ interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 59 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 63 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 67 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 71 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 129 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
+ "hif5", "hif6", "hif7", "hif8", "link";
+ phy-mode = "10gbase-r";
+ phys = <&cp0_comphy4 0>;
+ reg = <0>;
+ port-id = <0>; /* For backward compatibility. */
+ gop-port-id = <0>;
+ };
+
+ ethernet-port@1 {
+ interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 60 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 64 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 68 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 72 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 128 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
+ "hif5", "hif6", "hif7", "hif8", "link";
+ phy-mode = "rgmii-id";
+ reg = <1>;
+ port-id = <1>; /* For backward compatibility. */
+ gop-port-id = <2>;
+ };
+
+ ethernet-port@2 {
+ interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 61 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 65 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 69 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 73 IRQ_TYPE_LEVEL_HIGH>,
+ <ICU_GRP_NSR 127 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
+ "hif5", "hif6", "hif7", "hif8", "link";
+ phy-mode = "2500base-x";
+ managed = "in-band-status";
+ phys = <&cp0_comphy5 2>;
+ sfp = <&sfp_eth3>;
+ reg = <2>;
+ port-id = <2>; /* For backward compatibility. */
+ gop-port-id = <3>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/marvell,prestera.txt b/Documentation/devicetree/bindings/net/marvell,prestera.txt
deleted file mode 100644
index e28938ddfdf5..000000000000
--- a/Documentation/devicetree/bindings/net/marvell,prestera.txt
+++ /dev/null
@@ -1,81 +0,0 @@
-Marvell Prestera Switch Chip bindings
--------------------------------------
-
-Required properties:
-- compatible: must be "marvell,prestera" and one of the following
- "marvell,prestera-98dx3236",
- "marvell,prestera-98dx3336",
- "marvell,prestera-98dx4251",
-- reg: address and length of the register set for the device.
-- interrupts: interrupt for the device
-
-Optional properties:
-- dfx: phandle reference to the "DFX Server" node
-
-Example:
-
-switch {
- compatible = "simple-bus";
- #address-cells = <1>;
- #size-cells = <1>;
- ranges = <0 MBUS_ID(0x03, 0x00) 0 0x100000>;
-
- packet-processor@0 {
- compatible = "marvell,prestera-98dx3236", "marvell,prestera";
- reg = <0 0x4000000>;
- interrupts = <33>, <34>, <35>;
- dfx = <&dfx>;
- };
-};
-
-DFX Server bindings
--------------------
-
-Required properties:
-- compatible: must be "marvell,dfx-server", "simple-bus"
-- ranges: describes the address mapping of a memory-mapped bus.
-- reg: address and length of the register set for the device.
-
-Example:
-
-dfx-server {
- compatible = "marvell,dfx-server", "simple-bus";
- #address-cells = <1>;
- #size-cells = <1>;
- ranges = <0 MBUS_ID(0x08, 0x00) 0 0x100000>;
- reg = <MBUS_ID(0x08, 0x00) 0 0x100000>;
-};
-
-Marvell Prestera SwitchDev bindings
------------------------------------
-Optional properties:
-- compatible: must be "marvell,prestera"
-- base-mac-provider: describes handle to node which provides base mac address,
- might be a static base mac address or nvme cell provider.
-
-Example:
-
-eeprom_mac_addr: eeprom-mac-addr {
- compatible = "eeprom,mac-addr-cell";
- status = "okay";
-
- nvmem = <&eeprom_at24>;
-};
-
-prestera {
- compatible = "marvell,prestera";
- status = "okay";
-
- base-mac-provider = <&eeprom_mac_addr>;
-};
-
-The current implementation of Prestera Switchdev PCI interface driver requires
-that BAR2 is assigned to 0xf6000000 as base address from the PCI IO range:
-
-&cp0_pcie0 {
- ranges = <0x81000000 0x0 0xfb000000 0x0 0xfb000000 0x0 0xf0000
- 0x82000000 0x0 0xf6000000 0x0 0xf6000000 0x0 0x2000000
- 0x82000000 0x0 0xf9000000 0x0 0xf9000000 0x0 0x100000>;
- phys = <&cp0_comphy0 0>;
- status = "okay";
-};
diff --git a/Documentation/devicetree/bindings/net/marvell,prestera.yaml b/Documentation/devicetree/bindings/net/marvell,prestera.yaml
new file mode 100644
index 000000000000..5ea8b73663a5
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/marvell,prestera.yaml
@@ -0,0 +1,91 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/marvell,prestera.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Marvell Prestera switch family
+
+maintainers:
+ - Miquel Raynal <miquel.raynal@bootlin.com>
+
+properties:
+ compatible:
+ oneOf:
+ - items:
+ - enum:
+ - marvell,prestera-98dx3236
+ - marvell,prestera-98dx3336
+ - marvell,prestera-98dx4251
+ - const: marvell,prestera
+ - enum:
+ - pci11ab,c804
+ - pci11ab,c80c
+ - pci11ab,cc1e
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 3
+
+ dfx:
+ description: Reference to the DFX Server bus node.
+ $ref: /schemas/types.yaml#/definitions/phandle
+
+ nvmem-cells: true
+
+ nvmem-cell-names: true
+
+if:
+ properties:
+ compatible:
+ contains:
+ const: marvell,prestera
+
+# Memory mapped AlleyCat3 family
+then:
+ properties:
+ nvmem-cells: false
+ nvmem-cell-names: false
+ required:
+ - interrupts
+
+# PCI Aldrin family
+else:
+ properties:
+ interrupts: false
+ dfx: false
+
+required:
+ - compatible
+ - reg
+
+# Ports can also be described
+additionalProperties:
+ type: object
+
+examples:
+ - |
+ packet-processor@0 {
+ compatible = "marvell,prestera-98dx3236", "marvell,prestera";
+ reg = <0 0x4000000>;
+ interrupts = <33>, <34>, <35>;
+ dfx = <&dfx>;
+ };
+
+ - |
+ pcie@0 {
+ #address-cells = <3>;
+ #size-cells = <2>;
+ ranges = <0x0 0x0 0x0 0x0 0x0 0x0>;
+ reg = <0x0 0x0 0x0 0x0 0x0 0x0>;
+ device_type = "pci";
+
+ switch@0,0 {
+ reg = <0x0 0x0 0x0 0x0 0x0>;
+ compatible = "pci11ab,c80c";
+ nvmem-cells = <&mac_address 0>;
+ nvmem-cell-names = "mac-address";
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/marvell-pp2.txt b/Documentation/devicetree/bindings/net/marvell-pp2.txt
deleted file mode 100644
index ce15c173f43f..000000000000
--- a/Documentation/devicetree/bindings/net/marvell-pp2.txt
+++ /dev/null
@@ -1,141 +0,0 @@
-* Marvell Armada 375 Ethernet Controller (PPv2.1)
- Marvell Armada 7K/8K Ethernet Controller (PPv2.2)
- Marvell CN913X Ethernet Controller (PPv2.3)
-
-Required properties:
-
-- compatible: should be one of:
- "marvell,armada-375-pp2"
- "marvell,armada-7k-pp2"
-- reg: addresses and length of the register sets for the device.
- For "marvell,armada-375-pp2", must contain the following register
- sets:
- - common controller registers
- - LMS registers
- - one register area per Ethernet port
- For "marvell,armada-7k-pp2" used by 7K/8K and CN913X, must contain the following register
- sets:
- - packet processor registers
- - networking interfaces registers
- - CM3 address space used for TX Flow Control
-
-- clocks: pointers to the reference clocks for this device, consequently:
- - main controller clock (for both armada-375-pp2 and armada-7k-pp2)
- - GOP clock (for both armada-375-pp2 and armada-7k-pp2)
- - MG clock (only for armada-7k-pp2)
- - MG Core clock (only for armada-7k-pp2)
- - AXI clock (only for armada-7k-pp2)
-- clock-names: names of used clocks, must be "pp_clk", "gop_clk", "mg_clk",
- "mg_core_clk" and "axi_clk" (the 3 latter only for armada-7k-pp2).
-
-The ethernet ports are represented by subnodes. At least one port is
-required.
-
-Required properties (port):
-
-- interrupts: interrupt(s) for the port
-- port-id: ID of the port from the MAC point of view
-- gop-port-id: only for marvell,armada-7k-pp2, ID of the port from the
- GOP (Group Of Ports) point of view. This ID is used to index the
- per-port registers in the second register area.
-- phy-mode: See ethernet.txt file in the same directory
-
-Optional properties (port):
-
-- marvell,loopback: port is loopback mode
-- phy: a phandle to a phy node defining the PHY address (as the reg
- property, a single integer).
-- interrupt-names: if more than a single interrupt for is given, must be the
- name associated to the interrupts listed. Valid names are:
- "hifX", with X in [0..8], and "link". The names "tx-cpu0",
- "tx-cpu1", "tx-cpu2", "tx-cpu3" and "rx-shared" are supported
- for backward compatibility but shouldn't be used for new
- additions.
-- marvell,system-controller: a phandle to the system controller.
-
-Example for marvell,armada-375-pp2:
-
-ethernet@f0000 {
- compatible = "marvell,armada-375-pp2";
- reg = <0xf0000 0xa000>,
- <0xc0000 0x3060>,
- <0xc4000 0x100>,
- <0xc5000 0x100>;
- clocks = <&gateclk 3>, <&gateclk 19>;
- clock-names = "pp_clk", "gop_clk";
-
- eth0: eth0@c4000 {
- interrupts = <GIC_SPI 37 IRQ_TYPE_LEVEL_HIGH>;
- port-id = <0>;
- phy = <&phy0>;
- phy-mode = "gmii";
- };
-
- eth1: eth1@c5000 {
- interrupts = <GIC_SPI 41 IRQ_TYPE_LEVEL_HIGH>;
- port-id = <1>;
- phy = <&phy3>;
- phy-mode = "gmii";
- };
-};
-
-Example for marvell,armada-7k-pp2:
-
-cpm_ethernet: ethernet@0 {
- compatible = "marvell,armada-7k-pp22";
- reg = <0x0 0x100000>, <0x129000 0xb000>, <0x220000 0x800>;
- clocks = <&cpm_syscon0 1 3>, <&cpm_syscon0 1 9>,
- <&cpm_syscon0 1 5>, <&cpm_syscon0 1 6>, <&cpm_syscon0 1 18>;
- clock-names = "pp_clk", "gop_clk", "mg_clk", "mg_core_clk", "axi_clk";
-
- eth0: eth0 {
- interrupts = <ICU_GRP_NSR 39 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 43 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 47 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 51 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 55 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 59 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 63 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 67 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 71 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 129 IRQ_TYPE_LEVEL_HIGH>;
- interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
- "hif5", "hif6", "hif7", "hif8", "link";
- port-id = <0>;
- gop-port-id = <0>;
- };
-
- eth1: eth1 {
- interrupts = <ICU_GRP_NSR 40 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 44 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 48 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 52 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 56 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 60 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 64 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 68 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 72 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 128 IRQ_TYPE_LEVEL_HIGH>;
- interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
- "hif5", "hif6", "hif7", "hif8", "link";
- port-id = <1>;
- gop-port-id = <2>;
- };
-
- eth2: eth2 {
- interrupts = <ICU_GRP_NSR 41 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 45 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 49 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 53 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 57 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 61 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 65 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 69 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 73 IRQ_TYPE_LEVEL_HIGH>,
- <ICU_GRP_NSR 127 IRQ_TYPE_LEVEL_HIGH>;
- interrupt-names = "hif0", "hif1", "hif2", "hif3", "hif4",
- "hif5", "hif6", "hif7", "hif8", "link";
- port-id = <2>;
- gop-port-id = <3>;
- };
-};
diff --git a/Documentation/devicetree/bindings/net/microchip,lan95xx.yaml b/Documentation/devicetree/bindings/net/microchip,lan95xx.yaml
index cf91fecd8909..3715c5f8f0e0 100644
--- a/Documentation/devicetree/bindings/net/microchip,lan95xx.yaml
+++ b/Documentation/devicetree/bindings/net/microchip,lan95xx.yaml
@@ -39,7 +39,9 @@ properties:
- usb424,9e08 # SMSC LAN89530 USB Ethernet Device
- usb424,ec00 # SMSC9512/9514 USB Hub & Ethernet Device
- reg: true
+ reg:
+ maxItems: 1
+
local-mac-address: true
mac-address: true
diff --git a/Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml b/Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml
index b2558421268a..6924aff0b2c5 100644
--- a/Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml
+++ b/Documentation/devicetree/bindings/net/nfc/nxp,nci.yaml
@@ -14,7 +14,9 @@ properties:
oneOf:
- const: nxp,nxp-nci-i2c
- items:
- - const: nxp,pn547
+ - enum:
+ - nxp,nq310
+ - nxp,pn547
- const: nxp,nxp-nci-i2c
enable-gpios:
diff --git a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
index 0270b0ca166b..04df496af7e6 100644
--- a/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
+++ b/Documentation/devicetree/bindings/net/nxp,dwmac-imx.yaml
@@ -7,7 +7,9 @@ $schema: http://devicetree.org/meta-schemas/core.yaml#
title: NXP i.MX8 DWMAC glue layer
maintainers:
- - Joakim Zhang <qiangqing.zhang@nxp.com>
+ - Clark Wang <xiaoning.wang@nxp.com>
+ - Shawn Guo <shawnguo@kernel.org>
+ - NXP Linux Team <linux-imx@nxp.com>
# We need a select here so we don't match all nodes with 'snps,dwmac'
select:
diff --git a/Documentation/devicetree/bindings/net/pcs/fsl,lynx-pcs.yaml b/Documentation/devicetree/bindings/net/pcs/fsl,lynx-pcs.yaml
new file mode 100644
index 000000000000..fbedf696c555
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/pcs/fsl,lynx-pcs.yaml
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: GPL-2.0-only OR BSD-2-Clause
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/pcs/fsl,lynx-pcs.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: NXP Lynx PCS
+
+maintainers:
+ - Ioana Ciornei <ioana.ciornei@nxp.com>
+
+description: |
+ NXP Lynx 10G and 28G SerDes have Ethernet PCS devices which can be used as
+ protocol controllers. They are accessible over the Ethernet interface's MDIO
+ bus.
+
+properties:
+ compatible:
+ const: fsl,lynx-pcs
+
+ reg:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+
+additionalProperties: false
+
+examples:
+ - |
+ mdio-bus {
+ #address-cells = <1>;
+ #size-cells = <0>;
+
+ qsgmii_pcs1: ethernet-pcs@1 {
+ compatible = "fsl,lynx-pcs";
+ reg = <1>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/qca,ar71xx.yaml b/Documentation/devicetree/bindings/net/qca,ar71xx.yaml
index 1ebf9e8c8a1d..89f94b31b546 100644
--- a/Documentation/devicetree/bindings/net/qca,ar71xx.yaml
+++ b/Documentation/devicetree/bindings/net/qca,ar71xx.yaml
@@ -123,7 +123,6 @@ examples:
switch_port0: port@0 {
reg = <0x0>;
- label = "cpu";
ethernet = <&eth1>;
phy-mode = "gmii";
diff --git a/Documentation/devicetree/bindings/net/qcom,ipa.yaml b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
index dd4bb2e74880..4aeda379726f 100644
--- a/Documentation/devicetree/bindings/net/qcom,ipa.yaml
+++ b/Documentation/devicetree/bindings/net/qcom,ipa.yaml
@@ -49,6 +49,7 @@ properties:
- qcom,sc7280-ipa
- qcom,sdm845-ipa
- qcom,sdx55-ipa
+ - qcom,sm6350-ipa
- qcom,sm8350-ipa
reg:
@@ -124,19 +125,31 @@ properties:
- const: ipa-clock-enabled-valid
- const: ipa-clock-enabled
+ qcom,gsi-loader:
+ enum:
+ - self
+ - modem
+ - skip
+ description:
+ Indicates how GSI firmware should be loaded. If the AP loads
+ and validates GSI firmware, this property has value "self".
+ If the modem does this, this property has value "modem".
+ Otherwise, "skip" means GSI firmware loading is not required.
+
modem-init:
+ deprecated: true
type: boolean
description:
- If present, it indicates that the modem is responsible for
- performing early IPA initialization, including loading and
- validating firwmare used by the GSI.
+ This is the older (deprecated) way of indicating how GSI firmware
+ should be loaded. If present, the modem loads GSI firmware; if
+ absent, the AP loads GSI firmware.
memory-region:
maxItems: 1
description:
If present, a phandle for a reserved memory area that holds
the firmware passed to Trust Zone for authentication. Required
- when Trust Zone (not the modem) performs early initialization.
+ when the AP (not the modem) performs early initialization.
firmware-name:
$ref: /schemas/types.yaml#/definitions/string
@@ -155,22 +168,36 @@ required:
- interconnects
- qcom,smem-states
-# Either modem-init is present, or memory-region must be present.
-oneOf:
- - required:
- - modem-init
- - required:
- - memory-region
-
-# If memory-region is present, firmware-name may optionally be present.
-# But if modem-init is present, firmware-name must not be present.
-if:
- required:
- - modem-init
-then:
- not:
- required:
- - firmware-name
+allOf:
+ # If qcom,gsi-loader is present, modem-init must not be present
+ - if:
+ required:
+ - qcom,gsi-loader
+ then:
+ properties:
+ modem-init: false
+
+ # If qcom,gsi-loader is "self", the AP loads GSI firmware, and
+ # memory-region must be specified
+ if:
+ properties:
+ qcom,gsi-loader:
+ contains:
+ const: self
+ then:
+ required:
+ - memory-region
+ else:
+ # If qcom,gsi-loader is not present, we use deprecated behavior.
+ # If modem-init is not present, the AP loads GSI firmware, and
+ # memory-region must be specified.
+ if:
+ not:
+ required:
+ - modem-init
+ then:
+ required:
+ - memory-region
additionalProperties: false
@@ -201,14 +228,17 @@ examples:
};
ipa@1e40000 {
- compatible = "qcom,sdm845-ipa";
+ compatible = "qcom,sc7180-ipa";
- modem-init;
+ qcom,gsi-loader = "self";
+ memory-region = <&ipa_fw_mem>;
+ firmware-name = "qcom/sc7180-trogdor/modem/modem.mdt";
- iommus = <&apps_smmu 0x720 0x3>;
+ iommus = <&apps_smmu 0x440 0x0>,
+ <&apps_smmu 0x442 0x0>;
reg = <0x1e40000 0x7000>,
- <0x1e47000 0x2000>,
- <0x1e04000 0x2c000>;
+ <0x1e47000 0x2000>,
+ <0x1e04000 0x2c000>;
reg-names = "ipa-reg",
"ipa-shared",
"gsi";
@@ -226,9 +256,9 @@ examples:
clock-names = "core";
interconnects =
- <&rsc_hlos MASTER_IPA &rsc_hlos SLAVE_EBI1>,
- <&rsc_hlos MASTER_IPA &rsc_hlos SLAVE_IMEM>,
- <&rsc_hlos MASTER_APPSS_PROC &rsc_hlos SLAVE_IPA_CFG>;
+ <&aggre2_noc MASTER_IPA 0 &mc_virt SLAVE_EBI1 0>,
+ <&aggre2_noc MASTER_IPA 0 &system_noc SLAVE_IMEM 0>,
+ <&gem_noc MASTER_APPSS_PROC 0 &config_noc SLAVE_IPA_CFG 0>;
interconnect-names = "memory",
"imem",
"config";
diff --git a/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml
index ad8b2b41c140..7631ecc8fd01 100644
--- a/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml
+++ b/Documentation/devicetree/bindings/net/qcom,ipq4019-mdio.yaml
@@ -9,14 +9,18 @@ title: Qualcomm IPQ40xx MDIO Controller
maintainers:
- Robert Marko <robert.marko@sartura.hr>
-allOf:
- - $ref: "mdio.yaml#"
-
properties:
compatible:
- enum:
- - qcom,ipq4019-mdio
- - qcom,ipq5018-mdio
+ oneOf:
+ - enum:
+ - qcom,ipq4019-mdio
+ - qcom,ipq5018-mdio
+
+ - items:
+ - enum:
+ - qcom,ipq6018-mdio
+ - qcom,ipq8074-mdio
+ - const: qcom,ipq4019-mdio
"#address-cells":
const: 1
@@ -33,10 +37,12 @@ properties:
address range is only required by the platform IPQ50xx.
clocks:
- maxItems: 1
- description: |
- MDIO clock source frequency fixed to 100MHZ, this clock should be specified
- by the platform IPQ807x, IPQ60xx and IPQ50xx.
+ items:
+ - description: MDIO clock source frequency fixed to 100MHZ
+
+ clock-names:
+ items:
+ - const: gcc_mdio_ahb_clk
required:
- compatible
@@ -44,6 +50,26 @@ required:
- "#address-cells"
- "#size-cells"
+allOf:
+ - $ref: "mdio.yaml#"
+
+ - if:
+ properties:
+ compatible:
+ contains:
+ enum:
+ - qcom,ipq5018-mdio
+ - qcom,ipq6018-mdio
+ - qcom,ipq8074-mdio
+ then:
+ required:
+ - clocks
+ - clock-names
+ else:
+ properties:
+ clocks: false
+ clock-names: false
+
unevaluatedProperties: false
examples:
diff --git a/Documentation/devicetree/bindings/net/realtek-bluetooth.yaml b/Documentation/devicetree/bindings/net/realtek-bluetooth.yaml
index e329ef06e10f..143b5667abad 100644
--- a/Documentation/devicetree/bindings/net/realtek-bluetooth.yaml
+++ b/Documentation/devicetree/bindings/net/realtek-bluetooth.yaml
@@ -20,6 +20,7 @@ properties:
enum:
- realtek,rtl8723bs-bt
- realtek,rtl8723cs-bt
+ - realtek,rtl8723ds-bt
- realtek,rtl8822cs-bt
device-wake-gpios:
diff --git a/Documentation/devicetree/bindings/net/renesas,r8a779f0-ether-switch.yaml b/Documentation/devicetree/bindings/net/renesas,r8a779f0-ether-switch.yaml
new file mode 100644
index 000000000000..e933a1e48d67
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/renesas,r8a779f0-ether-switch.yaml
@@ -0,0 +1,262 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/renesas,r8a779f0-ether-switch.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Renesas Ethernet Switch
+
+maintainers:
+ - Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com>
+
+properties:
+ compatible:
+ const: renesas,r8a779f0-ether-switch
+
+ reg:
+ maxItems: 2
+
+ reg-names:
+ items:
+ - const: base
+ - const: secure_base
+
+ interrupts:
+ maxItems: 47
+
+ interrupt-names:
+ items:
+ - const: mfwd_error
+ - const: race_error
+ - const: coma_error
+ - const: gwca0_error
+ - const: gwca1_error
+ - const: etha0_error
+ - const: etha1_error
+ - const: etha2_error
+ - const: gptp0_status
+ - const: gptp1_status
+ - const: mfwd_status
+ - const: race_status
+ - const: coma_status
+ - const: gwca0_status
+ - const: gwca1_status
+ - const: etha0_status
+ - const: etha1_status
+ - const: etha2_status
+ - const: rmac0_status
+ - const: rmac1_status
+ - const: rmac2_status
+ - const: gwca0_rxtx0
+ - const: gwca0_rxtx1
+ - const: gwca0_rxtx2
+ - const: gwca0_rxtx3
+ - const: gwca0_rxtx4
+ - const: gwca0_rxtx5
+ - const: gwca0_rxtx6
+ - const: gwca0_rxtx7
+ - const: gwca1_rxtx0
+ - const: gwca1_rxtx1
+ - const: gwca1_rxtx2
+ - const: gwca1_rxtx3
+ - const: gwca1_rxtx4
+ - const: gwca1_rxtx5
+ - const: gwca1_rxtx6
+ - const: gwca1_rxtx7
+ - const: gwca0_rxts0
+ - const: gwca0_rxts1
+ - const: gwca1_rxts0
+ - const: gwca1_rxts1
+ - const: rmac0_mdio
+ - const: rmac1_mdio
+ - const: rmac2_mdio
+ - const: rmac0_phy
+ - const: rmac1_phy
+ - const: rmac2_phy
+
+ clocks:
+ maxItems: 1
+
+ resets:
+ maxItems: 1
+
+ iommus:
+ maxItems: 16
+
+ power-domains:
+ maxItems: 1
+
+ ethernet-ports:
+ type: object
+ additionalProperties: false
+
+ properties:
+ '#address-cells':
+ description: Port number of ETHA (TSNA).
+ const: 1
+
+ '#size-cells':
+ const: 0
+
+ patternProperties:
+ "^port@[0-9a-f]+$":
+ type: object
+ $ref: /schemas/net/ethernet-controller.yaml#
+ unevaluatedProperties: false
+
+ properties:
+ reg:
+ maxItems: 1
+ description:
+ Port number of ETHA (TSNA).
+
+ phys:
+ maxItems: 1
+ description:
+ Phandle of an Ethernet SERDES.
+
+ mdio:
+ $ref: /schemas/net/mdio.yaml#
+ unevaluatedProperties: false
+
+ required:
+ - reg
+ - phy-handle
+ - phy-mode
+ - phys
+ - mdio
+
+required:
+ - compatible
+ - reg
+ - reg-names
+ - interrupts
+ - interrupt-names
+ - clocks
+ - resets
+ - power-domains
+ - ethernet-ports
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/clock/r8a779f0-cpg-mssr.h>
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/power/r8a779f0-sysc.h>
+
+ ethernet@e6880000 {
+ compatible = "renesas,r8a779f0-ether-switch";
+ reg = <0xe6880000 0x20000>, <0xe68c0000 0x20000>;
+ reg-names = "base", "secure_base";
+ interrupts = <GIC_SPI 256 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 257 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 258 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 259 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 260 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 261 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 262 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 263 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 265 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 266 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 267 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 268 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 269 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 270 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 271 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 272 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 273 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 274 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 276 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 277 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 278 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 280 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 281 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 282 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 283 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 284 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 285 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 286 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 287 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 288 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 289 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 290 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 291 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 292 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 293 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 294 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 295 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 296 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 297 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 298 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 299 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 300 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 301 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 302 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 304 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 305 IRQ_TYPE_LEVEL_HIGH>,
+ <GIC_SPI 306 IRQ_TYPE_LEVEL_HIGH>;
+ interrupt-names = "mfwd_error", "race_error",
+ "coma_error", "gwca0_error",
+ "gwca1_error", "etha0_error",
+ "etha1_error", "etha2_error",
+ "gptp0_status", "gptp1_status",
+ "mfwd_status", "race_status",
+ "coma_status", "gwca0_status",
+ "gwca1_status", "etha0_status",
+ "etha1_status", "etha2_status",
+ "rmac0_status", "rmac1_status",
+ "rmac2_status",
+ "gwca0_rxtx0", "gwca0_rxtx1",
+ "gwca0_rxtx2", "gwca0_rxtx3",
+ "gwca0_rxtx4", "gwca0_rxtx5",
+ "gwca0_rxtx6", "gwca0_rxtx7",
+ "gwca1_rxtx0", "gwca1_rxtx1",
+ "gwca1_rxtx2", "gwca1_rxtx3",
+ "gwca1_rxtx4", "gwca1_rxtx5",
+ "gwca1_rxtx6", "gwca1_rxtx7",
+ "gwca0_rxts0", "gwca0_rxts1",
+ "gwca1_rxts0", "gwca1_rxts1",
+ "rmac0_mdio", "rmac1_mdio",
+ "rmac2_mdio",
+ "rmac0_phy", "rmac1_phy",
+ "rmac2_phy";
+ clocks = <&cpg CPG_MOD 1505>;
+ power-domains = <&sysc R8A779F0_PD_ALWAYS_ON>;
+ resets = <&cpg 1505>;
+
+ ethernet-ports {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ port@0 {
+ reg = <0>;
+ phy-handle = <&eth_phy0>;
+ phy-mode = "sgmii";
+ phys = <&eth_serdes 0>;
+ mdio {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ };
+ };
+ port@1 {
+ reg = <1>;
+ phy-handle = <&eth_phy1>;
+ phy-mode = "sgmii";
+ phys = <&eth_serdes 1>;
+ mdio {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ };
+ };
+ port@2 {
+ reg = <2>;
+ phy-handle = <&eth_phy2>;
+ phy-mode = "sgmii";
+ phys = <&eth_serdes 2>;
+ mdio {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ };
+ };
+ };
+ };
diff --git a/Documentation/devicetree/bindings/net/sff,sfp.yaml b/Documentation/devicetree/bindings/net/sff,sfp.yaml
index 06c66ab81c01..231c4d75e4b1 100644
--- a/Documentation/devicetree/bindings/net/sff,sfp.yaml
+++ b/Documentation/devicetree/bindings/net/sff,sfp.yaml
@@ -22,7 +22,8 @@ properties:
phandle of an I2C bus controller for the SFP two wire serial
maximum-power-milliwatt:
- maxItems: 1
+ minimum: 1000
+ default: 1000
description:
Maximum module power consumption Specifies the maximum power consumption
allowable by a module in the slot, in milli-Watts. Presently, modules can
diff --git a/Documentation/devicetree/bindings/net/snps,dwmac.yaml b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
index 13b984076af5..e88a86623fce 100644
--- a/Documentation/devicetree/bindings/net/snps,dwmac.yaml
+++ b/Documentation/devicetree/bindings/net/snps,dwmac.yaml
@@ -167,56 +167,238 @@ properties:
snps,mtl-rx-config:
$ref: /schemas/types.yaml#/definitions/phandle
description:
- Multiple RX Queues parameters. Phandle to a node that can
- contain the following properties
- * snps,rx-queues-to-use, number of RX queues to be used in the
- driver
- * Choose one of these RX scheduling algorithms
- * snps,rx-sched-sp, Strict priority
- * snps,rx-sched-wsp, Weighted Strict priority
- * For each RX queue
- * Choose one of these modes
- * snps,dcb-algorithm, Queue to be enabled as DCB
- * snps,avb-algorithm, Queue to be enabled as AVB
- * snps,map-to-dma-channel, Channel to map
- * Specifiy specific packet routing
- * snps,route-avcp, AV Untagged Control packets
- * snps,route-ptp, PTP Packets
- * snps,route-dcbcp, DCB Control Packets
- * snps,route-up, Untagged Packets
- * snps,route-multi-broad, Multicast & Broadcast Packets
- * snps,priority, bitmask of the tagged frames priorities assigned to
- the queue
+ Multiple RX Queues parameters. Phandle to a node that
+ implements the 'rx-queues-config' object described in
+ this binding.
+
+ rx-queues-config:
+ type: object
+ properties:
+ snps,rx-queues-to-use:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: number of RX queues to be used in the driver
+ snps,rx-sched-sp:
+ type: boolean
+ description: Strict priority
+ snps,rx-sched-wsp:
+ type: boolean
+ description: Weighted Strict priority
+ allOf:
+ - if:
+ required:
+ - snps,rx-sched-sp
+ then:
+ properties:
+ snps,rx-sched-wsp: false
+ - if:
+ required:
+ - snps,rx-sched-wsp
+ then:
+ properties:
+ snps,rx-sched-sp: false
+ patternProperties:
+ "^queue[0-9]$":
+ description: Each subnode represents a queue.
+ type: object
+ properties:
+ snps,dcb-algorithm:
+ type: boolean
+ description: Queue to be enabled as DCB
+ snps,avb-algorithm:
+ type: boolean
+ description: Queue to be enabled as AVB
+ snps,map-to-dma-channel:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: DMA channel id to map
+ snps,route-avcp:
+ type: boolean
+ description: AV Untagged Control packets
+ snps,route-ptp:
+ type: boolean
+ description: PTP Packets
+ snps,route-dcbcp:
+ type: boolean
+ description: DCB Control Packets
+ snps,route-up:
+ type: boolean
+ description: Untagged Packets
+ snps,route-multi-broad:
+ type: boolean
+ description: Multicast & Broadcast Packets
+ snps,priority:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: Bitmask of the tagged frames priorities assigned to the queue
+ allOf:
+ - if:
+ required:
+ - snps,dcb-algorithm
+ then:
+ properties:
+ snps,avb-algorithm: false
+ - if:
+ required:
+ - snps,avb-algorithm
+ then:
+ properties:
+ snps,dcb-algorithm: false
+ - if:
+ required:
+ - snps,route-avcp
+ then:
+ properties:
+ snps,route-ptp: false
+ snps,route-dcbcp: false
+ snps,route-up: false
+ snps,route-multi-broad: false
+ - if:
+ required:
+ - snps,route-ptp
+ then:
+ properties:
+ snps,route-avcp: false
+ snps,route-dcbcp: false
+ snps,route-up: false
+ snps,route-multi-broad: false
+ - if:
+ required:
+ - snps,route-dcbcp
+ then:
+ properties:
+ snps,route-avcp: false
+ snps,route-ptp: false
+ snps,route-up: false
+ snps,route-multi-broad: false
+ - if:
+ required:
+ - snps,route-up
+ then:
+ properties:
+ snps,route-avcp: false
+ snps,route-ptp: false
+ snps,route-dcbcp: false
+ snps,route-multi-broad: false
+ - if:
+ required:
+ - snps,route-multi-broad
+ then:
+ properties:
+ snps,route-avcp: false
+ snps,route-ptp: false
+ snps,route-dcbcp: false
+ snps,route-up: false
+ additionalProperties: false
+ additionalProperties: false
snps,mtl-tx-config:
$ref: /schemas/types.yaml#/definitions/phandle
description:
- Multiple TX Queues parameters. Phandle to a node that can
- contain the following properties
- * snps,tx-queues-to-use, number of TX queues to be used in the
- driver
- * Choose one of these TX scheduling algorithms
- * snps,tx-sched-wrr, Weighted Round Robin
- * snps,tx-sched-wfq, Weighted Fair Queuing
- * snps,tx-sched-dwrr, Deficit Weighted Round Robin
- * snps,tx-sched-sp, Strict priority
- * For each TX queue
- * snps,weight, TX queue weight (if using a DCB weight
- algorithm)
- * Choose one of these modes
- * snps,dcb-algorithm, TX queue will be working in DCB
- * snps,avb-algorithm, TX queue will be working in AVB
- [Attention] Queue 0 is reserved for legacy traffic
- and so no AVB is available in this queue.
- * Configure Credit Base Shaper (if AVB Mode selected)
- * snps,send_slope, enable Low Power Interface
- * snps,idle_slope, unlock on WoL
- * snps,high_credit, max write outstanding req. limit
- * snps,low_credit, max read outstanding req. limit
- * snps,priority, bitmask of the priorities assigned to the queue.
- When a PFC frame is received with priorities matching the bitmask,
- the queue is blocked from transmitting for the pause time specified
- in the PFC frame.
+ Multiple TX Queues parameters. Phandle to a node that
+ implements the 'tx-queues-config' object described in
+ this binding.
+
+ tx-queues-config:
+ type: object
+ properties:
+ snps,tx-queues-to-use:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: number of TX queues to be used in the driver
+ snps,tx-sched-wrr:
+ type: boolean
+ description: Weighted Round Robin
+ snps,tx-sched-wfq:
+ type: boolean
+ description: Weighted Fair Queuing
+ snps,tx-sched-dwrr:
+ type: boolean
+ description: Deficit Weighted Round Robin
+ snps,tx-sched-sp:
+ type: boolean
+ description: Strict priority
+ allOf:
+ - if:
+ required:
+ - snps,tx-sched-wrr
+ then:
+ properties:
+ snps,tx-sched-wfq: false
+ snps,tx-sched-dwrr: false
+ snps,tx-sched-sp: false
+ - if:
+ required:
+ - snps,tx-sched-wfq
+ then:
+ properties:
+ snps,tx-sched-wrr: false
+ snps,tx-sched-dwrr: false
+ snps,tx-sched-sp: false
+ - if:
+ required:
+ - snps,tx-sched-dwrr
+ then:
+ properties:
+ snps,tx-sched-wrr: false
+ snps,tx-sched-wfq: false
+ snps,tx-sched-sp: false
+ - if:
+ required:
+ - snps,tx-sched-sp
+ then:
+ properties:
+ snps,tx-sched-wrr: false
+ snps,tx-sched-wfq: false
+ snps,tx-sched-dwrr: false
+ patternProperties:
+ "^queue[0-9]$":
+ description: Each subnode represents a queue.
+ type: object
+ properties:
+ snps,weight:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: TX queue weight (if using a DCB weight algorithm)
+ snps,dcb-algorithm:
+ type: boolean
+ description: TX queue will be working in DCB
+ snps,avb-algorithm:
+ type: boolean
+ description:
+ TX queue will be working in AVB.
+ Queue 0 is reserved for legacy traffic and so no AVB is
+ available in this queue.
+ snps,send_slope:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: enable Low Power Interface
+ snps,idle_slope:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: unlock on WoL
+ snps,high_credit:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: max write outstanding req. limit
+ snps,low_credit:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description: max read outstanding req. limit
+ snps,priority:
+ $ref: /schemas/types.yaml#/definitions/uint32
+ description:
+ Bitmask of the tagged frames priorities assigned to the queue.
+ When a PFC frame is received with priorities matching the bitmask,
+ the queue is blocked from transmitting for the pause time specified
+ in the PFC frame.
+ allOf:
+ - if:
+ required:
+ - snps,dcb-algorithm
+ then:
+ properties:
+ snps,avb-algorithm: false
+ - if:
+ required:
+ - snps,avb-algorithm
+ then:
+ properties:
+ snps,dcb-algorithm: false
+ snps,weight: false
+ additionalProperties: false
+ additionalProperties: false
snps,reset-gpio:
deprecated: true
@@ -463,41 +645,6 @@ additionalProperties: true
examples:
- |
- stmmac_axi_setup: stmmac-axi-config {
- snps,wr_osr_lmt = <0xf>;
- snps,rd_osr_lmt = <0xf>;
- snps,blen = <256 128 64 32 0 0 0>;
- };
-
- mtl_rx_setup: rx-queues-config {
- snps,rx-queues-to-use = <1>;
- snps,rx-sched-sp;
- queue0 {
- snps,dcb-algorithm;
- snps,map-to-dma-channel = <0x0>;
- snps,priority = <0x0>;
- };
- };
-
- mtl_tx_setup: tx-queues-config {
- snps,tx-queues-to-use = <2>;
- snps,tx-sched-wrr;
- queue0 {
- snps,weight = <0x10>;
- snps,dcb-algorithm;
- snps,priority = <0x0>;
- };
-
- queue1 {
- snps,avb-algorithm;
- snps,send_slope = <0x1000>;
- snps,idle_slope = <0x1000>;
- snps,high_credit = <0x3E800>;
- snps,low_credit = <0xFFC18000>;
- snps,priority = <0x1>;
- };
- };
-
gmac0: ethernet@e0800000 {
compatible = "snps,dwxgmac-2.10", "snps,dwxgmac";
reg = <0xe0800000 0x8000>;
@@ -516,6 +663,42 @@ examples:
snps,axi-config = <&stmmac_axi_setup>;
snps,mtl-rx-config = <&mtl_rx_setup>;
snps,mtl-tx-config = <&mtl_tx_setup>;
+
+ stmmac_axi_setup: stmmac-axi-config {
+ snps,wr_osr_lmt = <0xf>;
+ snps,rd_osr_lmt = <0xf>;
+ snps,blen = <256 128 64 32 0 0 0>;
+ };
+
+ mtl_rx_setup: rx-queues-config {
+ snps,rx-queues-to-use = <1>;
+ snps,rx-sched-sp;
+ queue0 {
+ snps,dcb-algorithm;
+ snps,map-to-dma-channel = <0x0>;
+ snps,priority = <0x0>;
+ };
+ };
+
+ mtl_tx_setup: tx-queues-config {
+ snps,tx-queues-to-use = <2>;
+ snps,tx-sched-wrr;
+ queue0 {
+ snps,weight = <0x10>;
+ snps,dcb-algorithm;
+ snps,priority = <0x0>;
+ };
+
+ queue1 {
+ snps,avb-algorithm;
+ snps,send_slope = <0x1000>;
+ snps,idle_slope = <0x1000>;
+ snps,high_credit = <0x3E800>;
+ snps,low_credit = <0xFFC18000>;
+ snps,priority = <0x1>;
+ };
+ };
+
mdio0 {
#address-cells = <1>;
#size-cells = <0>;
diff --git a/Documentation/devicetree/bindings/net/socionext,synquacer-netsec.yaml b/Documentation/devicetree/bindings/net/socionext,synquacer-netsec.yaml
new file mode 100644
index 000000000000..a65e6aa215a7
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/socionext,synquacer-netsec.yaml
@@ -0,0 +1,73 @@
+# SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/net/socionext,synquacer-netsec.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: Socionext NetSec Ethernet Controller IP
+
+maintainers:
+ - Jassi Brar <jaswinder.singh@linaro.org>
+ - Ilias Apalodimas <ilias.apalodimas@linaro.org>
+
+allOf:
+ - $ref: ethernet-controller.yaml#
+
+properties:
+ compatible:
+ const: socionext,synquacer-netsec
+
+ reg:
+ items:
+ - description: control register area
+ - description: EEPROM holding the MAC address and microengine firmware
+
+ clocks:
+ maxItems: 1
+
+ clock-names:
+ const: phy_ref_clk
+
+ dma-coherent: true
+
+ interrupts:
+ maxItems: 1
+
+ mdio:
+ $ref: mdio.yaml#
+
+required:
+ - compatible
+ - reg
+ - clocks
+ - clock-names
+ - interrupts
+ - mdio
+
+unevaluatedProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+
+ ethernet@522d0000 {
+ compatible = "socionext,synquacer-netsec";
+ reg = <0x522d0000 0x10000>, <0x10000000 0x10000>;
+ interrupts = <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>;
+ clocks = <&clk_netsec>;
+ clock-names = "phy_ref_clk";
+ phy-mode = "rgmii";
+ max-speed = <1000>;
+ max-frame-size = <9000>;
+ phy-handle = <&phy1>;
+
+ mdio {
+ #address-cells = <1>;
+ #size-cells = <0>;
+ phy1: ethernet-phy@1 {
+ compatible = "ethernet-phy-ieee802.3-c22";
+ reg = <1>;
+ };
+ };
+ };
+...
diff --git a/Documentation/devicetree/bindings/net/socionext-netsec.txt b/Documentation/devicetree/bindings/net/socionext-netsec.txt
deleted file mode 100644
index a3c1dffaa4bb..000000000000
--- a/Documentation/devicetree/bindings/net/socionext-netsec.txt
+++ /dev/null
@@ -1,56 +0,0 @@
-* Socionext NetSec Ethernet Controller IP
-
-Required properties:
-- compatible: Should be "socionext,synquacer-netsec"
-- reg: Address and length of the control register area, followed by the
- address and length of the EEPROM holding the MAC address and
- microengine firmware
-- interrupts: Should contain ethernet controller interrupt
-- clocks: phandle to the PHY reference clock
-- clock-names: Should be "phy_ref_clk"
-- phy-mode: See ethernet.txt file in the same directory
-- phy-handle: See ethernet.txt in the same directory.
-
-- mdio device tree subnode: When the Netsec has a phy connected to its local
- mdio, there must be device tree subnode with the following
- required properties:
-
- - #address-cells: Must be <1>.
- - #size-cells: Must be <0>.
-
- For each phy on the mdio bus, there must be a node with the following
- fields:
- - compatible: Refer to phy.txt
- - reg: phy id used to communicate to phy.
-
-Optional properties: (See ethernet.txt file in the same directory)
-- dma-coherent: Boolean property, must only be present if memory
- accesses performed by the device are cache coherent.
-- max-speed: See ethernet.txt in the same directory.
-- max-frame-size: See ethernet.txt in the same directory.
-
-The MAC address will be determined using the optional properties
-defined in ethernet.txt. The 'phy-mode' property is required, but may
-be set to the empty string if the PHY configuration is programmed by
-the firmware or set by hardware straps, and needs to be preserved.
-
-Example:
- eth0: ethernet@522d0000 {
- compatible = "socionext,synquacer-netsec";
- reg = <0 0x522d0000 0x0 0x10000>, <0 0x10000000 0x0 0x10000>;
- interrupts = <GIC_SPI 176 IRQ_TYPE_LEVEL_HIGH>;
- clocks = <&clk_netsec>;
- clock-names = "phy_ref_clk";
- phy-mode = "rgmii";
- max-speed = <1000>;
- max-frame-size = <9000>;
- phy-handle = <&phy1>;
-
- mdio {
- #address-cells = <1>;
- #size-cells = <0>;
- phy1: ethernet-phy@1 {
- compatible = "ethernet-phy-ieee802.3-c22";
- reg = <1>;
- };
- };
diff --git a/Documentation/devicetree/bindings/net/xilinx_axienet.txt b/Documentation/devicetree/bindings/net/xilinx_axienet.txt
index 1aa4c6006cd0..80e505a2fda1 100644
--- a/Documentation/devicetree/bindings/net/xilinx_axienet.txt
+++ b/Documentation/devicetree/bindings/net/xilinx_axienet.txt
@@ -68,6 +68,8 @@ Optional properties:
- mdio : Child node for MDIO bus. Must be defined if PHY access is
required through the core's MDIO interface (i.e. always,
unless the PHY is accessed through a different bus).
+ Non-standard MDIO bus frequency is supported via
+ "clock-frequency", see mdio.yaml.
- pcs-handle: Phandle to the internal PCS/PMA PHY in SGMII or 1000Base-X
modes, where "pcs-handle" should be used to point
diff --git a/Documentation/devicetree/bindings/soc/mediatek/mediatek,mt7986-wo-ccif.yaml b/Documentation/devicetree/bindings/soc/mediatek/mediatek,mt7986-wo-ccif.yaml
new file mode 100644
index 000000000000..8e6ba2ec8a43
--- /dev/null
+++ b/Documentation/devicetree/bindings/soc/mediatek/mediatek,mt7986-wo-ccif.yaml
@@ -0,0 +1,51 @@
+# SPDX-License-Identifier: (GPL-2.0 OR BSD-2-Clause)
+%YAML 1.2
+---
+$id: http://devicetree.org/schemas/soc/mediatek/mediatek,mt7986-wo-ccif.yaml#
+$schema: http://devicetree.org/meta-schemas/core.yaml#
+
+title: MediaTek Wireless Ethernet Dispatch (WED) WO controller interface for MT7986
+
+maintainers:
+ - Lorenzo Bianconi <lorenzo@kernel.org>
+ - Felix Fietkau <nbd@nbd.name>
+
+description:
+ The MediaTek wo-ccif provides a configuration interface for WED WO
+ controller used to perfrom offload rx packet processing (e.g. 802.11
+ aggregation packet reordering or rx header translation) on MT7986 soc.
+
+properties:
+ compatible:
+ items:
+ - enum:
+ - mediatek,mt7986-wo-ccif
+ - const: syscon
+
+ reg:
+ maxItems: 1
+
+ interrupts:
+ maxItems: 1
+
+required:
+ - compatible
+ - reg
+ - interrupts
+
+additionalProperties: false
+
+examples:
+ - |
+ #include <dt-bindings/interrupt-controller/arm-gic.h>
+ #include <dt-bindings/interrupt-controller/irq.h>
+ soc {
+ #address-cells = <2>;
+ #size-cells = <2>;
+
+ syscon@151a5000 {
+ compatible = "mediatek,mt7986-wo-ccif", "syscon";
+ reg = <0 0x151a5000 0 0x1000>;
+ interrupts = <GIC_SPI 205 IRQ_TYPE_LEVEL_HIGH>;
+ };
+ };
diff --git a/Documentation/devicetree/bindings/soc/qcom/qcom,wcnss.yaml b/Documentation/devicetree/bindings/soc/qcom/qcom,wcnss.yaml
index 5320504bb5e0..0e6fd57d658d 100644
--- a/Documentation/devicetree/bindings/soc/qcom/qcom,wcnss.yaml
+++ b/Documentation/devicetree/bindings/soc/qcom/qcom,wcnss.yaml
@@ -42,15 +42,13 @@ properties:
bluetooth:
type: object
additionalProperties: false
+ allOf:
+ - $ref: /schemas/net/bluetooth/bluetooth-controller.yaml#
properties:
compatible:
const: qcom,wcnss-bt
- local-bd-address:
- $ref: /schemas/types.yaml#/definitions/uint8-array
- maxItems: 6
- description:
- See Documentation/devicetree/bindings/net/bluetooth.txt
+ local-bd-address: true
required:
- compatible
diff --git a/Documentation/networking/bonding.rst b/Documentation/networking/bonding.rst
index 96cd7a26f3d9..adc4bf4f3c50 100644
--- a/Documentation/networking/bonding.rst
+++ b/Documentation/networking/bonding.rst
@@ -566,7 +566,8 @@ miimon
link monitoring. A value of 100 is a good starting point.
The use_carrier option, below, affects how the link state is
determined. See the High Availability section for additional
- information. The default value is 0.
+ information. The default value is 100 if arp_interval is not
+ set.
min_links
@@ -956,6 +957,7 @@ xmit_hash_policy
hash = hash XOR source IP XOR destination IP
hash = hash XOR (hash RSHIFT 16)
hash = hash XOR (hash RSHIFT 8)
+ hash = hash RSHIFT 1
And then hash is reduced modulo slave count.
If the protocol is IPv6 then the source and destination
diff --git a/Documentation/networking/can.rst b/Documentation/networking/can.rst
index ebc822e605f5..90121deef217 100644
--- a/Documentation/networking/can.rst
+++ b/Documentation/networking/can.rst
@@ -1148,6 +1148,39 @@ tuning on deep embedded systems'. The author is running a MPC603e
load without any problems ...
+Switchable Termination Resistors
+--------------------------------
+
+CAN bus requires a specific impedance across the differential pair,
+typically provided by two 120Ohm resistors on the farthest nodes of
+the bus. Some CAN controllers support activating / deactivating a
+termination resistor(s) to provide the correct impedance.
+
+Query the available resistances::
+
+ $ ip -details link show can0
+ ...
+ termination 120 [ 0, 120 ]
+
+Activate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 120
+
+Deactivate the terminating resistor::
+
+ $ ip link set dev can0 type can termination 0
+
+To enable termination resistor support to a can-controller, either
+implement in the controller's struct can-priv::
+
+ termination_const
+ termination_const_cnt
+ do_set_termination
+
+or add gpio control with the device tree entries from
+Documentation/devicetree/bindings/net/can/can-controller.yaml
+
+
The Virtual CAN Driver (vcan)
-----------------------------
diff --git a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
index 51e6624fb774..1d2f55feca24 100644
--- a/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
+++ b/Documentation/networking/device_drivers/ethernet/freescale/dpaa2/mac-phy-support.rst
@@ -181,10 +181,13 @@ when necessary using the below listed API::
- int dpaa2_mac_connect(struct dpaa2_mac *mac);
- void dpaa2_mac_disconnect(struct dpaa2_mac *mac);
-A phylink integration is necessary only when the partner DPMAC is not of TYPE_FIXED.
-One can check for this condition using the below API::
+A phylink integration is necessary only when the partner DPMAC is not of
+``TYPE_FIXED``. This means it is either of ``TYPE_PHY``, or of
+``TYPE_BACKPLANE`` (the difference being the two that in the ``TYPE_BACKPLANE``
+mode, the MC firmware does not access the PCS registers). One can check for
+this condition using the following helper::
- - bool dpaa2_mac_is_type_fixed(struct fsl_mc_device *dpmac_dev,struct fsl_mc_io *mc_io);
+ - static inline bool dpaa2_mac_is_type_phy(struct dpaa2_mac *mac);
Before connection to a MAC, the caller must allocate and populate the
dpaa2_mac structure with the associated net_device, a pointer to the MC portal
diff --git a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
index bc562c49011b..cad96c8d1f97 100644
--- a/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
+++ b/Documentation/networking/device_drivers/ethernet/marvell/octeon_ep.rst
@@ -23,6 +23,7 @@ Supported Devices
=================
Currently, this driver support following devices:
* Network controller: Cavium, Inc. Device b200
+ * Network controller: Cavium, Inc. Device b400
Interface Control
=================
diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
index 5edf50d7dbd5..6969652f593c 100644
--- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
+++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5.rst
@@ -25,7 +25,7 @@ Enabling the driver and kconfig options
| at build time via kernel Kconfig flags.
| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags
| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y.
-| For the list of advanced features please see below.
+| For the list of advanced features, please see below.
**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko)
@@ -89,11 +89,11 @@ Enabling the driver and kconfig options
**CONFIG_MLX5_EN_IPSEC=(y/n)**
-| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
+| Enables `IPSec XFRM cryptography-offload acceleration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_.
**CONFIG_MLX5_EN_TLS=(y/n)**
-| TLS cryptography-offload accelaration.
+| TLS cryptography-offload acceleration.
**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko)
@@ -139,14 +139,14 @@ flow_steering_mode: Device flow steering mode
The flow steering mode parameter controls the flow steering mode of the driver.
Two modes are supported:
1. 'dmfs' - Device managed flow steering.
-2. 'smfs - Software/Driver managed flow steering.
+2. 'smfs' - Software/Driver managed flow steering.
In DMFS mode, the HW steering entities are created and managed through the
Firmware.
In SMFS mode, the HW steering entities are created and managed though by
-the driver directly into Hardware without firmware intervention.
+the driver directly into hardware without firmware intervention.
-SMFS mode is faster and provides better rule inserstion rate compared to default DMFS mode.
+SMFS mode is faster and provides better rule insertion rate compared to default DMFS mode.
User command examples:
@@ -165,9 +165,9 @@ User command examples:
enable_roce: RoCE enablement state
----------------------------------
RoCE enablement state controls driver support for RoCE traffic.
-When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well known UDP RoCE port is handled as raw ethernet traffic.
+When RoCE is disabled, there is no gid table, only raw ethernet QPs are supported and traffic on the well-known UDP RoCE port is handled as raw ethernet traffic.
-To change RoCE enablement state a user must change the driverinit cmode value and run devlink reload.
+To change RoCE enablement state, a user must change the driverinit cmode value and run devlink reload.
User command examples:
@@ -186,7 +186,7 @@ User command examples:
esw_port_metadata: Eswitch port metadata state
----------------------------------------------
-When applicable, disabling Eswitch metadata can increase packet rate
+When applicable, disabling eswitch metadata can increase packet rate
up to 20% depending on the use case and packet sizes.
Eswitch port metadata state controls whether to internally tag packets with
@@ -253,26 +253,26 @@ mlx5 subfunction
================
mlx5 supports subfunction management using devlink port (see :ref:`Documentation/networking/devlink/devlink-port.rst <devlink_port>`) interface.
-A Subfunction has its own function capabilities and its own resources. This
+A subfunction has its own function capabilities and its own resources. This
means a subfunction has its own dedicated queues (txq, rxq, cq, eq). These
queues are neither shared nor stolen from the parent PCI function.
-When a subfunction is RDMA capable, it has its own QP1, GID table and rdma
+When a subfunction is RDMA capable, it has its own QP1, GID table, and RDMA
resources neither shared nor stolen from the parent PCI function.
A subfunction has a dedicated window in PCI BAR space that is not shared
-with ther other subfunctions or the parent PCI function. This ensures that all
-devices (netdev, rdma, vdpa etc.) of the subfunction accesses only assigned
+with the other subfunctions or the parent PCI function. This ensures that all
+devices (netdev, rdma, vdpa, etc.) of the subfunction accesses only assigned
PCI BAR space.
-A Subfunction supports eswitch representation through which it supports tc
+A subfunction supports eswitch representation through which it supports tc
offloads. The user configures eswitch to send/receive packets from/to
the subfunction port.
Subfunctions share PCI level resources such as PCI MSI-X IRQs with
other subfunctions and/or with its parent PCI function.
-Example mlx5 software, system and device view::
+Example mlx5 software, system, and device view::
_______
| admin |
@@ -310,7 +310,7 @@ Example mlx5 software, system and device view::
| (device add/del)
_____|____ ____|________
| | | subfunction |
- | PCI NIC |---- activate/deactive events---->| host driver |
+ | PCI NIC |--- activate/deactivate events--->| host driver |
|__________| | (mlx5_core) |
|_____________|
@@ -320,7 +320,7 @@ Subfunction is created using devlink port interface.
$ devlink dev eswitch set pci/0000:06:00.0 mode switchdev
-- Add a devlink port of subfunction flaovur::
+- Add a devlink port of subfunction flavour::
$ devlink port add pci/0000:06:00.0 flavour pcisf pfnum 0 sfnum 88
pci/0000:06:00.0/32768: type eth netdev eth6 flavour pcisf controller 0 pfnum 0 sfnum 88 external false splittable false
@@ -351,46 +351,30 @@ driver.
MAC address setup
-----------------
-mlx5 driver provides mechanism to setup the MAC address of the PCI VF/SF.
+mlx5 driver support devlink port function attr mechanism to setup MAC
+address. (refer to Documentation/networking/devlink/devlink-port.rst)
-The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
-device created for the PCI VF/SF.
+RoCE capability setup
+---------------------
+Not all mlx5 PCI devices/SFs require RoCE capability.
-- Get the MAC address of the VF identified by its unique devlink port index::
+When RoCE capability is disabled, it saves 1 Mbytes worth of system memory per
+PCI devices/SF.
- $ devlink port show pci/0000:06:00.0/2
- pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
- function:
- hw_addr 00:00:00:00:00:00
-
-- Set the MAC address of the VF identified by its unique devlink port index::
-
- $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
-
- $ devlink port show pci/0000:06:00.0/2
- pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
- function:
- hw_addr 00:11:22:33:44:55
+mlx5 driver support devlink port function attr mechanism to setup RoCE
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
-- Get the MAC address of the SF identified by its unique devlink port index::
+migratable capability setup
+---------------------------
+User who wants mlx5 PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
- $ devlink port show pci/0000:06:00.0/32768
- pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
- function:
- hw_addr 00:00:00:00:00:00
-
-- Set the MAC address of the VF identified by its unique devlink port index::
-
- $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
-
- $ devlink port show pci/0000:06:00.0/32768
- pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcivf pfnum 0 sfnum 88
- function:
- hw_addr 00:00:00:00:88:88
+mlx5 driver support devlink port function attr mechanism to setup migratable
+capability. (refer to Documentation/networking/devlink/devlink-port.rst)
SF state setup
--------------
-To use the SF, the user must active the SF using the SF function state
+To use the SF, the user must activate the SF using the SF function state
attribute.
- Get the state of the SF identified by its unique devlink port index::
@@ -447,7 +431,7 @@ for it.
Additionally, the SF port also gets the event when the driver attaches to the
auxiliary device of the subfunction. This results in changing the operational
-state of the function. This provides visiblity to the user to decide when is it
+state of the function. This provides visibility to the user to decide when is it
safe to delete the SF port for graceful termination of the subfunction.
- Show the SF port operational state::
@@ -464,14 +448,14 @@ tx reporter
-----------
The tx reporter is responsible for reporting and recovering of the following two error scenarios:
-- TX timeout
+- tx timeout
Report on kernel tx timeout detection.
Recover by searching lost interrupts.
-- TX error completion
+- tx error completion
Report on error tx completion.
- Recover by flushing the TX queue and reset it.
+ Recover by flushing the tx queue and reset it.
-TX reporter also support on demand diagnose callback, on which it provides
+tx reporter also support on demand diagnose callback, on which it provides
real time information of its send queues status.
User commands examples:
@@ -491,32 +475,32 @@ rx reporter
-----------
The rx reporter is responsible for reporting and recovering of the following two error scenarios:
-- RX queues initialization (population) timeout
- RX queues descriptors population on ring initialization is done in
- napi context via triggering an irq, in case of a failure to get
- the minimum amount of descriptors, a timeout would occur and it
- could be recoverable by polling the EQ (Event Queue).
-- RX completions with errors (reported by HW on interrupt context)
+- rx queues' initialization (population) timeout
+ Population of rx queues' descriptors on ring initialization is done
+ in napi context via triggering an irq. In case of a failure to get
+ the minimum amount of descriptors, a timeout would occur, and
+ descriptors could be recovered by polling the EQ (Event Queue).
+- rx completions with errors (reported by HW on interrupt context)
Report on rx completion error.
Recover (if needed) by flushing the related queue and reset it.
-RX reporter also supports on demand diagnose callback, on which it
-provides real time information of its receive queues status.
+rx reporter also supports on demand diagnose callback, on which it
+provides real time information of its receive queues' status.
-- Diagnose rx queues status, and corresponding completion queue::
+- Diagnose rx queues' status and corresponding completion queue::
$ devlink health diagnose pci/0000:82:00.0 reporter rx
-NOTE: This command has valid output only when interface is up, otherwise the command has empty output.
+NOTE: This command has valid output only when interface is up. Otherwise, the command has empty output.
- Show number of rx errors indicated, number of recover flows ended successfully,
- is autorecover enabled and graceful period from last recover::
+ is autorecover enabled, and graceful period from last recover::
$ devlink health show pci/0000:82:00.0 reporter rx
fw reporter
-----------
-The fw reporter implements diagnose and dump callbacks.
+The fw reporter implements `diagnose` and `dump` callbacks.
It follows symptoms of fw error such as fw syndrome by triggering
fw core dump and storing it into the dump buffer.
The fw reporter diagnose command can be triggered any time by the user to check
@@ -537,7 +521,7 @@ running it on other PF or any VF will return "Operation not permitted".
fw fatal reporter
-----------------
-The fw fatal reporter implements dump and recover callbacks.
+The fw fatal reporter implements `dump` and `recover` callbacks.
It follows fatal errors indications by CR-space dump and recover flow.
The CR-space dump uses vsc interface which is valid even if the FW command
interface is not functional, which is the case in most FW fatal errors.
@@ -552,7 +536,7 @@ User commands examples:
$ devlink health recover pci/0000:82:00.0 reporter fw_fatal
-- Read FW CR-space dump if already strored or trigger new one::
+- Read FW CR-space dump if already stored or trigger new one::
$ devlink health dump show pci/0000:82:00.1 reporter fw_fatal
@@ -561,10 +545,10 @@ NOTE: This command can run only on PF.
mlx5 tracepoints
================
-mlx5 driver provides internal trace points for tracking and debugging using
+mlx5 driver provides internal tracepoints for tracking and debugging using
kernel tracepoints interfaces (refer to Documentation/trace/ftrace.rst).
-For the list of support mlx5 events check /sys/kernel/debug/tracing/events/mlx5/
+For the list of support mlx5 events, check `/sys/kernel/debug/tracing/events/mlx5/`.
tc and eswitch offloads tracepoints:
diff --git a/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
index ada611fb427c..650b57742d51 100644
--- a/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
+++ b/Documentation/networking/device_drivers/ethernet/netronome/nfp.rst
@@ -1,50 +1,57 @@
.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause)
+.. include:: <isonum.txt>
-=============================================
-Netronome Flow Processor (NFP) Kernel Drivers
-=============================================
+===========================================
+Network Flow Processor (NFP) Kernel Drivers
+===========================================
-Copyright (c) 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2019, Netronome Systems, Inc.
+:Copyright: |copy| 2022, Corigine, Inc.
Contents
========
- `Overview`_
- `Acquiring Firmware`_
+- `Devlink Info`_
+- `Configure Device`_
+- `Statistics`_
Overview
========
-This driver supports Netronome's line of Flow Processor devices,
-including the NFP4000, NFP5000, and NFP6000 models, which are also
-incorporated in the company's family of Agilio SmartNICs. The SR-IOV
-physical and virtual functions for these devices are supported by
-the driver.
+This driver supports Netronome and Corigine's line of Network Flow Processor
+devices, including the NFP3800, NFP4000, NFP5000, and NFP6000 models, which
+are also incorporated in the companies' family of Agilio SmartNICs. The SR-IOV
+physical and virtual functions for these devices are supported by the driver.
Acquiring Firmware
==================
-The NFP4000 and NFP6000 devices require application specific firmware
-to function. Application firmware can be located either on the host file system
+The NFP3800, NFP4000 and NFP6000 devices require application specific firmware
+to function. Application firmware can be located either on the host file system
or in the device flash (if supported by management firmware).
Firmware files on the host filesystem contain card type (`AMDA-*` string), media
-config etc. They should be placed in `/lib/firmware/netronome` directory to
+config etc. They should be placed in `/lib/firmware/netronome` directory to
load firmware from the host file system.
Firmware for basic NIC operation is available in the upstream
`linux-firmware.git` repository.
+A more comprehensive list of firmware can be downloaded from the
+`Corigine Support site <https://www.corigine.com/DPUDownload.html>`_.
+
Firmware in NVRAM
-----------------
Recent versions of management firmware supports loading application
-firmware from flash when the host driver gets probed. The firmware loading
+firmware from flash when the host driver gets probed. The firmware loading
policy configuration may be used to configure this feature appropriately.
Devlink or ethtool can be used to update the application firmware on the device
flash by providing the appropriate `nic_AMDA*.nffw` file to the respective
-command. Users need to take care to write the correct firmware image for the
+command. Users need to take care to write the correct firmware image for the
card and media configuration to flash.
Available storage space in flash depends on the card being used.
@@ -79,9 +86,9 @@ You may need to use hard instead of symbolic links on distributions
which use old `mkinitrd` command instead of `dracut` (e.g. Ubuntu).
After changing firmware files you may need to regenerate the initramfs
-image. Initramfs contains drivers and firmware files your system may
-need to boot. Refer to the documentation of your distribution to find
-out how to update initramfs. Good indication of stale initramfs
+image. Initramfs contains drivers and firmware files your system may
+need to boot. Refer to the documentation of your distribution to find
+out how to update initramfs. Good indication of stale initramfs
is system loading wrong driver or firmware on boot, but when driver is
later reloaded manually everything works correctly.
@@ -89,9 +96,9 @@ Selecting firmware per device
-----------------------------
Most commonly all cards on the system use the same type of firmware.
-If you want to load specific firmware image for a specific card, you
-can use either the PCI bus address or serial number. Driver will print
-which files it's looking for when it recognizes a NFP device::
+If you want to load a specific firmware image for a specific card, you
+can use either the PCI bus address or serial number. The driver will
+print which files it's looking for when it recognizes a NFP device::
nfp: Looking for firmware file in order of priority:
nfp: netronome/serial-00-12-34-aa-bb-cc-10-ff.nffw: not found
@@ -106,6 +113,15 @@ Note that `serial-*` and `pci-*` files are **not** automatically included
in initramfs, you will have to refer to documentation of appropriate tools
to find out how to include them.
+Running firmware version
+------------------------
+
+The version of the loaded firmware for a particular <netdev> interface,
+(e.g. enp4s0), or an interface's port <netdev port> (e.g. enp4s0np0) can
+be displayed with the ethtool command::
+
+ $ ethtool -i <netdev>
+
Firmware loading policy
-----------------------
@@ -132,6 +148,115 @@ abi_drv_load_ifc
Defines a list of PF devices allowed to load FW on the device.
This variable is not currently user configurable.
+Devlink Info
+============
+
+The devlink info command displays the running and stored firmware versions
+on the device, serial number and board information.
+
+Devlink info command example (replace PCI address)::
+
+ $ devlink dev info pci/0000:03:00.0
+ pci/0000:03:00.0:
+ driver nfp
+ serial_number CSAAMDA2001-1003000111
+ versions:
+ fixed:
+ board.id AMDA2001-1003
+ board.rev 01
+ board.manufacture CSA
+ board.model mozart
+ running:
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x1000003
+ fw.app nic-22.09.0
+ chip.init AMDA-2001-1003 1003000111
+ stored:
+ fw.bundle_id bspbundle_1003000111
+ fw.mgmt 22.10.0-rc3
+ fw.cpld 0x0
+ chip.init AMDA-2001-1003 1003000111
+
+Configure Device
+================
+
+This section explains how to use Agilio SmartNICs running basic NIC firmware.
+
+Configure interface link-speed
+------------------------------
+The following steps explains how to change between 10G mode and 25G mode on
+Agilio CX 2x25GbE cards. The changing of port speed must be done in order,
+port 0 (p0) must be set to 10G before port 1 (p1) may be set to 10G.
+
+Down the respective interface(s)::
+
+ $ ip link set dev <netdev port 0> down
+ $ ip link set dev <netdev port 1> down
+
+Set interface link-speed to 10G::
+
+ $ ethtool -s <netdev port 0> speed 10000
+ $ ethtool -s <netdev port 1> speed 10000
+
+Set interface link-speed to 25G::
+
+ $ ethtool -s <netdev port 0> speed 25000
+ $ ethtool -s <netdev port 1> speed 25000
+
+Reload driver for changes to take effect::
+
+ $ rmmod nfp; modprobe nfp
+
+Configure interface Maximum Transmission Unit (MTU)
+---------------------------------------------------
+
+The MTU of interfaces can temporarily be set using the iproute2, ip link or
+ifconfig tools. Note that this change will not persist. Setting this via
+Network Manager, or another appropriate OS configuration tool, is
+recommended as changes to the MTU using Network Manager can be made to
+persist.
+
+Set interface MTU to 9000 bytes::
+
+ $ ip link set dev <netdev port> mtu 9000
+
+It is the responsibility of the user or the orchestration layer to set
+appropriate MTU values when handling jumbo frames or utilizing tunnels. For
+example, if packets sent from a VM are to be encapsulated on the card and
+egress a physical port, then the MTU of the VF should be set to lower than
+that of the physical port to account for the extra bytes added by the
+additional header. If a setup is expected to see fallback traffic between
+the SmartNIC and the kernel then the user should also ensure that the PF MTU
+is appropriately set to avoid unexpected drops on this path.
+
+Configure Forward Error Correction (FEC) modes
+----------------------------------------------
+
+Agilio SmartNICs support FEC mode configuration, e.g. Auto, Firecode Base-R,
+ReedSolomon and Off modes. Each physical port's FEC mode can be set
+independently using ethtool. The supported FEC modes for an interface can
+be viewed using::
+
+ $ ethtool <netdev>
+
+The currently configured FEC mode can be viewed using::
+
+ $ ethtool --show-fec <netdev>
+
+To force the FEC mode for a particular port, auto-negotiation must be disabled
+(see the `Auto-negotiation`_ section). An example of how to set the FEC mode
+to Reed-Solomon is::
+
+ $ ethtool --set-fec <netdev> encoding rs
+
+Auto-negotiation
+----------------
+
+To change auto-negotiation settings, the link must first be put down. After the
+link is down, auto-negotiation can be enabled or disabled using::
+
+ ethtool -s <netdev> autoneg <on|off>
+
Statistics
==========
diff --git a/Documentation/networking/devlink/devlink-info.rst b/Documentation/networking/devlink/devlink-info.rst
index 7572bf6de5c1..1242b0e6826b 100644
--- a/Documentation/networking/devlink/devlink-info.rst
+++ b/Documentation/networking/devlink/devlink-info.rst
@@ -198,6 +198,11 @@ fw.bundle_id
Unique identifier of the entire firmware bundle.
+fw.bootloader
+-------------
+
+Version of the bootloader.
+
Future work
===========
diff --git a/Documentation/networking/devlink/devlink-port.rst b/Documentation/networking/devlink/devlink-port.rst
index 7627b1da01f2..3da590953ce8 100644
--- a/Documentation/networking/devlink/devlink-port.rst
+++ b/Documentation/networking/devlink/devlink-port.rst
@@ -110,7 +110,7 @@ devlink ports for both the controllers.
Function configuration
======================
-A user can configure the function attribute before enumerating the PCI
+Users can configure one or more function attributes before enumerating the PCI
function. Usually it means, user should configure function attribute
before a bus specific device for the function is created. However, when
SRIOV is enabled, virtual function devices are created on the PCI bus.
@@ -119,9 +119,127 @@ function device to the driver. For subfunctions, this means user should
configure port function attribute before activating the port function.
A user may set the hardware address of the function using
-'devlink port function set hw_addr' command. For Ethernet port function
+`devlink port function set hw_addr` command. For Ethernet port function
this means a MAC address.
+Users may also set the RoCE capability of the function using
+`devlink port function set roce` command.
+
+Users may also set the function as migratable using
+'devlink port function set migratable' command.
+
+Function attributes
+===================
+
+MAC address setup
+-----------------
+The configured MAC address of the PCI VF/SF will be used by netdevice and rdma
+device created for the PCI VF/SF.
+
+- Get the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the VF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/2 hw_addr 00:11:22:33:44:55
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:11:22:33:44:55
+
+- Get the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:00:00
+
+- Set the MAC address of the SF identified by its unique devlink port index::
+
+ $ devlink port function set pci/0000:06:00.0/32768 hw_addr 00:00:00:00:88:88
+
+ $ devlink port show pci/0000:06:00.0/32768
+ pci/0000:06:00.0/32768: type eth netdev enp6s0pf0sf88 flavour pcisf pfnum 0 sfnum 88
+ function:
+ hw_addr 00:00:00:00:88:88
+
+RoCE capability setup
+---------------------
+Not all PCI VFs/SFs require RoCE capability.
+
+When RoCE capability is disabled, it saves system memory per PCI VF/SF.
+
+When user disables RoCE capability for a VF/SF, user application cannot send or
+receive any RoCE packets through this VF/SF and RoCE GID table for this PCI
+will be empty.
+
+When RoCE capability is disabled in the device using port function attribute,
+VF/SF driver cannot override it.
+
+- Get RoCE capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce enable
+
+- Set RoCE capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 roce disable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 roce disable
+
+migratable capability setup
+---------------------------
+Live migration is the process of transferring a live virtual machine
+from one physical host to another without disrupting its normal
+operation.
+
+User who want PCI VFs to be able to perform live migration need to
+explicitly enable the VF migratable capability.
+
+When user enables migratable capability for a VF, and the HV binds the VF to VFIO driver
+with migration support, the user can migrate the VM with this VF from one HV to a
+different one.
+
+However, when migratable capability is enable, device will disable features which cannot
+be migrated. Thus migratable cap can impose limitations on a VF so let the user decide.
+
+Example of LM with migratable function configuration:
+- Get migratable capability of the VF device::
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable disable
+
+- Set migratable capability of the VF device::
+
+ $ devlink port function set pci/0000:06:00.0/2 migratable enable
+
+ $ devlink port show pci/0000:06:00.0/2
+ pci/0000:06:00.0/2: type eth netdev enp6s0pf0vf1 flavour pcivf pfnum 0 vfnum 1
+ function:
+ hw_addr 00:00:00:00:00:00 migratable enable
+
+- Bind VF to VFIO driver with migration support::
+
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/unbind
+ $ echo mlx5_vfio_pci > /sys/bus/pci/devices/0000:08:00.0/driver_override
+ $ echo <pci_id> > /sys/bus/pci/devices/0000:08:00.0/driver/bind
+
+Attach VF to the VM.
+Start the VM.
+Perform live migration.
+
Subfunction
============
@@ -130,10 +248,11 @@ it is deployed. Subfunction is created and deployed in unit of 1. Unlike
SRIOV VFs, a subfunction doesn't require its own PCI virtual function.
A subfunction communicates with the hardware through the parent PCI function.
-To use a subfunction, 3 steps setup sequence is followed.
-(1) create - create a subfunction;
-(2) configure - configure subfunction attributes;
-(3) deploy - deploy the subfunction;
+To use a subfunction, 3 steps setup sequence is followed:
+
+1) create - create a subfunction;
+2) configure - configure subfunction attributes;
+3) deploy - deploy the subfunction;
Subfunction management is done using devlink port user interface.
User performs setup on the subfunction management device.
@@ -191,13 +310,48 @@ API allows to configure following rate object's parameters:
``tx_max``
Maximum TX rate value.
+``tx_priority``
+ Allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their priority
+ as long as the nodes remain within their bandwidth limit. The higher the
+ priority the higher the probability that the node will get selected for
+ scheduling.
+
+``tx_weight``
+ Allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with the
+ strict priority. As a node is configured with a higher rate it gets more
+ BW relative to it's siblings. Values are relative like a percentage
+ points, they basically tell how much BW should node take relative to
+ it's siblings.
+
``parent``
Parent node name. Parent node rate limits are considered as additional limits
to all node children limits. ``tx_max`` is an upper limit for children.
``tx_share`` is a total bandwidth distributed among children.
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+Arbitration flow from the high level:
+
+#. Choose a node, or group of nodes with the highest priority that stays
+ within the BW limit and are not blocked. Use ``tx_priority`` as a
+ parameter for this arbitration.
+
+#. If group of nodes have the same priority perform WFQ arbitration on
+ that subgroup. Use ``tx_weight`` as a parameter for this arbitration.
+
+#. Select the winner node, and continue arbitration flow among it's children,
+ until leaf node is reached, and the winner is established.
+
+#. If all the nodes from the highest priority sub-group are satisfied, or
+ overused their assigned BW, move to the lower priority nodes.
+
Driver implementations are allowed to support both or either rate object types
-and setting methods of their parameters.
+and setting methods of their parameters. Additionally driver implementation
+may export nodes/leafs and their child-parent relationships.
Terms and Definitions
=====================
diff --git a/Documentation/networking/devlink/devlink-region.rst b/Documentation/networking/devlink/devlink-region.rst
index f06dca9a1eb6..9232cd7da301 100644
--- a/Documentation/networking/devlink/devlink-region.rst
+++ b/Documentation/networking/devlink/devlink-region.rst
@@ -31,6 +31,15 @@ in its ``devlink_region_ops`` structure. If snapshot id is not set in
the ``DEVLINK_CMD_REGION_NEW`` request kernel will allocate one and send
the snapshot information to user space.
+Regions may optionally allow directly reading from their contents without a
+snapshot. Direct read requests are not atomic. In particular a read request
+of size 256 bytes or larger will be split into multiple chunks. If atomic
+access is required, use a snapshot. A driver wishing to enable this for a
+region should implement the ``.read`` callback in the ``devlink_region_ops``
+structure. User space can request a direct read by using the
+``DEVLINK_ATTR_REGION_DIRECT`` attribute instead of specifying a snapshot
+id.
+
example usage
-------------
@@ -65,6 +74,10 @@ example usage
$ devlink region read pci/0000:00:05.0/fw-health snapshot 1 address 0 length 16
0000000000000000 0014 95dc 0014 9514 0035 1670 0034 db30
+ # Read from the region without a snapshot
+ $ devlink region read pci/0000:00:05.0/fw-health address 16 length 16
+ 0000000000000010 0000 0000 ffff ff04 0029 8c00 0028 8cc8
+
As regions are likely very device or driver specific, no generic regions are
defined. See the driver-specific documentation files for information on the
specific regions a driver supports.
diff --git a/Documentation/networking/devlink/devlink-trap.rst b/Documentation/networking/devlink/devlink-trap.rst
index 90d1381b88de..2c14dfe69b3a 100644
--- a/Documentation/networking/devlink/devlink-trap.rst
+++ b/Documentation/networking/devlink/devlink-trap.rst
@@ -485,6 +485,16 @@ be added to the following table:
- Traps incoming packets that the device decided to drop because
the destination MAC is not configured in the MAC table and
the interface is not in promiscuous mode
+ * - ``eapol``
+ - ``control``
+ - Traps "Extensible Authentication Protocol over LAN" (EAPOL) packets
+ specified in IEEE 802.1X
+ * - ``locked_port``
+ - ``drop``
+ - Traps packets that the device decided to drop because they failed the
+ locked bridge port check. That is, packets that were received via a
+ locked port and whose {SMAC, VID} does not correspond to an FDB entry
+ pointing to the port
Driver-specific Packet Traps
============================
@@ -589,6 +599,9 @@ narrow. The description of these groups must be added to the following table:
* - ``parser_error_drops``
- Contains packet traps for packets that were marked by the device during
parsing as erroneous
+ * - ``eapol``
+ - Contains packet traps for "Extensible Authentication Protocol over LAN"
+ (EAPOL) packets specified in IEEE 802.1X
Packet Trap Policers
====================
diff --git a/Documentation/networking/devlink/etas_es58x.rst b/Documentation/networking/devlink/etas_es58x.rst
new file mode 100644
index 000000000000..3b857d82a44c
--- /dev/null
+++ b/Documentation/networking/devlink/etas_es58x.rst
@@ -0,0 +1,36 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==========================
+etas_es58x devlink support
+==========================
+
+This document describes the devlink features implemented by the
+``etas_es58x`` device driver.
+
+Info versions
+=============
+
+The ``etas_es58x`` driver reports the following versions
+
+.. list-table:: devlink info versions implemented
+ :widths: 5 5 90
+
+ * - Name
+ - Type
+ - Description
+ * - ``fw``
+ - running
+ - Version of the firmware running on the device. Also available
+ through ``ethtool -i`` as the first member of the
+ ``firmware-version``.
+ * - ``fw.bootloader``
+ - running
+ - Version of the bootloader running on the device. Also available
+ through ``ethtool -i`` as the second member of the
+ ``firmware-version``.
+ * - ``board.rev``
+ - fixed
+ - The hardware revision of the device.
+ * - ``serial_number``
+ - fixed
+ - The USB serial number. Also available through ``lsusb -v``.
diff --git a/Documentation/networking/devlink/ice.rst b/Documentation/networking/devlink/ice.rst
index 0c89ceb8986d..625efb3777d5 100644
--- a/Documentation/networking/devlink/ice.rst
+++ b/Documentation/networking/devlink/ice.rst
@@ -189,12 +189,21 @@ device data.
* - ``nvm-flash``
- The contents of the entire flash chip, sometimes referred to as
the device's Non Volatile Memory.
+ * - ``shadow-ram``
+ - The contents of the Shadow RAM, which is loaded from the beginning
+ of the flash. Although the contents are primarily from the flash,
+ this area also contains data generated during device boot which is
+ not stored in flash.
* - ``device-caps``
- The contents of the device firmware's capabilities buffer. Useful to
determine the current state and configuration of the device.
-Users can request an immediate capture of a snapshot via the
-``DEVLINK_CMD_REGION_NEW``
+Both the ``nvm-flash`` and ``shadow-ram`` regions can be accessed without a
+snapshot. The ``device-caps`` region requires a snapshot as the contents are
+sent by firmware and can't be split into separate reads.
+
+Users can request an immediate capture of a snapshot for all three regions
+via the ``DEVLINK_CMD_REGION_NEW`` command.
.. code:: shell
@@ -254,3 +263,118 @@ Users can request an immediate capture of a snapshot via the
0000000000000210 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
$ devlink region delete pci/0000:01:00.0/device-caps snapshot 1
+
+Devlink Rate
+============
+
+The ``ice`` driver implements devlink-rate API. It allows for offload of
+the Hierarchical QoS to the hardware. It enables user to group Virtual
+Functions in a tree structure and assign supported parameters: tx_share,
+tx_max, tx_priority and tx_weight to each node in a tree. So effectively
+user gains an ability to control how much bandwidth is allocated for each
+VF group. This is later enforced by the HW.
+
+It is assumed that this feature is mutually exclusive with DCB performed
+in FW and ADQ, or any driver feature that would trigger changes in QoS,
+for example creation of the new traffic class. The driver will prevent DCB
+or ADQ configuration if user started making any changes to the nodes using
+devlink-rate API. To configure those features a driver reload is necessary.
+Correspondingly if ADQ or DCB will get configured the driver won't export
+hierarchy at all, or will remove the untouched hierarchy if those
+features are enabled after the hierarchy is exported, but before any
+changes are made.
+
+This feature is also dependent on switchdev being enabled in the system.
+It's required bacause devlink-rate requires devlink-port objects to be
+present, and those objects are only created in switchdev mode.
+
+If the driver is set to the switchdev mode, it will export internal
+hierarchy the moment VF's are created. Root of the tree is always
+represented by the node_0. This node can't be deleted by the user. Leaf
+nodes and nodes with children also can't be deleted.
+
+.. list-table:: Attributes supported
+ :widths: 15 85
+
+ * - Name
+ - Description
+ * - ``tx_max``
+ - maximum bandwidth to be consumed by the tree Node. Rate Limit is
+ an absolute number specifying a maximum amount of bytes a Node may
+ consume during the course of one second. Rate limit guarantees
+ that a link will not oversaturate the receiver on the remote end
+ and also enforces an SLA between the subscriber and network
+ provider.
+ * - ``tx_share``
+ - minimum bandwidth allocated to a tree node when it is not blocked.
+ It specifies an absolute BW. While tx_max defines the maximum
+ bandwidth the node may consume, the tx_share marks committed BW
+ for the Node.
+ * - ``tx_priority``
+ - allows for usage of strict priority arbiter among siblings. This
+ arbitration scheme attempts to schedule nodes based on their
+ priority as long as the nodes remain within their bandwidth limit.
+ Range 0-7. Nodes with priority 7 have the highest priority and are
+ selected first, while nodes with priority 0 have the lowest
+ priority. Nodes that have the same priority are treated equally.
+ * - ``tx_weight``
+ - allows for usage of Weighted Fair Queuing arbitration scheme among
+ siblings. This arbitration scheme can be used simultaneously with
+ the strict priority. Range 1-200. Only relative values mater for
+ arbitration.
+
+``tx_priority`` and ``tx_weight`` can be used simultaneously. In that case
+nodes with the same priority form a WFQ subgroup in the sibling group
+and arbitration among them is based on assigned weights.
+
+.. code:: shell
+
+ # enable switchdev
+ $ devlink dev eswitch set pci/0000:4b:00.0 mode switchdev
+
+ # at this point driver should export internal hierarchy
+ $ echo 2 > /sys/class/net/ens785np0/device/sriov_numvfs
+
+ $ devlink port function rate show
+ pci/0000:4b:00.0/node_25: type node parent node_24
+ pci/0000:4b:00.0/node_24: type node parent node_0
+ pci/0000:4b:00.0/node_32: type node parent node_31
+ pci/0000:4b:00.0/node_31: type node parent node_30
+ pci/0000:4b:00.0/node_30: type node parent node_16
+ pci/0000:4b:00.0/node_19: type node parent node_18
+ pci/0000:4b:00.0/node_18: type node parent node_17
+ pci/0000:4b:00.0/node_17: type node parent node_16
+ pci/0000:4b:00.0/node_14: type node parent node_5
+ pci/0000:4b:00.0/node_5: type node parent node_3
+ pci/0000:4b:00.0/node_13: type node parent node_4
+ pci/0000:4b:00.0/node_12: type node parent node_4
+ pci/0000:4b:00.0/node_11: type node parent node_4
+ pci/0000:4b:00.0/node_10: type node parent node_4
+ pci/0000:4b:00.0/node_9: type node parent node_4
+ pci/0000:4b:00.0/node_8: type node parent node_4
+ pci/0000:4b:00.0/node_7: type node parent node_4
+ pci/0000:4b:00.0/node_6: type node parent node_4
+ pci/0000:4b:00.0/node_4: type node parent node_3
+ pci/0000:4b:00.0/node_3: type node parent node_16
+ pci/0000:4b:00.0/node_16: type node parent node_15
+ pci/0000:4b:00.0/node_15: type node parent node_0
+ pci/0000:4b:00.0/node_2: type node parent node_1
+ pci/0000:4b:00.0/node_1: type node parent node_0
+ pci/0000:4b:00.0/node_0: type node
+ pci/0000:4b:00.0/1: type leaf parent node_25
+ pci/0000:4b:00.0/2: type leaf parent node_25
+
+ # let's create some custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom parent node_0
+
+ # second custom node
+ $ devlink port function rate add pci/0000:4b:00.0/node_custom_1 parent node_custom
+
+ # reassign second VF to newly created branch
+ $ devlink port function rate set pci/0000:4b:00.0/2 parent node_custom_1
+
+ # assign tx_weight to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_weight 5
+
+ # assign tx_share to the VF
+ $ devlink port function rate set pci/0000:4b:00.0/2 tx_share 500Mbps
diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst
index d578b8bcd8a4..f10f8eb44255 100644
--- a/Documentation/networking/ethtool-netlink.rst
+++ b/Documentation/networking/ethtool-netlink.rst
@@ -222,6 +222,7 @@ Userspace to kernel:
``ETHTOOL_MSG_MODULE_GET`` get transceiver module parameters
``ETHTOOL_MSG_PSE_SET`` set PSE parameters
``ETHTOOL_MSG_PSE_GET`` get PSE parameters
+ ``ETHTOOL_MSG_RSS_GET`` get RSS settings
===================================== =================================
Kernel to userspace:
@@ -263,6 +264,7 @@ Kernel to userspace:
``ETHTOOL_MSG_PHC_VCLOCKS_GET_REPLY`` PHC virtual clocks info
``ETHTOOL_MSG_MODULE_GET_REPLY`` transceiver module parameters
``ETHTOOL_MSG_PSE_GET_REPLY`` PSE parameters
+ ``ETHTOOL_MSG_RSS_GET_REPLY`` RSS settings
======================================== =================================
``GET`` requests are sent by userspace applications to retrieve device
@@ -491,6 +493,7 @@ Kernel response contents:
``ETHTOOL_A_LINKSTATE_SQI_MAX`` u32 Max support SQI value
``ETHTOOL_A_LINKSTATE_EXT_STATE`` u8 link extended state
``ETHTOOL_A_LINKSTATE_EXT_SUBSTATE`` u8 link extended substate
+ ``ETHTOOL_A_LINKSTATE_EXT_DOWN_CNT`` u32 count of link down events
==================================== ====== ============================
For most NIC drivers, the value of ``ETHTOOL_A_LINKSTATE_LINK`` returns
@@ -1686,6 +1689,33 @@ to control PoDL PSE Admin functions. This option is implementing
``IEEE 802.3-2018`` 30.15.1.2.1 acPoDLPSEAdminControl. See
``ETHTOOL_A_PODL_PSE_ADMIN_STATE`` for supported values.
+RSS_GET
+=======
+
+Get indirection table, hash key and hash function info associated with a
+RSS context of an interface similar to ``ETHTOOL_GRSSH`` ioctl request.
+
+Request contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested request header
+ ``ETHTOOL_A_RSS_CONTEXT`` u32 context number
+===================================== ====== ==========================
+
+Kernel response contents:
+
+===================================== ====== ==========================
+ ``ETHTOOL_A_RSS_HEADER`` nested reply header
+ ``ETHTOOL_A_RSS_HFUNC`` u32 RSS hash func
+ ``ETHTOOL_A_RSS_INDIR`` binary Indir table bytes
+ ``ETHTOOL_A_RSS_HKEY`` binary Hash key bytes
+===================================== ====== ==========================
+
+ETHTOOL_A_RSS_HFUNC attribute is bitmap indicating the hash function
+being used. Current supported options are toeplitz, xor or crc32.
+ETHTOOL_A_RSS_INDIR attribute returns RSS indrection table where each byte
+indicates queue number.
+
Request translation
===================
@@ -1767,7 +1797,7 @@ are netlink only.
``ETHTOOL_GMODULEEEPROM`` ``ETHTOOL_MSG_MODULE_EEPROM_GET``
``ETHTOOL_GEEE`` ``ETHTOOL_MSG_EEE_GET``
``ETHTOOL_SEEE`` ``ETHTOOL_MSG_EEE_SET``
- ``ETHTOOL_GRSSH`` n/a
+ ``ETHTOOL_GRSSH`` ``ETHTOOL_MSG_RSS_GET``
``ETHTOOL_SRSSH`` n/a
``ETHTOOL_GTUNABLE`` n/a
``ETHTOOL_STUNABLE`` n/a
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 16a153bcc5fe..4f2d1f682a18 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -104,6 +104,7 @@ Contents:
switchdev
sysfs-tagging
tc-actions-env-rules
+ tc-queue-filters
tcp-thin
team
timestamping
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index e7b3fa7bb3f7..7fbd060d6047 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1069,6 +1069,81 @@ tcp_child_ehash_entries - INTEGER
Default: 0
+tcp_plb_enabled - BOOLEAN
+ If set and the underlying congestion control (e.g. DCTCP) supports
+ and enables PLB feature, TCP PLB (Protective Load Balancing) is
+ enabled. PLB is described in the following paper:
+ https://doi.org/10.1145/3544216.3544226. Based on PLB parameters,
+ upon sensing sustained congestion, TCP triggers a change in
+ flow label field for outgoing IPv6 packets. A change in flow label
+ field potentially changes the path of outgoing packets for switches
+ that use ECMP/WCMP for routing.
+
+ PLB changes socket txhash which results in a change in IPv6 Flow Label
+ field, and currently no-op for IPv4 headers. It is possible
+ to apply PLB for IPv4 with other network header fields (e.g. TCP
+ or IPv4 options) or using encapsulation where outer header is used
+ by switches to determine next hop. In either case, further host
+ and switch side changes will be needed.
+
+ When set, PLB assumes that congestion signal (e.g. ECN) is made
+ available and used by congestion control module to estimate a
+ congestion measure (e.g. ce_ratio). PLB needs a congestion measure to
+ make repathing decisions.
+
+ Default: FALSE
+
+tcp_plb_idle_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a rehash can be performed, given there are no packets in flight.
+ This is referred to as M in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 3
+
+tcp_plb_rehash_rounds - INTEGER
+ Number of consecutive congested rounds (RTT) seen after which
+ a forced rehash can be performed. Be careful when setting this
+ parameter, as a small value increases the risk of retransmissions.
+ This is referred to as N in PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ Possible Values: 0 - 31
+
+ Default: 12
+
+tcp_plb_suspend_rto_sec - INTEGER
+ Time, in seconds, to suspend PLB in event of an RTO. In order to avoid
+ having PLB repath onto a connectivity "black hole", after an RTO a TCP
+ connection suspends PLB repathing for a random duration between 1x and
+ 2x of this parameter. Randomness is added to avoid concurrent rehashing
+ of multiple TCP connections. This should be set corresponding to the
+ amount of time it takes to repair a failed link.
+
+ Possible Values: 0 - 255
+
+ Default: 60
+
+tcp_plb_cong_thresh - INTEGER
+ Fraction of packets marked with congestion over a round (RTT) to
+ tag that round as congested. This is referred to as K in the PLB paper:
+ https://doi.org/10.1145/3544216.3544226.
+
+ The 0-1 fraction range is mapped to 0-256 range to avoid floating
+ point operations. For example, 128 means that if at least 50% of
+ the packets in a round were marked as congested then the round
+ will be tagged as congested.
+
+ Setting threshold to 0 means that PLB repaths every RTT regardless
+ of congestion. This is not intended behavior for PLB and should be
+ used only for experimentation purpose.
+
+ Possible Values: 0 - 256
+
+ Default: 128
+
UDP variables
=============
@@ -1102,6 +1177,33 @@ udp_rmem_min - INTEGER
udp_wmem_min - INTEGER
UDP does not have tx memory accounting and this tunable has no effect.
+udp_hash_entries - INTEGER
+ Show the number of hash buckets for UDP sockets in the current
+ networking namespace.
+
+ A negative value means the networking namespace does not own its
+ hash buckets and shares the initial networking namespace's one.
+
+udp_child_ehash_entries - INTEGER
+ Control the number of hash buckets for UDP sockets in the child
+ networking namespace, which must be set before clone() or unshare().
+
+ If the value is not 0, the kernel uses a value rounded up to 2^n
+ as the actual hash bucket size. 0 is a special value, meaning
+ the child networking namespace will share the initial networking
+ namespace's hash buckets.
+
+ Note that the child will use the global one in case the kernel
+ fails to allocate enough memory. In addition, the global hash
+ buckets are spread over available NUMA nodes, but the allocation
+ of the child hash table depends on the current process's NUMA
+ policy, which could result in performance differences.
+
+ Possible values: 0, 2^n (n: 7 (128) - 16 (64K))
+
+ Default: 0
+
+
RAW variables
=============
@@ -3025,6 +3127,15 @@ ecn_enable - BOOLEAN
Default: 1
+l3mdev_accept - BOOLEAN
+ Enabling this option allows a "global" bound socket to work
+ across L3 master domains (e.g., VRFs) with packets capable of
+ being received regardless of the L3 domain in which they
+ originated. Only valid when the kernel was compiled with
+ CONFIG_NET_L3_MASTER_DEV.
+
+ Default: 1 (enabled)
+
``/proc/sys/net/core/*``
========================
diff --git a/Documentation/networking/ipvs-sysctl.rst b/Documentation/networking/ipvs-sysctl.rst
index 387fda80f05f..3fb5fa142eef 100644
--- a/Documentation/networking/ipvs-sysctl.rst
+++ b/Documentation/networking/ipvs-sysctl.rst
@@ -129,6 +129,26 @@ drop_packet - INTEGER
threshold. When the mode 3 is set, the always mode drop rate
is controlled by the /proc/sys/net/ipv4/vs/am_droprate.
+est_cpulist - CPULIST
+ Allowed CPUs for estimation kthreads
+
+ Syntax: standard cpulist format
+ empty list - stop kthread tasks and estimation
+ default - the system's housekeeping CPUs for kthreads
+
+ Example:
+ "all": all possible CPUs
+ "0-N": all possible CPUs, N denotes last CPU number
+ "0,1-N:1/2": first and all CPUs with odd number
+ "": empty list
+
+est_nice - INTEGER
+ default 0
+ Valid range: -20 (more favorable) .. 19 (less favorable)
+
+ Niceness value to use for the estimation kthreads (scheduling
+ priority)
+
expire_nodest_conn - BOOLEAN
- 0 - disabled (default)
- not 0 - enabled
@@ -304,8 +324,8 @@ run_estimation - BOOLEAN
0 - disabled
not 0 - enabled (default)
- If disabled, the estimation will be stop, and you can't see
- any update on speed estimation data.
+ If disabled, the estimation will be suspended and kthread tasks
+ stopped.
You can always re-enable estimation by setting this value to 1.
But be careful, the first estimation after re-enable is not
diff --git a/Documentation/networking/tc-queue-filters.rst b/Documentation/networking/tc-queue-filters.rst
new file mode 100644
index 000000000000..6b417092276f
--- /dev/null
+++ b/Documentation/networking/tc-queue-filters.rst
@@ -0,0 +1,37 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+=========================
+TC queue based filtering
+=========================
+
+TC can be used for directing traffic to either a set of queues or
+to a single queue on both the transmit and receive side.
+
+On the transmit side:
+
+1) TC filter directing traffic to a set of queues is achieved
+ using the action skbedit priority for Tx priority selection,
+ the priority maps to a traffic class (set of queues) when
+ the queue-sets are configured using mqprio.
+
+2) TC filter directs traffic to a transmit queue with the action
+ skbedit queue_mapping $tx_qid. The action skbedit queue_mapping
+ for transmit queue is executed in software only and cannot be
+ offloaded.
+
+Likewise, on the receive side, the two filters for selecting set of
+queues and/or a single queue are supported as below:
+
+1) TC flower filter directs incoming traffic to a set of queues using
+ the 'hw_tc' option.
+ hw_tc $TCID - Specify a hardware traffic class to pass matching
+ packets on to. TCID is in the range 0 through 15.
+
+2) TC filter with action skbedit queue_mapping $rx_qid selects a
+ receive queue. The action skbedit queue_mapping for receive queue
+ is supported only in hardware. Multiple filters may compete in
+ the hardware for queue selection. In such case, the hardware
+ pipeline resolves conflicts based on priority. On Intel E810
+ devices, TC filter directing traffic to a queue have higher
+ priority over flow director filter assigning a queue. The hash
+ filter has lowest priority.
diff --git a/Documentation/networking/timestamping.rst b/Documentation/networking/timestamping.rst
index be4eb1242057..f17c01834a12 100644
--- a/Documentation/networking/timestamping.rst
+++ b/Documentation/networking/timestamping.rst
@@ -179,7 +179,8 @@ SOF_TIMESTAMPING_OPT_ID:
identifier and returns that along with the timestamp. The identifier
is derived from a per-socket u32 counter (that wraps). For datagram
sockets, the counter increments with each sent packet. For stream
- sockets, it increments with every byte.
+ sockets, it increments with every byte. For stream sockets, also set
+ SOF_TIMESTAMPING_OPT_ID_TCP, see the section below.
The counter starts at zero. It is initialized the first time that
the socket option is enabled. It is reset each time the option is
@@ -192,6 +193,35 @@ SOF_TIMESTAMPING_OPT_ID:
among all possibly concurrently outstanding timestamp requests for
that socket.
+SOF_TIMESTAMPING_OPT_ID_TCP:
+ Pass this modifier along with SOF_TIMESTAMPING_OPT_ID for new TCP
+ timestamping applications. SOF_TIMESTAMPING_OPT_ID defines how the
+ counter increments for stream sockets, but its starting point is
+ not entirely trivial. This option fixes that.
+
+ For stream sockets, if SOF_TIMESTAMPING_OPT_ID is set, this should
+ always be set too. On datagram sockets the option has no effect.
+
+ A reasonable expectation is that the counter is reset to zero with
+ the system call, so that a subsequent write() of N bytes generates
+ a timestamp with counter N-1. SOF_TIMESTAMPING_OPT_ID_TCP
+ implements this behavior under all conditions.
+
+ SOF_TIMESTAMPING_OPT_ID without modifier often reports the same,
+ especially when the socket option is set when no data is in
+ transmission. If data is being transmitted, it may be off by the
+ length of the output queue (SIOCOUTQ).
+
+ The difference is due to being based on snd_una versus write_seq.
+ snd_una is the offset in the stream acknowledged by the peer. This
+ depends on factors outside of process control, such as network RTT.
+ write_seq is the last byte written by the process. This offset is
+ not affected by external inputs.
+
+ The difference is subtle and unlikely to be noticed when configured
+ at initial socket creation, when no data is queued or sent. But
+ SOF_TIMESTAMPING_OPT_ID_TCP behavior is more robust regardless of
+ when the socket option is set.
SOF_TIMESTAMPING_OPT_CMSG:
Support recv() cmsg for all timestamped packets. Control messages
diff --git a/Documentation/networking/xfrm_device.rst b/Documentation/networking/xfrm_device.rst
index 01391dfd37d9..c43ace79e320 100644
--- a/Documentation/networking/xfrm_device.rst
+++ b/Documentation/networking/xfrm_device.rst
@@ -5,6 +5,7 @@ XFRM device - offloading the IPsec computations
===============================================
Shannon Nelson <shannon.nelson@oracle.com>
+Leon Romanovsky <leonro@nvidia.com>
Overview
@@ -18,10 +19,21 @@ can radically increase throughput and decrease CPU utilization. The XFRM
Device interface allows NIC drivers to offer to the stack access to the
hardware offload.
+Right now, there are two types of hardware offload that kernel supports.
+ * IPsec crypto offload:
+ * NIC performs encrypt/decrypt
+ * Kernel does everything else
+ * IPsec packet offload:
+ * NIC performs encrypt/decrypt
+ * NIC does encapsulation
+ * Kernel and NIC have SA and policy in-sync
+ * NIC handles the SA and policies states
+ * The Kernel talks to the keymanager
+
Userland access to the offload is typically through a system such as
libreswan or KAME/raccoon, but the iproute2 'ip xfrm' command set can
be handy when experimenting. An example command might look something
-like this::
+like this for crypto offload:
ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
reqid 0x07 replay-window 32 \
@@ -29,6 +41,17 @@ like this::
sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
offload dev eth4 dir in
+and for packet offload
+
+ ip x s add proto esp dst 14.0.0.70 src 14.0.0.52 spi 0x07 mode transport \
+ reqid 0x07 replay-window 32 \
+ aead 'rfc4106(gcm(aes))' 0x44434241343332312423222114131211f4f3f2f1 128 \
+ sel src 14.0.0.52/24 dst 14.0.0.70/24 proto tcp \
+ offload packet dev eth4 dir in
+
+ ip x p add src 14.0.0.70 dst 14.0.0.52 offload packet dev eth4 dir in
+ tmpl src 14.0.0.70 dst 14.0.0.52 proto esp reqid 10000 mode transport
+
Yes, that's ugly, but that's what shell scripts and/or libreswan are for.
@@ -40,17 +63,24 @@ Callbacks to implement
/* from include/linux/netdevice.h */
struct xfrmdev_ops {
+ /* Crypto and Packet offload callbacks */
int (*xdo_dev_state_add) (struct xfrm_state *x);
void (*xdo_dev_state_delete) (struct xfrm_state *x);
void (*xdo_dev_state_free) (struct xfrm_state *x);
bool (*xdo_dev_offload_ok) (struct sk_buff *skb,
struct xfrm_state *x);
void (*xdo_dev_state_advance_esn) (struct xfrm_state *x);
+
+ /* Solely packet offload callbacks */
+ void (*xdo_dev_state_update_curlft) (struct xfrm_state *x);
+ int (*xdo_dev_policy_add) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_delete) (struct xfrm_policy *x);
+ void (*xdo_dev_policy_free) (struct xfrm_policy *x);
};
-The NIC driver offering ipsec offload will need to implement these
-callbacks to make the offload available to the network stack's
-XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
+The NIC driver offering ipsec offload will need to implement callbacks
+relevant to supported offload to make the offload available to the network
+stack's XFRM subsystem. Additionally, the feature bits NETIF_F_HW_ESP and
NETIF_F_HW_ESP_TX_CSUM will signal the availability of the offload.
@@ -79,7 +109,8 @@ and an indication of whether it is for Rx or Tx. The driver should
=========== ===================================
0 success
- -EOPNETSUPP offload not supported, try SW IPsec
+ -EOPNETSUPP offload not supported, try SW IPsec,
+ not applicable for packet offload mode
other fail the request
=========== ===================================
@@ -96,6 +127,7 @@ will serviceable. This can check the packet information to be sure the
offload can be supported (e.g. IPv4 or IPv6, no IPv4 options, etc) and
return true of false to signify its support.
+Crypto offload mode:
When ready to send, the driver needs to inspect the Tx packet for the
offload information, including the opaque context, and set up the packet
send accordingly::
@@ -139,13 +171,25 @@ the stack in xfrm_input().
In ESN mode, xdo_dev_state_advance_esn() is called from xfrm_replay_advance_esn().
Driver will check packet seq number and update HW ESN state machine if needed.
+Packet offload mode:
+HW adds and deletes XFRM headers. So in RX path, XFRM stack is bypassed if HW
+reported success. In TX path, the packet lefts kernel without extra header
+and not encrypted, the HW is responsible to perform it.
+
When the SA is removed by the user, the driver's xdo_dev_state_delete()
-is asked to disable the offload. Later, xdo_dev_state_free() is called
-from a garbage collection routine after all reference counts to the state
+and xdo_dev_policy_delete() are asked to disable the offload. Later,
+xdo_dev_state_free() and xdo_dev_policy_free() are called from a garbage
+collection routine after all reference counts to the state and policy
have been removed and any remaining resources can be cleared for the
offload state. How these are used by the driver will depend on specific
hardware needs.
As a netdev is set to DOWN the XFRM stack's netdev listener will call
-xdo_dev_state_delete() and xdo_dev_state_free() on any remaining offloaded
-states.
+xdo_dev_state_delete(), xdo_dev_policy_delete(), xdo_dev_state_free() and
+xdo_dev_policy_free() on any remaining offloaded states.
+
+Outcome of HW handling packets, the XFRM core can't count hard, soft limits.
+The HW/driver are responsible to perform it and provide accurate data when
+xdo_dev_state_update_curlft() is called. In case of one of these limits
+occuried, the driver needs to call to xfrm_state_check_expire() to make sure
+that XFRM performs rekeying sequence.