summaryrefslogtreecommitdiff
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/9p.rst60
-rw-r--r--Documentation/filesystems/autofs.rst4
-rw-r--r--Documentation/filesystems/bcachefs/CodingStyle.rst2
-rw-r--r--Documentation/filesystems/fsverity.rst27
-rw-r--r--Documentation/filesystems/idmappings.rst8
-rw-r--r--Documentation/filesystems/iomap/design.rst10
-rw-r--r--Documentation/filesystems/journalling.rst6
-rw-r--r--Documentation/filesystems/locking.rst6
-rw-r--r--Documentation/filesystems/netfs_library.rst2
-rw-r--r--Documentation/filesystems/nfs/index.rst1
-rw-r--r--Documentation/filesystems/nfs/localio.rst357
-rw-r--r--Documentation/filesystems/overlayfs.rst7
-rw-r--r--Documentation/filesystems/vfs.rst15
13 files changed, 472 insertions, 33 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst
index 1e0e0bb6fdf9..2bbf68b56b0d 100644
--- a/Documentation/filesystems/9p.rst
+++ b/Documentation/filesystems/9p.rst
@@ -31,7 +31,7 @@ Other applications are described in the following papers:
* PROSE I/O: Using 9p to enable Application Partitions
http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf
* VirtFS: A Virtualization Aware File System pass-through
- http://goo.gl/3WPDg
+ https://kernel.org/doc/ols/2010/ols2010-pages-109-120.pdf
Usage
=====
@@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport::
mount -t 9p -o trans=virtio <mount_tag> /mnt/9
-where mount_tag is the tag associated by the server to each of the exported
+where mount_tag is the tag generated by the server to each of the exported
mount points. Each 9P export is seen by the client as a virtio device with an
associated "mount_tag" property. Available mount tags can be
seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files.
+USBG Usage
+==========
+
+To mount a 9p FS on a USB Host accessible via the gadget at runtime::
+
+ mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9
+
+To mount a 9p FS on a USB Host accessible via the gadget as root filesystem::
+
+ root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs
+
+where <device> is the tag associated by the usb gadget transport.
+It is defined by the configfs instance name.
+
+USBG Example
+============
+
+The USB host exports a filesystem, while the gadget on the USB device
+side makes it mountable.
+
+Diod (9pfs server) and the forwarder are on the development host, where
+the root filesystem is actually stored. The gadget is initialized during
+boot (or later) on the embedded board. Then the forwarder will find it
+on the USB bus and start forwarding requests.
+
+In this case the 9p requests come from the device and are handled by the
+host. The reason is that USB device ports are normally not available on
+PCs, so a connection in the other direction would not work.
+
+When using the usbg transport, for now there is no native usb host
+service capable to handle the requests from the gadget driver. For
+this we have to use the extra python tool p9_fwd.py from tools/usb.
+
+Just start the 9pfs capable network server like diod/nfs-ganesha e.g.::
+
+ $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD
+
+Optionaly scan your bus if there are more then one usbg gadgets to find their path::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py list
+
+ Bus | Addr | Manufacturer | Product | ID | Path
+ --- | ---- | ---------------- | ---------------- | --------- | ----
+ 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2
+ 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3
+
+Then start the python transport::
+
+ $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999
+
+After that the gadget driver can be used as described above.
+
+One use-case is to use it as an alternative to NFS root booting during
+the development of embedded Linux devices.
+
Options
=======
@@ -68,6 +123,7 @@ Options
virtio connect to the next virtio channel available
(from QEMU with trans_virtio module)
rdma connect to a specified RDMA channel
+ usbg connect to a specified usb gadget channel
======== ============================================
uname=name user name to attempt mount as on the remote server. The
diff --git a/Documentation/filesystems/autofs.rst b/Documentation/filesystems/autofs.rst
index 3b6e38e646cd..1ac576458c69 100644
--- a/Documentation/filesystems/autofs.rst
+++ b/Documentation/filesystems/autofs.rst
@@ -18,7 +18,7 @@ key advantages:
2. The names and locations of filesystems can be stored in
a remote database and can change at any time. The content
- in that data base at the time of access will be used to provide
+ in that database at the time of access will be used to provide
a target for the access. The interpretation of names in the
filesystem can even be programmatic rather than database-backed,
allowing wildcards for example, and can vary based on the user who
@@ -423,7 +423,7 @@ The available ioctl commands are:
and objects are expired if the are not in use.
**AUTOFS_EXP_FORCED** causes the in use status to be ignored
- and objects are expired ieven if they are in use. This assumes
+ and objects are expired even if they are in use. This assumes
that the daemon has requested this because it is capable of
performing the umount.
diff --git a/Documentation/filesystems/bcachefs/CodingStyle.rst b/Documentation/filesystems/bcachefs/CodingStyle.rst
index 0c45829a4899..01de555e21d8 100644
--- a/Documentation/filesystems/bcachefs/CodingStyle.rst
+++ b/Documentation/filesystems/bcachefs/CodingStyle.rst
@@ -175,7 +175,7 @@ errors in our thinking by running our code and seeing what happens. If your
time is being wasted because your tools are bad or too slow - don't accept it,
fix it.
-Put effort into your documentation, commmit messages, and code comments - but
+Put effort into your documentation, commit messages, and code comments - but
don't go overboard. A good commit message is wonderful - but if the information
was important enough to go in a commit message, ask yourself if it would be
even better as a code comment.
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 13e4b18e5dbb..0e2fac7a16da 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -86,6 +86,16 @@ authenticating fs-verity file hashes include:
signature in their "security.ima" extended attribute, as controlled
by the IMA policy. For more information, see the IMA documentation.
+- Integrity Policy Enforcement (IPE). IPE supports enforcing access
+ control decisions based on immutable security properties of files,
+ including those protected by fs-verity's built-in signatures.
+ "IPE policy" specifically allows for the authorization of fs-verity
+ files using properties ``fsverity_digest`` for identifying
+ files by their verity digest, and ``fsverity_signature`` to authorize
+ files with a verified fs-verity's built-in signature. For
+ details on configuring IPE policies and understanding its operational
+ modes, please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>`.
+
- Trusted userspace code in combination with `Built-in signature
verification`_. This approach should be used only with great care.
@@ -457,7 +467,11 @@ Enabling this option adds the following:
On success, the ioctl persists the signature alongside the Merkle
tree. Then, any time the file is opened, the kernel verifies the
file's actual digest against this signature, using the certificates
- in the ".fs-verity" keyring.
+ in the ".fs-verity" keyring. This verification happens as long as the
+ file's signature exists, regardless of the state of the sysctl variable
+ "fs.verity.require_signatures" described in the next item. The IPE LSM
+ relies on this behavior to recognize and label fsverity files
+ that contain a verified built-in fsverity signature.
3. A new sysctl "fs.verity.require_signatures" is made available.
When set to 1, the kernel requires that all verity files have a
@@ -481,7 +495,7 @@ be carefully considered before using them:
- Builtin signature verification does *not* make the kernel enforce
that any files actually have fs-verity enabled. Thus, it is not a
- complete authentication policy. Currently, if it is used, the only
+ complete authentication policy. Currently, if it is used, one
way to complete the authentication policy is for trusted userspace
code to explicitly check whether files have fs-verity enabled with a
signature before they are accessed. (With
@@ -490,6 +504,15 @@ be carefully considered before using them:
could just store the signature alongside the file and verify it
itself using a cryptographic library, instead of using this feature.
+- Another approach is to utilize fs-verity builtin signature
+ verification in conjunction with the IPE LSM, which supports defining
+ a kernel-enforced, system-wide authentication policy that allows only
+ files with a verified fs-verity builtin signature to perform certain
+ operations, such as execution. Note that IPE doesn't require
+ fs.verity.require_signatures=1.
+ Please refer to :doc:`IPE admin guide </admin-guide/LSM/ipe>` for
+ more details.
+
- A file's builtin signature can only be set at the same time that
fs-verity is being enabled on the file. Changing or deleting the
builtin signature later requires re-creating the file.
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
index ac0af679e61e..77930c77fcfe 100644
--- a/Documentation/filesystems/idmappings.rst
+++ b/Documentation/filesystems/idmappings.rst
@@ -821,7 +821,7 @@ the same idmapping to the mount. We now perform three steps:
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k20000:r10000, u1000) = k21000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k20000:r10000, k21000) = u1000
@@ -854,10 +854,10 @@ The same translation algorithm works with the third example.
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's kernel ids can be mapped to userspace ids in the
+3. Verify that the caller's kernel ids can be mapped to userspace ids in the
filesystem's idmapping::
- from_kuid(u0:k0:r4294967295, k21000) = u1000
+ from_kuid(u0:k0:r4294967295, k1000) = u1000
So the ownership that lands on disk will be ``u1000``.
@@ -994,7 +994,7 @@ from above:::
/* Map the userspace id down into a kernel id in the filesystem's idmapping. */
make_kuid(u0:k0:r4294967295, u1000) = k1000
-2. Verify that the caller's filesystem ids can be mapped to userspace ids in the
+3. Verify that the caller's filesystem ids can be mapped to userspace ids in the
filesystem's idmapping::
from_kuid(u0:k0:r4294967295, k1000) = u1000
diff --git a/Documentation/filesystems/iomap/design.rst b/Documentation/filesystems/iomap/design.rst
index f8ee3427bc1a..b0d0188a095e 100644
--- a/Documentation/filesystems/iomap/design.rst
+++ b/Documentation/filesystems/iomap/design.rst
@@ -142,9 +142,9 @@ Definitions
* **pure overwrite**: A write operation that does not require any
metadata or zeroing operations to perform during either submission
or completion.
- This implies that the fileystem must have already allocated space
+ This implies that the filesystem must have already allocated space
on disk as ``IOMAP_MAPPED`` and the filesystem must not place any
- constaints on IO alignment or size.
+ constraints on IO alignment or size.
The only constraints on I/O alignment are device level (minimum I/O
size and alignment, typically sector size).
@@ -165,7 +165,7 @@ structure below:
u16 flags;
struct block_device *bdev;
struct dax_device *dax_dev;
- voidw *inline_data;
+ void *inline_data;
void *private;
const struct iomap_folio_ops *folio_ops;
u64 validity_cookie;
@@ -394,7 +394,7 @@ iomap is concerned:
* The **upper** level primitive is provided by the filesystem to
coordinate access to different iomap operations.
- The exact primitive is specifc to the filesystem and operation,
+ The exact primitive is specific to the filesystem and operation,
but is often a VFS inode, pagecache invalidation, or folio lock.
For example, a filesystem might take ``i_rwsem`` before calling
``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent
@@ -426,7 +426,7 @@ iomap is concerned:
The exact locking requirements are specific to the filesystem; for
certain operations, some of these locks can be elided.
-All further mention of locking are *recommendations*, not mandates.
+All further mentions of locking are *recommendations*, not mandates.
Each filesystem author must figure out the locking for themself.
Bugs and Limitations
diff --git a/Documentation/filesystems/journalling.rst b/Documentation/filesystems/journalling.rst
index e18f90ffc6fd..0254f7d57429 100644
--- a/Documentation/filesystems/journalling.rst
+++ b/Documentation/filesystems/journalling.rst
@@ -137,7 +137,7 @@ Fast commits
JBD2 to also allows you to perform file-system specific delta commits known as
fast commits. In order to use fast commits, you will need to set following
-callbacks that perform correspodning work:
+callbacks that perform corresponding work:
`journal->j_fc_cleanup_cb`: Cleanup function called after every full commit and
fast commit.
@@ -149,7 +149,7 @@ File system is free to perform fast commits as and when it wants as long as it
gets permission from JBD2 to do so by calling the function
:c:func:`jbd2_fc_begin_commit()`. Once a fast commit is done, the client
file system should tell JBD2 about it by calling
-:c:func:`jbd2_fc_end_commit()`. If file system wants JBD2 to perform a full
+:c:func:`jbd2_fc_end_commit()`. If the file system wants JBD2 to perform a full
commit immediately after stopping the fast commit it can do so by calling
:c:func:`jbd2_fc_end_commit_fallback()`. This is useful if fast commit operation
fails for some reason and the only way to guarantee consistency is for JBD2 to
@@ -199,7 +199,7 @@ Journal Level
.. kernel-doc:: fs/jbd2/recovery.c
:internal:
-Transasction Level
+Transaction Level
~~~~~~~~~~~~~~~~~~
.. kernel-doc:: fs/jbd2/transaction.c
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index e664061ed55d..f5e3676db954 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -251,10 +251,10 @@ prototypes::
void (*readahead)(struct readahead_control *);
int (*write_begin)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len,
- struct page **pagep, void **fsdata);
+ struct folio **foliop, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -280,7 +280,7 @@ read_folio: yes, unlocks shared
writepages:
dirty_folio: maybe
readahead: yes, unlocks shared
-write_begin: locks the page exclusive
+write_begin: locks the folio exclusive
write_end: yes, unlocks exclusive
bmap:
invalidate_folio: yes exclusive
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 4cc657d743f7..f0d2cb257bb8 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -116,7 +116,7 @@ The following services are provided:
* Handle local caching, allowing cached data and server-read data to be
interleaved for a single request.
- * Handle clearing of bufferage that aren't on the server.
+ * Handle clearing of bufferage that isn't on the server.
* Handle retrying of reads that failed, switching reads from the cache to the
server as necessary.
diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst
index 8536134f31fd..95c2c009874c 100644
--- a/Documentation/filesystems/nfs/index.rst
+++ b/Documentation/filesystems/nfs/index.rst
@@ -8,6 +8,7 @@ NFS
client-identifier
exporting
+ localio
pnfs
rpc-cache
rpc-server-gss
diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst
new file mode 100644
index 000000000000..bd1967e2eab3
--- /dev/null
+++ b/Documentation/filesystems/nfs/localio.rst
@@ -0,0 +1,357 @@
+===========
+NFS LOCALIO
+===========
+
+Overview
+========
+
+The LOCALIO auxiliary RPC protocol allows the Linux NFS client and
+server to reliably handshake to determine if they are on the same
+host. Select "NFS client and server support for LOCALIO auxiliary
+protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel
+config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled).
+
+Once an NFS client and server handshake as "local", the client will
+bypass the network RPC protocol for read, write and commit operations.
+Due to this XDR and RPC bypass, these operations will operate faster.
+
+The LOCALIO auxiliary protocol's implementation, which uses the same
+connection as NFS traffic, follows the pattern established by the NFS
+ACL protocol extension.
+
+The LOCALIO auxiliary protocol is needed to allow robust discovery of
+clients local to their servers. In a private implementation that
+preceded use of this LOCALIO protocol, a fragile sockaddr network
+address based match against all local network interfaces was attempted.
+But unlike the LOCALIO protocol, the sockaddr-based matching didn't
+handle use of iptables or containers.
+
+The robust handshake between local client and server is just the
+beginning, the ultimate use case this locality makes possible is the
+client is able to open files and issue reads, writes and commits
+directly to the server without having to go over the network. The
+requirement is to perform these loopback NFS operations as efficiently
+as possible, this is particularly useful for container use cases
+(e.g. kubernetes) where it is possible to run an IO job local to the
+server.
+
+The performance advantage realized from LOCALIO's ability to bypass
+using XDR and RPC for reads, writes and commits can be extreme, e.g.:
+
+fio for 20 secs with directio, qd of 8, 16 libaio threads:
+ - With LOCALIO:
+ 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec)
+ 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec)
+ 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec)
+ 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec)
+ 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec)
+ 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec)
+ 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec)
+
+fio for 20 secs with directio, qd of 8, 1 libaio thread:
+ - With LOCALIO:
+ 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec)
+ 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec)
+ 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec)
+
+ - Without LOCALIO:
+ 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec)
+ 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec)
+ 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec)
+ 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec)
+
+FAQ
+===
+
+1. What are the use cases for LOCALIO?
+
+ a. Workloads where the NFS client and server are on the same host
+ realize improved IO performance. In particular, it is common when
+ running containerised workloads for jobs to find themselves
+ running on the same host as the knfsd server being used for
+ storage.
+
+2. What are the requirements for LOCALIO?
+
+ a. Bypass use of the network RPC protocol as much as possible. This
+ includes bypassing XDR and RPC for open, read, write and commit
+ operations.
+ b. Allow client and server to autonomously discover if they are
+ running local to each other without making any assumptions about
+ the local network topology.
+ c. Support the use of containers by being compatible with relevant
+ namespaces (e.g. network, user, mount).
+ d. Support all versions of NFS. NFSv3 is of particular importance
+ because it has wide enterprise usage and pNFS flexfiles makes use
+ of it for the data path.
+
+3. Why doesn’t LOCALIO just compare IP addresses or hostnames when
+ deciding if the NFS client and server are co-located on the same
+ host?
+
+ Since one of the main use cases is containerised workloads, we cannot
+ assume that IP addresses will be shared between the client and
+ server. This sets up a requirement for a handshake protocol that
+ needs to go over the same connection as the NFS traffic in order to
+ identify that the client and the server really are running on the
+ same host. The handshake uses a secret that is sent over the wire,
+ and can be verified by both parties by comparing with a value stored
+ in shared kernel memory if they are truly co-located.
+
+4. Does LOCALIO improve pNFS flexfiles?
+
+ Yes, LOCALIO complements pNFS flexfiles by allowing it to take
+ advantage of NFS client and server locality. Policy that initiates
+ client IO as closely to the server where the data is stored naturally
+ benefits from the data path optimization LOCALIO provides.
+
+5. Why not develop a new pNFS layout to enable LOCALIO?
+
+ A new pNFS layout could be developed, but doing so would put the
+ onus on the server to somehow discover that the client is co-located
+ when deciding to hand out the layout.
+ There is value in a simpler approach (as provided by LOCALIO) that
+ allows the NFS client to negotiate and leverage locality without
+ requiring more elaborate modeling and discovery of such locality in a
+ more centralized manner.
+
+6. Why is having the client perform a server-side file OPEN, without
+ using RPC, beneficial? Is the benefit pNFS specific?
+
+ Avoiding the use of XDR and RPC for file opens is beneficial to
+ performance regardless of whether pNFS is used. Especially when
+ dealing with small files its best to avoid going over the wire
+ whenever possible, otherwise it could reduce or even negate the
+ benefits of avoiding the wire for doing the small file I/O itself.
+ Given LOCALIO's requirements the current approach of having the
+ client perform a server-side file open, without using RPC, is ideal.
+ If in the future requirements change then we can adapt accordingly.
+
+7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)?
+
+ Strong authentication is usually tied to the connection itself. It
+ works by establishing a context that is cached by the server, and
+ that acts as the key for discovering the authorisation token, which
+ can then be passed to rpc.mountd to complete the authentication
+ process. On the other hand, in the case of AUTH_UNIX, the credential
+ that was passed over the wire is used directly as the key in the
+ upcall to rpc.mountd. This simplifies the authentication process, and
+ so makes AUTH_UNIX easier to support.
+
+8. How do export options that translate RPC user IDs behave for LOCALIO
+ operations (eg. root_squash, all_squash)?
+
+ Export options that translate user IDs are managed by nfsd_setuser()
+ which is called by nfsd_setuser_and_check_port() which is called by
+ __fh_verify(). So they get handled exactly the same way for LOCALIO
+ as they do for non-LOCALIO.
+
+9. How does LOCALIO make certain that object lifetimes are managed
+ properly given NFSD and NFS operate in different contexts?
+
+ See the detailed "NFS Client and Server Interlock" section below.
+
+RPC
+===
+
+The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL"
+RPC method that allows the Linux NFS client to verify the local Linux
+NFS server can see the nonce (single-use UUID) the client generated and
+made available in nfs_common. This protocol isn't part of an IETF
+standard, nor does it need to be considering it is Linux-to-Linux
+auxiliary RPC protocol that amounts to an implementation detail.
+
+The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of
+the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode
+XDR methods are used instead of the less efficient variable sized
+methods.
+
+The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned
+by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ):
+Linux Kernel Organization 400122 nfslocalio
+
+The LOCALIO protocol spec in rpcgen syntax is::
+
+ /* raw RFC 9562 UUID */
+ #define UUID_SIZE 16
+ typedef u8 uuid_t<UUID_SIZE>;
+
+ program NFS_LOCALIO_PROGRAM {
+ version LOCALIO_V1 {
+ void
+ NULL(void) = 0;
+
+ void
+ UUID_IS_LOCAL(uuid_t) = 1;
+ } = 1;
+ } = 400122;
+
+LOCALIO uses the same transport connection as NFS traffic. As such,
+LOCALIO is not registered with rpcbind.
+
+NFS Common and Client/Server Handshake
+======================================
+
+fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client
+to generate a nonce (single-use UUID) and associated short-lived
+nfs_uuid_t struct, register it with nfs_common for subsequent lookup and
+verification by the NFS server and if matched the NFS server populates
+members in the nfs_uuid_t struct. The NFS client then uses nfs_common to
+transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv
+clients_list from the nfs_common's uuids_list. See:
+fs/nfs/localio.c:nfs_local_probe()
+
+nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such
+it has members that point to nfsd memory for direct use by the client
+(e.g. 'net' is the server's network namespace, through it the client can
+access nn->nfsd_serv with proper rcu read access). It is this client
+and server synchronization that enables advanced usage and lifetime of
+objects to span from the host kernel's nfsd to per-container knfsd
+instances that are connected to nfs client's running on the same local
+host.
+
+NFS Client and Server Interlock
+===============================
+
+LOCALIO provides the nfs_uuid_t object and associated interfaces to
+allow proper network namespace (net-ns) and NFSD object refcounting:
+
+ We don't want to keep a long-term counted reference on each NFSD's
+ net-ns in the client because that prevents a server container from
+ completely shutting down.
+
+ So we avoid taking a reference at all and rely on the per-cpu
+ reference to the server (detailed below) being sufficient to keep
+ the net-ns active. This involves allowing the NFSD's net-ns exit
+ code to iterate all active clients and clear their ->net pointers
+ (which are needed to find the per-cpu-refcount for the nfsd_serv).
+
+ Details:
+
+ - Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head
+ that can be used to find the client. It does add the 16-byte
+ uuid_t to nfs_client so it is bigger than needed (given that
+ uuid_t is only used during the initial NFS client and server
+ LOCALIO handshake to determine if they are local to each other).
+ If that is really a problem we can find a fix.
+
+ - When the nfs server confirms that the uuid_t is local, it moves
+ the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net.
+
+ - When each server's net-ns is shutting down - in a "pre_exit"
+ handler, all these nfs_uuid_t have their ->net cleared. There is
+ an rcu_synchronize() call between pre_exit() handlers and exit()
+ handlers so any caller that sees nfs_uuid_t ->net as not NULL can
+ safely manage the per-cpu-refcount for nfsd_serv.
+
+ - The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it
+ can safely dereference ->net in a private rcu_read_lock() section
+ to allow safe access to the associated nfsd_net and nfsd_serv.
+
+So LOCALIO required the introduction and use of NFSD's percpu_ref to
+interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each
+nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and
+warrants a more detailed explanation:
+
+ nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its
+ nfsd_file handle and then the caller (NFS client) must drop the
+ reference for the nfsd_file and associated nn->nfsd_serv using
+ nfs_file_put_local() once it has completed its IO.
+
+ This interlock working relies heavily on nfsd_open_local_fh() being
+ afforded the ability to safely deal with the possibility that the
+ NFSD's net-ns (and nfsd_net by association) may have been destroyed
+ by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only
+ possible given the nfs_uuid_t ->net pointer managemenet detailed
+ above.
+
+All told, this elaborate interlock of the NFS client and server has been
+verified to fix an easy to hit crash that would occur if an NFSD
+instance running in a container, with a LOCALIO client mounted, is
+shutdown. Upon restart of the container and associated NFSD the client
+would go on to crash due to NULL pointer dereference that occurred due
+to the LOCALIO client's attempting to nfsd_open_local_fh(), using
+nn->nfsd_serv, without having a proper reference on nn->nfsd_serv.
+
+NFS Client issues IO instead of Server
+======================================
+
+Because LOCALIO is focused on protocol bypass to achieve improved IO
+performance, alternatives to the traditional NFS wire protocol (SUNRPC
+with XDR) must be provided to access the backing filesystem.
+
+See fs/nfs/localio.c:nfs_local_open_fh() and
+fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes
+focused use of select nfs server objects to allow a client local to a
+server to open a file pointer without needing to go over the network.
+
+The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the
+server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access
+both the associated nfsd network namespace and nn->nfsd_serv in terms of
+RCU. If nfsd_open_local_fh() finds that the client no longer sees valid
+nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO
+to nfs_local_open_fh() and the client will try to reestablish the
+LOCALIO resources needed by calling nfs_local_probe() again. This
+recovery is needed if/when an nfsd instance running in a container were
+to reboot while a LOCALIO client is connected to it.
+
+Once the client has an open nfsd_file pointer it will issue reads,
+writes and commits directly to the underlying local filesystem (normally
+done by the nfs server). As such, for these operations, the NFS client
+is issuing IO to the underlying local filesystem that it is sharing with
+the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and
+fs/nfs/localio.c:nfs_local_commit().
+
+Security
+========
+
+Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka
+AUTH_SYS) is used.
+
+Care is taken to ensure the same NFS security mechanisms are used
+(authentication, etc) regardless of whether LOCALIO or regular NFS
+access is used. The auth_domain established as part of the traditional
+NFS client access to the NFS server is also used for LOCALIO.
+
+Relative to containers, LOCALIO gives the client access to the network
+namespace the server has. This is required to allow the client to access
+the server's per-namespace nfsd_net struct. With traditional NFS, the
+client is afforded this same level of access (albeit in terms of the NFS
+protocol via SUNRPC). No other namespaces (user, mount, etc) have been
+altered or purposely extended from the server to the client.
+
+Testing
+=======
+
+The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write
+and commit access have proven stable against various test scenarios:
+
+- Client and server both on the same host.
+
+- All permutations of client and server support enablement for both
+ local and remote client and server.
+
+- Testing against NFS storage products that don't support the LOCALIO
+ protocol was also performed.
+
+- Client on host, server within a container (for both v3 and v4.2).
+ The container testing was in terms of podman managed containers and
+ includes successful container stop/restart scenario.
+
+- Formalizing these test scenarios in terms of existing test
+ infrastructure is on-going. Initial regular coverage is provided in
+ terms of ktest running xfstests against a LOCALIO-enabled NFS loopback
+ mount configuration, and includes lockdep and KASAN coverage, see:
+ https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next
+ https://github.com/koverstreet/ktest
+
+- Various kdevops testing (in terms of "Chuck's BuildBot") has been
+ performed to regularly verify the LOCALIO changes haven't caused any
+ regressions to non-LOCALIO NFS use cases.
+
+- All of Hammerspace's various sanity tests pass with LOCALIO enabled
+ (this includes numerous pNFS and flexfiles tests).
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 165514401441..343644712340 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -367,8 +367,11 @@ Metadata only copy up
When the "metacopy" feature is enabled, overlayfs will only copy
up metadata (as opposed to whole file), when a metadata specific operation
-like chown/chmod is performed. Full file will be copied up later when
-file is opened for WRITE operation.
+like chown/chmod is performed. An upper file in this state is marked with
+"trusted.overlayfs.metacopy" xattr which indicates that the upper file
+contains no data. The data will be copied up later when file is opened for
+WRITE operation. After the lower file's data is copied up,
+the "trusted.overlayfs.metacopy" xattr is removed from the upper file.
In other words, this is delayed data copy up operation and data is copied
up when there is a need to actually modify data.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 6e903a903f8f..0b18af3f954e 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -810,7 +810,7 @@ cache in your filesystem. The following members are defined:
struct page **pagep, void **fsdata);
int (*write_end)(struct file *, struct address_space *mapping,
loff_t pos, unsigned len, unsigned copied,
- struct page *page, void *fsdata);
+ struct folio *folio, void *fsdata);
sector_t (*bmap)(struct address_space *, sector_t);
void (*invalidate_folio) (struct folio *, size_t start, size_t len);
bool (*release_folio)(struct folio *, gfp_t);
@@ -913,8 +913,7 @@ cache in your filesystem. The following members are defined:
stop attempting I/O, it can simply return. The caller will
remove the remaining pages from the address space, unlock them
and decrement the page refcount. Set PageUptodate if the I/O
- completes successfully. Setting PageError on any page will be
- ignored; simply unlock the page if an I/O error occurs.
+ completes successfully.
``write_begin``
Called by the generic buffered write code to ask the filesystem
@@ -926,12 +925,12 @@ cache in your filesystem. The following members are defined:
(if they haven't been read already) so that the updated blocks
can be written out properly.
- The filesystem must return the locked pagecache page for the
- specified offset, in ``*pagep``, for the caller to write into.
+ The filesystem must return the locked pagecache folio for the
+ specified offset, in ``*foliop``, for the caller to write into.
It must be able to cope with short writes (where the length
passed to write_begin is greater than the number of bytes copied
- into the page).
+ into the folio).
A void * may be returned in fsdata, which then gets passed into
write_end.
@@ -944,8 +943,8 @@ cache in your filesystem. The following members are defined:
called. len is the original len passed to write_begin, and
copied is the amount that was able to be copied.
- The filesystem must take care of unlocking the page and
- releasing it refcount, and updating i_size.
+ The filesystem must take care of unlocking the folio,
+ decrementing its refcount, and updating i_size.
Returns < 0 on failure, otherwise the number of bytes (<=
'copied') that were able to be copied into pagecache.