diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/9p.rst | 58 | ||||
-rw-r--r-- | Documentation/filesystems/caching/cachefiles.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/iomap/operations.rst | 17 | ||||
-rw-r--r-- | Documentation/filesystems/multigrain-ts.rst | 125 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/exporting.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/localio.rst | 357 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 17 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/tmpfs.rst | 24 |
13 files changed, 601 insertions, 13 deletions
diff --git a/Documentation/filesystems/9p.rst b/Documentation/filesystems/9p.rst index d270a0aa8d55..2bbf68b56b0d 100644 --- a/Documentation/filesystems/9p.rst +++ b/Documentation/filesystems/9p.rst @@ -48,11 +48,66 @@ For server running on QEMU host with virtio transport:: mount -t 9p -o trans=virtio <mount_tag> /mnt/9 -where mount_tag is the tag associated by the server to each of the exported +where mount_tag is the tag generated by the server to each of the exported mount points. Each 9P export is seen by the client as a virtio device with an associated "mount_tag" property. Available mount tags can be seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio<n>/mount_tag files. +USBG Usage +========== + +To mount a 9p FS on a USB Host accessible via the gadget at runtime:: + + mount -t 9p -o trans=usbg,aname=/path/to/fs <device> /mnt/9 + +To mount a 9p FS on a USB Host accessible via the gadget as root filesystem:: + + root=<device> rootfstype=9p rootflags=trans=usbg,cache=loose,uname=root,access=0,dfltuid=0,dfltgid=0,aname=/path/to/rootfs + +where <device> is the tag associated by the usb gadget transport. +It is defined by the configfs instance name. + +USBG Example +============ + +The USB host exports a filesystem, while the gadget on the USB device +side makes it mountable. + +Diod (9pfs server) and the forwarder are on the development host, where +the root filesystem is actually stored. The gadget is initialized during +boot (or later) on the embedded board. Then the forwarder will find it +on the USB bus and start forwarding requests. + +In this case the 9p requests come from the device and are handled by the +host. The reason is that USB device ports are normally not available on +PCs, so a connection in the other direction would not work. + +When using the usbg transport, for now there is no native usb host +service capable to handle the requests from the gadget driver. For +this we have to use the extra python tool p9_fwd.py from tools/usb. + +Just start the 9pfs capable network server like diod/nfs-ganesha e.g.:: + + $ diod -f -n -d 0 -S -l 0.0.0.0:9999 -e $PWD + +Optionaly scan your bus if there are more then one usbg gadgets to find their path:: + + $ python $kernel_dir/tools/usb/p9_fwd.py list + + Bus | Addr | Manufacturer | Product | ID | Path + --- | ---- | ---------------- | ---------------- | --------- | ---- + 2 | 67 | unknown | unknown | 1d6b:0109 | 2-1.1.2 + 2 | 68 | unknown | unknown | 1d6b:0109 | 2-1.1.3 + +Then start the python transport:: + + $ python $kernel_dir/tools/usb/p9_fwd.py --path 2-1.1.2 connect -p 9999 + +After that the gadget driver can be used as described above. + +One use-case is to use it as an alternative to NFS root booting during +the development of embedded Linux devices. + Options ======= @@ -68,6 +123,7 @@ Options virtio connect to the next virtio channel available (from QEMU with trans_virtio module) rdma connect to a specified RDMA channel + usbg connect to a specified usb gadget channel ======== ============================================ uname=name user name to attempt mount as on the remote server. The diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst index e04a27bdbe19..b3ccc782cb3b 100644 --- a/Documentation/filesystems/caching/cachefiles.rst +++ b/Documentation/filesystems/caching/cachefiles.rst @@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available: This mask can also be set through sysfs, eg:: - echo 5 >/sys/modules/cachefiles/parameters/debug + echo 5 > /sys/module/cachefiles/parameters/debug Starting the Cache diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index e8e496d23e1d..44e9e77ffe0d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -29,6 +29,7 @@ algorithms work. fiemap files locks + multigrain-ts mount_api quota seq_file diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst index 8e6c721d2330..ef082e5a4e0c 100644 --- a/Documentation/filesystems/iomap/operations.rst +++ b/Documentation/filesystems/iomap/operations.rst @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel such `reservations <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_ because writeback will not consume the reservation. -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a +The ``iomap_write_delalloc_release`` can be called from a ``->iomap_end`` function to find all the clean areas of the folios caching a fresh (``IOMAP_F_NEW``) delalloc mapping. It takes the ``invalidate_lock``. @@ -513,6 +513,21 @@ IOMAP_WRITE`` with any combination of the following enhancements: if the mapping is unwritten and the filesystem cannot handle zeroing the unaligned regions without exposing stale contents. + * ``IOMAP_ATOMIC``: This write is being issued with torn-write + protection. + Only a single bio can be created for the write, and the write must + not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be + set. + The file range to write must be aligned to satisfy the requirements + of both the filesystem and the underlying block device's atomic + commit capabilities. + If filesystem metadata updates are required (e.g. unwritten extent + conversion or copy on write), all updates for the entire file range + must be committed atomically as well. + Only one space mapping is allowed per untorn write. + Untorn writes must be aligned to, and must not be longer than, a + single file block. + Callers commonly hold ``i_rwsem`` in shared or exclusive mode before calling this function. diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst new file mode 100644 index 000000000000..c779e47284e8 --- /dev/null +++ b/Documentation/filesystems/multigrain-ts.rst @@ -0,0 +1,125 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigrain Timestamps +===================== + +Introduction +============ +Historically, the kernel has always used coarse time values to stamp inodes. +This value is updated every jiffy, so any change that happens within that jiffy +will end up with the same timestamp. + +When the kernel goes to stamp an inode (due to a read or write), it first gets +the current time and then compares it to the existing timestamp(s) to see +whether anything will change. If nothing changed, then it can avoid updating +the inode's metadata. + +Coarse timestamps are therefore good from a performance standpoint, since they +reduce the need for metadata updates, but bad from the standpoint of +determining whether anything has changed, since a lot of things can happen in a +jiffy. + +They are particularly troublesome with NFSv3, where unchanging timestamps can +make it difficult to tell whether to invalidate caches. NFSv4 provides a +dedicated change attribute that should always show a visible change, but not +all filesystems implement this properly, causing the NFS server to substitute +the ctime in many cases. + +Multigrain timestamps aim to remedy this by selectively using fine-grained +timestamps when a file has had its timestamps queried recently, and the current +coarse-grained time does not cause a change. + +Inode Timestamps +================ +There are currently 3 timestamps in the inode that are updated to the current +wallclock time on different activity: + +ctime: + The inode change time. This is stamped with the current time whenever + the inode's metadata is changed. Note that this value is not settable + from userland. + +mtime: + The inode modification time. This is stamped with the current time + any time a file's contents change. + +atime: + The inode access time. This is stamped whenever an inode's contents are + read. Widely considered to be a terrible mistake. Usually avoided with + options like noatime or relatime. + +Updating the mtime always implies a change to the ctime, but updating the +atime due to a read request does not. + +Multigrain timestamps are only tracked for the ctime and the mtime. atimes are +not affected and always use the coarse-grained value (subject to the floor). + +Inode Timestamp Ordering +======================== + +In addition to just providing info about changes to individual files, file +timestamps also serve an important purpose in applications like "make". These +programs measure timestamps in order to determine whether source files might be +newer than cached objects. + +Userland applications like make can only determine ordering based on +operational boundaries. For a syscall those are the syscall entry and exit +points. For io_uring or nfsd operations, that's the request submission and +response. In the case of concurrent operations, userland can make no +determination about the order in which things will occur. + +For instance, if a single thread modifies one file, and then another file in +sequence, the second file must show an equal or later mtime than the first. The +same is true if two threads are issuing similar operations that do not overlap +in time. + +If however, two threads have racing syscalls that overlap in time, then there +is no such guarantee, and the second file may appear to have been modified +before, after or at the same time as the first, regardless of which one was +submitted first. + +Note that the above assumes that the system doesn't experience a backward jump +of the realtime clock. If that occurs at an inopportune time, then timestamps +can appear to go backward, even on a properly functioning system. + +Multigrain Timestamp Implementation +=================================== +Multigrain timestamps are aimed at ensuring that changes to a single file are +always recognizable, without violating the ordering guarantees when multiple +different files are modified. This affects the mtime and the ctime, but the +atime will always use coarse-grained timestamps. + +It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime +or ctime has been queried. If either or both have, then the kernel takes +special care to ensure the next timestamp update will display a visible change. +This ensures tight cache coherency for use-cases like NFS, without sacrificing +the benefits of reduced metadata updates when files aren't being watched. + +The Ctime Floor Value +===================== +It's not sufficient to simply use fine or coarse-grained timestamps based on +whether the mtime or ctime has been queried. A file could get a fine grained +timestamp, and then a second file modified later could get a coarse-grained one +that appears earlier than the first, which would break the kernel's timestamp +ordering guarantees. + +To mitigate this problem, maintain a global floor value that ensures that +this can't happen. The two files in the above example may appear to have been +modified at the same time in such a case, but they will never show the reverse +order. To avoid problems with realtime clock jumps, the floor is managed as a +monotonic ktime_t, and the values are converted to realtime clock values as +needed. + +Implementation Notes +==================== +Multigrain timestamps are intended for use by local filesystems that get +ctime values from the local clock. This is in contrast to network filesystems +and the like that just mirror timestamp values from a server. + +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the +fstype->fs_flags in order to opt-in, providing the ctime is only ever set via +inode_set_ctime_current(). If the filesystem has a ->getattr routine that +doesn't call generic_fillattr, then it should call fill_mg_cmtime() to +fill those values. For setattr, it should use setattr_copy() to update the +timestamps, or otherwise mimic its behavior. diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index f0d2cb257bb8..73f0bfd7e903 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -592,4 +592,3 @@ API Function Reference .. kernel-doc:: include/linux/netfs.h .. kernel-doc:: fs/netfs/buffered_read.c -.. kernel-doc:: fs/netfs/io.c diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index f04ce1215a03..de64d2d002a2 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -238,10 +238,3 @@ following flags are defined: all of an inode's dirty data on last close. Exports that behave this way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip waiting for writeback when closing such files. - - EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock - requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has - it's own ->lock() functionality as core posix_lock_file() implementation - has no async lock request handling yet. For more information about how to - indicate an async lock request from a ->lock() file_operations struct, see - fs/locks.c and comment for the function vfs_lock_file(). diff --git a/Documentation/filesystems/nfs/index.rst b/Documentation/filesystems/nfs/index.rst index 8536134f31fd..95c2c009874c 100644 --- a/Documentation/filesystems/nfs/index.rst +++ b/Documentation/filesystems/nfs/index.rst @@ -8,6 +8,7 @@ NFS client-identifier exporting + localio pnfs rpc-cache rpc-server-gss diff --git a/Documentation/filesystems/nfs/localio.rst b/Documentation/filesystems/nfs/localio.rst new file mode 100644 index 000000000000..bd1967e2eab3 --- /dev/null +++ b/Documentation/filesystems/nfs/localio.rst @@ -0,0 +1,357 @@ +=========== +NFS LOCALIO +=========== + +Overview +======== + +The LOCALIO auxiliary RPC protocol allows the Linux NFS client and +server to reliably handshake to determine if they are on the same +host. Select "NFS client and server support for LOCALIO auxiliary +protocol" in menuconfig to enable CONFIG_NFS_LOCALIO in the kernel +config (both CONFIG_NFS_FS and CONFIG_NFSD must also be enabled). + +Once an NFS client and server handshake as "local", the client will +bypass the network RPC protocol for read, write and commit operations. +Due to this XDR and RPC bypass, these operations will operate faster. + +The LOCALIO auxiliary protocol's implementation, which uses the same +connection as NFS traffic, follows the pattern established by the NFS +ACL protocol extension. + +The LOCALIO auxiliary protocol is needed to allow robust discovery of +clients local to their servers. In a private implementation that +preceded use of this LOCALIO protocol, a fragile sockaddr network +address based match against all local network interfaces was attempted. +But unlike the LOCALIO protocol, the sockaddr-based matching didn't +handle use of iptables or containers. + +The robust handshake between local client and server is just the +beginning, the ultimate use case this locality makes possible is the +client is able to open files and issue reads, writes and commits +directly to the server without having to go over the network. The +requirement is to perform these loopback NFS operations as efficiently +as possible, this is particularly useful for container use cases +(e.g. kubernetes) where it is possible to run an IO job local to the +server. + +The performance advantage realized from LOCALIO's ability to bypass +using XDR and RPC for reads, writes and commits can be extreme, e.g.: + +fio for 20 secs with directio, qd of 8, 16 libaio threads: + - With LOCALIO: + 4K read: IOPS=979k, BW=3825MiB/s (4011MB/s)(74.7GiB/20002msec) + 4K write: IOPS=165k, BW=646MiB/s (678MB/s)(12.6GiB/20002msec) + 128K read: IOPS=402k, BW=49.1GiB/s (52.7GB/s)(982GiB/20002msec) + 128K write: IOPS=11.5k, BW=1433MiB/s (1503MB/s)(28.0GiB/20004msec) + + - Without LOCALIO: + 4K read: IOPS=79.2k, BW=309MiB/s (324MB/s)(6188MiB/20003msec) + 4K write: IOPS=59.8k, BW=234MiB/s (245MB/s)(4671MiB/20002msec) + 128K read: IOPS=33.9k, BW=4234MiB/s (4440MB/s)(82.7GiB/20004msec) + 128K write: IOPS=11.5k, BW=1434MiB/s (1504MB/s)(28.0GiB/20011msec) + +fio for 20 secs with directio, qd of 8, 1 libaio thread: + - With LOCALIO: + 4K read: IOPS=230k, BW=898MiB/s (941MB/s)(17.5GiB/20001msec) + 4K write: IOPS=22.6k, BW=88.3MiB/s (92.6MB/s)(1766MiB/20001msec) + 128K read: IOPS=38.8k, BW=4855MiB/s (5091MB/s)(94.8GiB/20001msec) + 128K write: IOPS=11.4k, BW=1428MiB/s (1497MB/s)(27.9GiB/20001msec) + + - Without LOCALIO: + 4K read: IOPS=77.1k, BW=301MiB/s (316MB/s)(6022MiB/20001msec) + 4K write: IOPS=32.8k, BW=128MiB/s (135MB/s)(2566MiB/20001msec) + 128K read: IOPS=24.4k, BW=3050MiB/s (3198MB/s)(59.6GiB/20001msec) + 128K write: IOPS=11.4k, BW=1430MiB/s (1500MB/s)(27.9GiB/20001msec) + +FAQ +=== + +1. What are the use cases for LOCALIO? + + a. Workloads where the NFS client and server are on the same host + realize improved IO performance. In particular, it is common when + running containerised workloads for jobs to find themselves + running on the same host as the knfsd server being used for + storage. + +2. What are the requirements for LOCALIO? + + a. Bypass use of the network RPC protocol as much as possible. This + includes bypassing XDR and RPC for open, read, write and commit + operations. + b. Allow client and server to autonomously discover if they are + running local to each other without making any assumptions about + the local network topology. + c. Support the use of containers by being compatible with relevant + namespaces (e.g. network, user, mount). + d. Support all versions of NFS. NFSv3 is of particular importance + because it has wide enterprise usage and pNFS flexfiles makes use + of it for the data path. + +3. Why doesn’t LOCALIO just compare IP addresses or hostnames when + deciding if the NFS client and server are co-located on the same + host? + + Since one of the main use cases is containerised workloads, we cannot + assume that IP addresses will be shared between the client and + server. This sets up a requirement for a handshake protocol that + needs to go over the same connection as the NFS traffic in order to + identify that the client and the server really are running on the + same host. The handshake uses a secret that is sent over the wire, + and can be verified by both parties by comparing with a value stored + in shared kernel memory if they are truly co-located. + +4. Does LOCALIO improve pNFS flexfiles? + + Yes, LOCALIO complements pNFS flexfiles by allowing it to take + advantage of NFS client and server locality. Policy that initiates + client IO as closely to the server where the data is stored naturally + benefits from the data path optimization LOCALIO provides. + +5. Why not develop a new pNFS layout to enable LOCALIO? + + A new pNFS layout could be developed, but doing so would put the + onus on the server to somehow discover that the client is co-located + when deciding to hand out the layout. + There is value in a simpler approach (as provided by LOCALIO) that + allows the NFS client to negotiate and leverage locality without + requiring more elaborate modeling and discovery of such locality in a + more centralized manner. + +6. Why is having the client perform a server-side file OPEN, without + using RPC, beneficial? Is the benefit pNFS specific? + + Avoiding the use of XDR and RPC for file opens is beneficial to + performance regardless of whether pNFS is used. Especially when + dealing with small files its best to avoid going over the wire + whenever possible, otherwise it could reduce or even negate the + benefits of avoiding the wire for doing the small file I/O itself. + Given LOCALIO's requirements the current approach of having the + client perform a server-side file open, without using RPC, is ideal. + If in the future requirements change then we can adapt accordingly. + +7. Why is LOCALIO only supported with UNIX Authentication (AUTH_UNIX)? + + Strong authentication is usually tied to the connection itself. It + works by establishing a context that is cached by the server, and + that acts as the key for discovering the authorisation token, which + can then be passed to rpc.mountd to complete the authentication + process. On the other hand, in the case of AUTH_UNIX, the credential + that was passed over the wire is used directly as the key in the + upcall to rpc.mountd. This simplifies the authentication process, and + so makes AUTH_UNIX easier to support. + +8. How do export options that translate RPC user IDs behave for LOCALIO + operations (eg. root_squash, all_squash)? + + Export options that translate user IDs are managed by nfsd_setuser() + which is called by nfsd_setuser_and_check_port() which is called by + __fh_verify(). So they get handled exactly the same way for LOCALIO + as they do for non-LOCALIO. + +9. How does LOCALIO make certain that object lifetimes are managed + properly given NFSD and NFS operate in different contexts? + + See the detailed "NFS Client and Server Interlock" section below. + +RPC +=== + +The LOCALIO auxiliary RPC protocol consists of a single "UUID_IS_LOCAL" +RPC method that allows the Linux NFS client to verify the local Linux +NFS server can see the nonce (single-use UUID) the client generated and +made available in nfs_common. This protocol isn't part of an IETF +standard, nor does it need to be considering it is Linux-to-Linux +auxiliary RPC protocol that amounts to an implementation detail. + +The UUID_IS_LOCAL method encodes the client generated uuid_t in terms of +the fixed UUID_SIZE (16 bytes). The fixed size opaque encode and decode +XDR methods are used instead of the less efficient variable sized +methods. + +The RPC program number for the NFS_LOCALIO_PROGRAM is 400122 (as assigned +by IANA, see https://www.iana.org/assignments/rpc-program-numbers/ ): +Linux Kernel Organization 400122 nfslocalio + +The LOCALIO protocol spec in rpcgen syntax is:: + + /* raw RFC 9562 UUID */ + #define UUID_SIZE 16 + typedef u8 uuid_t<UUID_SIZE>; + + program NFS_LOCALIO_PROGRAM { + version LOCALIO_V1 { + void + NULL(void) = 0; + + void + UUID_IS_LOCAL(uuid_t) = 1; + } = 1; + } = 400122; + +LOCALIO uses the same transport connection as NFS traffic. As such, +LOCALIO is not registered with rpcbind. + +NFS Common and Client/Server Handshake +====================================== + +fs/nfs_common/nfslocalio.c provides interfaces that enable an NFS client +to generate a nonce (single-use UUID) and associated short-lived +nfs_uuid_t struct, register it with nfs_common for subsequent lookup and +verification by the NFS server and if matched the NFS server populates +members in the nfs_uuid_t struct. The NFS client then uses nfs_common to +transfer the nfs_uuid_t from its nfs_uuids to the nn->nfsd_serv +clients_list from the nfs_common's uuids_list. See: +fs/nfs/localio.c:nfs_local_probe() + +nfs_common's nfs_uuids list is the basis for LOCALIO enablement, as such +it has members that point to nfsd memory for direct use by the client +(e.g. 'net' is the server's network namespace, through it the client can +access nn->nfsd_serv with proper rcu read access). It is this client +and server synchronization that enables advanced usage and lifetime of +objects to span from the host kernel's nfsd to per-container knfsd +instances that are connected to nfs client's running on the same local +host. + +NFS Client and Server Interlock +=============================== + +LOCALIO provides the nfs_uuid_t object and associated interfaces to +allow proper network namespace (net-ns) and NFSD object refcounting: + + We don't want to keep a long-term counted reference on each NFSD's + net-ns in the client because that prevents a server container from + completely shutting down. + + So we avoid taking a reference at all and rely on the per-cpu + reference to the server (detailed below) being sufficient to keep + the net-ns active. This involves allowing the NFSD's net-ns exit + code to iterate all active clients and clear their ->net pointers + (which are needed to find the per-cpu-refcount for the nfsd_serv). + + Details: + + - Embed nfs_uuid_t in nfs_client. nfs_uuid_t provides a list_head + that can be used to find the client. It does add the 16-byte + uuid_t to nfs_client so it is bigger than needed (given that + uuid_t is only used during the initial NFS client and server + LOCALIO handshake to determine if they are local to each other). + If that is really a problem we can find a fix. + + - When the nfs server confirms that the uuid_t is local, it moves + the nfs_uuid_t onto a per-net-ns list in NFSD's nfsd_net. + + - When each server's net-ns is shutting down - in a "pre_exit" + handler, all these nfs_uuid_t have their ->net cleared. There is + an rcu_synchronize() call between pre_exit() handlers and exit() + handlers so any caller that sees nfs_uuid_t ->net as not NULL can + safely manage the per-cpu-refcount for nfsd_serv. + + - The client's nfs_uuid_t is passed to nfsd_open_local_fh() so it + can safely dereference ->net in a private rcu_read_lock() section + to allow safe access to the associated nfsd_net and nfsd_serv. + +So LOCALIO required the introduction and use of NFSD's percpu_ref to +interlock nfsd_destroy_serv() and nfsd_open_local_fh(), to ensure each +nn->nfsd_serv is not destroyed while in use by nfsd_open_local_fh(), and +warrants a more detailed explanation: + + nfsd_open_local_fh() uses nfsd_serv_try_get() before opening its + nfsd_file handle and then the caller (NFS client) must drop the + reference for the nfsd_file and associated nn->nfsd_serv using + nfs_file_put_local() once it has completed its IO. + + This interlock working relies heavily on nfsd_open_local_fh() being + afforded the ability to safely deal with the possibility that the + NFSD's net-ns (and nfsd_net by association) may have been destroyed + by nfsd_destroy_serv() via nfsd_shutdown_net() -- which is only + possible given the nfs_uuid_t ->net pointer managemenet detailed + above. + +All told, this elaborate interlock of the NFS client and server has been +verified to fix an easy to hit crash that would occur if an NFSD +instance running in a container, with a LOCALIO client mounted, is +shutdown. Upon restart of the container and associated NFSD the client +would go on to crash due to NULL pointer dereference that occurred due +to the LOCALIO client's attempting to nfsd_open_local_fh(), using +nn->nfsd_serv, without having a proper reference on nn->nfsd_serv. + +NFS Client issues IO instead of Server +====================================== + +Because LOCALIO is focused on protocol bypass to achieve improved IO +performance, alternatives to the traditional NFS wire protocol (SUNRPC +with XDR) must be provided to access the backing filesystem. + +See fs/nfs/localio.c:nfs_local_open_fh() and +fs/nfsd/localio.c:nfsd_open_local_fh() for the interface that makes +focused use of select nfs server objects to allow a client local to a +server to open a file pointer without needing to go over the network. + +The client's fs/nfs/localio.c:nfs_local_open_fh() will call into the +server's fs/nfsd/localio.c:nfsd_open_local_fh() and carefully access +both the associated nfsd network namespace and nn->nfsd_serv in terms of +RCU. If nfsd_open_local_fh() finds that the client no longer sees valid +nfsd objects (be it struct net or nn->nfsd_serv) it returns -ENXIO +to nfs_local_open_fh() and the client will try to reestablish the +LOCALIO resources needed by calling nfs_local_probe() again. This +recovery is needed if/when an nfsd instance running in a container were +to reboot while a LOCALIO client is connected to it. + +Once the client has an open nfsd_file pointer it will issue reads, +writes and commits directly to the underlying local filesystem (normally +done by the nfs server). As such, for these operations, the NFS client +is issuing IO to the underlying local filesystem that it is sharing with +the NFS server. See: fs/nfs/localio.c:nfs_local_doio() and +fs/nfs/localio.c:nfs_local_commit(). + +Security +======== + +Localio is only supported when UNIX-style authentication (AUTH_UNIX, aka +AUTH_SYS) is used. + +Care is taken to ensure the same NFS security mechanisms are used +(authentication, etc) regardless of whether LOCALIO or regular NFS +access is used. The auth_domain established as part of the traditional +NFS client access to the NFS server is also used for LOCALIO. + +Relative to containers, LOCALIO gives the client access to the network +namespace the server has. This is required to allow the client to access +the server's per-namespace nfsd_net struct. With traditional NFS, the +client is afforded this same level of access (albeit in terms of the NFS +protocol via SUNRPC). No other namespaces (user, mount, etc) have been +altered or purposely extended from the server to the client. + +Testing +======= + +The LOCALIO auxiliary protocol and associated NFS LOCALIO read, write +and commit access have proven stable against various test scenarios: + +- Client and server both on the same host. + +- All permutations of client and server support enablement for both + local and remote client and server. + +- Testing against NFS storage products that don't support the LOCALIO + protocol was also performed. + +- Client on host, server within a container (for both v3 and v4.2). + The container testing was in terms of podman managed containers and + includes successful container stop/restart scenario. + +- Formalizing these test scenarios in terms of existing test + infrastructure is on-going. Initial regular coverage is provided in + terms of ktest running xfstests against a LOCALIO-enabled NFS loopback + mount configuration, and includes lockdep and KASAN coverage, see: + https://evilpiepirate.org/~testdashboard/ci?user=snitzer&branch=snitm-nfs-next + https://github.com/koverstreet/ktest + +- Various kdevops testing (in terms of "Chuck's BuildBot") has been + performed to regularly verify the LOCALIO changes haven't caused any + regressions to non-LOCALIO NFS use cases. + +- All of Hammerspace's various sanity tests pass with LOCALIO enabled + (this includes numerous pNFS and flexfiles tests). diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 343644712340..4c8387e1c880 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -440,6 +440,23 @@ For example:: fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0); +Specifying layers via file descriptors +-------------------------------------- + +Since kernel v6.13, overlayfs supports specifying layers via file descriptors in +addition to specifying them as paths. This feature is available for the +"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the +fsconfig syscall from the new mount api:: + + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1); + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2); + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3); + fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1); + fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2); + fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work); + fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper); + + fs-verity support ----------------- diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 92bffcc6747a..9ab2a3d6f2b4 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -177,7 +177,7 @@ settles down a bit. **mandatory** s_export_op is now required for exporting a filesystem. -isofs, ext2, ext3, reiserfs, fat +isofs, ext2, ext3, fat can be used as examples of very different filesystems. --- diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e834779d9611..6a882c57a7e7 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -579,7 +579,7 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking - ss shadow stack page + ss shadow/guarded control stack page sl sealed == ======================================= diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 56a26c843dbe..d677e0428c3f 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -241,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' will give you tmpfs instance on /mytmpfs which can allocate 10GB RAM/SWAP in 10240 inodes and it is only accessible by root. +tmpfs has the following mounting options for case-insensitive lookup support: + +================= ============================================================== +casefold Enable casefold support at this mount point using the given + argument as the encoding standard. Currently only UTF-8 + encodings are supported. If no argument is used, it will load + the latest UTF-8 encoding available. +strict_encoding Enable strict encoding at this mount point (disabled by + default). In this mode, the filesystem refuses to create file + and directory with names containing invalid UTF-8 characters. +================= ============================================================== + +This option doesn't render the entire filesystem case-insensitive. One needs to +still set the casefold flag per directory, by flipping the +F attribute in an +empty directory. Nevertheless, new directories will inherit the attribute. The +mountpoint itself cannot be made case-insensitive. + +Example:: + + $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs + $ mount -t tmpfs -o casefold fs_name /mytmpfs + :Author: Christoph Rohland <cr@sap.com>, 1.12.01 @@ -250,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root. KOSAKI Motohiro, 16 Mar 2010 :Updated: Chris Down, 13 July 2020 +:Updated: + André Almeida, 23 Aug 2024 |