summaryrefslogtreecommitdiff
path: root/fs/fuse
AgeCommit message (Collapse)Author
2020-12-10fuse: fix bad inodeMiklos Szeredi
Jan Kara's analysis of the syzbot report (edited): The reproducer opens a directory on FUSE filesystem, it then attaches dnotify mark to the open directory. After that a fuse_do_getattr() call finds that attributes returned by the server are inconsistent, and calls make_bad_inode() which, among other things does: inode->i_mode = S_IFREG; This then confuses dnotify which doesn't tear down its structures properly and eventually crashes. Avoid calling make_bad_inode() on a live inode: switch to a private flag on the fuse inode. Also add the test to ops which the bad_inode_ops would have caught. This bug goes back to the initial merge of fuse in 2.6.14... Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com Signed-off-by: Miklos Szeredi <mszeredi@redhat.com> Tested-by: Jan Kara <jack@suse.cz> Cc: <stable@vger.kernel.org>
2020-11-11fuse: support SB_NOSEC flag to improve write performanceVivek Goyal
Virtiofs can be slow with small writes if xattr are enabled and we are doing cached writes (No direct I/O). Ganesh Mahalingam noticed this. Some debugging showed that file_remove_privs() is called in cached write path on every write. And everytime it calls security_inode_need_killpriv() which results in call to __vfs_getxattr(XATTR_NAME_CAPS). And this goes to file server to fetch xattr. This extra round trip for every write slows down writes tremendously. Normally to avoid paying this penalty on every write, vfs has the notion of caching this information in inode (S_NOSEC). So vfs sets S_NOSEC, if filesystem opted for it using super block flag SB_NOSEC. And S_NOSEC is cleared when setuid/setgid bit is set or when security xattr is set on inode so that next time a write happens, we check inode again for clearing setuid/setgid bits as well clear any security.capability xattr. This seems to work well for local file systems but for remote file systems it is possible that VFS does not have full picture and a different client sets setuid/setgid bit or security.capability xattr on file and that means VFS information about S_NOSEC on another client will be stale. So for remote filesystems SB_NOSEC was disabled by default. Commit 9e1f1de02c22 ("more conservative S_NOSEC handling") mentioned that these filesystems can still make use of SB_NOSEC as long as they clear S_NOSEC when they are refreshing inode attriutes from server. So this patch tries to enable SB_NOSEC on fuse (regular fuse as well as virtiofs). And clear SB_NOSEC when we are refreshing inode attributes. This is enabled only if server supports FUSE_HANDLE_KILLPRIV_V2. This says that server will clear setuid/setgid/security.capability on chown/truncate/write as apporpriate. This should provide tighter coherency because now suid/sgid/ security.capability will be cleared even if fuse client cache has not seen these attrs. Basic idea is that fuse client will trigger suid/sgid/security.capability clearing based on its attr cache. But even if cache has gone stale, it is fine because FUSE_HANDLE_KILLPRIV_V2 will make sure WRITE clear suid/sgid/security.capability. We make this change only if server supports FUSE_HANDLE_KILLPRIV_V2. This should make sure that existing filesystems which might be relying on seucurity.capability always being queried from server are not impacted. This tighter coherency relies on WRITE showing up on server (and not being cached in guest). So writeback_cache mode will not provide that tight coherency and it is not recommended to use two together. Having said that it might work reasonably well for lot of use cases. This change improves random write performance very significantly. Running virtiofsd with cache=auto and following fio command: fio --ioengine=libaio --direct=1 --name=test --filename=/mnt/virtiofs/random_read_write.fio --bs=4k --iodepth=64 --size=4G --readwrite=randwrite Bandwidth increases from around 50MB/s to around 250MB/s as a result of applying this patch. So improvement is very significant. Link: https://github.com/kata-containers/runtime/issues/2815 Reported-by: "Mahalingam, Ganesh" <ganesh.mahalingam@intel.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: add a flag FUSE_OPEN_KILL_SUIDGID for open() requestVivek Goyal
With FUSE_HANDLE_KILLPRIV_V2 support, server will need to kill suid/sgid/ security.capability on open(O_TRUNC), if server supports FUSE_ATOMIC_O_TRUNC. But server needs to kill suid/sgid only if caller does not have CAP_FSETID. Given server does not have this information, client needs to send this info to server. So add a flag FUSE_OPEN_KILL_SUIDGID to fuse_open_in request which tells server to kill suid/sgid (only if group execute is set). This flag is added to the FUSE_OPEN request, as well as the FUSE_CREATE request if the create was non-exclusive, since that might result in an existing file being opened/truncated. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: don't send ATTR_MODE to kill suid/sgid for handle_killpriv_v2Vivek Goyal
If client does a write() on a suid/sgid file, VFS will first call fuse_setattr() with ATTR_KILL_S[UG]ID set. This requires sending setattr to file server with ATTR_MODE set to kill suid/sgid. But to do that client needs to know latest mode otherwise it is racy. To reduce the race window, current code first call fuse_do_getattr() to get latest ->i_mode and then resets suid/sgid bits and sends rest to server with setattr(ATTR_MODE). This does not reduce the race completely but narrows race window significantly. With fc->handle_killpriv_v2 enabled, it should be possible to remove this race completely. Do not kill suid/sgid with ATTR_MODE at all. It will be killed by server when WRITE request is sent to server soon. This is similar to fc->handle_killpriv logic. V2 is just more refined version of protocol. Hence this patch does not send ATTR_MODE to kill suid/sgid if fc->handle_killpriv_v2 is enabled. This creates an issue if fc->writeback_cache is enabled. In that case WRITE can be cached in guest and server might not see WRITE request and hence will not kill suid/sgid. Miklos suggested that in such cases, we should fallback to a writethrough WRITE instead and that will generate WRITE request and kill suid/sgid. This patch implements that too. But this relies on client seeing the suid/sgid set. If another client sets suid/sgid and this client does not see it immideately, then we will not fallback to writethrough WRITE. So this is one limitation with both fc->handle_killpriv_v2 and fc->writeback_cache enabled. Both the options are not fully compatible. But might be good enough for many use cases. Note: This patch is not checking whether security.capability is set or not when falling back to writethrough path. If suid/sgid is not set and only security.capability is set, that will be taken care of by file_remove_privs() call in ->writeback_cache path. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: setattr should set FATTR_KILL_SUIDGIDVivek Goyal
If fc->handle_killpriv_v2 is enabled, we expect file server to clear suid/sgid/security.capbility upon chown/truncate/write as appropriate. Upon truncate (ATTR_SIZE), suid/sgid are cleared only if caller does not have CAP_FSETID. File server does not know whether caller has CAP_FSETID or not. Hence set FATTR_KILL_SUIDGID upon truncate to let file server know that caller does not have CAP_FSETID and it should kill suid/sgid as appropriate. On chown (ATTR_UID/ATTR_GID) suid/sgid need to be cleared irrespective of capabilities of calling process, so set FATTR_KILL_SUIDGID unconditionally in that case. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: set FUSE_WRITE_KILL_SUIDGID in cached write pathVivek Goyal
With HANDLE_KILLPRIV_V2, server will need to kill suid/sgid if caller does not have CAP_FSETID. We already have a flag FUSE_WRITE_KILL_SUIDGID in WRITE request and we already set it in direct I/O path. To make it work in cached write path also, start setting FUSE_WRITE_KILL_SUIDGID in this path too. Set it only if fc->handle_killpriv_v2 is set. Otherwise client is responsible for kill suid/sgid. In case of direct I/O we set FUSE_WRITE_KILL_SUIDGID unconditionally because we don't call file_remove_privs() in that path (with cache=none option). Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: rename FUSE_WRITE_KILL_PRIV to FUSE_WRITE_KILL_SUIDGIDMiklos Szeredi
Kernel has: ATTR_KILL_PRIV -> clear "security.capability" ATTR_KILL_SUID -> clear S_ISUID ATTR_KILL_SGID -> clear S_ISGID if executable Fuse has: FUSE_WRITE_KILL_PRIV -> clear S_ISUID and S_ISGID if executable So FUSE_WRITE_KILL_PRIV implies the complement of ATTR_KILL_PRIV, which is somewhat confusing. Also PRIV implies all privileges, including "security.capability". Change the name to FUSE_WRITE_KILL_SUIDGID and make FUSE_WRITE_KILL_PRIV an alias to perserve API compatibility Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: introduce the notion of FUSE_HANDLE_KILLPRIV_V2Vivek Goyal
We already have FUSE_HANDLE_KILLPRIV flag that says that file server will remove suid/sgid/caps on truncate/chown/write. But that's little different from what Linux VFS implements. To be consistent with Linux VFS behavior what we want is. - caps are always cleared on chown/write/truncate - suid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID. - sgid is always cleared on chown, while for truncate/write it is cleared only if caller does not have CAP_FSETID as well as file has group execute permission. As previous flag did not provide above semantics. Implement a V2 of the protocol with above said constraints. Server does not know if caller has CAP_FSETID or not. So for the case of write()/truncate(), client will send information in special flag to indicate whether to kill priviliges or not. These changes are in subsequent patches. FUSE_HANDLE_KILLPRIV_V2 relies on WRITE being sent to server to clear suid/sgid/security.capability. But with ->writeback_cache, WRITES are cached in guest. So it is not recommended to use FUSE_HANDLE_KILLPRIV_V2 and writeback_cache together. Though it probably might be good enough for lot of use cases. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: always revalidate if exclusive createMiklos Szeredi
Failure to do so may result in EEXIST even if the file only exists in the cache and not in the filesystem. The atomic nature of O_EXCL mandates that the cached state should be ignored and existence verified anew. Reported-by: Ken Schalk <kschalk@nvidia.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs: clean up error handling in virtio_fs_get_tree()Miklos Szeredi
Avoid duplicating error cleanup. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: add fuse_sb_destroy() helperMiklos Szeredi
This is to avoid minor code duplication between fuse_kill_sb_anon() and fuse_kill_sb_blk(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: simplify get_fuse_conn*()Miklos Szeredi
All callers dereference the result, so no point in checking for NULL pointer dereference here. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: get rid of fuse_mount refcountMiklos Szeredi
Fuse mount now only ever has a refcount of one (before being freed) so the count field is unnecessary. Remove the refcounting and fold fuse_mount_put() into callers. The only caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount() and here the fuse_conn_put() can simply be omitted. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs: simplify sb setupMiklos Szeredi
Currently when acquiring an sb for virtiofs fuse_mount_get() is being called from virtio_fs_set_super() if a new sb is being filled and fuse_mount_put() is called unconditionally after sget_fc() returns. The exact same result can be obtained by checking whether fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and only calling fuse_mount_put() if the ref wasn't transferred (error or matching sb found). This allows getting rid of virtio_fs_set_super() and fuse_mount_get(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs fix leak in setupMiklos Szeredi
This can be triggered for example by adding the "-omand" mount option, which will be rejected and virtio_fs_fill_super() will return an error. In such a case the allocations for fuse_conn and fuse_mount will leak due to s_root not yet being set and so ->put_super() not being called. Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: launder page should wait for page writebackMiklos Szeredi
Qian Cai reports that the WARNING in tree_insert() can be triggered by a fuzzer with the following call chain: invalidate_inode_pages2_range() fuse_launder_page() fuse_writepage_locked() tree_insert() The reason is that another write for the same page is already queued. The simplest fix is to wait until the pending write is completed and only after that queue the new write. Since this case is very rare, the additional wait should not be a problem. Reported-by: Qian Cai <cai@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-19Merge tag 'fuse-update-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-10-14virtiofs: calculate number of scatter-gather elements accuratelyVivek Goyal
virtiofs currently maps various buffers in scatter gather list and it looks at number of pages (ap->pages) and assumes that same number of pages will be used both for input and output (sg_count_fuse_req()), and calculates total number of scatterlist elements accordingly. But looks like this assumption is not valid in all the cases. For example, Cai Qian reported that trinity, triggers warning with virtiofs sometimes. A closer look revealed that if one calls ioctl(fd, 0x5a004000, buf), it will trigger following warning. WARN_ON(out_sgs + in_sgs != total_sgs) In this case, total_sgs = 8, out_sgs=4, in_sgs=3. Number of pages is 2 (ap->pages), but out_sgs are using both the pages but in_sgs are using only one page. In this case, fuse_do_ioctl() sets different size values for input and output. args->in_args[args->in_numargs - 1].size == 6656 args->out_args[args->out_numargs - 1].size == 4096 So current method of calculating how many scatter-gather list elements will be used is not accurate. Make calculations more precise by parsing size and ap->descs. Reported-by: Qian Cai <cai@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/linux-fsdevel/5ea77e9f6cb8c2db43b09fbd4158ab2d8c066a0a.camel@redhat.com/ Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-13Merge tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-blockLinus Torvalds
Pull block updates from Jens Axboe: - Series of merge handling cleanups (Baolin, Christoph) - Series of blk-throttle fixes and cleanups (Baolin) - Series cleaning up BDI, seperating the block device from the backing_dev_info (Christoph) - Removal of bdget() as a generic API (Christoph) - Removal of blkdev_get() as a generic API (Christoph) - Cleanup of is-partition checks (Christoph) - Series reworking disk revalidation (Christoph) - Series cleaning up bio flags (Christoph) - bio crypt fixes (Eric) - IO stats inflight tweak (Gabriel) - blk-mq tags fixes (Hannes) - Buffer invalidation fixes (Jan) - Allow soft limits for zone append (Johannes) - Shared tag set improvements (John, Kashyap) - Allow IOPRIO_CLASS_RT for CAP_SYS_NICE (Khazhismel) - DM no-wait support (Mike, Konstantin) - Request allocation improvements (Ming) - Allow md/dm/bcache to use IO stat helpers (Song) - Series improving blk-iocost (Tejun) - Various cleanups (Geert, Damien, Danny, Julia, Tetsuo, Tian, Wang, Xianting, Yang, Yufen, yangerkun) * tag 'block-5.10-2020-10-12' of git://git.kernel.dk/linux-block: (191 commits) block: fix uapi blkzoned.h comments blk-mq: move cancel of hctx->run_work to the front of blk_exit_queue blk-mq: get rid of the dead flush handle code path block: get rid of unnecessary local variable block: fix comment and add lockdep assert blk-mq: use helper function to test hw stopped block: use helper function to test queue register block: remove redundant mq check block: invoke blk_mq_exit_sched no matter whether have .exit_sched percpu_ref: don't refer to ref->data if it isn't allocated block: ratelimit handle_bad_sector() message blk-throttle: Re-use the throtl_set_slice_end() blk-throttle: Open code __throtl_de/enqueue_tg() blk-throttle: Move service tree validation out of the throtl_rb_first() blk-throttle: Move the list operation after list validation blk-throttle: Fix IO hang for a corner case blk-throttle: Avoid tracking latency if low limit is invalid blk-throttle: Avoid getting the current time if tg->last_finish_time is 0 blk-throttle: Remove a meaningless parameter for throtl_downgrade_state() block: Remove redundant 'return' statement ...
2020-10-12fuse: connection remove fixMiklos Szeredi
Re-add lost removal of fc from fuse_conn_list and the control filesystem. Reported-by: kernel test robot <rong.a.chen@intel.com> Fixes: fcee216beb9c ("fuse: split fuse_mount off of fuse_conn") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09fuse: implement crossmountsMax Reitz
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-24bdi: invert BDI_CAP_NO_ACCT_WBChristoph Hellwig
Replace BDI_CAP_NO_ACCT_WB with a positive BDI_CAP_WRITEBACK_ACCT to make the checks more obvious. Also remove the pointless bdi_cap_account_writeback wrapper that just obsfucates the check. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-24bdi: initialize ->ra_pages and ->io_pages in bdi_initChristoph Hellwig
Set up a readahead size by default, as very few users have a good reason to change it. This means code, ecryptfs, and orangefs now set up the values while they were previously missing it, while ubifs, mtd and vboxsf manually set it to 0 to avoid readahead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Acked-by: David Sterba <dsterba@suse.com> [btrfs] Acked-by: Richard Weinberger <richard@nod.at> [ubifs, mtd] Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-09-18fuse: Allow fuse_fill_super_common() for submountsMax Reitz
Submounts have their own superblock, which needs to be initialized. However, they do not have a fuse_fs_context associated with them, and the root node's attributes should be taken from the mountpoint's node. Extend fuse_fill_super_common() to work for submounts by making the @ctx parameter optional, and by adding a @submount_finode parameter. (There is a plain "unsigned" in an existing code block that is being indented by this commit. Extend it to "unsigned int" so checkpatch does not complain.) Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: split fuse_mount off of fuse_connMax Reitz
We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: drop fuse_conn parameter where possibleMax Reitz
With the last commit, all functions that handle some existing fuse_req no longer need to be given the associated fuse_conn, because they can get it from the fuse_req object. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: store fuse_conn in fuse_reqMax Reitz
Every fuse_req belongs to a fuse_conn. Right now, we always know which fuse_conn that is based on the respective device, but we want to allow multiple (sub)mounts per single connection, and then the corresponding filesystem is not going to be so trivial to obtain. Storing a pointer to the associated fuse_conn in every fuse_req will allow us to trivially find any request's superblock (and thus filesystem) even then. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: fix page dereference after freeMiklos Szeredi
After unlock_request() pages from the ap->pages[] array may be put (e.g. by aborting the connection) and the pages can be freed. Prevent use after free by grabbing a reference to the page before calling unlock_request(). The original patch was created by Pradeep P V K. Reported-by: Pradeep P V K <ppvk@codeaurora.org> Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-17fuse: fix the ->direct_IO() treatment of iov_iterAl Viro
the callers rely upon having any iov_iter_truncate() done inside ->direct_IO() countered by iov_iter_reexpand(). Reported-by: Qian Cai <cai@redhat.com> Tested-by: Qian Cai <cai@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-09-10virtiofs: add logic to free up a memory rangeVivek Goyal
Add logic to free up a busy memory range. Freed memory range will be returned to free pool. Add a worker which can be started to select and free some busy memory ranges. Process can also steal one of its busy dax ranges if free range is not available. I will refer it to as direct reclaim. If free range is not available and nothing can't be stolen from same inode, caller waits on a waitq for free range to become available. For reclaiming a range, as of now we need to hold following locks in specified order. down_write(&fi->i_mmap_sem); down_write(&fi->dax->sem); We look for a free range in following order. A. Try to get a free range. B. If not, try direct reclaim. C. If not, wait for a memory range to become free Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: maintain a list of busy elementsVivek Goyal
This list will be used selecting fuse_dax_mapping to free when number of free mappings drops below a threshold. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: serialize truncate/punch_hole and dax fault pathVivek Goyal
Currently in fuse we don't seem have any lock which can serialize fault path with truncate/punch_hole path. With dax support I need one for following reasons. 1. Dax requirement DAX fault code relies on inode size being stable for the duration of fault and want to serialize with truncate/punch_hole and they explicitly mention it. static vm_fault_t dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp, const struct iomap_ops *ops) /* * Check whether offset isn't beyond end of file now. Caller is * supposed to hold locks serializing us with truncate / punch hole so * this is a reliable test. */ max_pgoff = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); 2. Make sure there are no users of pages being truncated/punch_hole get_user_pages() might take references to page and then do some DMA to said pages. Filesystem might truncate those pages without knowing that a DMA is in progress or some I/O is in progress. So use dax_layout_busy_page() to make sure there are no such references and I/O is not in progress on said pages before moving ahead with truncation. 3. Limitation of kvm page fault error reporting If we are truncating file on host first and then removing mappings in guest lateter (truncate page cache etc), then this could lead to a problem with KVM. Say a mapping is in place in guest and truncation happens on host. Now if guest accesses that mapping, then host will take a fault and kvm will either exit to qemu or spin infinitely. IOW, before we do truncation on host, we need to make sure that guest inode does not have any mapping in that region or whole file. 4. virtiofs memory range reclaim Soon I will introduce the notion of being able to reclaim dax memory ranges from a fuse dax inode. There also I need to make sure that no I/O or fault is going on in the reclaimed range and nobody is using it so that range can be reclaimed without issues. Currently if we take inode lock, that serializes read/write. But it does not do anything for faults. So I add another semaphore fuse_inode->i_mmap_sem for this purpose. It can be used to serialize with faults. As of now, I am adding taking this semaphore only in dax fault path and not regular fault path because existing code does not have one. May be existing code can benefit from it as well to take care of some races, but that we can fix later if need be. For now, I am just focussing only on DAX path which is new path. Also added logic to take fuse_inode->i_mmap_sem in truncate/punch_hole/open(O_TRUNC) path to make sure file truncation and fuse dax fault are mutually exlusive and avoid all the above problems. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: define dax address space operationsVivek Goyal
This is done along the lines of ext4 and xfs. I primarily wanted ->writepages hook at this time so that I could call into dax_writeback_mapping_range(). This in turn will decide which pfns need to be written back. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add DAX mmap supportStefan Hajnoczi
Add DAX mmap() support. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement dax read/write operationsVivek Goyal
This patch implements basic DAX support. mmap() is not implemented yet and will come in later patches. This patch looks into implemeting read/write. We make use of interval tree to keep track of per inode dax mappings. Do not use dax for file extending writes, instead just send WRITE message to daemon (like we do for direct I/O path). This will keep write and i_size change atomic w.r.t crash. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: implement FUSE_INIT map_alignment fieldStefan Hajnoczi
The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment constraints via the FUST_INIT map_alignment field. Parse this field and ensure our DAX mappings meet the alignment constraints. We don't actually align anything differently since our mappings are already 2MB aligned. Just check the value when the connection is established. If it becomes necessary to honor arbitrary alignments in the future we'll have to adjust how mappings are sized. The upshot of this commit is that we can be confident that mappings will work even when emulating x86 on Power and similar combinations where the host page sizes are different. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: keep a list of free dax memory rangesVivek Goyal
Divide the dax memory range into fixed size ranges (2MB for now) and put them in a list. This will track free ranges. Once an inode requires a free range, we will take one from here and put it in interval-tree of ranges assigned to inode. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Peng Tao <tao.peng@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add a mount option to enable daxVivek Goyal
Add a mount option to allow using dax with virtio_fs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: set up virtio_fs dax_deviceStefan Hajnoczi
Setup a dax device. Use the shm capability to find the cache entry and map it. The DAX window is accessed by the fs/dax.c infrastructure and must have struct pages (at least on x86). Use devm_memremap_pages() to map the DAX window PCI BAR and allocate struct page. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: get rid of no_mount_optionsVivek Goyal
This option was introduced so that for virtio_fs we don't show any mounts options fuse_show_options(). Because we don't offer any of these options to be controlled by mounter. Very soon we are planning to introduce option "dax" which mounter should be able to specify. And no_mount_options does not work anymore. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: provide a helper function for virtqueue initializationVivek Goyal
This reduces code duplication and make it little easier to read code. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-04fuse: update project homepageAndré Almeida
As stated in https://sourceforge.net/projects/fuse/, "the FUSE project has moved to https://github.com/libfuse/" in 22-Dec-2015. Update URLs to reflect this. Signed-off-by: André Almeida <andrealmeid@collabora.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-11Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: - IRQ bypass support for vdpa and IFC - MLX5 vdpa driver - Endianness fixes for virtio drivers - Misc other fixes * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (71 commits) vdpa/mlx5: fix up endian-ness for mtu vdpa: Fix pointer math bug in vdpasim_get_config() vdpa/mlx5: Fix pointer math in mlx5_vdpa_get_config() vdpa/mlx5: fix memory allocation failure checks vdpa/mlx5: Fix uninitialised variable in core/mr.c vdpa_sim: init iommu lock virtio_config: fix up warnings on parisc vdpa/mlx5: Add VDPA driver for supported mlx5 devices vdpa/mlx5: Add shared memory registration code vdpa/mlx5: Add support library for mlx5 VDPA implementation vdpa/mlx5: Add hardware descriptive header file vdpa: Modify get_vq_state() to return error code net/vdpa: Use struct for set/get vq state vdpa: remove hard coded virtq num vdpasim: support batch updating vhost-vdpa: support IOTLB batching hints vhost-vdpa: support get/set backend features vhost: generialize backend features setting/getting vhost-vdpa: refine ioctl pre-processing vDPA: dont change vq irq after DRIVER_OK ...
2020-08-05virtio_fs: convert to LE accessorsMichael S. Tsirkin
Virtio fs is modern-only. Use LE accessors for config space. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-08-04Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
2020-07-16treewide: Remove uninitialized_var() usageKees Cook
Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
2020-07-15fuse: Fix parameter for FS_IOC_{GET,SET}FLAGSChirantan Ekbote
The ioctl encoding for this parameter is a long but the documentation says it should be an int and the kernel drivers expect it to be an int. If the fuse driver treats this as a long it might end up scribbling over the stack of a userspace process that only allocated enough space for an int. This was previously discussed in [1] and a patch for fuse was proposed in [2]. From what I can tell the patch in [2] was nacked in favor of adding new, "fixed" ioctls and using those from userspace. However there is still no "fixed" version of these ioctls and the fact is that it's sometimes infeasible to change all userspace to use the new one. Handling the ioctls specially in the fuse driver seems like the most pragmatic way for fuse servers to support them without causing crashes in userspace applications that call them. [1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/ [2]: https://sourceforge.net/p/fuse/mailman/message/31771759/ Signed-off-by: Chirantan Ekbote <chirantan@chromium.org> Fixes: 59efec7b9039 ("fuse: implement ioctl support") Cc: <stable@vger.kernel.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: don't ignore errors from fuse_writepages_fill()Vasily Averin
fuse_writepages() ignores some errors taken from fuse_writepages_fill() I believe it is a bug: if .writepages is called with WB_SYNC_ALL it should either guarantee that all data was successfully saved or return error. Fixes: 26d614df1da9 ("fuse: Implement writepages callback") Signed-off-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: clean up condition for writepage sendingMiklos Szeredi
fuse_writepages_fill uses following construction: if (wpa && ap->num_pages && (A || B || C)) { action; } else if (wpa && D) { if (E) { the same action; } } - ap->num_pages check is always true and can be removed - "if" and "else if" calls the same action and can be merged. Move checking A, B, C, D, E conditions to a helper, add comments. Original-patch-by: Vasily Averin <vvs@virtuozzo.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-07-14fuse: reject options on reconfigure via fsconfig(2)Miklos Szeredi
Previous patch changed handling of remount/reconfigure to ignore all options, including those that are unknown to the fuse kernel fs. This was done for backward compatibility, but this likely only affects the old mount(2) API. The new fsconfig(2) based reconfiguration could possibly be improved. This would make the new API less of a drop in replacement for the old, OTOH this is a good chance to get rid of some weirdnesses in the old API. Several other behaviors might make sense: 1) unknown options are rejected, known options are ignored 2) unknown options are rejected, known options are rejected if the value is changed, allowed otherwise 3) all options are rejected Prior to the backward compatibility fix to ignore all options all known options were accepted (1), even if they change the value of a mount parameter; fuse_reconfigure() does not look at the config values set by fuse_parse_param(). To fix that we'd need to verify that the value provided is the same as set in the initial configuration (2). The major drawback is that this is much more complex than just rejecting all attempts at changing options (3); i.e. all options signify initial configuration values and don't make sense on reconfigure. This patch opts for (3) with the rationale that no mount options are reconfigurable in fuse. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>