summaryrefslogtreecommitdiff
path: root/fs/fuse/virtio_fs.c
AgeCommit message (Collapse)Author
2022-01-18Merge tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhostLinus Torvalds
Pull virtio updates from Michael Tsirkin: "virtio,vdpa,qemu_fw_cfg: features, cleanups, and fixes. - partial support for < MAX_ORDER - 1 granularity for virtio-mem - driver_override for vdpa - sysfs ABI documentation for vdpa - multiqueue config support for mlx5 vdpa - and misc fixes, cleanups" * tag 'for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost: (42 commits) vdpa/mlx5: Fix tracking of current number of VQs vdpa/mlx5: Fix is_index_valid() to refer to features vdpa: Protect vdpa reset with cf_mutex vdpa: Avoid taking cf_mutex lock on get status vdpa/vdpa_sim_net: Report max device capabilities vdpa: Use BIT_ULL for bit operations vdpa/vdpa_sim: Configure max supported virtqueues vdpa/mlx5: Report max device capabilities vdpa: Support reporting max device capabilities vdpa/mlx5: Restore cur_num_vqs in case of failure in change_num_qps() vdpa: Add support for returning device configuration information vdpa/mlx5: Support configuring max data virtqueue vdpa/mlx5: Fix config_attr_mask assignment vdpa: Allow to configure max data virtqueues vdpa: Read device configuration only if FEATURES_OK vdpa: Sync calls set/get config/status with cf_mutex vdpa/mlx5: Distribute RX virtqueues in RQT object vdpa: Provide interface to read driver features vdpa: clean up get_config_size ret value handling virtio_ring: mark ring unused on error ...
2022-01-14virtio: wrap config->reset callsMichael S. Tsirkin
This will enable cleanups down the road. The idea is to disable cbs, then add "flush_queued_cbs" callback as a parameter, this way drivers can flush any work queued after callbacks have been disabled. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Link: https://lore.kernel.org/r/20211013105226.20225-1-mst@redhat.com Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2022-01-12Merge tag 'libnvdimm-for-5.17' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull dax and libnvdimm updates from Dan Williams: "The bulk of this is a rework of the dax_operations API after discovering the obstacles it posed to the work-in-progress DAX+reflink support for XFS and other copy-on-write filesystem mechanics. Primarily the need to plumb a block_device through the API to handle partition offsets was a sticking point and Christoph untangled that dependency in addition to other cleanups to make landing the DAX+reflink support easier. The DAX_PMEM_COMPAT option has been around for 4 years and not only are distributions shipping userspace that understand the current configuration API, but some are not even bothering to turn this option on anymore, so it seems a good time to remove it per the deprecation schedule. Recall that this was added after the device-dax subsystem moved from /sys/class/dax to /sys/bus/dax for its sysfs organization. All recent functionality depends on /sys/bus/dax. Some other miscellaneous cleanups and reflink prep patches are included as well. Summary: - Simplify the dax_operations API: - Eliminate bdev_dax_pgoff() in favor of the filesystem maintaining and applying a partition offset to all its DAX iomap operations. - Remove wrappers and device-mapper stacked callbacks for ->copy_from_iter() and ->copy_to_iter() in favor of moving block_device relative offset responsibility to the dax_direct_access() caller. - Remove the need for an @bdev in filesystem-DAX infrastructure - Remove unused uio helpers copy_from_iter_flushcache() and copy_mc_to_iter() as only the non-check_copy_size() versions are used for DAX. - Prepare XFS for the pending (next merge window) DAX+reflink support - Remove deprecated DEV_DAX_PMEM_COMPAT support - Cleanup a straggling misuse of the GUID api" * tag 'libnvdimm-for-5.17' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (38 commits) iomap: Fix error handling in iomap_zero_iter() ACPI: NFIT: Import GUID before use dax: remove the copy_from_iter and copy_to_iter methods dax: remove the DAXDEV_F_SYNC flag dax: simplify dax_synchronous and set_dax_synchronous uio: remove copy_from_iter_flushcache() and copy_mc_to_iter() iomap: turn the byte variable in iomap_zero_iter into a ssize_t memremap: remove support for external pgmap refcounts fsdax: don't require CONFIG_BLOCK iomap: build the block based code conditionally dax: fix up some of the block device related ifdefs fsdax: shift partition offset handling into the file systems dax: return the partition offset from fs_dax_get_by_bdev iomap: add a IOMAP_DAX flag xfs: pass the mapping flags to xfs_bmbt_to_iomap xfs: use xfs_direct_write_iomap_ops for DAX zeroing xfs: move dax device handling into xfs_{alloc,free}_buftarg ext4: cleanup the dax handling in ext4_fill_super ext2: cleanup the dax handling in ext2_fill_super fsdax: decouple zeroing from the iomap buffered I/O code ...
2021-12-18dax: remove the copy_from_iter and copy_to_iter methodsChristoph Hellwig
These methods indirect the actual DAX read/write path. In the end pmem uses magic flush and mc safe variants and fuse and dcssblk use plain ones while device mapper picks redirects to the underlying device. Add set_dax_nocache() and set_dax_nomc() APIs to control which copy routines are used to remove indirect call from the read/write fast path as well as a lot of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs] Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-18dax: remove the DAXDEV_F_SYNC flagChristoph Hellwig
Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and just let the drivers call set_dax_synchronous directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-12-14fuse: make DAX mount option a tri-stateJeffle Xu
We add 'always', 'never', and 'inode' (default). '-o dax' continues to operate the same which is equivalent to 'always'. The following behavior is consistent with that on ext4/xfs: - The default behavior (when neither '-o dax' nor '-o dax=always|never|inode' option is specified) is equal to 'inode' mode, while 'dax=inode' won't be printed among the mount option list. - The 'inode' mode is only advisory. It will silently fallback to 'never' mode if fuse server doesn't support that. Also noted that by the time of this commit, 'inode' mode is actually equal to 'always' mode, before the per inode DAX flag is introduced in the following patch. Signed-off-by: Jeffle Xu <jefflexu@linux.alibaba.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-12-04dax: simplify the dax_device <-> gendisk associationChristoph Hellwig
Replace the dax_host_hash with an xarray indexed by the pointer value of the gendisk, and require explicitly calls from the block drivers that want to associate their gendisk with a dax_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
2021-11-02virtiofs: use strscpy for copying the queue nameMiklos Szeredi
Always null terminate fsvq->name. Reported-by: kernel test robot <lkp@intel.com> Fixes: b43b7e81eb2b ("virtiofs: provide a helper function for virtqueue initialization") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: clean up fuse_mount destructionMiklos Szeredi
1. call fuse_mount_destroy() for open coded variants 2. before deactivate_locked_super() don't need fuse_mount destruction since that will now be done (if ->s_fs_info is not cleared) 3. rearrange fuse_mount setup in fuse_get_tree_submount() so that the regular pattern can be used Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: get rid of fuse_put_super()Miklos Szeredi
The ->put_super callback is called from generic_shutdown_super() in case of a fully initialized sb. This is called from kill_***_super(), which is called from ->kill_sb instances. Fuse uses ->put_super to destroy the fs specific fuse_mount and drop the reference to the fuse_conn, while it does the same on each error case during sb setup. This patch moves the destruction from fuse_put_super() to fuse_mount_destroy(), called at the end of all ->kill_sb instances. A follup patch will clean up the error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-10-21fuse: check s_root when destroying sbMiklos Szeredi
Checking "fm" works because currently sb->s_fs_info is cleared on error paths; however, sb->s_root is what generic_shutdown_super() checks to determine whether the sb was fully initialized or not. This change will allow cleanup of sb setup error paths. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-08-04fuse: name fs_context consistentlyMiklos Szeredi
Naming convention under fs/fuse/: struct fuse_conn *fc; struct fs_context *fsc; Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22fuse: add dedicated filesystem context ops for submountsGreg Kurz
The creation of a submount is open-coded in fuse_dentry_automount(). This brings a lot of complexity and we recently had to fix bugs because we weren't setting SB_BORN or because we were unlocking sb->s_umount before sb was fully configured. Most of these could have been avoided by using the mount API instead of open-coding. Basically, this means coming up with a proper ->get_tree() implementation for submounts and call vfs_get_tree(), or better fc_mount(). The creation of the superblock for submounts is quite different from the root mount. Especially, it doesn't require to allocate a FUSE filesystem context, nor to parse parameters. Introduce a dedicated context ops for submounts to make this clear. This is just a placeholder for now, fuse_get_tree_submount() will be populated in a subsequent patch. Only visible change is that we stop allocating/freeing a useless FUSE filesystem context with submounts. Signed-off-by: Greg Kurz <groug@kaod.org> Reviewed-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-06-22virtiofs: propagate sync() to file serverGreg Kurz
Even if POSIX doesn't mandate it, linux users legitimately expect sync() to flush all data and metadata to physical storage when it is located on the same system. This isn't happening with virtiofs though: sync() inside the guest returns right away even though data still needs to be flushed from the host page cache. This is easily demonstrated by doing the following in the guest: $ dd if=/dev/zero of=/mnt/foo bs=1M count=5K ; strace -T -e sync sync 5120+0 records in 5120+0 records out 5368709120 bytes (5.4 GB, 5.0 GiB) copied, 5.22224 s, 1.0 GB/s sync() = 0 <0.024068> and start the following in the host when the 'dd' command completes in the guest: $ strace -T -e fsync /usr/bin/sync virtiofs/foo fsync(3) = 0 <10.371640> There are no good reasons not to honor the expected behavior of sync() actually: it gives an unrealistic impression that virtiofs is super fast and that data has safely landed on HW, which isn't the case obviously. Implement a ->sync_fs() superblock operation that sends a new FUSE_SYNCFS request type for this purpose. Provision a 64-bit placeholder for possible future extensions. Since the file server cannot handle the wait == 0 case, we skip it to avoid a gratuitous roundtrip. Note that this is per-superblock: a FUSE_SYNCFS is send for the root mount and for each submount. Like with FUSE_FSYNC and FUSE_FSYNCDIR, lack of support for FUSE_SYNCFS in the file server is treated as permanent success. This ensures compatibility with older file servers: the client will get the current behavior of sync() not being propagated to the file server. Note that such an operation allows the file server to DoS sync(). Since a typical FUSE file server is an untrusted piece of software running in userspace, this is disabled by default. Only enable it with virtiofs for now since virtiofsd is supposedly trusted by the guest kernel. Reported-by: Robert Krawitz <rlk@redhat.com> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14virtiofs: fix usernsMiklos Szeredi
get_user_ns() is done twice (once in virtio_fs_get_tree() and once in fuse_conn_init()), resulting in a reference leak. Also looks better to use fsc->user_ns (which *should* be the current_user_ns() at this point). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14virtiofs: remove useless functionJiapeng Chong
Fix the following clang warning: fs/fuse/virtio_fs.c:130:35: warning: unused function 'vq_to_fpq' [-Wunused-function]. Reported-by: Abaci Robot <abaci@linux.alibaba.com> Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14virtiofs: split requests that exceed virtqueue sizeConnor Kuehl
If an incoming FUSE request can't fit on the virtqueue, the request is placed onto a workqueue so a worker can try to resubmit it later where there will (hopefully) be space for it next time. This is fine for requests that aren't larger than a virtqueue's maximum capacity. However, if a request's size exceeds the maximum capacity of the virtqueue (even if the virtqueue is empty), it will be doomed to a life of being placed on the workqueue, removed, discovered it won't fit, and placed on the workqueue yet again. Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect Descriptors) of the virtio spec: "A driver MUST NOT create a descriptor chain longer than the Queue Size of the device." To fix this, limit the number of pages FUSE will use for an overall request. This way, each request can realistically fit on the virtqueue when it is decomposed into a scattergather list and avoid violating section 2.6.5.3.1 of the virtio spec. Signed-off-by: Connor Kuehl <ckuehl@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-04-14virtiofs: fix memory leak in virtio_fs_probe()Luis Henriques
When accidentally passing twice the same tag to qemu, kmemleak ended up reporting a memory leak in virtiofs. Also, looking at the log I saw the following error (that's when I realised the duplicated tag): virtiofs: probe of virtio5 failed with error -17 Here's the kmemleak log for reference: unreferenced object 0xffff888103d47800 (size 1024): comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s) hex dump (first 32 bytes): 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00 .....N.......... ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff ................ backtrace: [<000000000ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs] [<00000000f8aca419>] virtio_dev_probe+0x15f/0x210 [<000000004d6baf3c>] really_probe+0xea/0x430 [<00000000a6ceeac8>] device_driver_attach+0xa8/0xb0 [<00000000196f47a7>] __driver_attach+0x98/0x140 [<000000000b20601d>] bus_for_each_dev+0x7b/0xc0 [<00000000399c7b7f>] bus_add_driver+0x11b/0x1f0 [<0000000032b09ba7>] driver_register+0x8f/0xe0 [<00000000cdd55998>] 0xffffffffa002c013 [<000000000ea196a2>] do_one_initcall+0x64/0x2e0 [<0000000008f727ce>] do_init_module+0x5c/0x260 [<000000003cdedab6>] __do_sys_finit_module+0xb5/0x120 [<00000000ad2f48c6>] do_syscall_64+0x33/0x40 [<00000000809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae Cc: stable@vger.kernel.org Signed-off-by: Luis Henriques <lhenriques@suse.de> Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2021-03-05virtiofs: Fail dax mount if device does not support itVivek Goyal
Right now "mount -t virtiofs -o dax myfs /mnt/virtiofs" succeeds even if filesystem deivce does not have a cache window and hence DAX can't be supported. This gives a false sense to user that they are using DAX with virtiofs but fact of the matter is that they are not. Fix this by returning error if dax can't be supported and user has asked for it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs: clean up error handling in virtio_fs_get_tree()Miklos Szeredi
Avoid duplicating error cleanup. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11fuse: get rid of fuse_mount refcountMiklos Szeredi
Fuse mount now only ever has a refcount of one (before being freed) so the count field is unnecessary. Remove the refcounting and fold fuse_mount_put() into callers. The only caller of fuse_mount_put() where fm->fc was NULL is fuse_dentry_automount() and here the fuse_conn_put() can simply be omitted. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs: simplify sb setupMiklos Szeredi
Currently when acquiring an sb for virtiofs fuse_mount_get() is being called from virtio_fs_set_super() if a new sb is being filled and fuse_mount_put() is called unconditionally after sget_fc() returns. The exact same result can be obtained by checking whether fs_contex->s_fs_info was set to NULL (ref trasferred to sb->s_fs_info) and only calling fuse_mount_put() if the ref wasn't transferred (error or matching sb found). This allows getting rid of virtio_fs_set_super() and fuse_mount_get(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-11-11virtiofs fix leak in setupMiklos Szeredi
This can be triggered for example by adding the "-omand" mount option, which will be rejected and virtio_fs_fill_super() will return an error. In such a case the allocations for fuse_conn and fuse_mount will leak due to s_root not yet being set and so ->put_super() not being called. Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-19Merge tag 'fuse-update-5.10' of ↵Linus Torvalds
git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: - Support directly accessing host page cache from virtiofs. This can improve I/O performance for various workloads, as well as reducing the memory requirement by eliminating double caching. Thanks to Vivek Goyal for doing most of the work on this. - Allow automatic submounting inside virtiofs. This allows unique st_dev/ st_ino values to be assigned inside the guest to files residing on different filesystems on the host. Thanks to Max Reitz for the patches. - Fix an old use after free bug found by Pradeep P V K. * tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits) virtiofs: calculate number of scatter-gather elements accurately fuse: connection remove fix fuse: implement crossmounts fuse: Allow fuse_fill_super_common() for submounts fuse: split fuse_mount off of fuse_conn fuse: drop fuse_conn parameter where possible fuse: store fuse_conn in fuse_req fuse: add submount support to <uapi/linux/fuse.h> fuse: fix page dereference after free virtiofs: add logic to free up a memory range virtiofs: maintain a list of busy elements virtiofs: serialize truncate/punch_hole and dax fault path virtiofs: define dax address space operations virtiofs: add DAX mmap support virtiofs: implement dax read/write operations virtiofs: introduce setupmapping/removemapping commands virtiofs: implement FUSE_INIT map_alignment field virtiofs: keep a list of free dax memory ranges virtiofs: add a mount option to enable dax virtiofs: set up virtio_fs dax_device ...
2020-10-14virtiofs: calculate number of scatter-gather elements accuratelyVivek Goyal
virtiofs currently maps various buffers in scatter gather list and it looks at number of pages (ap->pages) and assumes that same number of pages will be used both for input and output (sg_count_fuse_req()), and calculates total number of scatterlist elements accordingly. But looks like this assumption is not valid in all the cases. For example, Cai Qian reported that trinity, triggers warning with virtiofs sometimes. A closer look revealed that if one calls ioctl(fd, 0x5a004000, buf), it will trigger following warning. WARN_ON(out_sgs + in_sgs != total_sgs) In this case, total_sgs = 8, out_sgs=4, in_sgs=3. Number of pages is 2 (ap->pages), but out_sgs are using both the pages but in_sgs are using only one page. In this case, fuse_do_ioctl() sets different size values for input and output. args->in_args[args->in_numargs - 1].size == 6656 args->out_args[args->out_numargs - 1].size == 4096 So current method of calculating how many scatter-gather list elements will be used is not accurate. Make calculations more precise by parsing size and ap->descs. Reported-by: Qian Cai <cai@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/linux-fsdevel/5ea77e9f6cb8c2db43b09fbd4158ab2d8c066a0a.camel@redhat.com/ Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-10-09fuse: implement crossmountsMax Reitz
FUSE servers can indicate crossmount points by setting FUSE_ATTR_SUBMOUNT in fuse_attr.flags. The inode will then be marked as S_AUTOMOUNT, and the .d_automount implementation creates a new submount at that location, so that the submount gets a distinct st_dev value. Note that all submounts get a distinct superblock and a distinct st_dev value, so for virtio-fs, even if the same filesystem is mounted more than once on the host, none of its mount points will have the same st_dev. We need distinct superblocks because the superblock points to the root node, but the different host mounts may show different trees (e.g. due to submounts in some of them, but not in others). Right now, this behavior is only enabled when fuse_conn.auto_submounts is set, which is the case only for virtio-fs. Signed-off-by: Max Reitz <mreitz@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: split fuse_mount off of fuse_connMax Reitz
We want to allow submounts for the same fuse_conn, but with different superblocks so that each of the submounts has its own device ID. To do so, we need to split all mount-specific information off of fuse_conn into a new fuse_mount structure, so that multiple mounts can share a single fuse_conn. We need to take care only to perform connection-level actions once (i.e. when the fuse_conn and thus the first fuse_mount are established, or when the last fuse_mount and thus the fuse_conn are destroyed). For example, fuse_sb_destroy() must invoke fuse_send_destroy() until the last superblock is released. To do so, we keep track of which fuse_mount is the root mount and perform all fuse_conn-level actions only when this fuse_mount is involved. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-18fuse: drop fuse_conn parameter where possibleMax Reitz
With the last commit, all functions that handle some existing fuse_req no longer need to be given the associated fuse_conn, because they can get it from the fuse_req object. Signed-off-by: Max Reitz <mreitz@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add logic to free up a memory rangeVivek Goyal
Add logic to free up a busy memory range. Freed memory range will be returned to free pool. Add a worker which can be started to select and free some busy memory ranges. Process can also steal one of its busy dax ranges if free range is not available. I will refer it to as direct reclaim. If free range is not available and nothing can't be stolen from same inode, caller waits on a waitq for free range to become available. For reclaiming a range, as of now we need to hold following locks in specified order. down_write(&fi->i_mmap_sem); down_write(&fi->dax->sem); We look for a free range in following order. A. Try to get a free range. B. If not, try direct reclaim. C. If not, wait for a memory range to become free Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: add a mount option to enable daxVivek Goyal
Add a mount option to allow using dax with virtio_fs. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: set up virtio_fs dax_deviceStefan Hajnoczi
Setup a dax device. Use the shm capability to find the cache entry and map it. The DAX window is accessed by the fs/dax.c infrastructure and must have struct pages (at least on x86). Use devm_memremap_pages() to map the DAX window PCI BAR and allocate struct page. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Dr. David Alan Gilbert <dgilbert@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com> Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: get rid of no_mount_optionsVivek Goyal
This option was introduced so that for virtio_fs we don't show any mounts options fuse_show_options(). Because we don't offer any of these options to be controlled by mounter. Very soon we are planning to introduce option "dax" which mounter should be able to specify. And no_mount_options does not work anymore. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-09-10virtiofs: provide a helper function for virtqueue initializationVivek Goyal
This reduces code duplication and make it little easier to read code. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-08-05virtio_fs: convert to LE accessorsMichael S. Tsirkin
Virtio fs is modern-only. Use LE accessors for config space. Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
2020-05-19virtiofs: do not use fuse_fill_super_common() for device installationVivek Goyal
fuse_fill_super_common() allocates and installs one fuse_device. Hence virtiofs allocates and install all fuse devices by itself except one. This makes logic little twisted. There does not seem to be any real need that why virtiofs can't allocate and install all fuse devices itself. So opt out of fuse device allocation and installation while calling fuse_fill_super_common(). Regular fuse still wants fuse_fill_super_common() to install fuse_device. It needs to prevent against races where two mounters are trying to mount fuse using same fd. In that case one will succeed while other will get -EINVAL. virtiofs does not have this issue because sget_fc() resolves the race w.r.t multiple mounters and only one instance of virtio_fs_fill_super() should be in progress for same filesystem. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2020-04-20virtiofs: schedule blocking async replies in separate workerVivek Goyal
In virtiofs (unlike in regular fuse) processing of async replies is serialized. This can result in a deadlock in rare corner cases when there's a circular dependency between the completion of two or more async replies. Such a deadlock can be reproduced with xfstests:generic/503 if TEST_DIR == SCRATCH_MNT (which is a misconfiguration): - Process A is waiting for page lock in worker thread context and blocked (virtio_fs_requests_done_work()). - Process B is holding page lock and waiting for pending writes to finish (fuse_wait_on_page_writeback()). - Write requests are waiting in virtqueue and can't complete because worker thread is blocked on page lock (process A). Fix this by creating a unique work_struct for each async reply that can block (O_DIRECT read). Fixes: a62a8ef9d97d ("virtio-fs: add virtiofs filesystem") Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Use completions while waiting for queue to be drainedVivek Goyal
While we wait for queue to finish draining, use completions instead of usleep_range(). This is better way of waiting for event. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Do not send forget request "struct list_head" elementVivek Goyal
We are sending whole of virtio_fs_forget struct to the other end over virtqueue. Other end does not need to see elements like "struct list". That's internal detail of guest kernel. Fix it. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-22virtiofs: Use a common function to send forgetVivek Goyal
Currently we are duplicating logic to send forgets at two places. Consolidate the code by calling one helper function. This also uses virtqueue_add_outbuf() instead of virtqueue_add_sgs(). Former is simpler to call. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-11-12virtiofs: Fix old-style declarationYueHaibing
There expect the 'static' keyword to come first in a declaration, and we get warnings like this with "make W=1": fs/fuse/virtio_fs.c:687:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] fs/fuse/virtio_fs.c:692:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] fs/fuse/virtio_fs.c:1029:1: warning: 'static' is not at beginning of declaration [-Wold-style-declaration] Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-23virtiofs: Remove set but not used variable 'fc'zhengbin
Fixes gcc '-Wunused-but-set-variable' warning: fs/fuse/virtio_fs.c: In function virtio_fs_wake_pending_and_unlock: fs/fuse/virtio_fs.c:983:20: warning: variable fc set but not used [-Wunused-but-set-variable] It is not used since commit 7ee1e2e631db ("virtiofs: No need to check fpq->connected state") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: zhengbin <zhengbin13@huawei.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Retry request submission from worker contextVivek Goyal
If regular request queue gets full, currently we sleep for a bit and retrying submission in submitter's context. This assumes submitter is not holding any spin lock. But this assumption is not true for background requests. For background requests, we are called with fc->bg_lock held. This can lead to deadlock where one thread is trying submission with fc->bg_lock held while request completion thread has called fuse_request_end() which tries to acquire fc->bg_lock and gets blocked. As request completion thread gets blocked, it does not make further progress and that means queue does not get empty and submitter can't submit more requests. To solve this issue, retry submission with the help of a worker, instead of retrying in submitter's context. We already do this for hiprio/forget requests. Reported-by: Chirantan Ekbote <chirantan@chromium.org> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Count pending forgets as in_flight forgetsVivek Goyal
If virtqueue is full, we put forget requests on a list and these forgets are dispatched later using a worker. As of now we don't count these forgets in fsvq->in_flight variable. This means when queue is being drained, we have to have special logic to first drain these pending requests and then wait for fsvq->in_flight to go to zero. By counting pending forgets in fsvq->in_flight, we can get rid of special logic and just wait for in_flight to go to zero. Worker thread will kick and drain all the forgets anyway, leading in_flight to zero. I also need similar logic for normal request queue in next patch where I am about to defer request submission in the worker context if queue is full. This simplifies the code a bit. Also add two helper functions to inc/dec in_flight. Decrement in_flight helper will later used to call completion when in_flight reaches zero. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Set FR_SENT flag only after request has been sentVivek Goyal
FR_SENT flag should be set when request has been sent successfully sent over virtqueue. This is used by interrupt logic to figure out if interrupt request should be sent or not. Also add it to fqp->processing list after sending it successfully. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: No need to check fpq->connected stateVivek Goyal
In virtiofs we keep per queue connected state in virtio_fs_vq->connected and use that to end request if queue is not connected. And virtiofs does not even touch fpq->connected state. We probably need to merge these two at some point of time. For now, simplify the code a bit and do not worry about checking state of fpq->connected. Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-21virtiofs: Do not end request in submission contextVivek Goyal
Submission context can hold some locks which end request code tries to hold again and deadlock can occur. For example, fc->bg_lock. If a background request is being submitted, it might hold fc->bg_lock and if we could not submit request (because device went away) and tried to end request, then deadlock happens. During testing, I also got a warning from deadlock detection code. So put requests on a list and end requests from a worker thread. I got following warning from deadlock detector. [ 603.137138] WARNING: possible recursive locking detected [ 603.137142] -------------------------------------------- [ 603.137144] blogbench/2036 is trying to acquire lock: [ 603.137149] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_request_end+0xdf/0x1c0 [fuse] [ 603.140701] [ 603.140701] but task is already holding lock: [ 603.140703] 00000000f0f51107 (&(&fc->bg_lock)->rlock){+.+.}, at: fuse_simple_background+0x92/0x1d0 [fuse] [ 603.140713] [ 603.140713] other info that might help us debug this: [ 603.140714] Possible unsafe locking scenario: [ 603.140714] [ 603.140715] CPU0 [ 603.140716] ---- [ 603.140716] lock(&(&fc->bg_lock)->rlock); [ 603.140718] lock(&(&fc->bg_lock)->rlock); [ 603.140719] [ 603.140719] *** DEADLOCK *** Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-10-15virtio-fs: don't show mount optionsMiklos Szeredi
Virtio-fs does not accept any mount options, so it's confusing and wrong to show any in /proc/mounts. Reported-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
2019-09-18virtio-fs: add virtiofs filesystemStefan Hajnoczi
Add a basic file system module for virtio-fs. This does not yet contain shared data support between host and guest or metadata coherency speedups. However it is already significantly faster than virtio-9p. Design Overview =============== With the goal of designing something with better performance and local file system semantics, a bunch of ideas were proposed. - Use fuse protocol (instead of 9p) for communication between guest and host. Guest kernel will be fuse client and a fuse server will run on host to serve the requests. - For data access inside guest, mmap portion of file in QEMU address space and guest accesses this memory using dax. That way guest page cache is bypassed and there is only one copy of data (on host). This will also enable mmap(MAP_SHARED) between guests. - For metadata coherency, there is a shared memory region which contains version number associated with metadata and any guest changing metadata updates version number and other guests refresh metadata on next access. This is yet to be implemented. How virtio-fs differs from existing approaches ============================================== The unique idea behind virtio-fs is to take advantage of the co-location of the virtual machine and hypervisor to avoid communication (vmexits). DAX allows file contents to be accessed without communication with the hypervisor. The shared memory region for metadata avoids communication in the common case where metadata is unchanged. By replacing expensive communication with cheaper shared memory accesses, we expect to achieve better performance than approaches based on network file system protocols. In addition, this also makes it easier to achieve local file system semantics (coherency). These techniques are not applicable to network file system protocols since the communications channel is bypassed by taking advantage of shared memory on a local machine. This is why we decided to build virtio-fs rather than focus on 9P or NFS. Caching Modes ============= Like virtio-9p, different caching modes are supported which determine the coherency level as well. The “cache=FOO” and “writeback” options control the level of coherence between the guest and host filesystems. - cache=none metadata, data and pathname lookup are not cached in guest. They are always fetched from host and any changes are immediately pushed to host. - cache=always metadata, data and pathname lookup are cached in guest and never expire. - cache=auto metadata and pathname lookup cache expires after a configured amount of time (default is 1 second). Data is cached while the file is open (close to open consistency). - writeback/no_writeback These options control the writeback strategy. If writeback is disabled, then normal writes will immediately be synchronized with the host fs. If writeback is enabled, then writes may be cached in the guest until the file is closed or an fsync(2) performed. This option has no effect on mmap-ed writes or writes going through the DAX mechanism. Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Vivek Goyal <vgoyal@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>