diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-13 12:48:06 -0700 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-05-13 12:48:06 -0700 |
commit | 9961a785944601e32f185ea696347b22ffda634c (patch) | |
tree | 24362a6360b2a1fa96c7eb562c6a95f96b37e459 /io_uring/memmap.c | |
parent | f4e8d80292859809ea135e9f4c43bae47e4f58bc (diff) | |
parent | deb1e496a83557896fe0cca0b8af01c2a97c0dc6 (diff) |
Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux
Pull io_uring updates from Jens Axboe:
- Greatly improve send zerocopy performance, by enabling coalescing of
sent buffers.
MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the
io_uring side did not. In local testing, the crossover point for send
zerocopy being faster is now around 3000 byte packets, and it
performs better than the sync syscall variants as well.
This feature relies on a shared branch with net-next, which was
pulled into both branches.
- Unification of how async preparation is done across opcodes.
Previously, opcodes that required extra memory for async retry would
allocate that as needed, using on-stack state until that was the
case. If async retry was needed, the on-stack state was adjusted
appropriately for a retry and then copied to the allocated memory.
This led to some fragile and ugly code, particularly for read/write
handling, and made storage retries more difficult than they needed to
be. Allocate the memory upfront, as it's cheap from our pools, and
use that state consistently both initially and also from the retry
side.
- Move away from using remap_pfn_range() for mapping the rings.
This is really not the right interface to use and can cause lifetime
issues or leaks. Additionally, it means the ring sq/cq arrays need to
be physically contigious, which can cause problems in production with
larger rings when services are restarted, as memory can be very
fragmented at that point.
Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply
the same treatment to mapped ring provided buffers. This also helps
unify the code we have dealing with allocating and mapping memory.
Hard to see in the diffstat as we're adding a few features as well,
but this kills about ~400 lines of code from the codebase as well.
- Add support for bundles for send/recv.
When used with provided buffers, bundles support sending or receiving
more than one buffer at the time, improving the efficiency by only
needing to call into the networking stack once for multiple sends or
receives.
- Tweaks for our accept operations, supporting both a DONTWAIT flag for
skipping poll arm and retry if we can, and a POLLFIRST flag that the
application can use to skip the initial accept attempt and rely
purely on poll for triggering the operation. Both of these have
identical flags on the receive side already.
- Make the task_work ctx locking unconditional.
We had various code paths here that would do a mix of lock/trylock
and set the task_work state to whether or not it was locked. All of
that goes away, we lock it unconditionally and get rid of the state
flag indicating whether it's locked or not.
The state struct still exists as an empty type, can go away in the
future.
- Add support for specifying NOP completion values, allowing it to be
used for error handling testing.
- Use set/test bit for io-wq worker flags. Not strictly needed, but
also doesn't hurt and helps silence a KCSAN warning.
- Cleanups for io-wq locking and work assignments, closing a tiny race
where cancelations would not be able to find the work item reliably.
- Misc fixes, cleanups, and improvements
* tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits)
io_uring: support to inject result for NOP
io_uring: fail NOP if non-zero op flags is passed in
io_uring/net: add IORING_ACCEPT_POLL_FIRST flag
io_uring/net: add IORING_ACCEPT_DONTWAIT flag
io_uring/filetable: don't unnecessarily clear/reset bitmap
io_uring/io-wq: Use set_bit() and test_bit() at worker->flags
io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring
io_uring: Require zeroed sqe->len on provided-buffers send
io_uring/notif: disable LAZY_WAKE for linked notifs
io_uring/net: fix sendzc lazy wake polling
io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it
io_uring/rw: reinstate thread check for retries
io_uring/notif: implement notification stacking
io_uring/notif: simplify io_notif_flush()
net: add callback for setting a ubuf_info to skb
net: extend ubuf_info callback to ops structure
io_uring/net: support bundles for recv
io_uring/net: support bundles for send
io_uring/kbuf: add helpers for getting/peeking multiple buffers
io_uring/net: add provided buffer support for IORING_OP_SEND
...
Diffstat (limited to 'io_uring/memmap.c')
-rw-r--r-- | io_uring/memmap.c | 336 |
1 files changed, 336 insertions, 0 deletions
diff --git a/io_uring/memmap.c b/io_uring/memmap.c new file mode 100644 index 000000000000..523d982af2b0 --- /dev/null +++ b/io_uring/memmap.c @@ -0,0 +1,336 @@ +// SPDX-License-Identifier: GPL-2.0 +#include <linux/kernel.h> +#include <linux/init.h> +#include <linux/errno.h> +#include <linux/mm.h> +#include <linux/mman.h> +#include <linux/slab.h> +#include <linux/vmalloc.h> +#include <linux/io_uring.h> +#include <linux/io_uring_types.h> +#include <asm/shmparam.h> + +#include "memmap.h" +#include "kbuf.h" + +static void *io_mem_alloc_compound(struct page **pages, int nr_pages, + size_t size, gfp_t gfp) +{ + struct page *page; + int i, order; + + order = get_order(size); + if (order > MAX_PAGE_ORDER) + return ERR_PTR(-ENOMEM); + else if (order) + gfp |= __GFP_COMP; + + page = alloc_pages(gfp, order); + if (!page) + return ERR_PTR(-ENOMEM); + + for (i = 0; i < nr_pages; i++) + pages[i] = page + i; + + return page_address(page); +} + +static void *io_mem_alloc_single(struct page **pages, int nr_pages, size_t size, + gfp_t gfp) +{ + void *ret; + int i; + + for (i = 0; i < nr_pages; i++) { + pages[i] = alloc_page(gfp); + if (!pages[i]) + goto err; + } + + ret = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL); + if (ret) + return ret; +err: + while (i--) + put_page(pages[i]); + return ERR_PTR(-ENOMEM); +} + +void *io_pages_map(struct page ***out_pages, unsigned short *npages, + size_t size) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN; + struct page **pages; + int nr_pages; + void *ret; + + nr_pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT; + pages = kvmalloc_array(nr_pages, sizeof(struct page *), gfp); + if (!pages) + return ERR_PTR(-ENOMEM); + + ret = io_mem_alloc_compound(pages, nr_pages, size, gfp); + if (!IS_ERR(ret)) + goto done; + + ret = io_mem_alloc_single(pages, nr_pages, size, gfp); + if (!IS_ERR(ret)) { +done: + *out_pages = pages; + *npages = nr_pages; + return ret; + } + + kvfree(pages); + *out_pages = NULL; + *npages = 0; + return ret; +} + +void io_pages_unmap(void *ptr, struct page ***pages, unsigned short *npages, + bool put_pages) +{ + bool do_vunmap = false; + + if (!ptr) + return; + + if (put_pages && *npages) { + struct page **to_free = *pages; + int i; + + /* + * Only did vmap for the non-compound multiple page case. + * For the compound page, we just need to put the head. + */ + if (PageCompound(to_free[0])) + *npages = 1; + else if (*npages > 1) + do_vunmap = true; + for (i = 0; i < *npages; i++) + put_page(to_free[i]); + } + if (do_vunmap) + vunmap(ptr); + kvfree(*pages); + *pages = NULL; + *npages = 0; +} + +void io_pages_free(struct page ***pages, int npages) +{ + struct page **page_array = *pages; + + if (!page_array) + return; + + unpin_user_pages(page_array, npages); + kvfree(page_array); + *pages = NULL; +} + +struct page **io_pin_pages(unsigned long uaddr, unsigned long len, int *npages) +{ + unsigned long start, end, nr_pages; + struct page **pages; + int ret; + + end = (uaddr + len + PAGE_SIZE - 1) >> PAGE_SHIFT; + start = uaddr >> PAGE_SHIFT; + nr_pages = end - start; + if (WARN_ON_ONCE(!nr_pages)) + return ERR_PTR(-EINVAL); + + pages = kvmalloc_array(nr_pages, sizeof(struct page *), GFP_KERNEL); + if (!pages) + return ERR_PTR(-ENOMEM); + + ret = pin_user_pages_fast(uaddr, nr_pages, FOLL_WRITE | FOLL_LONGTERM, + pages); + /* success, mapped all pages */ + if (ret == nr_pages) { + *npages = nr_pages; + return pages; + } + + /* partial map, or didn't map anything */ + if (ret >= 0) { + /* if we did partial map, release any pages we did get */ + if (ret) + unpin_user_pages(pages, ret); + ret = -EFAULT; + } + kvfree(pages); + return ERR_PTR(ret); +} + +void *__io_uaddr_map(struct page ***pages, unsigned short *npages, + unsigned long uaddr, size_t size) +{ + struct page **page_array; + unsigned int nr_pages; + void *page_addr; + + *npages = 0; + + if (uaddr & (PAGE_SIZE - 1) || !size) + return ERR_PTR(-EINVAL); + + nr_pages = 0; + page_array = io_pin_pages(uaddr, size, &nr_pages); + if (IS_ERR(page_array)) + return page_array; + + page_addr = vmap(page_array, nr_pages, VM_MAP, PAGE_KERNEL); + if (page_addr) { + *pages = page_array; + *npages = nr_pages; + return page_addr; + } + + io_pages_free(&page_array, nr_pages); + return ERR_PTR(-ENOMEM); +} + +static void *io_uring_validate_mmap_request(struct file *file, loff_t pgoff, + size_t sz) +{ + struct io_ring_ctx *ctx = file->private_data; + loff_t offset = pgoff << PAGE_SHIFT; + + switch ((pgoff << PAGE_SHIFT) & IORING_OFF_MMAP_MASK) { + case IORING_OFF_SQ_RING: + case IORING_OFF_CQ_RING: + /* Don't allow mmap if the ring was setup without it */ + if (ctx->flags & IORING_SETUP_NO_MMAP) + return ERR_PTR(-EINVAL); + return ctx->rings; + case IORING_OFF_SQES: + /* Don't allow mmap if the ring was setup without it */ + if (ctx->flags & IORING_SETUP_NO_MMAP) + return ERR_PTR(-EINVAL); + return ctx->sq_sqes; + case IORING_OFF_PBUF_RING: { + struct io_buffer_list *bl; + unsigned int bgid; + void *ptr; + + bgid = (offset & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT; + bl = io_pbuf_get_bl(ctx, bgid); + if (IS_ERR(bl)) + return bl; + ptr = bl->buf_ring; + io_put_bl(ctx, bl); + return ptr; + } + } + + return ERR_PTR(-EINVAL); +} + +int io_uring_mmap_pages(struct io_ring_ctx *ctx, struct vm_area_struct *vma, + struct page **pages, int npages) +{ + unsigned long nr_pages = npages; + + vm_flags_set(vma, VM_DONTEXPAND); + return vm_insert_pages(vma, vma->vm_start, pages, &nr_pages); +} + +#ifdef CONFIG_MMU + +__cold int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct io_ring_ctx *ctx = file->private_data; + size_t sz = vma->vm_end - vma->vm_start; + long offset = vma->vm_pgoff << PAGE_SHIFT; + void *ptr; + + ptr = io_uring_validate_mmap_request(file, vma->vm_pgoff, sz); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + + switch (offset & IORING_OFF_MMAP_MASK) { + case IORING_OFF_SQ_RING: + case IORING_OFF_CQ_RING: + return io_uring_mmap_pages(ctx, vma, ctx->ring_pages, + ctx->n_ring_pages); + case IORING_OFF_SQES: + return io_uring_mmap_pages(ctx, vma, ctx->sqe_pages, + ctx->n_sqe_pages); + case IORING_OFF_PBUF_RING: + return io_pbuf_mmap(file, vma); + } + + return -EINVAL; +} + +unsigned long io_uring_get_unmapped_area(struct file *filp, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + void *ptr; + + /* + * Do not allow to map to user-provided address to avoid breaking the + * aliasing rules. Userspace is not able to guess the offset address of + * kernel kmalloc()ed memory area. + */ + if (addr) + return -EINVAL; + + ptr = io_uring_validate_mmap_request(filp, pgoff, len); + if (IS_ERR(ptr)) + return -ENOMEM; + + /* + * Some architectures have strong cache aliasing requirements. + * For such architectures we need a coherent mapping which aliases + * kernel memory *and* userspace memory. To achieve that: + * - use a NULL file pointer to reference physical memory, and + * - use the kernel virtual address of the shared io_uring context + * (instead of the userspace-provided address, which has to be 0UL + * anyway). + * - use the same pgoff which the get_unmapped_area() uses to + * calculate the page colouring. + * For architectures without such aliasing requirements, the + * architecture will return any suitable mapping because addr is 0. + */ + filp = NULL; + flags |= MAP_SHARED; + pgoff = 0; /* has been translated to ptr above */ +#ifdef SHM_COLOUR + addr = (uintptr_t) ptr; + pgoff = addr >> PAGE_SHIFT; +#else + addr = 0UL; +#endif + return current->mm->get_unmapped_area(filp, addr, len, pgoff, flags); +} + +#else /* !CONFIG_MMU */ + +int io_uring_mmap(struct file *file, struct vm_area_struct *vma) +{ + return is_nommu_shared_mapping(vma->vm_flags) ? 0 : -EINVAL; +} + +unsigned int io_uring_nommu_mmap_capabilities(struct file *file) +{ + return NOMMU_MAP_DIRECT | NOMMU_MAP_READ | NOMMU_MAP_WRITE; +} + +unsigned long io_uring_get_unmapped_area(struct file *file, unsigned long addr, + unsigned long len, unsigned long pgoff, + unsigned long flags) +{ + void *ptr; + + ptr = io_uring_validate_mmap_request(file, pgoff, len); + if (IS_ERR(ptr)) + return PTR_ERR(ptr); + + return (unsigned long) ptr; +} + +#endif /* !CONFIG_MMU */ |