Merge tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux

Pull io_uring updates from Jens Axboe: - Greatly improve send zerocopy performance, by enabling coalescing of sent buffers. MSG_ZEROCOPY already does this with send(2) and sendmsg(2), but the io_uring side did not. In local testing, the crossover point for send zerocopy being faster is now around 3000 byte packets, and it performs better than the sync syscall variants as well. This feature relies on a shared branch with net-next, which was pulled into both branches. - Unification of how async preparation is done across opcodes. Previously, opcodes that required extra memory for async retry would allocate that as needed, using on-stack state until that was the case. If async retry was needed, the on-stack state was adjusted appropriately for a retry and then copied to the allocated memory. This led to some fragile and ugly code, particularly for read/write handling, and made storage retries more difficult than they needed to be. Allocate the memory upfront, as it's cheap from our pools, and use that state consistently both initially and also from the retry side. - Move away from using remap_pfn_range() for mapping the rings. This is really not the right interface to use and can cause lifetime issues or leaks. Additionally, it means the ring sq/cq arrays need to be physically contigious, which can cause problems in production with larger rings when services are restarted, as memory can be very fragmented at that point. Move to using vm_insert_page(s) for the ring sq/cq arrays, and apply the same treatment to mapped ring provided buffers. This also helps unify the code we have dealing with allocating and mapping memory. Hard to see in the diffstat as we're adding a few features as well, but this kills about ~400 lines of code from the codebase as well. - Add support for bundles for send/recv. When used with provided buffers, bundles support sending or receiving more than one buffer at the time, improving the efficiency by only needing to call into the networking stack once for multiple sends or receives. - Tweaks for our accept operations, supporting both a DONTWAIT flag for skipping poll arm and retry if we can, and a POLLFIRST flag that the application can use to skip the initial accept attempt and rely purely on poll for triggering the operation. Both of these have identical flags on the receive side already. - Make the task_work ctx locking unconditional. We had various code paths here that would do a mix of lock/trylock and set the task_work state to whether or not it was locked. All of that goes away, we lock it unconditionally and get rid of the state flag indicating whether it's locked or not. The state struct still exists as an empty type, can go away in the future. - Add support for specifying NOP completion values, allowing it to be used for error handling testing. - Use set/test bit for io-wq worker flags. Not strictly needed, but also doesn't hurt and helps silence a KCSAN warning. - Cleanups for io-wq locking and work assignments, closing a tiny race where cancelations would not be able to find the work item reliably. - Misc fixes, cleanups, and improvements * tag 'for-6.10/io_uring-20240511' of git://git.kernel.dk/linux: (97 commits) io_uring: support to inject result for NOP io_uring: fail NOP if non-zero op flags is passed in io_uring/net: add IORING_ACCEPT_POLL_FIRST flag io_uring/net: add IORING_ACCEPT_DONTWAIT flag io_uring/filetable: don't unnecessarily clear/reset bitmap io_uring/io-wq: Use set_bit() and test_bit() at worker->flags io_uring/msg_ring: cleanup posting to IOPOLL vs !IOPOLL ring io_uring: Require zeroed sqe->len on provided-buffers send io_uring/notif: disable LAZY_WAKE for linked notifs io_uring/net: fix sendzc lazy wake polling io_uring/msg_ring: reuse ctx->submitter_task read using READ_ONCE instead of re-reading it io_uring/rw: reinstate thread check for retries io_uring/notif: implement notification stacking io_uring/notif: simplify io_notif_flush() net: add callback for setting a ubuf_info to skb net: extend ubuf_info callback to ops structure io_uring/net: support bundles for recv io_uring/net: support bundles for send io_uring/kbuf: add helpers for getting/peeking multiple buffers io_uring/net: add provided buffer support for IORING_OP_SEND ...
author: Linus Torvalds <torvalds@linux-foundation.org> 2024-05-13 12:48:06 -0700
committer: Linus Torvalds <torvalds@linux-foundation.org> 2024-05-13 12:48:06 -0700
commit: 9961a785944601e32f185ea696347b22ffda634c (patch)
tree: 24362a6360b2a1fa96c7eb562c6a95f96b37e459 /io_uring/kbuf.c
parent: f4e8d80292859809ea135e9f4c43bae47e4f58bc (diff)
parent: deb1e496a83557896fe0cca0b8af01c2a97c0dc6 (diff)
1 files changed, 188 insertions, 130 deletions
diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c
index 3aa16e27f509..d2945c9c812b 100644
--- a/io_uring/kbuf.c
+++ b/io_uring/kbuf.c
@@ -7,6 +7,7 @@
 #include <linux/slab.h>
 #include <linux/namei.h>
 #include <linux/poll.h>
+#include <linux/vmalloc.h>
 #include <linux/io_uring.h>
 
 #include <uapi/linux/io_uring.h>
@@ -14,8 +15,7 @@
 #include "io_uring.h"
 #include "opdef.h"
 #include "kbuf.h"
-
-#define IO_BUFFER_LIST_BUF_PER_PAGE (PAGE_SIZE / sizeof(struct io_uring_buf))
+#include "memmap.h"
 
 /* BIDs are addressed by a 16-bit field in a CQE */
 #define MAX_BIDS_PER_BGID (1 << 16)
@@ -31,25 +31,12 @@ struct io_provide_buf {
 	__u16				bid;
 };
 
-struct io_buf_free {
-	struct hlist_node		list;
-	void				*mem;
-	size_t				size;
-	int				inuse;
-};
-
-static inline struct io_buffer_list *__io_buffer_get_list(struct io_ring_ctx *ctx,
-							  unsigned int bgid)
-{
-	return xa_load(&ctx->io_bl_xa, bgid);
-}
-
 static inline struct io_buffer_list *io_buffer_get_list(struct io_ring_ctx *ctx,
 							unsigned int bgid)
 {
 	lockdep_assert_held(&ctx->uring_lock);
 
-	return __io_buffer_get_list(ctx, bgid);
+	return xa_load(&ctx->io_bl_xa, bgid);
 }
 
 static int io_buffer_add_list(struct io_ring_ctx *ctx,
@@ -130,6 +117,27 @@ static void __user *io_provided_buffer_select(struct io_kiocb *req, size_t *len,
 	return NULL;
 }
 
+static int io_provided_buffers_select(struct io_kiocb *req, size_t *len,
+				      struct io_buffer_list *bl,
+				      struct iovec *iov)
+{
+	void __user *buf;
+
+	buf = io_provided_buffer_select(req, len, bl);
+	if (unlikely(!buf))
+		return -ENOBUFS;
+
+	iov[0].iov_base = buf;
+	iov[0].iov_len = *len;
+	return 0;
+}
+
+static struct io_uring_buf *io_ring_head_to_buf(struct io_uring_buf_ring *br,
+						__u16 head, __u16 mask)
+{
+	return &br->bufs[head & mask];
+}
+
 static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len,
 					  struct io_buffer_list *bl,
 					  unsigned int issue_flags)
@@ -145,19 +153,10 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len,
 	if (head + 1 == tail)
 		req->flags |= REQ_F_BL_EMPTY;
 
-	head &= bl->mask;
-	/* mmaped buffers are always contig */
-	if (bl->is_mmap || head < IO_BUFFER_LIST_BUF_PER_PAGE) {
-		buf = &br->bufs[head];
-	} else {
-		int off = head & (IO_BUFFER_LIST_BUF_PER_PAGE - 1);
-		int index = head / IO_BUFFER_LIST_BUF_PER_PAGE;
-		buf = page_address(bl->buf_pages[index]);
-		buf += off;
-	}
+	buf = io_ring_head_to_buf(br, head, bl->mask);
 	if (*len == 0 || *len > buf->len)
 		*len = buf->len;
-	req->flags |= REQ_F_BUFFER_RING;
+	req->flags |= REQ_F_BUFFER_RING | REQ_F_BUFFERS_COMMIT;
 	req->buf_list = bl;
 	req->buf_index = buf->bid;
 
@@ -172,6 +171,7 @@ static void __user *io_ring_buffer_select(struct io_kiocb *req, size_t *len,
 		 * the transfer completes (or if we get -EAGAIN and must poll of
 		 * retry).
 		 */
+		req->flags &= ~REQ_F_BUFFERS_COMMIT;
 		req->buf_list = NULL;
 		bl->head++;
 	}
@@ -198,22 +198,134 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len,
 	return ret;
 }
 
-/*
- * Mark the given mapped range as free for reuse
- */
-static void io_kbuf_mark_free(struct io_ring_ctx *ctx, struct io_buffer_list *bl)
+/* cap it at a reasonable 256, will be one page even for 4K */
+#define PEEK_MAX_IMPORT		256
+
+static int io_ring_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg,
+				struct io_buffer_list *bl)
+{
+	struct io_uring_buf_ring *br = bl->buf_ring;
+	struct iovec *iov = arg->iovs;
+	int nr_iovs = arg->nr_iovs;
+	__u16 nr_avail, tail, head;
+	struct io_uring_buf *buf;
+
+	tail = smp_load_acquire(&br->tail);
+	head = bl->head;
+	nr_avail = min_t(__u16, tail - head, UIO_MAXIOV);
+	if (unlikely(!nr_avail))
+		return -ENOBUFS;
+
+	buf = io_ring_head_to_buf(br, head, bl->mask);
+	if (arg->max_len) {
+		int needed;
+
+		needed = (arg->max_len + buf->len - 1) / buf->len;
+		needed = min(needed, PEEK_MAX_IMPORT);
+		if (nr_avail > needed)
+			nr_avail = needed;
+	}
+
+	/*
+	 * only alloc a bigger array if we know we have data to map, eg not
+	 * a speculative peek operation.
+	 */
+	if (arg->mode & KBUF_MODE_EXPAND && nr_avail > nr_iovs && arg->max_len) {
+		iov = kmalloc_array(nr_avail, sizeof(struct iovec), GFP_KERNEL);
+		if (unlikely(!iov))
+			return -ENOMEM;
+		if (arg->mode & KBUF_MODE_FREE)
+			kfree(arg->iovs);
+		arg->iovs = iov;
+		nr_iovs = nr_avail;
+	} else if (nr_avail < nr_iovs) {
+		nr_iovs = nr_avail;
+	}
+
+	/* set it to max, if not set, so we can use it unconditionally */
+	if (!arg->max_len)
+		arg->max_len = INT_MAX;
+
+	req->buf_index = buf->bid;
+	do {
+		/* truncate end piece, if needed */
+		if (buf->len > arg->max_len)
+			buf->len = arg->max_len;
+
+		iov->iov_base = u64_to_user_ptr(buf->addr);
+		iov->iov_len = buf->len;
+		iov++;
+
+		arg->out_len += buf->len;
+		arg->max_len -= buf->len;
+		if (!arg->max_len)
+			break;
+
+		buf = io_ring_head_to_buf(br, ++head, bl->mask);
+	} while (--nr_iovs);
+
+	if (head == tail)
+		req->flags |= REQ_F_BL_EMPTY;
+
+	req->flags |= REQ_F_BUFFER_RING;
+	req->buf_list = bl;
+	return iov - arg->iovs;
+}
+
+int io_buffers_select(struct io_kiocb *req, struct buf_sel_arg *arg,
+		      unsigned int issue_flags)
 {
-	struct io_buf_free *ibf;
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_buffer_list *bl;
+	int ret = -ENOENT;
 
-	hlist_for_each_entry(ibf, &ctx->io_buf_list, list) {
-		if (bl->buf_ring == ibf->mem) {
-			ibf->inuse = 0;
-			return;
+	io_ring_submit_lock(ctx, issue_flags);
+	bl = io_buffer_get_list(ctx, req->buf_index);
+	if (unlikely(!bl))
+		goto out_unlock;
+
+	if (bl->is_buf_ring) {
+		ret = io_ring_buffers_peek(req, arg, bl);
+		/*
+		 * Don't recycle these buffers if we need to go through poll.
+		 * Nobody else can use them anyway, and holding on to provided
+		 * buffers for a send/write operation would happen on the app
+		 * side anyway with normal buffers. Besides, we already
+		 * committed them, they cannot be put back in the queue.
+		 */
+		if (ret > 0) {
+			req->flags |= REQ_F_BL_NO_RECYCLE;
+			req->buf_list->head += ret;
 		}
+	} else {
+		ret = io_provided_buffers_select(req, &arg->out_len, bl, arg->iovs);
+	}
+out_unlock:
+	io_ring_submit_unlock(ctx, issue_flags);
+	return ret;
+}
+
+int io_buffers_peek(struct io_kiocb *req, struct buf_sel_arg *arg)
+{
+	struct io_ring_ctx *ctx = req->ctx;
+	struct io_buffer_list *bl;
+	int ret;
+
+	lockdep_assert_held(&ctx->uring_lock);
+
+	bl = io_buffer_get_list(ctx, req->buf_index);
+	if (unlikely(!bl))
+		return -ENOENT;
+
+	if (bl->is_buf_ring) {
+		ret = io_ring_buffers_peek(req, arg, bl);
+		if (ret > 0)
+			req->flags |= REQ_F_BUFFERS_COMMIT;
+		return ret;
 	}
 
-	/* can't happen... */
-	WARN_ON_ONCE(1);
+	/* don't support multiple buffer selections for legacy */
+	return io_provided_buffers_select(req, &arg->max_len, bl, arg->iovs);
 }
 
 static int __io_remove_buffers(struct io_ring_ctx *ctx,
@@ -227,22 +339,16 @@ static int __io_remove_buffers(struct io_ring_ctx *ctx,
 
 	if (bl->is_buf_ring) {
 		i = bl->buf_ring->tail - bl->head;
-		if (bl->is_mmap) {
-			/*
-			 * io_kbuf_list_free() will free the page(s) at
-			 * ->release() time.
-			 */
-			io_kbuf_mark_free(ctx, bl);
-			bl->buf_ring = NULL;
-			bl->is_mmap = 0;
-		} else if (bl->buf_nr_pages) {
+		if (bl->buf_nr_pages) {
 			int j;
 
-			for (j = 0; j < bl->buf_nr_pages; j++)
-				unpin_user_page(bl->buf_pages[j]);
-			kvfree(bl->buf_pages);
-			bl->buf_pages = NULL;
-			bl->buf_nr_pages = 0;
+			if (!bl->is_mmap) {
+				for (j = 0; j < bl->buf_nr_pages; j++)
+					unpin_user_page(bl->buf_pages[j]);
+			}
+			io_pages_unmap(bl->buf_ring, &bl->buf_pages,
+					&bl->buf_nr_pages, bl->is_mmap);
+			bl->is_mmap = 0;
 		}
 		/* make sure it's seen as empty */
 		INIT_LIST_HEAD(&bl->buf_list);
@@ -498,9 +604,9 @@ err:
 static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 			    struct io_buffer_list *bl)
 {
-	struct io_uring_buf_ring *br;
+	struct io_uring_buf_ring *br = NULL;
 	struct page **pages;
-	int i, nr_pages;
+	int nr_pages, ret;
 
 	pages = io_pin_pages(reg->ring_addr,
 			     flex_array_size(br, bufs, reg->ring_entries),
@@ -508,18 +614,12 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 	if (IS_ERR(pages))
 		return PTR_ERR(pages);
 
-	/*
-	 * Apparently some 32-bit boxes (ARM) will return highmem pages,
-	 * which then need to be mapped. We could support that, but it'd
-	 * complicate the code and slowdown the common cases quite a bit.
-	 * So just error out, returning -EINVAL just like we did on kernels
-	 * that didn't support mapped buffer rings.
-	 */
-	for (i = 0; i < nr_pages; i++)
-		if (PageHighMem(pages[i]))
-			goto error_unpin;
+	br = vmap(pages, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!br) {
+		ret = -ENOMEM;
+		goto error_unpin;
+	}
 
-	br = page_address(pages[0]);
 #ifdef SHM_COLOUR
 	/*
 	 * On platforms that have specific aliasing requirements, SHM_COLOUR
@@ -530,8 +630,10 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 	 * should use IOU_PBUF_RING_MMAP instead, and liburing will handle
 	 * this transparently.
 	 */
-	if ((reg->ring_addr | (unsigned long) br) & (SHM_COLOUR - 1))
+	if ((reg->ring_addr | (unsigned long) br) & (SHM_COLOUR - 1)) {
+		ret = -EINVAL;
 		goto error_unpin;
+	}
 #endif
 	bl->buf_pages = pages;
 	bl->buf_nr_pages = nr_pages;
@@ -540,69 +642,24 @@ static int io_pin_pbuf_ring(struct io_uring_buf_reg *reg,
 	bl->is_mmap = 0;
 	return 0;
 error_unpin:
-	for (i = 0; i < nr_pages; i++)
-		unpin_user_page(pages[i]);
+	unpin_user_pages(pages, nr_pages);
 	kvfree(pages);
-	return -EINVAL;
-}
-
-/*
- * See if we have a suitable region that we can reuse, rather than allocate
- * both a new io_buf_free and mem region again. We leave it on the list as
- * even a reused entry will need freeing at ring release.
- */
-static struct io_buf_free *io_lookup_buf_free_entry(struct io_ring_ctx *ctx,
-						    size_t ring_size)
-{
-	struct io_buf_free *ibf, *best = NULL;
-	size_t best_dist;
-
-	hlist_for_each_entry(ibf, &ctx->io_buf_list, list) {
-		size_t dist;
-
-		if (ibf->inuse || ibf->size < ring_size)
-			continue;
-		dist = ibf->size - ring_size;
-		if (!best || dist < best_dist) {
-			best = ibf;
-			if (!dist)
-				break;
-			best_dist = dist;
-		}
-	}
-
-	return best;
+	vunmap(br);
+	return ret;
 }
 
 static int io_alloc_pbuf_ring(struct io_ring_ctx *ctx,
 			      struct io_uring_buf_reg *reg,
 			      struct io_buffer_list *bl)
 {
-	struct io_buf_free *ibf;
 	size_t ring_size;
-	void *ptr;
 
 	ring_size = reg->ring_entries * sizeof(struct io_uring_buf_ring);
 
-	/* Reuse existing entry, if we can */
-	ibf = io_lookup_buf_free_entry(ctx, ring_size);
-	if (!ibf) {
-		ptr = io_mem_alloc(ring_size);
-		if (IS_ERR(ptr))
-			return PTR_ERR(ptr);
-
-		/* Allocate and store deferred free entry */
-		ibf = kmalloc(sizeof(*ibf), GFP_KERNEL_ACCOUNT);
-		if (!ibf) {
-			io_mem_free(ptr);
-			return -ENOMEM;
-		}
-		ibf->mem = ptr;
-		ibf->size = ring_size;
-		hlist_add_head(&ibf->list, &ctx->io_buf_list);
-	}
-	ibf->inuse = 1;
-	bl->buf_ring = ibf->mem;
+	bl->buf_ring = io_pages_map(&bl->buf_pages, &bl->buf_nr_pages, ring_size);
+	if (!bl->buf_ring)
+		return -ENOMEM;
+
 	bl->is_buf_ring = 1;
 	bl->is_mmap = 1;
 	return 0;
@@ -750,18 +807,19 @@ struct io_buffer_list *io_pbuf_get_bl(struct io_ring_ctx *ctx,
 	return ERR_PTR(-EINVAL);
 }
 
-/*
- * Called at or after ->release(), free the mmap'ed buffers that we used
- * for memory mapped provided buffer rings.
- */
-void io_kbuf_mmap_list_free(struct io_ring_ctx *ctx)
+int io_pbuf_mmap(struct file *file, struct vm_area_struct *vma)
 {
-	struct io_buf_free *ibf;
-	struct hlist_node *tmp;
+	struct io_ring_ctx *ctx = file->private_data;
+	loff_t pgoff = vma->vm_pgoff << PAGE_SHIFT;
+	struct io_buffer_list *bl;
+	int bgid, ret;
 
-	hlist_for_each_entry_safe(ibf, tmp, &ctx->io_buf_list, list) {
-		hlist_del(&ibf->list);
-		io_mem_free(ibf->mem);
-		kfree(ibf);
-	}
+	bgid = (pgoff & ~IORING_OFF_MMAP_MASK) >> IORING_OFF_PBUF_SHIFT;
+	bl = io_pbuf_get_bl(ctx, bgid);
+	if (IS_ERR(bl))
+		return PTR_ERR(bl);
+
+	ret = io_uring_mmap_pages(ctx, vma, bl->buf_pages, bl->buf_nr_pages);
+	io_put_bl(ctx, bl);
+	return ret;
 }
author	Linus Torvalds <torvalds@linux-foundation.org>	2024-05-13 12:48:06 -0700
committer	Linus Torvalds <torvalds@linux-foundation.org>	2024-05-13 12:48:06 -0700
commit	9961a785944601e32f185ea696347b22ffda634c (patch)
tree	24362a6360b2a1fa96c7eb562c6a95f96b37e459 /io_uring/kbuf.c
parent	f4e8d80292859809ea135e9f4c43bae47e4f58bc (diff)
parent	deb1e496a83557896fe0cca0b8af01c2a97c0dc6 (diff)