summaryrefslogtreecommitdiff
path: root/io_uring
AgeCommit message (Collapse)Author
2024-03-18io_uring/sqpoll: early exit thread if task_context wasn't allocatedJens Axboe
Ideally we'd want to simply kill the task rather than wake it, but for now let's just add a startup check that causes the thread to exit. This can only happen if io_uring_alloc_task_context() fails, which generally requires fault injection. Reported-by: Ubisectech Sirius <bugreport@ubisectech.com> Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring: clear opcode specific data for an early failureJens Axboe
If failure happens before the opcode prep handler is called, ensure that we clear the opcode specific area of the request, which holds data specific to that request type. This prevents errors where opcode handlers either don't get to clear per-request private data since prep isn't even called. Reported-and-tested-by: syzbot+f8e9a371388aa62ecab4@syzkaller.appspotmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-16io_uring/net: ensure async prep handlers always initialize ->done_ioJens Axboe
If we get a request with IOSQE_ASYNC set, then we first run the prep async handlers. But if we then fail setting it up and want to post a CQE with -EINVAL, we use ->done_io. This was previously guarded with REQ_F_PARTIAL_IO, and the normal setup handlers do set it up before any potential errors, but we need to cover the async setup too. Fixes: 9817ad85899f ("io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/waitid: always remove waitid entry for cancel allJens Axboe
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our waitid list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Remove the dead check in cancelation as well for the hash_node being empty or not. We already have a waitid reference check for ownership, so we don't need to check the list too. Cc: stable@vger.kernel.org Fixes: f31ecf671ddc ("io_uring: add IORING_OP_WAITID support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring/futex: always remove futex entry for cancel allJens Axboe
We know the request is either being removed, or already in the process of being removed through task_work, so we can delete it from our futex list upfront. This is important for remove all conditions, as we otherwise will find it multiple times and prevent cancelation progress. Cc: stable@vger.kernel.org Fixes: 194bb58c6090 ("io_uring: add support for futex wake and wait") Fixes: 8f350194d5cf ("io_uring: add support for vectored futex waits") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-15io_uring: fix poll_remove stalled req completionPavel Begunkov
Taking the ctx lock is not enough to use the deferred request completion infrastructure, it'll get queued into the list but no one would expect it there, so it will sit there until next io_submit_flush_completions(). It's hard to care about the cancellation path, so complete it via tw. Fixes: ef7dfac51d8ed ("io_uring/poll: serialize poll linked timer start with poll removal") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c446740bc16858f8a2a8dcdce899812f21d15f23.1710514702.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: Fix release of pinned pages when __io_uaddr_map failsGabriel Krisman Bertazi
Looking at the error path of __io_uaddr_map, if we fail after pinning the pages for any reasons, ret will be set to -EINVAL and the error handler won't properly release the pinned pages. I didn't manage to trigger it without forcing a failure, but it can happen in real life when memory is heavily fragmented. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Fixes: 223ef4743164 ("io_uring: don't allow IORING_SETUP_NO_MMAP rings on highmem pages") Link: https://lore.kernel.org/r/20240313213912.1920-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring/kbuf: rename is_mappedPavel Begunkov
In buffer lists we have ->is_mapped as well as ->is_mmap, it's pretty hard to stay sane double checking which one means what, and in the long run there is a high chance of an eventual bug. Rename ->is_mapped into ->is_buf_ring. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c4838f4d8ad506ad6373f1c305aee2d2c1a89786.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-13io_uring: simplify io_pages_freePavel Begunkov
We never pass a null (top-level) pointer, remove the check. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e1a46f9a5cd38e6876905e8030bdff9b0845e96.1710343154.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring: clean rings on NO_MMAP alloc failPavel Begunkov
We make a few cancellation judgements based on ctx->rings, so let's zero it afer deallocation for IORING_SETUP_NO_MMAP just like it's done with the mmap case. Likely, it's not a real problem, but zeroing is safer and better tested. Cc: stable@vger.kernel.org Fixes: 03d89a2de25bbc ("io_uring: support for user allocated memory for rings/sqes") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/9ff6cdf91429b8a51699c210e1f6af6ea3f8bdcf.1710255382.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-12io_uring/rw: return IOU_ISSUE_SKIP_COMPLETE for multishot retryJens Axboe
If read multishot is being invoked from the poll retry handler, then we should return IOU_ISSUE_SKIP_COMPLETE rather than -EAGAIN. If not, then a CQE will be posted with -EAGAIN rather than triggering the retry when the file is flagged as readable again. Cc: stable@vger.kernel.org Reported-by: Sargun Dhillon <sargun@meta.com> Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11io_uring: don't save/restore iowait stateJens Axboe
This kind of state is per-syscall, and since we're doing the waiting off entering the io_uring_enter(2) syscall, there's no way that iowait can already be set for this case. Simplify it by setting it if we need to, and always clearing it to 0 when done. Fixes: 7b72d661f1f2 ("io_uring: gate iowait schedule on having pending requests") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-11Merge tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linuxLinus Torvalds
Pull io_uring updates from Jens Axboe: - Make running of task_work internal loops more fair, and unify how the different methods deal with them (me) - Support for per-ring NAPI. The two minor networking patches are in a shared branch with netdev (Stefan) - Add support for truncate (Tony) - Export SQPOLL utilization stats (Xiaobing) - Multishot fixes (Pavel) - Fix for a race in manipulating the request flags via poll (Pavel) - Cleanup the multishot checking by making it generic, moving it out of opcode handlers (Pavel) - Various tweaks and cleanups (me, Kunwu, Alexander) * tag 'for-6.9/io_uring-20240310' of git://git.kernel.dk/linux: (53 commits) io_uring: Fix sqpoll utilization check racing with dying sqpoll io_uring/net: dedup io_recv_finish req completion io_uring: refactor DEFER_TASKRUN multishot checks io_uring: fix mshot io-wq checks io_uring/net: add io_req_msg_cleanup() helper io_uring/net: simplify msghd->msg_inq checking io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLE io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_io io_uring/net: correctly handle multishot recvmsg retry setup io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handler io_uring: fix io_queue_proc modifying req->flags io_uring: fix mshot read defer taskrun cqe posting io_uring/net: fix overflow check in io_recvmsg_mshot_prep() io_uring/net: correct the type of variable io_uring/sqpoll: statistics of the true utilization of sq threads io_uring/net: move recv/recvmsg flags out of retry loop io_uring/kbuf: flag request if buffer pool is empty after buffer pick io_uring/net: improve the usercopy for sendmsg/recvmsg io_uring/net: move receive multishot out of the generic msghdr path io_uring/net: unify how recvmsg and sendmsg copy in the msghdr ...
2024-03-09io_uring: Fix sqpoll utilization check racing with dying sqpollGabriel Krisman Bertazi
Commit 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads"), currently in Jens for-next branch, peeks at io_sq_data->thread to report utilization statistics. But, If io_uring_show_fdinfo races with sqpoll terminating, even though we hold the ctx lock, sqd->thread might be NULL and we hit the Oops below. Note that we could technically just protect the getrusage() call and the sq total/work time calculations. But showing some sq information (pid/cpu) and not other information (utilization) is more confusing than not reporting anything, IMO. So let's hide it all if we happen to race with a dying sqpoll. This can be triggered consistently in my vm setup running sqpoll-cancel-hang.t in a loop. BUG: kernel NULL pointer dereference, address: 00000000000007b0 PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP NOPTI CPU: 0 PID: 16587 Comm: systemd-coredum Not tainted 6.8.0-rc3-g3fcb9d17206e-dirty #69 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS unknown 2/2/2022 RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Call Trace: <TASK> ? __die_body+0x1a/0x60 ? page_fault_oops+0x154/0x440 ? srso_alias_return_thunk+0x5/0xfbef5 ? do_user_addr_fault+0x174/0x7c0 ? srso_alias_return_thunk+0x5/0xfbef5 ? exc_page_fault+0x63/0x140 ? asm_exc_page_fault+0x22/0x30 ? getrusage+0x21/0x3e0 ? seq_printf+0x4e/0x70 io_uring_show_fdinfo+0x9db/0xa10 ? srso_alias_return_thunk+0x5/0xfbef5 ? vsnprintf+0x101/0x4d0 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_vprintf+0x34/0x50 ? srso_alias_return_thunk+0x5/0xfbef5 ? seq_printf+0x4e/0x70 ? seq_show+0x16b/0x1d0 ? __pfx_io_uring_show_fdinfo+0x10/0x10 seq_show+0x16b/0x1d0 seq_read_iter+0xd7/0x440 seq_read+0x102/0x140 vfs_read+0xae/0x320 ? srso_alias_return_thunk+0x5/0xfbef5 ? __do_sys_newfstat+0x35/0x60 ksys_read+0xa5/0xe0 do_syscall_64+0x50/0x110 entry_SYSCALL_64_after_hwframe+0x6e/0x76 RIP: 0033:0x7f786ec1db4d Code: e8 46 e3 01 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 80 3d d9 ce 0e 00 00 74 17 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 5b c3 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec RSP: 002b:00007ffcb361a4b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000 RAX: ffffffffffffffda RBX: 000055a4c8fe42f0 RCX: 00007f786ec1db4d RDX: 0000000000000400 RSI: 000055a4c8fe48a0 RDI: 0000000000000006 RBP: 00007f786ecfb0b0 R08: 00007f786ecfb2a8 R09: 0000000000000001 R10: 0000000000000000 R11: 0000000000000246 R12: 00007f786ecfaf60 R13: 000055a4c8fe42f0 R14: 0000000000000000 R15: 00007ffcb361a628 </TASK> Modules linked in: CR2: 00000000000007b0 ---[ end trace 0000000000000000 ]--- RIP: 0010:getrusage+0x21/0x3e0 Code: 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 55 48 89 d1 48 89 e5 41 57 41 56 41 55 41 54 49 89 fe 41 52 53 48 89 d3 48 83 ec 30 <4c> 8b a7 b0 07 00 00 48 8d 7a 08 65 48 8b 04 25 28 00 00 00 48 89 RSP: 0018:ffffa166c671bb80 EFLAGS: 00010282 RAX: 00000000000040ca RBX: ffffa166c671bc60 RCX: ffffa166c671bc60 RDX: ffffa166c671bc60 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffffa166c671bbe0 R08: ffff9448cc3930c0 R09: 0000000000000000 R10: ffffa166c671bd50 R11: ffffffff9ee89260 R12: 0000000000000000 R13: ffff9448ce099480 R14: 0000000000000000 R15: ffff9448cff5b000 FS: 00007f786e225900(0000) GS:ffff94493bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000007b0 CR3: 000000010d39c000 CR4: 0000000000750ef0 PKRU: 55555554 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x1ce00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Fixes: 3fcb9d17206e ("io_uring/sqpoll: statistics of the true utilization of sq threads") Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Link: https://lore.kernel.org/r/20240309003256.358-1-krisman@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: dedup io_recv_finish req completionPavel Begunkov
There are two block in io_recv_finish() completing the request, which we can combine and remove jumping. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/0e338dcb33c88de83809fda021cba9e7c9681620.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: refactor DEFER_TASKRUN multishot checksPavel Begunkov
We disallow DEFER_TASKRUN multishots from running by io-wq, which is checked by individual opcodes in the issue path. We can consolidate all it in io_wq_submit_work() at the same time moving the checks out of the hot path. Suggested-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e492f0f11588bb5aa11d7d24e6f53b7c7628afdb.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring: fix mshot io-wq checksPavel Begunkov
When checking for concurrent CQE posting, we're not only interested in requests running from the poll handler but also strayed requests ended up in normal io-wq execution. We're disallowing multishots in general from io-wq, not only when they came in a certain way. Cc: stable@vger.kernel.org Fixes: 17add5cea2bba ("io_uring: force multishot CQEs into task context") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/d8c5b36a39258036f93301cd60d3cd295e40653d.1709905727.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: add io_req_msg_cleanup() helperJens Axboe
For the fast inline path, we manually recycle the io_async_msghdr and free the iovec, and then clear the REQ_F_NEED_CLEANUP flag to avoid that needing doing in the slower path. We already do that in 2 spots, and in preparation for adding more, add a helper and use it. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: simplify msghd->msg_inq checkingJens Axboe
Just check for larger than zero rather than check for non-zero and not -1. This is easier to read, and also protects against any errants < 0 values that aren't -1. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/kbuf: rename REQ_F_PARTIAL_IO to REQ_F_BL_NO_RECYCLEJens Axboe
We only use the flag for this purpose, so rename it accordingly. This further prevents various other use cases of it, keeping it clean and consistent. Then we can also check it in one spot, when it's being attempted recycled, and remove some dead code in io_kbuf_recycle_ring(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-08io_uring/net: remove dependency on REQ_F_PARTIAL_IO for sr->done_ioJens Axboe
Ensure that prep handlers always initialize sr->done_io before any potential failure conditions, and with that, we now it's always been set even for the failure case. With that, we don't need to use the REQ_F_PARTIAL_IO flag to gate on that. Additionally, we should not overwrite req->cqe.res unless sr->done_io is actually positive. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: correctly handle multishot recvmsg retry setupJens Axboe
If we loop for multishot receive on the initial attempt, and then abort later on to wait for more, we miss a case where we should be copying the io_async_msghdr from the stack to stable storage. This leads to the next retry potentially failing, if the application had the msghdr on the stack. Cc: stable@vger.kernel.org Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring/net: clear REQ_F_BL_EMPTY in the multishot retry handlerJens Axboe
This flag should not be persistent across retries, so ensure we clear it before potentially attemting a retry. Fixes: c3f9109dbc9e ("io_uring/kbuf: flag request if buffer pool is empty after buffer pick") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix io_queue_proc modifying req->flagsPavel Begunkov
With multiple poll entries __io_queue_proc() might be running in parallel with poll handlers and possibly task_work, we should not be carelessly modifying req->flags there. io_poll_double_prepare() handles a similar case with locking but it's much easier to move it into __io_arm_poll_handler(). Cc: stable@vger.kernel.org Fixes: 595e52284d24a ("io_uring/poll: don't enable lazy wake for POLLEXCLUSIVE") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/455cc49e38cf32026fa1b49670be8c162c2cb583.1709834755.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-07io_uring: fix mshot read defer taskrun cqe postingPavel Begunkov
We can't post CQEs from io-wq with DEFER_TASKRUN set, normal completions are handled but aux should be explicitly disallowed by opcode handlers. Cc: stable@vger.kernel.org Fixes: fc68fcda04910 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/6fb7cba6f5366da25f4d3eb95273f062309d97fa.1709740837.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: fix overflow check in io_recvmsg_mshot_prep()Dan Carpenter
The "controllen" variable is type size_t (unsigned long). Casting it to int could lead to an integer underflow. The check_add_overflow() function considers the type of the destination which is type int. If we add two positive values and the result cannot fit in an integer then that's counted as an overflow. However, if we cast "controllen" to an int and it turns negative, then negative values *can* fit into an int type so there is no overflow. Good: 100 + (unsigned long)-4 = 96 <-- overflow Bad: 100 + (int)-4 = 96 <-- no overflow I deleted the cast of the sizeof() as well. That's not a bug but the cast is unnecessary. Fixes: 9b0fc3c054ff ("io_uring: fix types in io_recvmsg_multishot_overflow") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/138bd2e2-ede8-4bcc-aa7b-f3d9de167a37@moroto.mountain Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-04io_uring/net: correct the type of variableMuhammad Usama Anjum
The namelen is of type int. It shouldn't be made size_t which is unsigned. The signed number is needed for error checking before use. Fixes: c55978024d12 ("io_uring/net: move receive multishot out of the generic msghdr path") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Link: https://lore.kernel.org/r/20240301144349.2807544-1-usama.anjum@collabora.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/sqpoll: statistics of the true utilization of sq threadsXiaobing Li
Count the running time and actual IO processing time of the sqpoll thread, and output the statistical data to fdinfo. Variable description: "work_time" in the code represents the sum of the jiffies of the sq thread actually processing IO, that is, how many milliseconds it actually takes to process IO. "total_time" represents the total time that the sq thread has elapsed from the beginning of the loop to the current time point, that is, how many milliseconds it has spent in total. The test tool is fio, and its parameters are as follows: [global] ioengine=io_uring direct=1 group_reporting bs=128k norandommap=1 randrepeat=0 refill_buffers ramp_time=30s time_based runtime=1m clocksource=clock_gettime overwrite=1 log_avg_msec=1000 numjobs=1 [disk0] filename=/dev/nvme0n1 rw=read iodepth=16 hipri sqthread_poll=1 The test results are as follows: Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq SqMask: 0x3 SqHead: 3197153 SqTail: 3197153 CachedSqHead: 3197153 SqThread: 9231 SqThreadCpu: 11 SqTotalTime: 18099614 SqWorkTime: 16748316 The test results corresponding to different iodepths are as follows: |-----------|-------|-------|-------|------|-------| | iodepth | 1 | 4 | 8 | 16 | 64 | |-----------|-------|-------|-------|------|-------| |utilization| 2.9% | 8.8% | 10.9% | 92.9%| 84.4% | |-----------|-------|-------|-------|------|-------| | idle | 97.1% | 91.2% | 89.1% | 7.1% | 15.6% | |-----------|-------|-------|-------|------|-------| Signed-off-by: Xiaobing Li <xiaobing.li@samsung.com> Link: https://lore.kernel.org/r/20240228091251.543383-1-xiaobing.li@samsung.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-03-01io_uring/net: move recv/recvmsg flags out of retry loopJens Axboe
The flags don't change, just intialize them once rather than every loop for multishot. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/kbuf: flag request if buffer pool is empty after buffer pickJens Axboe
Normally we do an extra roundtrip for retries even if the buffer pool has depleted, as we don't check that upfront. Rather than add this check, have the buffer selection methods mark the request with REQ_F_BL_EMPTY if the used buffer group is out of buffers after this selection. This is very cheap to do once we're all the way inside there anyway, and it gives the caller a chance to make better decisions on how to proceed. For example, recv/recvmsg multishot could check this flag when it decides whether to keep receiving or not. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: improve the usercopy for sendmsg/recvmsgJens Axboe
We're spending a considerable amount of the sendmsg/recvmsg time just copying in the message header. And for provided buffers, the known single entry iovec. Be a bit smarter about it and enable/disable user access around our copying. In a test case that does both sendmsg and recvmsg, the runtime before this change (averaged over multiple runs, very stable times however): Kernel Time Diff ==================================== -git 4720 usec -git+commit 4311 usec -8.7% and looking at a profile diff, we see the following: 0.25% +9.33% [kernel.kallsyms] [k] _copy_from_user 4.47% -3.32% [kernel.kallsyms] [k] __io_msg_copy_hdr.constprop.0 where we drop more than 9% of _copy_from_user() time, and consequently add time to __io_msg_copy_hdr() where the copies are now attributed to, but with a net win of 6%. In comparison, the same test case with send/recv runs in 3745 usec, which is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is now only ~13% slower, where it was ~21% slower before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: move receive multishot out of the generic msghdr pathJens Axboe
Move the actual user_msghdr / compat_msghdr into the send and receive sides, respectively, so we can move the uaddr receive handling into its own handler, and ditto the multishot with buffer selection logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-27io_uring/net: unify how recvmsg and sendmsg copy in the msghdrJens Axboe
For recvmsg, we roll our own since we support buffer selections. This isn't the case for sendmsg right now, but in preparation for doing so, make the recvmsg copy helpers generic so we can call them from the sendmsg side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring/napi: enable even with a timeout of 0Jens Axboe
1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not. While at it, move the writing of the ctx napi values after we've copied the old values back to userspace. This ensures that if the call fails, we'll be in the same state as we were before, rather than some indeterminate state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-15io_uring: kill stale comment for io_cqring_overflow_kill()Jens Axboe
This function now deals only with discarding overflow entries on ring free and exit, and it no longer returns whether we successfully flushed all entries as there's no CQE posting involved anymore. Kill the outdated comment. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/net: fix multishot accept overflow handlingJens Axboe
If we hit CQ ring overflow when attempting to post a multishot accept completion, we don't properly save the result or return code. This results in losing the accepted fd value. Instead, we return the result from the poll operation that triggered the accept retry. This is generally POLLIN|POLLPRI|POLLRDNORM|POLLRDBAND which is 0xc3, or 195, which looks like a valid file descriptor, but it really has no connection to that. Handle this like we do for other multishot completions - assign the result, and return IOU_STOP_MULTISHOT to cancel any further completions from this request when overflow is hit. This preserves the result, as we should, and tells the application that the request needs to be re-armed. Cc: stable@vger.kernel.org Fixes: 515e26961295 ("io_uring: revert "io_uring fix multishot accept ordering"") Link: https://github.com/axboe/liburing/issues/1062 Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/sqpoll: use the correct check for pending task_workJens Axboe
A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring: wake SQPOLL task when task_work is added to an empty queueJens Axboe
If there's no current work on the list, we still need to potentially wake the SQPOLL task if it is sleeping. This is ordered with the wait queue addition in sqpoll, which adds to the wait queue before checking for pending work items. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-14io_uring/napi: ensure napi polling is aborted when work is availableJens Axboe
While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and stalls in packet delivery. Turns out that while io_napi_busy_loop_should_end() aborts appropriately on regular task_work, it does not abort if we have local task_work pending. Move io_has_work() into the private io_uring.h header, and gate whether we should continue polling on that as well. This makes NAPI polling on send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-12io_uring: Don't include af_unix.h.Kuniyuki Iwashima
Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does not use AF_UNIX anymore. Let's not include af_unix.h and instead include necessary headers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add register/unregister napi functionStefan Roesch
This adds an api to register and unregister the napi for io-uring. If the arg value is specified when unregistering, the current napi setting for the busy poll timeout is copied into the user structure. If this is not required, NULL can be passed as the arg value. Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add sqpoll support for napi busy pollStefan Roesch
This adds the sqpoll support to the io-uring napi. Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: add napi busy poll supportStefan Roesch
This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id's that are currently enabled for busy polling. The list is synchronized by the new napi_lock spin lock. The current default napi busy polling time is stored in napi_busy_poll_to. If napi busy polling is not enabled, the value is 0. In addition there is also a hash table. The hash table store the napi id and the pointer to the above list nodes. The hash table is used to speed up the lookup to the list elements. The hash table is synchronized with rcu. The NAPI_TIMEOUT is stored as a timeout to make sure that the time a napi entry is stored in the napi list is limited. The busy poll timeout is also stored as part of the io_wait_queue. This is necessary as for sq polling the poll interval needs to be adjusted and the napi callback allows only to pass in one value. This has been tested with two simple programs from the liburing library repository: the napi client and the napi server program. The client sends a request, which has a timestamp in its payload and the server replies with the same payload. The client calculates the roundtrip time and stores it to calculate the results. The client is running on host1 and the server is running on host 2 (in the same rack). The measured times below are roundtrip times. They are average times over 5 runs each. Each run measures 1 million roundtrips. no rx coal rx coal: frames=88,usecs=33 Default 57us 56us client_poll=100us 47us 46us server_poll=100us 51us 46us client_poll=100us+ 40us 40us server_poll=100us client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on server client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client + server Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io-uring: move io_wait_queue definition to header fileStefan Roesch
This moves the definition of the io_wait_queue structure to the header file so it can be also used from other files. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-09io_uring: add support for ftruncateTony Solomonik
Adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: Simplify the allocation of slab cachesKunwu Chan
commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/sqpoll: manage task_work privatelyJens Axboe
Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time. We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly. Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: pass in counter to handle_tw_list() rather than return itJens Axboe
No functional changes in this patch, just in preparation for returning something other than count from this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring: cleanup handle_tw_list() calling conventionJens Axboe
Now that we don't loop around task_work anymore, there's no point in maintaining the ring and locked state outside of handle_tw_list(). Get rid of passing in those pointers (and pointers to pointers) and just do the management internally in handle_tw_list(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
2024-02-08io_uring/poll: improve readability of poll reference decrementingJens Axboe
This overly long line is hard to read. Break it up by AND'ing the ref mask first, then perform the atomic_sub_return() with the value itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>