1. 02 9月, 2020 29 次提交
    • G
      io_uring: Fix NULL pointer dereference in loop_rw_iter() · 58511eb2
      Guoyu Huang 提交于
      fix #29760246
      
      Cherry-pick 2dd2111d0d383df104b144e0d1f6b5a00cb7cd88 from io_uring-5.9.
      
      loop_rw_iter() does not check whether the file has a read or
      write function. This can lead to NULL pointer dereference
      when the user passes in a file descriptor that does not have
      read or write function.
      
      The crash log looks like this:
      
      [   99.834071] BUG: kernel NULL pointer dereference, address: 0000000000000000
      [   99.835364] #PF: supervisor instruction fetch in kernel mode
      [   99.836522] #PF: error_code(0x0010) - not-present page
      [   99.837771] PGD 8000000079d62067 P4D 8000000079d62067 PUD 79d8c067 PMD 0
      [   99.839649] Oops: 0010 [#2] SMP PTI
      [   99.840591] CPU: 1 PID: 333 Comm: io_wqe_worker-0 Tainted: G      D           5.8.0 #2
      [   99.842622] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1 04/01/2014
      [   99.845140] RIP: 0010:0x0
      [   99.845840] Code: Bad RIP value.
      [   99.846672] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.848018] RAX: 0000000000000000 RBX: ffff92363bd67300 RCX: ffff92363d461208
      [   99.849854] RDX: 0000000000000010 RSI: 00007ffdbf696bb0 RDI: ffff92363bd67300
      [   99.851743] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.853394] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.855148] R13: 0000000000000000 R14: ffff92363d461208 R15: ffffa1c7c01ebc68
      [   99.856914] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.858651] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.860032] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      [   99.861979] Call Trace:
      [   99.862617]  loop_rw_iter.part.0+0xad/0x110
      [   99.863838]  io_write+0x2ae/0x380
      [   99.864644]  ? kvm_sched_clock_read+0x11/0x20
      [   99.865595]  ? sched_clock+0x9/0x10
      [   99.866453]  ? sched_clock_cpu+0x11/0xb0
      [   99.867326]  ? newidle_balance+0x1d4/0x3c0
      [   99.868283]  io_issue_sqe+0xd8f/0x1340
      [   99.869216]  ? __switch_to+0x7f/0x450
      [   99.870280]  ? __switch_to_asm+0x42/0x70
      [   99.871254]  ? __switch_to_asm+0x36/0x70
      [   99.872133]  ? lock_timer_base+0x72/0xa0
      [   99.873155]  ? switch_mm_irqs_off+0x1bf/0x420
      [   99.874152]  io_wq_submit_work+0x64/0x180
      [   99.875192]  ? kthread_use_mm+0x71/0x100
      [   99.876132]  io_worker_handle_work+0x267/0x440
      [   99.877233]  io_wqe_worker+0x297/0x350
      [   99.878145]  kthread+0x112/0x150
      [   99.878849]  ? __io_worker_unuse+0x100/0x100
      [   99.879935]  ? kthread_park+0x90/0x90
      [   99.880874]  ret_from_fork+0x22/0x30
      [   99.881679] Modules linked in:
      [   99.882493] CR2: 0000000000000000
      [   99.883324] ---[ end trace 4453745f4673190b ]---
      [   99.884289] RIP: 0010:0x0
      [   99.884837] Code: Bad RIP value.
      [   99.885492] RSP: 0018:ffffa1c7c01ebc08 EFLAGS: 00010202
      [   99.886851] RAX: 0000000000000000 RBX: ffff92363acd7f00 RCX: ffff92363d461608
      [   99.888561] RDX: 0000000000000010 RSI: 00007ffe040d9e10 RDI: ffff92363acd7f00
      [   99.890203] RBP: ffffa1c7c01ebc40 R08: 0000000000000000 R09: 0000000000000000
      [   99.891907] R10: ffffffff9ec692a0 R11: 0000000000000000 R12: 0000000000000010
      [   99.894106] R13: 0000000000000000 R14: ffff92363d461608 R15: ffffa1c7c01ebc68
      [   99.896079] FS:  0000000000000000(0000) GS:ffff92363dd00000(0000) knlGS:0000000000000000
      [   99.898017] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   99.899197] CR2: ffffffffffffffd6 CR3: 000000007ac66000 CR4: 00000000000006e0
      
      Fixes: 32960613b7c3 ("io_uring: correctly handle non ->{read,write}_iter() file_operations")
      Cc: stable@vger.kernel.org
      Signed-off-by: NGuoyu Huang <hgy5945@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      58511eb2
    • X
      io_uring: clear IORING_SQ_NEED_WAKEUP after executing task works · cd40824f
      Xiaoguang Wang 提交于
      fix #29605829
      
      commit 23b3628e45924419399da48c2b3a522b05557c91 upstream
      
      In io_sq_thread(), if there are task works to handle, current codes
      will skip schedule() and go on polling sq again, but forget to clear
      IORING_SQ_NEED_WAKEUP flag, fix this issue. Also add two helpers to
      set and clear IORING_SQ_NEED_WAKEUP flag,
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cd40824f
    • P
      io_uring: fix lockup in io_fail_links() · 18ca51d0
      Pavel Begunkov 提交于
      to #29608102
      
      commit 4ae6dbd683860b9edc254ea8acf5e04b5ae242e5 upstream.
      
      io_fail_links() doesn't consider REQ_F_COMP_LOCKED leading to nested
      spin_lock(completion_lock) and lockup.
      
      [  197.680409] rcu: INFO: rcu_preempt detected expedited stalls on
      	CPUs/tasks: { 6-... } 18239 jiffies s: 1421 root: 0x40/.
      [  197.680411] rcu: blocking rcu_node structures:
      [  197.680412] Task dump for CPU 6:
      [  197.680413] link-timeout    R  running task        0  1669
      	1 0x8000008a
      [  197.680414] Call Trace:
      [  197.680420]  ? io_req_find_next+0xa0/0x200
      [  197.680422]  ? io_put_req_find_next+0x2a/0x50
      [  197.680423]  ? io_poll_task_func+0xcf/0x140
      [  197.680425]  ? task_work_run+0x67/0xa0
      [  197.680426]  ? do_exit+0x35d/0xb70
      [  197.680429]  ? syscall_trace_enter+0x187/0x2c0
      [  197.680430]  ? do_group_exit+0x43/0xa0
      [  197.680448]  ? __x64_sys_exit_group+0x18/0x20
      [  197.680450]  ? do_syscall_64+0x52/0xa0
      [  197.680452]  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      18ca51d0
    • P
      io_uring: fix ->work corruption with poll_add · 7943eeed
      Pavel Begunkov 提交于
      to #29608102
      
      commit d5e16d8e23825304c6a9945116cc6b6f8d51f28c upstream.
      
      req->work might be already initialised by the time it gets into
      __io_arm_poll_handler(), which will corrupt it by using fields that are
      in an union with req->work. Luckily, the only side effect is missing
      put_creds(). Clean req->work before going there.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7943eeed
    • P
      io_uring: missed req_init_async() for IOSQE_ASYNC · e6508674
      Pavel Begunkov 提交于
      to #29608102
      
      commit 3e863ea3bb1a2203ae648eb272db0ce6a1a2072c upstream.
      
      IOSQE_ASYNC branch of io_queue_sqe() is another place where an
      unitialised req->work can be accessed (i.e. prior io_req_init_async()).
      Nothing really bad though, it just looses IO_WQ_WORK_CONCURRENT flag.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      e6508674
    • D
      io_uring: always allow drain/link/hardlink/async sqe flags · 6e5cbea1
      Daniele Albano 提交于
      to #29608102
      
      commit 61710e437f2807e26a3402543bdbb7217a9c8620 upstream.
      
      We currently filter these for timeout_remove/async_cancel/files_update,
      but we only should be filtering for fixed file and buffer select. This
      also causes a second read of sqe->flags, which isn't needed.
      
      Just check req->flags for the relevant bits. This then allows these
      commands to be used in links, for example, like everything else.
      Signed-off-by: NDaniele Albano <d.albano@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6e5cbea1
    • J
      io_uring: ensure double poll additions work with both request types · 8d834b0c
      Jens Axboe 提交于
      to #29608102
      
      commit 807abcb0883439af5ead73f3308310453b97b624 upstream.
      
      The double poll additions were centered around doing POLL_ADD on file
      descriptors that use more than one waitqueue (typically one for read,
      one for write) when being polled. However, it can also end up being
      triggered for when we use poll triggered retry. For that case, we cannot
      safely use req->io, as that could be used by the request type itself.
      
      Add a second io_poll_iocb pointer in the structure we allocate for poll
      based retry, and ensure we use the right one from the two paths.
      
      Fixes: 18bceab101ad ("io_uring: allow POLL_ADD with double poll_wait() users")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8d834b0c
    • P
      io_uring: fix recvmsg memory leak with buffer selection · fe15220e
      Pavel Begunkov 提交于
      to #29441901
      
      commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 upstream.
      
      io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
      causes a leak when used with automatic buffer selection.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      fe15220e
    • P
      io_uring: fix not initialised work->flags · 109831ff
      Pavel Begunkov 提交于
      to #29276773
      
      commit 16d598030a37853a7a6b4384cad19c9c0af2f021 upstream.
      
      59960b9deb535 ("io_uring: fix lazy work init") tried to fix missing
      io_req_init_async(), but left out work.flags and hash. Do it earlier.
      
      Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      109831ff
    • P
      io_uring: fix missing msg_name assignment · 589ee219
      Pavel Begunkov 提交于
      to #29276773
      
      commit dd821e0c95a64b5923a0c57f07d3f7563553e756 upstream.
      
      Ensure to set msg.msg_name for the async portion of send/recvmsg,
      as the header copy will copy to/from it.
      
      Cc: stable@vger.kernel.org # v5.5+
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      589ee219
    • J
      io_uring: account user memory freed when exit has been queued · 8c9ebb73
      Jens Axboe 提交于
      to #29276773
      
      commit 309fc03a3284af62eb6082fb60327045a1dabf57 upstream.
      
      We currently account the memory after the exit work has been run, but
      that leaves a gap where a process has closed its ring and until the
      memory has been accounted as freed. If the memlocked ulimit is
      borderline, then that can introduce spurious setup errors returning
      -ENOMEM because the free work hasn't been run yet.
      
      Account this as freed when we close the ring, as not to expose a tiny
      gap where setting up a new ring can fail.
      
      Fixes: 85faa7b8346e ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8c9ebb73
    • Y
      io_uring: fix memleak in io_sqe_files_register() · c1f5f815
      Yang Yingliang 提交于
      to #29276773
      
      commit 667e57da358f61b6966e12e925a69e42d912e8bb upstream.
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0x607eeac06e78 (size 8):
        comm "test", pid 295, jiffies 4294735835 (age 31.745s)
        hex dump (first 8 bytes):
          00 00 00 00 00 00 00 00                          ........
        backtrace:
          [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
          [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
          [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
          [<00000000591b89a6>] do_syscall_64+0x56/0xa0
          [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Call percpu_ref_exit() on error path to avoid
      refcount memleak.
      
      Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c1f5f815
    • Y
      io_uring: fix memleak in __io_sqe_files_update() · 4b392775
      Yang Yingliang 提交于
      to #29276773
      
      commit f3bd9dae3708a0ff6b067e766073ffeb853301f9 upstream.
      
      I got a memleak report when doing some fuzz test:
      
      BUG: memory leak
      unreferenced object 0xffff888113e02300 (size 488):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
      a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
      backtrace:
      [<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      BUG: memory leak
      unreferenced object 0xffff8881152dd5e0 (size 16):
      comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
      hex dump (first 16 bytes):
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
      backtrace:
      [<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
      [<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
      [<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
      [<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
      [<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
      [<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
      [<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
      [<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
      [<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
      [<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
      [<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
      [<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      If io_sqe_file_register() failed, we need put the file that get by fget()
      to avoid the memleak.
      
      Fixes: c3a31e605620 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
      Cc: stable@vger.kernel.org
      Reported-by: NHulk Robot <hulkci@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4b392775
    • X
      io_uring: export cq overflow status to userspace · c7f865a4
      Xiaoguang Wang 提交于
      to #29233603
      
      commit 6d5f904904608a9cd32854d7d0a4dd65b27f9935 upstream
      
      For those applications which are not willing to use io_uring_enter()
      to reap and handle cqes, they may completely rely on liburing's
      io_uring_peek_cqe(), but if cq ring has overflowed, currently because
      io_uring_peek_cqe() is not aware of this overflow, it won't enter
      kernel to flush cqes, below test program can reveal this bug:
      
      static void test_cq_overflow(struct io_uring *ring)
      {
              struct io_uring_cqe *cqe;
              struct io_uring_sqe *sqe;
              int issued = 0;
              int ret = 0;
      
              do {
                      sqe = io_uring_get_sqe(ring);
                      if (!sqe) {
                              fprintf(stderr, "get sqe failed\n");
                              break;;
                      }
                      ret = io_uring_submit(ring);
                      if (ret <= 0) {
                              if (ret != -EBUSY)
                                      fprintf(stderr, "sqe submit failed: %d\n", ret);
                              break;
                      }
                      issued++;
              } while (ret > 0);
              assert(ret == -EBUSY);
      
              printf("issued requests: %d\n", issued);
      
              while (issued) {
                      ret = io_uring_peek_cqe(ring, &cqe);
                      if (ret) {
                              if (ret != -EAGAIN) {
                                      fprintf(stderr, "peek completion failed: %s\n",
                                              strerror(ret));
                                      break;
                              }
                              printf("left requets: %d\n", issued);
                              continue;
                      }
                      io_uring_cqe_seen(ring, cqe);
                      issued--;
                      printf("left requets: %d\n", issued);
              }
      }
      
      int main(int argc, char *argv[])
      {
              int ret;
              struct io_uring ring;
      
              ret = io_uring_queue_init(16, &ring, 0);
              if (ret) {
                      fprintf(stderr, "ring setup failed: %d\n", ret);
                      return 1;
              }
      
              test_cq_overflow(&ring);
              return 0;
      }
      
      To fix this issue, export cq overflow status to userspace by adding new
      IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
      io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      c7f865a4
    • P
      io_uring: fix current->mm NULL dereference on exit · 1f9ef808
      Pavel Begunkov 提交于
      to #29197839
      
      commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 upstream.
      
      Don't reissue requests from io_iopoll_reap_events(), the task may not
      have mm, which ends up with NULL. It's better to kill everything off on
      exit anyway.
      
      [  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
      ...
      [  677.734679] Call Trace:
      [  677.734695]  ? __send_signal+0x1f2/0x420
      [  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [  677.734699]  ? send_signal+0xf5/0x140
      [  677.734700]  io_iopoll_getevents+0x12f/0x1a0
      [  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
      [  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
      [  677.734704]  io_uring_release+0x20/0x30
      [  677.734706]  __fput+0xcd/0x230
      [  677.734707]  ____fput+0xe/0x10
      [  677.734709]  task_work_run+0x67/0xa0
      [  677.734710]  do_exit+0x35d/0xb70
      [  677.734712]  do_group_exit+0x43/0xa0
      [  677.734713]  get_signal+0x140/0x900
      [  677.734715]  do_signal+0x37/0x780
      [  677.734717]  ? enqueue_hrtimer+0x41/0xb0
      [  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
      [  677.734720]  ? ktime_get+0x3e/0xa0
      [  677.734721]  ? lapic_next_deadline+0x26/0x30
      [  677.734723]  ? tick_program_event+0x4d/0x90
      [  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
      [  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
      [  677.734741]  prepare_exit_to_usermode+0x9/0x40
      [  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
      [  677.734743]  sysvec_reschedule_ipi+0x92/0x160
      [  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
      [  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      1f9ef808
    • P
      io_uring: fix hanging iopoll in case of -EAGAIN · d52d1291
      Pavel Begunkov 提交于
      to #29197839
      
      commit cd664b0e35cb1202f40c259a1a5ea791d18c879d upstream.
      
      io_do_iopoll() won't do anything with a request unless
      req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
      it, otherwise io_do_iopoll() will poll a file again and again even
      though the request of interest was completed long time ago.
      
      Also, remove -EAGAIN check from io_issue_sqe() as it races with
      the changed lines. The request will take the long way and be
      resubmitted from io_iopoll*().
      
      Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      d52d1291
    • X
      io_uring: fix io_sq_thread no schedule when busy · 73cef744
      Xuan Zhuo 提交于
      fix #27804498
      
      commit b772f07add1c0b22e02c0f1e96f647560679d3a9 upstream
      
      When the user consumes and generates sqe at a fast rate,
      io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
      so that io_sq_thread will never call cond_resched or schedule, and then
      we will get the following system error prompt:
      
      rcu: INFO: rcu_sched self-detected stall on CPU
      or
      watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]
      
      This patch checks whether need to call cond_resched() by checking
      the need_resched() function every cycle.
      Suggested-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      73cef744
    • X
      io_uring: fix possible race condition against REQ_F_NEED_CLEANUP · 00eaddd4
      Xiaoguang Wang 提交于
      to #28736503
      
      commit 6f2cc1664db20676069cff27a461ccc97dbfd114 upstream
      
      In io_read() or io_write(), when io request is submitted successfully,
      it'll go through the below sequence:
      
          kfree(iovec);
          req->flags &= ~REQ_F_NEED_CLEANUP;
          return ret;
      
      But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
      already have been completed, and then io_complete_rw_iopoll()
      and io_complete_rw() will be called, both of which will also modify
      req->flags if needed. This causes a race condition, with concurrent
      non-atomic modification of req->flags.
      
      To eliminate this race, in io_read() or io_write(), if io request is
      submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
      REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
      iovec cleanup work correspondingly.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      00eaddd4
    • J
      io_uring: reap poll completions while waiting for refs to drop on exit · 8ec093aa
      Jens Axboe 提交于
      to #28736503
      
      commit 56952e91acc93ed624fe9da840900defb75f1323 upstream
      
      If we're doing polled IO and end up having requests being submitted
      async, then completions can come in while we're waiting for refs to
      drop. We need to reap these manually, as nobody else will be looking
      for them.
      
      Break the wait into 1/20th of a second time waits, and check for done
      poll completions if we time out. Otherwise we can have done poll
      completions sitting in ctx->poll_list, which needs us to reap them but
      we're just waiting for them.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      8ec093aa
    • J
      io_uring: name sq thread and ref completions · 64d14fe5
      Jens Axboe 提交于
      to #28736503
      
      commit 0f158b4cf20e7983d5b33878a6aad118cfac4f05 upstream
      
      We used to have three completions, now we just have two. With the two,
      let's not allocate them dynamically, just embed then in the ctx and
      name them appropriately.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      64d14fe5
    • J
      io_uring: acquire 'mm' for task_work for SQPOLL · f0f8d236
      Jens Axboe 提交于
      to #28736503
      
      commit 9d8426a09195e2dcf2aa249de2aaadd792d491c7 upstream
      
      If we're unlucky with timing, we could be running task_work after
      having dropped the memory context in the sq thread. Since dropping
      the context requires a runnable task state, we cannot reliably drop
      it as part of our check-for-work loop in io_sq_thread(). Instead,
      abstract out the mm acquire for the sq thread into a helper, and call
      it from the async task work handler.
      
      Cc: stable@vger.kernel.org # v5.7
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f0f8d236
    • X
      io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed · a1db6be6
      Xiaoguang Wang 提交于
      to #28736503
      
      commit bbde017a32b32d2fa8d5fddca25fade20132abf8 upstream
      
      In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
      completed are two independent store operations, to ensure that once
      iopoll_completed is ture and then req->result must been perceived by
      the cpu executing io_do_iopoll(), proper memory barrier should be used.
      
      And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
      we'll need to issue this io request using io-wq again. In order to just
      issue a single smp_rmb() on the completion side, move the re-submit work
      to io_iopoll_complete().
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      [axboe: don't set ->iopoll_completed for -EAGAIN retry]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a1db6be6
    • X
      io_uring: don't fail links for EAGAIN error in IOPOLL mode · b0dc0883
      Xiaoguang Wang 提交于
      to #28736503
      
      commit 2d7d67920e5c8e0854df23ca77da2dd5880ce5dd upstream
      
      In IOPOLL mode, for EAGAIN error, we'll try to submit io request
      again using io-wq, so don't fail rest of links if this io request
      has links.
      
      Cc: stable@vger.kernel.org
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b0dc0883
    • P
      io_uring: cancel by ->task not pid · aa04321d
      Pavel Begunkov 提交于
      to #28736503
      
      commit 801dd57bd1d8c2c253f43635a3045bfa32a810b3 upstream
      
      For an exiting process it tries to cancel all its inflight requests. Use
      req->task to match such instead of work.pid. We always have req->task
      set, and it will be valid because we're matching only current exiting
      task.
      
      Also, remove work.pid and everything related, it's useless now.
      Reported-by: NEric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      aa04321d
    • P
      io_uring: lazy get task · 090d6c4e
      Pavel Begunkov 提交于
      to #28736503
      
      commit 4dd2824d6d5914949b5fe589538bc2622d84c5dd upstream
      
      There will be multiple places where req->task is used, so refcount-pin
      it lazily with introduced *io_{get,put}_req_task(). We need to always
      have valid ->task for cancellation reasons, but don't care about pinning
      it in some cases. That's why it sets req->task in io_req_init() and
      implements get/put laziness with a flag.
      
      This also removes using @current from polling io_arm_poll_handler(),
      etc., but doesn't change observable behaviour.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      090d6c4e
    • P
      io_uring: batch cancel in io_uring_cancel_files() · 037c3f51
      Pavel Begunkov 提交于
      to #28736503
      
      commit 67c4d9e693e3bb7fb968af24e3584f821a78ba56 upstream
      
      Instead of waiting for each request one by one, first try to cancel all
      of them in a batched manner, and then go over inflight_list/etc to reap
      leftovers.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      037c3f51
    • P
      io_uring: cancel all task's requests on exit · f4c5149f
      Pavel Begunkov 提交于
      to #28736503
      
      commit 44e728b8aae0bb6d4229129083974f9dea43f50b upstream
      
      If a process is going away, io_uring_flush() will cancel only 1
      request with a matching pid. Cancel all of them
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f4c5149f
    • P
      io-wq: add an option to cancel all matched reqs · 4b786826
      Pavel Begunkov 提交于
      to #28736503
      
      commit 4f26bda1522c35d2701fc219368c7101c17005c1 upstream
      
      This adds support for cancelling all io-wq works matching a predicate.
      It isn't used yet, so no change in observable behaviour.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      4b786826
    • P
      io_uring: fix lazy work init · 819f9648
      Pavel Begunkov 提交于
      to #28736503
      
      commit 59960b9deb5354e4cdb0b6ed3a3b653a2b4eb602 upstream
      
      Don't leave garbage in req.work before punting async on -EAGAIN
      in io_iopoll_queue().
      
      [  140.922099] general protection fault, probably for non-canonical
           address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
      ...
      [  140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480
      ...
      [  140.922114] Call Trace:
      [  140.922118]  ? __next_timer_interrupt+0xe0/0xe0
      [  140.922119]  io_wqe_worker+0x2a9/0x360
      [  140.922121]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [  140.922124]  kthread+0x12c/0x170
      [  140.922125]  ? io_worker_handle_work+0x480/0x480
      [  140.922126]  ? kthread_park+0x90/0x90
      [  140.922127]  ret_from_fork+0x22/0x30
      
      Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      819f9648
  2. 29 6月, 2020 11 次提交