Blame man/io_uring_setup.2

Packit d3489f
.\" Copyright (C) 2019 Jens Axboe <axboe@kernel.dk>
Packit d3489f
.\" Copyright (C) 2019 Jon Corbet <corbet@lwn.net>
Packit d3489f
.\" Copyright (C) 2019 Red Hat, Inc.
Packit d3489f
.\"
Packit d3489f
.\" SPDX-License-Identifier: LGPL-2.0-or-later
Packit d3489f
.\"
Packit d3489f
.TH IO_URING_SETUP 2 2019-01-29 "Linux" "Linux Programmer's Manual"
Packit d3489f
.SH NAME
Packit d3489f
io_uring_setup \- setup a context for performing asynchronous I/O
Packit d3489f
.SH SYNOPSIS
Packit d3489f
.nf
Packit d3489f
.BR "#include <linux/io_uring.h>"
Packit d3489f
.PP
Packit d3489f
.BI "int io_uring_setup(u32 " entries ", struct io_uring_params *" p );
Packit d3489f
.fi
Packit d3489f
.PP
Packit d3489f
.SH DESCRIPTION
Packit d3489f
.PP
Packit d3489f
The io_uring_setup() system call sets up a submission queue (SQ) and
Packit d3489f
completion queue (CQ) with at least
Packit d3489f
.I entries
Packit d3489f
entries, and returns a file descriptor which can be used to perform
Packit d3489f
subsequent operations on the io_uring instance.  The submission and
Packit d3489f
completion queues are shared between userspace and the kernel, which
Packit d3489f
eliminates the need to copy data when initiating and completing I/O.
Packit d3489f
Packit d3489f
.I params
Packit d3489f
is used by the application to pass options to the kernel, and by the
Packit d3489f
kernel to convey information about the ring buffers.
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
struct io_uring_params {
Packit d3489f
    __u32 sq_entries;
Packit d3489f
    __u32 cq_entries;
Packit d3489f
    __u32 flags;
Packit d3489f
    __u32 sq_thread_cpu;
Packit d3489f
    __u32 sq_thread_idle;
Packit d3489f
    __u32 features;
Packit d3489f
    __u32 resv[4];
Packit d3489f
    struct io_sqring_offsets sq_off;
Packit d3489f
    struct io_cqring_offsets cq_off;
Packit d3489f
};
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
The
Packit d3489f
.IR flags ,
Packit d3489f
.IR sq_thread_cpu ,
Packit d3489f
and
Packit d3489f
.I sq_thread_idle
Packit d3489f
fields are used to configure the io_uring instance.
Packit d3489f
.I flags
Packit d3489f
is a bit mask of 0 or more of the following values ORed
Packit d3489f
together:
Packit d3489f
.TP
Packit d3489f
.B IORING_SETUP_IOPOLL
Packit d3489f
Perform busy-waiting for an I/O completion, as opposed to getting
Packit d3489f
notifications via an asynchronous IRQ (Interrupt Request).  The file
Packit d3489f
system (if any) and block device must support polling in order for
Packit d3489f
this to work.  Busy-waiting provides lower latency, but may consume
Packit d3489f
more CPU resources than interrupt driven I/O.  Currently, this feature
Packit d3489f
is usable only on a file descriptor opened using the
Packit d3489f
.B O_DIRECT
Packit d3489f
flag.  When a read or write is submitted to a polled context, the
Packit d3489f
application must poll for completions on the CQ ring by calling
Packit d3489f
.BR io_uring_enter (2).
Packit d3489f
It is illegal to mix and match polled and non-polled I/O on an io_uring
Packit d3489f
instance.
Packit d3489f
Packit d3489f
.TP
Packit d3489f
.B IORING_SETUP_SQPOLL
Packit d3489f
When this flag is specified, a kernel thread is created to perform
Packit d3489f
submission queue polling.  An io_uring instance configured in this way
Packit d3489f
enables an application to issue I/O without ever context switching
Packit d3489f
into the kernel.  By using the submission queue to fill in new
Packit d3489f
submission queue entries and watching for completions on the
Packit d3489f
completion queue, the application can submit and reap I/Os without
Packit d3489f
doing a single system call.
Packit d3489f
Packit d3489f
If the kernel thread is idle for more than
Packit d3489f
.I sq_thread_idle
Packit d3489f
milliseconds, it will set the
Packit d3489f
.B IORING_SQ_NEED_WAKEUP
Packit d3489f
bit in the
Packit d3489f
.I flags
Packit d3489f
field of the
Packit d3489f
.IR "struct io_sq_ring" .
Packit d3489f
When this happens, the application must call
Packit d3489f
.BR io_uring_enter (2)
Packit d3489f
to wake the kernel thread.  If I/O is kept busy, the kernel thread
Packit d3489f
will never sleep.  An application making use of this feature will need
Packit d3489f
to guard the
Packit d3489f
.BR io_uring_enter (2)
Packit d3489f
call with the following code sequence:
Packit d3489f
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
/*
Packit d3489f
 * Ensure that the wakeup flag is read after the tail pointer has been
Packit d3489f
 * written.
Packit d3489f
 */
Packit d3489f
smp_mb();
Packit d3489f
if (*sq_ring->flags & IORING_SQ_NEED_WAKEUP)
Packit d3489f
    io_uring_enter(fd, 0, 0, IORING_ENTER_SQ_WAKEUP);
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
Packit d3489f
where
Packit d3489f
.I sq_ring
Packit d3489f
is a submission queue ring setup using the
Packit d3489f
.I struct io_sqring_offsets
Packit d3489f
described below.
Packit d3489f
.TP
Packit d3489f
.BR
Packit d3489f
To successfully use this feature, the application must register a set of files
Packit d3489f
to be used for IO through
Packit d3489f
.BR io_uring_register (2)
Packit d3489f
using the
Packit d3489f
.B IORING_REGISTER_FILES
Packit d3489f
opcode. Failure to do so will result in submitted IO being errored with
Packit d3489f
.B EBADF.
Packit d3489f
.TP
Packit d3489f
.B IORING_SETUP_SQ_AFF
Packit d3489f
If this flag is specified, then the poll thread will be bound to the
Packit d3489f
cpu set in the
Packit d3489f
.I sq_thread_cpu
Packit d3489f
field of the
Packit d3489f
.IR "struct io_uring_params" .
Packit d3489f
This flag is only meaningful when
Packit d3489f
.B IORING_SETUP_SQPOLL
Packit d3489f
is specified.
Packit d3489f
.TP
Packit d3489f
.B IORING_SETUP_CQSIZE
Packit d3489f
Create the completion queue with
Packit d3489f
.IR "struct io_uring_params.cq_entries"
Packit d3489f
entries.  The value must be greater than
Packit d3489f
.IR entries ,
Packit d3489f
and may be rounded up to the next power-of-two.
Packit d3489f
.PP
Packit d3489f
If no flags are specified, the io_uring instance is setup for
Packit d3489f
interrupt driven I/O.  I/O may be submitted using
Packit d3489f
.BR io_uring_enter (2)
Packit d3489f
and can be reaped by polling the completion queue.
Packit d3489f
Packit d3489f
The
Packit d3489f
.I resv
Packit d3489f
array must be initialized to zero.
Packit d3489f
Packit d3489f
.I features
Packit d3489f
is filled in by the kernel, which specifies various features supported
Packit d3489f
by current kernel version.
Packit d3489f
.TP
Packit d3489f
.B IORING_FEAT_SINGLE_MMAP
Packit d3489f
If this flag is set, the two SQ and CQ rings can be mapped with a single
Packit d3489f
.I mmap(2)
Packit d3489f
call. The SQEs must still be allocated separately. This brings the necessary
Packit d3489f
.I mmap(2)
Packit d3489f
calls down from three to two.
Packit d3489f
.TP
Packit d3489f
.B IORING_FEAT_NODROP
Packit d3489f
If this flag is set, io_uring supports never dropping completion events.
Packit d3489f
If a completion event occurs and the CQ ring is full, the kernel stores
Packit d3489f
the event internally until such a time that the CQ ring has room for more
Packit d3489f
entries. If this overflow condition is entered, attempting to submit more
Packit d3489f
IO with fail with the
Packit d3489f
.B -EBUSY
Packit d3489f
error value, if it can't flush the overflown events to the CQ ring. If this
Packit d3489f
happens, the application must reap events from the CQ ring and attempt the
Packit d3489f
submit again.
Packit d3489f
.TP
Packit d3489f
.B IORING_FEAT_SUBMIT_STABLE
Packit d3489f
If this flag is set, applications can be certain that any data for
Packit d3489f
async offload has been consumed when the kernel has consumed the SQE.
Packit d3489f
.TP
Packit d3489f
.B IORING_FEAT_RW_CUR_POS
Packit d3489f
If this flag is set, applications can specify
Packit d3489f
.I offset
Packit d3489f
== -1 with
Packit d3489f
.B IORING_OP_{READV,WRITEV}
Packit d3489f
,
Packit d3489f
.B IORING_OP_{READ,WRITE}_FIXED
Packit d3489f
, and
Packit d3489f
.B IORING_OP_{READ,WRITE}
Packit d3489f
to mean current file position, which behaves like
Packit d3489f
.I preadv2(2)
Packit d3489f
and
Packit d3489f
.I pwritev2(2)
Packit d3489f
with
Packit d3489f
.I offset
Packit d3489f
== -1. It'll use (and update) the current file position. This obviously comes
Packit d3489f
with the caveat that if the application has multiple reads or writes in flight,
Packit d3489f
then the end result will not be as expected. This is similar to threads sharing
Packit d3489f
a file descriptor and doing IO using the current file position.
Packit d3489f
.TP
Packit d3489f
.B IORING_FEAT_CUR_PERSONALITY
Packit d3489f
If this flag is set, then io_uring guarantees that both sync and async
Packit d3489f
execution of a request assumes the credentials of the task that called
Packit d3489f
.I
Packit d3489f
io_uring_enter(2)
Packit d3489f
to queue the requests. If this flag isn't set, then requests are issued with
Packit d3489f
the credentials of the task that originally registered the io_uring. If only
Packit d3489f
one task is using a ring, then this flag doesn't matter as the credentials
Packit d3489f
will always be the same. Note that this is the default behavior, tasks can
Packit d3489f
still register different personalities through
Packit d3489f
.I
Packit d3489f
io_uring_register(2)
Packit d3489f
with
Packit d3489f
.B IORING_REGISTER_PERSONALITY
Packit d3489f
and specify the personality to use in the sqe.
Packit d3489f
Packit d3489f
.PP
Packit d3489f
The rest of the fields in the
Packit d3489f
.I struct io_uring_params
Packit d3489f
are filled in by the kernel, and provide the information necessary to
Packit d3489f
memory map the submission queue, completion queue, and the array of
Packit d3489f
submission queue entries.
Packit d3489f
.I sq_entries
Packit d3489f
specifies the number of submission queue entries allocated.
Packit d3489f
.I sq_off
Packit d3489f
describes the offsets of various ring buffer fields:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
struct io_sqring_offsets {
Packit d3489f
    __u32 head;
Packit d3489f
    __u32 tail;
Packit d3489f
    __u32 ring_mask;
Packit d3489f
    __u32 ring_entries;
Packit d3489f
    __u32 flags;
Packit d3489f
    __u32 dropped;
Packit d3489f
    __u32 array;
Packit d3489f
    __u32 resv[3];
Packit d3489f
};
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
Taken together,
Packit d3489f
.I sq_entries
Packit d3489f
and
Packit d3489f
.I sq_off
Packit d3489f
provide all of the information necessary for accessing the submission
Packit d3489f
queue ring buffer and the submission queue entry array.  The
Packit d3489f
submission queue can be mapped with a call like:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
ptr = mmap(0, sq_off.array + sq_entries * sizeof(__u32),
Packit d3489f
           PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
Packit d3489f
           ring_fd, IORING_OFF_SQ_RING);
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
where
Packit d3489f
.I sq_off
Packit d3489f
is the
Packit d3489f
.I io_sqring_offsets
Packit d3489f
structure, and
Packit d3489f
.I ring_fd
Packit d3489f
is the file descriptor returned from
Packit d3489f
.BR io_uring_setup (2).
Packit d3489f
The addition of
Packit d3489f
.I sq_off.array
Packit d3489f
to the length of the region accounts for the fact that the ring
Packit d3489f
located at the end of the data structure.  As an example, the ring
Packit d3489f
buffer head pointer can be accessed by adding
Packit d3489f
.I sq_off.head
Packit d3489f
to the address returned from
Packit d3489f
.BR mmap (2):
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
head = ptr + sq_off.head;
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
Packit d3489f
The
Packit d3489f
.I flags
Packit d3489f
field is used by the kernel to communicate state information to the
Packit d3489f
application.  Currently, it is used to inform the application when a
Packit d3489f
call to
Packit d3489f
.BR io_uring_enter (2)
Packit d3489f
is necessary.  See the documentation for the
Packit d3489f
.B IORING_SETUP_SQPOLL
Packit d3489f
flag above.
Packit d3489f
The
Packit d3489f
.I dropped
Packit d3489f
member is incremented for each invalid submission queue entry
Packit d3489f
encountered in the ring buffer.
Packit d3489f
Packit d3489f
The head and tail track the ring buffer state.  The tail is
Packit d3489f
incremented by the application when submitting new I/O, and the head
Packit d3489f
is incremented by the kernel when the I/O has been successfully
Packit d3489f
submitted.  Determining the index of the head or tail into the ring is
Packit d3489f
accomplished by applying a mask:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
index = tail & ring_mask;
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
The array of submission queue entries is mapped with:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
sqentries = mmap(0, sq_entries * sizeof(struct io_uring_sqe),
Packit d3489f
                 PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE,
Packit d3489f
                 ring_fd, IORING_OFF_SQES);
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
The completion queue is described by
Packit d3489f
.I cq_entries
Packit d3489f
and
Packit d3489f
.I cq_off
Packit d3489f
shown here:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
struct io_cqring_offsets {
Packit d3489f
    __u32 head;
Packit d3489f
    __u32 tail;
Packit d3489f
    __u32 ring_mask;
Packit d3489f
    __u32 ring_entries;
Packit d3489f
    __u32 overflow;
Packit d3489f
    __u32 cqes;
Packit d3489f
    __u32 flags;
Packit d3489f
    __u32 resv[3];
Packit d3489f
};
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
The completion queue is simpler, since the entries are not separated
Packit d3489f
from the queue itself, and can be mapped with:
Packit d3489f
.PP
Packit d3489f
.in +4n
Packit d3489f
.EX
Packit d3489f
ptr = mmap(0, cq_off.cqes + cq_entries * sizeof(struct io_uring_cqe),
Packit d3489f
           PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, ring_fd,
Packit d3489f
           IORING_OFF_CQ_RING);
Packit d3489f
.EE
Packit d3489f
.in
Packit d3489f
.PP
Packit d3489f
Closing the file descriptor returned by
Packit d3489f
.BR io_uring_setup (2)
Packit d3489f
will free all resources associated with the io_uring context.
Packit d3489f
.PP
Packit d3489f
.SH RETURN VALUE
Packit d3489f
.BR io_uring_setup (2)
Packit d3489f
returns a new file descriptor on success.  The application may then
Packit d3489f
provide the file descriptor in a subsequent
Packit d3489f
.BR mmap (2)
Packit d3489f
call to map the submission and completion queues, or to the
Packit d3489f
.BR io_uring_register (2)
Packit d3489f
or
Packit d3489f
.BR io_uring_enter (2)
Packit d3489f
system calls.
Packit d3489f
Packit d3489f
On error, -1 is returned and
Packit d3489f
.I errno
Packit d3489f
is set appropriately.
Packit d3489f
.PP
Packit d3489f
.SH ERRORS
Packit d3489f
.TP
Packit d3489f
.B EFAULT
Packit d3489f
params is outside your accessible address space.
Packit d3489f
.TP
Packit d3489f
.B EINVAL
Packit d3489f
The resv array contains non-zero data, p.flags contains an unsupported
Packit d3489f
flag,
Packit d3489f
.I entries
Packit d3489f
is out of bounds,
Packit d3489f
.B IORING_SETUP_SQ_AFF
Packit d3489f
was specified, but
Packit d3489f
.B IORING_SETUP_SQPOLL
Packit d3489f
was not, or
Packit d3489f
.B IORING_SETUP_CQSIZE
Packit d3489f
was specified, but
Packit d3489f
.I io_uring_params.cq_entries
Packit d3489f
was invalid.
Packit d3489f
.TP
Packit d3489f
.B EMFILE
Packit d3489f
The per-process limit on the number of open file descriptors has been
Packit d3489f
reached (see the description of
Packit d3489f
.B RLIMIT_NOFILE
Packit d3489f
in
Packit d3489f
.BR getrlimit (2)).
Packit d3489f
.TP
Packit d3489f
.B ENFILE
Packit d3489f
The system-wide limit on the total number of open files has been
Packit d3489f
reached.
Packit d3489f
.TP
Packit d3489f
.B ENOMEM
Packit d3489f
Insufficient kernel resources are available.
Packit d3489f
.TP
Packit d3489f
.B EPERM
Packit d3489f
.B IORING_SETUP_SQPOLL
Packit d3489f
was specified, but the effective user ID of the caller did not have sufficient
Packit d3489f
privileges.
Packit d3489f
.SH SEE ALSO
Packit d3489f
.BR io_uring_register (2),
Packit d3489f
.BR io_uring_enter (2)