|
|
Subscribe / Log in / New account

Non-blocking buffered file read operations

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
September 23, 2014
It is natural to think of buffered I/O on Unix-like systems as being asynchronous. In Linux, the page cache exists to separate process-level I/O requests from the physical requests sent to the storage media. But, in truth, some operations are synchronous; in particular, a read operation cannot complete until the data is actually read into the page cache. So a call to read() on a normal file descriptor can always block; most of the time this blocking causes no difficulties, but it can be problematic for programs that need to always be responsive. Now, a partial solution is in the works for this kind of code, but it comes at the cost of adding as many as four new system calls.

The problem of blocking buffered reads is not new, of course, so applications have worked around it in a number of ways. One common approach is to create a set of threads dedicated to performing buffered I/O. Those threads can block while other threads in the program continue to do other work. This solution works and can be efficient, but it inevitably adds a certain amount of inter-thread communication overhead, especially in cases where the desired data is already in the page cache and a read() call could have completed immediately.

A recent patch set from Milosz Tanski attempts to solve the problem a different way. Milosz's approach is to allow a program to request non-blocking behavior at the level of a single read() call. Unfortunately, the current read() and variants do not have a "flags" argument, so there is no way to express that request using them. So Milosz adds two new versions of each of read() and write():

    int readv2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags);
    int writev2(unsigned long fd, struct iovec *vec, unsigned long vlen, int flags);
    int preadv2(unsigned long fd, struct iovec *vec, unsigned long vlen,
		unsigned long pos_l, unsigned long pos_h, int flags);
    int pwritev2(unsigned long fd, struct iovec *vec, unsigned long vlen,
		 unsigned long pos_l, unsigned long pos_h, int flags);

In each case, the system call is just like its predecessor with the exception of the addition of the flags argument. Note that the two offset parameters (pos_l and pos_h) to preadv2() and pwritev2() will be combined into a single off_t parameter at the C library level.

In Milosz's patch set, the only supported flag is RWF_NONBLOCK, which requests non-blocking operation. If a read request is accompanied by this flag, it will only complete if (at least some of) the requested data is already in the page cache; otherwise it returns EAGAIN. The current patch does not start any sort of readahead operation if it is unable to satisfy a non-blocking read request. The new write operations do not support non-blocking operation; the flags argument must be zero when calling them. Adding non-blocking behavior to write() is possible; such a write would only complete if memory were immediately available for a copy of the data in the page cache. But that implementation has been left as a future exercise.

Considering the alternatives

The patch is relatively simple and straightforward, but one might well wonder: why is it necessary to add a new set of system calls for non-blocking operation when the kernel has long supported this mode via either the O_NONBLOCK flag to open() or fcntl()? There are, it seems, a couple of reasons for not wanting to implement ordinary non-blocking I/O behavior for regular files, the first of which being that it will break applications.

Given that non-blocking I/O is an optional behavior that must be explicitly requested, it is not obvious that supporting it for regular files would create trouble. It comes down to the fact that passing O_NONBLOCK to an open() call actually requests two different things: (1) that the open() call, itself, not block, and (2) that subsequent I/O be non-blocking. There are applications that use O_NONBLOCK for the first reason; Samba uses it, for example, to keep an open() call from blocking in the presence of locks on the file. Since buffered reads have always blocked regardless of O_NONBLOCK, applications do not concern themselves with calling fcntl() to reset the flag before calling read(). If those read() calls start returning EAGAIN, the application, which is not expecting that behavior, will fail.

One could argue that this behavior is incorrect, but it has worked for decades; breaking these applications with a kernel change is not acceptable. Samba is not the only application to run into trouble here; evidently squid and GQview fail as well. So the problem is clearly real.

Beyond that, as Volker Lendecke explained, full non-blocking behavior would not play well with how applications like Samba want to use this feature. The wish is to attempt to read the needed data in the non-blocking mode; should the data not be available, the request will be turned over to the thread pool for synchronous execution. If the thread pool is using the same file descriptor, its attempts to perform blocking reads will fail. If it uses a different file descriptor, it can run into weird problems relating to the surprising semantics of POSIX file locks (see this article for more information). So the ability to request non-blocking behavior on a per-read basis is needed.

Another possibility would be to add a version of the fincore() system call, which allows a process to ask the kernel whether a specific range of file data is present in the page cache. Patches adding fincore() have been around since at least 2010. But fincore() is seen as a bit of an indirect route toward the real goal, and there is always the possibility that the situation might change between a call to fincore() and the application's decision to do something based on the result. Requesting non-blocking behavior with the read() avoids that particular race condition.

Finally, one could also consider the kernel's asynchronous I/O subsystem, which allows an application to obtain non-blocking behavior on a per-request basis. But asynchronous I/O has never been supported for buffered I/O, and attempts to add that functionality have bogged down in the sheer complexity of the problem. Adding non-blocking behavior to read() — where, unlike with asynchronous I/O, a request can simply fail if it cannot be satisfied immediately — is far simpler.

So the end result would appear to be that we will get a new set of Linux-specific system calls allowing applications to request non-blocking read() behavior on regular files. The rate of change on this patch set is slowing — though it is worth noting that readv2() and writev2() have been removed from the latest version (as of this writing) of the patch set. It is getting late to have this code ready for the 3.18 development cycle, but it should be more than ready for 3.19.

Index entries for this article
KernelAsynchronous I/O
KernelSystem calls


(Log in to post comments)

Non-blocking buffered file read operations

Posted Sep 25, 2014 3:06 UTC (Thu) by shemminger (subscriber, #5739) [Link]

Of course if AIO in Linux were not so fundamentally brain dead, no one would have to reinvent it.

Non-blocking buffered file read operations

Posted Sep 25, 2014 13:17 UTC (Thu) by sorokin (guest, #88478) [Link]

I'm not familiar with POSIX AIO. Could you elaborate why it is brain dead. After a quick reading, API seems pretty straightforward. Surely I'd prefer to have notifications enqueued in an epoll descriptor, instead of aio_suspend(). But this seems to be achievable using SIGEV_SIGNAL + signalfd.

Non-blocking buffered file read operations

Posted Sep 25, 2014 13:21 UTC (Thu) by ejr (subscriber, #51652) [Link]

Different cases. This appears to be "if it's already here, gimme a quick memory copy" while AIO is "read this when you can, I'll check back later" (or at least was on AIX & Solaris when last I used it). A server can use the first to lower latency on fast-path responses. For the former, the permissions and meta-data are assumed also to be in memory. The latter is more useful with pre-allocated spaces to avoid eating memory twice. But it's been a while using non-MPI-AIO, so I could have missed AIO changes.

Non-blocking buffered file read operations

Posted Sep 25, 2014 15:10 UTC (Thu) by epa (subscriber, #39769) [Link]

"if it's already here, gimme a quick memory copy" sounds like the file analogue of volatile ranges.

Non-blocking buffered file read operations

Posted Oct 8, 2014 7:19 UTC (Wed) by dgm (subscriber, #49227) [Link]

It looks like a recurring pattern. Apparently there's a desire to delegate caching from applications into the kernel. And it makes sense, if you ask me. The kernel can manage resources, be it memory or disk space, on a global scale. Just think of the vast amount of memory "trapped" into each application internal caches that could be much efficiently handled by the kernel, if only the right abstractions existed (basically: a system call for "remember this if you can"). This could also be applied to disk space, for instance, allowing full use of your disks for caching network data in a controlled way and without affecting the behavior of the rest of the system.

Non-blocking buffered file read operations

Posted Sep 25, 2014 10:34 UTC (Thu) by simlo (guest, #10866) [Link]

Please, just give me a flag, saying the O_NONBLOCK option to also cover disk-io.

In general I code with non-blocking io. Everything is a file descriptor, so I can the same code to handle data from a TCP connection, a serial interface, fifo - and a file. It is a pain that a read from a file is blocking, and the inconsistent, when I have set O_NONBLOCK on the socket.

In general file IO ought to be handled more like network io, due to network file systems. A blocking io from a network file system can make applications hang. This happens a lot for many desktop applications.

Non-blocking buffered file read operations

Posted Sep 25, 2014 10:55 UTC (Thu) by mpr22 (subscriber, #60784) [Link]

It seems to me that your proposal doesn't achieve what is desired:

The wish is to attempt to read the needed data in the non-blocking mode; should the data not be available, the request will be turned over to the thread pool for synchronous execution. If the thread pool is using the same file descriptor, its attempts to perform blocking reads will fail. If it uses a different file descriptor, it can run into weird problems relating to the surprising semantics of POSIX file locks (see this article for more information). So the ability to request non-blocking behavior on a per-read basis is needed.

Non-blocking buffered file read operations

Posted Sep 25, 2014 21:48 UTC (Thu) by simlo (guest, #10866) [Link]

I don't see why I can't get the option of making file io behave as non-blocking TCP:

Use of select()/poll() to get an event on a read- and/or write-able file descriptor, and read returns whatever is available, write writes unblocking until the kernel-buffer is full. In both cases the DMA interupt comes back and marks the filedescriptor read- or writeable again, such select or poll will wake up. Just as a the network interrupt wakes up my TCP socket.

I know with the current UNIX behaviour, where files are seen as an extension of the direct memory rather than IO, it doesn't make sense to be non-blocking for file-io.
But files are much, much slower than memory, file-io can fail, file-io can via a network file system. It does not make sense to make file IO behave differently than network IO.

So I turn the UNIX philosophy around: Instread of describing a network connection with a file descriptor, I will rather use a network descriptor for a files.

Non-blocking buffered file read operations

Posted Sep 26, 2014 23:45 UTC (Fri) by giraffedata (guest, #1954) [Link]

The analogous function for regular files to the select and nonblocking read function of a socket is for select to wait until there is data in the file at the current file pointer, and a blocking read to wait at eof until it isn't eof anymore because the file got bigger.

The problem discussed in the article is a different one from the one involved in blocking vs nonblocking socket I/O. In the socket case, it's a question of whether to wait for data to exist, while in the article it's about whether to wait to get existing data, since it may or may not be gettable quickly.

If you want to use select and nonblocking I/O to avoid waiting for file data to be read from the backing store, you need a rather different protocol. You need a system call to start a stream of file data into memory, and then read system calls to consume that data. (And you'd probably use a socket for that).

RWF_ATOMIC and RWF_COMPARE_AND_WRITE

Posted Sep 25, 2014 22:55 UTC (Thu) by dougg (subscriber, #1894) [Link]

In the subject line there are two suggestions for write flags. They could be used when the file descriptor was a block device opened O_DIRECT.

They would tie in with the recent SCSI commands: WRITE ATOMIC and COMPARE AND WRITE (see SBC-4 draft). For the latter command the first half of the buffer would be the compare and the second half the optional write. An errno would be needed to convey the idea of MISCOMPARE (i.e. when the optional write is not done).

Non-blocking buffered file read operations

Posted Sep 26, 2014 12:07 UTC (Fri) by eNovance (guest, #92805) [Link]

"If a read request is accompanied by this flag, it will only complete if (at least some of) the requested data is already in the page cache; otherwise it returns EAGAIN. The current patch does not start any sort of readahead operation if it is unable to satisfy a non-blocking read request."

I don't understand how this API should be used. If I get EAGAIN, how can I asynchronously get more bytes?

Is there an asynchronous "readahead" syscall? Or should I fallback to the blocking read()?

Non-blocking buffered file read operations

Posted Sep 26, 2014 12:23 UTC (Fri) by JGR (subscriber, #93631) [Link]

> I don't understand how this API should be used. If I get EAGAIN, how can I asynchronously get more bytes?

> Is there an asynchronous "readahead" syscall? Or should I fallback to the blocking read()?

The idea is that if you get an EAGAIN you can then do a blocking read using your thread pool.
However if the data is already in the page cache, you get it instantly without having incurred the overhead of context switching to/from a thread pool, and without the risk of blocking your "main" thread potentially indefinitely.

Non-blocking buffered file read operations

Posted Sep 26, 2014 23:49 UTC (Fri) by giraffedata (guest, #1954) [Link]

And there must be some applications where if you can't get the data quickly and cheaply, you'll just do without it.

Non-blocking buffered file read operations

Posted Sep 26, 2014 17:04 UTC (Fri) by jonabbey (guest, #2736) [Link]

Any word on how the BSD folks handle this?

Non-blocking buffered file read operations

Posted Sep 29, 2014 20:11 UTC (Mon) by justincormack (subscriber, #70439) [Link]

As far as I know not particularly well. The interfaces are there for compatibility but are fairly basic.

Non-blocking buffered file read operations

Posted Sep 26, 2014 21:44 UTC (Fri) by ppisa (subscriber, #67307) [Link]

I am not big Windows fan but VMS concepts inspired solution with option to provide event object (lpOverlapped) to ReadFile and WriteFile is elegant solution. It allows to continue with program code and when data are really required then sychronization/wait for that even can be used. Even more objects can be wait for (WaitForMultipleObjects) but only to maximum of 64 which is quite limited if we speak about epoll(), poll() or even select(). Both are almost unlimited - 100000 descriptors are not any problem. For ancient select it requires allocation of larger size for set than default libc stuff does but it is manageable.

On the other hand I am not sure if handle based explicits transfers are interesting these days. 64-bit space is bing enough to map portion or whole file to the process address space and then madvise and some clever completion event check/wait mechanism would be better option. I can imagine operation which creates event descriptor (similar to eventfd) and application declares set of ranges which should be made present in physical memory to signal that "eventfd" ready. The fd could be added to thread-loops epolls. Simpler implementation can be done to signal only single ready even per fd, more complex option could report ready ranges back through read of that "memreadyfd". The second solution would need less "memreadyfd" to be opened for complex case but complexity moves into "memreadyfd" implementation. The flag to mlock ready pages from requested range can be used to ensure that unnecessary trashing is not caused by missed reads. Process could used it ulimit permitted amount of allowed pages for these ranges requests.

With above approach big database engine can request sets of ranges for each pending transaction. These ranges in given set can be even from different files - i.e. double-entry bookkeeping - and actual transactions are processed in order of data availability in main memory.

I have even looked some time ago into notification mechanisms in pagecache some years ago and I think that described feature would need minimal or no changes there - changes limited mostly to new user-kernel API only. But my knowledge is not enough to do these changes without cooperation and mainly review of idea by somebody who can predicts caveats and real usability and performance of such approach.

Non-blocking buffered file read operations

Posted Sep 30, 2014 10:16 UTC (Tue) by paulj (subscriber, #341) [Link]

Two comments:

- Why not just add a new flag to fcntl (NON_BLOCKING_IO?), rather than new system calls? Then applications that want the full non-blocking experience can just do open+NON_BLOCK, fcntl+NON_BLOCKING_IO, and hey presto?

- If system calls have to be added then, given the other problems there have been with adding new flags, it might be a good idea to have a way to distinguish between mandatory and optional new flags. I.e. if an application is compiled to use some new flag, but is run on a kernel that doesn't support it, the old kernel can still do something sensible if it can at least know whether the application requires that flag to be supported or not. ?

Non-blocking buffered file read operations

Posted Oct 1, 2014 11:25 UTC (Wed) by nix (subscriber, #2304) [Link]

I'm wondering why there's this obsession with adding flags for every new feature. System calls are not a fundamentally limited resource: we're not going to run out.

You want a flag if it is likely that a program will want to call the function both with and without the flag from the same locus: changing the flags programmatically is a tiny bit easier than changing what function you're calling. But it will be rare to want to sometimes synchronously and sometimes asynchronously read() from the same read()-calling locus. And for this we're getting system calls with the viciously ugly names of '*v2()' and a bunch of flags? (This is particularly true given that the iovec-based interface proposed, while the only possible one, is clumsy enough that nobody will use this call unless they want asychronous I/O. So we have a flag here that will never be left out! Not until more flags are proposed, anyway.)

Tell me, which is easier to read?

reada(foo, bar, baz);
readv2(foo, bar, baz, RWF_NONBLOCK);

I'd say it's the former. It's not so SHOUTY, and the name is a bit clearer. (Though there's not much in it.)

Non-blocking buffered file read operations

Posted Oct 2, 2014 9:34 UTC (Thu) by ssokolow (guest, #94568) [Link]

Correct me if I'm wrong, but I seem to remember an LWN article from a few years ago where room for new system calls being a limited resource WAS listed as a concern because the kernel ABI distinguished syscalls by integer identifiers rather than character strings for performance reasons.

Non-blocking buffered file read operations

Posted Oct 7, 2014 17:01 UTC (Tue) by nix (subscriber, #2304) [Link]

What, so we can only have billions of them? That's not limited enough to provide a reason to not make something a system call. :)

(More relevant is that you need to wait until support percolates into glibc to use a system call, or have applications hardwire the call themselves during the transition period -- but the same applies to adding new flag bits, since you need to wait for their symbolic constants to hit glibc, or have applications hardwire the call themselves.)

Non-blocking buffered file read operations

Posted Oct 9, 2014 10:04 UTC (Thu) by kevinm (guest, #69913) [Link]

This entire discussion revolves around a fundamental imprecision of language.

Ordinary files *never* block, and this is why O_NONBLOCK does not affect them. select() and poll() will always show ordinary files as readable and writeable.

However, if the data is not immediately available, then the thread enters "disk wait" (this is the origin of the D state, as opposed to the S state shown by threads that are blocked). Waiting and blocking are not the same thing, though.

So really, the flag should be RWF_NO_WAIT.


Copyright © 2014, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds