fincore()
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Linux has long had the mincore() system call which allows an application to determine whether a given page is in RAM or not. There is no easy way, though, to tell whether a given page from a file is in the page cache or not. An application can mmap() the file and use mincore() on it, but that can be slow. So Chris Frost has proposed a new fincore() system call to handle this task:
int fincore(int fd, loff_t start, loff_t len, unsigned char *vec);
A call to fincore() will look at the pages of the file associated with fd in the range indicated by start and len. For each page of the file, one byte of vec will be set to a non-zero value if that page is in memory. Naturally, this answer is an approximation - the situation can change while the system call is running.
That, however, can be good enough for Chris's needs. His objective is to speed up applications which perform large numbers of non-sequential file reads. The traditional readahead code deals poorly with this kind of application, since the access pattern cannot be predicted ahead of time. But the application often does know about a sequence of reads in advance; if the kernel could be told to pull in those pages ahead of time, it could order the I/O operations optimally and make the whole thing go faster. When doing this for sqlite and the GIMP, Chris reports significant speedups.
The fadvise() system call can be used to request prefetching of file data. But there's a problem: it's hard for a prefetch library to know how much system memory is available. If too little data is prefetched, the performance gains will not be what they could be. Prefetching too much data, however, can lead to thrashing. Hence the fincore() system call: if prefetched pages are no longer present by the time the application gets around to using them, the library knows that it is asking for too much and can back off.
Andrew Morton likes the patch:
Jamie Lokier, though, wondered if it might not be a better idea to find a way to inform applications more directly that their pages are being evicted prior to use.
This is the first posting for this system call, so it has not gotten a lot
of attention yet; more discussion will certainly be necessary before it
could be merged. In the mean time, the libprefetch site has more
information on this whole project.
Index entries for this article | |
---|---|
Kernel | Prefetching |
Kernel | System calls/fincore() |
(Log in to post comments)
fincore()
Posted Jan 28, 2010 4:49 UTC (Thu) by bradfitz (subscriber, #4378) [Link]
fincore()
Posted Jan 28, 2010 17:18 UTC (Thu) by iabervon (subscriber, #722) [Link]
It doesn't make sense to have a userspace heuristic for figuring out kernel limits when you need kernel support to implement it, particularly if the information you're getting only helps if you are right about the kernel's heuristics. Maybe the kernel will stop evicting pages that have been requested but not used when asked to prefetch more pages, and heuristics based on checking whether pages are in core and an assumption as to the kernel's use of the hints will give entirely wrong results.
fincore()
Posted Jan 29, 2010 17:38 UTC (Fri) by giraffedata (guest, #1954) [Link]
I agree. First of all, fadvise() does not request prefetch. It advises the kernel that you are going to access a certain part of the file soon. It is up to the prefetcher to decide how to exploit that information.Only the prefetcher, in the kernel, can properly decide how much memory to allocate for prefetching this particular file. Memory is a resource shared between processes, and coordinating resource usage between processes is fundamentally the kernel's responsibility. The application should just look out for itself.
fincore()
Posted Jan 29, 2010 6:19 UTC (Fri) by kleptog (subscriber, #1183) [Link]
But other than that I think it's a fabulous idea. Although you could achieve much the same benefits if you could do a read() and have it also return a flag indicating if the data was all in memory or not.