[Numpy-discussion] Possible roadmap addendum: building better text file readers

Nathaniel Smith njs at pobox.com
Sun Feb 26 16:00:30 EST 2012


On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
> Right, I got that.  Sorry if the placement of the notes about how to clear
> the cache seemed to imply otherwise.

OK, cool, np.

>> Clearing the disk cache is very important for getting meaningful,
>> repeatable benchmarks in code where you know that the cache will
>> usually be cold and where hitting the disk will have unpredictable
>> effects (i.e., pretty much anything doing random access, like
>> databases, which have complicated locality patterns, you may or may
>> not trigger readahead, etc.). But here we're talking about pure
>> sequential reads, where the disk just goes however fast it goes, and
>> your code can either keep up or not.
>>
>> One minor point where the OS interface could matter: it's good to set
>> up your code so it can use mmap() instead of read(), since this can
>> reduce overhead. read() has to copy the data from the disk into OS
>> memory, and then from OS memory into your process's memory; mmap()
>> skips the second step.
>
> Thanks for the tip.  Do you happen to have any sample code that demonstrates
> this?  I'd like to explore this more.

No, I've never actually run into a situation where I needed it myself,
but I learned the trick from Tridge so I tend to believe it :-).
mmap() is actually a pretty simple interface -- the only thing I'd
watch out for is that you want to mmap() the file in pieces (so as to
avoid VM exhaustion on 32-bit systems), but you want to use pretty big
pieces (because each call to mmap()/munmap() has overhead). So you
might want to use chunks in the 32-128 MiB range. Or since I guess
you're probably developing on a 64-bit system you can just be lazy and
mmap the whole file for initial testing. git uses mmap, but I'm not
sure it's very useful example code.

Also it's not going to do magic. Your code has to be fairly quick
before avoiding a single memcpy() will be noticeable.

HTH,
-- Nathaniel



More information about the NumPy-Discussion mailing list