[Numpy-discussion] Possible roadmap addendum: building better text file readers

Fri Mar 2 10:26:48 EST 2012

Hi,

mmap can give a speed up in some case, but slow down in other. So care
must be taken when using it. For example, the speed difference between
read and mmap are not the same when the file is local and when it is
on NFS. On NFS, you need to read bigger chunk to make it worthwhile.

Another example is on an SMP computer. If for example you have a 8
cores computer but have only enought ram for 1 or 2 copy of your
dataset, using mmap is a bad idea. If you read the file by chunk
normally the OS will keep the file in its cache in ram. So if you
launch 8 jobs, they will all use the system cache to shared the data.
If you use mmap, I think this bypass the OS cache. So you will always
read the file. On NFS with a cluster of computer, this can bring a
high load on the file server. So having a way to specify to use or not
to use mmap would be great as you can't always guess the right thing
to do. (Except if I'm wrong and this don't by pass the OS cache)

Anyway, it is great to see people work in this problem, this was just
a few comments I had in mind when I read this thread.

Frédéric

On Sun, Feb 26, 2012 at 4:22 PM, Warren Weckesser
<warren.weckesser at enthought.com> wrote:
>
>
> On Sun, Feb 26, 2012 at 3:00 PM, Nathaniel Smith <njs at pobox.com> wrote:
>>
>> On Sun, Feb 26, 2012 at 7:58 PM, Warren Weckesser
>> <warren.weckesser at enthought.com> wrote:
>> > Right, I got that.  Sorry if the placement of the notes about how to
>> > clear
>> > the cache seemed to imply otherwise.
>>
>> OK, cool, np.
>>
>> >> Clearing the disk cache is very important for getting meaningful,
>> >> repeatable benchmarks in code where you know that the cache will
>> >> usually be cold and where hitting the disk will have unpredictable
>> >> effects (i.e., pretty much anything doing random access, like
>> >> databases, which have complicated locality patterns, you may or may
>> >> not trigger readahead, etc.). But here we're talking about pure
>> >> sequential reads, where the disk just goes however fast it goes, and
>> >> your code can either keep up or not.
>> >>
>> >> One minor point where the OS interface could matter: it's good to set
>> >> up your code so it can use mmap() instead of read(), since this can
>> >> reduce overhead. read() has to copy the data from the disk into OS
>> >> memory, and then from OS memory into your process's memory; mmap()
>> >> skips the second step.
>> >
>> > Thanks for the tip.  Do you happen to have any sample code that
>> > demonstrates
>> > this?  I'd like to explore this more.
>>
>> No, I've never actually run into a situation where I needed it myself,
>> but I learned the trick from Tridge so I tend to believe it :-).
>> mmap() is actually a pretty simple interface -- the only thing I'd
>> watch out for is that you want to mmap() the file in pieces (so as to
>> avoid VM exhaustion on 32-bit systems), but you want to use pretty big
>> pieces (because each call to mmap()/munmap() has overhead). So you
>> might want to use chunks in the 32-128 MiB range. Or since I guess
>> you're probably developing on a 64-bit system you can just be lazy and
>> mmap the whole file for initial testing. git uses mmap, but I'm not
>> sure it's very useful example code.
>>
>> Also it's not going to do magic. Your code has to be fairly quick
>> before avoiding a single memcpy() will be noticeable.
>>
>> HTH,
>
>
>
> Yes, thanks!   I'm working on a mmap version now.  I'm very curious to see
> just how much of an improvement it can give.
>
> Warren
>
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>