[Numpy-discussion] Possible roadmap addendum: building better text file readers

Sun Feb 26 14:58:35 EST 2012

On Sun, Feb 26, 2012 at 1:49 PM, Nathaniel Smith <njs at pobox.com> wrote:

> On Sun, Feb 26, 2012 at 7:16 PM, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
> > On Sun, Feb 26, 2012 at 1:00 PM, Nathaniel Smith <njs at pobox.com> wrote:
> >> For this kind of benchmarking, you'd really rather be measuring the
> >> CPU time, or reading byte streams that are already in memory. If you
> >> can process more MB/s than the drive can provide, then your code is
> >> effectively perfectly fast. Looking at this number has a few
> >> advantages:
> >>  - You get more repeatable measurements (no disk buffers and stuff
> >> messing with you)
> >>  - If your code can go faster than your drive, then the drive won't
> >> make your benchmark look bad
> >>  - There are probably users out there that have faster drives than you
> >> (e.g., I just measured ~340 megabytes/s off our lab's main RAID
> >> array), so it's nice to be able to measure optimizations even after
> >> they stop mattering on your equipment.
> >
> >
> > For anyone benchmarking software like this, be sure to clear the disk
> cache
> > before each run.  In linux:
>
> Err, my argument was that you should do exactly the opposite, and just
> worry about hot-cache times (or time reading a big in-memory buffer,
> to avoid having to think about the OS's caching strategies).
>
>

Right, I got that.  Sorry if the placement of the notes about how to clear
the cache seemed to imply otherwise.

> Clearing the disk cache is very important for getting meaningful,
> repeatable benchmarks in code where you know that the cache will
> usually be cold and where hitting the disk will have unpredictable
> effects (i.e., pretty much anything doing random access, like
> databases, which have complicated locality patterns, you may or may
> not trigger readahead, etc.). But here we're talking about pure
> sequential reads, where the disk just goes however fast it goes, and
> your code can either keep up or not.
>
> One minor point where the OS interface could matter: it's good to set
> up your code so it can use mmap() instead of read(), since this can
> reduce overhead. read() has to copy the data from the disk into OS
> memory, and then from OS memory into your process's memory; mmap()
> skips the second step.
>
>

Thanks for the tip.  Do you happen to have any sample code that
demonstrates this?  I'd like to explore this more.

Warren
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20120226/be0f96f8/attachment.html>