[Numpy-discussion] R: fast numpy.fromfile skipping data chunks

Wed Mar 13 11:54:24 EDT 2013

Thanks a lot for the feedback, I'll try to modify my function to overcome this issue.
Since I'm in the process of buying new hardware too, a slight OT (but definitely related).
Does an ssd provide substantial improvement in these cases?
________________________________________
Da: numpy-discussion-bounces at scipy.org [numpy-discussion-bounces at scipy.org] per conto di Nathaniel Smith [njs at pobox.com]
Inviato: mercoledì 13 marzo 2013 16.43
A: Discussion of Numerical Python
Oggetto: Re: [Numpy-discussion] R: R: R: R: fast numpy.fromfile skipping        data chunks

On 13 Mar 2013 15:16, "Andrea Cimatoribus" <Andrea.Cimatoribus at nioz.nl<mailto:Andrea.Cimatoribus at nioz.nl>> wrote:
>
> Ok, this seems to be working (well, as soon as I get the right offset and things like that, but that's a different story).
> The problem is that it does not go any faster than my initial function compiled with cython, and it is still a lot slower than fromfile. Is there a reason why, even with compiled code, reading from a file skipping some records should be slower than reading the whole file?

Oh, in that case you're probably IO bound, not CPU bound, so Cython etc. can't help.

Traditional spinning-disk hard drives can read quite quickly, but take a long time to find the right place to read from and start reading. Your OS has heuristics in it to detect sequential reads and automatically start the setup for the next read while you're processing the previous read, so you don't see the seek overhead. If your reads are widely separated enough, these heuristics will get confused and you'll drop back to doing a new disk seek on every call to read(), which is deadly. (And would explain what you're seeing.) If this is what's going on, your best bet is to just write a python loop that uses fromfile() to read some largeish (megabytes?) chunk, subsample those and throw away the rest, and repeat.

-n