[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

Benjamin Root ben.root at ou.edu
Tue Oct 28 16:25:52 EDT 2014


As a bit of an aside, I have just discovered that for fixed-width text
data, numpy's text readers seems to edge out pandas' read_fwf(), and numpy
has the advantage of being able to specify the dtypes ahead of time (seems
that the pandas version just won't allow it, which means I end up with
float64's and object dtypes instead of float32's and |S12 dtypes where I
want them).

Cheers!
Ben Root


On Tue, Oct 28, 2014 at 4:09 PM, Chris Barker <chris.barker at noaa.gov> wrote:

> A few thoughts:
>
> 1) yes, a faster, more memory efficient text file parser would be great.
> Yeah, if your workflow relies on parsing lots of huge text files, you
> probably need another workflow. But it's a really really common thing to
> nee to do -- why not do it fast?
>
> 2) """you are describing a special case where you know the data size
> apriori (eg not streaming), dtypes are readily apparent from a small sample
> case
> and in general your data is not messy """
>
> sure -- that's a special case, but it's a really common special case (OK
> -- without the know your data size ,anyway...)
>
> 3)
>
>> Someone also posted some code or the draft thereof for using resizable
>> arrays quite a while ago, which would
>> reduce the memory overhead for very large arrays.
>>
>
> That may have been me -- I have a resizable array class, both pure python
> and not-quite finished Cython version. In practice, if you add stuff to the
> array row by row (or item by item), it's no faster than putting it all in a
> list and then converting to an array -- but it IS more memory efficient,
> which seems to be the issue here. Let me know if you want it -- I really
> need to get it up on gitHub one of these days.
>
> My take: for fast parsing of big files you need:
>
> To do the parsing/converting in C -- what wrong with good old fscanf, at
> least for the basic types -- it's pretty darn fast.
>
> Memory efficiency -- somethign like my growable array is not all that hard
> to implement and pretty darn quick -- you just do the usual trick_ over
> allocate a bit of memory, and when it gets full re-allocate a larger chunk.
> It turns out, at least on the hardware I tested on, that the performance is
> not very sensitive to how much you over allocate -- if it's tiny (1
> element) performance really sucks, but once you get to a 10% or so (maybe
> less) over-allocation, you don't notice the difference.
>
> Keep the auto-figuring out of the structure / dtypes separate from the
> high speed parsing code. I"d say write high speed parsing code first --
> that requires specification of the data types and structure, then, if you
> want, write some nice pure python code that tries to auto-detect all that.
> If it's a small file, it's fast regardless. if it's a large file, then the
> overhead of teh fancy parsing will be lost, and you'll want the line by
> line parsing to be as fast as possible.
>
> From a quick loo, it seems that the Panda's code is pretty nice -- maybe
> the 2X memory footprint should be ignored.
>
> -Chris
>
>
>
>
>
>
>
>
>
>
>> Cheers,
>>                                                 Derek
>>
>>
>>
>> _______________________________________________
>> NumPy-Discussion mailing list
>> NumPy-Discussion at scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>>
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R            (206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115       (206) 526-6317   main reception
>
> Chris.Barker at noaa.gov
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20141028/48befce5/attachment.html>


More information about the NumPy-Discussion mailing list