[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt
chris.barker at noaa.gov
Tue Oct 28 16:09:09 EDT 2014
A few thoughts:
1) yes, a faster, more memory efficient text file parser would be great.
Yeah, if your workflow relies on parsing lots of huge text files, you
probably need another workflow. But it's a really really common thing to
nee to do -- why not do it fast?
2) """you are describing a special case where you know the data size
apriori (eg not streaming), dtypes are readily apparent from a small sample
and in general your data is not messy """
sure -- that's a special case, but it's a really common special case (OK --
without the know your data size ,anyway...)
> Someone also posted some code or the draft thereof for using resizable
> arrays quite a while ago, which would
> reduce the memory overhead for very large arrays.
That may have been me -- I have a resizable array class, both pure python
and not-quite finished Cython version. In practice, if you add stuff to the
array row by row (or item by item), it's no faster than putting it all in a
list and then converting to an array -- but it IS more memory efficient,
which seems to be the issue here. Let me know if you want it -- I really
need to get it up on gitHub one of these days.
My take: for fast parsing of big files you need:
To do the parsing/converting in C -- what wrong with good old fscanf, at
least for the basic types -- it's pretty darn fast.
Memory efficiency -- somethign like my growable array is not all that hard
to implement and pretty darn quick -- you just do the usual trick_ over
allocate a bit of memory, and when it gets full re-allocate a larger chunk.
It turns out, at least on the hardware I tested on, that the performance is
not very sensitive to how much you over allocate -- if it's tiny (1
element) performance really sucks, but once you get to a 10% or so (maybe
less) over-allocation, you don't notice the difference.
Keep the auto-figuring out of the structure / dtypes separate from the high
speed parsing code. I"d say write high speed parsing code first -- that
requires specification of the data types and structure, then, if you want,
write some nice pure python code that tries to auto-detect all that. If
it's a small file, it's fast regardless. if it's a large file, then the
overhead of teh fancy parsing will be lost, and you'll want the line by
line parsing to be as fast as possible.
>From a quick loo, it seems that the Panda's code is pretty nice -- maybe
the 2X memory footprint should be ignored.
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
Christopher Barker, Ph.D.
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the NumPy-Discussion