[Numpy-discussion] load from text files Pull Request Review

Tue Sep 13 19:30:47 EDT 2011

On Tue, Sep 13, 2011 at 3:43 AM, Pierre GM <pgmdevlist at gmail.com> wrote:
>
> On Sep 13, 2011, at 01:38 , Christopher Jordan-Squire wrote:
>
>> I did some timings to see what the advantage would be, in the simplest
>> case possible, of taking multiple lines from the file to process at a
>> time. Assuming the dtype is already known. The code is attached. What
>> I found was I can't use generators to avoid constructing a list and
>> then making a tuple from the list.
>
> Still, I think there should be a way to use generators to create the final array (once your dtype is known and assuming you can skip invalid lines)...
>
>>  The catch is that genfromtxt
>> was loading datetimes as strings, while loadtable was loading them as
>> numpy datetimes. The conversion from string to datetime is somewhat
>> expensive, so I think that accounts for some of the extra time. The
>> range of timings--between 1.5 to 3.5 times as slow--reflect how many
>> lines are used to check for sizes and dtypes. As it turns out,
>> checking for those can be quite expensive, and the majority of the
>> time seems to be spent in the regular expression matching. (Though
>> Chris is using a slight variant on my pull request, and I'm getting
>> function times that are not as bad as his.) The cost of the size and
>> type checking was less apparent in the example I have timings on in a
>> previous email because in that case there was a huge cost for
>> converting data with commas instead of decimals and for the datetime
>> conversion.
>
> The problem with parsing dates with re is that depending on your separator, on your local conventions (e.g., MM-DD-YYYY vs DD/MM/YYYY) and on your string itself, you'll get very different results, not always the ones you want. Hence, I preferred to leave the dates out of the basic convertors and ask the user instead to give her own. If you can provide a functionality in loadtable to that effect, that'd be great.
>
>> Other than possibly changing loadtable to use np.NA instead of masked
>> arrays in the presence of missing data, I'm starting to feel like it's
>> more or less complete for now, and can be left to be improved in the
>> future. Most of the things that have been discussed are either
>> performance trade-offs or somewhat large re-engineering of the
>> internals.
>
> Well,  it seems that loadtable doesn't work when you use positions instead of delimiters to  separate the fields (e.g. below).
> What if you want to apply some specific conversion to a column ? e.g., transform a string representing a hexa to a int?
>
> Apart from that, I do appreciate the efforts you're putting to improve genfromtxt. It's needed, direly. Sorry that I can't find the time to really work on that (I do need to sleep sometimes)… But chats with Pauli V., Ralf G. among others during EuroScipy lead me to think a basic reorganization of npyio is quite advisable.
>
>
> #C00:07 : YYYYMMDD (8)
> #C08:15 : HH:mm:SS (8)
> #C16:18 : XXX (3)
> #C19:25 : float (7)
> #C26:32 : float (7)
> #C27:39 : float (7)
> # np.genfromtxt('test.txt', delimiter=(8,8,3,7,7,7), usemask=True, dtype=None)
> 2011010112:34:56AAA001.234005.678010.123999.999
> 2011010112:34:57BBB001.234005.678010.123999.999
> 2011010112:34:58CCC001.234005.678010.123999.999
> 2011010112:34:59   001.234005.678010.123999.999
> 2011010112:35:00DDD         5.678010.123
> 2011010112:35:01EEE001.234005.678010.123999.999

Thanks for mentioning the fixed width file type. I had completely
missed genfromtxt allows that. Though, in all honesty, I wasn't really
intending that loadtable be a drop-in replacement for genfromtxt. More
like a more robust and memory efficient alternative. I think I can add
that functionality to loadtable, but it might require adding some
special case stuff. Most everything is geared towards delimited text
rather than fixed width text.

An idea that was floated today when I talked about loadtable at
Enthought was refactoring it as a class, and then letting some of the
internals that currently aren't exposed to the user be exposed. That
way the user could specify their own converters if desired without
having to add yet another parameter. In fact, it could make it
possible to remove some of the existing parameters by making them
instance variables, for example. How do people feel about that?

In terms of refactoring numpy io, was there anything concrete or
specific discussed?

-Chris JS

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>