[Numpy-discussion] load from text files Pull Request Review

Tue Sep 13 04:43:38 EDT 2011

On Sep 13, 2011, at 01:38 , Christopher Jordan-Squire wrote:

> I did some timings to see what the advantage would be, in the simplest
> case possible, of taking multiple lines from the file to process at a
> time. Assuming the dtype is already known. The code is attached. What
> I found was I can't use generators to avoid constructing a list and
> then making a tuple from the list.

Still, I think there should be a way to use generators to create the final array (once your dtype is known and assuming you can skip invalid lines)...

>  The catch is that genfromtxt
> was loading datetimes as strings, while loadtable was loading them as
> numpy datetimes. The conversion from string to datetime is somewhat
> expensive, so I think that accounts for some of the extra time. The
> range of timings--between 1.5 to 3.5 times as slow--reflect how many
> lines are used to check for sizes and dtypes. As it turns out,
> checking for those can be quite expensive, and the majority of the
> time seems to be spent in the regular expression matching. (Though
> Chris is using a slight variant on my pull request, and I'm getting
> function times that are not as bad as his.) The cost of the size and
> type checking was less apparent in the example I have timings on in a
> previous email because in that case there was a huge cost for
> converting data with commas instead of decimals and for the datetime
> conversion.

The problem with parsing dates with re is that depending on your separator, on your local conventions (e.g., MM-DD-YYYY vs DD/MM/YYYY) and on your string itself, you'll get very different results, not always the ones you want. Hence, I preferred to leave the dates out of the basic convertors and ask the user instead to give her own. If you can provide a functionality in loadtable to that effect, that'd be great.

> Other than possibly changing loadtable to use np.NA instead of masked
> arrays in the presence of missing data, I'm starting to feel like it's
> more or less complete for now, and can be left to be improved in the
> future. Most of the things that have been discussed are either
> performance trade-offs or somewhat large re-engineering of the
> internals.

Well,  it seems that loadtable doesn't work when you use positions instead of delimiters to  separate the fields (e.g. below).
What if you want to apply some specific conversion to a column ? e.g., transform a string representing a hexa to a int?

Apart from that, I do appreciate the efforts you're putting to improve genfromtxt. It's needed, direly. Sorry that I can't find the time to really work on that (I do need to sleep sometimes)… But chats with Pauli V., Ralf G. among others during EuroScipy lead me to think a basic reorganization of npyio is quite advisable.

#C00:07 : YYYYMMDD (8)
#C08:15 : HH:mm:SS (8)
#C16:18 : XXX (3)
#C19:25 : float (7)
#C26:32 : float (7)
#C27:39 : float (7)
# np.genfromtxt('test.txt', delimiter=(8,8,3,7,7,7), usemask=True, dtype=None)
2011010112:34:56AAA001.234005.678010.123999.999
2011010112:34:57BBB001.234005.678010.123999.999
2011010112:34:58CCC001.234005.678010.123999.999
2011010112:34:59   001.234005.678010.123999.999
2011010112:35:00DDD         5.678010.123
2011010112:35:01EEE001.234005.678010.123999.999