On Sep 13, 2011, at 01:38 , Christopher Jordan-Squire wrote:
I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list.
Still, I think there should be a way to use generators to create the final array (once your dtype is known and assuming you can skip invalid lines)...
The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion.
The problem with parsing dates with re is that depending on your separator, on your local conventions (e.g., MM-DD-YYYY vs DD/MM/YYYY) and on your string itself, you'll get very different results, not always the ones you want. Hence, I preferred to leave the dates out of the basic convertors and ask the user instead to give her own. If you can provide a functionality in loadtable to that effect, that'd be great.
Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals.
Well, it seems that loadtable doesn't work when you use positions instead of delimiters to separate the fields (e.g. below). What if you want to apply some specific conversion to a column ? e.g., transform a string representing a hexa to a int?
Apart from that, I do appreciate the efforts you're putting to improve genfromtxt. It's needed, direly. Sorry that I can't find the time to really work on that (I do need to sleep sometimes)… But chats with Pauli V., Ralf G. among others during EuroScipy lead me to think a basic reorganization of npyio is quite advisable.
#C00:07 : YYYYMMDD (8) #C08:15 : HH:mm:SS (8) #C16:18 : XXX (3) #C19:25 : float (7) #C26:32 : float (7) #C27:39 : float (7) # np.genfromtxt('test.txt', delimiter=(8,8,3,7,7,7), usemask=True, dtype=None) 2011010112:34:56AAA001.234005.678010.123999.999 2011010112:34:57BBB001.234005.678010.123999.999 2011010112:34:58CCC001.234005.678010.123999.999 2011010112:34:59 001.234005.678010.123999.999 2011010112:35:00DDD 5.678010.123 2011010112:35:01EEE001.234005.678010.123999.999