I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list. It appears that the user must create a tuple to place in a numpy record array. (Specifically, if you remove the 'tuple' command from f2 in the attached then you get an error.) Taking multiple lines at a time (using f4) does provide a speed benefit, but it's not very big since Python's re module won't let you capture more than 100 values, and I'm using capturing to extract the values. (This is done because we're allowing the user to use regular expressions to denote delimiters.)
In the example it's a bunch of space-delimited integers. f1 splits on the space and uses a list comprehension, f2 splits on the space and uses a generator, f3 uses regular expressions in a manner similar to the current code, and f4 uses regular expressions on multiple lines at once, and f5 uses fromiter. (Though fromiter isn't as useful as I'd hoped because you have to have already parsed out a line, since this is a record array.) f6 and f7 use stripped down versions of Chris Barker's accumulator idea. The difference is that f6 uses resize when expanding the array while f7 uses np.empty followed by np.append. This avoids the penalty from copying data that np.resize imposes. Note that f6 and f7 use the regular expression capturing line by line as in f3. To get a feel for the overheard involved with keeping track of string sizes, f8 is just f3 except with a list for the largest string sizes seen so far.
The speeds I get using timeit are f1 : 1.66ms f2 : 2.01ms f3 : 2.35ms f4(2) : 3.02ms (Odd that it starts out worse than f3 when you take 2 lines at a time) f4(5) : 2.25ms f4(10) : 2.02ms f4(15) : 1.93ms f4(20) : error f5 : 2.28ms (As I said, fromiter can't do much when it's just filling in a record array. While it's slightly faster than f3, which it's based on, it also loads all the data as a list before creating a numpy array, which is rather suboptimal.) f6 : 3.26ms f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by append then do to resize) f8 : 3.04ms (Compared to f3, this shows there's a non-trivial performance hit from keeping track of the sizes)
It seems like taking multiple lines at once isn't a big gain when we're limited to 100 captured entries at a time. (For Python 2.6, at least.) Especially since taking multiple lines at once would be rather complex since the code must still check each line to see if it's commented out or not.
After talking to Chris Farrow, an Enthought developer, and doing some timing on a dataset he was working on, I had loadtable running about 1.7 to 3.3 times as fast as genfromtxt. The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion.
To give some further context, I compared np.genfromtxt and np.loadtable on the same 'pseudo-file' f used in the above tests, when the data is just a bunch of integers. The results were:
np.genfromtxt with dtype=None: 4.45 ms np.loadtable with defaults: 5.12ms np.loadtable with check_sizes=False: 3.7ms
So it seems that np.loadtable is already competitive with np.genfromtxt other than checking the sizes. And the checking sizes isn't even that huge a penalty compared to genfromtxt.
Based on all the above it seems like the accumulator is the most promising way that things could be sped up. But it's not completely clear to me by how much, since we still must keep track of the dtypes and the sizes.
Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals.
On Thu, Sep 8, 2011 at 3:57 PM, Chris.Barker Chris.Barker@noaa.gov wrote:
On 9/8/11 1:43 PM, Christopher Jordan-Squire wrote:
I just ran a quick test on my machine of this idea. With
dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) temp = np.empty((), dtype=dt) temp2 = np.zeros(1,dtype=dt)
In : def f(): ...: l=*3 ...: l = 2.54 ...: l = 4 ...: l = 2.3645 ...: j = tuple(l) ...: temp2 = j
In : def g(): ...: temp['x'] = 2.54 ...: temp['y'] = 4 ...: temp['z'] = 2.3645 ...: temp2 = temp ...:
The timing results were 2.73 us for f and 3.43 us for g. So good idea, but it doesn't appear to be faster. (Though the difference wasn't nearly as dramatic as I thought it would be, based on Pauli's comment.)
my guess is that the lines like: temp['x'] = 2.54 are slower (it requires a dict lookup, and a conversion from a python type to a "raw" type)
temp2 = temp
is faster, as that doesn't require any conversion.
Which means that if you has a larger struct dtype, it would be even slower, so clearly not the way to go for performance.
It would be nice to have a higher performing struct dtype scalar -- as it is ordered, it might be nice to be able to index it with either the name or an numeric index.
-- Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion