Mailman 3 load from text files Pull Request Review - NumPy-Discussion

newer
Re: [Numpy-discussion] Functions...

load from text files Pull Request Review

Christopher Jordan-Squire

27 Aug 2011 27 Aug '11

6:08 p.m.

Hi-- I've submitted a pull request for a new method for loading data from text files into a record array/masked record array. https://github.com/numpy/numpy/pull/143 Click on the link for more info, but the general idea is to create a regular expression for what entries should look like and loop over the file, updating the regular expression if it's wrong. Once the types are determined the file is loaded line by line into a pre-allocated numpy array. Compared to genfromtxt this function has several advantages/potential advantages. *More modular (genfromtxt is a rather large, nearly 500 line, monolithic function. In my pull request no individual method is longer than around 80 lines, and they're fairly self-contained.) *delimiters can be specified via regex's *missing data can be specified via regex's *it's bit simpler and has sensible defaults *it actually works on some (unfortunately proprietary) data that genfromtxt doesn't seem robust enough for *it supports datetimes *fairly extensible for the power user *makes two passes through the file, the first to determine types/sizes for strings and the second to read in the data, and pre-allocates the array for the second pass. So no giant memory bloating for reading large text files *fairly fast, though I think there is plenty of room for optimizations All that said, it's entirely possible that the innards which determine the type should be ripped out and submitted as a function on their own. I'd love suggestions for improvements, as well as suggestions for a better name. (Currently it's called loadtable, which I don't really like. It was just a working name.) -Chris Jordan-Squire

Show replies by thread

Chris.Barker

30 Aug 30 Aug

4:21 p.m.

On 8/27/11 11:08 AM, Christopher Jordan-Squire wrote:

...

I've submitted a pull request for a new method for loading data from text files into a record array/masked record array.

...

Click on the link for more info, but the general idea is to create a regular expression for what entries should look like and loop over the file, updating the regular expression if it's wrong. Once the types are determined the file is loaded line by line into a pre-allocated numpy array.

nice stuff. Have you looked at my "accumulator" class, rather than pre-allocating? Less the class itself than that ideas behind it. It's easy enough to do, and would keep you from having to run through the file twice. The cost of memory re-allocation as the array grows is very small. I've posted the code recently, but let me know if you want it again. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Derek Homeier

2 Sep 2 Sep

3:22 p.m.

On 30.08.2011, at 6:21PM, Chris.Barker wrote:

...

...
I've submitted a pull request for a new method for loading data from text files into a record array/masked record array.

...
Click on the link for more info, but the general idea is to create a regular expression for what entries should look like and loop over the file, updating the regular expression if it's wrong. Once the types are determined the file is loaded line by line into a pre-allocated numpy array.

nice stuff.

Have you looked at my "accumulator" class, rather than pre-allocating? Less the class itself than that ideas behind it. It's easy enough to do, and would keep you from having to run through the file twice. The cost of memory re-allocation as the array grows is very small.

I've posted the code recently, but let me know if you want it again.

I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go. For load table this is not quite as straightforward, though, because the type auto-detection, strictly done, requires to scan the entire input, because a column full of int could still produce a float in the last row… I'd say one just has to accept that this kind of auto-detection is incompatible with input streams, and with the necessity to scan the entire data first anyway, pre-allocating the array makes sense as well. For better consistency with what people have likely got used to from npyio, I'd recommend some minor changes: make spaces the default delimiter enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?) Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------

Chris.Barker

3:50 p.m.

On 9/2/11 8:22 AM, Derek Homeier wrote:

...

I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go.

Good point, that would be nice.

...

For load table this is not quite as straightforward, though, because the type auto-detection, strictly done, requires to scan the entire input, because a column full of int could still produce a float in the last row…

hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again. In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values. So there is little cost, and for the common use case, it would be faster and cleaner. There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

...

For better consistency with what people have likely got used to from npyio, I'd recommend some minor changes:

make spaces the default delimiter

...

enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?)

I _think_this would benefit from a one-pass solution as well -- so you don't need to de-compress twice. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Jordan-Squire

4:16 p.m.

Sorry I'm only now getting around to thinking more about this. Been side-tracked by stats stuff. On Fri, Sep 2, 2011 at 10:50 AM, Chris.Barker wrote:

...

On 9/2/11 8:22 AM, Derek Homeier wrote:

...
I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go.

I'll read more about this soon. I haven't thought about it, and I didn't realize it was breaking anything.

...

Good point, that would be nice.

...
For load table this is not quite as straightforward, though, because the type auto-detection, strictly done, requires to scan the entire input, because a column full of int could still produce a float in the last row…

hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.

I hadn't thought of that. Interesting idea. I'm surprised that completely resetting the array could be faster.

...

In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.

In my tests, at least for a medium sized csv file (about 3000 rows by 30 columns), about 10% of the time was determine the types in the first read through and 90% of the time was sticking the data in the array. However, that particular test took more time for reading in because the data was quoted (so converting '"3,25"' to a float took between 1.5x and 2x as long as '3.25') and the datetime conversion is costly. Regardless, that suggests making the data loading faster is more important than avoiding reading through the file twice. I guess that intuition probably breaks if the data doesn't fit until memory, though. But I haven't worked with extremely large data files before, so I'd appreciate refutation/confirmation of my priors.

...

So there is little cost, and for the common use case, it would be faster and cleaner.

There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

Perhaps. I know that in the 'really annoying dataset that loading quickly and easily should be your use case' example I was given, about half-way through the data one of the columns got its first observation. (It was time series data where one of the columns didn't start being observed until 1/2 through the observation period.) So I'm not sure it'd be as rare we'd like.

...

...
For better consistency with what people have likely got used to from npyio, I'd recommend some minor changes:

make spaces the default delimiter

+1

Sure.

...

...
enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?)

I _think_this would benefit from a one-pass solution as well -- so you don't need to de-compress twice.

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Derek Homeier

4:17 p.m.

On 02.09.2011, at 5:50PM, Chris.Barker wrote:

...

hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.

In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.

So there is little cost, and for the common use case, it would be faster and cleaner.

There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

I still haven't studied your class in detail, but one could probably actually just create a copy of the array read in so far, e.g. changing it from a dtype=[('f0', 'https://github.com/numpy/numpy/pull/144] just parsing the text for comment lines adds ca. 10% time, while any of the array allocation and copying operations should at most be at the 1% level.

...

...
enable automatic decompression (given the modularity, could you simply use np.lib._datasource.open() like genfromtxt?)

I _think_this would benefit from a one-pass solution as well -- so you don't need to de-compress twice.

Absolutely; on compressed data the time for the extra pass jumps up to +30-50%. Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------

Derek Homeier

4:42 p.m.

On 02.09.2011, at 6:16PM, Christopher Jordan-Squire wrote:

...

I hadn't thought of that. Interesting idea. I'm surprised that completely resetting the array could be faster.

I had experimented a bit with the fromiter function, which also increases the output array as needed, and this creates negligible overhead compared to parsing the text input (it is implemented in C, though, I don't know how the .resize() calls would compare to that; and unfortunately it's for 1D-arrays only).

...

...
In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.

In my tests, at least for a medium sized csv file (about 3000 rows by 30 columns), about 10% of the time was determine the types in the first read through and 90% of the time was sticking the data in the array.

This would be consistent with my experience (basically testing for comment characters and the length of line.split(delimiter) in the first pass).

...

However, that particular test took more time for reading in because the data was quoted (so converting '"3,25"' to a float took between 1.5x and 2x as long as '3.25') and the datetime conversion is costly.

Regardless, that suggests making the data loading faster is more important than avoiding reading through the file twice. I guess that intuition probably breaks if the data doesn't fit until memory, though. But I haven't worked with extremely large data files before, so I'd appreciate refutation/confirmation of my priors.

The lion's share in the data loading time, by my experience, is still the string operations (like the comma conversion you quote above), so I'd always expect any subsequent manipulations of the numpy array data to be very fast compared to that. Maybe this changes slightly with more complex data types like string records or datetime instances, but as you indicate, even for those the conversion seems to dominate the cost. Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------

Chris.Barker

8:54 p.m.

On 9/2/11 9:16 AM, Christopher Jordan-Squire wrote:

...

...
...
I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go.

I'll read more about this soon. I haven't thought about it, and I didn't realize it was breaking anything.

you could call it a missing feature, rather than breaking...

...

...
hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.

I hadn't thought of that. Interesting idea. I'm surprised that completely resetting the array could be faster.

releasing memory an re-allocating doesn't take long at all.

...

...
In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.

In my tests, at least for a medium sized csv file (about 3000 rows by 30 columns), about 10% of the time was determine the types in the first read through and 90% of the time was sticking the data in the array.

I don't know how that can even be possible: Don't you have to load and parse the entire file to determine the data types? Once you've allocated, then all you are doing is setting a value in the array -- that has got to be fast. Also, the second time around, you may be taking advantage of disk cache, so that should be faster for that reason. Even so -- you may be able to save much of that 10%.

...

However, that particular test took more time for reading in because the data was quoted (so converting '"3,25"' to a float took between 1.5x and 2x as long as '3.25') and the datetime conversion is costly.

Didn't you have to do all that on the first pass as well? Or are you only checking for gross info -- length of rows, etc?

...

Regardless, that suggests making the data loading faster is more important than avoiding reading through the file twice. I guess that intuition probably breaks if the data doesn't fit until memory, though.

if the data don't fit into memory, then you need to go to memmapped arrays or something -- a whole new ball of wax.

...

...
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

Perhaps. I know that in the 'really annoying dataset that loading quickly and easily should be your use case' example I was given, about half-way through the data one of the columns got its first observation.

OK -- but isn't that just one re-wind? On 9/2/11 9:17 AM, Derek Homeier wrote:

...

...
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

I still haven't studied your class in detail, but one could probably actually just create a copy of the array read in so far, e.g. changing it from a dtype=[('f0', '

good point -- that would be a better way to do it, and only a tiny bit harder.

...

or even first implement it as a list or dict of arrays, that could be individually changed and only create a record array from that at the end.

I think that's a method that the OP is specifically trying to avoid -- a list of arrays uses substantially more storage than an array. Though less than a list of lists If each row is long, infact, the list overhead would be small.

...

The required copying and extra memory use would definitely pale compared to the text parsing or the current memory usage for the input list.

That's what I expected -- the OP's timing seems to indicate otherwise, but I'm still skeptical as to what has been timed.

...

In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing the text for comment lines adds ca. 10% time, while any of the array allocation and copying operations should at most be at the 1% level.

much more what I'd expect.

...

I had experimented a bit with the fromiter function, which also increases the output array as needed, and this creates negligible overhead compared to parsing the text input (it is implemented in C, though, I don't know how the .resize() calls would compare to that;

it's probably using pretty much the same code as .resize() internally anyway.

...

and unfortunately it's for 1D-arrays only).

That's not bad for this use -- make a row a struct dtype, and you've got a 1-d array anyway -- you can optionally convert to a 2-d array after the fact. I don't know why I didn't think of using fromiter() when I build accumulator. Though what I did is a bit more flexible -- you can add stuff later on, too, you don't need to do it allat once. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Jordan-Squire

9:45 p.m.

On Fri, Sep 2, 2011 at 3:54 PM, Chris.Barker wrote:

...

On 9/2/11 9:16 AM, Christopher Jordan-Squire wrote:

...
...
...
I agree it would make a very nice addition, and could complement my pre-allocation option for loadtxt - however there I've also been made aware that this approach breaks streamed input etc., so the buffer.resize(…) methods in accumulator would be the better way to go.

I'll read more about this soon. I haven't thought about it, and I didn't realize it was breaking anything.

you could call it a missing feature, rather than breaking...

...
...
hmmm -- it seems you could jsut as well be building the array as you go, and if you hit a change in the imput, re-set and start again.

I hadn't thought of that. Interesting idea. I'm surprised that completely resetting the array could be faster.

releasing memory an re-allocating doesn't take long at all.

...
...
In my tests, I'm pretty sure that the time spent file io and string parsing swamp the time it takes to allocate memory and set the values.

In my tests, at least for a medium sized csv file (about 3000 rows by 30 columns), about 10% of the time was determine the types in the first read through and 90% of the time was sticking the data in the array.

I don't know how that can even be possible:

Don't you have to load and parse the entire file to determine the data types?

Once you've allocated, then all you are doing is setting a value in the array -- that has got to be fast.

It doesn't have to parse the entire file to determine the dtypes. It builds up a regular expression for what it expects to see, in terms of dtypes. Then it just loops over the lines, only parsing if the regular expression doesn't match. It seems that a regex match is fast, but a regex fail is expensive. But the regex fails should be fairly rare, and are generally simple to catch. It was more expensive to keep track of the sizes for each line, as the doc string for loadtable describes. I couldn't find a good solution to cover all cases, so there's a combination of options to allow the user to pick the best case for them. Setting array elements is not as fast for the masked record arrays. You must set entire rows at a time, so I have to build up each row as a list, and convert to a tuple, and then stuff it in the array. And it's even slower for the record arrays with missing data because I must branch between adding missing data versus adding real data. Might that be the reason for the slower performance than you'd expect?

...

Also, the second time around, you may be taking advantage of disk cache, so that should be faster for that reason.

Even so -- you may be able to save much of that 10%.

I don't understand your meaning.

...

...
However, that particular test took more time for reading in because the data was quoted (so converting '"3,25"' to a float took between 1.5x and 2x as long as '3.25') and the datetime conversion is costly.

Didn't you have to do all that on the first pass as well? Or are you only checking for gross info -- length of rows, etc?

...
Regardless, that suggests making the data loading faster is more important than avoiding reading through the file twice. I guess that intuition probably breaks if the data doesn't fit until memory, though.

if the data don't fit into memory, then you need to go to memmapped arrays or something -- a whole new ball of wax.

...
...
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

Perhaps. I know that in the 'really annoying dataset that loading quickly and easily should be your use case' example I was given, about half-way through the data one of the columns got its first observation.

OK -- but isn't that just one re-wind?

Sure. If it only happens for one column. But suppose your data is a bunch of time series, one per column, where they each start at different dates. You'd have a restart for each column. But I guess that point is pedantic since regardless the number of columns should be many less than the number of rows. I wonder if there are any really important cases where you'd actually lose something by simply recasting an entry to another dtype, as Derek suggested. That would avoid having to go back to the start simply by doing an in-place conversion of the data.

...

On 9/2/11 9:17 AM, Derek Homeier wrote:

...
...
There is a chance, of course, that you might have to re-wind and start over more than once, but I suspect that that is the rare case.

I still haven't studied your class in detail, but one could probably actually just create a copy of the array read in so far, e.g. changing it from a dtype=[('f0', '
good point -- that would be a better way to do it, and only a tiny bit harder.

...
or even first implement it as a list or dict of arrays, that could be individually changed and only create a record array from that at the end.

I think that's a method that the OP is specifically trying to avoid -- a list of arrays uses substantially more storage than an array. Though less than a list of lists If each row is long, infact, the list overhead would be small.

...
The required copying and extra memory use would definitely pale compared to the text parsing or the current memory usage for the input list.

That's what I expected -- the OP's timing seems to indicate otherwise, but I'm still skeptical as to what has been timed.

So here's some of the important output from prun (in ipython) on the following call: np.loadtable('biggun.csv', quoted=True, comma_decimals=True, NA_re=r'#N/A N/A|', date_re='\d{1,2}/\d{1,2}/\d{2}', date_strp='%m/%d/%y', header=True) where biggun.csv was the 2857 by 25 csv file with datetimes and quoted data I'd mentioned earlier. (It's proprietary data, so I can't share the csv file itself.) ---------------------------------------------------------------------------------------------------------- 377584 function calls (377532 primitive calls) in 2.242 CPU seconds Ordered by: internal time ncalls tottime percall cumtime percall filename:lineno(function) 1 0.540 0.540 1.866 1.866 loadtable.py:859(get_data_missing) 91615 0.460 0.000 0.460 0.000 {built-in method match} 58835 0.286 0.000 0.405 0.000 loadtable.py:301(comma_float_conv) 126267 0.270 0.000 0.270 0.000 {method 'replace' of 'str' objects} 2857 0.125 0.000 0.151 0.000 core.py:2975(__setitem__) 2857 0.113 0.000 0.295 0.000 _strptime.py:295(_strptime) 2862 0.071 0.000 0.071 0.000 {numpy.core.multiarray.array} 2857 0.053 0.000 0.200 0.000 loadtable.py:1304(update_sizes) 2857 0.039 0.000 0.066 0.000 locale.py:316(normalize) 1 0.028 0.028 0.373 0.373 loadtable.py:1165(get_nrows_sizes_coltypes) 2857 0.024 0.000 0.319 0.000 {built-in method strptime} 5796 0.021 0.000 0.021 0.000 {built-in method groups} 2857 0.018 0.000 0.342 0.000 loadtable.py:784(<lambda>) 2857 0.017 0.000 0.102 0.000 locale.py:481(getlocale) 8637 0.016 0.000 0.016 0.000 {method 'get' of 'dict' objects} 2857 0.016 0.000 0.016 0.000 {map} 8631 0.015 0.000 0.015 0.000 {len} --------------------------------------------------------------------------------------------------------------------- It goes on, but those seem to be the important calls. So I wasn't quite right on the 90-10 split, but 99.9% of the time is in two methods: getting the data (get_data_missing) and determining the sizes, dtypes (get_nrows_sizes_coltypes). Between those two the split is 17-83.

...

...
In my loadtxt version [https://github.com/numpy/numpy/pull/144] just parsing the text for comment lines adds ca. 10% time, while any of the array allocation and copying operations should at most be at the 1% level.

much more what I'd expect.

...
I had experimented a bit with the fromiter function, which also increases the output array as needed, and this creates negligible overhead compared to parsing the text input (it is implemented in C, though, I don't know how the .resize() calls would compare to that;

it's probably using pretty much the same code as .resize() internally anyway.

...
and unfortunately it's for 1D-arrays only).

That's not bad for this use -- make a row a struct dtype, and you've got a 1-d array anyway -- you can optionally convert to a 2-d array after the fact.

I don't know why I didn't think of using fromiter() when I build accumulator. Though what I did is a bit more flexible -- you can add stuff later on, too, you don't need to do it allat once.

I'm unsure how to use fromiter for missing data. It sounds like a potential solution when no data is missing, though. -Chris

...

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Derek Homeier

6 Sep 6 Sep

2:32 p.m.

On 02.09.2011, at 11:45PM, Christopher Jordan-Squire wrote:

...

...
...
and unfortunately it's for 1D-arrays only).

That's not bad for this use -- make a row a struct dtype, and you've got a 1-d array anyway -- you can optionally convert to a 2-d array after the fact.

I don't know why I didn't think of using fromiter() when I build accumulator. Though what I did is a bit more flexible -- you can add stuff later on, too, you don't need to do it allat once.

I'm unsure how to use fromiter for missing data. It sounds like a potential solution when no data is missing, though.

Strange I haven't thought about it before either; I guess for record arrays it comes more natural to view them as a collection of 1D arrays. However, you'd need to construct a list or something of ncolumn iterators from the input - should not be too hard; but then how do you feed the ncolumn fromiter() instances synchronously from that?? As far as I can see there is no way to make them read one item at a time, row by row. Then there are additional complications with multi-D dtypes, and in your case, especially datetime instances, but the problem that all columns have to be read in in parallel really seems to be the showstopper here. Of course for "flat" 2D arrays of data (all the same dtype) this would work with simply reshaping the array - that's probably even the most common use case for loadtxt, but that method lacks way too much generality for my taste. Back to accumulator, I suppose. Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------

Christopher Jordan-Squire

2:36 p.m.

On Tue, Sep 6, 2011 at 9:32 AM, Derek Homeier wrote:

...

On 02.09.2011, at 11:45PM, Christopher Jordan-Squire wrote:

...
...
...
and unfortunately it's for 1D-arrays only).

That's not bad for this use -- make a row a struct dtype, and you've got a 1-d array anyway -- you can optionally convert to a 2-d array after the fact.

I don't know why I didn't think of using fromiter() when I build accumulator. Though what I did is a bit more flexible -- you can add stuff later on, too, you don't need to do it allat once.

I'm unsure how to use fromiter for missing data. It sounds like a potential solution when no data is missing, though.

Strange I haven't thought about it before either; I guess for record arrays it comes more natural to view them as a collection of 1D arrays. However, you'd need to construct a list or something of ncolumn iterators from the input - should not be too hard; but then how do you feed the ncolumn fromiter() instances synchronously from that?? As far as I can see there is no way to make them read one item at a time, row by row. Then there are additional complications with multi-D dtypes, and in your case, especially datetime instances, but the problem that all columns have to be read in in parallel really seems to be the showstopper here. Of course for "flat" 2D arrays of data (all the same dtype) this would work with simply reshaping the array - that's probably even the most common use case for loadtxt, but that method lacks way too much generality for my taste. Back to accumulator, I suppose.

Yes, I believe the thinking was that if your data is all one dtype that's simple enough to figure out, and there are other method for reading in such an array to produce a 2-d array. This is strictly for structured arrays currently, though I suppose that could change. -Chris

...

Cheers, Derek -- ---------------------------------------------------------------- Derek Homeier Centre de Recherche Astrophysique de Lyon ENS Lyon 46, Allée d'Italie 69364 Lyon Cedex 07, France +33 1133 47272-8894 ----------------------------------------------------------------

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Chris.Barker

7 Sep 7 Sep

7:52 p.m.

On 9/2/11 2:45 PM, Christopher Jordan-Squire wrote:

...

It doesn't have to parse the entire file to determine the dtypes. It builds up a regular expression for what it expects to see, in terms of dtypes. Then it just loops over the lines, only parsing if the regular expression doesn't match. It seems that a regex match is fast, but a regex fail is expensive.

interesting -- I wouldn't have expected a regex to be faster that simple parsing, but that's why you profile!

...

Setting array elements is not as fast for the masked record arrays. You must set entire rows at a time, so I have to build up each row as a list, and convert to a tuple, and then stuff it in the array.

hmmm -- that is a lot -- I was thinking of simple "set a value in an array". I"ve also done a bunch of this in C, where's it's really fast. However, rather than: build a row as a list build a row as a tuple stuff into array could you create an empty array scalar, and fill that, then put that in your array: In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) In [5]: dt Out[5]: dtype([('x', '

...

it's even slower for the record arrays with missing data because I must branch between adding missing data versus adding real data. Might that be the reason for the slower performance than you'd expect?

could be -- I haven't thought about the missing data part much.

...

I wonder if there are any really important cases where you'd actually lose something by simply recasting an entry to another dtype, as Derek suggested.

In general, it won't be a simple re-cast -- it will be a copy to a subset -- which may be hard to write the code, but would save having to re-parse the data. Anyway, you know the issues, this is good stuff either way. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Pauli Virtanen

8 Sep 8 Sep

7:43 a.m.

Wed, 07 Sep 2011 12:52:44 -0700, Chris.Barker wrote: [clip]

...

In [9]: temp['x'] = 3

In [10]: temp['y'] = 4

In [11]: temp['z'] = 5 [clip] maybe it wouldn't be any faster, but with re-using temp, and one less list-tuple conversion, and fewer python type to numpy type conversions, maybe it would.

Structured array assignments have plenty of overhead in Numpy, so it could be slower, too: x = np.array((1,2), dtype=[('a', int), ('b', float)]) x2 = [1,2,3] %timeit x['a'] = 9 100000 loops, best of 3: 2.83 us per loop %timeit x2[0] = 9 1000000 loops, best of 3: 368 ns per loop

Christopher Jordan-Squire

8:43 p.m.

On Wed, Sep 7, 2011 at 2:52 PM, Chris.Barker wrote:

...

On 9/2/11 2:45 PM, Christopher Jordan-Squire wrote:

...
It doesn't have to parse the entire file to determine the dtypes. It builds up a regular expression for what it expects to see, in terms of dtypes. Then it just loops over the lines, only parsing if the regular expression doesn't match. It seems that a regex match is fast, but a regex fail is expensive.

interesting -- I wouldn't have expected a regex to be faster that simple parsing, but that's why you profile!

...
Setting array elements is not as fast for the masked record arrays. You must set entire rows at a time, so I have to build up each row as a list, and convert to a tuple, and then stuff it in the array.

hmmm -- that is a lot -- I was thinking of simple "set a value in an array". I"ve also done a bunch of this in C, where's it's really fast.

However, rather than:

build a row as a list build a row as a tuple stuff into array

could you create an empty array scalar, and fill that, then put that in your array:

In [4]: dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)])

In [5]: dt Out[5]: dtype([('x', '
In [6]: temp = np.empty((), dtype=dt)

In [9]: temp['x'] = 3

In [10]: temp['y'] = 4

In [11]: temp['z'] = 5

In [13]: a = np.zeros((4,), dtype = dt)

In [14]: a[0] = temp

In [15]: a Out[15]: array([(3.0, 4, 5.0), (0.0, 0, 0.0), (0.0, 0, 0.0), (0.0, 0, 0.0)], dtype=[('x', '
(and you could pass the array scalar into accumulator as well)

maybe it wouldn't be any faster, but with re-using temp, and one less list-tuple conversion, and fewer python type to numpy type conversions, maybe it would.

I just ran a quick test on my machine of this idea. With dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) temp = np.empty((), dtype=dt) temp2 = np.zeros(1,dtype=dt) In [96]: def f(): ...: l=[0]*3 ...: l[0] = 2.54 ...: l[1] = 4 ...: l[2] = 2.3645 ...: j = tuple(l) ...: temp2[0] = j vs In [97]: def g(): ...: temp['x'] = 2.54 ...: temp['y'] = 4 ...: temp['z'] = 2.3645 ...: temp2[0] = temp ...: The timing results were 2.73 us for f and 3.43 us for g. So good idea, but it doesn't appear to be faster. (Though the difference wasn't nearly as dramatic as I thought it would be, based on Pauli's comment.) -Chris JS

...

...
it's even slower for the record arrays with missing data because I must branch between adding missing data versus adding real data. Might that be the reason for the slower performance than you'd expect?

could be -- I haven't thought about the missing data part much.

...
I wonder if there are any really important cases where you'd actually lose something by simply recasting an entry to another dtype, as Derek suggested.

In general, it won't be a simple re-cast -- it will be a copy to a subset -- which may be hard to write the code, but would save having to re-parse the data.

Anyway, you know the issues, this is good stuff either way.

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Chris.Barker

8:57 p.m.

On 9/8/11 1:43 PM, Christopher Jordan-Squire wrote:

...

I just ran a quick test on my machine of this idea. With

dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) temp = np.empty((), dtype=dt) temp2 = np.zeros(1,dtype=dt)

In [96]: def f(): ...: l=[0]*3 ...: l[0] = 2.54 ...: l[1] = 4 ...: l[2] = 2.3645 ...: j = tuple(l) ...: temp2[0] = j

vs

In [97]: def g(): ...: temp['x'] = 2.54 ...: temp['y'] = 4 ...: temp['z'] = 2.3645 ...: temp2[0] = temp ...:

The timing results were 2.73 us for f and 3.43 us for g. So good idea, but it doesn't appear to be faster. (Though the difference wasn't nearly as dramatic as I thought it would be, based on Pauli's comment.)

my guess is that the lines like: temp['x'] = 2.54 are slower (it requires a dict lookup, and a conversion from a python type to a "raw" type) and temp2[0] = temp is faster, as that doesn't require any conversion. Which means that if you has a larger struct dtype, it would be even slower, so clearly not the way to go for performance. It would be nice to have a higher performing struct dtype scalar -- as it is ordered, it might be nice to be able to index it with either the name or an numeric index. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Jordan-Squire

12 Sep 12 Sep

11:38 p.m.

I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list. It appears that the user must create a tuple to place in a numpy record array. (Specifically, if you remove the 'tuple' command from f2 in the attached then you get an error.) Taking multiple lines at a time (using f4) does provide a speed benefit, but it's not very big since Python's re module won't let you capture more than 100 values, and I'm using capturing to extract the values. (This is done because we're allowing the user to use regular expressions to denote delimiters.) In the example it's a bunch of space-delimited integers. f1 splits on the space and uses a list comprehension, f2 splits on the space and uses a generator, f3 uses regular expressions in a manner similar to the current code, and f4 uses regular expressions on multiple lines at once, and f5 uses fromiter. (Though fromiter isn't as useful as I'd hoped because you have to have already parsed out a line, since this is a record array.) f6 and f7 use stripped down versions of Chris Barker's accumulator idea. The difference is that f6 uses resize when expanding the array while f7 uses np.empty followed by np.append. This avoids the penalty from copying data that np.resize imposes. Note that f6 and f7 use the regular expression capturing line by line as in f3. To get a feel for the overheard involved with keeping track of string sizes, f8 is just f3 except with a list for the largest string sizes seen so far. The speeds I get using timeit are f1 : 1.66ms f2 : 2.01ms f3 : 2.35ms f4(2) : 3.02ms (Odd that it starts out worse than f3 when you take 2 lines at a time) f4(5) : 2.25ms f4(10) : 2.02ms f4(15) : 1.93ms f4(20) : error f5 : 2.28ms (As I said, fromiter can't do much when it's just filling in a record array. While it's slightly faster than f3, which it's based on, it also loads all the data as a list before creating a numpy array, which is rather suboptimal.) f6 : 3.26ms f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by append then do to resize) f8 : 3.04ms (Compared to f3, this shows there's a non-trivial performance hit from keeping track of the sizes) It seems like taking multiple lines at once isn't a big gain when we're limited to 100 captured entries at a time. (For Python 2.6, at least.) Especially since taking multiple lines at once would be rather complex since the code must still check each line to see if it's commented out or not. After talking to Chris Farrow, an Enthought developer, and doing some timing on a dataset he was working on, I had loadtable running about 1.7 to 3.3 times as fast as genfromtxt. The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion. To give some further context, I compared np.genfromtxt and np.loadtable on the same 'pseudo-file' f used in the above tests, when the data is just a bunch of integers. The results were: np.genfromtxt with dtype=None: 4.45 ms np.loadtable with defaults: 5.12ms np.loadtable with check_sizes=False: 3.7ms So it seems that np.loadtable is already competitive with np.genfromtxt other than checking the sizes. And the checking sizes isn't even that huge a penalty compared to genfromtxt. Based on all the above it seems like the accumulator is the most promising way that things could be sped up. But it's not completely clear to me by how much, since we still must keep track of the dtypes and the sizes. Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals. -Chris JS On Thu, Sep 8, 2011 at 3:57 PM, Chris.Barker wrote:

...

On 9/8/11 1:43 PM, Christopher Jordan-Squire wrote:

...
I just ran a quick test on my machine of this idea. With

dt = np.dtype([('x',np.float32),('y', np.int32),('z', np.float64)]) temp = np.empty((), dtype=dt) temp2 = np.zeros(1,dtype=dt)

In [96]: def f(): ...: l=[0]*3 ...: l[0] = 2.54 ...: l[1] = 4 ...: l[2] = 2.3645 ...: j = tuple(l) ...: temp2[0] = j

vs

In [97]: def g(): ...: temp['x'] = 2.54 ...: temp['y'] = 4 ...: temp['z'] = 2.3645 ...: temp2[0] = temp ...:

The timing results were 2.73 us for f and 3.43 us for g. So good idea, but it doesn't appear to be faster. (Though the difference wasn't nearly as dramatic as I thought it would be, based on Pauli's comment.)

my guess is that the lines like: temp['x'] = 2.54 are slower (it requires a dict lookup, and a conversion from a python type to a "raw" type)

and

temp2[0] = temp

is faster, as that doesn't require any conversion.

Which means that if you has a larger struct dtype, it would be even slower, so clearly not the way to go for performance.

It would be nice to have a higher performing struct dtype scalar -- as it is ordered, it might be nice to be able to index it with either the name or an numeric index.

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Pierre GM

13 Sep 13 Sep

8:43 a.m.

On Sep 13, 2011, at 01:38 , Christopher Jordan-Squire wrote:

...

I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list.

Still, I think there should be a way to use generators to create the final array (once your dtype is known and assuming you can skip invalid lines)...

...

The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion.

The problem with parsing dates with re is that depending on your separator, on your local conventions (e.g., MM-DD-YYYY vs DD/MM/YYYY) and on your string itself, you'll get very different results, not always the ones you want. Hence, I preferred to leave the dates out of the basic convertors and ask the user instead to give her own. If you can provide a functionality in loadtable to that effect, that'd be great.

...

Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals.

Well, it seems that loadtable doesn't work when you use positions instead of delimiters to separate the fields (e.g. below). What if you want to apply some specific conversion to a column ? e.g., transform a string representing a hexa to a int? Apart from that, I do appreciate the efforts you're putting to improve genfromtxt. It's needed, direly. Sorry that I can't find the time to really work on that (I do need to sleep sometimes)… But chats with Pauli V., Ralf G. among others during EuroScipy lead me to think a basic reorganization of npyio is quite advisable. #C00:07 : YYYYMMDD (8) #C08:15 : HH:mm:SS (8) #C16:18 : XXX (3) #C19:25 : float (7) #C26:32 : float (7) #C27:39 : float (7) # np.genfromtxt('test.txt', delimiter=(8,8,3,7,7,7), usemask=True, dtype=None) 2011010112:34:56AAA001.234005.678010.123999.999 2011010112:34:57BBB001.234005.678010.123999.999 2011010112:34:58CCC001.234005.678010.123999.999 2011010112:34:59 001.234005.678010.123999.999 2011010112:35:00DDD 5.678010.123 2011010112:35:01EEE001.234005.678010.123999.999

Chris.Barker

6:41 p.m.

On 9/12/11 4:38 PM, Christopher Jordan-Squire wrote:

...

I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time.

Nice work, only a minor comment:

...

f6 and f7 use stripped down versions of Chris Barker's accumulator idea. The difference is that f6 uses resize when expanding the array while f7 uses np.empty followed by np.append. This avoids the penalty from copying data that np.resize imposes.

I don't think it does: """ In [3]: np.append? ---------- arr : array_like Values are appended to a copy of this array. Returns ------- out : ndarray A copy of `arr` with `values` appended to `axis`. Note that `append` does not occur in-place: a new array is allocated and filled. """ There is no getting around the copying. However, I think resize() uses the OS memory re-allocate call, which may, in some instances, have over-allocated the memory in the first place, and thus not require a copy. So I'm pretty sure ndarray.resize is as good as it gets.

...

f6 : 3.26ms f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by append then do to resize)

Darn that profiling proving my expectations wrong again! though I'm really confused as to how that could be! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Jordan-Squire

8:01 p.m.

On Tue, Sep 13, 2011 at 2:41 PM, Chris.Barker wrote:

...

On 9/12/11 4:38 PM, Christopher Jordan-Squire wrote:

...
I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time.

Nice work, only a minor comment:

...
f6 and f7 use stripped down versions of Chris Barker's accumulator idea. The difference is that f6 uses resize when expanding the array while f7 uses np.empty followed by np.append. This avoids the penalty from copying data that np.resize imposes.

I don't think it does:

""" In [3]: np.append? ---------- arr : array_like Values are appended to a copy of this array.

Returns ------- out : ndarray A copy of `arr` with `values` appended to `axis`. Note that `append` does not occur in-place: a new array is allocated and filled. """

There is no getting around the copying. However, I think resize() uses the OS memory re-allocate call, which may, in some instances, have over-allocated the memory in the first place, and thus not require a copy. So I'm pretty sure ndarray.resize is as good as it gets.

...
f6 : 3.26ms f7 : 2.77ms (Apparently it's a lot cheaper to do np.empty followed by append then do to resize)

Darn that profiling proving my expectations wrong again! though I'm really confused as to how that could be!

Sorry, I cheated by reading the docs. :-) """ numpy.resize(a, new_shape) Return a new array with the specified shape. If the new array is larger than the original array, then the new array is filled with repeated copies of a. Note that this behavior is different from a.resize(new_shape) which fills with zeros instead of repeated copies of a. """ The copying I meant was that numpy.resize will fill the resized array with repeated copies of the data. So np.empty followed by np.append avoids that. -Chris

...

-Chris

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Jordan-Squire

11:30 p.m.

On Tue, Sep 13, 2011 at 3:43 AM, Pierre GM wrote:

...

On Sep 13, 2011, at 01:38 , Christopher Jordan-Squire wrote:

...
I did some timings to see what the advantage would be, in the simplest case possible, of taking multiple lines from the file to process at a time. Assuming the dtype is already known. The code is attached. What I found was I can't use generators to avoid constructing a list and then making a tuple from the list.

Still, I think there should be a way to use generators to create the final array (once your dtype is known and assuming you can skip invalid lines)...

...
The catch is that genfromtxt was loading datetimes as strings, while loadtable was loading them as numpy datetimes. The conversion from string to datetime is somewhat expensive, so I think that accounts for some of the extra time. The range of timings--between 1.5 to 3.5 times as slow--reflect how many lines are used to check for sizes and dtypes. As it turns out, checking for those can be quite expensive, and the majority of the time seems to be spent in the regular expression matching. (Though Chris is using a slight variant on my pull request, and I'm getting function times that are not as bad as his.) The cost of the size and type checking was less apparent in the example I have timings on in a previous email because in that case there was a huge cost for converting data with commas instead of decimals and for the datetime conversion.

The problem with parsing dates with re is that depending on your separator, on your local conventions (e.g., MM-DD-YYYY vs DD/MM/YYYY) and on your string itself, you'll get very different results, not always the ones you want. Hence, I preferred to leave the dates out of the basic convertors and ask the user instead to give her own. If you can provide a functionality in loadtable to that effect, that'd be great.

...
Other than possibly changing loadtable to use np.NA instead of masked arrays in the presence of missing data, I'm starting to feel like it's more or less complete for now, and can be left to be improved in the future. Most of the things that have been discussed are either performance trade-offs or somewhat large re-engineering of the internals.

Well, it seems that loadtable doesn't work when you use positions instead of delimiters to separate the fields (e.g. below). What if you want to apply some specific conversion to a column ? e.g., transform a string representing a hexa to a int?

Apart from that, I do appreciate the efforts you're putting to improve genfromtxt. It's needed, direly. Sorry that I can't find the time to really work on that (I do need to sleep sometimes)… But chats with Pauli V., Ralf G. among others during EuroScipy lead me to think a basic reorganization of npyio is quite advisable.

#C00:07 : YYYYMMDD (8) #C08:15 : HH:mm:SS (8) #C16:18 : XXX (3) #C19:25 : float (7) #C26:32 : float (7) #C27:39 : float (7) # np.genfromtxt('test.txt', delimiter=(8,8,3,7,7,7), usemask=True, dtype=None) 2011010112:34:56AAA001.234005.678010.123999.999 2011010112:34:57BBB001.234005.678010.123999.999 2011010112:34:58CCC001.234005.678010.123999.999 2011010112:34:59 001.234005.678010.123999.999 2011010112:35:00DDD 5.678010.123 2011010112:35:01EEE001.234005.678010.123999.999

Thanks for mentioning the fixed width file type. I had completely missed genfromtxt allows that. Though, in all honesty, I wasn't really intending that loadtable be a drop-in replacement for genfromtxt. More like a more robust and memory efficient alternative. I think I can add that functionality to loadtable, but it might require adding some special case stuff. Most everything is geared towards delimited text rather than fixed width text. An idea that was floated today when I talked about loadtable at Enthought was refactoring it as a class, and then letting some of the internals that currently aren't exposed to the user be exposed. That way the user could specify their own converters if desired without having to add yet another parameter. In fact, it could make it possible to remove some of the existing parameters by making them instance variables, for example. How do people feel about that? In terms of refactoring numpy io, was there anything concrete or specific discussed? -Chris JS

...

_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Christopher Barker

14 Sep 14 Sep

8:01 p.m.

On 9/13/11 1:01 PM, Christopher Jordan-Squire wrote:

...

Sorry, I cheated by reading the docs. :-)

me too...

...

""" numpy.resize(a, new_shape)

Return a new array with the specified shape.

If the new array is larger than the original array, then the new array is filled with repeated copies of a. Note that this behavior is different from a.resize(new_shape) which fills with zeros instead of repeated copies of a. """

see the: "this behavior is different from a.resize(new_shape)" so: a.resize(new_shape, refcheck=True) Change shape and size of array in-place. Parameters ---------- new_shape : tuple of ints, or `n` ints Shape of resized array. refcheck : bool, optional If False, reference count will not be checked. Default is True. Returns ------- None

...

The copying I meant was that numpy.resize will fill the resized array with repeated copies of the data. So np.empty followed by np.append avoids that.

numpy.ndarray.resize is a different method, and I'm pretty sure it should be as fast or faster that np.empty + np.append. It is often confusing that there is a numpy function and ndarray method with the same name and slightly different usage. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Barker

9:25 p.m.

On 9/14/11 1:01 PM, Christopher Barker wrote:

...

numpy.ndarray.resize is a different method, and I'm pretty sure it should be as fast or faster that np.empty + np.append.

My profile: In [25]: %timeit f1 # numpy.resize() 10000000 loops, best of 3: 163 ns per loop In [26]: %timeit f2 #numpy.ndarray.resize() 10000000 loops, best of 3: 136 ns per loop In [27]: %timeit f3 # numpy.empty() + append() 10000000 loops, best of 3: 136 ns per loop those last two couldn't b more identical! (though this is an excercise in unrequired optimization!) My test code: #!/usr/bin/env python """ test_resize A test of various numpy re-sizing options """ import numpy def f1(): """ numpy.resize """ l = 100 a = numpy.zeros((l,)) for i in xrange(1000): l += l a = numpy.resize(a, (l,) ) return None def f2(): """ numpy.ndarray.resize """ l = 100 a = numpy.zeros((l,)) for i in xrange(1000): l += l a.resize(a, (l,) ) return None def f3(): """ numpy.empty + append """ l = 100 a = numpy.zeros((l,)) for i in xrange(1000): b = np.empty((l,)) a.append(b) return None -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Benjamin Root

9:41 p.m.

On Wed, Sep 14, 2011 at 4:25 PM, Christopher Barker wrote:

...

On 9/14/11 1:01 PM, Christopher Barker wrote:

...
numpy.ndarray.resize is a different method, and I'm pretty sure it should be as fast or faster that np.empty + np.append.

My profile:

In [25]: %timeit f1 # numpy.resize() 10000000 loops, best of 3: 163 ns per loop

In [26]: %timeit f2 #numpy.ndarray.resize() 10000000 loops, best of 3: 136 ns per loop

In [27]: %timeit f3 # numpy.empty() + append() 10000000 loops, best of 3: 136 ns per loop

those last two couldn't b more identical!

(though this is an excercise in unrequired optimization!)

Are you sure the f2 code works? a.resize() takes only a shape tuple. As coded, you should get an exception. Ben Root

Christopher Barker

10:30 p.m.

On 9/14/11 2:41 PM, Benjamin Root wrote:

...

Are you sure the f2 code works? a.resize() takes only a shape tuple. As coded, you should get an exception.

wow, what an idiot! I think I just timed how long it takes to raise that exception... And when I fix that, I get a memory error. When I fix that, I find that f3() wasn't doing the right thing. What an astonishing lack of attention on my part! Here it is again, working, I hope! In [107]: %timeit f1() 10 loops, best of 3: 50.7 ms per loop In [108]: %timeit f2() 1000 loops, best of 3: 719 us per loop In [109]: %timeit f3() 100 loops, best of 3: 19 ms per loop So: numpy.resize() is the slowest numpy.empty+ numpy.append() is faster numpy.ndarray.resize() is the fastest Which matches my expectations, for once! -Chris The code: #!/usr/bin/env python """ test_resize A test of various numpy re-sizing options """ import numpy def f1(): """ numpy.resize """ extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a = numpy.resize(a, (l,) ) return a def f2(): """ numpy.ndarray.resize """ extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a.resize( (l,) ) return a def f3(): """ numpy.empty + append """ extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): b = numpy.empty((extra,)) a = numpy.append(a, b) return a a1 = f1() a2 = f2() a3 = f3() if a1.shape == a2.shape == a3.shape: print "they are all returning the same size array" else: print "Something is wrong!" -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Christopher Jordan-Squire

11:26 p.m.

On Wed, Sep 14, 2011 at 5:30 PM, Christopher Barker wrote:

...

On 9/14/11 2:41 PM, Benjamin Root wrote:

...
Are you sure the f2 code works? a.resize() takes only a shape tuple. As coded, you should get an exception.

wow, what an idiot!

I think I just timed how long it takes to raise that exception...

And when I fix that, I get a memory error.

When I fix that, I find that f3() wasn't doing the right thing. What an astonishing lack of attention on my part!

Here it is again, working, I hope!

In [107]: %timeit f1() 10 loops, best of 3: 50.7 ms per loop

In [108]: %timeit f2() 1000 loops, best of 3: 719 us per loop

In [109]: %timeit f3() 100 loops, best of 3: 19 ms per loop

So: numpy.resize() is the slowest numpy.empty+ numpy.append() is faster numpy.ndarray.resize() is the fastest

Which matches my expectations, for once!

Good catch! I didn't think the difference between np.resize and ndarray.resize would matter. (And I was getting inscrutable errors when I called ndarray.resize that told me to use np.resize instead.) -Chris JS

...

-Chris The code:

#!/usr/bin/env python

""" test_resize

A test of various numpy re-sizing options

"""

import numpy

def f1(): """ numpy.resize """

extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a = numpy.resize(a, (l,) )

return a

def f2(): """ numpy.ndarray.resize """

extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): l += extra a.resize( (l,) )

return a

def f3(): """ numpy.empty + append """

extra = 100 l = extra a = numpy.zeros((l,)) for i in xrange(100): b = numpy.empty((extra,)) a = numpy.append(a, b) return a

a1 = f1() a2 = f2() a3 = f3()

if a1.shape == a2.shape == a3.shape: print "they are all returning the same size array" else: print "Something is wrong!"

-- Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

4607

Age (days ago)

4625

Last active (days ago)

List overview

Download

24 comments

7 participants

participants (7)

Benjamin Root
Chris.Barker
Christopher Barker
Christopher Jordan-Squire
Derek Homeier
Pauli Virtanen
Pierre GM

load from text files Pull Request Review

Christopher Jordan-Squire

Christopher Jordan-Squire

Christopher Jordan-Squire

Christopher Jordan-Squire

Christopher Jordan-Squire

Christopher Jordan-Squire

Pierre GM

Christopher Jordan-Squire

Christopher Jordan-Squire

Benjamin Root

Christopher Jordan-Squire

tags

participants (7)