
All, Please find attached to this message another implementation of np.loadtxt, which focuses on missing values. It's basically a combination of John Hunter's et al mlab.csv2rec, Ryan May's patches and pieces of code I'd been working on over the last few weeks. Besides some helper classes (StringConverter to convert a string into something else, NameValidator to check names..._), you'll find 3 functions: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io * `loadtxt` would replace the current np.loadtxt. It outputs a ndarray, where missing data being filled. It would also go in np.lib.io * `mloadtxt` would go into np.ma.io (to be created) and renamed `loadtxt`. Right now, I needed a different name to avoid conflicts. It combines the outputs of `genloadtxt` into a single masked array. You'll also several series of tests, that you can use as examples. Please give it a try and send me some feedback (bugs, wishes, suggestions). I'd like it to make the 1.3.0 release (I need some of the functionalities to improve the corresponding function in scikits.timeseries, currently fubar...) P.

Hi Pierre 2008/12/1 Pierre GM <pgmdevlist@gmail.com>:
* `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io
I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks? Regards Stéfan

Stéfan van der Walt wrote:
Hi Pierre
2008/12/1 Pierre GM <pgmdevlist@gmail.com>:
* `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io
I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks?
I've wondered about this being an issue. On one hand, you hate to make existing code noticeably slower. On the other hand, if speed is important to you, why are you using ascii I/O? I personally am not entirely against having two versions of loadtxt-like functions. However, the idea seems a little odd, seeing as how loadtxt was already supposed to be the "swiss army knife" of text reading. I'm seeing a similar slowdown with Pierre's version of the code. The version of loadtxt that I cobbled together with the StringConverter class (and no missing value support) shows about a 50% slowdown, so clearly there's a performance penalty for trying to make a generic function that can be all things to all people. On the other hand, this approach reduces code duplication. I'm not really opinionated on what the right approach is here. My only opinion is that this functionality *really* needs to be in numpy in some fashion. For my own use case, with the old version, I could read a text file and by hand separate out columns and mask values. Now, I open a file and get a structured array with an automatically detected dtype (names and types!) plus masked values. My $0.02. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

2008/12/1 Ryan May <rmay31@gmail.com>:
I've wondered about this being an issue. On one hand, you hate to make existing code noticeably slower. On the other hand, if speed is important to you, why are you using ascii I/O?
More "I" than "O"! But I think numpy.fromfile, once fixed up, could fill this niche nicely.
I personally am not entirely against having two versions of loadtxt-like functions. However, the idea seems a little odd, seeing as how loadtxt was already supposed to be the "swiss army knife" of text reading.
I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed.
I'm not really opinionated on what the right approach is here. My only opinion is that this functionality *really* needs to be in numpy in some fashion. For my own use case, with the old version, I could read a text file and by hand separate out columns and mask values. Now, I open a file and get a structured array with an automatically detected dtype (names and types!) plus masked values.
That's neat! Cheers Stéfan

I agree, genloadtxt is a bit blotted, and it's not a surprise it's slower than the initial one. I think that in order to be fair, comparisons must be performed with matplotlib.mlab.csv2rec, that implements as well the autodetection of the dtype. I'm quite in favor of keeping a lite version around. On Dec 1, 2008, at 4:47 PM, Stéfan van der Walt wrote:
I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed.
Well, one of the issues is that we need to keep the function compatible w/ urllib.urlretrieve (Ryan, am I right?), which means not being able to go back to the beginning of a file (no call to .seek). Another issue comes from the possibility to define the dtype automatically: you need to keep track of the converters, then have to do a second loop on the data. Those converters are likely the bottleneck, as you need to check whether each value can be interpreted as missing or not and respond appropriately. I thought about creating a base class, with a specific subclass taking care of the missing values. I found out it would have duplicated a lot of code In any case, I think that's secondary: we can always optimize pieces of the code afterwards. I'd like more feedback on corner cases and usage...

Pierre GM wrote:
Another issue comes from the possibility to define the dtype automatically:
Does all that get bypassed if the dtype(s) is specified? Is it still slow in that case? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Dec 1, 2008, at 6:21 PM, Christopher Barker wrote:
Pierre GM wrote:
Another issue comes from the possibility to define the dtype automatically:
Does all that get bypassed if the dtype(s) is specified? Is it still slow in that case?
Good question. Having a dtype != None does skip a secondary loop. Once again, I;m sure there's plenty of room for optimization (eg, different loops whether the dtype is defined or not, whether missing values have to be taken into account or not, etc...). I just want to make sure that we're not missing any functionality and/or corner cases and that the usage is intuitive enough before spending some time optimizing...

Hi, I need to convolve a 1d filter with 8 coefficients with a 2d array of the shape (6,7). I can use convolve to perform the operation for each row. This will involve a for loop with a counter 6. I wonder there is an fast way to do this in numpy without using for loop. Does anyone know how to do it? Thanks Frank _________________________________________________________________ Access your email online and on the go with Windows Live Hotmail. http://windowslive.com/Explore/Hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_access_...

Hi Frank 2008/12/2 frank wang <f.yw@hotmail.com>:
I need to convolve a 1d filter with 8 coefficients with a 2d array of the shape (6,7). I can use convolve to perform the operation for each row. This will involve a for loop with a counter 6. I wonder there is an fast way to do this in numpy without using for loop. Does anyone know how to do it?
Since 6x7 is quite small, so you can afford this trick: a) Pad the 6,7 array to 6,14. b) Flatten the array c) Perform convolution d) Unflatten array e) Take out valid values Cheers Stéfan

This is what I thought to do. However, I am not sure whether this is a fast way to do it and also I want to find a more generous way to do it. I thought there may be a more elegant way to do it. Thanks Frank> Date: Tue, 2 Dec 2008 07:42:27 +0200> From: stefan@sun.ac.za> To: numpy-discussion@scipy.org> Subject: Re: [Numpy-discussion] fast way to convolve a 2d array with 1d filter> > Hi Frank> > > 2008/12/2 frank wang <f.yw@hotmail.com>:> > I need to convolve a 1d filter with 8 coefficients with a 2d array of the> > shape (6,7). I can use convolve to perform the operation for each row. This> > will involve a for loop with a counter 6. I wonder there is> > an fast way to do this in numpy without using for loop. Does anyone know how> > to do it?> > Since 6x7 is quite small, so you can afford this trick:> > a) Pad the 6,7 array to 6,14.> b) Flatten the array> c) Perform convolution> d) Unflatten array> e) Take out valid values> > Cheers> Stéfan> _______________________________________________> Numpy-discussion mailing list> Numpy-discussion@scipy.org> http://projects.scipy.org/mailman/listinfo/numpy-discussion _________________________________________________________________ Get more done, have more fun, and stay more connected with Windows Mobile®. http://clk.atdmt.com/MRT/go/119642556/direct/01/

On Mon, Dec 1, 2008 at 11:14 PM, frank wang <f.yw@hotmail.com> wrote:
This is what I thought to do. However, I am not sure whether this is a fast way to do it and also I want to find a more generous way to do it. I thought there may be a more elegant way to do it.
Thanks
Frank
Well, for just the one matrix not much will speed it up. If you have lots of matrices and the coefficients are fixed, then you can set up a "convolution" matrix whose columns are the coefficients shifted appropriately. Then just do a matrix multiply. Chuck

You can use 2D convolution routines either in scipy.signal or numpy.numarray.nd_image Nadav -----הודעה מקורית----- מאת: numpy-discussion-bounces@scipy.org בשם frank wang נשלח: ג 02-דצמבר-08 03:38 אל: numpy-discussion@scipy.org נושא: [Numpy-discussion] fast way to convolve a 2d array with 1d filter Hi, I need to convolve a 1d filter with 8 coefficients with a 2d array of the shape (6,7). I can use convolve to perform the operation for each row. This will involve a for loop with a counter 6. I wonder there is an fast way to do this in numpy without using for loop. Does anyone know how to do it? Thanks Frank _________________________________________________________________ Access your email online and on the go with Windows Live Hotmail. http://windowslive.com/Explore/Hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_access_...

On Mon, Dec 1, 2008 at 4:55 PM, Pierre GM <pgmdevlist@gmail.com> wrote:
On Dec 1, 2008, at 4:47 PM, Stéfan van der Walt wrote:
I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed.
Well, one of the issues is that we need to keep the function compatible w/ urllib.urlretrieve (Ryan, am I right?), which means not being able to go back to the beginning of a file (no call to .seek).
Well, the original version of loadtxt() checked for seek but didn't need it (fixed now), which kept me from using a urllib2.urlopen() object. If actually using seek() would speed up the new version of loadtxt(), feel free to use it. I'm more than capable of wrapping the urlopen() object within a StringIO. However, I am unconvinced that removing the 2nd loop and instead redoing the reading from the file will be much (if any) of a speed win. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma

Stéfan van der Walt wrote:
important to you, why are you using ascii I/O?
ascii I/O is slow, so that's a reason in itself to want it not to be slower!
More "I" than "O"! But I think numpy.fromfile, once fixed up, could fill this niche nicely.
I agree -- for the simple cases, fromfile() could work very well -- perhaps it could even be used to speed up some special cases of loadtxt. But is anyone working on fromfile()? By the way, I think overloading fromfile() for text files is a bit misleading for users -- I propose we have a fromtextfile() or something instead. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 1 Dec 2008, at 21:47 , Stéfan van der Walt wrote:
Hi Pierre
2008/12/1 Pierre GM <pgmdevlist@gmail.com>:
* `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io
I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks?
Regards Stéfan
As a historical note, we used to have scipy.io.read_array which at the time was considered by Travis too slow and too "grandiose" to be put in Numpy. As a consequence, numpy.loadtxt() was created which was simple and fast. Now it looks like we're going back to something grandiose. But perhaps it can be made grandiose *and* reasonably fast ;-). Cheers, Joris P.S. As a reference: http://article.gmane.org/gmane.comp.python.numeric.general/5556/ Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

On 12/2/2008 7:21 AM Joris De Ridder apparently wrote:
As a historical note, we used to have scipy.io.read_array which at the time was considered by Travis too slow and too "grandiose" to be put in Numpy. As a consequence, numpy.loadtxt() was created which was simple and fast. Now it looks like we're going back to something grandiose. But perhaps it can be made grandiose *and* reasonably fast ;-).
I hope this consideration remains prominent in this thread. Is the disappearance or read_array the reason for this change? What happened to it? Note that read_array_demo1.py is still in scipy.io despite the loss of read_array. Alan Isaac
participants (9)
-
Alan G Isaac
-
Charles R Harris
-
Christopher Barker
-
frank wang
-
Joris De Ridder
-
Nadav Horesh
-
Pierre GM
-
Ryan May
-
Stéfan van der Walt