Mailman 3 Question about improving genfromtxt errors - NumPy-Discussion

Question about improving genfromtxt errors

Skipper Seabold

Sept. 25, 2009

5 p.m.

There have been some recent attempts to improve the error reporting in genfromtxt <http://projects.scipy.org/numpy/ticket/1212>, which is great, because hunting down the problems reading in big and messy files is not fun. I am working on a patch that keeps up with the line number and column number of where you are in parsing the file, so that this can be reported in the error. Is there a way to catch a raised error and add to it? For instance, I have a problem in my file which leads to this error being raised from np.lib._iotools.StringCoverter.upgrade ValueError: Converter is locked and cannot be upgraded I added this into np.lib.io.genfromtxt around line 995. linenum = 0 [...] if dtype is None: try: colnum = 0 for (converter, item) in zip(converters, values): converter.upgrade(item) colnum += 1 except: raise ValueError, "I don't report the error from _iotools.StringConverter.upgrade, but I do know that there is a problem trying to convert a value at line %s and column %s" % (linenum,colnum) [...] linenum += 1 I'd like to add line and column number information to original error from _iotools. Any suggestions? Cheers, Skipper

Show replies by date

Ralf Gommers

September 2009

6 p.m.

On Fri, Sep 25, 2009 at 1:00 PM, Skipper Seabold <jsseabold@gmail.com>wrote:

...

There have been some recent attempts to improve the error reporting in genfromtxt <http://projects.scipy.org/numpy/ticket/1212>, which is great, because hunting down the problems reading in big and messy files is not fun.

I am working on a patch that keeps up with the line number and column number of where you are in parsing the file, so that this can be reported in the error. Is there a way to catch a raised error and add to it?

For instance, I have a problem in my file which leads to this error being raised from np.lib._iotools.StringCoverter.upgrade

ValueError: Converter is locked and cannot be upgraded

I added this into np.lib.io.genfromtxt around line 995.

linenum = 0 [...] if dtype is None: try: colnum = 0 for (converter, item) in zip(converters, values): converter.upgrade(item) colnum += 1 except: raise ValueError, "I don't report the error from _iotools.StringConverter.upgrade, but I do know that there is a problem trying to convert a value at line %s and column %s" % (linenum,colnum) [...] linenum += 1

I'd like to add line and column number information to original error from _iotools. Any suggestions?

There is no good way to edit the message of the original exception instance, as explained here: http://blog.ianbicking.org/2007/09/12/re-raising-exceptions/ Probably the easiest for your purpose is this: def divbyzero(): return 1/0 try: a = divbyzero() except ZeroDivisionError as err: print 'problem occurred at line X' raise err or if you want to catch any error: try: yourcode() except: print 'problem occurred at line X' raise Maybe better to use a logger instead of print, but you get the idea. Cheers, Ralf

...

Cheers, Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Skipper Seabold

6:08 p.m.

On Fri, Sep 25, 2009 at 2:00 PM, Ralf Gommers <ralf.gommers@googlemail.com> wrote:

...

On Fri, Sep 25, 2009 at 1:00 PM, Skipper Seabold <jsseabold@gmail.com> wrote:

...
There have been some recent attempts to improve the error reporting in genfromtxt <http://projects.scipy.org/numpy/ticket/1212>, which is great, because hunting down the problems reading in big and messy files is not fun.

I am working on a patch that keeps up with the line number and column number of where you are in parsing the file, so that this can be reported in the error. Is there a way to catch a raised error and add to it?

For instance, I have a problem in my file which leads to this error being raised from np.lib._iotools.StringCoverter.upgrade

ValueError: Converter is locked and cannot be upgraded

I added this into np.lib.io.genfromtxt around line 995.

linenum = 0 [...] if dtype is None: try: colnum = 0 for (converter, item) in zip(converters, values): converter.upgrade(item) colnum += 1 except: raise ValueError, "I don't report the error from _iotools.StringConverter.upgrade, but I do know that there is a problem trying to convert a value at line %s and column %s" % (linenum,colnum) [...] linenum += 1

I'd like to add line and column number information to original error from _iotools. Any suggestions?

There is no good way to edit the message of the original exception instance, as explained here: http://blog.ianbicking.org/2007/09/12/re-raising-exceptions/

Probably the easiest for your purpose is this:

def divbyzero():     return 1/0

try:     a = divbyzero() except ZeroDivisionError as err:     print 'problem occurred at line X'     raise err

or if you want to catch any error:

try:     yourcode() except:     print 'problem occurred at line X'     raise

Maybe better to use a logger instead of print, but you get the idea.

Ok thanks. Using a logger might be a good idea. Skipper

Christopher Barker

7:08 p.m.

Ralf Gommers wrote:

...

Probably the easiest for your purpose is this:

def divbyzero(): return 1/0

try: a = divbyzero() except ZeroDivisionError as err: print 'problem occurred at line X' raise err

I get an error with this syntax -- is thin 2.6 only? In [10]: run error.py error.py:9: Warning: 'as' will become a reserved keyword in Python 2.6 ------------------------------------------------------------ File "error.py", line 9 except ZeroDivisionError as err: ^ SyntaxError: invalid syntax

...

print 'problem occurred at line X' raise

Maybe better to use a logger instead of print, but you get the idea.

definitely don't print! a lib function should never print (unless maybe with a debug flag or something set). I don't know if there is a standard logging approach you could use. I'd rather see info added to the Exception, or a new Exception raised with info. Now, another option. It seems in this case that you know what Exception(s) you are trying to catch, and you want to add some information to the message. If you don't need to keep the old traceback, you can do something like: try: 4/0 except ZeroDivisionError, err: raise Exception("A new error with old message also: %s"%err.message) It doesn't appear to be possible(well, at least not easy!) to add or change the message on an existing exception and then re-raise it (kind of makes me want mutable strings) I suspect that for this use, this would suffice, what the user really wants to know is where in their file the error occurred, not where in the converting it occurred. This assumes that the converting code puts useful messages in the errors. Otherwise, there is info in the traceback module, so you can get more by doing this: import traceback try: 4/0 except ZeroDivisionError, err: line, col = 45, 10 raise Exception(traceback.format_exc()+"\error took place at line: %i, column %i\n"%(line, col)) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Ralf Gommers

7:28 p.m.

On Fri, Sep 25, 2009 at 3:08 PM, Christopher Barker <Chris.Barker@noaa.gov>wrote:

...

Ralf Gommers wrote:

...
Probably the easiest for your purpose is this:

def divbyzero(): return 1/0

try: a = divbyzero() except ZeroDivisionError as err: print 'problem occurred at line X' raise err

I get an error with this syntax -- is thin 2.6 only?

Yes sorry, that's the 2.6/3.0 syntax. It should be except ZeroDivisionError, err: Anyway, the instance is not needed in my example because it can't be usefully modified and can be re-raised with a bare "raise".

...

...
Maybe better to use a logger instead of print, but you get the idea.

definitely don't print! a lib function should never print (unless maybe with a debug flag or something set). I don't know if there is a standard logging approach you could use.

I'd rather see info added to the Exception, or a new Exception raised with info.

The former is not possible, which motivated my example. The latter loses the original traceback. Logging seems to be the way to go if you want all the info. Cheers, Ralf

Bruce Southey

6:03 p.m.

On 09/25/2009 12:00 PM, Skipper Seabold wrote:

...

There have been some recent attempts to improve the error reporting in genfromtxt<http://projects.scipy.org/numpy/ticket/1212>, which is great, because hunting down the problems reading in big and messy files is not fun.

I am working on a patch that keeps up with the line number and column number of where you are in parsing the file, so that this can be reported in the error. Is there a way to catch a raised error and add to it?

For instance, I have a problem in my file which leads to this error being raised from np.lib._iotools.StringCoverter.upgrade

ValueError: Converter is locked and cannot be upgraded

I added this into np.lib.io.genfromtxt around line 995.

linenum = 0 [...] if dtype is None: try: colnum = 0 for (converter, item) in zip(converters, values): converter.upgrade(item) colnum += 1 except: raise ValueError, "I don't report the error from _iotools.StringConverter.upgrade, but I do know that there is a problem trying to convert a value at line %s and column %s" % (linenum,colnum) [...] linenum += 1

I'd like to add line and column number information to original error from _iotools. Any suggestions?

Cheers, Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hi, I am guessing that the converter is most likely causing the error. So presumably the file is read correctly without using the converter. If not then you should address that first. If it is an input file, then what is it and the how is genfromtxt called? A question regarding the ticket than this, why do you want to raise an exception? The reason I did not do it was that it was helpful to identify all of the lines that have a specific problem. You can not assume that a user will fix all lines with this problem let alone fix all lines with similar problems. Bruce

Skipper Seabold

6:16 p.m.

On Fri, Sep 25, 2009 at 2:03 PM, Bruce Southey <bsouthey@gmail.com> wrote:

...

On 09/25/2009 12:00 PM, Skipper Seabold wrote:

...
There have been some recent attempts to improve the error reporting in genfromtxt<http://projects.scipy.org/numpy/ticket/1212>, which is great, because hunting down the problems reading in big and messy files is not fun.

I am working on a patch that keeps up with the line number and column number of where you are in parsing the file, so that this can be reported in the error. Is there a way to catch a raised error and add to it?

For instance, I have a problem in my file which leads to this error being raised from np.lib._iotools.StringCoverter.upgrade

ValueError: Converter is locked and cannot be upgraded

I added this into np.lib.io.genfromtxt around line 995.

linenum = 0 [...] if dtype is None: try: colnum = 0 for (converter, item) in zip(converters, values): converter.upgrade(item) colnum += 1 except: raise ValueError, "I don't report the error from _iotools.StringConverter.upgrade, but I do know that there is a problem trying to convert a value at line %s and column %s" % (linenum,colnum) [...] linenum += 1

I'd like to add line and column number information to original error from _iotools. Any suggestions?

Cheers, Skipper _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hi, I am guessing that the converter is most likely causing the error. So presumably the file is read correctly without using the converter. If not then you should address that first.

If it is an input file, then what is it and the how is genfromtxt called?

The converter is certainly causing the error, I just need to know where so then I can change it as appropriate. I've had to define 100s of converters for qualitative data/survey responses (loops doing most of the work), so I'd rather just know where the converter is failing and then I make the changes with the column number that is spit out in the error. Using your changes and my added exception has saved me so much time that you have no idea compared with what I was doing.

...

A question regarding the ticket than this, why do you want to raise an exception?

The reason I did not do it was that it was helpful to identify all of the lines that have a specific problem. You can not assume that a user will fix all lines with this problem let alone fix all lines with similar problems.

Good point. This would save me even more time. Perhaps then a logger is the way to go rather than print. My understanding is that print statements aren't allowed in numpy. Is this correct? Once I'm done, I will post my changes to the ticket and we can discuss some more. Skipper

Pierre GM

6:17 p.m.

Sorry all, I haven't been as respondent as I wished lately... * About the patch: I don't like the idea of adding yet some other tests in the main loop. I was more into letting things like they are, but calling some error function if some 'setting an array element with a sequence' exception is raised. This function would take 'rows' as an input and would check the length of each row. That way, we don't slow things down when everything works, but just add some delay when they don't. I'll try to come up w/ something soon (in the next couple of weeks). * About the converter error: there's indeed a bug in StringConverter.upgrade, I need to write some unittests to make sure I get it covered. If you could get me some sample code, that'd be great.

Skipper Seabold

6:25 p.m.

On Fri, Sep 25, 2009 at 2:17 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

Sorry all, I haven't been as respondent as I wished lately... * About the patch: I don't like the idea of adding yet some other tests in the main loop. I was more into letting things like they are, but calling some error function if some 'setting an array element with a sequence' exception is raised. This function would take 'rows' as an input and would check the length of each row. That way, we don't slow things down when everything works, but just add some delay when they don't. I'll try to come up w/ something soon (in the next couple of weeks).

Ok.

...

* About the converter error: there's indeed a bug in StringConverter.upgrade, I need to write some unittests to make sure I get it covered. If you could get me some sample code, that'd be great.

Hmm, I'm not sure that the error I'm seeing is the same as the bug we had previously discussed. In this case, the converters are wrong and I need to know about it. I will try to post an example of the two times I've seen this error raised when I get a minute. Skipper

Bruce Southey

7:34 p.m.

On 09/25/2009 01:25 PM, Skipper Seabold wrote:

...

On Fri, Sep 25, 2009 at 2:17 PM, Pierre GM<pgmdevlist@gmail.com> wrote:

...
Sorry all, I haven't been as respondent as I wished lately... * About the patch: I don't like the idea of adding yet some other tests in the main loop. I was more into letting things like they are, but calling some error function if some 'setting an array element with a sequence' exception is raised. This function would take 'rows' as an input and would check the length of each row. That way, we don't slow things down when everything works, but just add some delay when they don't. I'll try to come up w/ something soon (in the next couple of weeks).

Ok.

I tend to agree but I think that the actual array() function give a more meaningful error about mismatched data such as indicating the row. I think that it would be too late to go back to the data and try to figure out why the exception occurred. If you wait until array() is called then you have not used at least two opportunities to check the whole data.. The data is parsed at least twice, the first is the itertools.chain loop and the second is the subsequent enumeration over rows - lines 981 and 1006 of the unpatched io.py). Really it is a question of how useful the messages are and if (or when) genfromtxt should stop on an error. For a huge data set I can see that stopping on an error is useful because it avoids parsing all the data. But listing all the errors is also useful especially when you can fix all the errors at once.

...

...
* About the converter error: there's indeed a bug in StringConverter.upgrade, I need to write some unittests to make sure I get it covered. If you could get me some sample code, that'd be great.

Hmm, I'm not sure that the error I'm seeing is the same as the bug we had previously discussed. In this case, the converters are wrong and I need to know about it. I will try to post an example of the two times I've seen this error raised when I get a minute.

Skipper _______________________________________________

Please! Samples of using it would be great. Bruce

Skipper Seabold

7:42 p.m.

On Fri, Sep 25, 2009 at 3:34 PM, Bruce Southey <bsouthey@gmail.com> wrote: <snip>

...

...
...
* About the converter error: there's indeed a bug in StringConverter.upgrade, I need to write some unittests to make sure I get it covered. If you could get me some sample code, that'd be great.

Hmm, I'm not sure that the error I'm seeing is the same as the bug we had previously discussed. In this case, the converters are wrong and I need to know about it. I will try to post an example of the two times I've seen this error raised when I get a minute.

Skipper _______________________________________________

Please! Samples of using it would be great.

As far as this goes, I added some examples to the docs wiki, but I think that genfromtxt and related would be best served by having their own wiki page that could maybe go here <http://docs.scipy.org/numpy/docs/numpy-docs/user/> Thoughts? I can work on it as I find time. Also while I'm thinking about it, I filed an enhancement ticket and patch to use the autostrip keyword to get rid of whitespace in strings <http://projects.scipy.org/numpy/ticket/1238> Skipper

Pierre GM

7:47 p.m.

On Sep 25, 2009, at 3:42 PM, Skipper Seabold wrote:

...

As far as this goes, I added some examples to the docs wiki, but I think that genfromtxt and related would be best served by having their own wiki page that could maybe go here <http://docs.scipy.org/numpy/docs/numpy-docs/user/>

Thoughts? I can work on it as I find time.

...

Also while I'm thinking about it, I filed an enhancement ticket and patch to use the autostrip keyword to get rid of whitespace in strings <http://projects.scipy.org/numpy/ticket/1238>

While you're at it, can you ask for adding the possibility to process a dtype like (int,int,float) ? That was what I was working on before I started installing Snow Leopard...

Skipper Seabold

7:51 p.m.

On Fri, Sep 25, 2009 at 3:47 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

On Sep 25, 2009, at 3:42 PM, Skipper Seabold wrote:

...
As far as this goes, I added some examples to the docs wiki, but I think that genfromtxt and related would be best served by having their own wiki page that could maybe go here <http://docs.scipy.org/numpy/docs/numpy-docs/user/>

Thoughts? I can work on it as I find time.

+1

...
Also while I'm thinking about it, I filed an enhancement ticket and patch to use the autostrip keyword to get rid of whitespace in strings <http://projects.scipy.org/numpy/ticket/1238>

While you're at it, can you ask for adding the possibility to process a dtype like (int,int,float) ? That was what I was working on before I started installing Snow Leopard...

Sure. Should it be another keyword though `type` maybe? dtype implies that it's a legal dtype, and I don't think (int,int,float) works does it?

Pierre GM

7:55 p.m.

On Sep 25, 2009, at 3:51 PM, Skipper Seabold wrote:

...

...
While you're at it, can you ask for adding the possibility to process a dtype like (int,int,float) ? That was what I was working on before I started installing Snow Leopard...

Sure. Should it be another keyword though `type` maybe? dtype implies that it's a legal dtype, and I don't think (int,int,float) works does it?

`type` would call for troubles. And no, (int,int,float) is not a valid dtype, but it can be easily processed as such.

Ralf Gommers

7:56 p.m.

On Fri, Sep 25, 2009 at 3:47 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

On Sep 25, 2009, at 3:42 PM, Skipper Seabold wrote:

...
As far as this goes, I added some examples to the docs wiki, but I think that genfromtxt and related would be best served by having their own wiki page that could maybe go here <http://docs.scipy.org/numpy/docs/numpy-docs/user/>

Thoughts? I can work on it as I find time.

+1

The examples you put in the docstring are good I think. One more example demonstrating missing values would be useful. And +1 to a page in the user guide for anything else. Ralf

Pierre GM

7:59 p.m.

On Sep 25, 2009, at 3:56 PM, Ralf Gommers wrote:

...

The examples you put in the docstring are good I think. One more example demonstrating missing values would be useful. And +1 to a page in the user guide for anything else.

Check also what's done in tests/test_io.py, that gives an idea of what can be done and what cannot.

Christopher Barker

7:12 p.m.

Pierre GM wrote:

...

That way, we don't slow things down when everything works, but just add some delay when they don't.

good goal, but if you don't keep track of where you are, wouldn't you need to re-parse the whole file to figure it out again? Maybe a "debug" mode that the user could turn on and off would fit the need. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Pierre GM

7:39 p.m.

On Sep 25, 2009, at 3:12 PM, Christopher Barker wrote:

...

Pierre GM wrote:

...
That way, we don't slow things down when everything works, but just add some delay when they don't.

good goal, but if you don't keep track of where you are, wouldn't you need to re-parse the whole file to figure it out again?

Indeed. But does it really matter ? We're in a case where there's a problem already...

...

Maybe a "debug" mode that the user could turn on and off would fit the need.

Not a bad idea. Another option would be to give the user the possibility to skip the offending lines: * we already know what number of columns to expect (nbcols) * we check whether the current row has the correct nb of columns * if it doesn't match, we skip or raise an exception with the corresponding line number. But even if we skip, we need to log the the line number to tell the user that there was a problem (issuing a warning ?)

Christopher Barker

7:51 p.m.

One more thought: Pierre GM wrote:

...

...
...
That way, we don't slow things down when everything works,

how long can it really take to increment an integer as each line is parsed? I'd suspect no one would even notice!

...

...
if you don't keep track of where you are, wouldn't you need to re-parse the whole file to figure it out again?

Indeed. But does it really matter ? We're in a case where there's a problem already...

no, it doesn't.

...

...
Maybe a "debug" mode that the user could turn on and off would fit the need.

...

Not a bad idea. Another option would be to give the user the possibility to skip the offending lines:

In either case, I think I'd tend to use it something like this: try: LoadTheFile() except GenFromTxtException: LoadTheFile(debug=True) But I suppose that block could be built in if debug was on -- without debug, it would simply raise an error when one was hit, with debug, it would go back and figure out more about the error and report it. or you could have multiple error modes: error_mode is one of: "fast": does it as fast as possible, and craps out with not much info on error "first_error": stops on first error, and gives you some info about it. "all_errors": keeps going after an error, and logs them all and reports back at the end. "ignore_errors": skips any line with errors, loading the rest of the data -- I think I'd still want the error report, though. but who's going to write that code? -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Skipper Seabold

11:30 p.m.

On Fri, Sep 25, 2009 at 3:51 PM, Christopher Barker <Chris.Barker@noaa.gov> wrote:

...

One more thought:

Pierre GM wrote:

...
...
...
That way, we don't slow things down when everything works,

how long can it really take to increment an integer as each line is parsed? I'd suspect no one would even notice!

A 1000 converters later... FWIW, I have a script that creates and savez arrays from several text files in total about 1.5 GB of text. without the incrementing in genfromtxt Run time: 122.043943 seconds with the incrementing in genfromtxt Run time: 131.698873 seconds If we just want to always keep track of things, I would be willing to take a poorly measured 8 % slowdown, because the info that I took from the errors is the only thing that made what I was doing at all feasible. Skipper

Christopher Barker

4:41 p.m.

Skipper Seabold wrote:

...

FWIW, I have a script that creates and savez arrays from several text files in total about 1.5 GB of text.

without the incrementing in genfromtxt

Run time: 122.043943 seconds

with the incrementing in genfromtxt

Run time: 131.698873 seconds

If we just want to always keep track of things, I would be willing to take a poorly measured 8 % slowdown,

I also think 8% is worth it, but I'm still surprised it's that much. What addition code is inside the inner loop? (or , I guess, the each line loop...) -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Skipper Seabold

4:51 p.m.

On Mon, Sep 28, 2009 at 12:41 PM, Christopher Barker <Chris.Barker@noaa.gov> wrote:

...

Skipper Seabold wrote:

...
FWIW, I have a script that creates and savez arrays from several text files in total about 1.5 GB of text.

without the incrementing in genfromtxt

Run time: 122.043943 seconds

with the incrementing in genfromtxt

Run time: 131.698873 seconds

If we just want to always keep track of things, I would be willing to take a poorly measured 8 % slowdown,

I also think 8% is worth it, but I'm still surprised it's that much. What addition code is inside the inner loop? (or , I guess, the each line loop...)

-Chris

This was probably due to the way that I timed it, honestly. I only did it once. The only differences I made for that part were in the first post of the thread. Two incremented scalars for line numbers and column numbers and a try/except block. I'm really not against a debug mode if someone wants to do it, and it's deemed necessary. If it could be made to log all of the errors that would be extremely helpful. I still need to post some of my use cases though. Anything to help make data cleaning less of a chore... Skipper

Pierre GM

5:36 p.m.

On Sep 28, 2009, at 12:51 PM, Skipper Seabold wrote:

...

This was probably due to the way that I timed it, honestly. I only did it once. The only differences I made for that part were in the first post of the thread. Two incremented scalars for line numbers and column numbers and a try/except block.

I'm really not against a debug mode if someone wants to do it, and it's deemed necessary. If it could be made to log all of the errors that would be extremely helpful. I still need to post some of my use cases though. Anything to help make data cleaning less of a chore...

I was thinking about something this week-end: we could create a second list when looping on the rows, where we would store the length of each splitted row. After the loop, we can find if these values don't match the expected number of columns `nbcols` and where. Then, we can decide to strip the `rows` list of its invalid values (that corresponds to skipping) or raise an exception, but in both cases we know where the problem is. My only concern is that we'd be creating yet another list of integers, which would increase memory usage. Would it be a problem ? In other news, I should eventually be able to tackle that this week...

Skipper Seabold

5:54 p.m.

On Mon, Sep 28, 2009 at 1:36 PM, Pierre GM <pgmdevlist@gmail.com> wrote:

...

On Sep 28, 2009, at 12:51 PM, Skipper Seabold wrote:

...
This was probably due to the way that I timed it, honestly. I only did it once. The only differences I made for that part were in the first post of the thread. Two incremented scalars for line numbers and column numbers and a try/except block.

I'm really not against a debug mode if someone wants to do it, and it's deemed necessary. If it could be made to log all of the errors that would be extremely helpful. I still need to post some of my use cases though. Anything to help make data cleaning less of a chore...

I was thinking about something this week-end: we could create a second list when looping on the rows, where we would store the length of each splitted row. After the loop, we can find if these values don't match the expected number of columns `nbcols` and where. Then, we can decide to strip the `rows` list of its invalid values (that corresponds to skipping) or raise an exception, but in both cases we know where the problem is. My only concern is that we'd be creating yet another list of integers, which would increase memory usage. Would it be a problem ? In other news, I should eventually be able to tackle that this week...

I don't think it would be prohibitively large. One of the datasets I was working with was about a million lines with about 500 columns in each. So...if this is how you actually do this then you have. L = [500] * 1201798 import sys print sys.getsizeof(L)/(1000000.), "MB" # (9.6144560000000006, 'MB') I can't think of a case where I would want to just skip bad rows. Also, I'd definitely like to know about each line that had problems in an error log if we're going to go through the whole file anyway. No hurry on this, just getting my thoughts out there after my experience. I will post some test cases tonight probably. Skipper

Christopher Barker

4:37 p.m.

Pierre GM wrote:

...

I was thinking about something this week-end: we could create a second list when looping on the rows, where we would store the length of each splitted row. After the loop, we can find if these values don't match the expected number of columns `nbcols` and where. Then, we can decide to strip the `rows` list of its invalid values (that corresponds to skipping) or raise an exception, but in both cases we know where the problem is. My only concern is that we'd be creating yet another list of integers, which would increase memory usage. Would it be a problem ?

I doubt it would be that big deal, however... Skipper Seabold wrote:

...

One of the datasets I was working with was about a million lines with about 500 columns in each.

In this use case, it's clearly not a big deal, but it's probably pretty common for folks to have data sets with a smaller number of columns, maybe even two or so (I know I do sometimes). In that case, I suppose we're increasing memory usage by 50% or s, which may be an issue. Another idea: only store the indexes of the rows that have the "wrong" number of columns -- if that's a large number, then then user has bigger problems than memory usage!

...

I can't think of a case where I would want to just skip bad rows.

I can't either, but someone suggested it. It certainly shouldn't happen by default or without a big ol' message of some sort to the user's code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Pierre GM

5:05 p.m.

On Sep 29, 2009, at 12:37 PM, Christopher Barker wrote:

...

Pierre GM wrote: Another idea: only store the indexes of the rows that have the "wrong" number of columns -- if that's a large number, then then user has bigger problems than memory usage!

That was my first idea, but then it adds tests in the inside loop (which is what I'm trying to avoid)...

...

...
I can't think of a case where I would want to just skip bad rows.

I can't either, but someone suggested it. It certainly shouldn't happen by default or without a big ol' message of some sort to the user's code.

That was my intention. OK, I should be able to start working on that in the next few days. Meanwhile, it'd be great if y'all could send me some test cases (so that I can find which method works best). Cheers P.

Christopher Barker

7:28 p.m.

Pierre GM wrote:

...

...
Another idea: only store the indexes of the rows that have the "wrong" number of columns -- if that's a large number, then then user has bigger problems than memory usage!

That was my first idea, but then it adds tests in the inside loop (which is what I'm trying to avoid)...

well, how does one test compare to: read the line from the file split the line into tokens parse each token I can't imagine it's significant, but I guess you only know with profiling. How does it handle the wrong number of tokens now? if an exception is raised somewhere, then that's the only place you'd need to anything extra anyway.

...

OK, I should be able to start working on that in the next few days.

cool! -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Pierre GM

7:35 p.m.

On Sep 29, 2009, at 3:28 PM, Christopher Barker wrote:

...

well, how does one test compare to:

read the line from the file split the line into tokens parse each token

I can't imagine it's significant, but I guess you only know with profiling.

That's on the parsing part. I'd like to keep it as light as possible.

...

How does it handle the wrong number of tokens now? if an exception is raised somewhere, then that's the only place you'd need to anything extra anyway.

It silently fails outside the loop, when the list of splitted rows is converted into an array: if one row has a different length than the others, a "Creating array from a sequence" error occurs but we can't tell where the problem is (because np.array does not tell us).

Christopher Barker

9:27 p.m.

Pierre GM wrote:

...

...
How does it handle the wrong number of tokens now? if an exception is raised somewhere, then that's the only place you'd need to anything extra anyway.

It silently fails outside the loop, when the list of splitted rows is converted into an array: if one row has a different length than the others, a "Creating array from a sequence" error occurs but we can't tell where the problem is (because np.array does not tell us).

Which brings up a good point -- maybe some of this error reporting should go into np.array? It would be nice to know at least when the failure happened. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Bruce Southey

5:57 p.m.

On 09/29/2009 11:37 AM, Christopher Barker wrote:

...

Pierre GM wrote:

...
I was thinking about something this week-end: we could create a second list when looping on the rows, where we would store the length of each splitted row. After the loop, we can find if these values don't match the expected number of columns `nbcols` and where. Then, we can decide to strip the `rows` list of its invalid values (that corresponds to skipping) or raise an exception, but in both cases we know where the problem is. My only concern is that we'd be creating yet another list of integers, which would increase memory usage. Would it be a problem ?

I doubt it would be that big deal, however...

Probably more than memory is the execution time involved in printing these problem rows. There are already two loops over the data where you can measure the number of elements in the row but the first may be more appropriate. So a simple solution is that in the first loop you could append the 'bad' rows to one list and append to a 'good' rows to a exist row list or just store the row number that is bad. Untested code for corresponding part of io.py: row_bad=[] # store bad rows bad_row_numbers=[] # store just the row number row_number=0 #simple row counter that probably should be the first data row not first line of the file for line in itertools.chain([first_line,], fhd): values = split_line(line) # Skip an empty line if len(values) == 0: continue # Select only the columns we need if usecols: values = [values[_] for _ in usecols] # Check whether we need to update the converter if dtype is None: for (converter, item) in zip(converters, values): converter.upgrade(item) if len(values) != nbcols: row_bad.append(line) # store bad row so the user can search for that line bad_row_numbers.append(row_number) # store just the bad row number so user can go to the appropriate line(s) in file else: append_to_rows(tuple(values)) row_number=row_number+1 # increment row counter Note I assume that nbcols is the expected number of columns but I seem to be one off with my counting. Then if len(rows_bad) is greater than zero you could raise or print out a warning and the rows then raise an exception or continue. The problem with continuing is that a user may not be aware that there is a warning. Bruce

Pierre GM

6:30 p.m.

On Sep 29, 2009, at 1:57 PM, Bruce Southey wrote:

...

On 09/29/2009 11:37 AM, Christopher Barker wrote:

...
Pierre GM wrote:

Probably more than memory is the execution time involved in printing these problem rows.

The rows with problems will be printed outside the loop (with at least an associated warning or possibly raising an exception). My concern is to whether store only the tuples (index of the row, nb of columns) for the invalid rows, or just create a list of nb of columns that I'd parse afterwards. The first solution requires an extra test in the loop, the second may waste some memory space. Bah, I'll figure it out. Please send me some test cases so that I can time/test the best option.

...

Bruce Southey

8:36 p.m.

On 09/29/2009 01:30 PM, Pierre GM wrote:

...

On Sep 29, 2009, at 1:57 PM, Bruce Southey wrote:

...
On 09/29/2009 11:37 AM, Christopher Barker wrote:

...
Pierre GM wrote:

Probably more than memory is the execution time involved in printing these problem rows.

The rows with problems will be printed outside the loop (with at least an associated warning or possibly raising an exception). My concern is to whether store only the tuples (index of the row, nb of columns) for the invalid rows, or just create a list of nb of columns that I'd parse afterwards. The first solution requires an extra test in the loop, the second may waste some memory space. Bah, I'll figure it out. Please send me some test cases so that I can time/test the best option.

...
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion

Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter. I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000 If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash. Bruce

Skipper Seabold

3:22 p.m.

On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey <bsouthey@gmail.com> wrote: <snip>

...

Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.

I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000

If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.

Bruce

I think this would actually cover most of the problems I was running into. The only other one I can think of is when I used a converter that I thought would work, but it got unexpected data. For example, from StringIO import StringIO import numpy as np strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or (not 'r' in x.lower() and x.strip() or 0.0)) # Example usage strip_rand('R 40') strip_rand(' ') strip_rand('') strip_rand('40') strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or (not '%' in x.lower() and x.strip() or 0.0)) # Example usage strip_per('7 %') strip_per('7') strip_per(' ') strip_per('') # Unexpected usage strip_per('R 1') s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\ ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55') data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand}, delimiter=",", dtype=None) I don't have a clean install right now, but I think this returned a converter is locked for upgrading error. I would just like to know where the problem occured (line and column, preferably not zero-indexed), so I can go and have a look at my data. One more note, being able to autostrip whitespace turned out to be very helpful. I didn't realize how much memory strings of spaces could take up, and as soon as I turned this on, I was able to process an array with a lot of whitespace without filling up my memory. So I think maybe autostrip should be turned on by default? I will post anything else if it occurs to me. Skipper

Bruce Southey

4:56 p.m.

On 09/30/2009 10:22 AM, Skipper Seabold wrote:

...

On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com> wrote: <snip>

...
Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.

I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000

If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.

Bruce

I think this would actually cover most of the problems I was running into. The only other one I can think of is when I used a converter that I thought would work, but it got unexpected data. For example,

from StringIO import StringIO import numpy as np

strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or (not 'r' in x.lower() and x.strip() or 0.0))

# Example usage strip_rand('R 40') strip_rand(' ') strip_rand('') strip_rand('40')

strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or (not '%' in x.lower() and x.strip() or 0.0))

# Example usage strip_per('7 %') strip_per('7') strip_per(' ') strip_per('')

# Unexpected usage strip_per('R 1')

Does this work for you? I get an: ValueError: invalid literal for float(): R 1

...

s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\ ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')

Can you provide the correct line before the bad line? It just makes it easy to understand why a line is bad.

...

data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand}, delimiter=",", dtype=None)

I don't have a clean install right now, but I think this returned a converter is locked for upgrading error. I would just like to know where the problem occured (line and column, preferably not zero-indexed), so I can go and have a look at my data.

I rather limited understanding here. I think the problem is that Python is raising a ValueError because your strip_per() is wrong. It is not informative to you because _iotools.py is not aware that an invalid converter will raise a ValueError. Therefore there needs to be some way to test that the converter is correct or not. This this case I think it is the delimiter so checking the column numbers should occur before the application of the converter to that row. Bruce

Skipper Seabold

5:44 p.m.

On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey <bsouthey@gmail.com> wrote:

...

On 09/30/2009 10:22 AM, Skipper Seabold wrote:

...
On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com> wrote: <snip>

...
Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.

I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000

If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.

Bruce

I think this would actually cover most of the problems I was running into. The only other one I can think of is when I used a converter that I thought would work, but it got unexpected data. For example,

from StringIO import StringIO import numpy as np

strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or (not 'r' in x.lower() and x.strip() or 0.0))

# Example usage strip_rand('R 40') strip_rand(' ') strip_rand('') strip_rand('40')

strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or (not '%' in x.lower() and x.strip() or 0.0))

# Example usage strip_per('7 %') strip_per('7') strip_per(' ') strip_per('')

# Unexpected usage strip_per('R 1')

Does this work for you? I get an: ValueError: invalid literal for float(): R 1

No, that's the idea. Sorry this was a bit opaque.

...

...
s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\ ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')

Can you provide the correct line before the bad line? It just makes it easy to understand why a line is bad.

The idea is that I have a column, which I expect to be percentages, but these are coded in by different data collectors, so some code a 0 for 0, some just leave it missing which could just as well be 0, some use the %. What I didn't expect was that some put in a money amount, hence the 'R 7', which my converter doesn't catch.

...

...
data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand}, delimiter=",", dtype=None)

I don't have a clean install right now, but I think this returned a converter is locked for upgrading error. I would just like to know where the problem occured (line and column, preferably not zero-indexed), so I can go and have a look at my data.

I rather limited understanding here. I think the problem is that Python is raising a ValueError because your strip_per() is wrong. It is not informative to you because _iotools.py is not aware that an invalid converter will raise a ValueError. Therefore there needs to be some way to test that the converter is correct or not.

_iotools does catch this I believe, though I don't understand the upgrading and locking properly. The kludgy fix that I provided in the first post "I do not report the error from _iotools.StringConverter...", catches that an error is raised from _iotools and tells me exactly where the converter fails, so I can go to, say line 750,000 column 250 (and converter with key 249) instead of not knowing anything except that one of my ~500 converters failed somewhere in a 1 million line data file. If you still want to keep the error messages from _iotools.StringConverter, then they maybe they could have a (%s, %s) added and then this can be filled in in genfromtxt when you know (line, column) or something similar as was kind of suggested in a post in this thread I believe. Then again, this might not be possible. I haven't tried.

...

This this case I think it is the delimiter so checking the column numbers should occur before the application of the converter to that row.

Sometimes it was the case where I had an extra comma in a number 1,000 say and then the converter tried to work on the wrong column, and sometimes it was because my converter didn't cover every use case, because I didn't know it yet. Either way, I just needed a gentle nudge in the right direction. If that doesn't clear up what I was after, I can try to provide a more detailed code sample. Skipper

Bruce Southey

October 2009

2:34 p.m.

On 09/30/2009 12:44 PM, Skipper Seabold wrote:

...

On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey<bsouthey@gmail.com> wrote:

...
On 09/30/2009 10:22 AM, Skipper Seabold wrote:

...
On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com> wrote: <snip>

...
Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.

I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000

If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.

Bruce

I think this would actually cover most of the problems I was running into. The only other one I can think of is when I used a converter that I thought would work, but it got unexpected data. For example,

from StringIO import StringIO import numpy as np

strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or (not 'r' in x.lower() and x.strip() or 0.0))

# Example usage strip_rand('R 40') strip_rand(' ') strip_rand('') strip_rand('40')

strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or (not '%' in x.lower() and x.strip() or 0.0))

# Example usage strip_per('7 %') strip_per('7') strip_per(' ') strip_per('')

# Unexpected usage strip_per('R 1')

Does this work for you? I get an: ValueError: invalid literal for float(): R 1

No, that's the idea. Sorry this was a bit opaque.

...
...
s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\ ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')

Can you provide the correct line before the bad line? It just makes it easy to understand why a line is bad.

The idea is that I have a column, which I expect to be percentages, but these are coded in by different data collectors, so some code a 0 for 0, some just leave it missing which could just as well be 0, some use the %. What I didn't expect was that some put in a money amount, hence the 'R 7', which my converter doesn't catch.

...
...
data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand}, delimiter=",", dtype=None)

I don't have a clean install right now, but I think this returned a converter is locked for upgrading error. I would just like to know where the problem occured (line and column, preferably not zero-indexed), so I can go and have a look at my data.

I rather limited understanding here. I think the problem is that Python is raising a ValueError because your strip_per() is wrong. It is not informative to you because _iotools.py is not aware that an invalid converter will raise a ValueError. Therefore there needs to be some way to test that the converter is correct or not.

_iotools does catch this I believe, though I don't understand the upgrading and locking properly. The kludgy fix that I provided in the first post "I do not report the error from _iotools.StringConverter...", catches that an error is raised from _iotools and tells me exactly where the converter fails, so I can go to, say line 750,000 column 250 (and converter with key 249) instead of not knowing anything except that one of my ~500 converters failed somewhere in a 1 million line data file. If you still want to keep the error messages from _iotools.StringConverter, then they maybe they could have a (%s, %s) added and then this can be filled in in genfromtxt when you know (line, column) or something similar as was kind of suggested in a post in this thread I believe. Then again, this might not be possible. I haven't tried.

I added another patch to ticket 1212 http://projects.scipy.org/numpy/ticket/1212 I tried to rework my first patch because I had forgotten that the header of the file that I was using was missing a delimiter. (Something I need to investigate more.) Hopefully it helps towards a better solution. I added a try/except block around the 'converter.upgrade(item)' line which appears to provide the results for your file. While not the best solution. In addition, I modified the loop to enumerate the converter list so I could find which one in the list fails. The output for your example: Row Number: 3 Failed Converter 2 in list of converters [('D01N01', '10/1/2003 ', 1.0, 75.0, 400, 600.0) ('L24U05', '12/5/2003', 2.0, 1.0, 300, 150.5) ('D02N03', '10/10/2004 ', 0.0, 0.0, 7, 145.55000000000001)]

...

...
This this case I think it is the delimiter so checking the column numbers should occur before the application of the converter to that row.

Sometimes it was the case where I had an extra comma in a number 1,000 say and then the converter tried to work on the wrong column, and sometimes it was because my converter didn't cover every use case, because I didn't know it yet. Either way, I just needed a gentle nudge in the right direction.

If that doesn't clear up what I was after, I can try to provide a more detailed code sample.

Skipper _______________________________________________

I do not see how to write code to determine when a delimiter has more than one meaning. While there are more columns than expected, it can be very hard to determine which column is incorrect without additional information. We might be able to that we we associate a format to a column. But then you would have to split columns one by one and checking each one as you do so. Probably not hard to do but a lot of work to validate it. For example, I have numerous problems with dates in SAS because you have 2 or 4 digit years, 1 or 2 digits days and months. But any variation than expected leads to errors if it expects 2 digit years and gets a 4 digit year. So I usually read dates as strings and then parse it as I want. Bruce

Skipper Seabold

September 2009

3:42 p.m.

On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey <bsouthey@gmail.com> wrote:

...

Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.

I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000

If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.

One other point that perhaps goes without saying is that we want to detect missing and extra delimiters (eg., commas for 1000s). Skipper

5639

Age (days ago)

5646

Last active (days ago)

List overview

Download

36 comments

5 participants

participants (5)

Bruce Southey
Christopher Barker
Pierre GM
Ralf Gommers
Skipper Seabold

Question about improving genfromtxt errors

Skipper Seabold

Skipper Seabold

Bruce Southey

Skipper Seabold

Pierre GM

Skipper Seabold

Bruce Southey

Skipper Seabold

Pierre GM

Skipper Seabold

Pierre GM

Pierre GM

Pierre GM

Skipper Seabold

Skipper Seabold

Pierre GM

Skipper Seabold

Pierre GM

Pierre GM

Bruce Southey

Pierre GM

Bruce Southey

Skipper Seabold

Bruce Southey

Skipper Seabold

Bruce Southey

Skipper Seabold

tags

participants (5)