Re: [Numpy-discussion] Question about improving genfromtxt errors

Oct. 2, 2009

      On 09/30/2009 12:44 PM, Skipper Seabold wrote:
...
On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey<bsouthey@gmail.com>  wrote:
...
On 09/30/2009 10:22 AM, Skipper Seabold wrote:
...
On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com>    wrote:
<snip>
...
Hi,
The first case just has to handle a missing delimiter - actually I expect
that most of my cases would relate this. So here is simple Python code to
generate arbitrary large list with the occasional missing delimiter.
I set it so it reads the desired number of rows and frequency of bad rows
from the linux command line.
$time python tbig.py 1000000 100000
If I comment out the extra prints in io.py that I put in, it takes about 22
seconds to finish if the delimiters are correct. If I have the missing
delimiter it takes 20.5 seconds to crash.
Bruce
I think this would actually cover most of the problems I was running
into.  The only other one I can think of is when I used a converter
that I thought would work, but it got unexpected data.  For example,
from StringIO import StringIO
import numpy as np
strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or
(not 'r' in x.lower() and x.strip() or 0.0))
# Example usage
strip_rand('R 40')
strip_rand('  ')
strip_rand('')
strip_rand('40')
strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or
(not '%' in x.lower() and x.strip() or 0.0))
# Example usage
strip_per('7 %')
strip_per('7')
strip_per(' ')
strip_per('')
# Unexpected usage
strip_per('R 1')
Does this work for you?
I get an:
ValueError: invalid literal for float(): R 1
No, that's the idea.  Sorry this was a bit opaque.
...
...
s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\
,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')
Can you provide the correct line before the bad line?
It just makes it easy to understand why a line is bad.
The idea is that I have a column, which I expect to be percentages,
but these are coded in by different data collectors, so some code a 0
for 0, some just leave it missing which could just as well be 0, some
use the %.  What I didn't expect was that some put in a money amount,
hence the 'R 7', which my converter doesn't catch.
...
...
data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand},
delimiter=",", dtype=None)
I don't have a clean install right now, but I think this returned a
converter is locked for upgrading error.  I would just like to know
where the problem occured (line and column, preferably not
zero-indexed), so I can go and have a look at my data.
I rather limited understanding here. I think the problem is that Python
is raising a ValueError because your strip_per() is wrong. It is not
informative to you because _iotools.py is not aware that an invalid
converter will raise a ValueError. Therefore there needs to be some way
to test that the converter is correct or not.
_iotools does catch this I believe, though I don't understand the
upgrading and locking properly.  The kludgy fix that I provided in the
first post "I do not report the error from
_iotools.StringConverter...", catches that an error is raised from
_iotools and tells me exactly where the converter fails, so I can go
to, say line 750,000 column 250 (and converter with key 249) instead
of not knowing anything except that one of my ~500 converters failed
somewhere in a 1 million line data file.  If you still want to keep
the error messages from _iotools.StringConverter, then they maybe they
could have a (%s, %s) added and then this can be filled in in
genfromtxt when you know (line, column) or something similar as was
kind of suggested in a post in this thread I believe.  Then again,
this might not be possible.  I haven't tried.
I added another patch to ticket 1212
http://projects.scipy.org/numpy/ticket/1212

I tried to rework my first patch because I had forgotten that the header 
of the file that I was using was missing a delimiter. (Something I need 
to investigate more.) Hopefully it helps towards a better solution.

I added a try/except block around the 'converter.upgrade(item)' line 
which appears to provide the results for your file. While not the best 
solution. In addition, I modified the loop to enumerate the converter 
list so I could find which one in the list fails. The output for your 
example:

Row Number: 3 Failed Converter 2 in list of converters
[('D01N01', '10/1/2003 ', 1.0, 75.0, 400, 600.0)
  ('L24U05', '12/5/2003', 2.0, 1.0, 300, 150.5)
  ('D02N03', '10/10/2004 ', 0.0, 0.0, 7, 145.55000000000001)]
...
...
This this case I think it is the delimiter so checking the column
numbers should occur before the application of the converter to that row.
Sometimes it was the case where I had an extra comma in a number 1,000
say and then the converter tried to work on the wrong column, and
sometimes it was because my converter didn't cover every use case,
because I didn't know it yet.  Either way, I just needed a gentle
nudge in the right direction.
If that doesn't clear up what I was after, I can try to provide a more
detailed code sample.
Skipper
_______________________________________________
I do not see how to write code to determine when a delimiter has more 
than one meaning. While there are more columns than expected, it can be 
very hard to determine which column is incorrect without additional 
information. We might be able to that we we associate a format to a 
column. But then you would have to split columns one by one and checking 
each one as you do so. Probably not hard to do but a lot of work to 
validate it. For example, I have numerous problems with dates in SAS 
because you have 2 or 4 digit years, 1 or  2 digits days and months. But 
any variation than expected leads to errors if it expects 2 digit years 
and gets a 4 digit year. So I usually read dates as strings and then 
parse it as I want.

Bruce