
On 09/30/2009 12:44 PM, Skipper Seabold wrote:
On Wed, Sep 30, 2009 at 12:56 PM, Bruce Southey<bsouthey@gmail.com> wrote:
On 09/30/2009 10:22 AM, Skipper Seabold wrote:
On Tue, Sep 29, 2009 at 4:36 PM, Bruce Southey<bsouthey@gmail.com> wrote: <snip>
Hi, The first case just has to handle a missing delimiter - actually I expect that most of my cases would relate this. So here is simple Python code to generate arbitrary large list with the occasional missing delimiter.
I set it so it reads the desired number of rows and frequency of bad rows from the linux command line. $time python tbig.py 1000000 100000
If I comment out the extra prints in io.py that I put in, it takes about 22 seconds to finish if the delimiters are correct. If I have the missing delimiter it takes 20.5 seconds to crash.
Bruce
I think this would actually cover most of the problems I was running into. The only other one I can think of is when I used a converter that I thought would work, but it got unexpected data. For example,
from StringIO import StringIO import numpy as np
strip_rand = lambda x : float(('r' in x.lower() and x.split()[-1]) or (not 'r' in x.lower() and x.strip() or 0.0))
# Example usage strip_rand('R 40') strip_rand(' ') strip_rand('') strip_rand('40')
strip_per = lambda x : float(('%' in x.lower() and x.split()[0]) or (not '%' in x.lower() and x.strip() or 0.0))
# Example usage strip_per('7 %') strip_per('7') strip_per(' ') strip_per('')
# Unexpected usage strip_per('R 1')
Does this work for you? I get an: ValueError: invalid literal for float(): R 1
No, that's the idea. Sorry this was a bit opaque.
s = StringIO('D01N01,10/1/2003 ,1 %,R 75,400,600\r\nL24U05,12/5/2003\ ,2 %,1,300, 150.5\r\nD02N03,10/10/2004 ,R 1,,7,145.55')
Can you provide the correct line before the bad line? It just makes it easy to understand why a line is bad.
The idea is that I have a column, which I expect to be percentages, but these are coded in by different data collectors, so some code a 0 for 0, some just leave it missing which could just as well be 0, some use the %. What I didn't expect was that some put in a money amount, hence the 'R 7', which my converter doesn't catch.
data = np.genfromtxt(s, converters = {2 : strip_per, 3 : strip_rand}, delimiter=",", dtype=None)
I don't have a clean install right now, but I think this returned a converter is locked for upgrading error. I would just like to know where the problem occured (line and column, preferably not zero-indexed), so I can go and have a look at my data.
I rather limited understanding here. I think the problem is that Python is raising a ValueError because your strip_per() is wrong. It is not informative to you because _iotools.py is not aware that an invalid converter will raise a ValueError. Therefore there needs to be some way to test that the converter is correct or not.
_iotools does catch this I believe, though I don't understand the upgrading and locking properly. The kludgy fix that I provided in the first post "I do not report the error from _iotools.StringConverter...", catches that an error is raised from _iotools and tells me exactly where the converter fails, so I can go to, say line 750,000 column 250 (and converter with key 249) instead of not knowing anything except that one of my ~500 converters failed somewhere in a 1 million line data file. If you still want to keep the error messages from _iotools.StringConverter, then they maybe they could have a (%s, %s) added and then this can be filled in in genfromtxt when you know (line, column) or something similar as was kind of suggested in a post in this thread I believe. Then again, this might not be possible. I haven't tried.
I added another patch to ticket 1212 http://projects.scipy.org/numpy/ticket/1212 I tried to rework my first patch because I had forgotten that the header of the file that I was using was missing a delimiter. (Something I need to investigate more.) Hopefully it helps towards a better solution. I added a try/except block around the 'converter.upgrade(item)' line which appears to provide the results for your file. While not the best solution. In addition, I modified the loop to enumerate the converter list so I could find which one in the list fails. The output for your example: Row Number: 3 Failed Converter 2 in list of converters [('D01N01', '10/1/2003 ', 1.0, 75.0, 400, 600.0) ('L24U05', '12/5/2003', 2.0, 1.0, 300, 150.5) ('D02N03', '10/10/2004 ', 0.0, 0.0, 7, 145.55000000000001)]
This this case I think it is the delimiter so checking the column numbers should occur before the application of the converter to that row.
Sometimes it was the case where I had an extra comma in a number 1,000 say and then the converter tried to work on the wrong column, and sometimes it was because my converter didn't cover every use case, because I didn't know it yet. Either way, I just needed a gentle nudge in the right direction.
If that doesn't clear up what I was after, I can try to provide a more detailed code sample.
Skipper _______________________________________________
I do not see how to write code to determine when a delimiter has more than one meaning. While there are more columns than expected, it can be very hard to determine which column is incorrect without additional information. We might be able to that we we associate a format to a column. But then you would have to split columns one by one and checking each one as you do so. Probably not hard to do but a lot of work to validate it. For example, I have numerous problems with dates in SAS because you have 2 or 4 digit years, 1 or 2 digits days and months. But any variation than expected leads to errors if it expects 2 digit years and gets a 4 digit year. So I usually read dates as strings and then parse it as I want. Bruce