Re: [Numpy-discussion] Question about improving genfromtxt errors

Sept. 29, 2009

      Pierre GM wrote:
...
I was thinking about something this week-end: we could create a second  
list when looping on the rows, where we would store the length of each  
splitted row. After the loop, we can find if these values don't match  
the expected number of columns `nbcols` and where. Then, we can decide  
to strip the `rows` list of its invalid values (that corresponds to  
skipping) or raise an exception, but in both cases we know where the  
problem is.
My only concern is that we'd be creating yet another list of integers,  
which would increase memory usage. Would it be a problem ?
I doubt it would be that big deal, however...

Skipper Seabold wrote:
...
One of the datasets I
was working with was about a million lines with about 500 columns in
each.
In this use case, it's clearly not a big deal, but it's probably pretty 
common for folks to have data sets with a smaller number of columns, 
maybe even two or so (I know I do sometimes). In that case, I suppose 
we're increasing memory usage by 50% or s, which may be an issue.

Another idea: only store the indexes of the rows that have the "wrong" 
number of columns -- if that's a large number, then then user has bigger 
problems than memory usage!
...
I can't think of a case where I would want to just skip bad rows.
I can't either, but someone suggested it. It certainly shouldn't happen 
by default or without a big ol' message of some sort to the user's code.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker@noaa.gov