[Numpy-discussion] Question about improving genfromtxt errors

Tue Sep 29 12:37:14 EDT 2009

Pierre GM wrote:
> I was thinking about something this week-end: we could create a second  
> list when looping on the rows, where we would store the length of each  
> splitted row. After the loop, we can find if these values don't match  
> the expected number of columns `nbcols` and where. Then, we can decide  
> to strip the `rows` list of its invalid values (that corresponds to  
> skipping) or raise an exception, but in both cases we know where the  
> problem is.
> My only concern is that we'd be creating yet another list of integers,  
> which would increase memory usage. Would it be a problem ?

I doubt it would be that big deal, however...

Skipper Seabold wrote:
>  One of the datasets I
> was working with was about a million lines with about 500 columns in
> each.

In this use case, it's clearly not a big deal, but it's probably pretty 
common for folks to have data sets with a smaller number of columns, 
maybe even two or so (I know I do sometimes). In that case, I suppose 
we're increasing memory usage by 50% or s, which may be an issue.

Another idea: only store the indexes of the rows that have the "wrong" 
number of columns -- if that's a large number, then then user has bigger 
problems than memory usage!

> I can't think of a case where I would want to just skip bad rows.

I can't either, but someone suggested it. It certainly shouldn't happen 
by default or without a big ol' message of some sort to the user's code.

-Chris

-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R            (206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115       (206) 526-6317   main reception

Chris.Barker at noaa.gov