[Numpy-discussion] Question about improving genfromtxt errors
Christopher Barker
Chris.Barker at noaa.gov
Tue Sep 29 12:37:14 EDT 2009
Pierre GM wrote:
> I was thinking about something this week-end: we could create a second
> list when looping on the rows, where we would store the length of each
> splitted row. After the loop, we can find if these values don't match
> the expected number of columns `nbcols` and where. Then, we can decide
> to strip the `rows` list of its invalid values (that corresponds to
> skipping) or raise an exception, but in both cases we know where the
> problem is.
> My only concern is that we'd be creating yet another list of integers,
> which would increase memory usage. Would it be a problem ?
I doubt it would be that big deal, however...
Skipper Seabold wrote:
> One of the datasets I
> was working with was about a million lines with about 500 columns in
> each.
In this use case, it's clearly not a big deal, but it's probably pretty
common for folks to have data sets with a smaller number of columns,
maybe even two or so (I know I do sometimes). In that case, I suppose
we're increasing memory usage by 50% or s, which may be an issue.
Another idea: only store the indexes of the rows that have the "wrong"
number of columns -- if that's a large number, then then user has bigger
problems than memory usage!
> I can't think of a case where I would want to just skip bad rows.
I can't either, but someone suggested it. It certainly shouldn't happen
by default or without a big ol' message of some sort to the user's code.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
More information about the NumPy-Discussion
mailing list