
Pierre GM wrote:
I was thinking about something this week-end: we could create a second list when looping on the rows, where we would store the length of each splitted row. After the loop, we can find if these values don't match the expected number of columns `nbcols` and where. Then, we can decide to strip the `rows` list of its invalid values (that corresponds to skipping) or raise an exception, but in both cases we know where the problem is. My only concern is that we'd be creating yet another list of integers, which would increase memory usage. Would it be a problem ?
I doubt it would be that big deal, however... Skipper Seabold wrote:
One of the datasets I was working with was about a million lines with about 500 columns in each.
In this use case, it's clearly not a big deal, but it's probably pretty common for folks to have data sets with a smaller number of columns, maybe even two or so (I know I do sometimes). In that case, I suppose we're increasing memory usage by 50% or s, which may be an issue. Another idea: only store the indexes of the rows that have the "wrong" number of columns -- if that's a large number, then then user has bigger problems than memory usage!
I can't think of a case where I would want to just skip bad rows.
I can't either, but someone suggested it. It certainly shouldn't happen by default or without a big ol' message of some sort to the user's code. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov