There must be a better way
Tim Chase
python.list at tim.thechases.com
Tue Apr 23 10:30:05 EDT 2013
On 2013-04-23 13:36, Neil Cerutti wrote:
> On 2013-04-22, Colin J. Williams <cjw at ncf.ca> wrote:
> > Since I'm only interested in one or two columns, the simpler
> > approach is probably better.
>
> Here's a sketch of how one of my projects handles that situation.
> I think the index variables are invaluable documentation, and
> make it a bit more robust. (Python 3, so not every bit is
> relevant to you).
>
> with open("today.csv", encoding='UTF-8', newline='') as today_file:
> reader = csv.reader(today_file)
> header = next(reader)
> majr_index = header.index('MAJR')
> div_index = header.index('DIV')
> for rec in reader:
> major = rec[majr_index]
> rec[div_index] = DIVISION_TABLE[major]
>
> But a csv.DictReader might still be more efficient. I never
> tested. This is the only place I've used this "optimization".
> It's fast enough. ;)
I believe the csv module does all the work at c-level, rather than
as pure Python, so it should be notably faster. The only times I've
had to do things by hand like that are when there are header
peculiarities that I can't control, such as mismatched case or
added/remove punctuation (client files are notorious for this). So I
often end up doing something like
def normalize(header):
return header.strip().upper() # other cleanup as needed
reader = csv.reader(f)
headers = next(reader)
header_map = dict(
(normalize(header), i)
for i, header
in enumerate(headers)
)
item = lambda col: row[header_map[col]].strip()
for row in reader:
major = item("MAJR").upper()
division = item("DIV")
# ...
The function calling might add overhead (in which case one could
just use explicit indirect indexing for each value assignment:
major = row[header_map["MAJR"]].strip().upper()
but I usually find that processing CSV files leaves me I/O bound
rather than CPU bound.
-tkc
More information about the Python-list
mailing list