There must be a better way

Tue Apr 23 10:30:05 EDT 2013

On 2013-04-23 13:36, Neil Cerutti wrote:
> On 2013-04-22, Colin J. Williams <cjw at ncf.ca> wrote:
> > Since I'm only interested in one or two columns, the simpler
> > approach is probably better.
> 
> Here's a sketch of how one of my projects handles that situation.
> I think the index variables are invaluable documentation, and
> make it a bit more robust. (Python 3, so not every bit is
> relevant to you).
> 
> with open("today.csv", encoding='UTF-8', newline='') as today_file:
>     reader = csv.reader(today_file)
>     header = next(reader)
>     majr_index = header.index('MAJR')
>     div_index = header.index('DIV')
>     for rec in reader:
>         major = rec[majr_index]
>         rec[div_index] = DIVISION_TABLE[major]
> 
> But a csv.DictReader might still be more efficient. I never
> tested. This is the only place I've used this "optimization".
> It's fast enough. ;)

I believe the csv module does all the work at c-level, rather than
as  pure Python, so it should be notably faster.  The only times I've
had to do things by hand like that are when there are header
peculiarities that I can't control, such as mismatched case or
added/remove punctuation (client files are notorious for this).  So I
often end up doing something like

  def normalize(header):
    return header.strip().upper() # other cleanup as needed

  reader = csv.reader(f)
  headers = next(reader)
  header_map = dict(
    (normalize(header), i)
    for i, header
    in enumerate(headers)
    )
  item = lambda col: row[header_map[col]].strip()
  for row in reader:
    major = item("MAJR").upper()
    division = item("DIV")
    # ...

The function calling might add overhead (in which case one could
just use explicit indirect indexing for each value assignment:

  major = row[header_map["MAJR"]].strip().upper()

but I usually find that processing CSV files leaves me I/O bound
rather than CPU bound.

-tkc