[Python-ideas] csv.DictReader could handle headers more intelligently.

J. Cliff Dyer jcd at sdf.lonestar.org
Wed Jan 23 02:06:08 CET 2013


Idea folks,

I'm working with some poorly-formed CSV files, and I noticed that
DictReader always and only pulls headers off of the first row.  But many
of the files I see have blank lines before the row of headers, sometimes
with commas to the appropriate field count, sometimes without.  The
current implementation's behavior in this case is likely never correct,
and certainly always annoying.  Given the following file:

---Start File 1---
,,
A,B,C
1,2,3
2,4,6
---End File 1---

csv.DictReader yields the rows:

    {'': 'C'}
    {'': '3'}
    {'': '6'}


And given a file starting with a zero-length line, like the following:

---Start File 2---

A,B,C
1,2,3
2,4,6
---End File 2---

It yields the following:

{None: ['A', 'B', 'C']}
{None: ['1', '2', '3']}
{None: ['2', '4', '6']}

I think that in both cases, the proper response would be treat the A,B,C
line as the header line.  The change that makes this work is pretty
simple.  In the fieldnames getter property, the "if not
self._fieldnames:" conditional becomes "while not self._fieldnames or
not any(self._fieldnames):"  As a subclass:

import csv


class DictReader(csv.DictReader):
    @property
    def fieldnames(self):
        while self._fieldnames is None or not any(self._fieldnames):
            try:
                self._fieldnames = next(self.reader)
            except StopIteration:
                break
        return self._fieldnames
        self.line_num = self.reader.line_num

    #Same as the original setter, just rewritten to associate with the
new getter propery
    @fieldnames.setter
    def fieldnames(self, value):
        self._fieldnames = value

There might be some issues with existing code that depends on the {None:
['1','2','3']} construction, but I can't imagine a time when programmers
would want to see {'': '3'} with the 1 and 2 values getting lost.

Thoughts? Do folks think this is worth adding to the csv library, or
should I just keep using my subclass?

Cheers,
Cliff





More information about the Python-ideas mailing list