[Python-ideas] csv.DictReader could handle headers more intelligently.
Jeff Jenkins
jeff at jeffreyjenkins.ca
Wed Jan 30 15:04:47 CET 2013
I think this may have been lost somewhere in the last 90 messages, but
adding a warning to DictReader in the docs seems like it solves almost the
entire problem. New csv.DictReader users are informed, no one's old code
breaks, and a separate discussion can be had about whether it's worth
adding a csv.MultiDictReader which uses lists.
On Wed, Jan 30, 2013 at 7:59 AM, Shane Green <shane at umbrellacode.com> wrote:
> So I've done some thinking on it, a bit of research, etc., and have worked
> with a lot of different CSV content. There are a lot of parallels between
> the name/value pairs of an HTML form submission, and our use case.
>
> Namely:
> - There's typically only one value per name, but it's perfectly legal to
> have multiple values assigned to a name.
> - When there are duplicate multiple values assigned to a name, order can
> be very important.
> - They made the mistake of mapping names to values; they made the mistake
> of mapping name field names to singular values when there was only one
> value, and multiple values where there were multiple values.
> - Each of these have been deprecated an their FieldStorage now always maps
> field names to lists of values.
>
> I've implemented a Record class I'm going to pitch for feedback. Although
> I followed the FieldStorage API for a couple of methods, it didn't
> translate very well because their values are complex objects. This Record
> class is a dictionary type that maps header names to the values from
> columns labeled by that same header. Most lists have a single field
> because usually headers aren't duplicated. When multiple values are in a
> field, they are listed in the order they were read from the CSV file. The
> API provides convenience methods for getting the first or last value listed
> for a given column name, making it very easy to turn work with singular
> values when desired. The dictionary API will likely bent primary mechanism
> for interacting with it, however, knows the header and row sequences it was
> built from, and provides sequential access to them as well. In addition to
> working with non-standard CSV, performing transformations, etc.this
> information makes it possible to reproduce correctly ordered CSV.
>
> While I don't really know yet whether it would make sense to support any
> kind of manipulation of values on the record instances themselves, versus
> using more copy()/update() approach to defining modifying records or
> something, but I did decide to wrap the row values in a tuple, making it
> read only. This was for several reasons. One was to address a potential
> inconsistency that might arise should we decide to support editing, and the
> other is because the record is the representation of that row read from the
> source file, and so it should always accurately reflect that content.
>
> About the code: I wrote it tonight, tested it for an hour, so it's not
> meant to be perfect or final, but it should stir up a very concrete
> discussion about the API, if nothing else ;-) I included a generator that
> seemed to work on the some test files. It most definitely is not meant to
> be critiqued or a distraction, but I've included it in case anyone ends up
> wanting to investigate the things further. Although the iterator function
> provides a slightly different signature that DictReader, that's not because
> I'm trying toe change anything; please keep in mind the generator was just
> a test. Also, I'd like to mention one last time that I don't think we
> should change what exists to reflect any of these changes: I was thinking
> it would be a new set of classes and functions that, that would become the
> preferred implementation in the future.
>
>
>
>
> class Record(dict):
> def __init__(self, headers, fields):
> if len(headers) != len(fields):
> # I don't make decicions about how gaps should be filled.
> raise ValueError("header/field size mismatch")
> self._headers = headers
> self._fields = tuple(fields)
> [self.setdefault(h,[]).append(v) for h,v in self.fielditems()]
> super(Record, self).__init__()
> def fielditems(self):
> """
> Get header,value sequence that reflects CSV source.
> """
> return zip(self.headers(),self.fields())
> def headers(self):
> """
> Get ordered sequence of headers reflecting CSV source.
> """
> return self._headers
> def fields(self):
> """
> Get ordered sequence of values reflecting CSV row source.
> """
> return self._fields
> def getfirst(self, name, default=None):
> """
> Get value of last field associated with header named
> 'name'; return 'default' if no such value exists.
> """
> return self[name][0] if name in self else default
> def getlast(self, name, default=None):
> """
> Get value of last field associated with header named
> 'name'; return 'default' if no such value exists.
> """
> return self[name][-1] if name in self else default
> def getlist(self, name):
> """
> Get values of all fields associated with header named 'name'.
> """
> return self.get(name, [])
> def pretty(self, header=True):
> lines = []
> if header:
> lines.append(
> ["%s".ljust(10).rjust(20) % h for h in self.headers()])
> lines.append(
> ["%s".ljust(10).rjust(20) % v for v in self.fields()])
> return "\n\n".join(["|".join(line).strip() for line in lines])
> def __getslice__(self, start=0, stop=None):
> return self.fields()[start: stop]
>
>
> import itertools
>
> Undefined = object()
> def iterrecords(f, headers=None, bucketheader=Undefined,
> missingfieldsok=False, dialect="excel", *args, **kw):
> rows = reader(f, dialect, *args, **kw)
> for row in itertools.ifilter(None, rows):
> if not headers:
> headers = row
> headcount = len(headers)
> print headers
> continue
> rowcount = len(row)
> rowheaders = headers
> if rowcount < headcount:
> if not missingfieldsok:
> raise KeyError("row has more values than headers")
> elif rowcount > headcount:
> if bucketheader is Undefined:
> raise KeyError("row has more values than headers")
> rowheaders += [bucketheader] * (rowcount - headcount)
> record = Record(rowheaders, row)
> yield record
>
>
>
>
> I should probably also have noted the dictionary API behaviour since it's
> not explicitly:
> keys() -> list of unique() header names.
> values() -> list of field values lists.
> items() -> [(header, field-list),] pairs.
>
> And then of course dictionary lookup. One thing that comes to mind is
> that there's really no value to the unordered sequence of value lists;
> there could be some value in extending an OrderedDict, making all the
> iteration methods consistent and therefore something that could be used to
> do something like write values, etc….
>
>
>
>
>
> _______________________________________________
> Python-ideas mailing list
> Python-ideas at python.org
> http://mail.python.org/mailman/listinfo/python-ideas
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130130/42df6b2f/attachment.html>
More information about the Python-ideas
mailing list