[Python-ideas] Fwd: csv.DictReader could handle headers more intelligently.

Sun Jan 27 15:10:49 CET 2013

Something as simple as this (straw man) demonstrates what I mean: 

> class Record(defaultdict):
>     def __init__(self, headers, fields):
>         super(Record, self).__init__(list)
>         self.headers = headers
>         self.fields = fields
>         map(self.enter, self.headers, self.fields)
>     def valuemap(self, first=False):
>         index = 0 if first else -1
>         return dict([(key,values[index]) for key,values in self.items()])
>     def enter(self, header, *values):
>         if isinstance(header, int):
>             header = self.headers[header]
>         self[header].extend(values)
>     def itemseq(self):
>         return zip(self.headers,self.fields)
>     def __getitem__(self, spec):
>         if isinstance(spec, int):
>             return self.fields[spec]
>         return super(Record, self).__getitem__(spec)
>     def __getslice__(self, *args):
>         return self.fields.__getslice__(*args)
> 

This would let you access column values using header names, just like before.  Each column's value(s) is now in a list, and would contain multiple values anytime for any column repeated more than once in the header.  
Values can also be accessed sequentially using integer indexes, and the valuemap() returns a standard dictionary that conforms to the previous standard exactly: there is a one to one mapping between column headers and values, which the last value associated with a given column name being the value. 

While I think the changes should be added without changing what exists for backward compatibility reasons, I've started to think the existing version should also be deprecated, rather than maintained as a special case.  Even when the format is perfect for the existing code, I don't see any big advantages to using it over this approach. 

Keep in mind the example is just a quick straw man: performance is a big difference (and plenty of bugs), but that doesn't seem like the right thing to base the decision, as performance can easily be enhanced later.  

In summary, given headers: A, B, C, D, E, B, G

record.headers == ["A", "B", "C", "D", "E", "B", "G"]
record.fields = [0, 1, 2, 3, 4, 5, 6, 7]

record["A"] == [0]
record["B"] == [1, 5]

# Note sequential access values are not in lists, and the second "B" column's value 5 is in it's original 5th position. 
record[0] == 0
record[1] ==1
record[2] == 2
record[3] == 3
record[4] == 4
record[5] == 5

record.items() == [("A", [0]), ("B", [1, 5)), …]
record.valuemap() == {"A": 0, "B": 5, …} # This returns exactly what DictReader does today, a single value per named column, with the last value being the one used. 

Shane Green 
www.umbrellacode.com
408-692-4666 | shane at umbrellacode.com

Begin forwarded message:

> From: Shane Green <shane at umbrellacode.com>
> Subject: Re: [Python-ideas] csv.DictReader could handle headers more intelligently.
> Date: January 26, 2013 6:39:11 AM PST
> To: "Stephen J. Turnbull" <stephen at xemacs.org>
> Cc: python-ideas at python.org
> 
> Okay, I like your point about DictReader having a place with a subset of CSV tables, and agree that, given that definition, it should throw an exception when its fed something that doesn't conform to this definition.  I like that.
> 
> One thing, though, the new version would let you access column data by name as well: 
> 
> Instead of
> 	row["timestamp"] == 1359210019.299478
> 
> It would be
> 	row["timestamp"] == [1359210019.299478]
> 
> And potentially 
> 	row["timestamp"] == [1359210019.299478,1359210019.299478]
> 
> It could also be accessed as: 
> 	row.headers[0] == "timestamp"
> 	row.headers[1] == "timestamp"
> 	row.values[0] == 1359210019.299478
> 	row.values[1] == 1359210019.299478
> 
> Could still provide: 
> 	for name,value in records.iterfirstitems(): # get the first value for each column with a given name.
> 	 	- or - 
> 	for name,value in records.iterlasttitems(): # get the last value for each column with a given name.
> 
> And the exact functionality you have now: 
> 	records.itervaluemaps() # or something… just a map(dict(records.iterlastitesm()))
> 		
> Overkill, but really simple things to add… 
> 
> The only thing this really adds to the "convenience" of the current DictReader for well-behaved tables, is the ability to access values sequentially or by name; other than that, the only difference would be iterating on a generator method's output instead of the instance itself.  
> 
> 
> 
> 
> Shane Green 
> www.umbrellacode.com
> 408-692-4666 | shane at umbrellacode.com
> 
> On Jan 26, 2013, at 5:53 AM, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
> 
>> Shane Green writes:
>> 
>>> And while it's true that a dictionary is a dictionary and it works
>>> the way it works, the real point that drives home is that it's an
>>> inappropriate mechanism for dealing ordered rows of sequential
>>> values.
>> 
>> Right!  So use csv.reader, or csv.DictReader with an explicit
>> fieldnames argument.
>> 
>> The point of csv.DictReader with default fieldnames is to take a
>> "well-behaved" table and turn it into a sequence of "poor-man's"
>> objects.
>> 
>>> The final point is a simple one: while that CSV file format was
>>> stupid, it was perfectly legal.  Something that deals with CSV
>>> content should not be losing any of its content.
>> 
>> That's a reasonable requirement.
>> 
>>> It also should [not] be barfing or throwing exceptions, by the way.
>> 
>> That's not.  As long as the module provides classes capable of
>> handling any CSV format (it does), it may also provide convenience
>> classes for special purposes with restricted formats.  Those classes
>> may throw exceptions on input that doesn't satisfy the restrictions.
>> 
>>> And what about fixing it by replacing implementing a class that
>>> does it correctly, [...]?
>> 
>> Doesn't help users who want automatically detected access-by-name.
>> They must have unique field names.  (I don't have a use case.  I
>> assume the implementer of csv.DictReader did.<wink/>)
>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130127/2be9300b/attachment.html>