[Python-ideas] Fwd: csv.DictReader could handle headers more intelligently.
Shane Green
shane at umbrellacode.com
Sun Jan 27 15:10:49 CET 2013
Something as simple as this (straw man) demonstrates what I mean:
> class Record(defaultdict):
> def __init__(self, headers, fields):
> super(Record, self).__init__(list)
> self.headers = headers
> self.fields = fields
> map(self.enter, self.headers, self.fields)
> def valuemap(self, first=False):
> index = 0 if first else -1
> return dict([(key,values[index]) for key,values in self.items()])
> def enter(self, header, *values):
> if isinstance(header, int):
> header = self.headers[header]
> self[header].extend(values)
> def itemseq(self):
> return zip(self.headers,self.fields)
> def __getitem__(self, spec):
> if isinstance(spec, int):
> return self.fields[spec]
> return super(Record, self).__getitem__(spec)
> def __getslice__(self, *args):
> return self.fields.__getslice__(*args)
>
This would let you access column values using header names, just like before. Each column's value(s) is now in a list, and would contain multiple values anytime for any column repeated more than once in the header.
Values can also be accessed sequentially using integer indexes, and the valuemap() returns a standard dictionary that conforms to the previous standard exactly: there is a one to one mapping between column headers and values, which the last value associated with a given column name being the value.
While I think the changes should be added without changing what exists for backward compatibility reasons, I've started to think the existing version should also be deprecated, rather than maintained as a special case. Even when the format is perfect for the existing code, I don't see any big advantages to using it over this approach.
Keep in mind the example is just a quick straw man: performance is a big difference (and plenty of bugs), but that doesn't seem like the right thing to base the decision, as performance can easily be enhanced later.
In summary, given headers: A, B, C, D, E, B, G
record.headers == ["A", "B", "C", "D", "E", "B", "G"]
record.fields = [0, 1, 2, 3, 4, 5, 6, 7]
record["A"] == [0]
record["B"] == [1, 5]
# Note sequential access values are not in lists, and the second "B" column's value 5 is in it's original 5th position.
record[0] == 0
record[1] ==1
record[2] == 2
record[3] == 3
record[4] == 4
record[5] == 5
record.items() == [("A", [0]), ("B", [1, 5)), …]
record.valuemap() == {"A": 0, "B": 5, …} # This returns exactly what DictReader does today, a single value per named column, with the last value being the one used.
Shane Green
www.umbrellacode.com
408-692-4666 | shane at umbrellacode.com
Begin forwarded message:
> From: Shane Green <shane at umbrellacode.com>
> Subject: Re: [Python-ideas] csv.DictReader could handle headers more intelligently.
> Date: January 26, 2013 6:39:11 AM PST
> To: "Stephen J. Turnbull" <stephen at xemacs.org>
> Cc: python-ideas at python.org
>
> Okay, I like your point about DictReader having a place with a subset of CSV tables, and agree that, given that definition, it should throw an exception when its fed something that doesn't conform to this definition. I like that.
>
> One thing, though, the new version would let you access column data by name as well:
>
> Instead of
> row["timestamp"] == 1359210019.299478
>
> It would be
> row["timestamp"] == [1359210019.299478]
>
> And potentially
> row["timestamp"] == [1359210019.299478,1359210019.299478]
>
> It could also be accessed as:
> row.headers[0] == "timestamp"
> row.headers[1] == "timestamp"
> row.values[0] == 1359210019.299478
> row.values[1] == 1359210019.299478
>
> Could still provide:
> for name,value in records.iterfirstitems(): # get the first value for each column with a given name.
> - or -
> for name,value in records.iterlasttitems(): # get the last value for each column with a given name.
>
> And the exact functionality you have now:
> records.itervaluemaps() # or something… just a map(dict(records.iterlastitesm()))
>
> Overkill, but really simple things to add…
>
> The only thing this really adds to the "convenience" of the current DictReader for well-behaved tables, is the ability to access values sequentially or by name; other than that, the only difference would be iterating on a generator method's output instead of the instance itself.
>
>
>
>
> Shane Green
> www.umbrellacode.com
> 408-692-4666 | shane at umbrellacode.com
>
> On Jan 26, 2013, at 5:53 AM, "Stephen J. Turnbull" <stephen at xemacs.org> wrote:
>
>> Shane Green writes:
>>
>>> And while it's true that a dictionary is a dictionary and it works
>>> the way it works, the real point that drives home is that it's an
>>> inappropriate mechanism for dealing ordered rows of sequential
>>> values.
>>
>> Right! So use csv.reader, or csv.DictReader with an explicit
>> fieldnames argument.
>>
>> The point of csv.DictReader with default fieldnames is to take a
>> "well-behaved" table and turn it into a sequence of "poor-man's"
>> objects.
>>
>>> The final point is a simple one: while that CSV file format was
>>> stupid, it was perfectly legal. Something that deals with CSV
>>> content should not be losing any of its content.
>>
>> That's a reasonable requirement.
>>
>>> It also should [not] be barfing or throwing exceptions, by the way.
>>
>> That's not. As long as the module provides classes capable of
>> handling any CSV format (it does), it may also provide convenience
>> classes for special purposes with restricted formats. Those classes
>> may throw exceptions on input that doesn't satisfy the restrictions.
>>
>>> And what about fixing it by replacing implementing a class that
>>> does it correctly, [...]?
>>
>> Doesn't help users who want automatically detected access-by-name.
>> They must have unique field names. (I don't have a use case. I
>> assume the implementer of csv.DictReader did.<wink/>)
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-ideas/attachments/20130127/2be9300b/attachment.html>
More information about the Python-ideas
mailing list