Best/better way? (histogram)
Peter Otten
__peter__ at web.de
Wed Jan 28 03:52:49 EST 2009
Bernard Rankin wrote:
> I've got several versions of code to here to generate a histogram-esque
> structure from rows in a CSV file.
>
> The basic approach is to use a Dict as a bucket collection to count
> instances of data items.
>
> Other than the try/except(KeyError) idiom for dealing with new bucket
> names, which I don't like as it desribes the initial state of a KeyValue
> _after_ you've just described what to do with the existing value, I've
> come up with a few other methods.
>
> What seems like to most resonable approuch?
The simplest. That would be #3, cleaned up a bit:
from collections import defaultdict
from csv import DictReader
from pprint import pprint
from operator import itemgetter
def rows(filename):
infile = open(filename, "rb")
for row in DictReader(infile):
yield row["CATEGORIES"]
def stats(values):
histo = defaultdict(int)
for v in values:
histo[v] += 1
return sorted(histo.iteritems(), key=itemgetter(1), reverse=True)
Should you need the inner dict (which doesn't seem to offer any additional
information) you can always add another step:
def format(items):
result = []
for raw, count in items:
leaf = raw.rpartition("|")[2]
result.append((raw, dict(count=count, leaf=leaf)))
return result
pprint(format(stats(rows("sampledata.csv"))), indent=4, width=60)
By the way, if you had broken the problem in steps like above you could have
offered four different stats() functions which would would have been a bit
easier to read...
Peter
More information about the Python-list
mailing list