Best/better way? (histogram)
Bernard Rankin
berankin99 at yahoo.com
Wed Jan 28 02:02:58 EST 2009
Hello,
I've got several versions of code to here to generate a histogram-esque structure from rows in a CSV file.
The basic approach is to use a Dict as a bucket collection to count instances of data items.
Other than the try/except(KeyError) idiom for dealing with new bucket names, which I don't like as it desribes the initial state of a KeyValue _after_ you've just described what to do with the existing value, I've come up with a few other methods.
What seems like to most resonable approuch?
Do you have any other ideas?
Is the try/except(KeyError) idiom reallyteh best?
In the code below you will see several 4-line groups of code. Each of set of the n-th line represents one solution to the problem. (Cases 1 & 2 do differ from cases 3 & 4 in the final outcome.)
Thank you
:)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
from collections import defaultdict
from csv import DictReader
from pprint import pprint
dataFile = open("sampledata.csv")
dataRows = DictReader(dataFile)
catagoryStats = defaultdict(lambda : {'leaf' : '', 'count' : 0})
#catagoryStats = {}
#catagoryStats = defaultdict(int)
#catagoryStats = {}
for row in dataRows:
catagoryRaw = row['CATEGORIES']
catagoryLeaf = catagoryRaw.split('|').pop()
## csb => Catagory Stats Bucket
## multi-statement lines are used for ease of method switching.
csb = catagoryStats[catagoryRaw]; csb['count'] += 1; csb['leaf'] = catagoryLeaf
#csb = catagoryStats.setdefault(catagoryRaw, {'leaf' : '', 'count' : 0}); csb['count'] += 1; csb['leaf'] = catagoryLeaf
#catagoryStats[catagoryRaw] += 1
#catagoryStats[catagoryRaw] = catagoryStats.get(catagoryRaw, 0) + 1
catagoryStatsSorted = catagoryStats.items()
catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1]['count'], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)
#catagoryStatsSorted.sort(key=lambda itemtuple: itemtuple[1], reverse=1)
pprint(catagoryStatsSorted, indent=4, width=60)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
sampledata.csv
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CATEGORIES,SKU
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|laptops|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"computers|servers|accessories",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|super_fun",12345
"toys|really|not_at_all_fun",12345
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
output: (in case #1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In [1]: %run catstat.py
[ ( 'computers|servers|accessories',
{'count': 5, 'leaf': 'accessories'}),
( 'toys|really|super_fun',
{'count': 3, 'leaf': 'super_fun'}),
( 'computers|laptops|accessories',
{'count': 3, 'leaf': 'accessories'}),
( 'toys|really|not_at_all_fun',
{'count': 1, 'leaf': 'not_at_all_fun'})]
More information about the Python-list
mailing list