flattening/rolling up/aggregating a large sorted text file

Eddie Corns eddie at holyrood.ed.ac.uk
Thu Mar 22 17:10:13 CET 2007

adtvff at yahoo.com writes:


>Given a large ascii file (delimited or fixed width) with one ID field
>and dimensions/measures fields, sorted by dimensions, I'd like to
>"flatten" or "rollup" the file by creating new columns: one for each
>combination of dimension level, and summing up measures over all
>records for a given ID.

>If the wheel has already been invented, great, please point me in the
>right direction. If not, please share some pointers on how to think
>about this problem in order to write efficient code.

>Is a hash with dimension level combinations a good approach, with
>values reset at each new ID level?

>I know mysql, Oracle etc will do this , but they all have a cap on #
>of columns allowed. SAS will allow unlimited columns, but I don't own


>001, blue, square,  4
>001, red , circle,    5
>001, red,  circle,    6

>ID, blue_circle, blue_square, red_circle, red_square
>002 ...

Something like:

import sys
from sets import Set

ids  = {}
keys = Set()

for line in sys.stdin:
    ID,COL,SHAPE,VAL = [s.strip() for s in line.split(',')]
    key = '%s_%s'%(COL,SHAPE)
    ids[ID][key] += int(VAL)

print 'id',',',','.join([str(key) for key in keys])
for id,cols in ids.items():
    print id,',', ', '.join([str(cols.get(k,0)) for k in keys])

Doesn't keep all possible keys just those that are actually used.
Needs to sort() things here and there.

Incidentally I don't think you could do it in SQL at all in this way but you
could do it in a more vertical fashion (eg
001, red, circle, 11
001, blue, square, 4
002, red, rhombus, 99) etc.


More information about the Python-list mailing list