Scalable python dict {'key_is_a_string': [count, some_val]}

krishna krishna.k.0001 at gmail.com
Sat Feb 20 01:36:28 EST 2010


I have to manage a couple of dicts with huge dataset (larger than
feasible with the memory on my system), it basically has a key which
is a string (actually a tuple converted to a string) and a two item
list as value, with one element in the list being a count related to
the key. I have to at the end sort this dictionary by the count.

The platform is linux. I am planning to implement it by setting a
threshold beyond which I write the data into files (3 columns: 'key
count some_val' ) and later merge those files (I plan to sort the
individual files by the key column and walk through the files with one
pointer per file and merge them; I would add up the counts when
entries from two files match by key) and sorting using the 'sort'
command. Thus the bottleneck is the 'sort' command.

Any suggestions, comments?

By the way, is there a linux command that does the merging part?

Thanks,
Krishna





More information about the Python-list mailing list