splitting a large dictionary into smaller ones
python.list at tim.thechases.com
Mon Mar 23 13:57:23 CET 2009
> i have a very large dictionary object that is built from a text file
> that is about 800 MB -- it contains several million keys. ideally i
> would like to pickle this object so that i wouldnt have to parse this
> large file to compute the dictionary every time i run my program.
> however currently the pickled file is over 300 MB and takes a very
> long time to write to disk - even longer than recomputing the
> dictionary from scratch.
> i would like to split the dictionary into smaller ones, containing
> only hundreds of thousands of keys, and then try to pickle them. is
> there a way to easily do this?
While others have suggested databases, they may be a bit
overkill, depending on your needs. Python2.5+ supplies not only
the sqlite3 module, but older versions (at least back to 2.0)
offer the anydbm module (changed to "dbm" in 3.0), allowing you
to create an on-disk string-to-string dictionary:
db = anydbm.open("data.db", "c")
# populate some data
# using "db" as your dictionary
f = file("800megs.txt")
data = csv.reader(f, delimiter='\t')
data.next() # discard a header row
for key, value in data:
db[key] = value
print db["some key"]
The resulting DB object is a little sparsely documented, but for
the most part it can be treated like a dictionary. The advantage
is that, if the source data doesn't change, you can parse once
and then just use your "data.db" file from there out.
More information about the Python-list