very large dictionary
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Aug 2 02:54:17 EDT 2008
On Fri, 01 Aug 2008 00:46:09 -0700, Simon Strobl wrote:
> Hello,
>
> I tried to load a 6.8G large dictionary on a server that has 128G of
> memory. I got a memory error. I used Python 2.5.2. How can I load my
> data?
How do you know the dictionary takes 6.8G?
I'm going to guess an answer to my own question. In a later post, Simon
wrote:
[quote]
I had a file bigrams.py with a content like below:
bigrams = {
", djy" : 75 ,
", djz" : 57 ,
", djzoom" : 165 ,
", dk" : 28893 ,
", dk.au" : 854 ,
", dk.b." : 3668 ,
...
}
[end quote]
I'm guessing that the file is 6.8G of *text*. How much memory will it
take to import that? I don't know, but probably a lot more than 6.8G. The
compiler has to read the whole file in one giant piece, analyze it,
create all the string and int objects, and only then can it create the
dict. By my back-of-the-envelope calculations, the pointers alone will
require about 5GB, nevermind the objects they point to.
I suggest trying to store your data as data, not as Python code. Create a
text file "bigrams.txt" with one key/value per line, like this:
djy : 75
djz : 57
djzoom : 165
dk : 28893
...
Then import it like such:
bigrams = {}
for line in open('bigrams.txt', 'r'):
key, value = line.split(':')
bigrams[key.strip()] = int(value.strip())
This will be slower, but because it only needs to read the data one line
at a time, it might succeed where trying to slurp all 6.8G in one piece
will fail.
--
Steven
More information about the Python-list
mailing list