[Tutor] managing memory large dictionaries in python

Oscar Benjamin oscar.j.benjamin at gmail.com
Wed Oct 17 00:30:12 CEST 2012


On 16 October 2012 21:03, Prasad, Ramit <ramit.prasad at jpmorgan.com> wrote:
> Abhishek Pratap wrote:
>> Sent: Tuesday, October 16, 2012 11:57 AM
>> To: tutor at python.org
>> Subject: [Tutor] managing memory large dictionaries in python
>>
>> Hi Guys
>>
>> For my problem I need to store 400-800 million 20 characters keys in a
>> dictionary and do counting. This data structure takes about 60-100 Gb
>> of RAM.
>> I am wondering if there are slick ways to map the dictionary to a file
>> on disk and not store it in memory but still access it as dictionary
>> object. Speed is not the main concern in this problem and persistence
>> is not needed as the counting will only be done once on the data. We
>> want the script to run on smaller memory machines if possible.
>>
>> I did think about databases for this but intuitively it looks like a
>> overkill coz for each key you have to first check whether it is
>> already present and increase the count by 1  and if not then insert
>> the key into dbase.
>>
>> Just want to take your opinion on this.
>>
>> Thanks!
>> -Abhi
>
> I do not think that a database would be overkill for this type of task.

Neither do I but I think that there are also ways to make it more
memory efficient without the use of disk storage. I suspect that the
bulk of the memory usage is for the strings and there are often ways
to store them more efficiently in memory.

For example, are you using Python 3? You might be able to reduce the
memory consumption by 25-50% by using byte strings instead of unicode
strings (assuming that the keys are ascii).

Further gains are possible if your strings only use a subset of ascii
characters, as is the case for e.g. hex strings. If you can map the
strings reversibly to integers with a function (rather than a dict)
then you should be able to achieve a significant reduction in memory
usage by using ints as the dictionary keys.

A 20 character hex key can theoretically be stored with 80 bits or 10
bytes. If you could hit this limit then you would only need 4-8 GB of
memory for the strings themselves (which may still be too much).

> Your process may be trivial but the amount of data it has manage is not trivial. You can use a simple database like SQLite. Otherwise, you
> could create a file for each key and update the count in there. It will
> run on a small amount of memory but will be slower than using a db.
>
> # Pseudocode
> key = get_key()
> filename = os.path.join(directory, key)
> if os.path.exists(filename):
>     # read and update count
> else:
>     with open(os.path.join(directory, key), 'w') as f:
>         f.write('1')

Depending on the file system this could require a large amount of disk
space. If each file needed 4096 bytes of disk space then you would
need around 2-3 TB of disk space with this solution.

>
> Given that SQLite is included in Python and is easy to use, I would just
> use that.

I would also try this. A standard database solution will likely give
the least headaches in the long run.


Oscar


More information about the Tutor mailing list