[Tutor] managing memory large dictionaries in python
emile at fenx.com
Wed Oct 17 00:21:43 CEST 2012
On 10/16/2012 01:03 PM, Prasad, Ramit wrote:
> Abhishek Pratap wrote:
>> Sent: Tuesday, October 16, 2012 11:57 AM
>> To: tutor at python.org
>> Subject: [Tutor] managing memory large dictionaries in python
>> Hi Guys
>> For my problem I need to store 400-800 million 20 characters keys in a
>> dictionary and do counting. This data structure takes about 60-100 Gb
>> of RAM.
>> I am wondering if there are slick ways to map the dictionary to a file
>> on disk and not store it in memory but still access it as dictionary
>> object. Speed is not the main concern in this problem and persistence
>> is not needed as the counting will only be done once on the data. We
>> want the script to run on smaller memory machines if possible.
>> I did think about databases for this but intuitively it looks like a
>> overkill coz for each key you have to first check whether it is
>> already present and increase the count by 1 and if not then insert
>> the key into dbase.
>> Just want to take your opinion on this.
> I do not think that a database would be overkill for this type of task.
> Your process may be trivial but the amount of data it has manage is not trivial. You can use a simple database like SQLite. Otherwise, you
> could create a file for each key and update the count in there. It will
> run on a small amount of memory but will be slower than using a db.
Well, maybe -- depends on how many unique entries exist. Most vanilla
systems are going to crash (or give the appearance thereof) if you end
up with millions of file entries in a directory. If a filesystem based
answer is sought, I'd consider generating 16-bit CRCs per key and
appending the keys to the CRC named file, then pass those, sort and do
the final counting.
More information about the Tutor