[Tutor] managing memory large dictionaries in python

Prasad, Ramit ramit.prasad at jpmorgan.com
Tue Oct 16 22:03:22 CEST 2012


Abhishek Pratap wrote:
> Sent: Tuesday, October 16, 2012 11:57 AM
> To: tutor at python.org
> Subject: [Tutor] managing memory large dictionaries in python
> 
> Hi Guys
> 
> For my problem I need to store 400-800 million 20 characters keys in a
> dictionary and do counting. This data structure takes about 60-100 Gb
> of RAM.
> I am wondering if there are slick ways to map the dictionary to a file
> on disk and not store it in memory but still access it as dictionary
> object. Speed is not the main concern in this problem and persistence
> is not needed as the counting will only be done once on the data. We
> want the script to run on smaller memory machines if possible.
> 
> I did think about databases for this but intuitively it looks like a
> overkill coz for each key you have to first check whether it is
> already present and increase the count by 1  and if not then insert
> the key into dbase.
> 
> Just want to take your opinion on this.
> 
> Thanks!
> -Abhi

I do not think that a database would be overkill for this type of task.
Your process may be trivial but the amount of data it has manage is not trivial. You can use a simple database like SQLite. Otherwise, you 
could create a file for each key and update the count in there. It will
run on a small amount of memory but will be slower than using a db.

# Pseudocode
key = get_key()
filename = os.path.join(directory, key)
if os.path.exists(filename):
    # read and update count
else:
    with open(os.path.join(directory, key), 'w') as f:
        f.write('1')

Given that SQLite is included in Python and is easy to use, I would just
use that.


-Ramit


This email is confidential and subject to important disclaimers and
conditions including on offers for the purchase or sale of
securities, accuracy and completeness of information, viruses,
confidentiality, legal privilege, and legal entity disclaimers,
available at http://www.jpmorgan.com/pages/disclosures/email.  


More information about the Tutor mailing list