[Tutor] managing memory large dictionaries in python

Sat Oct 20 03:03:27 CEST 2012

On Oct 16, 2012, at 12:57 PM, Abhishek Pratap <abhishek.vit at gmail.com> wrote:

> Hi Guys
> 
> For my problem I need to store 400-800 million 20 characters keys in a
> dictionary and do counting. This data structure takes about 60-100 Gb
> of RAM.
> I am wondering if there are slick ways to map the dictionary to a file
> on disk and not store it in memory but still access it as dictionary
> object. Speed is not the main concern in this problem and persistence
> is not needed as the counting will only be done once on the data. We
> want the script to run on smaller memory machines if possible.
> 
> I did think about databases for this but intuitively it looks like a
> overkill coz for each key you have to first check whether it is
> already present and increase the count by 1  and if not then insert
> the key into dbase.
> 
> Just want to take your opinion on this.
> 
> Thanks!
> -Abhi
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor

With apologies for coming to this late.  Just a couple of comments -

1)  If the keys were sorted before the counting operation was to take place, the problem could be run in successive batches in as small a foot print as desired.

2)  Sorting in limited memory space was a highly developed art back in the early days of IBM mainframe computing (when one had to use tape rather than disk for temp store and 64k, yes "k", was a lot of memory).  If you look at Knuth, Volume III: Sorting and Searching you will find several algorithms that would allow the sort to also run in as small a foot print as needed.

Possibly not interesting, but just saying…

Bill