splitting a large dictionary into smaller ones

Mon Mar 23 07:22:03 EDT 2009

As others have said, a database is probably the right answer.  There, 
the data is kept on disk, and only a few records at a time are read for 
each access, with modification transactions usually being synchronous.

However, there are use cases where your approach makes more sense.  And 
it shouldn't be too hard to do what you describe.  So it's probably 
worth implementing.

Use case:  suppose most accesses are to a 'local' set of keys, in some 
sense.  For example, if the keys were serial numbers, with the most 
recently created serial numbers likely to be accessed much more often 
than those a few years old.  Then if you split the serial numbers up in 
the most obvious way (using the high digits of the number, for example 
to index which external file the data is in), you'd find nearly all 
accesses would be to the file with the highest index.

So you create a wrapper class with an init and two member methods, read 
and write, plus a flush() method.  read takes a key, and returns a 
value, while write takes both key and value, and saves it away.  The 
init method creates a list of dictionaries (initially of length zero) 
and saves some information about the location and naming of the actual 
pickle files (or whatever you're using).

When a read or write method is called, you do the divide (or whatever 
filter you use) to find an index into the list of dictionaries.  If this 
index is beyond the end of the list, extend the list with None till it's 
big enough.  Now, if the particular entry in the list is None, you need 
to read in the corresponding pickle file, and add the created dictionary 
object into this index of the list.

At this point, you read or write your dictionary as usual.  The only 
other tricky thing is to save a 'dirty bit' for each loaded dictionary.  
When your wrapper's flush() is called, you pickle each of the dirty 
dictionaries back to the file system.

Lots of possible variants.  For example, you could use a dictionary of 
dictionary objects instead of a list of dictionaries.  That would be 
useful if the list will be large, and sparse.

The main thing I don't know here is how to overload the [] operator, if 
you need that particular syntax for the read and write methods.

per wrote:
> hi all,
>
> i have a very large dictionary object that is built from a text file
> that is about 800 MB -- it contains several million keys.  ideally i
> would like to pickle this object so that i wouldnt have to parse this
> large file to compute the dictionary every time i run my program.
> however currently the pickled file is over 300 MB and takes a very
> long time to write to disk - even longer than recomputing the
> dictionary from scratch.
>
> i would like to split the dictionary into smaller ones, containing
> only hundreds of thousands of keys, and then try to pickle them. is
> there a way to easily do this? i.e. is there an easy way to make a
> wrapper for this such that i can access this dictionary as just one
> object, but underneath it's split into several? so that i can write
> my_dict[k] and get a value, or set my_dict[m] to some value without
> knowing which sub dictionary it's in.
>
> if there aren't known ways to do this, i would greatly apprciate any
> advice/examples on how to write this data structure from scratch,
> reusing as much of the dict() class as possible.
>
> thanks.
>
> large_dict[a]
>
>