Request for comments on a design
Peter Otten
__peter__ at web.de
Sat Oct 23 04:50:53 EDT 2010
TomF wrote:
> I have a program that manipulates lots of very large indices, which I
> implement as bit vectors (via the bitarray module). These are too
> large to keep all of them in memory so I have to come up with a way to
> cache and load them from disk as necessary. I've been reading about
> weak references and it looks like they may be what I want.
>
> My idea is to use a WeakValueDictionary to hold references to these
> bitarrays, so Python can decide when to garbage collect them. I then
> keep a key-value database of them (via bsddb) on disk and load them
> when necessary. The basic idea for accessing one of these indexes is:
>
> _idx_to_bitvector_dict = weakref.WeakValueDictionary()
In a well written script that cache will be almost empty. You should compare
the weakref approach against a least-recently-used caching strategy. In
newer Pythons you can use collections.OrderedDict to implement an LRU cache
or use the functools.lru_cache decorator.
> def retrieve_index(idx):
> if idx in _idx_to_bitvector_dict and _idx_to_bitvector_dict[idx] is
> not None:
> return _idx_to_bitvector_dict[idx]
In a multi-threaded environment the above may still return None or raise a
KeyError.
> else: # it's been gc'd
> bv_str = bitvector_from_db[idx] # Load from bsddb
> bv = cPickle.loads(bv_str) # Deserialize the string
> _idx_to_bitvector_dict[idx] = bv # Re-initialize the weak
> dict element
> return bv
>
> Hopefully that's not too confusing. Comments on this approach? I'm
> wondering whether the weakref stuff isn't duplicating some of the
> caching that bsddb might be doing.
Even if the raw value is cached somewhere you save the overhead of
deserialisation.
Peter
More information about the Python-list
mailing list