[Tutor] Shelve & immutable objects

eryksun eryksun at gmail.com
Thu Jan 2 17:00:46 CET 2014


On Thu, Jan 2, 2014 at 4:15 AM, Keith Winston <keithwins at gmail.com> wrote:
> Thanks for all this Eryksun (and Mark!), but... I don't understand why you
> brought gdbm in? Is it something underlying shelve, or a better approach, or
> something else? That last part really puts me in a pickle, and I don't
> understand why.

A Shelf is backed by a container with the following mapping methods:

    keys
    __contains__
    __getitem__
    __setitem__
    __delitem__
    __len__

Shelf will also try to call `close` and `sync` on the container if
available. For some reason no one has made Shelf into a context
manager (i.e. __enter__ and __exit__), so remember to close() it.

For demonstration purposes, you can use a dict with Shelf:

    >>> sh = shelve.Shelf(dict={})
    >>> sh['alist'] = [1,2,3]

The mapping is referenced in the (badly named) `dict` attribute:

    >>> sh.dict
    {b'alist': b'\x80\x03]q\x00(K\x01K\x02K\x03e.'}

Keys are encoded as bytes (UTF-8 default) and the value is serialized
using pickle. This is to support using a database from the dbm module.

shelve.open returns an instance of shelve.DbfilenameShelf, which is a
subclass of Shelf specialized to open a dbm database.

Here's an overview of Unix dbm databases that Google turned up:

http://www.unixpapa.com/incnote/dbm.html

Note the size restrictions for keys and values in ndbm, which gdbm
doesn't have. Using gdbm lifts the restriction on the size of pickled
objects (the docs vaguely suggest to keep them "fairly small").
Unfortunately gdbm isn't always available.

On my system, dbm defaults to creating a _gdbm.gdbm database, where
_gdbm is an extension module that wraps the GNU gdbm library (e.g.
libgdbm.so.3).

You can use a different database with Shelf (or a subclass), so long
as it has the required methods. For example, shelve.BsdDbShelf is
available for use with pybsddb (Debian package "python3-bsddb3"). It
exposes the bsddb3 database methods `first`, `next`, `previous`,
`last` and `set_location`.

> Separately, I'm also curious about how to process big files.
> ...
> I'm also beginning to think about how to speed it up:

I defer to Steven's sage advice.

Look into using databases such as sqlite3, numpy, and also add the
multiprocessing and concurrent.futures modules to your todo list. Even
if you know C/C++, I suggest you use Cython to create CPython
extension modules.

http://www.cython.org


More information about the Tutor mailing list