[Tutor] Conditional attribute access / key access

Steven D'Aprano steve at pearwood.info
Tue Aug 31 11:47:13 CEST 2010


On Tue, 31 Aug 2010 12:44:08 am Knacktus wrote:
> Hey everyone,
>
> I have a huge number of data items coming from a database. 

Huge?

Later in this thread, you mentioned 200,000 items overall. That might 
be "huge" to you, but it isn't to Python. Here's an example:

class K(object):
    def __init__(self):
        self.info = {"id": id(self),
                    "name": "root " + str(id(self)), 
                    "children_ids": [2*id(self), 3*id(self)+1]}


And the size:

>>> k = K()
>>> sys.getsizeof(k)
28
>>> sys.getsizeof(k.info)
136
>>> L = [K() for _ in xrange(200000)]
>>> sys.getsizeof(L)
835896

The sizes given are in bytes. So 200,000 instances of this class, plus 
the list to hold them, would take approximately 34 megabytes. An entry 
level PC these days has 1000 megabytes of memory. "Huge"? Not even 
close.

Optimizing with __slots__ is premature. Perhaps if you had 1000 times 
that many instances, then it might be worth while.



> So far 
> there're no restrictions about how to model the items. They can be
> dicts, objects of a custom class (preferable with __slots__) or
> namedTuple.
>
> Those items have references to each other using ids.

That approach sounds slow and ponderous to me. Why don't you just give 
items direct references to each other, instead of indirect using ids?

I presume you're doing something like this:

ids = {0: None}  # Map IDs to objects.
a = Part(0)
ids[1] = a
b = Part(1)  # b is linked to a via its ID 1.
ids[2] = b
c = Part(2)  # c is linked to b via its ID 2.
ids[3] = c

(only presumably less painfully).


If that's what you're doing, you should dump the ids and just do this:

a = Part(None)
b = Part(a)
c = Part(b)

Storing references to objects in Python is cheap -- it's only a pointer. 
Using indirection via an ID you manage yourself is a pessimation, not 
an optimization: it requires more code, slower speed, and more memory 
too (because the integer IDs themselves are pointers to 12 byte 
objects, not 4 byte ints).

If you *need* indirection, say because you are keeping the data in a 
database and you want to only lazily load it when needed, rather than 
all at once, then the right approach is probably a proxy object:

class PartProxy(object):
    def __init__(self, database_id):
        self._info = None
        self.database_id = database_id
    @property
    def info(self):
        if self._info is None:
            self._info = get_from_database(self.database_id)
        return self._info




-- 
Steven D'Aprano


More information about the Tutor mailing list