how to store great array?

Thomas Wouters thomas at xs4all.net
Tue Jan 30 04:21:05 EST 2001


On Mon, Jan 29, 2001 at 08:13:58PM +0000, Quinn Dunkan wrote:

> But python's lists are arrays.  Don't ask me why this confusing
> terminology was chosen, probably because some Dutch people doing usability
> tests thought 'list' sounded nicer for ABC a million years ago.

I'm not sure where list comes from, but I'm sure glad they're called lists
and not arrays! To C programmers, arrays mean 'lists of a single type'. To
people who never heard of arrays (like newbie programmers), arrays mean
'wtf?'. Only to people who judge a datatype by name instead of documentation
*and* are used to a programming language where a list is defined as a linked
list, might 'list' seem 'wrong'. I came from LambdaMOO, where you program in
a wonderfully OO (for a MUD, anyway) and very Python-like programming
language, and I was absolutely delighted when I met Python, which had the
first *real* lists I'd seen in years.

> Dict access is also constant, as is usual for a hash table (internally, it
> gets turned into simple array access).  But if your indices will be inteeger,
> then a plain array aka list will have less overhead.

Only if you have (near) sequential keys. If you have large gaps in your keys
(say 60% or more 'unused' keys, but a dict expert like Tim can probably
correct me :), you're better off with a dictionary -- you won't have to
allocate quite as much room for it.

> Python experts could tell you the exact memory usage.  It will be something
> like bytes_in_string+null + pointer + type_field + maybe_a_cached_hash +
> interned_flag.  So with ~4 bytes overhead for each 5 byte string, you're
> looking at 9*2_000_000 / 1024 / 1024 = 17MB for 2 million elements.

Actually, a string object is defined like this (from Include/stringobject.h):

typedef struct {
    PyObject_VAR_HEAD
#ifdef CACHE_HASH
    long ob_shash;
#endif
#ifdef INTERN_STRINGS
    PyObject *ob_sinterned;
#endif
    char ob_sval[1];
} PyStringObject;

where both CACHE_HASH and INTERN_STRINGS are defined by default.
PyObject_VAR_HEAD is a macro that contains PyOject_HEAD and an ob_size
integer, and PyObject_HEAD holds a pointer to the type struct and an int for
the reference count (if you don't enable extreme debugging.) So, all in all:

pointer + int + int + long + pointer (ob_sval is a placeholder that gets
allocated to the right size in the PyString_New) = 160 bytes on a 32bit
system, 256 bytes on a proper 64bit system, and 224 bytes on Win64. Plus the
length of the string in whatever size a char is on that particular system
(usually a single byte.)

-- 
Thomas Wouters <thomas at xs4all.net>

Hi! I'm a .signature virus! copy me into your .signature file to help me spread!




More information about the Python-list mailing list