[Python-Dev] Unicode objects more space efficient than plain strings? can that be?

Skip Montanaro skip@pobox.com
Wed, 1 May 2002 23:15:35 -0500


I'm busy absorbing all the great feedback I got on the Unicode how-to.
Thanks to all who've responded.  As Aahz has said, the best way to lear=
n
something isn't to ask questions, it's to post an incorrect program (or=
 in
this case, text).

After reading Simon's Perl/Unicode course notes and Marc-Andr=E9's
Python/Unicode EuroPython slides, I formed a simple, seemingly obvious,=

hypothesis:

    When considering just ASCII data, plain Python strings should be mo=
re
    space efficient than Unicode strings.

I compared ps output for two interactive sessions.  In the first, I exe=
cuted
this statement at the interpreter prompt:

    l =3D [u"abc%d"%i for i in xrange(1000000)]

In the second I executed this similar statement:

    l =3D ["abc%d"%i for i in xrange(1000000)]

Ps showed that the interpreter consumed 57MB or so of virtual memory fo=
r the
list of Unicode strings case, and a whopping 152MB for the list of plai=
n
strings case.  Just to be sure I wasn't dreaming, I repeated the crude
experiment.  Same result.  I then looked at the typedefs for Unicode an=
d
string objects.  The sizes of the two structs are approximately the sam=
e.
There's certainly not a factor of three difference in the per-object
overhead.  I expect the raw Unicode buffer to refer to a chunk of memor=
y
that is roughly two times the size of the plain string version of the b=
ytes
because the internal representation is (I seem to recall from MAL's not=
es)
UCS2.  It seemed the only thing that might be a problem was string
interning, so based on the comment in stringobject.h about interning st=
rings
that "look like" Python identifiers, I tried one more time with strings=
 that
didn't look like identifers (the automatic string interning would only
happen for string literals anyway, right?):

    l =3D ["-bc%d"%i for i in xrange(1000000)]

Same result.

I ran these tests with a two-week old build from CVS.  I just tried it =
with
a build from today using xrange(100000) and got similar, though obvious=
ly
smaller virtual memory sizes.

I must be missing something obvious, but what is it?  Something about
pymalloc?

Skip