[Python-Dev] Unicode objects more space efficient than plain strings? can that be?
Skip Montanaro
skip@pobox.com
Wed, 1 May 2002 23:15:35 -0500
I'm busy absorbing all the great feedback I got on the Unicode how-to.
Thanks to all who've responded. As Aahz has said, the best way to lear=
n
something isn't to ask questions, it's to post an incorrect program (or=
in
this case, text).
After reading Simon's Perl/Unicode course notes and Marc-Andr=E9's
Python/Unicode EuroPython slides, I formed a simple, seemingly obvious,=
hypothesis:
When considering just ASCII data, plain Python strings should be mo=
re
space efficient than Unicode strings.
I compared ps output for two interactive sessions. In the first, I exe=
cuted
this statement at the interpreter prompt:
l =3D [u"abc%d"%i for i in xrange(1000000)]
In the second I executed this similar statement:
l =3D ["abc%d"%i for i in xrange(1000000)]
Ps showed that the interpreter consumed 57MB or so of virtual memory fo=
r the
list of Unicode strings case, and a whopping 152MB for the list of plai=
n
strings case. Just to be sure I wasn't dreaming, I repeated the crude
experiment. Same result. I then looked at the typedefs for Unicode an=
d
string objects. The sizes of the two structs are approximately the sam=
e.
There's certainly not a factor of three difference in the per-object
overhead. I expect the raw Unicode buffer to refer to a chunk of memor=
y
that is roughly two times the size of the plain string version of the b=
ytes
because the internal representation is (I seem to recall from MAL's not=
es)
UCS2. It seemed the only thing that might be a problem was string
interning, so based on the comment in stringobject.h about interning st=
rings
that "look like" Python identifiers, I tried one more time with strings=
that
didn't look like identifers (the automatic string interning would only
happen for string literals anyway, right?):
l =3D ["-bc%d"%i for i in xrange(1000000)]
Same result.
I ran these tests with a two-week old build from CVS. I just tried it =
with
a build from today using xrange(100000) and got similar, though obvious=
ly
smaller virtual memory sizes.
I must be missing something obvious, but what is it? Something about
pymalloc?
Skip