[Python-Dev] Unicode objects more space efficient than plain strings? can that be?
Wed, 1 May 2002 23:15:35 -0500
I'm busy absorbing all the great feedback I got on the Unicode how-to.
Thanks to all who've responded. As Aahz has said, the best way to lear=
something isn't to ask questions, it's to post an incorrect program (or=
this case, text).
After reading Simon's Perl/Unicode course notes and Marc-Andr=E9's
Python/Unicode EuroPython slides, I formed a simple, seemingly obvious,=
When considering just ASCII data, plain Python strings should be mo=
space efficient than Unicode strings.
I compared ps output for two interactive sessions. In the first, I exe=
this statement at the interpreter prompt:
l =3D [u"abc%d"%i for i in xrange(1000000)]
In the second I executed this similar statement:
l =3D ["abc%d"%i for i in xrange(1000000)]
Ps showed that the interpreter consumed 57MB or so of virtual memory fo=
list of Unicode strings case, and a whopping 152MB for the list of plai=
strings case. Just to be sure I wasn't dreaming, I repeated the crude
experiment. Same result. I then looked at the typedefs for Unicode an=
string objects. The sizes of the two structs are approximately the sam=
There's certainly not a factor of three difference in the per-object
overhead. I expect the raw Unicode buffer to refer to a chunk of memor=
that is roughly two times the size of the plain string version of the b=
because the internal representation is (I seem to recall from MAL's not=
UCS2. It seemed the only thing that might be a problem was string
interning, so based on the comment in stringobject.h about interning st=
that "look like" Python identifiers, I tried one more time with strings=
didn't look like identifers (the automatic string interning would only
happen for string literals anyway, right?):
l =3D ["-bc%d"%i for i in xrange(1000000)]
I ran these tests with a two-week old build from CVS. I just tried it =
a build from today using xrange(100000) and got similar, though obvious=
smaller virtual memory sizes.
I must be missing something obvious, but what is it? Something about