[Python-Dev] AlternativeImplementation forPEP292:SimpleString Substitutions

Stephen J. Turnbull stephen at xemacs.org
Fri Sep 10 07:38:38 CEST 2004

>>>>> "Gareth" == Gareth McCaughan <gmccaughan at synaptics-uk.com> writes:

    Gareth> That said, I strongly agree that all textual data should
    Gareth> be Unicode as far as the developer is concerned; but, at
    Gareth> least in the USA :-), it makes sense to have an optimized
    Gareth> representation that saves space for ASCII-only text, just
    Gareth> as we have an optimized representation for small integers.

This is _not at all_ obvious.  As MAL just pointed out, if efficiency
is a goal, text algorithms often need to be different for operations
on texts that are dense in an 8-bit character space, vs texts that are
sparse in a 16-bit or 20-bit character space.  Note that that is what
</F> is talking about too; he points to SRE and ElementTree.

When viewed from that point of view, the subtext to </F>'s comment is
"I don't want to separately maintain 8-bit versions of new text
facilities to support my non-Unicode applications, I want to impose
that burden on the authors of text-handling PEPs."  That may very well
be the best thing for Python; as </F> has done a lot of Unicode
implementation for Python, he's in a good position to make such
judgements.  But the development costs MAL refers to are bigger than
you are estimating, and will continue as long as that policy does.

While I'm very sympathetic to </F>'s view that there's more than one
way to skin a cat, and a good cat-handling design should account for
that, and conceding his expertise, none-the-less I don't think that
Python really wants to _maintain_ more than one text-processing system
by default.  Of course if you restrict yourself to the class of ASCII-
only strings, you can do better, and of course that is a huge class of
strings.  But that, as such, is important only to efficiency fanatics.

The question is, how often are people going to notice that when they
have pure ASCII they get a 100% speedup, or that they actually can
just suck that 3GB ASCII file into their 4GB memory, rather than
buffering it as 3 (or 6) 2GB Unicode strings?  Compare how often
people are going to notice that a new facility "just works" for
Japanese or Hindi.  I just don't see the former being worth the extra
effort, while the latter makes the "this or that" choice clear.  If a
single representation is enough, it had better be Unicode-based, and
the others can be supported in libraries (which turn binary blobs into
non-standard text objects with appropriate methods) as the need arises.

Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

More information about the Python-Dev mailing list