Re: [Python-Dev] AlternativeImplementation forPEP292:SimpleString Substitutions

Sept. 10, 2004


      On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
...
...
...
...
...
...
"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> That said, I strongly agree that all textual data should
    Gareth> be Unicode as far as the developer is concerned; but, at
    Gareth> least in the USA :-), it makes sense to have an optimized
    Gareth> representation that saves space for ASCII-only text, just
    Gareth> as we have an optimized representation for small integers.
This is _not at all_ obvious.  As MAL just pointed out, if efficiency
is a goal, text algorithms often need to be different for operations
on texts that are dense in an 8-bit character space, vs texts that are
sparse in a 16-bit or 20-bit character space.  Note that that is what
</F> is talking about too; he points to SRE and ElementTree.
I hope you aren't expecting me to disagree.
...
When viewed from that point of view, the subtext to </F>'s comment is
"I don't want to separately maintain 8-bit versions of new text
facilities to support my non-Unicode applications, I want to impose
that burden on the authors of text-handling PEPs."  That may very well
be the best thing for Python; as </F> has done a lot of Unicode
implementation for Python, he's in a good position to make such
judgements.  But the development costs MAL refers to are bigger than
you are estimating, and will continue as long as that policy does.
How do you know what I am estimating?
...
While I'm very sympathetic to </F>'s view that there's more than one
way to skin a cat, and a good cat-handling design should account for
that, and conceding his expertise, none-the-less I don't think that
Python really wants to _maintain_ more than one text-processing system
by default.  Of course if you restrict yourself to the class of ASCII-
only strings, you can do better, and of course that is a huge class of
strings.  But that, as such, is important only to efficiency fanatics.
No, it's important to ... well, people to whom efficiency
matters. There's no need for them to be fanatics.
...
The question is, how often are people going to notice that when they
have pure ASCII they get a 100% speedup, or that they actually can
just suck that 3GB ASCII file into their 4GB memory, rather than
buffering it as 3 (or 6) 2GB Unicode strings?  Compare how often
people are going to notice that a new facility "just works" for
Japanese or Hindi.
Why is that the question, rather than "how often are people
going to benefit from getting a 100% speedup when they have
pure ASCII"? Or even "how often are people going to try out
Python on an application that uses pure-ASCII strings, and
decide to use some other language that seems to do the job
much faster"?
...
I just don't see the former being worth the extra
effort, while the latter makes the "this or that" choice clear.  If a
single representation is enough, it had better be Unicode-based, and
the others can be supported in libraries (which turn binary blobs into
non-standard text objects with appropriate methods) as the need arises.
No question that if a single representation is enough then it
had better be Unicode.

-- 
g