[Python-Dev] AlternativeImplementation forPEP292:SimpleString Substitutions

Fri Sep 10 13:57:13 CEST 2004

On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
> >>>>> "Gareth" == Gareth McCaughan <gmccaughan at synaptics-uk.com> writes:
> 
>     Gareth> That said, I strongly agree that all textual data should
>     Gareth> be Unicode as far as the developer is concerned; but, at
>     Gareth> least in the USA :-), it makes sense to have an optimized
>     Gareth> representation that saves space for ASCII-only text, just
>     Gareth> as we have an optimized representation for small integers.
> 
> This is _not at all_ obvious.  As MAL just pointed out, if efficiency
> is a goal, text algorithms often need to be different for operations
> on texts that are dense in an 8-bit character space, vs texts that are
> sparse in a 16-bit or 20-bit character space.  Note that that is what
> </F> is talking about too; he points to SRE and ElementTree.

I hope you aren't expecting me to disagree.

> When viewed from that point of view, the subtext to </F>'s comment is
> "I don't want to separately maintain 8-bit versions of new text
> facilities to support my non-Unicode applications, I want to impose
> that burden on the authors of text-handling PEPs."  That may very well
> be the best thing for Python; as </F> has done a lot of Unicode
> implementation for Python, he's in a good position to make such
> judgements.  But the development costs MAL refers to are bigger than
> you are estimating, and will continue as long as that policy does.

How do you know what I am estimating?

> While I'm very sympathetic to </F>'s view that there's more than one
> way to skin a cat, and a good cat-handling design should account for
> that, and conceding his expertise, none-the-less I don't think that
> Python really wants to _maintain_ more than one text-processing system
> by default.  Of course if you restrict yourself to the class of ASCII-
> only strings, you can do better, and of course that is a huge class of
> strings.  But that, as such, is important only to efficiency fanatics.

No, it's important to ... well, people to whom efficiency
matters. There's no need for them to be fanatics.

> The question is, how often are people going to notice that when they
> have pure ASCII they get a 100% speedup, or that they actually can
> just suck that 3GB ASCII file into their 4GB memory, rather than
> buffering it as 3 (or 6) 2GB Unicode strings?  Compare how often
> people are going to notice that a new facility "just works" for
> Japanese or Hindi.

Why is that the question, rather than "how often are people
going to benefit from getting a 100% speedup when they have
pure ASCII"? Or even "how often are people going to try out
Python on an application that uses pure-ASCII strings, and
decide to use some other language that seems to do the job
much faster"?

>                    I just don't see the former being worth the extra
> effort, while the latter makes the "this or that" choice clear.  If a
> single representation is enough, it had better be Unicode-based, and
> the others can be supported in libraries (which turn binary blobs into
> non-standard text objects with appropriate methods) as the need arises.

No question that if a single representation is enough then it
had better be Unicode.

-- 
g