[Python-Dev] AlternativeImplementation forPEP292:SimpleString Substitutions
gmccaughan at synaptics-uk.com
Fri Sep 10 13:57:13 CEST 2004
On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
> >>>>> "Gareth" == Gareth McCaughan <gmccaughan at synaptics-uk.com> writes:
> Gareth> That said, I strongly agree that all textual data should
> Gareth> be Unicode as far as the developer is concerned; but, at
> Gareth> least in the USA :-), it makes sense to have an optimized
> Gareth> representation that saves space for ASCII-only text, just
> Gareth> as we have an optimized representation for small integers.
> This is _not at all_ obvious. As MAL just pointed out, if efficiency
> is a goal, text algorithms often need to be different for operations
> on texts that are dense in an 8-bit character space, vs texts that are
> sparse in a 16-bit or 20-bit character space. Note that that is what
> </F> is talking about too; he points to SRE and ElementTree.
I hope you aren't expecting me to disagree.
> When viewed from that point of view, the subtext to </F>'s comment is
> "I don't want to separately maintain 8-bit versions of new text
> facilities to support my non-Unicode applications, I want to impose
> that burden on the authors of text-handling PEPs." That may very well
> be the best thing for Python; as </F> has done a lot of Unicode
> implementation for Python, he's in a good position to make such
> judgements. But the development costs MAL refers to are bigger than
> you are estimating, and will continue as long as that policy does.
How do you know what I am estimating?
> While I'm very sympathetic to </F>'s view that there's more than one
> way to skin a cat, and a good cat-handling design should account for
> that, and conceding his expertise, none-the-less I don't think that
> Python really wants to _maintain_ more than one text-processing system
> by default. Of course if you restrict yourself to the class of ASCII-
> only strings, you can do better, and of course that is a huge class of
> strings. But that, as such, is important only to efficiency fanatics.
No, it's important to ... well, people to whom efficiency
matters. There's no need for them to be fanatics.
> The question is, how often are people going to notice that when they
> have pure ASCII they get a 100% speedup, or that they actually can
> just suck that 3GB ASCII file into their 4GB memory, rather than
> buffering it as 3 (or 6) 2GB Unicode strings? Compare how often
> people are going to notice that a new facility "just works" for
> Japanese or Hindi.
Why is that the question, rather than "how often are people
going to benefit from getting a 100% speedup when they have
pure ASCII"? Or even "how often are people going to try out
Python on an application that uses pure-ASCII strings, and
decide to use some other language that seems to do the job
> I just don't see the former being worth the extra
> effort, while the latter makes the "this or that" choice clear. If a
> single representation is enough, it had better be Unicode-based, and
> the others can be supported in libraries (which turn binary blobs into
> non-standard text objects with appropriate methods) as the need arises.
No question that if a single representation is enough then it
had better be Unicode.
More information about the Python-Dev