
On Friday 2004-09-10 06:38, Stephen J. Turnbull wrote:
"Gareth" == Gareth McCaughan <gmccaughan@synaptics-uk.com> writes:
Gareth> That said, I strongly agree that all textual data should Gareth> be Unicode as far as the developer is concerned; but, at Gareth> least in the USA :-), it makes sense to have an optimized Gareth> representation that saves space for ASCII-only text, just Gareth> as we have an optimized representation for small integers.
This is _not at all_ obvious. As MAL just pointed out, if efficiency is a goal, text algorithms often need to be different for operations on texts that are dense in an 8-bit character space, vs texts that are sparse in a 16-bit or 20-bit character space. Note that that is what </F> is talking about too; he points to SRE and ElementTree.
I hope you aren't expecting me to disagree.
When viewed from that point of view, the subtext to </F>'s comment is "I don't want to separately maintain 8-bit versions of new text facilities to support my non-Unicode applications, I want to impose that burden on the authors of text-handling PEPs." That may very well be the best thing for Python; as </F> has done a lot of Unicode implementation for Python, he's in a good position to make such judgements. But the development costs MAL refers to are bigger than you are estimating, and will continue as long as that policy does.
How do you know what I am estimating?
While I'm very sympathetic to </F>'s view that there's more than one way to skin a cat, and a good cat-handling design should account for that, and conceding his expertise, none-the-less I don't think that Python really wants to _maintain_ more than one text-processing system by default. Of course if you restrict yourself to the class of ASCII- only strings, you can do better, and of course that is a huge class of strings. But that, as such, is important only to efficiency fanatics.
No, it's important to ... well, people to whom efficiency matters. There's no need for them to be fanatics.
The question is, how often are people going to notice that when they have pure ASCII they get a 100% speedup, or that they actually can just suck that 3GB ASCII file into their 4GB memory, rather than buffering it as 3 (or 6) 2GB Unicode strings? Compare how often people are going to notice that a new facility "just works" for Japanese or Hindi.
Why is that the question, rather than "how often are people going to benefit from getting a 100% speedup when they have pure ASCII"? Or even "how often are people going to try out Python on an application that uses pure-ASCII strings, and decide to use some other language that seems to do the job much faster"?
I just don't see the former being worth the extra effort, while the latter makes the "this or that" choice clear. If a single representation is enough, it had better be Unicode-based, and the others can be supported in libraries (which turn binary blobs into non-standard text objects with appropriate methods) as the need arises.
No question that if a single representation is enough then it had better be Unicode. -- g