[Python-3000] How will unicode get used?

Thu Sep 21 03:58:30 CEST 2006

David Hopwood <david.nospam.hopwood at blueyonder.co.uk> wrote:
> Brett Cannon wrote:
[snip]
> > If you want that kind of
> > exposure, use the bytes type.  Otherwise assume the usage will be by people
> > ignorant of Unicode and thus want something that will work the way they are
> > used to when compared to working in ASCII.
> 
> It simply is not possible to do correct string processing in Unicode that
> will "work the way [programmers] are used to when compared to working in ASCII".
> 
> The Unicode standard is on-line at www.unicode.org, and is quite well written,
> with lots of motivation and explanation of how processing international texts
> necessarily differs from working with ASCII. There is no excuse for any
> programmer doing text processing not to have read it.

Since, basically everyone using Python today performs "text processing"
in one way or another, you are saying that basically everyone should be
reading the Unicode spec before using Python.  Nevermind that the
document is generally larger than most people want to be reading, and
that you didn't provide a link to the most applicable section (with
regards to *using* unicode).  I will also mention that in the unicode
4.0 spec, Chapter 5 "Implementation Guidelines" starts with:

'''
It is possible to implement a substantial subset of the Unicode Standard
as "wide ASCII" with little change to existing programming practice. ...
'''

It later goes on to explain where "wide ASCII" is not a reasonable
strategy, but I'm not sure that users of Python necessarily need to know
all of that.

> Should we nevertheless try to avoid making the use of Unicode strings
> unnecessarily difficult for people who have minimal knowledge of Unicode?
> Absolutely, but not at the expense of making basic operations on strings
> asymptotically less efficient. O(1) indexing and slicing is a basic
> requirement, even if it has to be done using code units.

I believe you mean "code points", "code units" imply non-O(1) indexing
and slicing (variable-width characters).

 - Josiah