[Python-Dev] PEP 393 Summer of Code Project

Stephen J. Turnbull stephen at xemacs.org
Thu Aug 25 07:58:10 CEST 2011


Nick Coghlan writes:
 > GvR writes:

 > > Let's just define a Unicode string to be a sequence of code points and
 > > let libraries deal with the rest. Ok, methods like lower() should
 > > consider characters, but indexing/slicing should refer to code points.
 > > Same for '=='; we can have a library that compares by applying (or
 > > assuming?) certain normalizations. Tom C tells me that case-less
 > > comparison cannot use a.lower() == b.lower(); fine, we can add that
 > > operation to the library too. But this exceeds the scope of PEP 393,
 > > right?
 > 
 > Yep, I was agreeing with you on this point - I think you're right that
 > if we provide a solid code point based core Unicode type (perhaps with
 > some character based methods), then library support can fill the gap
 > between handling code points and handling characters.

+1  I don't really see an alternative to this approach.  The
underlying array has to be exposed because there are too many
applications that can take advantage of it, and analysis of decomposed
characters requires it.

Making that array be an array of code points is a really good idea,
and Python already has that in the UCS-4 build.  PEP 393 is "just" a
space optimization that allows getting rid of the narrow build, with
all its wartiness.

 > something to incorporate directly around the 3.4 timeframe (or
 > potentially even in 3.3, since it should already be possible to
 > develop such a wrapper based on UCS4 builds of 3.2)

I agree that it's possible, but I estimate that it's not feasible for
3.3 because we don't yet know the requirements.  This one really needs
to ferment and mature in PyPI for a while because we just don't know
how far the scope of user needs is going to extend.  Bidi is a
mudball[1], confusable character indexes sound like a cool idea for
the web and email but is anybody really going to use them?, etc.

 > However, there may an important distinction to be made on the
 > Python-the-language vs CPython-the-implementation front: is another
 > implementation (e.g. PyPy) *allowed* to implement character based
 > indexing instead of code point based for 2.x unicode/3.x str type? Or
 > is the code point indexing part of the language spec, and any
 > character based indexing needs to be provided via a separate type or
 > module?

+1 for language spec.  Remember, there are cases in Unicode where
you'd like to access base characters and the like.  So you need to be
able to get at individual code points in an NFD string.  You shouldn't
need to use different code for that in different implementations of
Python.

Footnotes: 
[1]  Sure, we can implement the UAX#9 bidi algorithm, but it's not
good enough by itself: something as simple as

    "File name (default {0}): ".format(name)

can produce disconcerting results if the whole resulting string is
treated by the UBA.  Specifically, using the usual convention of
uppercase letters being an RTL script, name = "ABCD" will result in
the prompt:

    File name (default :(DCBA _

(where _ denotes the position of the insertion cursor).  The Hebrew
speakers on emacs-devel agreed that an example using a real Hebrew
string didn't look right to them, either.


More information about the Python-Dev mailing list