[Python-Dev] PEP 393 Summer of Code Project

Thu Aug 25 05:48:49 CEST 2011

On Thu, Aug 25, 2011 at 1:11 PM, Guido van Rossum <guido at python.org> wrote:
>> With narrow builds, code units can currently come into play
>> internally, but with PEP 393 everything internal will be working
>> directly with code points. Normalisation, combining characters and
>> bidi issues may still affect the correctness of unicode comparison and
>> slicing (and other text manipulation), but there are limits to how
>> much of the underlying complexity we can effectively hide without
>> being misleading.
>
> Let's just define a Unicode string to be a sequence of code points and
> let libraries deal with the rest. Ok, methods like lower() should
> consider characters, but indexing/slicing should refer to code points.
> Same for '=='; we can have a library that compares by applying (or
> assuming?) certain normalizations. Tom C tells me that case-less
> comparison cannot use a.lower() == b.lower(); fine, we can add that
> operation to the library too. But this exceeds the scope of PEP 393,
> right?

Yep, I was agreeing with you on this point - I think you're right that
if we provide a solid code point based core Unicode type (perhaps with
some character based methods), then library support can fill the gap
between handling code points and handling characters.

In particular, a unicode character based string type would be
significantly easier to write in Python than it would be in C (after
skimming Tom's bug report at http://bugs.python.org/issue12729, I
better understand the motivation and desire for that kind of interface
and it sounds like Terry's prototype is along those lines). Once those
mappings are thrashed out outside the core, then there may be
something to incorporate directly around the 3.4 timeframe (or
potentially even in 3.3, since it should already be possible to
develop such a wrapper based on UCS4 builds of 3.2)

However, there may an important distinction to be made on the
Python-the-language vs CPython-the-implementation front: is another
implementation (e.g. PyPy) *allowed* to implement character based
indexing instead of code point based for 2.x unicode/3.x str type? Or
is the code point indexing part of the language spec, and any
character based indexing needs to be provided via a separate type or
module?

Regards,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia