[Python-Dev] PEP 393 Summer of Code Project

Stephen J. Turnbull stephen at xemacs.org
Wed Aug 31 14:21:38 CEST 2011


Glenn Linderman writes:

 > From comments Guido has made, he is not interested in changing the
 > efficiency or access methods of the str type to raise the level of
 > support of Unicode to the composed character, or grapheme cluster
 > concepts.

IMO, that would be a bad idea, as higher-level Unicode support should
either be a wrapper around full implementations such as ICU (or
platform support in .NET or Java), or written in pure Python at first.
Thus there is a need for an efficient array of code units type.  PEP
393 allows this to go to the level of code points, but evidently that
is inappropriate for Jython and IronPython.

 > The str type itself can presently be used to process other
 > character encodings:

Not really.  Remember, on input codecs always decode to Unicode and on
output they always encode from Unicode.  How do you propose to get
other encodings into the array of code units?

 > [A "true Unicode" type] could be based on extensions to the
 > existing str type, or it could be based on the array type, or it
 > could based on the bytes type.  It could use an internal format of
 > 32-bit codepoints, PEP 393 variable-size codepoints, or 8- or
 > 16-bit codeunits.

In theory yes, but in practice all of the string methods and libraries
like re operate on str (and often but not always bytes; in particular,
codecs always decode from byte and encode to bytes).

Why bother with anything except arrays of code points at the start?
PEP 393 makes that time-efficient and reasonably space-efficient as a
starting point and allows starting with re or MRAB's regex to get
basic RE functionality or good UTS #18 functionality respectively.
Plus str already has all the usual string operations (.startswith(),
.join(), etc), and we have modules for dealing with the Unicode
Character Database.  Why waste effort reintegrating with all that,
until we have common use cases that need more efficient representation?

There would be some issue in coming up with an appropriate UTF-16 to
code point API for Jython and IronPython, but Terry Reedy has a rather
efficient library for that already.

So this discussion of alternative representations, including use of
high bits to represent properties, is premature optimization
... especially since we don't even have a proto-PEP specifying how
much conformance we want of this new "true Unicode" type in the first
place.

We need to focus on that before optimizing anything.


More information about the Python-Dev mailing list