[Python-Dev] len(chr(i)) = 2?

M.-A. Lemburg mal at egenix.com
Wed Nov 24 19:50:57 CET 2010


Alexander Belopolsky wrote:
> To conclude, I feel that rather than trying to fully support non-BMP
> characters as surrogate pairs in narrow builds, we should make it
> easier for application developers to avoid them. 

I don't understand what you're after here. Programmers can easily
avoid them by not using them :-)

> If abandoning
> internal use of UTF-16 is not an option, I think we should at least
> add an option for decoders that currently produce surrogate pairs to
> treat non-BMP characters as errors and handle them according to user's
> choice.

But what do you gain by doing this ? You'd lose the round-trip
safety of those codecs and that's not a good thing.

Note that most text processing APIs in Python work based on code
units, which in most cases represent single code points, but in
some cases can also represent surrogates (both on UCS-2 and on
UCS-4 builds).

E.g. str.center(n) centers the string in a padded string that
is composed of n code units. Whether that operation will result
in a text that's centered visually on output is a completely
different story. The original string could contain surrogates,
it could also contain combing code points, so the visual
presentation of the result may very well not be centered at
all; it may not even appear as having the length n to the user.

Since we're not going change the semantics of those APIs,
it is OK to not support padding with non-BMP code points on
UCS-2 builds.

Supporting such cases would only cause problems:

* if the methods would pad with surrogates, the resulting
  string would no longer have length n; breaking the
  assumption that len(str.center(n)) == n

* if the methods would pad with half the number of surroagtes
  to make sure that len(str.center(n)) == n, the resulting
  output to e.g. a terminal would be further off, than what
  you already have with surrogates and combining code points
  in the original string.

More on codecs supporting surrogates:

  http://mail.python.org/pipermail/python-dev/2008-July/080915.html

Perhaps it's time to reconsider a project I once started
but that never got off the ground:

  http://mail.python.org/pipermail/python-dev/2008-July/080911.html

Here's the pre-PEP:

  http://mail.python.org/pipermail/python-dev/2001-July/015938.html

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Nov 24 2010)
>>> Python/Zope Consulting and Support ...        http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ...             http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...        http://python.egenix.com/
________________________________________________________________________

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/


More information about the Python-Dev mailing list