[Python-Dev] len(chr(i)) = 2?
M.-A. Lemburg
mal at egenix.com
Wed Nov 24 19:50:57 CET 2010
Alexander Belopolsky wrote:
> To conclude, I feel that rather than trying to fully support non-BMP
> characters as surrogate pairs in narrow builds, we should make it
> easier for application developers to avoid them.
I don't understand what you're after here. Programmers can easily
avoid them by not using them :-)
> If abandoning
> internal use of UTF-16 is not an option, I think we should at least
> add an option for decoders that currently produce surrogate pairs to
> treat non-BMP characters as errors and handle them according to user's
> choice.
But what do you gain by doing this ? You'd lose the round-trip
safety of those codecs and that's not a good thing.
Note that most text processing APIs in Python work based on code
units, which in most cases represent single code points, but in
some cases can also represent surrogates (both on UCS-2 and on
UCS-4 builds).
E.g. str.center(n) centers the string in a padded string that
is composed of n code units. Whether that operation will result
in a text that's centered visually on output is a completely
different story. The original string could contain surrogates,
it could also contain combing code points, so the visual
presentation of the result may very well not be centered at
all; it may not even appear as having the length n to the user.
Since we're not going change the semantics of those APIs,
it is OK to not support padding with non-BMP code points on
UCS-2 builds.
Supporting such cases would only cause problems:
* if the methods would pad with surrogates, the resulting
string would no longer have length n; breaking the
assumption that len(str.center(n)) == n
* if the methods would pad with half the number of surroagtes
to make sure that len(str.center(n)) == n, the resulting
output to e.g. a terminal would be further off, than what
you already have with surrogates and combining code points
in the original string.
More on codecs supporting surrogates:
http://mail.python.org/pipermail/python-dev/2008-July/080915.html
Perhaps it's time to reconsider a project I once started
but that never got off the ground:
http://mail.python.org/pipermail/python-dev/2008-July/080911.html
Here's the pre-PEP:
http://mail.python.org/pipermail/python-dev/2001-July/015938.html
--
Marc-Andre Lemburg
eGenix.com
Professional Python Services directly from the Source (#1, Nov 24 2010)
>>> Python/Zope Consulting and Support ... http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
Registered at Amtsgericht Duesseldorf: HRB 46611
http://www.egenix.com/company/contact/
More information about the Python-Dev
mailing list