[Python-Dev] len(chr(i)) = 2?

Mon Nov 22 18:00:14 CET 2010

On Mon, Nov 22, 2010 at 11:13 AM, Nick Coghlan <ncoghlan at gmail.com> wrote:
..
>> Do you think these articles are helpful for someone learning how to
>> use chr() and ord() in Python for the first time?
>
> No, that's what the documentation of chr() and ord() is for. For that
> use case, it doesn't matter *what* the terms are.

I recently updated  chr() and ord()  documentation and used
"narrow/wide" terms.  I thought USC2/4 proponents objected to that on
the basis that these terms are imprecise.

http://docs.python.org/dev/library/functions.html#chr
http://docs.python.org/dev/library/functions.html#ord

> They could say "in a
> FOO build this will do X, in a BAR build it will do Y, see <link> for
> a detailed explanation of the differences between FOO and BAR builds
> of Python" and be perfectly adequate for the task. If there is no
> appropriate documentation link to point to (probably somewhere in the
> C API docs if it isn't anywhere else) then that is a key issue that
> needs to be fixed, rather than trying to change the terms that have
> been in use for the better part of a decade already.
>

That's the point that I was trying to make.  Using somewhat vague
narrow/wide terms gives us an opportunity to describe exactly what is
going on without confusing the reader with the intricacies of the
Unicode Standard or Python'd compliance with a particular version of
it.

> The raw meaning of UCS2/UCS4 mainly comes into the story when people
> are encountering this as a config option when building Python. The
> whole idea of changing the terms for the two build types *should* have
> been short circuited by the "status quo wins a stalemate" guideline,
> but apparently that didn't happen at the time.
>

It also comes in the "Data model" reference section on String which is
currently out of date:

"""
Strings
The items of a string object are Unicode code units. A Unicode code
unit is represented by a string object of one item and can hold either
a 16-bit or 32-bit value representing a Unicode ordinal (the maximum
value for the ordinal is given in sys.maxunicode, and depends on how
Python is configured at compile time). Surrogate pairs may be present
in the Unicode object, and will be reported as two separate items. The
built-in functions chr() and ord() convert between code units and
nonnegative integers representing the Unicode ordinals as defined in
the Unicode Standard 3.0. Conversion from and to other encodings are
possible through the string method encode().
""" http://docs.python.org/dev/reference/datamodel.html

The out of date part is the reference to the Unicode Standard 3.0.  I
don't think we should refer to a specific version of Unicode here.  It
has little consequence for the "Python data model" and AFAICT does not
come into play anywhere except unicodedata which is currently at
version 6.0.

The description of chr() and ord() is also not accurate on narrow
builds and nether is the statement "The items of a string object are
Unicode code units."