[Patches] [ python-Patches-1057588 ] chr, ord, unichr documentation updates

Wed Jan 19 05:52:56 CET 2005

Patches item #1057588, was opened at 2004-10-31 02:25
Message generated for change (Comment added) made by fdrake
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1057588&group_id=5470

Category: Documentation
Group: Python 2.4
>Status: Pending
Resolution: None
Priority: 5
Submitted By: Mike Brown (mike_j_brown)
Assigned to: Fred L. Drake, Jr. (fdrake)
Summary: chr, ord, unichr documentation updates

Initial Comment:
The attached diff may be applied against v1.175 of
libfuncs.tex --
http://cvs.sourceforge.net/viewcvs.py/*checkout*/python/python/dist/src/Doc/lib/libfuncs.tex?content-type=text%2Fplain&rev=1.175

chr(): A str is not in any particular encoding, so
don't talk about ASCII, which does not apply to
arguments > 127 anyway. Also make reference to unichr().

ord(): A str is not in any particular encoding, so
don't talk about ASCII. Describe what the return value
represents for each type of string (str, unicode), and
mention the TypeError that will be raised on narrow
unicode builds of Python.

unichr(): Mention the restrictions on the argument
depending on whether Python was built with wide or
narrow unicode.

The precedent in unicode() is to refer to str objects
as "8-bit strings", so the wording of the above changes
was chosen accordingly.

----------------------------------------------------------------------

>Comment By: Fred L. Drake, Jr. (fdrake)
Date: 2005-01-18 23:52

Message:
Logged In: YES 
user_id=3066

Is the patch here finished, or was additional work needed?

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-11-02 01:56

Message:
Logged In: YES 
user_id=371366

You're right re: UCS2/UCS4. I can work up another patch.

I think you know this, but "code point" is not accurate
UTR#17-conformant terminology, as it just refers to the
single integer number from the code space that is available
to Unicode (0x0-0xD7FF and 0xE000-0x10FFFF), bearing in mind
that not all code points correspond to characters (all those
whose hex values end in FFFE and FFFF, for example).

If we are just talking about what a Unicode string is in
general sense, we say it is just a sequence of characters --
a character being a unit like, say, "Latin small letter z",
or "plus sign", in a writing system ("script") like
Latin/Roman, Cyrillic, Hiragana, etc.

If we are talking about what the unicode type is in Python,
to be accurate, we should say it is a sequence of UCS2 or
UCS4 "code values", depending on how Python was compiled,
and note that in its printable representation, the unicode
type displays, for characters outside the ASCII range, the
"code points" represented by those code values. It does this
using the same syntax as for string literals, but treats
surrogate pairs of code values as being representative of a
single code point (e.g., a unicode object consisting of code
value 0xD800 followed by 0xDC00 is printably represented by
u'\U00010000' even though it's still a string of length 2 in
both UCS2 and UCS4 builds of Python).

Is there a recommendation for how to refer unambiguously to
an instance of a unicode type? Is it a "unicode object"? How
about an instance of the str type? Is it an "8-bit string"?
I notice we say "byte string" a lot but apparently not
everyone is happy about that.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2004-11-01 06:11

Message:
Logged In: YES 
user_id=38388

The new wording is indeed better than the old one. +1 on that
change.

However, you should use the term "code point" consistently
and perhaps add a footnote explaining the difference between
code point, glyph and character (Unicode strings are arrays
of code points - not characters).

Another note: I don't particularly like the terms "narrow"
and "wide"
Unicode builds. If possible, these terms should be replaced
by the
more accurate technical terms UCS2 and UCS4 - since the error
messages relating to this difference also mention these
technical
terms rather then narrow or wide builds.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-10-31 13:17

Message:
Logged In: YES 
user_id=371366

Oops, didn't mean to remove the assignment to fdrake when
adding previous comment.

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-10-31 03:23

Message:
Logged In: YES 
user_id=371366

Also note that I did not suggest removing the example with
the letter "a". I just suggested removing the reference to
"ASCII" in particular.

Ideally, IMHO, the documentation for sequence types is where
one should mention the strong association between strings
and ASCII. It currently doesn't even really describe what a
string or Unicode string is. It should state that
non-Unicode strings are an abstraction in which each member
of the sequence is a "character" that is actually an 8-bit
value, as in Standard C, intended to represent a character
in an arbitrary encoding, and that there is an _informal_
convention, in documentation, of referring to these values
as being ASCII values, in part due to the notational
conventions of string literals, such as using "\t", "\n",
and "\r" to represent decimal values 9, 10, and 13,
respectively (associations that only make sense in ASCII or
ASCII-based encodings), and in part because it is easier to
talk about the lower 128 values in terms of their ASCII
equivalents (e.g. "chr(97) produces the string 'a'").
Likewise, the unicode type could be described as being an
abstraction of 16-bit ("narrow") or 32-bit ("wide") code
units, depending on how Python was built, and so on... I
would see making such unambiguous statements to be a
reasonable alternative to just deleting mentions of ASCII
from the library docs, although I think making all of the
changes would be best, as people already have preconceived
notions of what a 'string' is and I know from experience
that they tend to not worry about straightening out their
understanding of such nuances until they get burned by
assumptions built around statements like "ord() gives you
the ASCII value".

----------------------------------------------------------------------

Comment By: Mike Brown (mike_j_brown)
Date: 2004-10-31 02:51

Message:
Logged In: YES 
user_id=371366

That kind of resistance to using accurate, strict
terminology just perpetuates common misunderstandings about
the relationship between characters and encodings.

----------------------------------------------------------------------

Comment By: Raymond Hettinger (rhettinger)
Date: 2004-10-31 02:38

Message:
Logged In: YES 
user_id=80475

The attachment didn't make it.  Try again.

And, FWIW, I think the documentation is perfectly clear as
is.  Though the ASCII reference is not strict, I think
taking it out would be a mistake.  Though many encodings are
possible, there is a strong relationship between the number
97 and the letter 'a'.  Mentioning ASCII makes that
relationship clear.

IOW, I -1 on changing it until a new bytes type is introduced.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=305470&aid=1057588&group_id=5470