[Python-Dev] Multilingual programming article on the Red Hat Developer blog
solipsis at pitrou.net
Wed Sep 17 11:37:43 CEST 2014
Seriously, can this discussion move somewhere else?
This has nothing to do on python-dev.
On Wed, 17 Sep 2014 18:56:02 +1000
Steven D'Aprano <steve at pearwood.info> wrote:
> On Wed, Sep 17, 2014 at 09:21:56AM +0900, Stephen J. Turnbull wrote:
> > Guido's mantra is something like "Python's str doesn't contain
> > characters or even code points, it contains code units."
> But is that true? If it were true, I would expect to be able to make
> Python text strings containing code units that aren't code points, e.g.
> something like "\U12340000" or chr(0x12340000) should work, but neither
> do. As far as I can tell, there is no way to build a string containing
> items which aren't code points.
> I don't think it is useful to say that strings *contain* code units,
> more that they *are made up from* code units. Code units are the
> implementation: 16-bit code units in narrow builds, 32-bit code units
> in wide builds, and either 8-, 16- or 32-bit code units in Python 3.3 and
> beyond. (I don't know of any Python implementation which uses UTF-8
> internally, but if there was one, it would use 8-bit code units.)
> It isn't very useful to say that in Python 3.3 the string "A" *contains*
> the 8-bit code unit 0x41. That's conflating two different levels of
> explanation (the high-level interface and the underlying implemention)
> and potentially leads to user confusion like
> # 8-bit code units are bytes, right?
> assert b'\41' in "A"
> which is Not Even Wrong.
> I think it is correct to say that Python strings are sequences of
> Unicode code points U+0000 through U+10FFFF. There are no other
> restrictions, e.g. strings can contain surrogates, noncharacters, or
> nonsensical combinations of code points such as a U+0300 COMBINING GRAVE
> ACCENT combined with U+000A (newline).
> > Implying
> > that dealing with characters (or the grapheme globs that occasionally
> > raise their ugly heads here) is an issue for higher-level facilities
> > than str to deal with.
> Agreed that Python doesn't offer a string type based on graphemes, and
> that such a facility belongs as a high-level library, not a built-in
> Also agreed that talking about characters is sloppy. Nevertheless, for
> English speakers at least, "code point = character" isn't too awful a
> first approximation.
> > The point being that
> > > Basically, we are pretending that the each smuggled byte is single
> > > character
> > is something of a misstatement (good enough for present purpose of
> > discussing email, but not good enough for the general case of
> > understanding how this is supposed to work when porting the construct
> > to other Python implementations), while
> > > for string parsing purposes...but they don't match any of our
> > > parsing constants.
> > is precisely Pythonically correct. You might want to add "because all
> > parsing constants contain only valid characters by construction."
> I don't understand what you are trying to say here.
> > > [*] I worried a lot that this was re-introducing the bytes/string
> > > problem from python2.
> > It isn't, because the bytes/str problem was that given a str object
> > out of context you could not tell whether it was a binary blob or
> > text, and if text, you couldn't tell if it was external encoded text
> > or internal abstract text.
> > That is not true here because the representations of characters vs.
> > smuggled bytes in str are disjoint sets.
> Nor am I sure what you are trying to say here either.
> > Footnotes:
> >  In Unicode terminology, a code unit is the smallest computer
> > object that can represent a character (this is uniquely and sanely
> > defined for all real Unicode transformation formats aka UTFs). A code
> > point is an integer 0 - (17*256*256-1) that can represent a character,
> > but many code points such as surrogates and 0xFFFF are defined to be
> > non-characters.
> Actually not quite. "Noncharacter" is concretely defined in Unicode, and
> there are only 66 of them, many fewer than the surrogate code points
> alone. Surrogates are reserved, not noncharacters.
> It is wrong to talk about "surrogate characters", but perhaps you mean
> to say that surrogates (by which I understand you to mean surrogate code
> points) are "not human-meaningful characters", which is not the same
> thing as a Unicode noncharacter.
> > Characters are those code points that may be assigned
> > an interpretation as a character, including undefined characters
> > (private space and reserved).
> So characters are code points which are characters, including undefined
> characters? :-)
More information about the Python-Dev