Unicode utf-8 doesn't do back-and-forth?
Martin v. Löwis
loewis at informatik.hu-berlin.de
Mon Jul 8 10:05:25 EDT 2002
Piet van Oostrum <piet at cs.uu.nl> writes:
> Well, I looked into the Unicode specs and it says that even if single
> surrogates appear in a string, the UTF-8 encoding should generate a valid
> UTF-8 byte sequence, which on encoding should give the same surrogate. So
> I would say this is a bug in the UTF-8 encoding.
Which Unicode specs did you look at? Unicode TR #28 (aka Unicode 3.2),
http://www.unicode.org/unicode/reports/tr28/
says
<quote>
The definition of transformation formats such as UTF-8 allowed
conformant processes to interpret certain sequences called irregular
sequences. These irregular sequences are those that would be produced
by transforming supplementary code points as if they were a sequence
of two surrogate code points.
To tighten the definitions, in Unicode 3.2 such irregular sequences
are now illegal.
<quote>
Table 3.1B of the same document explicitly lists the byte sequences
that would denote code points D800-D8FF as illegal.
There is special permission given to recovery tools to deal with
irregular or illegal sequences without indicating an error, but the
standard Python UTF-8 codec certainly does not fall into this
category.
Regards,
Martin
More information about the Python-list
mailing list