recycling internationalized garbage

Fredrik Lundh fredrik at
Wed Mar 15 10:04:36 CET 2006

Martin wrote:

> > The point is that you can tell UTF-8 reliably.

RFC 3629 says "fairly reliably" rather than "reliably", but they mean
the same thing...

> > If the data decodes
> > as UTF-8, it *is* UTF-8, because no other encoding in the world
> > produces the same byte sequences (except for ASCII, which is
> > an UTF-8 subset).

or as the RFC puts it,

    "the probability that a string of characters in any other encoding
    appears as valid UTF-8 is low, diminishing with increasing string


Ross Ridge wrote:

> It should be obvious that any 8-bit single-byte character set can
> produce byte sequences that are valid in UTF-8.

it should be fairly obvious that you don't know much about UTF-8...


More information about the Python-list mailing list