recycling internationalized garbage

Wed Mar 15 04:04:36 EST 2006

Martin wrote:

> > The point is that you can tell UTF-8 reliably.

RFC 3629 says "fairly reliably" rather than "reliably", but they mean
the same thing...

> > If the data decodes
> > as UTF-8, it *is* UTF-8, because no other encoding in the world
> > produces the same byte sequences (except for ASCII, which is
> > an UTF-8 subset).

or as the RFC puts it,

    "the probability that a string of characters in any other encoding
    appears as valid UTF-8 is low, diminishing with increasing string
    length".

:::

Ross Ridge wrote:

> It should be obvious that any 8-bit single-byte character set can
> produce byte sequences that are valid in UTF-8.

it should be fairly obvious that you don't know much about UTF-8...

</F>