[I18n-sig] UTF-8 decoder in CVS still buggy

16 Jul 2000 20:55:41 +0200

"M.-A. Lemburg" <mal@lemburg.com> writes:

> > Thanks.  It's more consistent now, but I still don't like it. The
> > basic question is whether a bad sequence like "c0 80" shall be
> > replaced by one or multiple U+FFFD characters. I vote for a single
> > replacement character because it seems natural, but different people
> > may have different opinions here. ;-)
> 
> Is there a standard way of dealing with these errors ?

From Markus Kuhn's test file:

| According to ISO 10646-1, sections R.7 and 2.3c, a device receiving
| UTF-8 shall interpret a "malformed sequence in the same way that it
| interprets a character that is outside the adopted subset". This means
| usually that the malformed UTF-8 sequence is replaced by a replacement
| character (U+FFFD), which looks a bit like an inverted question mark,
| or a similar symbol. It might be a good idea to visually distinguish a
| malformed UTF-8 sequence from a correctly encoded Unicode character
| that is just not available in the current font but otherwise fully
| legal. For both cases, a clearly recognisable symbol should be used.
| Just ignoring malformed sequences or unavailable characters will make
| debugging more difficult and can lead to user confusion.

I've contacted Markus and he told me that the propoosed approach (i.e.
replace the whole sequence with a replacement character) is used in
the UTF-8 xterm extension for XFree86.  OTOH, the C library interface
makes this approach a bit complicated to implement, so it's likely
that each octet in a malformed sequence is replaced by a replacement
character there.  In the future, if UTF-8-aware C libraries are widely
deployed, xterm might use them, resulting in a changed behavior, more
like the current Python one.

> What do other languages do, e.g. Perl, TCL ?

Sorry, I don't know.  Anyone else?

> I don't have any problem changing the current implementation,
> but would of course like to stick to an accepted standard here.

There doesn't seem to be any standard yet, and I doubt that there is
already something like best common practice. :-(

[Test module]

> 100 LOCs is ok. Would you be willing to write this up and submit
> it as patch ?

It might take some time, but yes, I'm going to do it.

> (What's the copyright on Markus Kuhn's test suite ?)

I got permission to use it for this task from him.  Is this
sufficient, or do you need a disclaimer or something like that?