[I18n-sig] UTF-8 decoder in CVS still buggy
16 Jul 2000 20:55:41 +0200
"M.-A. Lemburg" <firstname.lastname@example.org> writes:
> > Thanks. It's more consistent now, but I still don't like it. The
> > basic question is whether a bad sequence like "c0 80" shall be
> > replaced by one or multiple U+FFFD characters. I vote for a single
> > replacement character because it seems natural, but different people
> > may have different opinions here. ;-)
> Is there a standard way of dealing with these errors ?
From Markus Kuhn's test file:
| According to ISO 10646-1, sections R.7 and 2.3c, a device receiving
| UTF-8 shall interpret a "malformed sequence in the same way that it
| interprets a character that is outside the adopted subset". This means
| usually that the malformed UTF-8 sequence is replaced by a replacement
| character (U+FFFD), which looks a bit like an inverted question mark,
| or a similar symbol. It might be a good idea to visually distinguish a
| malformed UTF-8 sequence from a correctly encoded Unicode character
| that is just not available in the current font but otherwise fully
| legal. For both cases, a clearly recognisable symbol should be used.
| Just ignoring malformed sequences or unavailable characters will make
| debugging more difficult and can lead to user confusion.
I've contacted Markus and he told me that the propoosed approach (i.e.
replace the whole sequence with a replacement character) is used in
the UTF-8 xterm extension for XFree86. OTOH, the C library interface
makes this approach a bit complicated to implement, so it's likely
that each octet in a malformed sequence is replaced by a replacement
character there. In the future, if UTF-8-aware C libraries are widely
deployed, xterm might use them, resulting in a changed behavior, more
like the current Python one.
> What do other languages do, e.g. Perl, TCL ?
Sorry, I don't know. Anyone else?
> I don't have any problem changing the current implementation,
> but would of course like to stick to an accepted standard here.
There doesn't seem to be any standard yet, and I doubt that there is
already something like best common practice. :-(
> 100 LOCs is ok. Would you be willing to write this up and submit
> it as patch ?
It might take some time, but yes, I'm going to do it.
> (What's the copyright on Markus Kuhn's test suite ?)
I got permission to use it for this task from him. Is this
sufficient, or do you need a disclaimer or something like that?