[I18n-sig] UTF-8 decoder in CVS still buggy
Walter Underwood
wunder@ultraseek.com
Sun, 23 Jul 2000 13:21:55 -0700
I'd rather that it not try to "repair" broken UTF-8. If it isn't UTF-8,
throw an exception,
and let the caller decide.
For example, when parsing XML, invalide UTF-8 means the whole document is
invalid.
It is considered polite to say where the first invalid character occurs,
but it is not
acceptable to continue parsing. An XML parser cannot use a UTF-8 decoder
that accepts
invalide UTF-8.
Code that deals with multiple encodings usually needs to do some encoding
guessing
up front, before choosing an encoder. If the guess is wrong, I'd want the
decoder to
fail, so we can try the next most likely endcoding.
We're busy converting our search engine to use Unicode, so I'm really
familiar with
the issues right now.
wunder
--
Walter Underwood
Senior Staff Engineer, Ultraseek Server, Inktomi Corp.
http://www.ultraseek.com/
http://www.inktomi.com/