[I18n-sig] UTF-8 decoder in CVS still buggy

Walter Underwood wunder@ultraseek.com
Sun, 23 Jul 2000 13:21:55 -0700


I'd rather that it not try to "repair" broken UTF-8. If it isn't UTF-8, 
throw an exception,
and let the caller decide.

For example, when parsing XML, invalide UTF-8 means the whole document is 
invalid.
It is considered polite to say where the first invalid character occurs, 
but it is not
acceptable to continue parsing. An XML parser cannot use a UTF-8 decoder 
that accepts
invalide UTF-8.

Code that deals with multiple encodings usually needs to do some encoding 
guessing
up front, before choosing an encoder. If the guess is wrong, I'd want the 
decoder to
fail, so we can try the next most likely endcoding.

We're busy converting our search engine to use Unicode, so I'm really 
familiar with
the issues right now.

wunder
--
Walter Underwood
Senior Staff Engineer, Ultraseek Server, Inktomi Corp.
http://www.ultraseek.com/
http://www.inktomi.com/