[I18n-sig] UTF-8 decoder in CVS still buggy

M.-A. Lemburg mal@lemburg.com
Sun, 16 Jul 2000 17:36:18 +0200


Florian Weimer wrote:
> 
> "M.-A. Lemburg" <mal@lemburg.com> writes:
> 
> > I've checked in a fix which should remedy the problem.
> > Could you run the stress test using the fixed
> > interpreter ?
> 
> Thanks.  It's more consistent now, but I still don't like it. The
> basic question is whether a bad sequence like "c0 80" shall be
> replaced by one or multiple U+FFFD characters. I vote for a single
> replacement character because it seems natural, but different people
> may have different opinions here. ;-)

Is there a standard way of dealing with these errors ?
What do other languages do, e.g. Perl, TCL ?

I don't have any problem changing the current implementation,
but would of course like to stick to an accepted standard here.
 
> > BTW, how much code is the stress test ? Maybe we should add
> > some of it to the test suite.
> 
> Currently, it isn't automated (I only feed Markus Kuhn's UTF-8 test
> through the decoder), and I expect that an automated implementation
> would consist of around 100 lines of code.  (The test covers just the
> most important borderline cases.)

100 LOCs is ok. Would you be willing to write this up and submit
it as patch ?

(What's the copyright on Markus Kuhn's test suite ?)

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/