[I18n-sig] UTF-8 decoder in CVS still buggy
Sun, 16 Jul 2000 17:36:18 +0200
Florian Weimer wrote:
> "M.-A. Lemburg" <email@example.com> writes:
> > I've checked in a fix which should remedy the problem.
> > Could you run the stress test using the fixed
> > interpreter ?
> Thanks. It's more consistent now, but I still don't like it. The
> basic question is whether a bad sequence like "c0 80" shall be
> replaced by one or multiple U+FFFD characters. I vote for a single
> replacement character because it seems natural, but different people
> may have different opinions here. ;-)
Is there a standard way of dealing with these errors ?
What do other languages do, e.g. Perl, TCL ?
I don't have any problem changing the current implementation,
but would of course like to stick to an accepted standard here.
> > BTW, how much code is the stress test ? Maybe we should add
> > some of it to the test suite.
> Currently, it isn't automated (I only feed Markus Kuhn's UTF-8 test
> through the decoder), and I expect that an automated implementation
> would consist of around 100 lines of code. (The test covers just the
> most important borderline cases.)
100 LOCs is ok. Would you be willing to write this up and submit
it as patch ?
(What's the copyright on Markus Kuhn's test suite ?)
Python Pages: http://www.lemburg.com/python/