[I18n-sig] UTF-8 decoder in CVS still buggy

M.-A. Lemburg mal@lemburg.com
Sat, 02 Sep 2000 16:03:46 +0200

François Pinard wrote:
> [mal@lemburg.com]
> > Please keep us informed of any quirks you may experience during this
> > conversion.  We can use some real life reports for the new Unicode
> > support in Python to polish up the implementation and design.
> Hi, people.  I just recently subscribed to i18n-sig, and started to
> read the archives.  Let me hope you will tolerate that I jump in some
> conversations without having matured all the background.
> On the above topic, I did not check what Python exactly does, but I wanted to
> share that my `recode' program is not perfect in that area.  In particular,
> there is a requirement for UTF-8 to be valid that the sequence be minimal,
> which `recode' currently does not check on input.  Roughly said, an UTF-8
> sequence is not valid if it could have been expressed in fewer bytes.
> I've nothing against Python beating me at it! :-)

Could you give some examples ? I'm not sure I understand what you
mean by "could have been expressed with fewer bytes" -- perhaps
a multi-byte encoding where the top-most bytes are 0 ?

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/