[I18n-sig] UTF-8 decoder in CVS still buggy

M.-A. Lemburg mal@lemburg.com
Sat, 02 Sep 2000 19:05:08 +0200

Fredrik Lundh wrote:
> François Pinard wrote:
> > Hi, people.  I just recently subscribed to i18n-sig, and started to
> > read the archives.  Let me hope you will tolerate that I jump in some
> > conversations without having matured all the background.
> >
> > On the above topic, I did not check what Python exactly does, but I wanted to
> > share that my `recode' program is not perfect in that area.  In particular,
> > there is a requirement for UTF-8 to be valid that the sequence be minimal,
> > which `recode' currently does not check on input.  Roughly said, an UTF-8
> > sequence is not valid if it could have been expressed in fewer bytes.
> for security reasons, the UTF-8 codec gives you an "illegal encoding"
> error in this case.
> mal wrote:
> > Could you give some examples ? I'm not sure I understand what you
> > mean by "could have been expressed with fewer bytes" -- perhaps
> > a multi-byte encoding where the top-most bytes are 0 ?
> quoting RFC 2279:
>     Implementors of UTF-8 need to consider the security aspects of how
>     they handle illegal UTF-8 sequences.  It is conceivable that in some
>     circumstances an attacker would be able to exploit an incautious
>     UTF-8 parser by sending it an octet sequence that is not permitted by
>     the UTF-8 syntax.
>     A particularly subtle form of this attack could be carried out
>     against a parser which performs security-critical validity checks
>     against the UTF-8 encoded form of its input, but interprets certain
>     illegal octet sequences as characters.  For example, a parser might
>     prohibit the NUL character when encoded as the single-octet sequence
>     00, but allow the illegal two-octet sequence C0 80 and interpret it
>     as a NUL character.  Another example might be a parser which
>     prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
>     illegal octet sequence 2F C0 AE 2E 2F.


>>> unicode('\xC0\x80','utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding
>>> unicode('\x2F\x2E\x2E\x2F','utf-8')
>>> unicode('\x2F\xC0\xAE\x2E\x2F','utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
UnicodeError: UTF-8 decoding error: illegal encoding

... so what's buggy about the codec ?

Marc-Andre Lemburg
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/