[I18n-sig] UTF-8 decoder in CVS still buggy

Fredrik Lundh Fredrik Lundh" <effbot@telia.com
Sat, 2 Sep 2000 18:30:56 +0200

François Pinard wrote:
> Hi, people.  I just recently subscribed to i18n-sig, and started to
> read the archives.  Let me hope you will tolerate that I jump in some
> conversations without having matured all the background.
> On the above topic, I did not check what Python exactly does, but I wanted to
> share that my `recode' program is not perfect in that area.  In particular,
> there is a requirement for UTF-8 to be valid that the sequence be minimal,
> which `recode' currently does not check on input.  Roughly said, an UTF-8
> sequence is not valid if it could have been expressed in fewer bytes.

for security reasons, the UTF-8 codec gives you an "illegal encoding"
error in this case.

mal wrote:
> Could you give some examples ? I'm not sure I understand what you
> mean by "could have been expressed with fewer bytes" -- perhaps
> a multi-byte encoding where the top-most bytes are 0 ?

quoting RFC 2279:

    Implementors of UTF-8 need to consider the security aspects of how
    they handle illegal UTF-8 sequences.  It is conceivable that in some
    circumstances an attacker would be able to exploit an incautious
    UTF-8 parser by sending it an octet sequence that is not permitted by
    the UTF-8 syntax.

    A particularly subtle form of this attack could be carried out
    against a parser which performs security-critical validity checks
    against the UTF-8 encoded form of its input, but interprets certain
    illegal octet sequences as characters.  For example, a parser might
    prohibit the NUL character when encoded as the single-octet sequence
    00, but allow the illegal two-octet sequence C0 80 and interpret it
    as a NUL character.  Another example might be a parser which
    prohibits the octet sequence 2F 2E 2E 2F ("/../"), yet permits the
    illegal octet sequence 2F C0 AE 2E 2F.