[I18n-sig] Re: Unicode 3.1 and contradictions.

Markus Kuhn Markus.Kuhn@cl.cam.ac.uk
Thu, 28 Jun 2001 12:48:40 +0100

Guido van Rossum wrote on 2001-06-28 11:25 UTC:
> > The UTF-8 representations of U+D800..U+DFFF, U+FFFE, and U+FFFF are not
> > allowed in a UTF-8 stream and a secure UTF-8 decoder must never output
> > any of these characters.
> Can you explain a bit more about the security issues?

There are two ways of processing UTF-8 encoded UCS text:

  a) as a UTF-8 bytestream
  b) as a stream of decoded integer code values (32-bit wchar_t, etc.)

Problems arise if security-relevant checks are done in one
representation and interpretation of the data is done in the other.

Imagine, you have an application with the following processing steps:

  - read a UTF-8 string
  - apply a substring test to convince yourself that certain characters
    are not present in the string
  - decode UTF-8
  - use the decoded string in an application where presence of the
    tested characters could be security critical

The classical example is a Win32 web server, where a UTF-8 URL is fed
in, tested by a script in UTF-8 to be free of the byte sequence '/../',
and then UTF-8 decoded and fed into a UTF-16 API for file system access.
Even though the presence of '/../' encoded in ASCII was filtered out,
the same character sequence can still be passed past the filter by a
clever attacker using alternative encodings that an unsafe UTF-8 decoder
might accept, for instance an overlong sequence for any of the

This problem is most severe with non-ASCII representations of ASCII
characters by overlong UTF-8 sequences, because ASCII characters have
often lots of special functions associated, but it also occurs with
other tests. For example, it should be perfectly legitimate to test a
UTF-8 string to be free of non-BMP characters by simply testing that no
byte >= 0xE0 is present, without the far less efficient use of a UTF-8

Other risks are people smuggling a UTF-8 encoded U+FFFE or U+FFFF into a
system, which when decoded into UTF-16 might be interpreted as an
instruction to swap the byte sex (anti-BOM) or as some generic
escape-or-end-of-string/file character (U+FFFF).

The golden rule that there must be exactly one single UTF-8 byte
sequence that can result in the output of a certain Unicode character
and that Unicode code positions reserved for special non-character use
such as U+D800..U+DFFF, U+FFFE, and U+FFFF should never be generated by
a UTF-8 decoder eliminates all these potential pitfalls.



Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>