[I18n-sig] Re: Unicode 3.1 and contradictions.

Markus Kuhn Markus.Kuhn@cl.cam.ac.uk
Thu, 28 Jun 2001 16:47:59 +0100

Guido van Rossum wrote on 2001-06-28 14:51 UTC:
> > Imagine, you have an application with the following processing steps:
> > 
> >   - read a UTF-8 string
> >   - apply a substring test to convince yourself that certain characters
> >     are not present in the string
> >   - decode UTF-8
> >   - use the decoded string in an application where presence of the
> >     tested characters could be security critical
> I'd say that the security implementation of such an application is
> broken -- the check should have been done on the final datya.  It
> seems you are trying to patch up a legacy system the wrong way.  Or am
> I missing something?  How can this be a common pattern?

We should not expect that any and all UTF-8 data has to be decoded
before it can be processed. UTF-8 has been very carefully designed to
allow much text processing (substring searching without case mapping,
etc.) to be done on UTF-8 data directly. Only few operations (display,
case mapping, proper sorting) actually require a UTF-8 decoder. The name
"UCS Transfer Format" is in practise misleading, because processing
UTF-8 as opposed to just transfering is often the right thing to do,
unless a buggy UTF-8 decoder would make that risky.

> But we were talking about isolated surrogates.  How can passing
> through *isolated* surrogates cause a security violation?  It's not an
> overlong sequence!  (Assuming the decoder does the right thing for
> surrogate *pairs*.)

OK, that is far less of a security concern. However, an isolated
surrogate is usually a symptom of something else being wrong (e.g.,
UTF-16 strings being split at the wrong place, then UTF-8 converted,
then joined again), and if not spotted will lead to incorrect UTF-8
sequences at the end. Signalling an exception might often be better than
passing everything through quietly.

> > This problem is most severe with non-ASCII representations of ASCII
> > characters by overlong UTF-8 sequences, because ASCII characters have
> > often lots of special functions associated, but it also occurs with
> > other tests. For example, it should be perfectly legitimate to test a
> > UTF-8 string to be free of non-BMP characters by simply testing that no
> > byte >= 0xE0 is present, without the far less efficient use of a UTF-8
> > decoder.
> Why is testing for non-BMP characters part of a security screening?

If a database field has a policy of not allowing non-BMP characters in a
field, then that policy can be violated. How bad that is depends on the
application. It was really just an example, not a specific risk.

> Sorry, you haven't convinced me that these tests should be applied by
> Python's standard UTF-8 codec.  Also, your use of "such as" suggests
> that the collection of dangerous code points is open-ended, but I find
> that hard to believe (since legacy codecs won't be updated).

My list of unwanted UTF-8 code points was just the one found in a note
in the UTF-8 definition in ISO 10646-1:1993 (R.4):

  NOTE 3 - Values of x in the range 0000 D800 .. 0000 DFFF are reserved
  for the UTF-16 form and do not occur in UCS-4.  The values 0000 FFFE and
  0000 FFFF also do not occur (see clause 8).  The mappings of these code
  positions in UTF-8 are undefined.


Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>