[I18n-sig] Unicode surrogates: just say no!

Tue, 26 Jun 2001 13:38:48 -0700

> Somebody please correct me: A conforming implementation must never
> encode a non-BMP character with six bytes in UTF-8; security people
> will shoot you if you say that two alternative representations for the
> same string are possible.
>...
> HOWEVER, I think what the spec says that implementation shall accept
> to receive non-BMP characters encoded in six bytes UTF-8. This is

The spec has been recently changed to eliminate the ambiguity precisely  
because of security restrictions.  You are never allowed to produce "non  
shortest form".  The correct, conforming way to encode surrogate pairs in  
UTF-8 is to convert the pair to UTF-32, and then convert the UTF-32 entity  
to UTF-8.

See:
	http://www.unicode.org/unicode/reports/tr27/

which is the definition of Unicode 3.1.  It says in the intro:

    Most notable among the corrigenda to the standard is a tightening of the
    definition of UTF-8, to eliminate a possible security
    issue with non-shortest-form UTF-8.

Later, there is a section "UTF-8 Corrigendum", which starts with the text  
shown below.  This always results in a UTF-8 sequence <= 4 bytes in length,  
for all valid Unicode characters 0..10FFFF.

(BTW, I have also been working on an updated reference code for the  
various UTF transformations, but have not yet posted it due to the  
controversy surrounding the so called UTF-8S proposal.)

	Rick

------------------------------------------------------

UTF-8 Corrigendum

The current conformance clause C12 in The Unicode Standard, Version 3.0  
forbids the generation of "non-shortest form" UTF-8, and forbids the  
interpretation of illegal sequences, but not the interpretation of  
"non-shortest form". Where software does interpret the non-shortest forms,  
security issues can arise. For example:

     Process A performs security checks, but does not check for  
non-shortest forms.
     Process B accepts the byte sequence from process A, and transforms it into
	UTF-16 while interpreting non-shortest forms.
     The UTF-16 text may then contain characters that should have been filtered
	out by process A.

To address this issue, the Unicode Technical Committee has modified the  
definition of UTF-8 to forbid conformant implementations from interpreting  
non-shortest forms for BMP characters, and clarified some of the  
conformance clauses.