[I18n-sig] Unicode surrogates: just say no!
Rick McGowan
rick@unicode.org
Tue, 26 Jun 2001 13:38:48 -0700
> Somebody please correct me: A conforming implementation must never
> encode a non-BMP character with six bytes in UTF-8; security people
> will shoot you if you say that two alternative representations for the
> same string are possible.
>...
> HOWEVER, I think what the spec says that implementation shall accept
> to receive non-BMP characters encoded in six bytes UTF-8. This is
The spec has been recently changed to eliminate the ambiguity precisely
because of security restrictions. You are never allowed to produce "non
shortest form". The correct, conforming way to encode surrogate pairs in
UTF-8 is to convert the pair to UTF-32, and then convert the UTF-32 entity
to UTF-8.
See:
http://www.unicode.org/unicode/reports/tr27/
which is the definition of Unicode 3.1. It says in the intro:
Most notable among the corrigenda to the standard is a tightening of the
definition of UTF-8, to eliminate a possible security
issue with non-shortest-form UTF-8.
Later, there is a section "UTF-8 Corrigendum", which starts with the text
shown below. This always results in a UTF-8 sequence <= 4 bytes in length,
for all valid Unicode characters 0..10FFFF.
(BTW, I have also been working on an updated reference code for the
various UTF transformations, but have not yet posted it due to the
controversy surrounding the so called UTF-8S proposal.)
Rick
------------------------------------------------------
UTF-8 Corrigendum
The current conformance clause C12 in The Unicode Standard, Version 3.0
forbids the generation of "non-shortest form" UTF-8, and forbids the
interpretation of illegal sequences, but not the interpretation of
"non-shortest form". Where software does interpret the non-shortest forms,
security issues can arise. For example:
Process A performs security checks, but does not check for
non-shortest forms.
Process B accepts the byte sequence from process A, and transforms it into
UTF-16 while interpreting non-shortest forms.
The UTF-16 text may then contain characters that should have been filtered
out by process A.
To address this issue, the Unicode Technical Committee has modified the
definition of UTF-8 to forbid conformant implementations from interpreting
non-shortest forms for BMP characters, and clarified some of the
conformance clauses.