[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Sun, 24 Jun 2001 13:28:06 +0200


First of all, I'd like to say that we left the handling of surrogates
undefined back when we initially discussed the internal format 
for storing Unicode. The reasoning was simple: there were no
assign char points outside the BMP (roughly the lower 16-bit range).

It was decided to use 16-bits per character as basis for dealing with
Unicode in such a way that we get the disjunction of UTF-16 and
UCS-2 (Unicode 2.x). This allowed us to postpone the handling of
variable length problems to a later point in time.

Now with Unicode 3.1, the time has come to rethink these things,
since for the first time, there are assigned char points outside
the BMP which could eventually be used by programmers.

This means that we have to start thinking about how to treat
UTF-16 surrogates (two Py_UNICODE elements per Unicode character).

The basic questions are:

1. How to treat lone surrogates (the Unicode char U+10000 is
   represented as the two words 0xd800 0xdc00 in UTF-16) ?

2. What to do when slicing of Unicode strings would break
   a surrogate pair ?

3. How to treat input data which has lone surrogate words 
   in strings (at the start, in the middle and at the end) ?

4. How to process requests for creating output data from 
   lone surrogate words ?

BTW, Python's Unicode implementation is bound to the standard
defined at www.unicode.org; moving over to ISO 10646 is not an
option.

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/