[I18n-sig] How does Python Unicode treat surrogates?

Fredrik Lundh fredrik@pythonware.com
Mon, 25 Jun 2001 21:54:37 +0200


guido wrote:

> > That's because len(u) has nothing to do with the number of 
> > characters in the string, it only counts the code units (Py_UNICODEs)
> > which are used to represent characters. The same is true for normal
> > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> > case) for a single code unit and in Unicode you can create characters
> > by combining code units 
> 
> Total agreement.

I disagree: in python's current string model, there's a difference
between *encoded* byte buffers and character strings.

> So let me rephrase the question.  How hard would it be to offer the
> config-time choice between UCS-4 and UTF-16?

> If it's hard, why?

the core string type (which I wrote) should support this pretty
much out of the box.

probably more work to fix the codecs (I didn't write them, so I
cannot tell for sure), but I doubt it's that much work.

SRE and the unicode databases (me again) should also work
pretty much out of the box.

> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?

I'm sceptical -- I see very little reason to maintain that distinction.
let's use either UCS-2 or UCS-4 for the internal storage, stick to the
"character strings are character sequences" concept, and keep the
UTF-16 surrogate issue where it belongs: in the codecs.

Cheers /F