[I18n-sig] How does Python Unicode treat surrogates?

Paul Prescod paulp@ActiveState.com
Mon, 25 Jun 2001 17:43:05 -0700

"Martin v. Loewis" wrote:
> > I agree. But I'd add that if different people really need different
> > performance/simplicity trade-offs then maybe we need multiple variants
> > of the Unicode object.
> The question really is: Those people that require a 16-bit Py_UNICODE,
> would they ever need characters outside the BMP?

Hard to tell. People usually want to have their cake and eat it too.
i.e. I want the performance of 16-bit Py_UNICODE but I want to support
the occasional non-BMP character that happens to show up in a document.

> My guess is no, so Fredrik's proposal sounds good to me.

I'm not clear on what Fredrik's proposal is. He says: "let's use either
UCS-2 or UCS-4 for the internal storage". Is he saying:

 1. let's choose one or the other today
 2. let's make it a compile-time switch
 3. make it a runtime option

I could live with 1. for a while longer...I haven't heard of a real user
complaint about our current model. The longer we put it off, the more
acceptable UCS-4 is.

I wouldn't be thrilled with 2., because it makes Python code harder to
move between machines (depends on your build options!)

3 would be okay if it is handled intelligently.

Any of these is better to me than exposing the details of UTF-16 to the
Python programmer in our Unicode type!
Take a recipe. Leave a recipe.  
Python Cookbook!  http://www.ActiveState.com/pythoncookbook