[I18n-sig] How does Python Unicode treat surrogates?
Fredrik Lundh
fredrik@pythonware.com
Mon, 25 Jun 2001 21:54:37 +0200
guido wrote:
> > That's because len(u) has nothing to do with the number of
> > characters in the string, it only counts the code units (Py_UNICODEs)
> > which are used to represent characters. The same is true for normal
> > strings, e.g. UTF-8 can use between 1-4 code units (bytes in this
> > case) for a single code unit and in Unicode you can create characters
> > by combining code units
>
> Total agreement.
I disagree: in python's current string model, there's a difference
between *encoded* byte buffers and character strings.
> So let me rephrase the question. How hard would it be to offer the
> config-time choice between UCS-4 and UTF-16?
> If it's hard, why?
the core string type (which I wrote) should support this pretty
much out of the box.
probably more work to fix the codecs (I didn't write them, so I
cannot tell for sure), but I doubt it's that much work.
SRE and the unicode databases (me again) should also work
pretty much out of the box.
> If we make a clean distinction between characters and storage units,
> and if stick to the rule that u[i] accesses a storage unit, what's the
> conceptual difficulty?
I'm sceptical -- I see very little reason to maintain that distinction.
let's use either UCS-2 or UCS-4 for the internal storage, stick to the
"character strings are character sequences" concept, and keep the
UTF-16 surrogate issue where it belongs: in the codecs.
Cheers /F