[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 22:52:54 +0200

Fredrik Lundh wrote:
> I wrote:
> > SRE and the unicode databases (me again) should also work
> > pretty much out of the box.
> a 32-bit version SRE works as expected, at least:
> >>> a = array.array("i", map(ord, "hello"))
> >>> m = sre.search("l+", a)
> >>> m
> <SRE_Match object at 008CECA8>
> >>> m.group(0)
> array('i', [108, 108])
> the DLL size is identical, and the performance is roughly the
> same.

That's good to know, but Guido was asking about supporting
both UTF-16 and UCS-4 by means of a configure switch -- supporting
this kind of dual approach is what I consider hard to maintain
and implement. 

Dealing only with UTF-16 or only with UCS-4
would be much less work and this is what I am advertising (stick
with UTF-16 for the next few years and then maybe switch over to
UCS-4; note that this will cause an incompatibility due to u[i]
referencing code units which then change).

Marc-Andre Lemburg
CEO eGenix.com Software GmbH
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/