[I18n-sig] How does Python Unicode treat surrogates?

M.-A. Lemburg mal@lemburg.com
Mon, 25 Jun 2001 21:18:13 +0200


Tom Emerson wrote:
> ...
> No, but we may as well stop going around on this, since my views are
> not going to happen.
> 
> In my view the string 'u' is a Unicode string. I don't care what sits
> underneath: 16-bits or 32-bits I don't care. As far as I'm concerned
> the string has three characters in it:
> 
> foo = u"\u4e00\u020000a"
> 
> means that foo[0] == u"\u4e00", foo[1] == u"\u020000", and foo[2] ==
> u"a".
> 
> The fact that this is represented internally different ways shouldn't
> matter to the user who only cares about characters.

While I agree with Guido that foo[i] should return the code
unit and not the code point, I think that by providing a few
more Unicode methods (like the ones Mark mentioned) would
go a long way in providing a compromise, e.g. foo.codepoint(1)
would then return u"\u020000", foo.codelen() would return 3, etc.

Alternatively we could of course also provide this functionality
in form of functions in a separate module (with the recent 
controveries over methods vs. functions I am not sure anymore
what the general guideline is for Python... string methods at least
don't seem to be too popular around here anymore; OK, 
just rambling ;-).

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting:                           http://www.egenix.com/
Python Software:                        http://www.lemburg.com/python/