[I18n-sig] Support for "wide" Unicode characters
M.-A. Lemburg
mal@lemburg.com
Thu, 28 Jun 2001 15:11:04 +0200
Guido van Rossum wrote:
>
> > > There is a new (experimental) define:
> > >
> > > #define PY_UNICODE_SIZE 2
> >
> > Doesn't sizeof(Py_UNICODE) do the same ?
>
> Not on a Cray! And not in the C standard. Ask Tim. :-)
Ah, OK... nice sofas these Crays, BTW ;-)
> > This introduces an incompatibility between narrow and wide
> > builds at run-time. PYC should not be harmed by this since they
> > store Unicode strings using UTF-8.
>
> Does UTF-8 transfer isolated surrogates correctly? I think that's
> necessary, otherwise I can't marshal or unmarshal literals containing
> those, which means that .pyc files for .py files containing those
> can't be read (on maybe aren't portable between wide and narrow
> interpreters).
It handles surrogates correctly, but rejects isolated ones on input
(easy to fix though) and passes them through on output. As I said
before, surrogate is far from being complete.
> Note that I'm OK with the UTF-8 encoder recognizing hi+lo surrogate
> pairs and encoding them as one Unicode character, since the decoder
> generates surrogates for non-BMP characters on a narrow platform.
That's what it currently does.
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.lemburg.com/python/