[I18n-sig] UCS-4 configuration
Guido van Rossum
guido@digicool.com
Wed, 27 Jun 2001 11:20:14 -0400
> > Another loose end: define sys.maxunicode.
>
> Breaking my promise not to touch the code, I've added this. I was not
> quite sure what type you meant to see in sys.maxunicode; I took
> integer, since U+FFFF is a non-character.
Correct. And thanks!
> > Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> > utf8 sequences. I think it should either spit out an error or (I
> > think this is better -- "be forgiving in what you accept") recognize
> > the surrogate pair and spit out a 4-byte utf8 sequence. Note that in
> > 2-byte mode, this same string literal can be marshalled and
> > unmarshalled just fine!
>
> That was actually the same problem as with the test case: the UTF-8
> encoder would not use the surrogate code in wide mode. I've removed
> that restriction, so this test now also passes.
Thanks again!
> > Or should we change the marshalling format to do something that's more
> > transparent? It feels uncomfortable that in 2-byte mode we can easily
> > create unicode strings containing illegal sequences (e.g. lone
> > surrogates), but these strings can't be marshalled.
>
> You mean, they cannot be unmarshalled? With the current code,
> marshalling them works fine...
Yes.
> There was another problem with the unicode database; the code assumed
> that adding two Py_UNICODE values would wrap around at 65536. With
> that fixed and committed, the test suite passes for me.
Wow. And for both versions, too!
Are there any open issues left? A list of those would help! Some I
can think of:
- Marc-Andre's message
- disable Unicode entirely with a configuration switch
- documentation
- marshalling UCS2 strings containing lone surrogates
Anything else?
--Guido van Rossum (home page: http://www.python.org/~guido/)