[I18n-sig] UCS-4 configuration
Martin v. Loewis
martin@loewis.home.cs.tu-berlin.de
Wed, 27 Jun 2001 08:45:11 +0200
> Another loose end: define sys.maxunicode.
Breaking my promise not to touch the code, I've added this. I was not
quite sure what type you meant to see in sys.maxunicode; I took
integer, since U+FFFF is a non-character.
> Note how the utf8 codec has encoded the surrogate pair as two 3-byte
> utf8 sequences. I think it should either spit out an error or (I
> think this is better -- "be forgiving in what you accept") recognize
> the surrogate pair and spit out a 4-byte utf8 sequence. Note that in
> 2-byte mode, this same string literal can be marshalled and
> unmarshalled just fine!
That was actually the same problem as with the test case: the UTF-8
encoder would not use the surrogate code in wide mode. I've removed
that restriction, so this test now also passes.
> Or should we change the marshalling format to do something that's more
> transparent? It feels uncomfortable that in 2-byte mode we can easily
> create unicode strings containing illegal sequences (e.g. lone
> surrogates), but these strings can't be marshalled.
You mean, they cannot be unmarshalled? With the current code,
marshalling them works fine...
There was another problem with the unicode database; the code assumed
that adding two Py_UNICODE values would wrap around at 65536. With
that fixed and committed, the test suite passes for me.
Regards,
Martin