[Python-Dev] UTF-16 code point comparison

Bill Tutt billtut@microsoft.com
Thu, 27 Jul 2000 07:43:44 -0700


Fredik wrote:

> the original unicode implementation did just that, but Bill and
> Marc-Andre recently lowered the shields: the UTF-8 decoder
> now generates UTF-16 encoded data.  (so does \N{}, but
> that's a non-issue:=20

> my proposal is to tighten this up in 2.0: ifdef out the UTF-16
> code in the UTF-8 decoder, and ifdef out the UTF-16 stuff in
> the compare function.

Commenting the UTF-16 stuff out in the compare function is a valid point,
given our current Unicode string object.

I disagree strongly with disabling the surrogate support in UTF-8, and we
should fix the UTF-16 decoder.
Since the decoder/encoder support doesn't hurt any other piece of code by
emitting surrogate pairs, I don't see why you want to disable the code. 

> (oddly enough, the UTF-16 decoder won't accept anything
> that isn't UCS-2 ;-)

Well that's an easy bug to fix.

> let's wait until 2.1 before we support the full unicode character
> set (and I'm pretty sure "the right way" is UCS-4 storage and a
> unified string implementation, but let's discuss that later).

I've mentioned this before, but why discuss this later? Indeed why would we
want to fix it for 2.1?
Esp. if we move to UCS-4 storage in a minor release. Why not just get the
Unicode support correct this time around. Save the poor users of the Python
Unicode support from going nuts when we make these additional confusing
changes later. 
If you think you want to move to UCS-4 later, don't wait, do it know.  Add
support for special surrogate handling later if we must, but please oh
please don't change the storage mechanism in memory in a later relase.

Java and Win32 are all UTF-16 based, and those extra 16-bits are actually
wasted for nearly every common Unicode data you'd hope to handle. I think
using UTF-16 as an internal storage mechanism actually makes sense. Whether
or not you want to have your character type expose an array of 16-bit
values, or the appearance of an array of UCS-4 characters is a separate
issue. An issue I think should be answered now, instead of fixing it later.

Bill