[XML-SIG] Re: Issues with Unicode type

Lars Marius Garshol larsga@garshol.priv.no
26 Sep 2002 11:42:58 +0200


* Lars Marius Garshol
|
| Actually, Windows 2000 displays non-BMP characters just fine. MSIE
| can be made to do it, Opera 6.0 does it just fine, Mozilla does not
| (I think) do it.

* Martin v. Loewis
| 
| Can you demonstrate this? 

I don't know why my word alone is not enough, but here you go:
  <URL: http://www.garshol.priv.no/tmp/nonbmp.png >

The page contains instructions for how to enable the display of such
characters. Note that I never did anything to enable surrogate support.

| I failed trying for myself, because:
| 
| - I have no fonts that has characters outside the BMP,

Use James Kass's Code 2001.

| - OpenType fonts that want to include non-BMP characters need
|   to char-to-glyph tables, one for UCS-2, and one for UCS-4.
| 
|   Reportedly, W2k will only use the UCS-2 table in a font that
|   contains non-BMP characters, so I somewhat doubt your statement. WXP
|   reportedly does support such fonts - but I have none.

The screenshot above is taken on Windows 2000. The font is Code 2001.

* Lars Marius Garshol
|
| Also, there are locales where non-BMP characters are essential.
| Cantonese is probably the best example. You can't write the
| Cantonese equivalent of the "-ing" ending in Cantonese with the
| BMP...
 
* Martin v. Loewis
|
| W2k/WXP support GB18030 with a special support package, but the font
| included (SimSun18030 aka NSimSun) does *not* support the CJK
| Extensions B, only CJK extensions A.

That may well be. In Opera we have our own GB 18030 converter. I would
prefer to pretend that the wretched mess does not exist, but contracts
with mainland Chinese companies require us to support it.
 
* Lars Marius Garshol
|
| Is the plan that Python will eventually be UCS-4 only?
 
* Martin v. Loewis
|
| It's my plan, but I think I don't share this plan with GvR. When I
| first presented a Unicode type for Python on IPC6, Guido was quite
| upset about my proposal to use a 4-byte wchar_t as the underlying
| type, since he considered the space wastage unacceptable.
| 
| When Fredrik and I implemented PEP 261, I had to back out my change
| to make Py_UNICODE equal to wchar_t by default if wchar_t is four
| bytes.

That's sad. It would be good if we could eventually get Python to be
all UCS-4. 

-- 
Lars Marius Garshol, Ontopian         <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC        <URL: http://www.garshol.priv.no >