[XML-SIG] Re: Issues with Unicode type
Lars Marius Garshol
larsga@garshol.priv.no
26 Sep 2002 11:42:58 +0200
* Lars Marius Garshol
|
| Actually, Windows 2000 displays non-BMP characters just fine. MSIE
| can be made to do it, Opera 6.0 does it just fine, Mozilla does not
| (I think) do it.
* Martin v. Loewis
|
| Can you demonstrate this?
I don't know why my word alone is not enough, but here you go:
<URL: http://www.garshol.priv.no/tmp/nonbmp.png >
The page contains instructions for how to enable the display of such
characters. Note that I never did anything to enable surrogate support.
| I failed trying for myself, because:
|
| - I have no fonts that has characters outside the BMP,
Use James Kass's Code 2001.
| - OpenType fonts that want to include non-BMP characters need
| to char-to-glyph tables, one for UCS-2, and one for UCS-4.
|
| Reportedly, W2k will only use the UCS-2 table in a font that
| contains non-BMP characters, so I somewhat doubt your statement. WXP
| reportedly does support such fonts - but I have none.
The screenshot above is taken on Windows 2000. The font is Code 2001.
* Lars Marius Garshol
|
| Also, there are locales where non-BMP characters are essential.
| Cantonese is probably the best example. You can't write the
| Cantonese equivalent of the "-ing" ending in Cantonese with the
| BMP...
* Martin v. Loewis
|
| W2k/WXP support GB18030 with a special support package, but the font
| included (SimSun18030 aka NSimSun) does *not* support the CJK
| Extensions B, only CJK extensions A.
That may well be. In Opera we have our own GB 18030 converter. I would
prefer to pretend that the wretched mess does not exist, but contracts
with mainland Chinese companies require us to support it.
* Lars Marius Garshol
|
| Is the plan that Python will eventually be UCS-4 only?
* Martin v. Loewis
|
| It's my plan, but I think I don't share this plan with GvR. When I
| first presented a Unicode type for Python on IPC6, Guido was quite
| upset about my proposal to use a 4-byte wchar_t as the underlying
| type, since he considered the space wastage unacceptable.
|
| When Fredrik and I implemented PEP 261, I had to back out my change
| to make Py_UNICODE equal to wchar_t by default if wchar_t is four
| bytes.
That's sad. It would be good if we could eventually get Python to be
all UCS-4.
--
Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net >
ISO SC34/WG3, OASIS GeoLang TC <URL: http://www.garshol.priv.no >