Multibyte Character Surport for Python
Stephen J. Turnbull
stephen at xemacs.org
Mon May 13 05:26:02 EDT 2002
>>>>> "Martin" == Martin v Loewis <martin at v.loewis.de> writes:
Martin> No, that is not the only way. Just use UTF-16BE, and all
Martin> will be fine.
If that were feasible, then there would be no UTF-16LE, no UTF-16, and
no BOM. Or at the very least, Python could get away with aliasing
UTF-16 to UTF-16BE on _all_ platforms.
Here's some more codec fun:
bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34)
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> import codecs
>>> dir(codecs)
['BOM', 'BOM32_BE', 'BOM32_LE', 'BOM64_BE', 'BOM64_LE', 'BOM_BE',
'BOM_LE', ...]
# BOM64* ??!? Hmm
>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>>
# And howcum no BOM8, which actually has some basis in the standard?
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r")
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'
Submitted as request #555360 on the tracker, including the request for
a BOM8 constant and the wrong sizes (or names?) of BOM64 and BOM32.
--
Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
My nostalgia for Icon makes me forget about any of the bad things. I don't
have much nostalgia for Perl, so its faults I remember. Scott Gilbert c.l.py
More information about the Python-list
mailing list