Multibyte Character Surport for Python

Stephen J. Turnbull stephen at xemacs.org
Mon May 13 05:26:02 EDT 2002


>>>>> "Martin" == Martin v Loewis <martin at v.loewis.de> writes:

    Martin> No, that is not the only way. Just use UTF-16BE, and all
    Martin> will be fine.

If that were feasible, then there would be no UTF-16LE, no UTF-16, and
no BOM.  Or at the very least, Python could get away with aliasing
UTF-16 to UTF-16BE on _all_ platforms.

Here's some more codec fun:

bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more information.
>>> import codecs
>>> dir(codecs)
['BOM', 'BOM32_BE', 'BOM32_LE', 'BOM64_BE', 'BOM64_LE', 'BOM_BE',
'BOM_LE', ...]
# BOM64* ??!?  Hmm
>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>> 
# And howcum no BOM8, which actually has some basis in the standard?
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r") 
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'

Submitted as request #555360 on the tracker, including the request for
a BOM8 constant and the wrong sizes (or names?) of BOM64 and BOM32.

-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py



More information about the Python-list mailing list