Multibyte Character Surport for Python

Stephen J. Turnbull stephen at xemacs.org
Sat May 11 10:55:23 EDT 2002


>>>>> "Martin" == Martin v Loewis <martin at v.loewis.de> writes:

    Martin> UTF-16 as-a-CES is defined in RFC 2781, which, in section
    Martin> 3.3, says that the BOM SHOULD be inserted if the CES
    Martin> UTF-16 is used.

The content of what you wrote is identical to what I wrote.  It's
optional, if you have good reason not to do so.  The behavior of

            u"a".encode("UTF-16") + u"b".encode("UTF-16")

versus

                        u"ab".encode("UTF-16")

is quite sufficient reason, to my mind.

It is, however, incorrect to cite RFC 2781 Section 3.3 "Choosing a
label for UTF-16 text" here.  Python strings have no explicit charset
labels, which is the subject of that section.  (At least I can't find
them in a string object using dir() etc.)  It simply does not apply.

Not to mention that RFC 2781 is not intended to apply to the Python
interpreter's internal operation at all, except for "Widely
Distributed Python"<wink>.  See section 1 "Introduction".


-- 
Institute of Policy and Planning Sciences     http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
 My nostalgia for Icon makes me forget about any of the bad things.  I don't
have much nostalgia for Perl, so its faults I remember.  Scott Gilbert c.l.py



More information about the Python-list mailing list