[Python-Dev] Unicode byte order mark decoding

Stephen J. Turnbull stephen at xemacs.org
Tue Apr 5 15:04:34 CEST 2005

>>>>>>"MAL" == M  <mal at egenix.com> writes:

    MAL> Stephen J. Turnbull wrote:

    >> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it
    >> even adds them to existing UTF-8 files lacking them.

    MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS
    MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode
    MAL> text".

Yes, it is an MS application.  I'll have to borrow somebody's box to
check, but IIRC UTF-8 is the native "text" encoding for Japanese now.
(Japanized applications generally behave differently from everything
else, as there are so many "standards" for encoding Japanese.)

    M> The UTF-16 stream codecs implement this logic.

    M> The UTF-16 encode and decode functions will however always
    M> strip the BOM mark from the beginning of a string.

    M> If the application doesn't want this stripping to happen, it
    M> should use the UTF-16-LE or -BE codec resp.

That sounds like it would work fine almost all the time.  If it
doesn't it's straightforward to work around, and certainly would be
more convenient for the non-standards-geek programmer.

School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba                    Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
               Ask not how you can "do" free software business;
              ask what your business can "do for" free software.

More information about the Python-Dev mailing list