[Python-Dev] Unicode byte order mark decoding
Stephen J. Turnbull
stephen at xemacs.org
Tue Apr 5 15:04:34 CEST 2005
>>>>>>"MAL" == M <mal at egenix.com> writes:
MAL> Stephen J. Turnbull wrote:
>> The Japanese "memopado" (Notepad) uses UTF-8 signatures; it
>> even adds them to existing UTF-8 files lacking them.
MAL> Is that a MS application ? AFAIK, notepad, wordpad and MS
MAL> Office always use UTF-16-LE + BOM when saving text as "Unicode
MAL> text".
Yes, it is an MS application. I'll have to borrow somebody's box to
check, but IIRC UTF-8 is the native "text" encoding for Japanese now.
(Japanized applications generally behave differently from everything
else, as there are so many "standards" for encoding Japanese.)
M> The UTF-16 stream codecs implement this logic.
M> The UTF-16 encode and decode functions will however always
M> strip the BOM mark from the beginning of a string.
M> If the application doesn't want this stripping to happen, it
M> should use the UTF-16-LE or -BE codec resp.
That sounds like it would work fine almost all the time. If it
doesn't it's straightforward to work around, and certainly would be
more convenient for the non-standards-geek programmer.
--
School of Systems and Information Engineering http://turnbull.sk.tsukuba.ac.jp
University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN
Ask not how you can "do" free software business;
ask what your business can "do for" free software.
More information about the Python-Dev
mailing list