[Python-bugs-list] [ python-Bugs-555360 ] UTF-16 BOM handling counterintuitive

noreply@sourceforge.net noreply@sourceforge.net
Mon, 13 May 2002 02:21:28 -0700


Bugs item #555360, was opened at 2002-05-13 18:21
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470

Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Stephen J. Turnbull (yaseppochi)
Assigned to: M.-A. Lemburg (lemburg)
Summary: UTF-16 BOM handling counterintuitive

Initial Comment:
A search on "Unicode BOM" doesn't turn up anything related.

Sorry, I don't have a 2.2 or CVS to hand.  Easy enough
to replicate, anyway.  The UTF-16 codec happily
corrupts files by appending a BOM before writing
encoded text to the file:

bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more
information.
>>> import codecs
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r") 
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'

Oops.

Also, dir(codecs) shows BOM64* constants are defined
(to what purpose, I have no idea---Microsoft Word files
on Alpha, maybe?), but no BOM8, which actually has some
basis in the standards.  (I think the idea of a UTF-8
signature is a abomination, so you can leave it
out<wink>, but people who do use the BOM as signature
in UTF-8 files would find it useful.)  Hmm ...

>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>> 

Urk!  I only count 32 bits in BOM64 and 16 bits in
BOM32!  Maybe BOM32_* was intended as an alias for
BOM_*, and BOM64_* was a continuation of the typo, as
it were?

I wonder if this is the right interface, actually. 
Wouldn't prefixBOM() and checkBOM() methods for streams
and strings make more sense?  prefixBOM should be
idempotent, and checkBOM would return either a codec
(with size and endianness determined) or a codec.BOM*
constant.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470