[Python-bugs-list] [ python-Bugs-555360 ] UTF-16 BOM handling counterintuitive

noreply@sourceforge.net noreply@sourceforge.net
Mon, 03 Jun 2002 04:54:01 -0700


Bugs item #555360, was opened at 2002-05-13 11:21
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470

Category: Unicode
Group: None
Status: Open
Resolution: None
Priority: 3
Submitted By: Stephen J. Turnbull (yaseppochi)
Assigned to: M.-A. Lemburg (lemburg)
Summary: UTF-16 BOM handling counterintuitive

Initial Comment:
A search on "Unicode BOM" doesn't turn up anything related.

Sorry, I don't have a 2.2 or CVS to hand.  Easy enough
to replicate, anyway.  The UTF-16 codec happily
corrupts files by appending a BOM before writing
encoded text to the file:

bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more
information.
>>> import codecs
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r") 
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'

Oops.

Also, dir(codecs) shows BOM64* constants are defined
(to what purpose, I have no idea---Microsoft Word files
on Alpha, maybe?), but no BOM8, which actually has some
basis in the standards.  (I think the idea of a UTF-8
signature is a abomination, so you can leave it
out<wink>, but people who do use the BOM as signature
in UTF-8 files would find it useful.)  Hmm ...

>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>> 

Urk!  I only count 32 bits in BOM64 and 16 bits in
BOM32!  Maybe BOM32_* was intended as an alias for
BOM_*, and BOM64_* was a continuation of the typo, as
it were?

I wonder if this is the right interface, actually. 
Wouldn't prefixBOM() and checkBOM() methods for streams
and strings make more sense?  prefixBOM should be
idempotent, and checkBOM would return either a codec
(with size and endianness determined) or a codec.BOM*
constant.

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 13:54

Message:
Logged In: YES 
user_id=89016

> The 32 vs. 16 refer to the number of bits in the
> Unicode internal type; they don't refer to the number of
> bits in the mark.

Yes, but unfortunately the constants in codecs are *not*
BOM32_?? and BOM16_??, but BOM64_?? and BOM32_??.


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-02 19:16

Message:
Logged In: YES 
user_id=38388

I agree that opening a file in append mode should 
probably be smarter in the sense that the BOM is
only written in case file.seek() points to the beginning
of the file; patches are welcome.

On the other points:
* I don't see the point of adding an 8-bit BOM mark
  (UTF-8 does not depend on byte order).
* The 32 vs. 16 refer to the number of bits in the
  Unicode internal type; they don't refer to the number of 
  bits in the mark.


----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-05-13 22:43

Message:
Logged In: YES 
user_id=89016

And if you're using a different encoding for the second 
open call, the data will really be corrupted:
f = codecs.open("/tmp/foo","w","utf-8")
f.write("ää")
f = codecs.open("/tmp/foo","a","latin-1")
f.write("ää")

But how should codec.open be able to determine that the 
file is always opened with the same encoding, or which 
encoding was used for the open call last time? And if it 
could would it have to read the content using the old 
encoding and rewrite it using the new encoding to keep the 
file consistent?

I agree that the BOM names are broken.

> I wonder if this is the right interface, actually. 
> Wouldn't prefixBOM() and checkBOM() methods for streams 
> and strings make more sense? prefixBOM should be 
> idempotent, and checkBOM would return either a codec 
> (with size and endianness determined) or a codec.BOM* 
> constant.

You should consider UTF-16 to be a stateful encoding, so if 
you want to do your output in multiple pieces you have to 
use a stateful encoder, i.e. a StreamWriter:

>>> import codecs, cStringIO as StringIO
>>> stream = StringIO.StringIO()
>>> writer = codecs.getwriter("utf-16")(stream)
>>> writer.write(u"a")
>>> writer.write(u"b") 
>>> stream.getvalue()
'\xff\xfea\x00b\x00'


----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470