[Python-bugs-list] [ python-Bugs-555360 ] UTF-16 BOM handling counterintuitive

Tue, 04 Jun 2002 08:19:17 -0700

Bugs item #555360, was opened at 2002-05-13 11:21
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470

Category: Unicode
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 3
Submitted By: Stephen J. Turnbull (yaseppochi)
Assigned to: Walter Dörwald (doerwalter)
Summary: UTF-16 BOM handling counterintuitive

Initial Comment:
A search on "Unicode BOM" doesn't turn up anything related.

Sorry, I don't have a 2.2 or CVS to hand.  Easy enough
to replicate, anyway.  The UTF-16 codec happily
corrupts files by appending a BOM before writing
encoded text to the file:

bash-2.05a$ python
Python 2.1.3 (#1, Apr 20 2002, 10:14:34) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "copyright", "credits" or "license" for more
information.
>>> import codecs
>>> f = codecs.open("/tmp/utf16","w","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = codecs.open("/tmp/utf16","a","utf-16")
>>> f.write(u"a")
>>> f.close()
>>> f = open("/tmp/utf16","r") 
>>> f.read()
'\xff\xfea\x00\xff\xfea\x00'

Oops.

Also, dir(codecs) shows BOM64* constants are defined
(to what purpose, I have no idea---Microsoft Word files
on Alpha, maybe?), but no BOM8, which actually has some
basis in the standards.  (I think the idea of a UTF-8
signature is a abomination, so you can leave it
out<wink>, but people who do use the BOM as signature
in UTF-8 files would find it useful.)  Hmm ...

>>> codecs.BOM_BE
'\xfe\xff'
>>> codecs.BOM64_BE
'\x00\x00\xfe\xff'
>>> codecs.BOM32_BE
'\xfe\xff'
>>> 

Urk!  I only count 32 bits in BOM64 and 16 bits in
BOM32!  Maybe BOM32_* was intended as an alias for
BOM_*, and BOM64_* was a continuation of the typo, as
it were?

I wonder if this is the right interface, actually. 
Wouldn't prefixBOM() and checkBOM() methods for streams
and strings make more sense?  prefixBOM should be
idempotent, and checkBOM would return either a codec
(with size and endianness determined) or a codec.BOM*
constant.

----------------------------------------------------------------------

>Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-04 17:19

Message:
Logged In: YES 
user_id=89016

Checked in as:
Misc/NEWS 1.413
Lib/codecs.py 1.25
Doc/lib/libcodecs.tex 1.9

----------------------------------------------------------------------

Comment By: Stephen J. Turnbull (yaseppochi)
Date: 2002-06-04 11:50

Message:
Logged In: YES 
user_id=88738

"My" BOM8 would look exactly as you give it:  '\xef\xbb\xbf'
This would be useful in the same kinds of contexts as the
BOM16/32 variants, I would think.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-04 09:19

Message:
Logged In: YES 
user_id=38388

Stephen, how would your BOM8 look like ?

As explained below, the two constants are there
for checking which signature was used, not so 
much for generating it (since this is up to the 
UTF-16/32 codecs).

UTF-8 doesn't need a BOM. Still, it can be used
as signature, so I'D say we add BOM_UTF8_
= '\xef\xbb\xbf' as well.

----------------------------------------------------------------------

Comment By: Stephen J. Turnbull (yaseppochi)
Date: 2002-06-04 04:09

Message:
Logged In: YES 
user_id=88738

The reason for a BOM8 is for use as a _signature_, cf.
ISO/IEC 10646-1, Annex F, as Amended by Amendment 2. 
Implementers of PEP 263 and those who have to interchange
with MS Notepad and other such applications that use a
leading ZERO-WIDTH NO-BREAK SPACE as a Unicode signature may
find it convenient.  The name BOM8 is for consistency with
the other signatures.

Of course you could trash _all_ the BOM names in favor of
"SIGNATURE_UTF(8|16|32)(_[BL]E)?", which applies in all cases.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-03 20:39

Message:
Logged In: YES 
user_id=38388

Yes, please (but do leave the existing ones around fof
backward compatiblity).

About the UTF-32 codec: sure why not. Patches are
welcome !

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 19:19

Message:
Logged In: YES 
user_id=89016

So should I name them BOM_UTF16_* and BOM_UTF32_*?
(IMHO it makes much more sense this way)

Maybe Python should get an UTF-32 codec (see
http://www.unicode.org/unicode/reports/tr19/)?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-03 19:02

Message:
Logged In: YES 
user_id=38388

Google'ing around a bit, I can't find a single reference to
something like a special BOM mark on 64-bit machines.

Perhaps this was just some wild idea which has no
real meaning ?

Hmm it could have a meaning for UTF-32... but then it
should really be BOM_UTF16_ vs. BOM_UTF32_..
and have nothing to do with the internal storage format.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 17:41

Message:
Logged In: YES 
user_id=89016

In old code BOM, BOM_LE and BOM_BE are all 16bit, so to be
backwards compatible maybe the attached path diff2.txt
should be applied instead.

But this feels strange, because I'd expect that for a
--enable-unicode=ucs2 build codecs.BOM==codecs.BOM_UCS2 and
for an --enable-unicode=ucs4 build codecs.BOM==codecs.BOM_UCS4.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 16:54

Message:
Logged In: YES 
user_id=89016

> The only requirement we have for BOM is that it matches
> the BOM mark which actually gets written to the file.

But this is independent from the internal byte size of the
character type. UTF-16 always writes two bytes (except for
surrogates).

The attached diff.txt shows what I think the BOM stuff
should look like.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-03 15:58

Message:
Logged In: YES 
user_id=38388

The only requirement we have for BOM is that it matches
the BOM mark which actually gets written to the file.
The purpose of BOM_UCS2_ and BOM_UCS4_ is
to be able to figure out which underlying Unicode
version was used.

I'm not even sure whether there's a standard for this on
64-bit machines; could be that Microsoft invented something
here... (maybe that's also where the old names originated,
I don't know)

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 15:35

Message:
Logged In: YES 
user_id=89016

codecs.py says:
BOM = struct.pack('=H', 0xFEFF)
this is not correct in wide build. Should this be changed to
if sys.maxunicode>0xffff:
   BOM = struct.pack('=L', 0x0000FEFF)
else:
   BOM = struct.pack('=H', 0xFEFF)
(with two additional constants BOM_UCS2 and BOM_UCS4?)

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-03 14:57

Message:
Logged In: YES 
user_id=38388

Ok, please do and then close the bug.

About the append mode: I think you're right. It's
not worth the trouble. Applications can easily figure
this out for themselves (and then use the proper
non-BOM prepending codec name).

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 14:38

Message:
Logged In: YES 
user_id=89016

> Perhaps we should add aliases called BOM_UCS2_*
> and BOM_UCS4_* and update the documentation
> accordingly ?

Sounds reasonable!

If you want, I'll change it (including documentation)

About the append mode: I don't think it's a good idea to try
to fix this. There is much that can go wrong: seeking an odd
number of bytes, mixed endianness on writes, using a
different encoding on the second write. And what about UTF-8
and UTF-7? What should happen if the user seeks into the
middle of a UTF-[78] byte sequence and starts to write?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-03 14:10

Message:
Logged In: YES 
user_id=38388

Hmm, you're right. Something is wrong here.

Perhaps we should add aliases called BOM_UCS2_*
and BOM_UCS4_* and update the documentation
accordingly ?

About the append mode: is file.mode considered to
be part of the file interface or not... I think that a patch
for the UTF-16 codec should check for this attribute
on the stream object to tell whether or not to prepend
the BOM mark.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-06-03 13:54

Message:
Logged In: YES 
user_id=89016

> The 32 vs. 16 refer to the number of bits in the
> Unicode internal type; they don't refer to the number of
> bits in the mark.

Yes, but unfortunately the constants in codecs are *not*
BOM32_?? and BOM16_??, but BOM64_?? and BOM32_??.

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-06-02 19:16

Message:
Logged In: YES 
user_id=38388

I agree that opening a file in append mode should 
probably be smarter in the sense that the BOM is
only written in case file.seek() points to the beginning
of the file; patches are welcome.

On the other points:
* I don't see the point of adding an 8-bit BOM mark
  (UTF-8 does not depend on byte order).
* The 32 vs. 16 refer to the number of bits in the
  Unicode internal type; they don't refer to the number of 
  bits in the mark.

----------------------------------------------------------------------

Comment By: Walter Dörwald (doerwalter)
Date: 2002-05-13 22:43

Message:
Logged In: YES 
user_id=89016

And if you're using a different encoding for the second 
open call, the data will really be corrupted:
f = codecs.open("/tmp/foo","w","utf-8")
f.write("ää")
f = codecs.open("/tmp/foo","a","latin-1")
f.write("ää")

But how should codec.open be able to determine that the 
file is always opened with the same encoding, or which 
encoding was used for the open call last time? And if it 
could would it have to read the content using the old 
encoding and rewrite it using the new encoding to keep the 
file consistent?

I agree that the BOM names are broken.

> I wonder if this is the right interface, actually. 
> Wouldn't prefixBOM() and checkBOM() methods for streams 
> and strings make more sense? prefixBOM should be 
> idempotent, and checkBOM would return either a codec 
> (with size and endianness determined) or a codec.BOM* 
> constant.

You should consider UTF-16 to be a stateful encoding, so if 
you want to do your output in multiple pieces you have to 
use a stateful encoder, i.e. a StreamWriter:

>>> import codecs, cStringIO as StringIO
>>> stream = StringIO.StringIO()
>>> writer = codecs.getwriter("utf-16")(stream)
>>> writer.write(u"a")
>>> writer.write(u"b") 
>>> stream.getvalue()
'\xff\xfea\x00b\x00'

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=105470&aid=555360&group_id=5470