[Python-3000] Pre-PEP: Easy Text File Decoding

Josiah Carlson jcarlson at uci.edu
Wed Sep 13 18:41:01 CEST 2006


"John S. Yates, Jr." <john at yates-sheets.org> wrote:
> 
> On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:
> 
> > UTF-8 with BOM is the Microsoft preferred format.
> 
> I believe this is a gloss.  Microsoft uses UTF-16.  Because
> the basic character unit is larger than one byte it is crucial
> for interoperability to prefix a string of UTF-16 text with an
> indication of the order of bytes in each two byte unit.  This
> is the role of the BOM.  The BOM is not part of the text.  It
> is a wrapper or envelope.
> 
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.  But instead of stripping the wrapper
> and converting only the text payload Microsoft lazily treats
> both the wrapper and its payload as text.

I have actually had a variant of this particular discussion with Walter
Dörwald.  He brought up RCF 3629...

[Walter Dörwald]
I don't think it does. RFC 3629 isn't that clear about whether an
initial 0xEF 0xBB 0xBF sequence is to be interpreted as an encoding
signature or a ZWNBSP. But I think the following part of RFC 3629
applies here for Python source code:

   o  A protocol SHOULD also forbid use of U+FEFF as a signature for
      those textual protocol elements for which the protocol provides
      character encoding identification mechanisms, when it is expected
      that implementations of the protocol will be in a position to
      always use the mechanisms properly.  This will be the case when
      the protocol elements are maintained tightly under the control of
      the implementation from the time of their creation to the time of
      their (properly labeled) transmission.

[My reply, slightly altered for this context]
Because not all tools that may manipulate data consumed and/or produced
by Python follow the coding: directive, then "the protocol elements" are
not 'tightly maintained', so the inclusion of a "BOM" for utf-8 is a
necessary "protocol element", at least for .py files, and certainly
suggested for other file types that _may not have_ the equivalent of a
Python coding: directive.


Explicit is better than implicit, and in this case we have the
opportunity to be explicit about the "envelope" or "the protocol
elements", which will guarantee proper interpretation by non-braindead
software.  Braindead software that doesn't understand a utf-* BOM should
be fixed by the developer or eschewed.


> You can take this further and imagine concatenating two UTF-8
> strings, one originally UTF-16 generated in a little-endian
> environment, the other originally UTF-16 generated in a big-
> endian environment.  If the BOMs are not pre-stripped then
> during raising of the concatenated result to UTF-16 you will
> get an object with embedded BOMs.  This is not meaningful.

And is generally ignored, as per unicode spec; it's a "zero width
non-breaking space" - an invisible character with no effect on wrapping
or otherwise.

> What does it mean within a UTF-16 string to encounter a BOM
> that contradicts the wrapper/envelope?  Does this mean that
> any correct UTF-16 utility much cope with hybrid object whose
> byte order potentially changes mid-stride?

Unless you are doing something wrong (like literally concatenating the
byte representations of a utf-16be and utf-16le encoded text), this
won't happen.


> /john, who has written a database loader that has to contend
> with (and clearly diagnoses) BOM in UTF-8 strings.

Being that BOMs are only supposed to be seen as a BOM if they are
literally the first few bytes in a string, I certainly hope you didn't
spend too much time on that support.

 - Josiah (who has written an editor with support for all UTF variants
with BOM, and UTF-8 + all other localized encodings using coding:
directives)



More information about the Python-3000 mailing list