[Python-3000] Pre-PEP: Easy Text File Decoding

Wed Sep 13 15:24:00 CEST 2006

On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:

> UTF-8 with BOM is the Microsoft preferred format.

I believe this is a gloss.  Microsoft uses UTF-16.  Because
the basic character unit is larger than one byte it is crucial
for interoperability to prefix a string of UTF-16 text with an
indication of the order of bytes in each two byte unit.  This
is the role of the BOM.  The BOM is not part of the text.  It
is a wrapper or envelope.

It is a mistake on Microsoft's part to fail to strip the BOM
during conversion to UTF-8.  There is no MEANINGFUL definition
of BOM in a UTF-8 string.  But instead of stripping the wrapper
and converting only the text payload Microsoft lazily treats
both the wrapper and its payload as text.

You can see the logical fallacy if you imagine emitting UTF-16
text in an environment of one byte sex, reducing that text to
UTF-8, carrying it to an environment of the other byte sex and
raising it back to UTF-16.  The Unicode.org assumption is that
on generation one organizes the bytes of UTF-16 or UTF-32 units
according to what is most convenient for a given environment.
One prefixes a BOM to text objects to be persisted or passed
to differing byte-sex environments.  Such an object is not a
string but a means of inter-operation.

If the BOMs are not stripped during reduction to UTF-8 and are
reconstituted during raising to UTF-16 or UTF-32 then raising
must honor the BOM and the Unicode.org efficiency objective is
subverted.

You can take this further and imagine concatenating two UTF-8
strings, one originally UTF-16 generated in a little-endian
environment, the other originally UTF-16 generated in a big-
endian environment.  If the BOMs are not pre-stripped then
during raising of the concatenated result to UTF-16 you will
get an object with embedded BOMs.  This is not meaningful.
What does it mean within a UTF-16 string to encounter a BOM
that contradicts the wrapper/envelope?  Does this mean that
any correct UTF-16 utility much cope with hybrid object whose
byte order potentially changes mid-stride?

/john, who has written a database loader that has to contend
with (and clearly diagnoses) BOM in UTF-8 strings.