PEP-0263 and default encoding

Wed Oct 1 18:09:05 EDT 2003

On Wed, 1 Oct 2003 11:58:22 +0000 (UTC), Klaus Alexander Seistrup <spam at magnetic-ink.dk> wrote:

>Erik Max Francis skrev:
>
>>> Please, could you explain what you mean by "the UTF-8 BOM"?
>>
>> Byte order marker.  It's a clever gimmick Unicode uses, where a few
>> valid Unicode characters are set aside for being used in sequence to
>> help determine whether an encoded Unicode stream is little-endian or
>> big-endian.
>
>Thanks, I also found a reference on unicode.org¹ that was useful.
>
>
>  // Klaus
>
> ¹) <http://www.unicode.org/unicode/faq/utf_bom.html>

A table of BOMs appears:

00 00 FE FF     UTF-32, big-endian
FF FE 00 00     UTF-32, little-endian
FE FF           UTF-16, big-endian
FF FE           UTF-16, little-endian
EF BB BF        UTF-8

but I'm not sure I trust everything on that page. E.g., at the bottom it says,

"Last updated:  - Tuesday, December 09, 1902 16:15:05" ;-)

There appear to be a number of other typos as well, and some mysterious semantics, e.g., in

"""
Q: Can you summarize how I should deal with BOMs?

A: Here are some guidelines to follow: 

    1. A  particular protocol  (e.g. Microsoft  conventions  for
    .txt files) may require  use of the  BOM on certain  Unicode
    data streams, such  as files.  When you need  to conform  to
    such a protocol, use a BOM.

    2. Some  protocols  allow  optional  BOMs  in  the  case  of
    untagged text. In those cases,

    o   Where a text data stream is known to be plain  text,
        but of  unknown  encoding,  BOM can  be  used  as  a
        signature. If there is no BOM, the encoding could be
        anything.

    o   Where a  text  data  stream is  known  to  be  plain
        Unicode text (but not which endian), then BOM can be
        used as a signature.  If there is  no BOM, the  text
        should be interpreted as big-endian.

    3. Where the precise type of the data stream is known  (e.g.
    Unicode big-endian or Unicode little-endian), the BOM should
    not be used. [MD]
"""

(3) sounds a little funny, though I think I know what it's trying
to say.

I don't understand (2), unless it's just saying you can make up
ad hoc markup using BOMs to indicate a binary packing scheme totally
orthogonally to what the packed bits might mean as an encoded data stream.

BOMs have always suggested Unicode to me, so this was a liberating notion,
intended or not ;-) In which case, why not UTF-xxz BOMs for zlib zip-format
packing, etc., where xx could be the usual, e.g., UTF-16lez or UTF-8z. I'd
bet the latter could save some bandwidth and disk space on some non-english
web sites, if browsers supported it for UTF-8 unicode.

Actually, is there a standard for overall compressed HTML transfer already?
Or is it ignored in favor of letting lower levels do compression?
Haven't looked lately...

Regards,
Bengt Richter