PEP-0263 and default encoding
Bengt Richter
bokr at oz.net
Wed Oct 1 18:09:05 EDT 2003
On Wed, 1 Oct 2003 11:58:22 +0000 (UTC), Klaus Alexander Seistrup <spam at magnetic-ink.dk> wrote:
>Erik Max Francis skrev:
>
>>> Please, could you explain what you mean by "the UTF-8 BOM"?
>>
>> Byte order marker. It's a clever gimmick Unicode uses, where a few
>> valid Unicode characters are set aside for being used in sequence to
>> help determine whether an encoded Unicode stream is little-endian or
>> big-endian.
>
>Thanks, I also found a reference on unicode.org¹ that was useful.
>
>
> // Klaus
>
> ¹) <http://www.unicode.org/unicode/faq/utf_bom.html>
A table of BOMs appears:
00 00 FE FF UTF-32, big-endian
FF FE 00 00 UTF-32, little-endian
FE FF UTF-16, big-endian
FF FE UTF-16, little-endian
EF BB BF UTF-8
but I'm not sure I trust everything on that page. E.g., at the bottom it says,
"Last updated: - Tuesday, December 09, 1902 16:15:05" ;-)
There appear to be a number of other typos as well, and some mysterious semantics, e.g., in
"""
Q: Can you summarize how I should deal with BOMs?
A: Here are some guidelines to follow:
1. A particular protocol (e.g. Microsoft conventions for
.txt files) may require use of the BOM on certain Unicode
data streams, such as files. When you need to conform to
such a protocol, use a BOM.
2. Some protocols allow optional BOMs in the case of
untagged text. In those cases,
o Where a text data stream is known to be plain text,
but of unknown encoding, BOM can be used as a
signature. If there is no BOM, the encoding could be
anything.
o Where a text data stream is known to be plain
Unicode text (but not which endian), then BOM can be
used as a signature. If there is no BOM, the text
should be interpreted as big-endian.
3. Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should
not be used. [MD]
"""
(3) sounds a little funny, though I think I know what it's trying
to say.
I don't understand (2), unless it's just saying you can make up
ad hoc markup using BOMs to indicate a binary packing scheme totally
orthogonally to what the packed bits might mean as an encoded data stream.
BOMs have always suggested Unicode to me, so this was a liberating notion,
intended or not ;-) In which case, why not UTF-xxz BOMs for zlib zip-format
packing, etc., where xx could be the usual, e.g., UTF-16lez or UTF-8z. I'd
bet the latter could save some bandwidth and disk space on some non-english
web sites, if browsers supported it for UTF-8 unicode.
Actually, is there a standard for overall compressed HTML transfer already?
Or is it ignored in favor of letting lower levels do compression?
Haven't looked lately...
Regards,
Bengt Richter
More information about the Python-list
mailing list