Stephen J. Turnbull wrote:
"MAL" == M email@example.com writes:
MAL> The BOM (byte order mark) was a non-standard Microsoft MAL> invention to detect Unicode text data as such (MS always uses MAL> UTF-16-LE for Unicode text files).
The Japanese "memopado" (Notepad) uses UTF-8 signatures; it even adds them to existing UTF-8 files lacking them.
Is that a MS application ? AFAIK, notepad, wordpad and MS Office always use UTF-16-LE + BOM when saving text as "Unicode text".
MAL> -1; there's no standard for UTF-8 BOMs - adding it to the MAL> codecs module was probably a mistake to begin with. You MAL> usually only get UTF-8 files with BOM marks as the result of MAL> recoding UTF-16 files into UTF-8.
There is a standard for UTF-8 _signatures_, however. I don't have the most recent version of the ISO-10646 standard, but Amendment 2 (which defined UTF-8 for ISO-10646) specifically added the UTF-8 signature to Annex F of that standard. Evan quotes Version 4 of the Unicode standard, which explicitly defines the UTF-8 signature.
Ok, as signature the BOM does make some sense - whether to strip signatures from a document is a good idea or not is a different matter, though.
Here's the Unicode Cons. FAQ on the subject:
They also explicitly warn about adding BOMs to UTF-8 data since it can break applications and protocols that do not expect such a signature.
So there is a standard for the UTF-8 signature, and I know of applications which produce it. While I agree with you that Python's codecs shouldn't produce it (by default), providing an option to strip is a good idea.
However, this option should be part of the initialization of an IO stream which produces Unicodes, _not_ an operation on arbitrary internal strings (whether raw or Unicode).
MAL> BTW, how do you know that s came from the start of a file and MAL> not from slicing some already loaded file somewhere in the MAL> middle ?
The programmer or the application might, but Python's codecs don't. The point is that this is also true of rawstrings that happen to contain UTF-16 or UTF-32 data. The UTF-16 ("auto-endian") codec shouldn't strip leading BOMs either, unless it has been told it has the beginning of the string.
The UTF-16 stream codecs implement this logic.
The UTF-16 encode and decode functions will however always strip the BOM mark from the beginning of a string.
If the application doesn't want this stripping to happen, it should use the UTF-16-LE or -BE codec resp.
MAL> Evan Jones wrote: >> This is *not* a valid Unicode character. The Unicode >> specification (version 4, section 15.8) says the following >> about non-characters: >> >>> Applications are free to use any of these noncharacter code >>> points internally but should never attempt to exchange >>> them. If a noncharacter is received in open interchange, an >>> application is not required to interpret it in any way. It is >>> good practice, however, to recognize it as a noncharacter and >>> to take appropriate action, such as removing it from the >>> text. Note that Unicode conformance freely allows the removal >>> of these characters. (See C10 in Section3.2, Conformance >>> Requirements.) >> >> My interpretation of the specification means that Python should
The specification _permits_ silent removal; it does not recommend.
>> silently remove the character, resulting in a zero length >> Unicode string. Similarly, both of the following lines should >> also result in a zero length Unicode string: >>>> '\xff\xfe\xfe\xff'.decode( "utf16" ) > u'\ufffe' >>>> '\xff\xfe\xff\xff'.decode( "utf16" ) > u'\uffff'
I strongly disagree; these decisions should be left to a higher layer. In the case of specified UTFs, the codecs should simply invert the UTF to Python's internal encoding.
MAL> Hmm, wouldn't it be better to raise an error ? After all, a MAL> reversed BOM mark in the stream looks a lot like you're MAL> trying to decode a UTF-16 stream assuming the wrong byte MAL> order ?!
+1 on (optionally) raising an error.
The advantage of raising an error is that the application can deal with the situation in whatever way seems fit (by registering a special error handler or by simply using "ignore" or "replace").
I agree that much of this lies outside the scope of codecs and should be handled at an application or protocol level.
-1 on removing it or anything like that, unless under control of the application (ie, the program written in Python, not Python itself). It's far too easy for software to generate broken Unicode streams, and the choice of how to deal with those should be with the application, not with the implementation language.
Footnotes:  An egregious example was the Outlook Express distributed with early Win2k betas, which produced MIME bodies with apparent Content-Type: text/html; charset=utf-16, but the HTML tags and newlines were 7-bit ASCII!