On Apr 1, 2005, at 15:19, M.-A. Lemburg wrote:
The BOM (byte order mark) was a non-standard Microsoft invention to detect Unicode text data as such (MS always uses UTF-16-LE for Unicode text files).
Well, it's origins do not really matter since at this point the BOM is firmly encoded in the Unicode standard. It seems to me that it is in everyone's best interest to support it.
It is not needed for the UTF-8 because that format doesn't rely on the byte order and the BOM character at the beginning of a stream is a legitimate ZWNBSP (zero width non breakable space) code point.
You are correct: it is a legitimate character. However, its use as a ZWNBSP character has been deprecated:
The overloading of semantics for this code point has caused problems for programs and protocols. The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.
Also, the Unicode specification is ambiguous on what an implementation should do about a leading ZWNBSP that is encoded in UTF-16. Like I mentioned, if you look at the Unicode standard, version 4, section 15.9, it says:
- Unmarked Character Set. In some circumstances, the character set
information for a stream of coded characters (such as a file) is not available. The only information available is that the stream contains text, but the precise character set is not known.
This seems to indicate that it is permitted to strip the BOM from the beginning of UTF-8 text.
-1; there's no standard for UTF-8 BOMs - adding it to the codecs module was probably a mistake to begin with. You usually only get UTF-8 files with BOM marks as the result of recoding UTF-16 files into UTF-8.
This is clearly incorrect. The UTF-8 is specified in the Unicode standard version 4, section 15.9:
In UTF-8, the BOM corresponds to the byte sequence <EF BB BF>.
I normally find files with UTF-8 BOMs from many Windows applications when you save a text file as UTF8. I think that Notepad or WordPad does this, for example. I think UltraEdit also does the same thing. I know that Scintilla definitely does.
At the very least, it would be nice to add a note about this to the documentation, and possibly add this example function that implements the "UTF-8 or ASCII?" logic.
Well, I'd say that's a very English way of dealing with encoded text ;-)
Please note I am saying only that something like this may want to me considered for addition to the documentation, and not to the Python standard library. This example function more closely replicates the logic that is used on those Windows applications when opening ".txt" files. It uses the default locale if there is no BOM:
def autodecode( s ): if s.beginswith( codecs.BOM_UTF8 ): # The byte string s is UTF-8 out = s.decode( "utf8" ) return out[1:] else: return s.decode()
BTW, how do you know that s came from the start of a file and not from slicing some already loaded file somewhere in the middle ?
Well, the same argument could be applied to the UTF-16 decoder know that the string came from the start of a file, and not from slicing some already loaded file? The standard states that:
In the UTF-16 encoding scheme, U+FEFF at the very beginning of a file or stream explicitly signals the byte order.
So it is perfectly permissible to perform this type of processing if you consider a string to be equivalent to a stream.
My interpretation of the specification means that Python should silently remove the character, resulting in a zero length Unicode string.
Hmm, wouldn't it be better to raise an error ? After all, a reversed BOM mark in the stream looks a lot like you're trying to decode a UTF-16 stream assuming the wrong byte order ?!
Well, either one is possible, however the Unicode standard suggests, but does not require, silently removing them:
It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as removing it from the text. Note that Unicode conformance freely allows the removal of these characters.
I would prefer silently ignoring them from the str.decode() function, since I believe in "be strict in what you emit, but liberal in what you accept." I think that this only applies to str.decode(). Any other attempt to create non-characters, such as unichr( 0xffff ), *should* raise an exception because clearly the programmer is making a mistake.
Other than that: +1 on fixing this case.