<br><br><div><span class="gmail_quote">On 9/13/06, <b class="gmail_sendername">John S. Yates, Jr.</b> <<a href="mailto:john@yates-sheets.org">john@yates-sheets.org</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:<br><br>> UTF-8 with BOM is the Microsoft preferred format.<br><br>It is a mistake on Microsoft's part to fail to strip the BOM<br>during conversion to UTF-8. There is no MEANINGFUL definition
<br>of BOM in a UTF-8 string. </blockquote><div><br>That is not true. <br><br>Q: Can a UTF-8 data stream contain the BOM
character (in UTF-8 form)? If yes, then can I still assume the remaining
UTF-8 bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. However, it makes <i>no</i>
difference as to the endianness of the byte stream. UTF-8 always has the
same byte order. An initial BOM is <i>only</i> used as a signature — an
indication that an otherwise unmarked text file is in UTF-8.<br><br>This is a very valuable function and applications like Microsoft's Notepad, Apple's TextEdit and VIM take good advantage of it. <br> <br>"""
<br><pre>Vim will try to detect what kind of file you are editing. It uses the<br>encoding names in the <a href="http://www.vim.org/htmldoc/options.html#%27fileencodings%27">'fileencodings'</a> option. When using <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">
Unicode</a>, the default<br>value is: "ucs-bom,utf-8,latin1". This means that Vim checks the file to see<br>if it's one of these encodings:<br><br>        ucs-bom                File must start with a Byte Order <a href="http://www.vim.org/htmldoc/motion.html#Mark">
Mark</a> (BOM). This<br>                        allows detection of 16-bit, 32-bit and <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">utf-8</a> <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">Unicode</a><br>                        encodings.<br>        <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">
utf-8</a>                <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">utf-8</a> <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">Unicode</a>. This is rejected when a sequence of<br>                        bytes is illegal in <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">
utf-8</a>.<br><br>        latin1                The good old 8-bit encoding.</pre><pre>"""</pre>
I'm pretty much proposing this same algorithm for Python's encoding guessing.<br><br></div> Paul Prescod<br><br></div>