<br><br><div><span class="gmail_quote">On 9/13/06, <b class="gmail_sendername">John S. Yates, Jr.</b> &lt;<a href="mailto:john@yates-sheets.org">john@yates-sheets.org</a>&gt; wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

On Mon, 11 Sep 2006 18:16:15 -0700, &quot;Paul Prescod&quot; wrote:<br><br>&gt; UTF-8 with BOM is the Microsoft preferred format.<br><br>It is a mistake on Microsoft's part to fail to strip the BOM<br>during conversion to UTF-8.&nbsp;&nbsp;There is no MEANINGFUL definition

<br>of BOM in a UTF-8 string.&nbsp;&nbsp;</blockquote><div><br>That is not true. <br><br>Q: Can a UTF-8 data stream contain the BOM    

        character (in UTF-8 form)? If yes, then can I still assume the remaining    

        UTF-8 bytes are in big-endian order?   

        A: Yes, UTF-8 can contain a BOM. However, it makes <i>no</i>  

        difference as to the endianness of the byte stream. UTF-8 always has the  

        same byte order. An initial BOM is <i>only</i> used as a signature — an  

        indication that an otherwise unmarked text file is in UTF-8.<br><br>This is a very valuable function and applications like Microsoft's Notepad, Apple's TextEdit and VIM take good advantage of it. <br>&nbsp;<br>&quot;&quot;&quot;

<br><pre>Vim will try to detect what kind of file you are editing.  It uses the<br>encoding names in the <a href="http://www.vim.org/htmldoc/options.html#%27fileencodings%27">'fileencodings'</a> option.  When using <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">

Unicode</a>, the default<br>value is: &quot;ucs-bom,utf-8,latin1&quot;.  This means that Vim checks the file to see<br>if it's one of these encodings:<br><br>        ucs-bom                File must start with a Byte Order <a href="http://www.vim.org/htmldoc/motion.html#Mark">

Mark</a> (BOM).  This<br>                        allows detection of 16-bit, 32-bit and <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">utf-8</a> <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">Unicode</a><br>                        encodings.<br>        <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">

utf-8</a>                <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">utf-8</a> <a href="http://www.vim.org/htmldoc/mbyte.html#Unicode">Unicode</a>.  This is rejected when a sequence of<br>                        bytes is illegal in <a href="http://www.vim.org/htmldoc/mbyte.html#utf-8">

utf-8</a>.<br><br>        latin1                The good old 8-bit encoding.</pre><pre>&quot;&quot;&quot;</pre>

I'm pretty much proposing this same algorithm for Python's encoding guessing.<br><br></div>&nbsp;Paul Prescod<br><br></div>