[Python-3000] Pre-PEP: Easy Text File Decoding

Paul Prescod paul at prescod.net
Wed Sep 13 18:44:18 CEST 2006


On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
>
> On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:
>
> > UTF-8 with BOM is the Microsoft preferred format.
>
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8.  There is no MEANINGFUL definition
> of BOM in a UTF-8 string.


That is not true.

Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order? A: Yes, UTF-8 can contain a BOM. However, it makes *no* difference as
to the endianness of the byte stream. UTF-8 always has the same byte order.
An initial BOM is *only* used as a signature — an indication that an
otherwise unmarked text file is in UTF-8.

This is a very valuable function and applications like Microsoft's Notepad,
Apple's TextEdit and VIM take good advantage of it.

"""

Vim will try to detect what kind of file you are editing.  It uses the
encoding names in the 'fileencodings'
<http://www.vim.org/htmldoc/options.html#%27fileencodings%27> option.
When using Unicode <http://www.vim.org/htmldoc/mbyte.html#Unicode>,
the default
value is: "ucs-bom,utf-8,latin1".  This means that Vim checks the file to see
if it's one of these encodings:

	ucs-bom		File must start with a Byte Order Mark
<http://www.vim.org/htmldoc/motion.html#Mark> (BOM).  This
			allows detection of 16-bit, 32-bit and utf-8
<http://www.vim.org/htmldoc/mbyte.html#utf-8> Unicode
<http://www.vim.org/htmldoc/mbyte.html#Unicode>
			encodings.
	utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8>		utf-8
<http://www.vim.org/htmldoc/mbyte.html#utf-8> Unicode
<http://www.vim.org/htmldoc/mbyte.html#Unicode>.  This is rejected
when a sequence of
			bytes is illegal in utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8>.

	latin1		The good old 8-bit encoding.

"""

I'm pretty much proposing this same algorithm for Python's encoding
guessing.

 Paul Prescod
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/python-3000/attachments/20060913/87ce70e9/attachment.html 


More information about the Python-3000 mailing list