[Python-3000] Pre-PEP: Easy Text File Decoding
paul at prescod.net
Wed Sep 13 18:44:18 CEST 2006
On 9/13/06, John S. Yates, Jr. <john at yates-sheets.org> wrote:
> On Mon, 11 Sep 2006 18:16:15 -0700, "Paul Prescod" wrote:
> > UTF-8 with BOM is the Microsoft preferred format.
> It is a mistake on Microsoft's part to fail to strip the BOM
> during conversion to UTF-8. There is no MEANINGFUL definition
> of BOM in a UTF-8 string.
That is not true.
Q: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If
yes, then can I still assume the remaining UTF-8 bytes are in big-endian
order? A: Yes, UTF-8 can contain a BOM. However, it makes *no* difference as
to the endianness of the byte stream. UTF-8 always has the same byte order.
An initial BOM is *only* used as a signature — an indication that an
otherwise unmarked text file is in UTF-8.
This is a very valuable function and applications like Microsoft's Notepad,
Apple's TextEdit and VIM take good advantage of it.
Vim will try to detect what kind of file you are editing. It uses the
encoding names in the 'fileencodings'
When using Unicode <http://www.vim.org/htmldoc/mbyte.html#Unicode>,
value is: "ucs-bom,utf-8,latin1". This means that Vim checks the file to see
if it's one of these encodings:
ucs-bom File must start with a Byte Order Mark
<http://www.vim.org/htmldoc/motion.html#Mark> (BOM). This
allows detection of 16-bit, 32-bit and utf-8
utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8> utf-8
<http://www.vim.org/htmldoc/mbyte.html#Unicode>. This is rejected
when a sequence of
bytes is illegal in utf-8 <http://www.vim.org/htmldoc/mbyte.html#utf-8>.
latin1 The good old 8-bit encoding.
I'm pretty much proposing this same algorithm for Python's encoding
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-3000