Eclipse/PyDev - BOM Lexical Error

Terry Reedy tjreedy at udel.edu
Tue Oct 5 16:40:36 EDT 2010


On 10/5/2010 5:13 AM, TheOne wrote:

>> It's a MS-specific thing that makes a file identifieable as
>> UTF-8-encoded under windows. The name BOM is obviously BS, but it's the
>> way it is.

> I didn't know that it's a MS-thing. (Is it really?)

Yes, who else would 'customize' an international standard by corrupting 
it when adopting it. Sort of like animals pissing on things to mark them 
as theirs.

Here is the relevant part of
https://secure.wikimedia.org/wikipedia/en/wiki/UTF-8
"Byte order mark

Many Windows programs (including Windows Notepad) add the bytes 0xEF, 
0xBB, 0xBF at the start of any document saved as UTF-8. This is the 
UTF-8 encoding of the Unicode byte order mark (BOM), and is commonly 
referred to as a UTF-8 BOM even though it is not relevant to byte order. 
The BOM can also appear if another encoding with a BOM is translated to 
UTF-8 without stripping it.

The presence of the UTF-8 BOM may cause interoperability problems with 
existing software that could otherwise handle UTF-8, for example:

     * Older text editors may display the BOM as "" at the start of 
the document, even if the UTF-8 file contains only ASCII and would 
otherwise display correctly.
     * Programming language parsers not explicitly designed for UTF-8 
can often handle UTF-8 in string constants and comments, but cannot 
parse the BOM at the start of the file.
     * Programs that identify file types by leading characters may fail 
to identify the file if a BOM is present even if the user of the file 
could skip the BOM. Or conversely they will identify the file when the 
user cannot handle the BOM. An example is the Unix shebang syntax.
     * Programs that insert information at the start of a file will 
result in a file with the BOM somewhere in the middle of it (this is 
also a problem with the UTF-16 BOM). One example is offline browsers 
that add the originating URL to the start of the file.

If compatibility with existing programs is not important, the BOM could 
be used to identify if a file is UTF-8 versus a legacy encoding, but 
this is still problematic due to many instances where the BOM is added 
or removed without actually changing the encoding, or various encodings 
are concatenated together. Checking if the text is valid UTF-8 is more 
reliable than using BOM.
"

> Anyway, it would be great if I could make my eclipse/pydev to
> understand the BOM character and suppress the lexical error msg.

It IS an error for *decoded* unicode strings to contain the BOM 
'character'. BOM is only intended for use in multibyte transfer 
*encodings*. Its very illegality within text is what makes it useful for 
its purpose. Exclipse understands that, hence

Eclipse/Pydev reports lexical error :
 >> >   Lexical error at line 1, column 1. Encountered: "\ufeff" (65279),
 >> > after : ""

Python deals with this by having separate standard utf_8 and utf_8_sig 
(nature) codecs for encoding and decoding:

 >>> bom = bytes((0xEF, 0xBB, 0xBF))
 >>> bom.decode('utf_8')
'\ufeff'
 >>> bom.decode('utf_8_sig')
''

So if you insist on mal-forming your files, you need to tell 
eclipse/pydev to use the equivalent of the utf_8_sig codec.

-- 
Terry Jan Reedy





More information about the Python-list mailing list