Python unicode and Windows cmd.exe

Alf P. Steinbach alfps at start.no
Sun Mar 14 20:37:04 EDT 2010


* Mark Tolonen:
> 
> "Terry Reedy" <tjreedy at udel.edu> wrote in message 
> news:hnjkuo$n16$1 at dough.gmane.org...
> On 3/14/2010 4:40 PM, Guillermo wrote:
>> Adding the byte that some call a 'utf-8 bom' makes the file an invalid 
>> utf-8 file.
> 
> Not true.  From http://unicode.org/faq/utf_bom.html:
> 
> Q: When a BOM is used, is it only in 16-bit Unicode text?
> A: No, a BOM can be used as a signature no matter how the Unicode text 
> is transformed: UTF-16, UTF-8, UTF-7, etc. The exact bytes comprising 
> the BOM will be whatever the Unicode character FEFF is converted into by 
> that transformation format. In that form, the BOM serves to indicate 
> both that it is a Unicode file, and which of the formats it is in. 
> Examples:
> BytesEncoding Form
> 00 00 FE FF UTF-32, big-endian
> FF FE 00 00 UTF-32, little-endian
> FE FF            UTF-16, big-endian
> FF FE            UTF-16, little-endian
> EF BB BF      UTF-8

Well, technically true, and Terry was wrong about "There is no such thing as a 
utf-8 'byte order mark'. The concept is an oxymoron.". It's true that as a 
descriptive term "byte order mark" is an oxymoron for UTF-8. But in this 
particular context it's not a descriptive term, and it's not only technically 
allowed, as you point out, but sometimes required.

However, some tools are unable to process UTF-8 files with BOM.

The most annoying example is the GCC compiler suite, in particular g++, which in 
its Windows MinGW manifestation insists on UTF-8 source code without BOM, while 
Microsoft's compiler needs the BOM to recognize the file as UTF-8  --  the only 
way I found to satisfy both compilers, apart from a restriction to ASCII or 
perhaps Windows ANSI with wide character literals restricted to ASCII 
(exploiting a bug in g++ that lets it handle narrow character literals with 
non-ASCII chars) is to preprocess the source code. But that's not a general 
solution since the g++ preprocessor, via another bug, accepts some constructs 
(which then compile nicely) which the compiler doesn't accept when explicit 
preprocessing isn't used. So it's a mess.


Cheers,

- Alf



More information about the Python-list mailing list