writing \feff at the begining of a file

Thomas Jollans thomas at jollybox.de
Sat Aug 14 05:27:58 EDT 2010


On Saturday 14 August 2010, it occurred to Steven D'Aprano to exclaim:
> On Fri, 13 Aug 2010 18:25:46 -0400, Terry Reedy wrote:
> > A short background to MRAB's answer which I will try to get right.
> > 
> > The byte-order-mark was invented for UTF-16 encodings so the reader
> > could determine whether the pairs of bytes are in little or big endiean
> > order, depending on whether the first two bute are fe and ff or ff and
> > fe (or maybe vice versa, does not matter here). The concept is
> > meaningless for utf-8 which consists only of bytes in a defined order.
> > This is part of the Unicode standard.
> > 
> > However, Microsoft (or whoever) re-purposed (hijacked) that pair of
> > bytes to serve as a non-standard indicator of utf-8 versus any
> > non-unicode encoding. The result is a corrupted utf-8 stream that python
> > accommodates with the utf-8-sig(nature) codec (versus the standard utf-8
> > codec).
> 
> Is there a standard way to autodetect the encoding of a text file? I do
> this:

No, there is no way to autodetect the encoding of a text file. 

> Open the file in binary mode; if the first three bytes are
> codecs.BOM_UTF8, then it's a Microsoft UTF-8 text file; otherwise if the
> first two byes are codecs.BOM_BE or codecs.BOM_LE, the encoding is utf-16-
> be or utf-16-le respectively.

Unless the file happens to be UCS-2/UTF-16, or it happens to be a UTF-8 with 
garbage at the top.

> If there's no BOM, then re-open the file and read the first two lines. If
> either of them match this regex 'coding[=:]\s*([-\w.]+)' then I take the
> encoding name from that. This matches Python's behaviour, and supports
> EMACS and vi encoding declarations.

This is a completely different method, and probably the most common in real 
usage:
 1. Assume the file is ASCII (or some similar code page), but be liberal about
    characters you don't recognize
 2. Know the file format you're reading.
 3. Switch encoding once you have reached an indication of which exact
    character set to use.
      For Python, use the coding cookie if it's there
      For XML, read the <?xml ... ?> declaration.
      For HTML, look for a <meta http-equiv='Content-Type' ...> tag, or just
      guess

If no encoding is specified in a way you recognize, then you're out of luck. 
You'd usually just guess. (better still, you'd know what encoding you're 
dealing with in the first place, but that's too much to ask, I suppose...)
You can try to take an educated guess by cross-referencing character 
frequencies with tables for known encoding/language combinations. I think this 
is what Microsoft IE does when it encounters a web page of unspecified 
encoding.




More information about the Python-list mailing list