writing \feff at the begining of a file
Thomas Jollans
thomas at jollybox.de
Sat Aug 14 05:27:58 EDT 2010
On Saturday 14 August 2010, it occurred to Steven D'Aprano to exclaim:
> On Fri, 13 Aug 2010 18:25:46 -0400, Terry Reedy wrote:
> > A short background to MRAB's answer which I will try to get right.
> >
> > The byte-order-mark was invented for UTF-16 encodings so the reader
> > could determine whether the pairs of bytes are in little or big endiean
> > order, depending on whether the first two bute are fe and ff or ff and
> > fe (or maybe vice versa, does not matter here). The concept is
> > meaningless for utf-8 which consists only of bytes in a defined order.
> > This is part of the Unicode standard.
> >
> > However, Microsoft (or whoever) re-purposed (hijacked) that pair of
> > bytes to serve as a non-standard indicator of utf-8 versus any
> > non-unicode encoding. The result is a corrupted utf-8 stream that python
> > accommodates with the utf-8-sig(nature) codec (versus the standard utf-8
> > codec).
>
> Is there a standard way to autodetect the encoding of a text file? I do
> this:
No, there is no way to autodetect the encoding of a text file.
> Open the file in binary mode; if the first three bytes are
> codecs.BOM_UTF8, then it's a Microsoft UTF-8 text file; otherwise if the
> first two byes are codecs.BOM_BE or codecs.BOM_LE, the encoding is utf-16-
> be or utf-16-le respectively.
Unless the file happens to be UCS-2/UTF-16, or it happens to be a UTF-8 with
garbage at the top.
> If there's no BOM, then re-open the file and read the first two lines. If
> either of them match this regex 'coding[=:]\s*([-\w.]+)' then I take the
> encoding name from that. This matches Python's behaviour, and supports
> EMACS and vi encoding declarations.
This is a completely different method, and probably the most common in real
usage:
1. Assume the file is ASCII (or some similar code page), but be liberal about
characters you don't recognize
2. Know the file format you're reading.
3. Switch encoding once you have reached an indication of which exact
character set to use.
For Python, use the coding cookie if it's there
For XML, read the <?xml ... ?> declaration.
For HTML, look for a <meta http-equiv='Content-Type' ...> tag, or just
guess
If no encoding is specified in a way you recognize, then you're out of luck.
You'd usually just guess. (better still, you'd know what encoding you're
dealing with in the first place, but that's too much to ask, I suppose...)
You can try to take an educated guess by cross-referencing character
frequencies with tables for known encoding/language combinations. I think this
is what Microsoft IE does when it encounters a web page of unspecified
encoding.
More information about the Python-list
mailing list