writing \feff at the begining of a file
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Fri Aug 13 21:54:27 EDT 2010
On Fri, 13 Aug 2010 18:25:46 -0400, Terry Reedy wrote:
> A short background to MRAB's answer which I will try to get right.
>
> The byte-order-mark was invented for UTF-16 encodings so the reader
> could determine whether the pairs of bytes are in little or big endiean
> order, depending on whether the first two bute are fe and ff or ff and
> fe (or maybe vice versa, does not matter here). The concept is
> meaningless for utf-8 which consists only of bytes in a defined order.
> This is part of the Unicode standard.
>
> However, Microsoft (or whoever) re-purposed (hijacked) that pair of
> bytes to serve as a non-standard indicator of utf-8 versus any
> non-unicode encoding. The result is a corrupted utf-8 stream that python
> accommodates with the utf-8-sig(nature) codec (versus the standard utf-8
> codec).
Is there a standard way to autodetect the encoding of a text file? I do
this:
Open the file in binary mode; if the first three bytes are
codecs.BOM_UTF8, then it's a Microsoft UTF-8 text file; otherwise if the
first two byes are codecs.BOM_BE or codecs.BOM_LE, the encoding is utf-16-
be or utf-16-le respectively.
(I don't bother to check for other BOMs, such as for utf-32. There are
*lots* of them, but in my experience the encodings are rarely used, and
the BOMs aren't defined in the codecs module, so I don't bother to
support them.)
If there's no BOM, then re-open the file and read the first two lines. If
either of them match this regex 'coding[=:]\s*([-\w.]+)' then I take the
encoding name from that. This matches Python's behaviour, and supports
EMACS and vi encoding declarations.
Otherwise, there is no declared encoding, and I use whatever encoding I
like (whatever was specified by the user or the application default).
--
Steven
More information about the Python-list
mailing list