[Python-ideas] Fix default encodings on Windows
Steven D'Aprano
steve at pearwood.info
Wed Aug 10 23:14:04 EDT 2016
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
> On 10Aug2016 1431, Chris Angelico wrote:
> >>* make the default open() encoding check for a BOM or else use utf-8
> >
> >-0.5. Is there any precedent for this kind of data-based detection
> >being the default?
There is precedent: the Python interpreter will accept a BOM instead of
an encoding cookie when importing .py files.
[Chris]
> >An explicit "utf-sig" could do a full detection,
> >but even then it's not perfect - how do you distinguish UTF-32LE from
> >UTF-16LE that starts with U+0000?
BOMs are a heuristic, nothing more. If you're reading arbitrary files
could start with anything, then of course they can guess wrong. But then
if I dumped a bunch of arbitrary Unicode codepoints in your lap and
asked you to guess the language, you would likely get it wrong too :-)
[Chris]
> >Do you say "UTF-32 is rare so we'll
> >assume UTF-16", or do you say "files starting U+0000 are rare, so
> >we'll assume UTF-32"?
The way I have done auto-detection based on BOMs is you start by reading
four bytes from the file in binary mode. (If there are fewer than four
bytes, it cannot be a text file with a BOM.) Compare those first four
bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second*
(otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs
(big-endian and little-endian). Then check for UTF-8, and if you're
really keen, UTF-7 and UTF-1.
def bom2enc(bom, default=None):
"""Return encoding name from a four-byte BOM."""
if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')):
return 'utf_32'
elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')):
return 'utf_16'
elif bom.startswith(b'\xEF\xBB\xBF'):
return 'utf_8_sig'
elif bom.startswith(b'\x2B\x2F\x76'):
if len(bom) == 4 and bom[4] in b'\x2B\x2F\x38\x39':
return 'utf_7'
elif bom.startswith(b'\xF7\x64\x4C'):
return 'utf_1'
elif default is None:
raise ValueError('no recognisable BOM signature')
else:
return default
[Steve Dower]
> The BOM exists solely for data-based detection, and the UTF-8 BOM is
> different from the UTF-16 and UTF-32 ones. So we either find an exact
> BOM (which IIRC decodes as a no-op spacing character, though I have a
> feeling some version of Unicode redefined it exclusively for being the
> marker) or we use utf-8.
The Byte Order Mark is always U+FEFF encoded into whatever bytes your
encoding uses. You should never use U+FEFF except as a BOM, but of
course arbitrary Unicode strings might include it in the middle of the
string Just Because. In that case, it may be interpreted as a legacy
"ZERO WIDTH NON-BREAKING SPACE" character. But new content should never
do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF
inside the body of your file or string as an unsupported character.
http://www.unicode.org/faq/utf_bom.html#BOM
[Steve]
> But the main reason for detecting the BOM is that currently opening
> files with 'utf-8' does not skip the BOM if it exists. I'd be quite
> happy with changing the default encoding to:
>
> * utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
> * utf-8 when writing (so the BOM is *not* written)
Sounds reasonable to me.
Rather than hard-coding that behaviour, can we have a new encoding that
does that? "utf-8-readsig" perhaps.
[Steve]
> This provides the best compatibility when reading/writing files without
> making any guesses. We could reasonably extend this to read utf-16 and
> utf-32 if they have a BOM, but that's an extension and not necessary for
> the main change.
The use of a BOM is always a guess :-) Maybe I just happen to have a
Latin1 file that starts with "", or a Mac Roman file that starts with
"Ôªø". Either case will be wrongly detected as UTF-8. That's the risk
you take when using a heuristic.
And if you don't want to use that heuristic, then you must specify the
actual encoding in use.
--
Steven D'Aprano
More information about the Python-ideas
mailing list