On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
On 10Aug2016 1431, Chris Angelico wrote:
- make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default?
There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files.
An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000?
BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-)
Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.) Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1.
def bom2enc(bom, default=None): """Return encoding name from a four-byte BOM.""" if bom.startswith((b'\x00\x00\xFE\xFF', b'\xFF\xFE\x00\x00')): return 'utf_32' elif bom.startswith((b'\xFE\xFF', b'\xFF\xFE')): return 'utf_16' elif bom.startswith(b'\xEF\xBB\xBF'): return 'utf_8_sig' elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom in b'\x2B\x2F\x38\x39': return 'utf_7' elif bom.startswith(b'\xF7\x64\x4C'): return 'utf_1' elif default is None: raise ValueError('no recognisable BOM signature') else: return default
The BOM exists solely for data-based detection, and the UTF-8 BOM is different from the UTF-16 and UTF-32 ones. So we either find an exact BOM (which IIRC decodes as a no-op spacing character, though I have a feeling some version of Unicode redefined it exclusively for being the marker) or we use utf-8.
The Byte Order Mark is always U+FEFF encoded into whatever bytes your encoding uses. You should never use U+FEFF except as a BOM, but of course arbitrary Unicode strings might include it in the middle of the string Just Because. In that case, it may be interpreted as a legacy "ZERO WIDTH NON-BREAKING SPACE" character. But new content should never do that: you should use U+2060 "WORD JOINER" instead, and treat a U+FEFF inside the body of your file or string as an unsupported character.
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
- utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
- utf-8 when writing (so the BOM is *not* written)
Sounds reasonable to me.
Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps.
This provides the best compatibility when reading/writing files without making any guesses. We could reasonably extend this to read utf-16 and utf-32 if they have a BOM, but that's an extension and not necessary for the main change.
The use of a BOM is always a guess :-) Maybe I just happen to have a Latin1 file that starts with "ï»¿", or a Mac Roman file that starts with "Ôªø". Either case will be wrongly detected as UTF-8. That's the risk you take when using a heuristic.
And if you don't want to use that heuristic, then you must specify the actual encoding in use.