On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano firstname.lastname@example.org wrote:
On Wed, Aug 10, 2016 at 04:40:31PM -0700, Steve Dower wrote:
On 10Aug2016 1431, Chris Angelico wrote:
- make the default open() encoding check for a BOM or else use utf-8
-0.5. Is there any precedent for this kind of data-based detection being the default?
There is precedent: the Python interpreter will accept a BOM instead of an encoding cookie when importing .py files.
Okay, that's good enough for me.
An explicit "utf-sig" could do a full detection, but even then it's not perfect - how do you distinguish UTF-32LE from UTF-16LE that starts with U+0000?
BOMs are a heuristic, nothing more. If you're reading arbitrary files could start with anything, then of course they can guess wrong. But then if I dumped a bunch of arbitrary Unicode codepoints in your lap and asked you to guess the language, you would likely get it wrong too :-)
I have my own mental heuristics, but I can't recognize one Cyrillic language from another. And some Slavic languages can be written with either Latin or Cyrillic letters, just to further confuse matters. Of course, "arbitrary Unicode codepoints" might not all come from one language, and might not be any language at all.
(Do you wanna build a U+2603?)
Do you say "UTF-32 is rare so we'll assume UTF-16", or do you say "files starting U+0000 are rare, so we'll assume UTF-32"?
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.)
Interesting. Are you assuming that a text file cannot be empty? Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with less than one character in them?
Compare those first four bytes against the UTF-32 BOMs first, and the UTF-16 BOMs *second* (otherwise UTF-16 will shadow UFT-32). Note that there are two BOMs (big-endian and little-endian). Then check for UTF-8, and if you're really keen, UTF-7 and UTF-1.
For a default file-open encoding detection, I would minimize the number of options. The UTF-7 BOM could be the beginning of a file containing Base 64 data encoded in ASCII, which is a very real possibility.
elif bom.startswith(b'\x2B\x2F\x76'): if len(bom) == 4 and bom in b'\x2B\x2F\x38\x39': return 'utf_7'
So I wouldn't include UTF-7 in the detection. Nor UTF-1. Both are rare. Even UTF-32 doesn't necessarily have to be included. When was the last time you saw a UTF-32LE-BOM file?
But the main reason for detecting the BOM is that currently opening files with 'utf-8' does not skip the BOM if it exists. I'd be quite happy with changing the default encoding to:
- utf-8-sig when reading (so the UTF-8 BOM is skipped if it exists)
- utf-8 when writing (so the BOM is *not* written)
Sounds reasonable to me.
Rather than hard-coding that behaviour, can we have a new encoding that does that? "utf-8-readsig" perhaps.
+1. Makes the documentation easier by having the default value for encoding not depend on the value for mode.