On Thu, Aug 11, 2016 at 02:09:00PM +1000, Chris Angelico wrote:
On Thu, Aug 11, 2016 at 1:14 PM, Steven D'Aprano <steve@pearwood.info> wrote:
The way I have done auto-detection based on BOMs is you start by reading four bytes from the file in binary mode. (If there are fewer than four bytes, it cannot be a text file with a BOM.)
Interesting. Are you assuming that a text file cannot be empty?
Hmmm... not consciously, but I guess I was. If the file is empty, how do you know it's text?
Because 0xFF 0xFE could represent an empty file in UTF-16, and 0xEF 0xBB 0xBF likewise for UTF-8. Or maybe you don't care about files with less than one character in them?
I'll have to think about it some more :-)
For a default file-open encoding detection, I would minimize the number of options. The UTF-7 BOM could be the beginning of a file containing Base 64 data encoded in ASCII, which is a very real possibility.
I'm coming from the assumption that you're reading unformated text in an unknown encoding, rather than some structured format. But we're getting off topic here. In context of Steve's suggestion, we should only autodetect UTF-8. In other words, if there's a UTF-8 BOM, skip it, otherwise treat the file as UTF-8.
When was the last time you saw a UTF-32LE-BOM file?
Two minutes ago, when I looked at my test suite :-P -- Steve