On 1/24/21 1:18 PM, MRAB wrote:
On 2021-01-24 17:04, Chris Angelico wrote:
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull firstname.lastname@example.org wrote:
Chris Angelico writes: > Right, but as long as there's only one system encoding, that's not > our problem. If you're on a Greek system and you want to decode > ISO-8859-9 text, you have to state that explicitly. For the > situations where you want heuristics based on byte distributions, > there's always chardet.
But that's the big question. If you're just going to fall back to chardet, you might as well start there. No? Consider: if 'open' detects the encoding for you, *you can't find out what it is*. 'open' has no facility to tell you!
Isn't that what file objects have attributes for? You can find out, for instance, what newlines a file uses, even if it's being autodetected.
> In theory, UTF-16 without a BOM can consist entirely of byte values > below 128,
It's not just theory, it's my life. 62/80 of the Japanese "hiragana" syllabary is composed of 2 printing ASCII characters (including SPC). A large fraction of the Han ideographs satisfy that condition, and I wouldn't be surprised if a majority of the 1000 most common ones do. (Not a good bet because half of the ideographs have a low byte > 127, but the order of characters isn't random, so if you get a couple of popular radicals that have 50 or so characters in a group in that range, you'd be much of the way there.)
> But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no way for the autodetection to recognize them. That's why, in my quickly-whipped-up algorithm above, I basically had it assume that no BOM means not UTF-16. After all, there's no way to know whether it's UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point of it), so IMO it's not unreasonable to assert that all files that don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using the ASCII-compatible detection method.
(Of course, this is *ONLY* if you don't specify an encoding. That part won't be going away.)
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's probably UTF16-BE and if you see patterns like b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and sequences of Han characters.
Yes, if you happen to see that sort of pattern, you could perhaps make a guess, but since part of the goal is to not need to read ahead much of the file, it doesn't become a very reliable test to confirm UTF16 file in case they don't begin with Latin-1 characters.