[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

24 Jan 2021


      On 1/24/21 1:18 PM, MRAB wrote:
...
On 2021-01-24 17:04, Chris Angelico wrote:
...
On Mon, Jan 25, 2021 at 3:55 AM Stephen J. Turnbull
 wrote:
...
Chris Angelico writes:
 > Right, but as long as there's only one system encoding, that's not
 > our problem. If you're on a Greek system and you want to decode
 > ISO-8859-9 text, you have to state that explicitly. For the
 > situations where you want heuristics based on byte distributions,
 > there's always chardet.
But that's the big question.  If you're just going to fall back to
chardet, you might as well start there.  No?  Consider: if 'open'
detects the encoding for you, *you can't find out what it is*.  'open'
has no facility to tell you!
Isn't that what file objects have attributes for? You can find out,
for instance, what newlines a file uses, even if it's being
autodetected.
...
 > In theory, UTF-16 without a BOM can consist entirely of byte values
 > below 128,
It's not just theory, it's my life.  62/80 of the Japanese "hiragana"
syllabary is composed of 2 printing ASCII characters (including SPC).
A large fraction of the Han ideographs satisfy that condition, and I
wouldn't be surprised if a majority of the 1000 most common ones do.
(Not a good bet because half of the ideographs have a low byte > 127,
but the order of characters isn't random, so if you get a couple of
popular radicals that have 50 or so characters in a group in that
range, you'd be much of the way there.)
 > But there's no solution to that,
Well, yes, but that's my line. ;-)
Do you get files that lack the BOM? If so, there's fundamentally no
way for the autodetection to recognize them. That's why, in my
quickly-whipped-up algorithm above, I basically had it assume that no
BOM means not UTF-16. After all, there's no way to know whether it's
UTF-16-BE or UTF-16-LE without a BOM anyway (which is kinda the point
of it), so IMO it's not unreasonable to assert that all files that
don't start either b"\xFF\xFE" or b"\xFE\xFF" should be decoded using
the ASCII-compatible detection method.
(Of course, this is *ONLY* if you don't specify an encoding. That part
won't be going away.)
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's
probably UTF16-BE and if you see patterns like
b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and
sequences of Han characters.
Yes, if you happen to see that sort of pattern, you could perhaps make a
guess, but since part of the goal is to not need to read ahead much of
the file, it doesn't become a very reliable test to confirm UTF16 file
in case they don't begin with Latin-1 characters.

-- 
Richard Damon