[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

23 Jan 2021

      On Sun, Jan 24, 2021 at 2:31 AM Barry Scott <barry@barrys-emacs.org> wrote:
...
I think that you are going to create a bug magnet if you attempt to auto
detect the encoding.
First problem I see is that the file may be a pipe and then you will block
until you have enough data to do the auto detect.
Second problem is that the first N bytes are all in ASCII and only later
do you see Windows code page signature (odd lack of utf-8 signature).
Both can be handled, just as universal newlines can, by remaining in
an "uncertain" state.

When the file is first opened, we know nothing about its encoding.
Once you request that anything be read (eg by pumping the iterator or
anything), it reads, as per current status. Then:

1) If it looks like UTF-16, assume UTF-16. Rather than falling for the
"Bush hid the facts" issue, this might be restricted to files that
start with a BOM.

2) If it's entirely ASCII, decode it as ASCII and stay uncertain.

3) If it can be decoded UTF-8, remember that this is a UTF-8 file, and
from there on, error out if anything isn't UTF-8.

4) Otherwise, use the system encoding.

On subsequent reads, if we're in ASCII mode, repeat steps 2-4. Until
it finds a non-ASCII byte value, it doesn't really matter how it
decodes it.

Unlike chardet, this can be done completely dependably. I'm not sure
what would happen if the system encoding isn't an eight-bit
ASCII-compatible one, though. The algorithm might produce some odd
results if the file looks like ASCII, but then switches to some
incompatible encoding. Can anyone give an example of a current in-use
system encoding that would have this issue? How likely is it that
you'd get even one line of text that purports to be ASCII?

ChrisA

[Python-ideas] Re: Adding `open_text()` builtin function. (relating to PEP 597)

Chris Angelico