On Sun, Jan 24, 2021 at 2:31 AM Barry Scott <barry@barrys-emacs.org> wrote:
I think that you are going to create a bug magnet if you attempt to auto detect the encoding.
First problem I see is that the file may be a pipe and then you will block until you have enough data to do the auto detect.
Second problem is that the first N bytes are all in ASCII and only later do you see Windows code page signature (odd lack of utf-8 signature).
Both can be handled, just as universal newlines can, by remaining in an "uncertain" state. When the file is first opened, we know nothing about its encoding. Once you request that anything be read (eg by pumping the iterator or anything), it reads, as per current status. Then: 1) If it looks like UTF-16, assume UTF-16. Rather than falling for the "Bush hid the facts" issue, this might be restricted to files that start with a BOM. 2) If it's entirely ASCII, decode it as ASCII and stay uncertain. 3) If it can be decoded UTF-8, remember that this is a UTF-8 file, and from there on, error out if anything isn't UTF-8. 4) Otherwise, use the system encoding. On subsequent reads, if we're in ASCII mode, repeat steps 2-4. Until it finds a non-ASCII byte value, it doesn't really matter how it decodes it. Unlike chardet, this can be done completely dependably. I'm not sure what would happen if the system encoding isn't an eight-bit ASCII-compatible one, though. The algorithm might produce some odd results if the file looks like ASCII, but then switches to some incompatible encoding. Can anyone give an example of a current in-use system encoding that would have this issue? How likely is it that you'd get even one line of text that purports to be ASCII? ChrisA