On Sun, Jan 24, 2021, at 13:18, MRAB wrote:
Well, if you see patterns like b'\x00H\x00e\x00l\x00l\x00o' then it's probably UTF16-BE and if you see patterns like b'H\x00e\x00l\x00l\x00o\x00' then it's probably UTF16-LE.
You could also look for, say, sequences of Latin characters and sequences of Han characters.
This is dangerous, as Microsoft discovered: a sequence of ASCII latin characters can look a lot like a sequence of UTF-16 Han characters. On Windows, Notepad always writes UTF-16 with BOM, even though it now writes UTF-8 without it by default. Probably the winning combination is "if there is a UTF-16 BOM, it's UTF-16, else if first few non-ASCII bytes encountered are valid UTF-8, it's UTF-8", otherwise it's the system default 'ANSI' locale. The one problem with that is what to do if something like a pipe or a socket gets a sequence of bytes that are a valid *partial* UTF-8 character, then doesn't get any more data for a while. It's unacceptable to have to wait for more data before interpreting data that has been read. Notepad has the luxury of only working on ordinary files, and being able to scan the whole file before making a decision about the character set [I believe it mmaps the file rather than using ordinary open/read calls].