On Sun, Jan 24, 2021 at 9:13 PM Stephen J. Turnbull <turnbull.stephen.fw@u.tsukuba.ac.jp> wrote:
Chris Angelico writes:
Can anyone give an example of a current in-use system encoding that would have [ASCII bytes in non-ASCII text]?
Shift JIS, Big5. (Both can have bytes < 128 inside multibyte characters.) I don't know if Big5 is still in use as the default encoding anywhere, but Shift JIS is, although it's decreasing.
Sorry, let me clarify. Can anyone give an example of a current system encoding (ie one that is likely to be the default currently used by open()) that can have byte values below 128 which do NOT mean what they would mean in ASCII? In other words, is it possible to read in a section of a file, think that it's ASCII, and then find that you decoded it wrongly?
For both of those once you encounter a non-ASCII byte you can just switch over, and none of the previous text was mis-decoded.
Good to know, so these two won't be a problem. I'm assuming here that there is a *single* default system encoding, meaning that the automatic handler has only three cases to worry about: UTF-16 (with BOM), UTF-8 (including pure ASCII), and the system encoding.
But that's only if you *know* the language was Japanese (respectively Chinese). Remember, there is no encoding that can be distinguished from ISO 8859-1 (and several other Latin encodings) simply based on the bytes found, since it uses all 256 bytes.
Right, but as long as there's only one system encoding, that's not our problem. If you're on a Greek system and you want to decode ISO-8859-9 text, you have to state that explicitly. For the situations where you want heuristics based on byte distributions, there's always chardet.
How likely is it that you'd get even one line of text that purports to be ASCII?
Program source code where the higher-level functions (likely to contain literal strings) come late in the file are frequently misdetected based on the earlier bytes.
Yup; and the real question is whether anything would have been decoded incorrectly. If you read in a bunch of ASCII-only text and yield it to the app, and then come across something that proves that the file is not UTF-8, then as far as I am aware, you won't have to un-yield any of the previous text - it'll all have been correctly decoded. In theory, UTF-16 without a BOM can consist entirely of byte values below 128, and that's an absolute pain. But there's no solution to that, other than demanding a BOM (or hoping that the first few characters are all ASCII, so you can see "H\0e\0l\0l\0o\0", which I wouldn't call reliable, although your odds probably aren't that bad in real-world cases). ChrisA