
windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain coders as follows: bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1")) What this "else" does is that it's a lambda, and it gets passed an object with a decode method identical to the bytes decode method, except that it doesn't affect already-decoded characters. In this case, "windows-1252-ext" only includes things in the \x80-\x9F range, leaving it up to "iso-8859-1" to handle the rest. A similar process would happen for encoding: encode with "windows-1252-ext", else = "iso-8859-1". (Technically, "windows-1252-ext" isn't needed - you can use the existing "windows-1252" and combine it with the "iso-8859-1" to get "windows-1252-c1".) This would be a novel way to think of encodings as not just flat translation tables but highly composable translation tables. I have a thing for composition.

I see how this is another way to get what I was asking for: a way to decode some unfortunately common text encodings, ones that Web browsers use, in Python without having to import additional modules. I appreciate other ideas about how to solve this problem, but the generality here seems pretty unnecessary. The world isn't making any _novel_ legacy encodings. There are 8 legacy encodings that Python has missed, and there's no reason to expect there to be any more of them. It's worrisome to support arbitrary compositions of encodings. Most of these possible hybrid encodings haven't been used before, and using them would be a bad idea because there would be no reason to expect any other software in existence to be compatible with them. Some of these legacy encodings (like the webbish version of windows-1255) are not the composition of two encodings that already exist in Python. So you'd have to define new encodings anyway. On Fri, 19 Jan 2018 at 17:09 Soni L. <fakedme+py@gmail.com> wrote:
windows-1252 is based on iso-8859-1. Thus, I'd like to be able to chain coders as follows:
bytes.decode("windows-1252-ext", else=lambda r: r.decode("iso-8859-1"))
What this "else" does is that it's a lambda, and it gets passed an object with a decode method identical to the bytes decode method, except that it doesn't affect already-decoded characters. In this case, "windows-1252-ext" only includes things in the \x80-\x9F range, leaving it up to "iso-8859-1" to handle the rest.
A similar process would happen for encoding: encode with "windows-1252-ext", else = "iso-8859-1".
(Technically, "windows-1252-ext" isn't needed - you can use the existing "windows-1252" and combine it with the "iso-8859-1" to get "windows-1252-c1".)
This would be a novel way to think of encodings as not just flat translation tables but highly composable translation tables. I have a thing for composition. _______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/
participants (2)
-
Rob Speer
-
Soni L.