[Python-Dev] Python3 "complexity"

Chris Angelico rosuav at gmail.com
Fri Jan 10 02:22:02 CET 2014


On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik at gmail.com> wrote:
>   2. introduce autodetect mode to open functions
>      1. read and transform on the fly, maintaining a buffer that
> stores original bytes
>          and their mapping to letters. The mapping is updated as bytes frequency
>          changes. When the buffer is full, you have the best candidate.
>

Bad idea. Bad, bad idea! No biscuit. Sit!

This sort of magic is what brings the "bush hid the facts" bug in
Windows Notepad. If byte value distribution is used to guess encoding,
there's no end to the craziness that can result. How do you know that
the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case
ASCII letters and not a 32-bit integer or floating-point value, or
some accented lower-case letter A's in EBCDIC, or anything else? Maybe
if you have a whole document, AND you know for sure that it's
linguistic text, then maybe - MAYBE - you could guess with reasonable
reliability. But even then, how can you be sure? Remember, too, you
might have to deal with something that's actually mis-encoded. If
you're told this is UTF-8 and you find the byte sequence ED B3 BF, do
you decide that it can't possibly be UTF-8 and pick a different
encoding to decode with? That would produce no end of trouble, where
the actual result you want is (most likely) to throw an error.

ChrisA


More information about the Python-Dev mailing list