[Python-Dev] Python3 "complexity"

Steven D'Aprano steve at pearwood.info
Fri Jan 10 03:39:52 CET 2014


On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
> On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik at gmail.com> wrote:
> >   2. introduce autodetect mode to open functions
> >      1. read and transform on the fly, maintaining a buffer that
> > stores original bytes
> >          and their mapping to letters. The mapping is updated as bytes frequency
> >          changes. When the buffer is full, you have the best candidate.
> >
> 
> Bad idea. Bad, bad idea! No biscuit. Sit!
> 
> This sort of magic is what brings the "bush hid the facts" bug in
> Windows Notepad. If byte value distribution is used to guess encoding,
> there's no end to the craziness that can result.

I think that heuristics to guess the encoding have their role to play, 
if the caller understands the risks. For example, an application might 
give the user the choice of specifying the codec, or having the app 
guess it. (I dislike the term "Auto detect", since that implies a level 
of certainty which often doesn't apply to real files.)

There is already a third-party library, chardet, which does this. 
Perhaps the std lib should include this? Perhaps chardet should be 
considered best-of-breed "atomic reactor", but the std lib could include 
a "battery" to do something similar. I don't think we ought to dismiss 
this idea out of hand.


> How do you know that
> the byte values 0x41 0x42 0x43 0x44 are supposed to mean upper-case
> ASCII letters and not a 32-bit integer or floating-point value, 

Presumably if you're reading a file intended to be text, they'll be 
meant to be text and not arbitrary binary blobs. Given that it is 2014 
and not 1974, chances are reasonably good that bytes 0x41 0x42 0x43 0x44 
are meant as ASCII letters rather than EBCDIC. But you can't be 
certain, and even if "ASCII capital A" is the right way to bet with
byte 0x41, it's much harder to guess what 0xC9 is intended as:

py> for encoding in "macroman cp1256 latin1 koi8_r".split():
...     print(b'\xC9'.decode(encoding))
...
…
ة
É
и


If you know the encoding via some out-of-band metadata, that's great. If 
you don't, or if the specified encoding is wrong, an application may not 
have the luxury of just throwing up its hands and refusing to process 
the data. Your web browser has to display something even if the web page 
lies about the encoding used or contains invalid data.

Even though encoding issues are more than 40 years old, making this 
problem older than most programmers, it's still new to many people. 
(Perhaps they haven't been paying attention, or living in denial that it 
would even happen to them, or they've just been lucky to be living in a 
pure ASCII world.) So a bit of sympathy to those struggling with this, 
but on the flip side, they need to HTFU and deal with it. Python 3 did 
not cause encoding issues, and in these days of code being interchanged 
all over the world, any programmer who doesn't have at least a basic 
understanding of this is like a programmer who doesn't understand why 
"<insert name of language> cannot multiply correctly":

py> 0.7*7 == 4.9
False



-- 
Steven


More information about the Python-Dev mailing list