[Python-Dev] Python3 "complexity"

Chris Angelico rosuav at gmail.com
Fri Jan 10 05:03:10 CET 2014


On Fri, Jan 10, 2014 at 1:39 PM, Steven D'Aprano <steve at pearwood.info> wrote:
> On Fri, Jan 10, 2014 at 12:22:02PM +1100, Chris Angelico wrote:
>> On Fri, Jan 10, 2014 at 11:53 AM, anatoly techtonik <techtonik at gmail.com> wrote:
>> >   2. introduce autodetect mode to open functions
>> >      1. read and transform on the fly, maintaining a buffer that
>> > stores original bytes
>> >          and their mapping to letters. The mapping is updated as bytes frequency
>> >          changes. When the buffer is full, you have the best candidate.
>> >
>>
>> Bad idea. Bad, bad idea! No biscuit. Sit!
>>
>> This sort of magic is what brings the "bush hid the facts" bug in
>> Windows Notepad. If byte value distribution is used to guess encoding,
>> there's no end to the craziness that can result.
>
> I think that heuristics to guess the encoding have their role to play,
> if the caller understands the risks. For example, an application might
> give the user the choice of specifying the codec, or having the app
> guess it. (I dislike the term "Auto detect", since that implies a level
> of certainty which often doesn't apply to real files.)
>
> There is already a third-party library, chardet, which does this.
> Perhaps the std lib should include this? Perhaps chardet should be
> considered best-of-breed "atomic reactor", but the std lib could include
> a "battery" to do something similar. I don't think we ought to dismiss
> this idea out of hand.

I don't deny that chardet has its place, but would you use it like
this (I'm assuming it works with Py3, the docs seem to imply Py2):

text = ""
with open("blah", "rb") as f:
    while True:
        data = f.read(256)
        if not data: break
        text += data.decode(chardet.detect(data)['encoding'])

Certainly not. But that's how the file-open-mode of "auto detect"
sounds. At very least, it has to do something like this _until_ it has
confidence; maybe it can retain the chardet state after the first
read, but it's still going to have to decode as little as you first
read. How can it handle this case?

first_char = open("blah", encoding="auto").read(1)

Somehow it needs to know how many bytes to read (and not read too many
more, preferably - buffering a line-ish is reasonable, buffering a
megabyte not so much) and figure out what's one character.

I see this as similar to the Python 2 input() function. It's not the
file-open builtin's job to do something advanced and foot-shooting as
automatic charset detection. If you want that, you should be prepared
for its failures and the messes of partial reads, and call on chardet
yourself, same as you should use eval(input()) explicitly in Py3 (and,
in my opinion, eval(raw_input()) equally explicitly in Py2). I'm not
saying that chardet is bad, but I *am* saying, and I stand by this,
that an auto-detect option on file open is a bad idea.

Unix comes with a 'file' command which will tell you even more about
what something is. (For what it thinks are text files, I believe it
uses heuristics similar to chardet to guess an encoding.) Would you
want a parameter to the open() builtin that tries to read the file as
an image, or an audio file, or a document, or an executable, and
automatically decodes it to a PIL.Image, an mm.wave, etc, or execute
the code and return its stdout, all entirely automatically? I don't
think so. Not open()'s job.

ChrisA


More information about the Python-Dev mailing list