[Python-Dev] What does a double coding cookie mean?

Serhiy Storchaka storchaka at gmail.com
Thu Mar 17 12:50:59 EDT 2016


On 17.03.16 16:55, Guido van Rossum wrote:
> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> Likely. However the interface of tokenize.detect_encoding() is not very
>> simple.
>
> I just found that out yesterday. You have to give it a readline()
> function, which is cumbersome if all you have is a (byte) string and
> you don't want to split it on lines just yet. And the readline()
> function raises SyntaxError when the encoding isn't right. I wish
> there were a lower-level helper that just took a line and told you
> what the encoding in it was, if any. Then the rest of the logic can be
> handled by the caller (including the logic of trying up to two lines).

The simplest way to detect encoding of bytes string:

     lines = data.splitlines()
     encoding = tokenize.detect_encoding(iter(lines).__next__)[0]

If you don't want to split all data on lines, the most efficient way in 
Python 3.5 is:

     encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]

In Python 3.5 io.BytesIO(data) has constant complexity.

In older versions for detecting encoding without copying data or 
splitting all data on lines you should write line iterator. For example:

     def iterlines(data):
         start = 0
         while True:
             end = data.find(b'\n', start) + 1
             if not end:
                 break
             yield data[start:end]
             start = end
         yield data[start:]

     encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]

or

     it = (m.group() for m in re.finditer(b'.*\n?', data))
     encoding = tokenize.detect_encoding(it.__next__)

I don't know what approach is more efficient.




More information about the Python-Dev mailing list