[Python-Dev] What does a double coding cookie mean?
Guido van Rossum
guido at python.org
Thu Mar 17 15:11:04 EDT 2016
On Thu, Mar 17, 2016 at 9:50 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
> On 17.03.16 16:55, Guido van Rossum wrote:
>> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com>
>>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>> Likely. However the interface of tokenize.detect_encoding() is not very
>> I just found that out yesterday. You have to give it a readline()
>> function, which is cumbersome if all you have is a (byte) string and
>> you don't want to split it on lines just yet. And the readline()
>> function raises SyntaxError when the encoding isn't right. I wish
>> there were a lower-level helper that just took a line and told you
>> what the encoding in it was, if any. Then the rest of the logic can be
>> handled by the caller (including the logic of trying up to two lines).
> The simplest way to detect encoding of bytes string:
> lines = data.splitlines()
> encoding = tokenize.detect_encoding(iter(lines).__next__)
This will raise SyntaxError if the encoding is unknown. That needs to
be caught in mypy's case and then it needs to get the line number from
the exception. I tried this and it was too painful, so now I've just
changed the regex that mypy uses to use non-eager matching
> If you don't want to split all data on lines, the most efficient way in
> Python 3.5 is:
> encoding = tokenize.detect_encoding(io.BytesIO(data).readline)
> In Python 3.5 io.BytesIO(data) has constant complexity.
Ditto with the SyntaxError though.
> In older versions for detecting encoding without copying data or splitting
> all data on lines you should write line iterator. For example:
> def iterlines(data):
> start = 0
> while True:
> end = data.find(b'\n', start) + 1
> if not end:
> yield data[start:end]
> start = end
> yield data[start:]
> encoding = tokenize.detect_encoding(iterlines(data).__next__)
> it = (m.group() for m in re.finditer(b'.*\n?', data))
> encoding = tokenize.detect_encoding(it.__next__)
> I don't know what approach is more efficient.
Having my own regex was simpler. :-(
--Guido van Rossum (python.org/~guido)
More information about the Python-Dev