[Python-Dev] What does a double coding cookie mean?
Serhiy Storchaka
storchaka at gmail.com
Thu Mar 17 12:50:59 EDT 2016
On 17.03.16 16:55, Guido van Rossum wrote:
> On Thu, Mar 17, 2016 at 5:04 AM, Serhiy Storchaka <storchaka at gmail.com> wrote:
>>> Should we recommend that everyone use tokenize.detect_encoding()?
>>
>> Likely. However the interface of tokenize.detect_encoding() is not very
>> simple.
>
> I just found that out yesterday. You have to give it a readline()
> function, which is cumbersome if all you have is a (byte) string and
> you don't want to split it on lines just yet. And the readline()
> function raises SyntaxError when the encoding isn't right. I wish
> there were a lower-level helper that just took a line and told you
> what the encoding in it was, if any. Then the rest of the logic can be
> handled by the caller (including the logic of trying up to two lines).
The simplest way to detect encoding of bytes string:
lines = data.splitlines()
encoding = tokenize.detect_encoding(iter(lines).__next__)[0]
If you don't want to split all data on lines, the most efficient way in
Python 3.5 is:
encoding = tokenize.detect_encoding(io.BytesIO(data).readline)[0]
In Python 3.5 io.BytesIO(data) has constant complexity.
In older versions for detecting encoding without copying data or
splitting all data on lines you should write line iterator. For example:
def iterlines(data):
start = 0
while True:
end = data.find(b'\n', start) + 1
if not end:
break
yield data[start:end]
start = end
yield data[start:]
encoding = tokenize.detect_encoding(iterlines(data).__next__)[0]
or
it = (m.group() for m in re.finditer(b'.*\n?', data))
encoding = tokenize.detect_encoding(it.__next__)
I don't know what approach is more efficient.
More information about the Python-Dev
mailing list