[Python-Dev] issue2180 and using 'tokenize' with Python 3 'str's

Tue Sep 28 14:09:48 CEST 2010

On Tue, Sep 28, 2010 at 9:29 PM, Michael Foord
<fuzzyman at voidspace.org.uk> wrote:
>  On 28/09/2010 12:19, Antoine Pitrou wrote:
>> On Mon, 27 Sep 2010 23:45:45 -0400
>> Steve Holden<steve at holdenweb.com>  wrote:
>>> On 9/27/2010 11:27 PM, Benjamin Peterson wrote:
>>>> Tokenize only works on bytes. You can open a feature request if you
>>>> desire.
>>>>
>>> Working only on bytes does seem rather perverse.
>>
>> I agree, the morality of bytes objects could have been better :)
>>
> The reason for working with bytes is that source data can only be correctly
> decoded to text once the encoding is known. The encoding is determined by
> reading the encoding cookie.
>
> I certainly wouldn't be opposed to an API that accepts a string as well
> though.

A very quick scan of _tokenize suggests it is designed to support
detect_encoding returning None to indicate the line iterator will
return already decoded lines. This is confirmed by the fact the
standard library uses it that way (via generate_tokens).

An API that accepts a string, wraps a StringIO around it, then calls
_tokenise with an encoding of None would appear to be the answer here.
A feature request on the tracker is the best way to make that happen.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia