[Python-Dev] Small tweak to tokenize.py?

Sat Dec 2 19:06:50 CET 2006

On 12/2/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
> Guido van Rossum wrote:
>
> >> it would be a good thing if it could, optionally, be made to report
> >> horizontal whitespace as well.
> >
> > It's remarkably easy to get this out of the existing API
>
> sure, but it would be even easier if I didn't have to write that code
> myself (last time I did that, I needed a couple of tries before the
> parser handled all cases correctly...).
>
> but maybe this could simply be handled by a helper generator in the
> tokenizer module, that simply wraps the standard tokenizer generator
> and inserts whitespace tokens where necessary?

A helper sounds like a promising idea. Anyone interested in
volunteering a patch?

> > keep track
> > of the end position returned by the previous call, and if it's
> > different from the start position returned by the next call, slice the
> > line text from the column positions, assuming the line numbers are the
> > same.If the line numbers differ, something has been eating \n tokens;
> > this shouldn't happen any more with my patch.
>
> you'll still have to deal with multiline strings, right?

No, they are returned as a single token whose start and stop correctly
reflect line/col of the begin and end of the token. My current code
(based on the second patch I gave in this thread and the algorithm
described above) doesn't have to special-case anything except the
ENDMARKER token  (to break out of its loop :-).

-- 
--Guido van Rossum (home page: http://www.python.org/~guido/)