On 12/2/06, Fredrik Lundh email@example.com wrote:
Guido van Rossum wrote:
it would be a good thing if it could, optionally, be made to report horizontal whitespace as well.
It's remarkably easy to get this out of the existing API
sure, but it would be even easier if I didn't have to write that code myself (last time I did that, I needed a couple of tries before the parser handled all cases correctly...).
but maybe this could simply be handled by a helper generator in the tokenizer module, that simply wraps the standard tokenizer generator and inserts whitespace tokens where necessary?
A helper sounds like a promising idea. Anyone interested in volunteering a patch?
keep track of the end position returned by the previous call, and if it's different from the start position returned by the next call, slice the line text from the column positions, assuming the line numbers are the same.If the line numbers differ, something has been eating \n tokens; this shouldn't happen any more with my patch.
you'll still have to deal with multiline strings, right?
No, they are returned as a single token whose start and stop correctly reflect line/col of the begin and end of the token. My current code (based on the second patch I gave in this thread and the algorithm described above) doesn't have to special-case anything except the ENDMARKER token (to break out of its loop :-).