[Tutor] how to cope with text litterals by code parsing?

Mon May 17 19:21:55 EDT 2004

On Mon, 2004-05-17 at 11:12, denis wrote:
> Hello,
> 
> In a module that parses Python code, I have a symbol_row class that builds a
> list of meaningful symbols (about what Guido van Rossum calls 'tokens' in
> the language reference) out of a line of code. It splits the line and
> defines each symbol's nature (e.g. keyword) and role (e.g. operator).
> Everything works fine, but I'm not satisfied of how it's done.
> 
> The splitline function, among other problems, has to cope with the
> well-known problem of explicit texts (so-called 'litterals') that can hold
> anything; especially all kinds of signs that will be used as marks for
> splitting. I couldn't find any smart and elegant algorithm for that.
> I do it so:
> 
> -1- find, read, store and replace the explicit texts by a placeholder
> (' $$$ ')
> -2- split the line
> -3- replace the placeholder by the original texts
> 
> I don't like that solution 'instinctively', so to say it hurts my sense of
> easthetics ;-)
> Also, it's not an overall solution: it works only because the code can't
> hold anything (any character of sequence of characters); or rather because
> if it does hold anything, it's not a valid piece of code and the problem of
> explicit text isn't relevant anymore.
> 
> Well I would be happy to hear about alternative algorithms.
> (below the guilty function, it's named and commented in a kind of english)

Hi Denis,

Do you have specific requirements that call for you to do everything
yourself? If not, there is a standard module called tokenize:

http://docs.python.org/lib/module-tokenize.html

>>> import tokenize
>>> f = file('temp.py')
>>> [tt[1] for tt in tokenize.generate_tokens(f.readline)]

[' ', 'def', 'split_line', '(', 'self', ')', ':', '\n', ...]

If you need to do all the work yourself for some reason, the source code
in tokenize.py may help you out.

Good luck.

Rich