[Tutor] how to cope with text litterals by code parsing?

Mon May 17 11:12:32 EDT 2004

Hello,

In a module that parses Python code, I have a symbol_row class that builds a
list of meaningful symbols (about what Guido van Rossum calls 'tokens' in
the language reference) out of a line of code. It splits the line and
defines each symbol's nature (e.g. keyword) and role (e.g. operator).
Everything works fine, but I'm not satisfied of how it's done.

The splitline function, among other problems, has to cope with the
well-known problem of explicit texts (so-called 'litterals') that can hold
anything; especially all kinds of signs that will be used as marks for
splitting. I couldn't find any smart and elegant algorithm for that.
I do it so:

-1- find, read, store and replace the explicit texts by a placeholder
(' $$$ ')
-2- split the line
-3- replace the placeholder by the original texts

I don't like that solution 'instinctively', so to say it hurts my sense of
easthetics ;-)
Also, it's not an overall solution: it works only because the code can't
hold anything (any character of sequence of characters); or rather because
if it does hold anything, it's not a valid piece of code and the problem of
explicit text isn't relevant anymore.

Well I would be happy to hear about alternative algorithms.
(below the guilty function, it's named and commented in a kind of english)

denis

********************************************
    def split_line(self):
        """
        A task that seems easy.
        """
        line = self.line
        # First, replace the texts (that could hold signs) by
        # placeholders: '$$$', surrounded with spaces
        pos, texts, placeholder = 0, [], '$$$'
        while pos < len(line):
            if line[pos] in quotes:
                # read the text
                chars, quote = line[pos:], line[pos]
                text = self.text_read(chars)
                size = len(text)
                if size == 0:   # '' was returned by text_read()
                    print 'Error found while reading explicit text ' \
                        'at position:', pos
#===============debug=============================
                    return []
                texts.append(text)  # save the text
                # replace the text with a placeholder
                line = line[:pos] + \
                       space + placeholder + space + \
                       line[pos+size:]
                pos += 5    # size of placeholder + 2 spaces
            else:
                pos +=1
        #
        # Then, read across the line
        # to surround all signs with spaces.
        # We can't simply use replace(),
        # köz some signs (<) are part of other (<=).
        # The signs made of two chars are tested first.
        pos = 0
        while pos < len(line):
            sign_found = False
            for sign in signs:
                if line[pos:].startswith(sign):
                    size = len(sign)
                    line = line[:pos] + \
                           space + sign + space + \
                           line[pos+size:]
                    pos += size + 2     # sign + 2 spaces around
                    sign_found = True
                    break               # don't check other signs!
            if not sign_found:
                pos += 1
        # now, erase useless spaces
        line = line.strip()
        two_spaces = space * 2
        while line.count(two_spaces) != 0:
            line = line.replace(two_spaces,space)
        #
        # finally split the line...
        self.symbols = line.split(space)
        # ...and replace the placeholders with the original texts
        iText = 0
        for iSymbol in range(len(self.symbols)):
            if self.symbols[iSymbol] == placeholder:
                self.symbols[iSymbol] = texts[iText]
                iText += 1