simple string parsing ?

Fri Sep 10 05:40:41 EDT 2004

TAG <tonino.greco at gmail.com> wrote:

> > ((Of course, you ARE restricted to what Python considers 'tokens' so you
> > may need some postprocessing if you need a slightly different notion of
> > tokens))
> 
> luckily they should all be - but in the case that they are not - how
> can I checki it ?

With a little post-processing.  Say for example that you need := and :+
to be seen as single tokens; here's a Python 2.4 approach...:

mergers = {':' : set('=+'), }

def tokens_of(x):
    it = peekahead_iterator(toktuple[1] for toktuple in
            tokenize.generate_tokens(cStringIO.StringIO(x).readline)
         )
    for tok in it:
        if it.preview in mergers.get(tok, ()):
            yield tok+it.preview
            it.next()
        else:
            yield tok

x = 'fup(z:=97, y:+45):zap'
print list(tokens_of(x))

result is: 

['fup', '(', 'z', ':=', '97', ',', 'y', ':+', '45', ')', ':', 'zap', '']

Of course, you do need the handy 'peekahead_iterator', say something
like:

class peekahead_iterator(object):
    class nothing: pass
    def __init__(self, it):
        self._nit = iter(it).next
        self.preview = None
        self._step()
    def __iter__(self): return self
    def next(self):
        result = self._step()
        if result == self.nothing: raise StopIteration
        else: return result
    def _step(self):
        result = self.preview
        try: self.preview = self._nit()
        except StopIteration: self.preview = self.nothing
        return result

Splitting one token into several is easier (no peeking ahead is needed).
But both splitting and merging are fine, as long as the deviations
between what you want to see as tokens and what Python considers tokens
are minor.  If you have BIG divergences -- e.g., you do not want to
support triple-quoted strings as single tokens -- then you may be better
off with a completely different approach, as others have suggested.

Alex