simple string parsing ?
Alex Martelli
aleaxit at yahoo.com
Fri Sep 10 05:40:41 EDT 2004
TAG <tonino.greco at gmail.com> wrote:
> > ((Of course, you ARE restricted to what Python considers 'tokens' so you
> > may need some postprocessing if you need a slightly different notion of
> > tokens))
>
> luckily they should all be - but in the case that they are not - how
> can I checki it ?
With a little post-processing. Say for example that you need := and :+
to be seen as single tokens; here's a Python 2.4 approach...:
mergers = {':' : set('=+'), }
def tokens_of(x):
it = peekahead_iterator(toktuple[1] for toktuple in
tokenize.generate_tokens(cStringIO.StringIO(x).readline)
)
for tok in it:
if it.preview in mergers.get(tok, ()):
yield tok+it.preview
it.next()
else:
yield tok
x = 'fup(z:=97, y:+45):zap'
print list(tokens_of(x))
result is:
['fup', '(', 'z', ':=', '97', ',', 'y', ':+', '45', ')', ':', 'zap', '']
Of course, you do need the handy 'peekahead_iterator', say something
like:
class peekahead_iterator(object):
class nothing: pass
def __init__(self, it):
self._nit = iter(it).next
self.preview = None
self._step()
def __iter__(self): return self
def next(self):
result = self._step()
if result == self.nothing: raise StopIteration
else: return result
def _step(self):
result = self.preview
try: self.preview = self._nit()
except StopIteration: self.preview = self.nothing
return result
Splitting one token into several is easier (no peeking ahead is needed).
But both splitting and merging are fine, as long as the deviations
between what you want to see as tokens and what Python considers tokens
are minor. If you have BIG divergences -- e.g., you do not want to
support triple-quoted strings as single tokens -- then you may be better
off with a completely different approach, as others have suggested.
Alex
More information about the Python-list
mailing list