Alan Kennedy alanmk at hotmail.com
Mon Jan 19 15:38:50 CET 2004

Maarten van Reeuwijk wrote:
> I need to parse various text files in python. I was wondering if
> there was a general purpose tokenizer available. 

Indeed there is: python comes with batteries included. Try the shlex


Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo

import shlex
import StringIO

def prepareToker(toker, splitters): 
  for s in splitters: # resists People's Front of Judea joke ;-D
    if toker.whitespace.find(s) == -1:
      toker.whitespace = "%s%s" % (s, toker.whitespace)
  return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
  print "%s:%s" % (num, tok)

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.


alan kennedy
