Looking for very simple general purpose tokenizer

Alan Kennedy alanmk at hotmail.com
Mon Jan 19 15:38:50 CET 2004

Maarten van Reeuwijk wrote:
> I need to parse various text files in python. I was wondering if
> there was a general purpose tokenizer available. 

Indeed there is: python comes with batteries included. Try the shlex


Try the following code: it seems to do what you want. If it doesn't,
then please be more specific on your tokenisation rules.

splitchars = [' ', '\n', '=', '/',]

source = """
thisshouldcome inthree parts
thisshould comeintwo

import shlex
import StringIO

def prepareToker(toker, splitters): 
  for s in splitters: # resists People's Front of Judea joke ;-D
    if toker.whitespace.find(s) == -1:
      toker.whitespace = "%s%s" % (s, toker.whitespace)
  return toker

buf = StringIO.StringIO(source)
toker = shlex.shlex(buf)
toker = prepareToker(toker, splitchars)
for num, tok in enumerate(toker):
  print "%s:%s" % (num, tok)

Note that the use of the iteration based interface in the above code
requires python 2.3. If you need it to run on previous versions,
specify which one.


alan kennedy
check http headers here: http://xhaus.com/headers
email alan:              http://xhaus.com/contact/alan

More information about the Python-list mailing list