generic tokenizer

Alex Martelli aleaxit at
Wed Sep 1 13:26:22 CEST 2004

Angus Mackay <yeah at> wrote:

> I remember python having a generic tokenizer in the library. all I want
> is to set a list of token seperators and then read tokens out of a 
> stream, the token seperators should be returned as themselves.
> is there anything like this?

Not as such in the standard library: the functions in module tokenizer
do not let you 'set a list of token separators'.  If what you're
tokenizing can fit in a string in memory, module re can help:

>>> x=re.compile('(\s+|,|;)')
>>> for w in x.split('a,b, c;d; e'): print repr(w),'+',
'a' + ',' + 'b' + ',' + '' + ' ' + 'c' + ';' + 'd' + ';' + '' + ' ' +
'e' +

Note that you get empty-string items when two separators abut.

If the limitations of re.split (stuff must fit in memory, &c) are a
problem, then the lexx-like solutions I see somebody else suggested may
be more appropriate for your needs.


More information about the Python-list mailing list