Tokenize a string or split on steroids

John Machin sjmachin at lexicon.net
Sat Mar 9 16:53:31 EST 2002


Fernando Rodr?uez <frr at wanadoo.es> wrote in message news:<4lpj8u4nbsoqjf2h4upadti62jh4sf4i59 at 4ax.com>...
> I need to tokenize a string using several separator characters, not just one
> as split().  
> 
> For example, I want a function that returns ['one', 'two'] when given the
> string '{one}{two}' .
> 
> How can I do this? O:-)
 
re.split() has a plain vanilla flavour and a richer one -- banana
split :-)

>>> re.split(r'[{}]', '{one}{two}')
['', 'one', '', 'two', '']
>>> re.split(r'([{}])', '{one}{two}')
['', '{', 'one', '}', '', '{', 'two', '}', '']
>>>

You may want to handle the "empty" tokens that are returned by zapping
them with map(None, your_token_list), or you may want to step through
the list applying some higher logic -- this depends on what you are
really trying to do.

However, let's review your requirements, as your example was a little
specific.

What output do you want for this input: 'foo{one}bar{two}zot'?? Do you
want ['foo', 'one', 'bar', 'two', 'zot']? If so, use re.split. Do you
still want ['one', 'two']? If so, use re.findall.

For fancier tokenisation (other than for Python source), you will need
to go outside the Python distribution e.g. mx.TextTools package.



More information about the Python-list mailing list