Tokenize a string or split on steroids
John Machin
sjmachin at lexicon.net
Sat Mar 9 16:53:31 EST 2002
Fernando Rodr?uez <frr at wanadoo.es> wrote in message news:<4lpj8u4nbsoqjf2h4upadti62jh4sf4i59 at 4ax.com>...
> I need to tokenize a string using several separator characters, not just one
> as split().
>
> For example, I want a function that returns ['one', 'two'] when given the
> string '{one}{two}' .
>
> How can I do this? O:-)
re.split() has a plain vanilla flavour and a richer one -- banana
split :-)
>>> re.split(r'[{}]', '{one}{two}')
['', 'one', '', 'two', '']
>>> re.split(r'([{}])', '{one}{two}')
['', '{', 'one', '}', '', '{', 'two', '}', '']
>>>
You may want to handle the "empty" tokens that are returned by zapping
them with map(None, your_token_list), or you may want to step through
the list applying some higher logic -- this depends on what you are
really trying to do.
However, let's review your requirements, as your example was a little
specific.
What output do you want for this input: 'foo{one}bar{two}zot'?? Do you
want ['foo', 'one', 'bar', 'two', 'zot']? If so, use re.split. Do you
still want ['one', 'two']? If so, use re.findall.
For fancier tokenisation (other than for Python source), you will need
to go outside the Python distribution e.g. mx.TextTools package.
More information about the Python-list
mailing list