Looking for very simple general purpose tokenizer
Eric Brunel
eric.brunel at N0SP4M.com
Mon Jan 19 05:38:52 EST 2004
Maarten van Reeuwijk wrote:
> Hi group,
>
> I need to parse various text files in python. I was wondering if there was a
> general purpose tokenizer available. I know about split(), but this
> (otherwise very handy method does not allow me to specify a list of
> splitting characters, only one at the time and it removes my splitting
> operators (OK for spaces and \n's but not for =, / etc. Furthermore I tried
> tokenize but this specifically for Python and is way too heavy for me. I am
> looking for something like this:
>
>
> splitchars = [' ', '\n', '=', '/', ....]
> tokenlist = tokenize(rawfile, splitchars)
>
> Is there something like this available inside Python or did anyone already
> make this? Thank you in advance
You may use re.findall for that:
>>> import re
>>> s = "a = b+c; z = 34;"
>>> pat = " |=|;|[^ =;]*"
>>> re.findall(pat, s)
['a', ' ', '=', ' ', 'b+c', ';', ' ', 'z', ' ', '=', ' ', '34', ';', '']
The pattern basically says: match either a space, a '=', a ';', or a sequence of
any characters that are not space, '=' or ';'. You may have to take care
beforehands about special characters like \n or \ (very special in regular
expressions)
HTH
--
- Eric Brunel <eric dot brunel at pragmadev dot com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com
More information about the Python-list
mailing list