regex help: splitting string gets weird groups

Patrick Maupin pmaupin at gmail.com
Thu Apr 8 17:06:05 EDT 2010


On Apr 8, 3:40 pm, gry <georgeryo... at gmail.com> wrote:
> >    >>> s='555tHe-rain.in#=1234'
> >    >>> import re
> >    >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
> >    >>> r.findall(s)
> >    ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']
>
> This is nice and simple and has the invertible property that Patrick
> mentioned above.  Thanks much!

Yes, like using split(), this is invertible.  But you will see a
difference (and for a given task, you might prefer one way or the
other) if, for example, you put a few consecutive spaces in the middle
of your string, where this pattern and findall() will return each
space individually, and split() will return them all together.

You *can* fix up the pattern for findall() where it will have the same
properties as the split(), but it will almost always be a more
complicated pattern than for the equivalent split().

Another thing you can do with split(): if you *think* you have a
pattern that fully covers every string you expect to throw at it, but
would like to verify this, you can make use of the fact that split()
returns a string between each match (and before the first match and
after the last match).  So if you expect that every character in your
entire string should be a part of a match, you can do something like:

strings = splitter(s)
tokens = strings[1::2]
assert not ''.join(strings[::2])

Regards,
Pat



More information about the Python-list mailing list