regex help: splitting string gets weird groups
pmaupin at gmail.com
Thu Apr 8 23:06:05 CEST 2010
On Apr 8, 3:40 pm, gry <georgeryo... at gmail.com> wrote:
> > >>> s='555tHe-rain.in#=1234'
> > >>> import re
> > >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
> > >>> r.findall(s)
> > ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']
> This is nice and simple and has the invertible property that Patrick
> mentioned above. Thanks much!
Yes, like using split(), this is invertible. But you will see a
difference (and for a given task, you might prefer one way or the
other) if, for example, you put a few consecutive spaces in the middle
of your string, where this pattern and findall() will return each
space individually, and split() will return them all together.
You *can* fix up the pattern for findall() where it will have the same
properties as the split(), but it will almost always be a more
complicated pattern than for the equivalent split().
Another thing you can do with split(): if you *think* you have a
pattern that fully covers every string you expect to throw at it, but
would like to verify this, you can make use of the fact that split()
returns a string between each match (and before the first match and
after the last match). So if you expect that every character in your
entire string should be a part of a match, you can do something like:
strings = splitter(s)
tokens = strings[1::2]
assert not ''.join(strings[::2])
More information about the Python-list