regex help: splitting string gets weird groups
Jon Clements
joncle at googlemail.com
Thu Apr 8 16:37:14 EDT 2010
On 8 Apr, 19:49, gry <georgeryo... at gmail.com> wrote:
> [ python3.1.1, re.__version__='2.2.1' ]
> I'm trying to use re to split a string into (any number of) pieces of
> these kinds:
> 1) contiguous runs of letters
> 2) contiguous runs of digits
> 3) single other characters
>
> e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
> '.', 'in', '#', '=', 1234]
> I tried:>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
>
> ('1234', 'in', '1234', '=')
>
> Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
> group? Is my regexp illegal somehow and confusing the engine?
>
> I *would* like to understand what's wrong with this regex, though if
> someone has a neat other way to do the above task, I'm also interested
> in suggestions.
Avoiding re's (for a bit of fun):
(no good for unicode obviously)
import string
from itertools import groupby, chain, repeat, count, izip
s = """555tHe-rain.in#=1234"""
unique_group = count()
lookup = dict(
chain(
izip(string.ascii_letters, repeat('L')),
izip(string.digits, repeat('D')),
izip(string.punctuation, unique_group)
)
)
parse = dict(D=int, L=str.capitalize)
print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in
groupby(s, lambda L: lookup[L]) ]
[555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234]
Jon.
More information about the Python-list
mailing list