Help: Arbitrary number of groups in regex

Bengt Richter bokr at oz.net
Fri Aug 9 00:40:27 EDT 2002


On Thu, 08 Aug 2002 16:51:24 -0700 (PDT), "Sean 'Shaleh' Perry" <shalehperry at attbi.com> wrote:

>>> > Does anybody know what the PATTERN should be ?
>>>
>>> I don't believe it's possible.  Perhaps it should be.  If all you want
>>> to do is split a string into a sequence of characters, just do this:
>>>
>>> >>> tuple("abcde")
>>> ('a', 'b', 'c', 'd', 'e')
>>>
>> 
>> I'm starting to think it's impossible too. Perhaps I oversimplified
>> the problem in my. What I have is actually an arbitrary number
>> or comma separated values, each of which can be
>> composed of letters or numbers. I don't know the number in
>> advance. For instance, I might have the following input:
>> 
>> " FL234,  MK434,  9743"
>> 
>> I've tried to write a regex pattern which could return me each value
>> in a separate group, but I believe I have to 'split' the string first
>> and then parse each value separately.
>> 
>
>yeah it would be easiest if you split on commas and then parsed the individual
>entries.
>
>>>> re.split(r'[\s,]', " FL234,  MK434,  9743")
>['', 'FL234', '', '', 'MK434', '', '', '9743']
>
>you just have to skip the empties.

Or avoid them:
 >>> line = " FL234,  MK434,  9743"
 >>> re.split(r'\s*,\s*', line.strip())
 ['FL234', 'MK434', '9743']

or to be able to reuse the compiled re: 
 >>> rec = re.compile(r'\s*,\s*')
 >>> rec.split(line.strip())
 ['FL234', 'MK434', '9743']

also allows:
 >>> line2 = """ FL234,
 ...       MK434  ,
 ...       9743
 ... """
 >>> rec.split(line2.strip())
 ['FL234', 'MK434', '9743']

>
>split is nice if the input is fairly uniform whereas findall can pull pieces
>out of a larger mass.
So what was wrong with findall again?
 >>> re.findall(r'(\w+)', line)
 ['FL234', 'MK434', '9743']
 >>> re.findall(r'(\w+)', line2)
 ['FL234', 'MK434', '9743']

>
>If you use string.split(',', input) here the pieces will have spaces in them
>which you will have to deal with.  That is why the regex I made breaks them off
>as well.
I assumed comma was mandatory, and spaces were optional (and multiple ok) on either side
of the comma. Then pre-stripping the original string of leading and trailing spaces
avoids empties, but if you know you're looking for alphanum sequences, and want to
chuck everything else (e.g., no 'MK4.34' preserving the '.'), findall seems the thing.

Regards,
Bengt Richter



More information about the Python-list mailing list