how to get all repeated group with regular expression
Jeremiah Dodds
jeremiah.dodds at gmail.com
Sat Nov 22 16:46:36 EST 2008
On Fri, Nov 21, 2008 at 9:12 PM, scsoce <scsoce at gmail.com> wrote:
> MRAB wrote:
>
>> <div class="moz-text-flowed" style="font-family: -moz-fixed">Steve Holden
>> wrote:
>>
>>> Please keep this on the list.
>>>
>>> scsoce wrote:
>>>
>>>> Steve Holden wrote:
>>>>
>>>>> scsoce wrote:
>>>>>
>>>>>
>>>>>> say, when I try to search and match every char from variable length
>>>>>> string, such as string '123456', i tried re.findall( r'(\d)*, '12346'
>>>>>> )
>>>>>>
>>>>>>
>>>>> I think you will find you missed a quote out there. Always better to
>>>>> copy and paste ...
>>>>>
>>>>>
>>>>>
>>>>>> , but only get '6' and Python doc indeed say: "If a group is contained
>>>>>> in a part of the pattern that matched multiple times, the last match
>>>>>> is
>>>>>> returned."
>>>>>>
>>>>>>
>>>>> So use
>>>>>
>>>>> r'(\d*)'
>>>>>
>>>>> instead and then the group includes all the digits you match.
>>>>>
>>>>>
>>>>>
>>>>>> cause the regx engine cannot remember all the past history then ? is
>>>>>> it
>>>>>> nature to all regx engine or only to Python ?
>>>>>>
>>>>>>
>>>>> Different regex engines have different capabilities, so I can't speak
>>>>> to
>>>>> them all. If you wanted *all* the matches of *all* groups, how would
>>>>> you
>>>>> have them returned? As a list? That would make the case where there was
>>>>> only one match much tricker to handle. And what would you do with
>>>>>
>>>>> r'((\w)*\d)*)'
>>>>>
>>>>> Also, what about named groups? I can see enough potential
>>>>> implementation
>>>>> issues that I can perfectly understand why Python works the way it
>>>>> does,
>>>>> so I'd be interested to know why it doesn't makes sense to you, and
>>>>> what
>>>>> you would prefer it to do.
>>>>>
>>>>> regards
>>>>> Steve
>>>>>
>>>>>
>>>> maybe my expression was not clear. I want to capture every matched part
>>>> in a repeated pattern, not only the last, say, for string '123456', I
>>>> want to back reference any one char, not only the '6'. and i know the
>>>> example is very simple, so we can got the whole string using regx and
>>>> get every char using other python statements, but if the pattern in
>>>> group is complex?
>>>> and I test in VIM, it can do the 'back reference':
>>>> ==you text in vim:
>>>> 123456
>>>> == pattern:
>>>> :%s/\(\d\)*/$2
>>>> text will turn to be:
>>>> 2
>>>>
>>>> 'Fraid the Python re implementers just decided not to do it that way.
>>>
>>> Nor Perl.
>>
>> Probably what you want is re.findall(r"(\d)", "123456"), which returns a
>> list of what it captured.
>>
>>
>> </div>
>>
> Yes, you are right, but this way findall() capture only the 'top' group.
> What I really need to do is to capture nested and repated patterns, say,
> <table> tag in html contains many <tr>, <tr> contains many <td>, the
> data in <td> is i need, so I write the regx like this:
> regx ='''
> <table.*\n
> (
> (\s*<tr.*\n
> (\s*<td.*</td>\n|\n)*
> \s*</tr>\n
> |\n)*
> )
> \s*</table>
> '''
> Steve Holden wrote:
>
>> I can see enough potential implementation
>> issues that I can perfectly understand why Python works the way it does,
>> so I'd be interested to know why it doesn't makes sense to you, and what
>> you would prefer it to do.
>>
>>
>
> As Steve said, if re really cannot do this kind of work , so I have to
> split the one line regx down, and capture <table> first, and then loop to
> catpure <tr>, and then <td>, and so on ... . I donnot like this way compared
> with the above one clean regx line.
>
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>
If you're parsing structured markup like HTML, why not use something meant
for that? I personally find BeautifulSoup (
http://www.crummy.com/software/BeautifulSoup/) to be very good at this. For
instance, here's a code snippet I recently used to pull out specific data
from a table in a site:
soup = BeautifulSoup(some_page)
opts = [fonttag.string.strip()
for row in soup('tr', attrs={'class':'targetClass'})
for cell in row('td')
for fonttag in cell('font')
if cell('font')]
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20081122/1077d1bc/attachment-0001.html>
More information about the Python-list
mailing list