<br><br><div class="gmail_quote">On Fri, Nov 21, 2008 at 9:12 PM, scsoce <span dir="ltr"><<a href="mailto:scsoce@gmail.com">scsoce@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


MRAB wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="Wj3C7c">


<div class="moz-text-flowed" style="font-family: -moz-fixed">Steve Holden wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Please keep this on the list.<br>


<br>


scsoce wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


Steve Holden wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


scsoce wrote:<br>


 <br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


say, when I try to search and match every char  from variable length<br>


string, such as string '123456',  i tried re.findall( r'(\d)*, '12346' )<br>


    <br>


</blockquote>


I think you will find you missed a quote out there. Always better to<br>


copy and paste ...<br>


<br>


 <br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


, but only get '6' and Python doc indeed say: "If a group is contained<br>


in a part of the pattern that matched multiple times, the last match is<br>


returned."<br>


    <br>


</blockquote>


So use<br>


<br>


    r'(\d*)'<br>


<br>


instead and then the group includes all the digits you match.<br>


<br>


 <br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


cause the regx engine cannot remember all the past history then ?  is it<br>


nature to all regx engine or only to Python ?<br>


    <br>


</blockquote>


Different regex engines have different capabilities, so I can't speak to<br>


them all. If you wanted *all* the matches of *all* groups, how would you<br>


have them returned? As a list? That would make the case where there was<br>


only one match  much tricker to handle. And what would you do with<br>


<br>


  r'((\w)*\d)*)'<br>


<br>


Also, what about named groups? I can see enough potential implementation<br>


issues that I can perfectly understand why Python works the way it does,<br>


so I'd be interested to know why it doesn't makes sense to you, and what<br>


you would prefer it to do.<br>


<br>


regards<br>


 Steve<br>


  <br>


</blockquote>


maybe my expression was not clear. I  want to capture every matched part<br>


in a repeated pattern, not only the last,  say, for string '123456',  I<br>


want to back reference any one char, not only the '6'. and i know the<br>


example is very simple, so we can got the whole string using regx and<br>


get every char using other python statements, but if the pattern in<br>


group is complex?<br>


and I test in VIM, it can do the 'back reference':<br>


==you text in vim:<br>


123456<br>


== pattern:<br>


:%s/\(\d\)*/$2<br>


text will turn to be:<br>


2<br>


<br>


</blockquote>


'Fraid the Python re implementers just decided not to do it that way.<br>


<br>


</blockquote>


Nor Perl.<br>


<br>


Probably what you want is re.findall(r"(\d)", "123456"), which returns a list of what it captured.<br>


<br>


<br></div></div>


</div><br>


</blockquote>


Yes, you are right, but this way findall() capture only the 'top' group. What I really need to do is to capture nested and repated patterns, say, <table> tag in html contains many <tr>,  <tr>  contains many <td>,   the  data in <td>  is i need, so I write the regx like this:<br>


   regx ='''<br>


             <table.*\n<br>


              (<br>


              (\s*<tr.*\n<br>


                   (\s*<td.*</td>\n|\n)*<br>


               \s*</tr>\n<br>


              |\n)*<br>


              )<br>


              \s*</table><br>


               '''<div class="Ih2E3d"><br>


Steve Holden wrote:<br>


<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">


I can see enough potential implementation<br>


issues that I can perfectly understand why Python works the way it does,<br>


so I'd be interested to know why it doesn't makes sense to you, and what<br>


you would prefer it to do.<br>


  <br>


</blockquote>


<br></div>


As Steve said, if re really cannot do this kind of work , so I have to split the one line regx down, and  capture <table> first, and then loop to catpure <tr>, and then <td>, and so on ... . I donnot like this way compared with the above one clean regx line.<div>


<div></div><div class="Wj3C7c"><br>


<br>


--<br>


<a href="http://mail.python.org/mailman/listinfo/python-list" target="_blank">http://mail.python.org/mailman/listinfo/python-list</a><br>


</div></div></blockquote></div><br>If you're parsing structured markup like HTML, why not use something meant for that? I personally find BeautifulSoup (<a href="http://www.crummy.com/software/BeautifulSoup/">http://www.crummy.com/software/BeautifulSoup/</a>) to be very good at this. For instance, here's a code snippet I recently used to pull out specific data from a table in a site:<br>


<br>soup = BeautifulSoup(some_page)<br>opts = [fonttag.string.strip()<br>           for row in soup('tr', attrs={'class':'targetClass'})<br>           for cell in row('td')<br>           for fonttag in cell('font')<br>


           if cell('font')]<br>