<div class="gmail_quote">On Tue, Jan 17, 2012 at 3:07 AM, Chris Kavanagh <span dir="ltr"><<a href="mailto:ckava1@msn.com">ckava1@msn.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hey guys, girls, hope everyone is doing well.<br>
<br>
Here's my question, when using Regular Expressions, the docs say when using parenthesis, it "captures" the data. This has got me confused (doesn't take much), can someone explain this to me, please??<br>
<br>
Here's an example to use. It's kinda long, so, if you'd rather provide your own shorter ex, that'd be fine. Thanks for any help as always.<br></blockquote><div><br></div><div>Here's a quick example:</div>
<div><br></div><div>import re</div><div><br></div><div>data = 'Wayne Werner fake-phone: 501-555-1234, fake-SSN: 123-12-1234'</div><div><div>parsed = re.search('([\d]{3})-([\d]{3}-[\d]{4})', data)</div></div>
<div>print(parsed.group())</div><div>print(parsed.groups())</div><div><br></div><div><div>parsed = re.search('[\d]{3}-[\d]{3}-[\d]{4}', data)</div></div><div>print(parsed.group())</div><div>print(parsed.groups())</div>
<div><br></div><div>You'll notice that you can access the individual clusters using the .groups() method. This makes capturing the individual groups pretty easy. Of course, capturing isn't just for storing the results. You can also use the captured group later on. </div>
<div><br></div><div>Let's say, for some fictitious reason you want to find every letter that appears as a double in some data. If you were to do this the "brute force" way you'd pretty much have to do something like this:</div>
<div><br></div><div>for i in range(len(data)-1):</div><div> found = []</div><div> if data[i] == data[i+1]:</div><div> if not data[i] in found:</div><div> found.append(i)</div><div> print(found)</div><div>
<br></div><div>The regex OTOH looks like this:</div><div><div><br></div><div>In [29]: data = 'aaabababbcacacceadbacdb'</div><div><div><br></div><div>In [32]: parsed = re.findall(r'([a-z])\1', data)</div><div>
<br></div><div>In [33]: parsed</div><div>Out[33]: ['a', 'b', 'c']</div></div><div><br></div></div><div>Now, that example was super contrived and also simple. Very few real-world applications will be as simple as that one - usually you have much crazier specifications, like find every person who has blue eyes AND blue hair, but only if they're left handed. Assuming you had data that looked like this:</div>
<div><br></div><div>Name Eye Color Hair Color Handedness Favorite type of potato</div><div>Wayne Blue Brown Dexter Mashed</div><div>Sarah Blue Blonde Sinister Spam(?)</div>
<div>Kane Green White Dexter None</div><div>Kermit Blue Blue Sinister Idaho</div><div><br></div><div><br></div><div>You could parse out the data using captures and backrefrences [1].</div>
<div><br></div><div>HTH,</div><div>Wayne</div><div><br></div><div>[1] In this situation, of course, regex is overkill. It's easier to just .split() and compare. But if you're parsing something really nasty like EDI then sometimes a regex is just the best way to go[2].</div>
<div><br></div><div>[2] When people start to understand regexes they're like the proverbial man who only has a hammer. As Jamie Zawinski said[3], "Some people, when confronted with a problem, think </div><div>“I know, I'll use regular expressions.” Now they have two problems." I've come across very few occasions that regexes were actually useful, and it's usually extracting very specifically formatted data (money, phone numbers, etc.) from copious amounts of text. I've not yet had a need to actually process words with it. Especially using Python.</div>
<div><br></div><div>[3]<a href="http://regex.info/blog/2006-09-15/247">http://regex.info/blog/2006-09-15/247</a></div></div>