[Tutor] To: Wayne Werner , re Reg. Expressions Parenthesis

Chris Kavanagh ckava1 at msn.com
Thu Jan 19 01:09:12 CET 2012

For some reason I didn't get this email, found it in the archives. I 
wanted to make sure I thanked Wayne for the help!!!

On Tue, Jan 17, 2012 at 3:07 AM, Chris Kavanagh <[hidden email]> wrote:
Hey guys, girls, hope everyone is doing well.

Here's my question, when using Regular Expressions, the docs say when 
using parenthesis, it "captures" the data. This has got me confused 
(doesn't take much), can someone explain this to me, please??

Here's an example to use. It's kinda long, so, if you'd rather provide 
your own shorter ex, that'd be fine. Thanks for any help as always.

Here's a quick example:

import re

data = 'Wayne Werner fake-phone: 501-555-1234, fake-SSN: 123-12-1234'
parsed = re.search('([\d]{3})-([\d]{3}-[\d]{4})', data)

parsed = re.search('[\d]{3}-[\d]{3}-[\d]{4}', data)

You'll notice that you can access the individual clusters using the 
.groups() method. This makes capturing the individual groups pretty 
easy. Of course, capturing isn't just for storing the results. You can 
also use the captured group later on.

Let's say, for some fictitious reason you want to find every letter that 
appears as a double in some data. If you were to do this the "brute 
force" way you'd pretty much have to do something like this:

for i in range(len(data)-1):
    found = []
    if data[i] == data[i+1]:
       if not data[i] in found:

The regex OTOH looks like this:

In [29]: data = 'aaabababbcacacceadbacdb'

In [32]: parsed = re.findall(r'([a-z])\1', data)

In [33]: parsed
Out[33]: ['a', 'b', 'c']

Now, that example was super contrived and also simple. Very few 
real-world applications will be as simple as that one - usually you have 
much crazier specifications, like find every person who has blue eyes 
AND blue hair, but only if they're left handed. Assuming you had data 
that looked like this:

Name    Eye Color    Hair Color   Handedness     Favorite type of potato
Wayne    Blue             Brown            Dexter             Mashed
Sarah      Blue             Blonde           Sinister            Spam(?)
Kane       Green          White             Dexter             None
Kermit     Blue             Blue               Sinister            Idaho

You could parse out the data using captures and backrefrences [1].


[1] In this situation, of course, regex is overkill. It's easier to just 
.split() and compare. But if you're parsing something really nasty like 
EDI then sometimes a regex is just the best way to go[2].

[2] When people start to understand regexes they're like the proverbial 
man who only has a hammer. As Jamie Zawinski said[3], "Some people, when 
confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems." 
I've come across very few occasions that regexes were actually useful, 
and it's usually extracting very specifically formatted data (money, 
phone numbers, etc.) from copious amounts of text. I've not yet had a 
need to actually process words with it. Especially using Python.


Tutor maillist  -  [hidden email]
To unsubscribe or change subscription options:

Thanks again Wayne.

More information about the Tutor mailing list