[Tutor] Reg. Expressions Parenthesis

Tue Jan 17 09:48:26 CET 2012

On Tue, Jan 17, 2012 at 3:07 AM, Chris Kavanagh <ckava1 at msn.com> wrote:

> Hey guys, girls, hope everyone is doing well.
>
> Here's my question, when using Regular Expressions, the docs say when
> using parenthesis, it "captures" the data. This has got me confused
> (doesn't take much), can someone explain this to me, please??
>
> Here's an example to use. It's kinda long, so, if you'd rather provide
> your own shorter ex, that'd be fine. Thanks for any help as always.
>

Here's a quick example:

import re

data = 'Wayne Werner fake-phone: 501-555-1234, fake-SSN: 123-12-1234'
parsed = re.search('([\d]{3})-([\d]{3}-[\d]{4})', data)
print(parsed.group())
print(parsed.groups())

parsed = re.search('[\d]{3}-[\d]{3}-[\d]{4}', data)
print(parsed.group())
print(parsed.groups())

You'll notice that you can access the individual clusters using the
.groups() method. This makes capturing the individual groups pretty easy.
Of course, capturing isn't just for storing the results. You can also use
the captured group later on.

Let's say, for some fictitious reason you want to find every letter that
appears as a double in some data. If you were to do this the "brute force"
way you'd pretty much have to do something like this:

for i in range(len(data)-1):
   found = []
   if data[i] == data[i+1]:
      if not data[i] in found:
        found.append(i)
   print(found)

The regex OTOH looks like this:

In [29]: data = 'aaabababbcacacceadbacdb'

In [32]: parsed = re.findall(r'([a-z])\1', data)

In [33]: parsed
Out[33]: ['a', 'b', 'c']

Now, that example was super contrived and also simple. Very few real-world
applications will be as simple as that one - usually you have much crazier
specifications, like find every person who has blue eyes AND blue hair, but
only if they're left handed. Assuming you had data that looked like this:

Name    Eye Color    Hair Color   Handedness     Favorite type of potato
Wayne    Blue             Brown            Dexter             Mashed
Sarah      Blue             Blonde           Sinister            Spam(?)
Kane       Green          White             Dexter             None
Kermit     Blue             Blue               Sinister            Idaho

You could parse out the data using captures and backrefrences [1].

HTH,
Wayne

[1] In this situation, of course, regex is overkill. It's easier to just
.split() and compare. But if you're parsing something really nasty like EDI
then sometimes a regex is just the best way to go[2].

[2] When people start to understand regexes they're like the proverbial man
who only has a hammer. As Jamie Zawinski said[3], "Some people, when
confronted with a problem, think
“I know, I'll use regular expressions.”   Now they have two problems." I've
come across very few occasions that regexes were actually useful, and it's
usually extracting very specifically formatted data (money, phone numbers,
etc.) from copious amounts of text. I've not yet had a need to actually
process words with it. Especially using Python.

[3]http://regex.info/blog/2006-09-15/247
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20120117/6633c581/attachment.html>