[Tutor] Regex's and "best practice"

Sat Nov 22 10:50:53 EST 2003

On 22 Nov 2003, Carl D Cravens <- raven at phoenyx.net wrote:

> Here's my question.  The From: line can appear in basically four forms,
> and I have a little chain of s///'s that try to find the "real name", and
> barring that, use the address, stripping out extra junk.  Here's the Perl
> snippet... (the "From: " has been stripped and the remainder is in $line)

> ## three formats to deal with (a bare address falls through)
> ## <address> , Name <address>, (Name) address
> $line ~= s/(^<)(.*)(>$)/$2/ ||
>     $line =~ s/<.*>// ||
>     $line ~=s/(.*)(\()(.*)(\))(.*)/$3/;

Without a little bit cheating you will have more verbose code with
Python.  But that's nothing new nor should it be a hindrance (if the
number of lines doesn't differ too much).

[solution with re.sub]

> Not nearly as compact and simple as the Perl statement.

Compact yes.

> Is this pretty much the best I can do?  The OR's were very convenient in

You could at first use search() instead of sub().  That would mean you
had to rewrite the regexps a bit.  Second you could use named groups
(a nice thing in Python) instead of the numbers.  Third you could write
one or two functions which give you the terseness of Perl.

First the functions:

--8<---------------cut here---------------start------------->8---
# reg.py
import re

def assemble(regs):
    return [re.compile(reg).search for reg in regs]

def s(reglist, string, grp):
    for rcs in reglist:
        m = rcs(string)
        if m:
            return m.group(grp)
    return string
--8<---------------cut here---------------end--------------->8---

assemble() builds a list of regexp objects with the search method.
s() takes such a list, a string to be searched and the name of the group
you want to get returned.  If nothing is found the string is returned.
As soon as a match occurs the function returns.

--8<---------------cut here---------------start------------->8---
# reg2.py
from reg import *

regs = assemble(['(^<)(?P<name>.*)(>$)',
                '(?P<name>.*)<.*>',
                '(.*)(\()(?P<name>.*)(\))(.*)'])

for line in open($file):
    line = s(regs, line, 'name')
--8<---------------cut here---------------end--------------->8---

assemble() is here called with a list of three regexps; the group which
is of interest has been namend 'name'.  The construct (?P<group>...)
gives the name 'group' to a group.  The second regexp had to be
rewritten a bit.

Then our file is opened and every line is tried with our regexps.

[...]
> Is there something obvious I'm missing, or is this a fair solution to the
> problem?

No your solution is a fair one IMO mine is only an alternative.  It's a
matter of taste what you prefer.

   Karl
-- 
Please do *not* send copies of replies to me.
I read the list