[Tutor] Re: OO approach

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Mon, 25 Feb 2002 23:44:08 -0800 (PST)

On Tue, 26 Feb 2002, Prahlad Vaidyanathan wrote:

> > Can you show us the mailURL base class?  Perhaps it's possible to avoid
> > inheritance altogether by making the parent class general enough to handle
> > all cases.
> Well, I've attached the entire script here (~9K). I've just added
> another sub-class for the Wired Daily News as well, but I am yet to
> check if that works (it should).
> The script works all right, but something tells me there _must_ be a
> better way of doing this :-)

If I'm understanding things, it appears that mailURL has two distinct
modes that it runs under:

    1.  Reading content from a local file
    2.  Reading content from a web url.

The 'LinuxWeeklyNews', 'KernelTraffic', and 'LinuxDoc' classes have a
pretty regular structure for 'weekly' stuff.  On the flip side, WiredNews,
Register, and LinuxJournal classes look fairly similar as stuff that
handles 'daily' news.

Would it be possible to somehow unify these two categories into one?

That is, it might be possible to have a separate program that periodically
retrieves Daily stuff and writes them to temporary files, formatting them
to look similar to what the Weekly news sources provide.  That way, we can
reduce the complexity in mailUrl and remove the need for retrieveURL(). If
you're running on Unix, 'cron' should be helpful for scheduling this, and
I'm pretty sure that Windows has a scheduler as well.

The extractURLs() functions that do the "Daily" stuff have a very similar
structure; they appear to grab every url-containing paragraph of 3 lines.  
In fact, the signficant difference I see between WiredNews.extractURLs(),
Register.extractURLs(), and LinuxJournal.extractURLs() is the last line:

    links[desc] = string.strip(line)            ## WiredNews
    links[desc] = string.strip(line)            ## Register
    links[desc] = string.strip(match.group(1))  ## LinuxJournal

We might be able to generalize these three regular expressions:

    regex = re.compile(r'http://www.theregister.co.uk/')
    regex = re.compile(r'(http://www.linuxjournal.com/article.php[^\s]*)')
    regex = re.compile(r'http://www.wired.com/news/..*')

from the main() function so that we can treat all the cases the same way
in extractURLs():

    links[desc] = string.strip(match.group(1))  ## All three

And that should cut down on the class madness.  *grin*

By the way, you might find the following module useful: this is a regular
expression that matches HTTP urls.  It defines an 'url_re' object that you
can use to findall() urls in a document.

## http_regular_expression.py
## This is a regular expression that detects HTTP urls.
## This is only a small sample of tchrist's very nice tutorial on
## regular expressions.  See:
##     http://www.perl.com/doc/FMTEYEWTK/regexps.html
## for more details.

urls = '(%s)' % '|'.join("""http telnet gopher file wais ftp""".split())
ltrs = r'\w'
gunk = '/#~:.?+=&%@!\-'
punc = '.:?\-'
any = "%(ltrs)s%(gunk)s%(punc)s" % { 'ltrs' : ltrs,
                                     'gunk' : gunk,
                                     'punc' : punc }

url = r"""
    \b                            # start at word boundary
    (                             # begin \1 {
        %(urls)s    :             # need resource and a colon
        [%(any)s] +?              # followed by one or more
                                  #  of any valid character, but
                                  #  be conservative and take only
                                  #  what you need to....
    )                             # end   \1 }
    (?=                           # look-ahead non-consumptive assertion
            [%(punc)s]*           # either 0 or more punctuation
            [^%(any)s]            #  followed by a non-url char
        |                         # or else
            $                     #  then end of the string
    """ % {'urls' : urls,
           'any' : any,
           'punc' : punc }

url_re = re.compile(url, re.VERBOSE)

def _test():
    sample = """hello world, this is an url:
                http://python.org.  Can you find it?"""
    match = url_re.search(sample)
    print "Here's what we found: '%s'" % match.group(0)

if __name__ == '__main__':

Well, let's start refactoring!  *grin*

Good luck to you.