[Tutor] Re: OO approach [HTTP urls and regular expressions]

Mon, 25 Feb 2002 23:55:14 -0800 (PST)

On Mon, 25 Feb 2002, Danny Yoo wrote:

> By the way, you might find the following module useful: this is a regular
> expression that matches HTTP urls.  It defines an 'url_re' object that you
> can use to findall() urls in a document.

Yikes, there were some typos in that source code.  Let me send the
repaired version of http_regular_expression.py, as well as an example of
how we can use it:

###
>>> from http_regular_expression import url_re
>>> import urllib
>>>
>>> contents = urllib.urlopen('http://python.org').read()
>>>
>>> url_re.findall(contents)[:10]
[('http://barry.wooz.org/software/ht2html', 'http'),
('http://www.jython.org/', 'http'), ('http://www.endeavors.com/pippy/',
'http'), ('http://idlefork.sourceforge.net', 'http'),
('http://python.sourceforge.net/', 'http'),
('http://python.sourceforge.net/peps/', 'http'),
('http://aspn.activestate.com/ASPN/Cookbook/Python', 'http'),
('http://www.python.org/cgi-bin/faqw.py', 'http'),
('http://www.europython.org', 'http'),
('http://sourceforge.net/bugs/?group_id=5470', 'http')]
###

Cool, it works.  *grin*

######
## This is a regular expression that detects HTTP urls.
##
## This is only a small sample of tchrist's very nice tutorial on
## regular expressions.  See:
##
##     http://www.perl.com/doc/FMTEYEWTK/regexps.html
##
## for more details.

import re

urls = '(%s)' % '|'.join("""http telnet gopher file wais ftp""".split())
ltrs = r'\w'
gunk = r'/#~:.?+=&%@!\-'
punc = r'.:?\-'
any = "%(ltrs)s%(gunk)s%(punc)s" % { 'ltrs' : ltrs,
                                     'gunk' : gunk,
                                     'punc' : punc }

url = r"""
    \b                            # start at word boundary
    (                             # begin \1 {
        %(urls)s    :             # need resource and a colon
        [%(any)s] +?              # followed by one or more
                                  #  of any valid character, but
                                  #  be conservative and take only
                                  #  what you need to....
    )                             # end   \1 }
    (?=                           # look-ahead non-consumptive assertion
            [%(punc)s]*           # either 0 or more punctuation
            [^%(any)s]            #  followed by a non-url char
        |                         # or else
            $                     #  then end of the string
    )
    """ % {'urls' : urls,
           'any' : any,
           'punc' : punc }

url_re = re.compile(url, re.VERBOSE)

def _test():
    sample = """hello world, this is an url:
                http://python.org.  Can you find it?"""
    match = url_re.search(sample)
    print "Here's what we found: '%s'" % match.group(0)

if __name__ == '__main__':
    _test()
######

Sorry about the mistake!