[Tutor] re: Finding an URL

Danny Yoo dyoo@acoma.Stanford.EDU
Tue, 10 Sep 2002 14:30:46 -0700 (PDT)


On Tue, 10 Sep 2002, Danny Yoo wrote:

> Hi Terje,
>
> This parsing of URL's in text seems like something that is reinvented
> quite a bit.  Perhaps it might be nice for someone to do a review of the
> code out there and package it nicely into the Standard library?
>
[some text cut]
>
> I ported over Tom Christiansen's URL regular expression here:
>
>     http://mail.python.org/pipermail/tutor/2002-February/012481.html



Hmmm... I actually tried a silly test on the original program:

###
>>> url_re.findall('http://python.org...')
[('http://python.org...', 'http')]
###

and that's not right!  It should realize that the trailing dots aren't a
part of the url.  So I took a closer look at tchrist's original Perl
regular expression, and it appears to have a bug.  Doh.  Serves me right
for directly porting the regex... *grin*




The following version should work better for you:

###
"""parseUrls.py  A regular expression that detects HTTP urls.

Danny Yoo (dyoo@hkn.eecs.berkeley.edu)

This is only a small sample of tchrist's very nice tutorial on
regular expressions.  See:

    http://www.perl.com/doc/FMTEYEWTK/regexps.html

for more details.

Note: this properly detects strings like "http://python.org.", with a
period at the end of the string."""


import re

def grabUrls(text):
    """Given a text string, returns all the urls we can find in it."""
    return url_re.findall(text)


urls = '(?: %s)' % '|'.join("""http telnet gopher file wais
ftp""".split())
ltrs = r'\w'
gunk = r'/#~:.?+=&%@!\-'
punc = r'.:?\-'
any = "%(ltrs)s%(gunk)s%(punc)s" % { 'ltrs' : ltrs,
                                     'gunk' : gunk,
                                     'punc' : punc }

url = r"""
    \b                            # start at word boundary
        %(urls)s    :             # need resource and a colon
        [%(any)s]  +?             # followed by one or more
                                  #  of any valid character, but
                                  #  be conservative and take only
                                  #  what you need to....
    (?=                           # look-ahead non-consumptive assertion
            [%(punc)s]*           # either 0 or more punctuation
            (?:   [^%(any)s]      #  followed by a non-url char
                |                 #   or end of the string
                  $
            )
    )
    """ % {'urls' : urls,
           'any' : any,
           'punc' : punc }

url_re = re.compile(url, re.VERBOSE | re.MULTILINE)


def _test():
    sample = """hello world, this is an url:
                http://python.org.  Can you find it?"""
    match = url_re.search(sample)
    print "Here's what we found: '%s'" % match.group(0)

if __name__ == '__main__':
    _test()
###





Here's a small demonstration on one way to use this code:

###
>>> grabUrls(urllib.urlopen('http://python.org').read())[:5]
['http://ht2html.sf.net',
 'http://sourceforge.net/bugs/?group_id=5470',
 'http://sourceforge.net/patch/?group_id=5470',
 'http://sourceforge.net/cvs/?group_id=5470',
 'http://www.jython.org/']
>>> grabUrls("this is a test")
[]
>>> grabUrls("http://python.org.....")
['http://python.org']
>>> grabUrls("In a hole in the ground there lived a hobbit.  "
...          + "See: http://www.coldal.org/hobbit.htm")
['http://www.coldal.org/hobbit.htm']
###


Hope this helps!