[Tutor] Re: Regex (almost solved, parser problem remaining)

Fri Aug 29 20:31:39 EDT 2003

I've looked at all tips I got. Kirk, I tried the Wiki's parser but it 
has the obvious problem that a plain-text link in between double quotes 
will not be recognized as being a link (that being the reason I didn't 
use the backward-looking regex in the first place). I have not looked 
what happens if an <a>-tag contains invalid HTML, for example because a 
link is constructed like this: <a href=http://invalidlink.com>Invalid</a>.

Based on Danny's example, I ended up using the SGML parser combined with 
a regex (code and a test included at the bottom of this message). It 
also converts invalid HTML links (like the one shown above) to valid 
HTML and chops long links up a bit.

With my tests it works OK except for one thing: the SGML parser chokes 
on "&". The original says:

    go to news://bl_a.com/?ha-ha&query=tb for more info

and it's modified to:

    go to <a 
href="news://bl_a.com/?ha-ha">news://bl_a.com/?ha-ha</a>&query=tb for 
more info

The bit starting with "&" doesn't make it into the href attribute and 
hence also falls outside the link.
The parser recognizes "&" as being the start of an entity ref or a char 
ref and by default would like to remove it entirely. I use 
handle_entityref to put it back in, but at that time the preceding text 
has already been run through the linkify method, which generates the 
links. This means that the "&query" stuff is added behind the generated 
link instead of inside it.

Any suggestions on how I can deal with this particular problem?

Andrei

===[ EXAMPLE RESULTS ]========
Put them in a html file and view in browser.

<pre>
ORIGINAL
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-ha&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<IMG src="http://images.com/image.gif" ALT="Nice image"/> (image tag 
with alt text)
<a href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original 
invalid HTML)
Link containing an anchor "http://myhomepage.com/index.html#01".

MODIFIED
Plain old link: <a 
href="http://www.mail.yahoo.com">http://www.mail.yahoo.com</a>.
Containing numbers: <a 
href="ftp://bla.com/ding/co.rt,39,%93">ftp://bla.com/ding/co.rt,39,%93</a> 
or other
Go to <a 
href="news://bl_a.com/?ha-ha">news://bl_a.com/?ha-ha</a>&query=tb for 
more info.
A real link: <a href="http://x.com">http://x.com</a>.
<a 
href="ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html">ftp://verylong.org...s/oritwontfit.html</a> 
(long one)
<a href="http://b.com/image.gif">http://b.com/image.gif (image)</a> (a 
plain image tag)
<a href="http://images.com/image.gif">Nice image (image)</a> (image tag 
with alt text)
<a href="http://fixedlink.com/orginialinvalid.html">fixed</a> (original 
invalid HTML)
Link containing an anchor "<a 
href="http://myhomepage.com/index.html#01">http://myhomepage.com/index.html#01</a>".
</pre>

===[ CODE ]====================
mytext = """
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-ha&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<IMG src="http://images.com/image.gif" ALT="Nice image"/> (image tag 
with alt text)
<a href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original 
invalid HTML)
Link containing an anchor "http://myhomepage.com/index.html#01".
"""

import sgmllib, re
class LinkMaker(sgmllib.SGMLParser):
     """A parser which converts any stand-alone URL (meaning it is not
        inside a link, nor an attribute to a tag) to a proper link and
        can regenerate the modified HTML code.
        It also replaces IMG-tags with links to the images."""
     def __init__(self):
         sgmllib.SGMLParser.__init__(self)
         self.anchorlevel = 0 # pays attention to nested anchors
         self.elements = [] # stores the bits and pieces which will be 
joined
         # don't allow generation of links which are huge. If the MAXLENGTH
         # is exceeded by an URL, it will be chopped a bit in self.makeLink.
         self.MAXLENGTH = 40

     def unknown_starttag(self, tag, attrs):
         """Handle all start tags."""
         if tag == 'a':
             self.anchorlevel += 1 # store nested anchor depth
         if tag == 'img':
             # convert img-tag to link
             tag = 'a'
             # make a dictionary of attributes in order to convert those 
as well
             attribs = {"src": "", "alt": ""}
             for attrib in attrs:
                 attribs[attrib[0]] = attrib[1].strip()
             # generate alt attribute from link if otherwise not available
             if not attribs["alt"]:
                 attribs["alt"] = self.limitLength(attribs["src"])
             # only convert to link if a src attribute is present
             if attribs["src"]:
                 self.elements.append('<a href="%s">%s (image)</a>' % \
                                      (attribs["src"], attribs["alt"]))
         else: # for non-img tags normal handling
             # build a string containing the attribs
             attribsstring = " ".join([ '%s=\"%s\"' % (attrib) for 
attrib in attrs ])
             if attribsstring:
                 elem = " ".join([tag, attribsstring])
             else:
                 elem = tag
             self.elements.append("<%s>" % elem)

     def unknown_endtag(self, tag):
         """Handle all end tags."""
         if tag == 'a':
             self.anchorlevel -= 1
             # don't allow anchorlevel <0 (invalid HTML in fact)
             self.anchorlevel = max(0, self.anchorlevel)
         # convert img-tag to link
         if tag == 'img':
             tag = 'a'
         self.elements.append("</%s>" % tag)

     def handle_entityref(self, ref):
         """Is called when a character reference (&...) is found.
            These must pass through unmodified."""
         self.elements.append("&%s" % ref)

     def limitLength(self, text):
         """Returns a string with a maximum length of self.MAXLENGTH."""
         if len(text)>self.MAXLENGTH:
             # don't allow the text to become too large
             text = "%s...%s" % (text[:self.MAXLENGTH//2-2],
                                 text[-(self.MAXLENGTH//2-2):])
         return text

     def makeLink(self, matchobj):
         """Function called whenever linkify matches. Takes the
            match object and returns a link."""
         url = matchobj.group() # this will be in the href
         text = self.limitLength(url) # this will be the data of the tag 
(the visible link)
         return '<a href="%s">%s</a>' % (url, text)

     def linkify(self, text):
         """Regex for finding URLs:
            URL's start with http(s)/ftp/news ((http)|(ftp)|(news))
            followed by ://
            then any number of non-whitespace characters including
            numbers, dots, forward slashes, commas, question marks,
            ampersands, equality signs, dashes, underscores and plusses,
            but ending in a non-dot!
              ([@a-zA-Z0-9,/%:&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?=\-_]

            Result:

(?:http|https|ftp|news)://(?:[@a-zA-Z0-9,/%:\&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?=\-_]

            Tests:
               Plain old link: http://www.mail.yahoo.com.
               Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
               Go to news://bl_a.com/?ha-ha&query=tb for more info.
               A real link: <a href="http://x.com">http://x.com</a>.

ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
               <IMG src="http://b.com/image.gif" /> (a plain image tag)
               <a 
href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original 
invalid HTML)
               Link containing an anchor 
<b>"http://myhomepage.com/index.html#01"</b>.
         """
         expression = 
r"(?:http|https|ftp|news)://(?:[a-zA-Z0-9,@/%:\&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?\=\-_]"
         linkparser = re.compile(expression, re.I)
         text = linkparser.sub(self.makeLink, text)
         return text

     def handle_data(self, data):
         """Handle data between tags. If not inside a link
            (anchorlevel==0), then make sure any URLs are
            converted to links."""
         if self.anchorlevel == 0:
             data = self.linkify(data)
         self.elements.append(data)

     def getResult(self):
         """Returns the updated HTML. This just consists of
            joining the elements."""
         return "".join(self.elements)

parser = LinkMaker()
parser.feed(mytext)
print parser.getResult()