[Tutor] Re: Regex (almost solved, parser problem remaining)
Andrei
project5 at redrival.net
Fri Aug 29 20:31:39 EDT 2003
I've looked at all tips I got. Kirk, I tried the Wiki's parser but it
has the obvious problem that a plain-text link in between double quotes
will not be recognized as being a link (that being the reason I didn't
use the backward-looking regex in the first place). I have not looked
what happens if an <a>-tag contains invalid HTML, for example because a
link is constructed like this: <a href=http://invalidlink.com>Invalid</a>.
Based on Danny's example, I ended up using the SGML parser combined with
a regex (code and a test included at the bottom of this message). It
also converts invalid HTML links (like the one shown above) to valid
HTML and chops long links up a bit.
With my tests it works OK except for one thing: the SGML parser chokes
on "&". The original says:
go to news://bl_a.com/?ha-ha&query=tb for more info
and it's modified to:
go to <a
href="news://bl_a.com/?ha-ha">news://bl_a.com/?ha-ha</a>&query=tb for
more info
The bit starting with "&" doesn't make it into the href attribute and
hence also falls outside the link.
The parser recognizes "&" as being the start of an entity ref or a char
ref and by default would like to remove it entirely. I use
handle_entityref to put it back in, but at that time the preceding text
has already been run through the linkify method, which generates the
links. This means that the "&query" stuff is added behind the generated
link instead of inside it.
Any suggestions on how I can deal with this particular problem?
Andrei
===[ EXAMPLE RESULTS ]========
Put them in a html file and view in browser.
<pre>
<b><u>ORIGINAL</u></b>
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-ha&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<IMG src="http://images.com/image.gif" ALT="Nice image"/> (image tag
with alt text)
<a href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original
invalid HTML)
Link containing an anchor <b>"http://myhomepage.com/index.html#01"</b>.
<b><u>MODIFIED</u></b>
Plain old link: <a
href="http://www.mail.yahoo.com">http://www.mail.yahoo.com</a>.
Containing numbers: <a
href="ftp://bla.com/ding/co.rt,39,%93">ftp://bla.com/ding/co.rt,39,%93</a>
or other
Go to <a
href="news://bl_a.com/?ha-ha">news://bl_a.com/?ha-ha</a>&query=tb for
more info.
A real link: <a href="http://x.com">http://x.com</a>.
<a
href="ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html">ftp://verylong.org...s/oritwontfit.html</a>
(long one)
<a href="http://b.com/image.gif">http://b.com/image.gif (image)</a> (a
plain image tag)
<a href="http://images.com/image.gif">Nice image (image)</a> (image tag
with alt text)
<a href="http://fixedlink.com/orginialinvalid.html">fixed</a> (original
invalid HTML)
Link containing an anchor <b>"<a
href="http://myhomepage.com/index.html#01">http://myhomepage.com/index.html#01</a>"</b>.
</pre>
===[ CODE ]====================
mytext = """
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-ha&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<IMG src="http://images.com/image.gif" ALT="Nice image"/> (image tag
with alt text)
<a href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original
invalid HTML)
Link containing an anchor <b>"http://myhomepage.com/index.html#01"</b>.
"""
import sgmllib, re
class LinkMaker(sgmllib.SGMLParser):
"""A parser which converts any stand-alone URL (meaning it is not
inside a link, nor an attribute to a tag) to a proper link and
can regenerate the modified HTML code.
It also replaces IMG-tags with links to the images."""
def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.anchorlevel = 0 # pays attention to nested anchors
self.elements = [] # stores the bits and pieces which will be
joined
# don't allow generation of links which are huge. If the MAXLENGTH
# is exceeded by an URL, it will be chopped a bit in self.makeLink.
self.MAXLENGTH = 40
def unknown_starttag(self, tag, attrs):
"""Handle all start tags."""
if tag == 'a':
self.anchorlevel += 1 # store nested anchor depth
if tag == 'img':
# convert img-tag to link
tag = 'a'
# make a dictionary of attributes in order to convert those
as well
attribs = {"src": "", "alt": ""}
for attrib in attrs:
attribs[attrib[0]] = attrib[1].strip()
# generate alt attribute from link if otherwise not available
if not attribs["alt"]:
attribs["alt"] = self.limitLength(attribs["src"])
# only convert to link if a src attribute is present
if attribs["src"]:
self.elements.append('<a href="%s">%s (image)</a>' % \
(attribs["src"], attribs["alt"]))
else: # for non-img tags normal handling
# build a string containing the attribs
attribsstring = " ".join([ '%s=\"%s\"' % (attrib) for
attrib in attrs ])
if attribsstring:
elem = " ".join([tag, attribsstring])
else:
elem = tag
self.elements.append("<%s>" % elem)
def unknown_endtag(self, tag):
"""Handle all end tags."""
if tag == 'a':
self.anchorlevel -= 1
# don't allow anchorlevel <0 (invalid HTML in fact)
self.anchorlevel = max(0, self.anchorlevel)
# convert img-tag to link
if tag == 'img':
tag = 'a'
self.elements.append("</%s>" % tag)
def handle_entityref(self, ref):
"""Is called when a character reference (&...) is found.
These must pass through unmodified."""
self.elements.append("&%s" % ref)
def limitLength(self, text):
"""Returns a string with a maximum length of self.MAXLENGTH."""
if len(text)>self.MAXLENGTH:
# don't allow the text to become too large
text = "%s...%s" % (text[:self.MAXLENGTH//2-2],
text[-(self.MAXLENGTH//2-2):])
return text
def makeLink(self, matchobj):
"""Function called whenever linkify matches. Takes the
match object and returns a link."""
url = matchobj.group() # this will be in the href
text = self.limitLength(url) # this will be the data of the tag
(the visible link)
return '<a href="%s">%s</a>' % (url, text)
def linkify(self, text):
"""Regex for finding URLs:
URL's start with http(s)/ftp/news ((http)|(ftp)|(news))
followed by ://
then any number of non-whitespace characters including
numbers, dots, forward slashes, commas, question marks,
ampersands, equality signs, dashes, underscores and plusses,
but ending in a non-dot!
([@a-zA-Z0-9,/%:&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?=\-_]
Result:
(?:http|https|ftp|news)://(?:[@a-zA-Z0-9,/%:\&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?=\-_]
Tests:
Plain old link: http://www.mail.yahoo.com.
Containing numbers: ftp://bla.com/ding/co.rt,39,%93 or other
Go to news://bl_a.com/?ha-ha&query=tb for more info.
A real link: <a href="http://x.com">http://x.com</a>.
ftp://verylong.org/url/must/be/chopped/to/pieces/oritwontfit.html (long one)
<IMG src="http://b.com/image.gif" /> (a plain image tag)
<a
href=http://fixedlink.com/orginialinvalid.html>fixed</a> (original
invalid HTML)
Link containing an anchor
<b>"http://myhomepage.com/index.html#01"</b>.
"""
expression =
r"(?:http|https|ftp|news)://(?:[a-zA-Z0-9,@/%:\&#\?=\-_]+\.*)+[a-zA-Z0-9,/%:\&#\?\=\-_]"
linkparser = re.compile(expression, re.I)
text = linkparser.sub(self.makeLink, text)
return text
def handle_data(self, data):
"""Handle data between tags. If not inside a link
(anchorlevel==0), then make sure any URLs are
converted to links."""
if self.anchorlevel == 0:
data = self.linkify(data)
self.elements.append(data)
def getResult(self):
"""Returns the updated HTML. This just consists of
joining the elements."""
return "".join(self.elements)
parser = LinkMaker()
parser.feed(mytext)
print parser.getResult()
More information about the Tutor
mailing list