[Tutor] Re: Regex

Tue Aug 26 21:38:14 EDT 2003

Prehaps I can help.

I am writing a wiki. something very much like this comes up in convertingt the 
page into html for the recipient's browser. If one types
http://www.tinylist.org
into the page, it should be converted into a link
<a href="http://www.tinylist.org">http://www.tinylist.org</a>
is the result it renders.

But if it is parsing the page, and it finds somthing that appears to be an 
operational link already, it should leave it alone. This is because a different 
function turns some code into an image tag with a src declaration pointing at 
the host's website, and this must not be broken.

The critical difference is one simple character. the " at the start of the address.
<img src="http://www.tinylist.org/images/wikinehesalogo2.gif">
http://www.tinylist.org/images/wikinehesalogo2.gif
The second is converted into a link. The first is disabled, by
turning the < and > into &lt; and &gt; respectively.
&lt.img src="http://www.tinylist.org/images/wikinehesalogo2.gif"&gt;
will not operate when examined by the browser. Morover, the link constructor 
will not turn the address into a hotlink, because of the leading '"'.

This is the sourcecode of the program. Please feel free to steal as needed.
http://www.tinylist.org/wikinehesa.txt

To witness it in action, click this:
http;//www.tinylist.org/cgi-bin/wikinehesa.py
Notice there is a image displayed in the body of the page, as well as a hotlink 
back to the main website. To examine the page's wikicode source, just click the 
EDIT THIS PAGE button.

I hope this is of some help.

Andrei wrote:

> Thanks, it *almost* helps, but I'm not trying to harvest the links. The 
> issue is that I do *not* want to get URLs if they're in between <a> 
> tags, nor if they're an attribute to some tag (img, a, link, whatever).
> 
> Perhaps I should have explained my goal more clearly: I wish to take a 
> piece of text which may or may not contain HTML tags and turn any piece 
> of text which is NOT a link, but is an URL into a link. E.g.:
> 
>   go to <a href="http://home.com">http://home.com</a>. [1]
>   go <a href="http://home.com">home</a>. [2]
> 
> should remain unmodified, but
> 
>   go to http://home.com [3]
> 
> should be turned into [1]. That negative lookbehind can do the job in 
> the large majority of the cases (by not matching URLs if they're 
> preceded by single or double quotes or by ">"), but not always since it 
> doesn't allow the lookbehind to be non-fixed length. I think one of the 
> parser modules might be able to help (?) but regardless of how much I 
> try, I can't get the hang of them, while I do somewhat understand regexes.
> 
> Andrei
> 
> 
> 
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 

-- 

-- 

end

Cheers!
         Kirk D Bailey

  +                              think                                +
   http://www.howlermonkey.net  +-----+        http://www.tinylist.org
   http://www.listville.net     | BOX |  http://www.sacredelectron.org
   Thou art free"-ERIS          +-----+     'Got a light?'-Promethieus
  +                              think                                +

Fnord.