[Mailman-Users] Pipermail URL handling in archives

Sat Feb 23 01:37:24 CET 2008

Jim Popovitch wrote:

>On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark at msapiro.net> wrote:
>>  You could try to find the line
>>
>>  urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text
>>
>>  near the beginning of Mailman/Archiver/HyperArch.py and change it to
>>
>>  urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text
>
>Mark, that works well for the case I described.  I did find something
>else similar that doesn't work:
>
>     this is another url http://www.yahoo.com, and so is this
>http://www.google.com.
>
>Gets converted into:
>   this is another url <A
>HREF="http://www.yahoo.com,">http://www.yahoo.com,</A>
>            and so is this <A
>HREF="http://www.ibm.com">http://www.google.com</A>.

I assume that's a typo and 'ibm' should be 'google'.

>So, the problem seems to appear with commas too which makes me wonder
>if this can be resolved with this:
>
>   urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text
>
>but then I got to thinking about any other punctuation make that
>follows a URL... and my mind started spinning :-)

I think the suggestion above - (\.|,)? would work for comma, but you
could do it other ways - e.g.

   urlpat = re.compile(r'(\w+://[^>)\s]+?)[.,;]?(\s|$)') # URLs in text

to handle '.', ',' and ';', and you could extend that with more
characters, but you really need to be careful. Consider for example,
<http://www.example.com/some/page#anchor.> which could be a valid URL
ending in '.'.

-- 
Mark Sapiro <mark at msapiro.net>        The highway is for gamblers,
San Francisco Bay Area, California    better use your sense - B. Dylan