Hi,
I have a problem with some URLs being handled incorrectly. Specifically URLs that are at the end of a sentence followed by a period (full stop). Example:
To enroll visit this site: http://www.domain.tld/cgi-bin/enroll.pl.
Mailman/Pipermail converts that sentence like so:
To enroll visit this site: <A HREF="http://www.domain.tld/cgi-bin/enroll.pl.">http://www.domain.tld/cgi-bin/enroll.pl.</A>
The ending period (full stop) then invalidates the URL.
Is there any quick fix to 2.1.9 to resolve this?
Thanks,
-Jim P.
Jim Popovitch wrote:
I have a problem with some URLs being handled incorrectly. Specifically URLs that are at the end of a sentence followed by a period (full stop). Example:
To enroll visit this site: http://www.domain.tld/cgi-bin/enroll.pl.
Mailman/Pipermail converts that sentence like so:
To enroll visit this site: <A HREF="http://www.domain.tld/cgi-bin/enroll.pl.">http://www.domain.tld/cgi-bin/enroll.pl.</A>
The ending period (full stop) then invalidates the URL.
Is there any quick fix to 2.1.9 to resolve this?
You could try to find the line
urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text
near the beginning of Mailman/Archiver/HyperArch.py and change it to
urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text
Note this re is very lightly tested and may not work in all cases.
Of course, if you can get the posters to surround their URLs with <>, there is no problem.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark@msapiro.net> wrote:
You could try to find the line
urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text
near the beginning of Mailman/Archiver/HyperArch.py and change it to
urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text
Note this re is very lightly tested and may not work in all cases.
As always, Thank you Mark.
Of course, if you can get the posters to surround their URLs with <>, there is no problem.
:-) Herding cats would be easier.
Thanks again,
-Jim P.
On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark@msapiro.net> wrote:
You could try to find the line
urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text
near the beginning of Mailman/Archiver/HyperArch.py and change it to
urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text
Mark, that works well for the case I described. I did find something else similar that doesn't work:
this is another url http://www.yahoo.com, and so is this
Gets converted into: this is another url <A HREF="http://www.yahoo.com,">http://www.yahoo.com,</A> and so is this <A HREF="http://www.ibm.com">http://www.google.com</A>.
So, the problem seems to appear with commas too which makes me wonder if this can be resolved with this:
urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text
but then I got to thinking about any other punctuation make that follows a URL... and my mind started spinning :-)
Any ideas, anyone?
-Jim P.
Jim Popovitch wrote:
On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark@msapiro.net> wrote:
You could try to find the line
urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text
near the beginning of Mailman/Archiver/HyperArch.py and change it to
urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text
Mark, that works well for the case I described. I did find something else similar that doesn't work:
this is another url http://www.yahoo.com, and so is this
Gets converted into: this is another url <A HREF="http://www.yahoo.com,">http://www.yahoo.com,</A> and so is this <A HREF="http://www.ibm.com">http://www.google.com</A>.
I assume that's a typo and 'ibm' should be 'google'.
So, the problem seems to appear with commas too which makes me wonder if this can be resolved with this:
urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text
but then I got to thinking about any other punctuation make that follows a URL... and my mind started spinning :-)
I think the suggestion above - (\.|,)? would work for comma, but you could do it other ways - e.g.
urlpat = re.compile(r'(\w+://[^>)\s]+?)[.,;]?(\s|$)') # URLs in text
to handle '.', ',' and ';', and you could extend that with more characters, but you really need to be careful. Consider for example, <http://www.example.com/some/page#anchor.> which could be a valid URL ending in '.'.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan
On Fri, Feb 22, 2008 at 7:37 PM, Mark Sapiro <mark@msapiro.net> wrote:
Gets converted into: this is another url <A HREF="http://www.yahoo.com,">http://www.yahoo.com,</A> and so is this <A HREF="http://www.ibm.com">http://www.google.com</A>.
I assume that's a typo and 'ibm' should be 'google'.
:-) Yep. I had used www.ibm.com and www.mbi.com in my test and changed them to G! and Y! for the email, but missed one reference.
So, the problem seems to appear with commas too which makes me wonder if this can be resolved with this:
urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text
but then I got to thinking about any other punctuation make that follows a URL... and my mind started spinning :-)
I think the suggestion above - (\.|,)? would work for comma, but you could do it other ways - e.g.
urlpat = re.compile(r'(\w+://[^>)\s]+?)[.,;]?(\s|$)') # URLs in text
to handle '.', ',' and ';', and you could extend that with more characters, but you really need to be careful. Consider for example, <http://www.example.com/some/page#anchor.> which could be a valid URL ending in '.'.
Understood. I think the "[.,;]" would cover 99% of the possibilities of a URL in a sentence.
Thanks again!
-Jim P.
Jim Popovitch writes:
On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark@msapiro.net> wrote: So, the problem seems to appear with commas too which makes me wonder if this can be resolved with this:
urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text
but then I got to thinking about any other punctuation make that follows a URL... and my mind started spinning :-)
Any ideas, anyone?
Unfortunately sre doesn't support POSIX character classes (like [:punct:]) AFAIK, but I would say it's a good idea to make that
urlpat = re.compile(r'(\w+://[^>)\s]+?)[#,.::\'"!?()]?(\s|$)') # URLs in text
for starters. It would be better to replace it with a real URL-matching regexp, though.
Hi, Please pardon me for butting in, I will be the first to admit I am probably the least qualified to give an opinion here, but.... :-) It seems to me this is a DNS issue, no? Simply because the trailing character, be it a comma or period, more or less represents "root" as far as DNS is concerned. If you have a form front end it should not be that big a deal to just "trim" the period or comma from the URL. Or am I just totally out of touch here???
Date: Fri, 22 Feb 2008 19:09:16 -0500> From: yahoo@jimpop.com> To: mailman-users@python.org> Subject: Re: [Mailman-Users] Pipermail URL handling in archives> > On Fri, Feb 22, 2008 at 4:03 PM, Mark Sapiro <mark@msapiro.net> wrote:> > You could try to find the line> >> > urlpat = re.compile(r'(\w+://[^>)\s]+)') # URLs in text> >> > near the beginning of Mailman/Archiver/HyperArch.py and change it to> >> > urlpat = re.compile(r'(\w+://[^>)\s]+?)\.?(\s|$)') # URLs in text> > Mark, that works well for the case I described. I did find something> else similar that doesn't work:> > this is another url http://www.yahoo.com, and so is this> http://www.google.com.> > Gets converted into:> this is another url <A> HREF="http://www.yahoo.com,">http://www.yahoo.com,</A>> and so is this <A> HREF="http://www.ibm.com">http://www.google.com</A>.> > So, the problem seems to appear with commas too which makes me wonder> if this can be resolved with this:> > urlpat = re.compile(r'(\w+://[^>)\s]+?)(\.|,)?(\s|$)') # URLs in text> > but then I got to thinking about any other punctuation make that> follows a URL... and my mind started spinning :-)> > Any ideas, anyone?> > -Jim P.> ------------------------------------------------------> Mailman-Users mailing list> Mailman-Users@python.org> http://mail.python.org/mailman/listinfo/mailman-users> Mailman FAQ: http://www.python.org/cgi-bin/faqw-mm.py> Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/> Unsubscribe: http://mail.python.org/mailman/options/mailman-users/boxenberg%40hotmail.com> > Security Policy: http://www.python.org/cgi-bin/faqw-mm.py?req=show&file=faq01.027.htp
On Fri, Feb 22, 2008 at 9:01 PM, Dov Oxenberg <boxenberg@hotmail.com> wrote:
Hi, Please pardon me for butting in, I will be the first to admit I am probably the least qualified to give an opinion here, but.... :-) It seems to me this is a DNS issue, no? Simply because the trailing character, be it a comma or period, more or less represents "root" as far as DNS is concerned. If you have a form front end it should not be that big a deal to just "trim" the period or comma from the URL. Or am I just totally out of touch here???
:-) Well, tell that to those who live in a point-n-click world. Yes, manually removing the comma or period (or any other sentence forming punctuation character that immediately trails the URL) will work, HOWEVER that doesn't work for search engines when pipermail includes the trailing mark within the html anchors.
-Jim P.
participants (4)
-
Dov Oxenberg
-
Jim Popovitch
-
Mark Sapiro
-
Stephen J. Turnbull