Regex single quotes in scraper script?
Christopher T King
squirrel at WPI.EDU
Sat Jul 17 07:07:07 CEST 2004
On Fri, 16 Jul 2004, Rock wrote:
> Being a real newbie with this I think I found the area of code that parses
> the href. It is in a file called parsefns.py
> the full excerpt is listed below but here is the regex line that I believe
> is not dealing with single quote.
> m = re.search(r'href\s*=\s*"?([^>" ]+)["> ]', text, re.I)
> I have tried many different variations but no luck and no luck getting hold
> of the author. Any ideas? Thx.
Good job tracking that down. Methinks you'll want to change it to read
m = re.search(r'href\s*=\s*["\']?([^>"\' ]+)["\'> ]', text, re.I)
This will possibly break some sites, though (namely those that use single
quotes in their URLs, but those are broken anyways). A proper fix would
require a tad more work (i.e. either a much, much, messier regex or a
change in the function), and it's really late right now ;)
More information about the Python-list