[Tutor] Regex for a variable within a string

Kent Johnson kent37 at tds.net
Wed Apr 8 12:50:39 CEST 2009


On Wed, Apr 8, 2009 at 6:28 AM, David Cash <cashbang at googlemail.com> wrote:
> Hi, I'm new to python and have decided to develop a web crawler / file
> downloader as my first application. I am at the stage where the script
> requests a page and parses the page for URLs, then prints them out. However,
> I'd like to change my current regex that greps for 'http' to one that will
> grep for the url variable that is used in the connect string.
>
> I was hoping I could use something like p=re.compile((url).*?'"') but this
> is clearly not the right syntax. Apologies for such a newbie question!

We live for newbie questions :-)

I'm not too sure what you want to do. In general, if you have a string
in a variable and you want to include that string in a regex, you
should build a new string, then compile that. In your case, the way to
do what you asked for is
  p = re.compile(url + ".*?")

But I don't think this will do anything useful, for a couple of
reasons. It finds the exact URL followed by any text. So the first
match will match from the url to the end of the text. If your URL has
a path component - for example http://some.domain.com/index.html -
then the regex will not find other URLs in the domain, such as
http://some.domain.com/good/stuff/index.html.

You should look at BeautifulSoup, it is an add-on module that parses
HTML and makes it easy to extract links. You also might be interested
in  the urlparse module, which has functions which break up a URL into
components.

http://personalpages.tds.net/~kent37/kk/00009.html  # Intro to BeautifulSoup
http://docs.python.org/library/urlparse.html

Kent


More information about the Tutor mailing list