[Tutor] Re trouble

tpc at csua.berkeley.edu tpc at csua.berkeley.edu
Mon Oct 27 15:33:27 EST 2003


regarding extracting URLs from HTML documents via Python regular
expressions, this question has been asked many times and the consensus is
that you want to use HTMLParser, as re doesn't keep state and is not the
right tool for this task.

You can search the python archives here:

http://mail.python.org/pipermail/tutor/

On Mon, 27 Oct 2003, [iso-8859-1] =D8yvind Dale Sp=F8rck wrote:

> Hello,
>
>    I am using the Re module to filter out some webadresses out of html
> documents, but cannot seem to get it right. What should go in the parante=
ses
> of the re.search?
>
> Here is an example from the html:
>
> <a
> href=3D"../../../../../../get.liste.kvakk.no/fs/http_3A/www.db.no/smurf/d=
efaul
> t.htm"><b>Dagbladet AS</b></a> &#91;<a
> href=3D"../../../../../../get.liste.kvakk.no/is/http_3A/testside.no/smurf=
/defa
> ult.htm"><font color=3D"#CC3300"><b>Vis side</b></font></a>
>
> In other words, I would like to get a list of these adresses:
> www.db.no/smurf/default.htm
> testside.no/smurf/default.htm
>
> These adresses can be anything. I guess the common nominator is that they
> start after http_3A/ and ends before the first ".
>
> How would I write that so that re picks out the right stuff?
>
> Thanks in advance,
> =D8yvind
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>





More information about the Tutor mailing list