[Tutor] printing the links of a page (regular expressions)
Kent Johnson
kent37 at tds.net
Sat May 6 14:25:17 CEST 2006
Alfonso wrote:
> I'm writing a script to retrieve and print some links of a page. These
> links begin wiht "/dog/", so I use a regular expresion to try to find
> them. The problem is that the script only retrieves a link per line in
> the page. I mean, if the line hat several links, the script only reports
> the first. I can't find where is the mistake. Does anyone hat a idea,
> what I have false made?
You are reading the data by line using readlines(). You only search each
line once. regex.findall() or regex.finditer() would be a better choice
than regex.search().
You might also be interested in sgmllib-based solutions to this problem,
which will generally be more robust than regex-based searching. For
example, see
http://diveintopython.org/html_processing/extracting_data.html
http://www.w3journal.com/6/s3.vanrossum.html#MARKER-9-26
Kent
>
> Thank you very much for your help.
>
>
> import re
> from urllib import urlopen
>
> fileObj = urlopen("http://name_of_the_page")
> links = []
> regex = re.compile ( "((/dog/)[^ \"\'<>;:,]+)",re.I)
>
> for a in fileObj.readlines():
> result = regex.search(a)
> if result:
> print result.group()
>
>
>
>
> ______________________________________________
> LLama Gratis a cualquier PC del Mundo.
> Llamadas a fijos y móviles desde 1 céntimo por minuto.
> http://es.voice.yahoo.com
> _______________________________________________
> Tutor maillist - Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>
>
More information about the Tutor
mailing list