[Tutor] printing the links of a page (regular expressions)

Kent Johnson kent37 at tds.net
Sat May 6 14:25:17 CEST 2006


Alfonso wrote:
> I'm writing a script to retrieve and print some links of a page. These 
> links begin wiht "/dog/", so I use a regular expresion to try to find 
> them. The problem is that the script only retrieves a link per line in 
> the page. I mean, if the line hat several links, the script only reports 
> the first. I can't find where is the mistake. Does anyone hat a idea, 
> what I have false made? 

You are reading the data by line using readlines(). You only search each 
line once. regex.findall() or regex.finditer() would be a better choice 
than regex.search().

You might also be interested in sgmllib-based solutions to this problem, 
which will generally be more robust than regex-based searching. For 
example, see
http://diveintopython.org/html_processing/extracting_data.html
http://www.w3journal.com/6/s3.vanrossum.html#MARKER-9-26

Kent

> 
> Thank you very much for your help.
> 
> 
> import re
> from urllib import urlopen
> 
> fileObj = urlopen("http://name_of_the_page")
> links = []
> regex = re.compile ( "((/dog/)[^ \"\'<>;:,]+)",re.I)
> 
> for a in fileObj.readlines():
>         result = regex.search(a)
>         if result:
>                 print result.group()
> 
> 
> 
> 		
> ______________________________________________ 
> LLama Gratis a cualquier PC del Mundo. 
> Llamadas a fijos y móviles desde 1 céntimo por minuto. 
> http://es.voice.yahoo.com
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
> 




More information about the Tutor mailing list