Pulling out <TITLE></TITLE>

Bengt Richter bokr at accessone.com
Wed Nov 21 04:13:05 EST 2001


On Sun, 18 Nov 2001 20:45:44 -0800, Brett Cannon <bac at OCF.Berkeley.EDU> wrote:

>You could just read each page and use a regex to fetch it:
>
>title_value=re.search(r'<title>(?P<title>.*?)</title>',re.I)
>title_value.group('title')
>
Hm. What happens with the following page?

 <HTML><HEAD>
 <!-- (old title kept for reference, or possible restoring) 
 <TITLE>This is the old title</TITLE>
 -->
 <TITLE>Official new title</TITLE>
 </HEAD><Body>...whatever...</BODY></HTML>

>On Sun, 18 Nov 2001, David A McInnis wrote:
>
>> I am writing a script to catalog about 30,000 html pages on my site and need
>> to pull out the value of <TITLE></TITLE>.
>>
>> I guess this is possible with htmllib, but I cannot figure it out.
>>
>> Thanks,
>> David
>>
>>
>>
>




More information about the Python-list mailing list