Extract Title from HTML documents

Mike Meyer mwm at mired.org
Fri Nov 5 03:29:22 EST 2004


Max M <maxm at mxm.dk> writes:

> Nickolay Kolev wrote:
>> Hi all,
>> I am looking for a way to extract the titles of HTML documents. I
>> have made an honest attempt at doing it, and it even works. Is there
>> an easier (faster / more efficient / clearer) way?
>
> You anly need one tag here, so using a regex is ok.
>
> linkPattern = re.compile('((<title.*?>(.*?)</body>))', re.I|re.S)
                                               ^^^^
Shouldn't that be </title>

          <mike?

> match = linkPattern.search(source)
> if match is None:
>      result = ''
> result = match.group(0)
>
> If you need more than just the title I would definitely go with
> BeautifulSoap.
>
> -- 
>
> hilsen/regards Max M, Denmark
>
> http://www.mxm.dk/
> IT's Mad Science

-- 
Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.



More information about the Python-list mailing list