web page text extractor

Andre Engels andreengels at gmail.com
Thu Jul 12 10:24:23 EDT 2007


2007/7/12, Andre Engels <andreengels at gmail.com>:

I forgot to include

import urllib2, re

here

> def textonly(url):
>    # Get the HTML source on url and give only the main text
>    f = urllib2.urlopen(url)
>    text = f.read()
>    r = re.compile('\<[^\<\>]*\>')
>    newtext = r.sub('',text)
>    while newtext != text:
>       text = newtext
>       newtext = r.sub('',text)
>    return text


-- 
Andre Engels, andreengels at gmail.com
ICQ: 6260644  --  Skype: a_engels



More information about the Python-list mailing list