web page text extractor
Andre Engels
andreengels at gmail.com
Thu Jul 12 10:24:23 EDT 2007
2007/7/12, Andre Engels <andreengels at gmail.com>:
I forgot to include
import urllib2, re
here
> def textonly(url):
> # Get the HTML source on url and give only the main text
> f = urllib2.urlopen(url)
> text = f.read()
> r = re.compile('\<[^\<\>]*\>')
> newtext = r.sub('',text)
> while newtext != text:
> text = newtext
> newtext = r.sub('',text)
> return text
--
Andre Engels, andreengels at gmail.com
ICQ: 6260644 -- Skype: a_engels
More information about the Python-list
mailing list