[Tutor] HTML --> TXT?
Corran Webster
cwebster@nevada.edu
Wed, 29 Mar 2000 11:10:42 -0800
At 11:47 AM -0500 29/3/00, Justin Sheehy wrote:
> "Curtis Larsen" <curtis.larsen@Covance.Com> writes:
>
> > Is there a fairly simple Python-ish way to convert an HTML file to text?
>
> Check out the htmllib and formatter modules. The HTMLParser and
> DumbWriter classes in those respective modules should do what you need.
In particular, the following should do the trick for a basic text-dump to
standard output:
----
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
source = open("myfile.html")
parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(source.read())
parser.close()
----
'source' can be replaced by any file-like object (such as the file-like
objects returned by urllib.urlopen). For example:
----
from htmllib import HTMLParser
from formatter import AbstractFormatter, DumbWriter
from urllib import urlopen
source = urlopen('http://www.yahoo.com/')
parser = HTMLParser(AbstractFormatter(DumbWriter()))
parser.feed(source.read())
parser.close()
----
You can also specify an output file for DumbWriter, and adjust the way that
lines wrap.
More sophisticated behaviour can be achieved by subclassing the Writer
and/or Formatter classes from the formatter module; or the HTMLParser class
(usually overriding the start_tag, end_tag or do_tag methods for specific
tags). See the documentation for details of the interfaces.
Regards,
Corran