Converting HTML to ASCII
Jorgen Grahn
jgrahn-nntq at algonet.se
Sat Feb 26 20:46:05 EST 2005
On 26 Feb 2005 02:36:31 -0800, Paul Rubin <> wrote:
> Jorgen Grahn <jgrahn-nntq at algonet.se> writes:
>> You should probably do what some other poster suggested -- download
>> lynx or some other text-only browser and make your code execute it
>> in -dump mode to get the text-formatted html. You'll get that
>> working in an hour or so, and then you can see if you need something
>> more complicated.
>
> Lynx is pathetically slow for large files. It seems to use a
> quadratic algorithm for remembering where the links point, or
> something. I wrote a very crude but very fast renderer in C that I
> can post if someone wants it, which is what I use for this purpose.
That may be so, but it's fast enough for all the people who use it as a
general html->plaintext tool, so it's probably good enough for the OP.
w3m and links are other options. They provide better formatting than lynx,
and at least w3m has the -dump option.
I wouldn't mind if there was a reusable library for rendering HTML to text,
from various languages. I'd also like to see one (CSS-aware) for rendering
to troff or Postscript.
/Jorgen
--
// Jorgen Grahn <jgrahn@ Ph'nglui mglw'nafh Cthulhu
\X/ algonet.se> R'lyeh wgah'nagl fhtagn!
More information about the Python-list
mailing list