parsing complex web pages

Tue Jul 15 20:46:08 EDT 2003

John J. Lee <jjl at pobox.com> wrote:
> John Hunter <jdhunter at ace.bsd.uchicago.edu> writes:
>> >>>>> "John" == John J Lee <jjl at pobox.com> writes:
>>     John> If it works well for you, why not stick with it?
> [...]
>> It did cause me to wonder though, whether some good python html->text
>> converters which render the html as text (ie, preserve visual layout),
>> were lurking out their beneath my radar screen.
>
> If they exist, it's unlikely they'll do as good a job as lynx (in
> general, not talking about Yahoo in particular), because there is so
> much awful HTML out there.  lynx has been around a long time.

And if lynx don't do what you want there's also other text browsers
available. links, elinks and w3m can also dump text-files, and all of
them handle tables better than lynx (at least for viewing).

Looking at the sizes for these browsers, maybe it would be best to use
w3m for dumping webpages:

:!ls -l `which lynx links elinks w3m`
-rwxr-xr-x    root       619004 Jul 18  2001 /usr/bin/links
-rwxr-xr-x    root       807132 Jan 14  2003 /usr/local/bin/elinks
-rwxr-xr-x    root      1111090 Jan 21 19:58 /usr/local/bin/lynx
-rwxr-xr-x    root       351008 Jan 24 01:08 /usr/local/bin/w3m

-- 
Vlad