parsing complex web pages
minceme at start.no
Wed Jul 16 02:46:08 CEST 2003
John J. Lee <jjl at pobox.com> wrote:
> John Hunter <jdhunter at ace.bsd.uchicago.edu> writes:
>> >>>>> "John" == John J Lee <jjl at pobox.com> writes:
>> John> If it works well for you, why not stick with it?
>> It did cause me to wonder though, whether some good python html->text
>> converters which render the html as text (ie, preserve visual layout),
>> were lurking out their beneath my radar screen.
> If they exist, it's unlikely they'll do as good a job as lynx (in
> general, not talking about Yahoo in particular), because there is so
> much awful HTML out there. lynx has been around a long time.
And if lynx don't do what you want there's also other text browsers
available. links, elinks and w3m can also dump text-files, and all of
them handle tables better than lynx (at least for viewing).
Looking at the sizes for these browsers, maybe it would be best to use
w3m for dumping webpages:
:!ls -l `which lynx links elinks w3m`
-rwxr-xr-x root 619004 Jul 18 2001 /usr/bin/links
-rwxr-xr-x root 807132 Jan 14 2003 /usr/local/bin/elinks
-rwxr-xr-x root 1111090 Jan 21 19:58 /usr/local/bin/lynx
-rwxr-xr-x root 351008 Jan 24 01:08 /usr/local/bin/w3m
More information about the Python-list