[Tutor] HTMLParser problem unable to find all the IMG
mlist-python at dideas.com
Thu Oct 28 21:36:03 CEST 2004
At 01:49 PM 10/28/2004, Lloyd Kvam wrote:
>On Thu, 2004-10-28 at 08:34, Chris Barnhart wrote:
> > The problem is that using the HTMLParser I'm not getting all the IMG
> > tags. I know this as I have another program that just uses string
> > processing that gets 2.5 times more IMG SRC tag. I also know this because
> > HTMLParser starttag is never called with the IMG that I'm after!
>For debugging I would suggest saving the html locally from both methods
>and confirming that they are the same. Then run your HTMLParser against
>the saved file and print the img tags that you find. You can redirect
>the output to a file. Then compare to the original html to see what
>tags got missed.
>Debugging against the live site means that you have no control on the
>data being fed through and no easy way to validate your program.
Yeah - I agree. I've saved the webpage to a file and am working through
it. It's nothing to do with capitalization....
The webpage is in a string now, and I've found that if I cut off the start
of the string, I can get "new" tags to appear. I'm trying to work though
and fix exactly which tags are missing and why. Its kind of tedious!
One possible problem is that HTMLParse is HTML 2.0 complied, but CNN's
output is 4.01.
f = open("cnn_py.html","r")
html_full = f.read()
html = html_full[20000:]
h = MyParser()
When I figure this out, I'll make a new post.
More information about the Tutor