[Tutor] HTMLParser problem unable to find all the IMG tags....
Chris Barnhart
mlist-python at dideas.com
Fri Oct 29 17:16:28 CEST 2004
Danny,
Thank you much for looking into this. I'll make your change to my copy of
HTMLParser.py and see how it works.
As for a bug report, you've traced the issue deeper and into territory that
is beyond me [I started pyton 1 week ago, and don't know perl in part due
to the RE stuff!] so I'd recommend that submit it. I'd just be sending
them your insights if I did it myself.
It would also be valuable to document that HTMLParser is sensitive to minor
spec flaws.
When I get the CNN grabber working I'll post it to this thread....
Thanks,
Chris
At 05:05 PM 10/28/2004, you wrote:
>On Thu, 28 Oct 2004, Chris Barnhart wrote:
>
> > At 01:49 PM 10/28/2004, Lloyd Kvam wrote:
> > >On Thu, 2004-10-28 at 08:34, Chris Barnhart wrote:
> > > >
> > > > The problem is that using the HTMLParser I'm not getting all the IMG
> > > > tags. I know this as I have another program that just uses string
> > > > processing that gets 2.5 times more IMG SRC tag. I also know this
> because
> > > > HTMLParser starttag is never called with the IMG that I'm after!
> >
> >
> > The problem with my getting all the IMG tags from CNN is the lack of a
> > space separating a close quote and start of an attribute in at least one
> > their IMG SRC statements.
>
>
>
>Hi Chris,
>
>
>Ah, that makes sense. Can you send a feature request or bug report to the
>Python developers? They keep a bug list on Sourceforge:
>
> http://sourceforge.net/tracker/?group_id=5470
>
>
>I did some diving through the code. The bug appears to be that the
>internal function HTMLParser.check_for_whole_start_tag() doesn't recognize
>that:
>
> <IMG SRC = "abc.jpg"WIDTH=5>
>
>is a whole start tag element.
>
>
>
>I think it might have to do with HTMLParser.locatestarttagend, because the
>regular expression there says:
>
>
>###
>locatestarttagend = re.compile(r"""
> <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
> (?:\s+ # whitespace before attribute name
> (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
> (?:\s*=\s* # value indicator
> (?:'[^']*' # LITA-enclosed value
> |\"[^\"]*\" # LIT-enclosed value
> |[^'\">\s]+ # bare value
> )
> )?
> )
> )*
> \s* # trailing whitespace
>""", re.VERBOSE)
>###
>
>
>that there's a required whitespace before every attribute value. CNN's
>HTML obviously doesn't have this.
>
>
>
>I wonder what happens if we relax the regular expression slightly, so that
>the whitespace is optional.
>
>###
> >>> import HTMLParser
> >>> import re
> >>> HTMLParser.locatestarttagend = re.compile(r"""
>... <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
>... (?:\s* # optional whitespace before
>... # attribute name
>... (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
>... (?:\s*=\s* # value indicator
>... (?:'[^']*' # LITA-enclosed value
>... |\"[^\"]*\" # LIT-enclosed value
>... |[^'\">\s]+ # bare value
>... )
>... )?
>... )
>... )*
>... \s* # trailing whitespace
>... """, re.VERBOSE)
> >>> class Parser(HTMLParser.HTMLParser):
>... def handle_starttag(self, tag, attrs):
>... print "START", tag, attrs
>...
> >>> p = Parser()
> >>> p.feed(' <IMG SRC = "abc.jpg"WIDTH=5>')
>START img [('src', 'abc.jpg'), ('width', '5')]
>###
>
>
>Ok, that makes the parser a little more permissive, so that it'll accept
>the screwed up HTML that CNN is providing us. *grin*
>
>
>Chris, would this work for you? Maybe we can pass this off to the Python
>developers and get it into the next release. Do you want to send the bug
>report to them?
>
>
>Good luck to you!
More information about the Tutor
mailing list