[Tutor] HTMLParser problem unable to find all the IMG tags....

Fri Oct 29 17:16:28 CEST 2004

Danny,

Thank you much for looking into this.  I'll make your change to my copy of 
HTMLParser.py and see how it works.

As for a bug report, you've traced the issue deeper and into territory that 
is beyond me [I started pyton 1 week ago, and don't know perl in part due 
to the RE stuff!] so I'd recommend that submit it.  I'd just be sending 
them your insights if I did it myself.

It would also be valuable to document that HTMLParser is sensitive to minor 
spec flaws.

When I get the CNN grabber working I'll post it to this thread....

Thanks,
Chris

At 05:05 PM 10/28/2004, you wrote:

>On Thu, 28 Oct 2004, Chris Barnhart wrote:
>
> > At 01:49 PM 10/28/2004, Lloyd Kvam wrote:
> > >On Thu, 2004-10-28 at 08:34, Chris Barnhart wrote:
> > > >
> > > > The problem is that using the HTMLParser I'm not getting all the IMG
> > > > tags.  I know this as I have another program that just uses string
> > > > processing that gets 2.5 times more IMG SRC tag.  I also know this 
> because
> > > > HTMLParser starttag is never called with the IMG that I'm after!
> >
> >
> > The problem with my getting all the IMG tags from CNN is the lack of a
> > space separating a close quote and start of an attribute in at least one
> > their IMG SRC statements.
>
>
>
>Hi Chris,
>
>
>Ah, that makes sense.  Can you send a feature request or bug report to the
>Python developers?  They keep a bug list on Sourceforge:
>
>     http://sourceforge.net/tracker/?group_id=5470
>
>
>I did some diving through the code.  The bug appears to be that the
>internal function HTMLParser.check_for_whole_start_tag() doesn't recognize
>that:
>
>     <IMG SRC = "abc.jpg"WIDTH=5>
>
>is a whole start tag element.
>
>
>
>I think it might have to do with HTMLParser.locatestarttagend, because the
>regular expression there says:
>
>
>###
>locatestarttagend = re.compile(r"""
>   <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
>   (?:\s+                             # whitespace before attribute name
>     (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
>       (?:\s*=\s*                     # value indicator
>         (?:'[^']*'                   # LITA-enclosed value
>           |\"[^\"]*\"                # LIT-enclosed value
>           |[^'\">\s]+                # bare value
>          )
>        )?
>      )
>    )*
>   \s*                                # trailing whitespace
>""", re.VERBOSE)
>###
>
>
>that there's a required whitespace before every attribute value.  CNN's
>HTML obviously doesn't have this.
>
>
>
>I wonder what happens if we relax the regular expression slightly, so that
>the whitespace is optional.
>
>###
> >>> import HTMLParser
> >>> import re
> >>> HTMLParser.locatestarttagend =  re.compile(r"""
>...   <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
>...   (?:\s*                             # optional whitespace before
>...                                      # attribute name
>...     (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
>...       (?:\s*=\s*                     # value indicator
>...         (?:'[^']*'                   # LITA-enclosed value
>...           |\"[^\"]*\"                # LIT-enclosed value
>...           |[^'\">\s]+                # bare value
>...          )
>...        )?
>...      )
>...    )*
>...   \s*                                # trailing whitespace
>... """, re.VERBOSE)
> >>> class Parser(HTMLParser.HTMLParser):
>...     def handle_starttag(self, tag, attrs):
>...         print "START", tag, attrs
>...
> >>> p = Parser()
> >>> p.feed('    <IMG SRC = "abc.jpg"WIDTH=5>')
>START img [('src', 'abc.jpg'), ('width', '5')]
>###
>
>
>Ok, that makes the parser a little more permissive, so that it'll accept
>the screwed up HTML that CNN is providing us.  *grin*
>
>
>Chris, would this work for you?  Maybe we can pass this off to the Python
>developers and get it into the next release.  Do you want to send the bug
>report to them?
>
>
>Good luck to you!