[ python-Bugs-921657 ] HTMLParser ParseError in start tag

SourceForge.net noreply at sourceforge.net
Wed Oct 13 12:16:26 CEST 2004


Bugs item #921657, was opened at 2004-03-23 13:17
Message generated for change (Comment added) made by nnseva
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=921657&group_id=5470

Category: Python Library
Group: Python 2.3
Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Bernd Zimmermann (bernd_zedv)
Assigned to: A.M. Kuchling (akuchling)
Summary: HTMLParser ParseError in start tag

Initial Comment:
when this - obviously correct html - is parsed:

<a href=mailto:xyz at domain.com>xyz</a>

this exception is raised:
HTMLParseError: junk characters in start 
tag: '@domain.com>', at line 1, column 1

I work around this by adding '@' to the
allowed character's class:

import HTMLParser
HTMLParser.attrfind = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)
_#=~@]*))?')

myparser = HTMLParser.HTMLParser()
myparser.feed('<a ... ')



----------------------------------------------------------------------

Comment By: Vsevolod Novikov (nnseva)
Date: 2004-10-13 14:16

Message:
Logged In: YES 
user_id=325678

see request #1046092 to fix it

----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2004-06-05 19:32

Message:
Logged In: YES 
user_id=11375

Committed to the CVS HEAD; thanks!


----------------------------------------------------------------------

Comment By: A.M. Kuchling (akuchling)
Date: 2004-04-19 17:01

Message:
Logged In: YES 
user_id=11375

I don't believe this HTML is obviously correct.  
The section on attributes in the HTML 4.01 Recommendation
(http://www.w3.org/TR/html4/intro/sgmltut.html#h-3.2.2) says:

In certain cases, authors may specify the value of an
attribute without any quotation marks. The attribute value
may only contain letters (a-z and A-Z), digits (0-9),
hyphens (ASCII decimal 45), periods (ASCII decimal 46),
underscores (ASCII decimal 95), and colons (ASCII decimal
58). We recommend using quotation marks even when it is
possible to eliminate them.  

The regex is already more liberal than this, allowing slashes
and various other symbols, so we might as well add '@', but
you should also consider adding quotation marks to the
original attribute.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=921657&group_id=5470


More information about the Python-bugs-list mailing list