filed: <a href="http://bugs.python.org/issue7311">http://bugs.python.org/issue7311</a><br><br><div class="gmail_quote">On Thu, Nov 12, 2009 at 12:24 AM, Michael Foord <span dir="ltr"><<a href="mailto:fuzzyman@voidspace.org.uk">fuzzyman@voidspace.org.uk</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Hello Zhang Chiyuan,<br>
<br>
Can you file a bug on the Python issue tracker please:<br>
<br>
<a href="http://bugs.python.org" target="_blank">http://bugs.python.org</a><br>
<br>
Thanks<br>
<br>
Michael Foord<br>
<br>
Zhang Chiyuan wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><div><div></div><div class="h5">
Hi all,<br>
<br>
I'm using BeautifulSoup to parsing an HTML page and find it refused to<br>
parse the page. By looking at the backtrace, I found it is a problem<br>
with the python built-in HTMLParser.py. In fact, the web page I'm<br>
parsing is with some Chinese characters. there is a tag like <img<br>
src=/foo/bar.png alt=中文> , note this is legacy html page where the<br>
attributes are not quoted. However, the regexp defined in<br>
HTMLParser.py is :<br>
<br>
attrfind = re.compile(<br>
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'<br>
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./,:;+*%?!&$\(\)_#=~@]*))?')<br>
<br>
Note that the Chinese character (also any other non-english<br>
characters), so it fire an error parsing this. I'm not sure whether<br>
the HTML standard allow un-quoted non-ASCII characters in the<br>
attributes. If it allows, this seems to be a bug. and the regexp to<br>
better be [^>\s] IMHO.<br>
<br>
BTW: It seems something like :<br>
<br>
<script><br>
var st = "<a></";<br>
</script><br>
<br>
can not be parsed. :-/<br>
<br>
--<br>
pluskid<br>
<a href="http://blog.pluskid.org" target="_blank">http://blog.pluskid.org</a><br></div></div>
_______________________________________________<br>
Python-Dev mailing list<br>
<a href="mailto:Python-Dev@python.org" target="_blank">Python-Dev@python.org</a><br>
<a href="http://mail.python.org/mailman/listinfo/python-dev" target="_blank">http://mail.python.org/mailman/listinfo/python-dev</a><br>
Unsubscribe: <a href="http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk" target="_blank">http://mail.python.org/mailman/options/python-dev/fuzzyman%40voidspace.org.uk</a><br>
<br>
</blockquote>
<br>
<br>
-- <br>
<a href="http://www.ironpythoninaction.com/" target="_blank">http://www.ironpythoninaction.com/</a><br>
<br>
</blockquote></div><br><br clear="all"><br>-- <br>pluskid<br><a href="http://blog.pluskid.org">http://blog.pluskid.org</a><br>