[XML-SIG] "Character reference too large" error with HtmlLib.Reader()
Douglas Bates
bates@stat.wisc.edu
30 Jul 2002 16:29:54 -0500
I am using Debian GNU/Linux 3.0 (woody) with the Debian python2.2-xml
(0.7.1-2) package. I am trying to build a DOM of an HTML URI using
from xml.dom.ext.reader.HtmlLib import Reader
dom =3D Reader().fromUri(...)
The HTML returned from this URI occasionally has character references
such as
<dt>Alberto Luceño and Jaime Puig-Pey</dt>
and I get errors of
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py=
", line 63, in fromUri
return self.fromStream(stream, ownerDoc, charset)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py=
", line 28, in fromStream
self.parser.parse(stream)
File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py"=
, line 57, in parse
self._parser.parse(stream.read())
ValueError: character reference too large
As I understand the code in Sgmlop.py the default characterset is
ISO-8859-1 and ñ should be
small n, tilde =F1 ñ --> =F1 ñ --> =
=F1
in ISO-8859-1.
I welcome suggestions (and thank the developers for a wonderful
package).
--=20
Douglas Bates bates@stat.wisc.edu
Statistics Department 608/262-2598
University of Wisconsin - Madison http://www.stat.wisc.edu/~bates/