[XML-SIG] "Character reference too large" error with HtmlLib.Reader()

30 Jul 2002 16:29:54 -0500

I am using Debian GNU/Linux 3.0 (woody) with the Debian python2.2-xml
(0.7.1-2) package.  I am trying to build a DOM of an HTML URI using

from xml.dom.ext.reader.HtmlLib import Reader
dom =3D Reader().fromUri(...)

The HTML returned from this URI occasionally has character references
such as
		<dt>Alberto Luce&#241;o and Jaime Puig-Pey</dt>
and I get errors of
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py=
", line 63, in fromUri
    return self.fromStream(stream, ownerDoc, charset)
  File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/HtmlLib.py=
", line 28, in fromStream
    self.parser.parse(stream)
  File "/usr/lib/python2.2/site-packages/_xmlplus/dom/ext/reader/Sgmlop.py"=
, line 57, in parse
    self._parser.parse(stream.read())
ValueError: character reference too large

As I understand the code in Sgmlop.py the default characterset is
ISO-8859-1 and &#241; should be

small n, tilde                       =F1    &#241; --> =F1    &ntilde; --> =
=F1

in ISO-8859-1.

I welcome suggestions (and thank the developers for a wonderful
package).

--=20
Douglas Bates                            bates@stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/