processing XHTML1.1 documents with xml.sax

Fri Aug 6 22:23:01 EDT 2004

Has anybody had any luck processing XHTML1.1 documents with xml.sax?
Whenever I try it, python loads the W3C DTD from the top, then crashes
saying that there's an error in the external DTD.
All I need to do is rip through a bunch of XHTML documents and extract
some data, does anybody know a quick way to do this without sax making
outgoing network connections and fussing with DTDs?

BTW, the code to reproduce the error if anybody cares:
below is a document 'hello.html' produced by the W3C's Amaya:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN"
      "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta http-equiv="Content-Type" content="text/html;
charset=iso-8859-1" />
  <title>Hello  World</title>
  <meta name="generator" content="amaya 8.5, see
http://www.w3.org/Amaya/" />
</head>

<body>
<p>hello world!</p>
</body>
</html>

and the script:

import xml.sax.handler
xml.sax.parse("hello.html",
    xml.sax.handler.ContentHandler()
              )

the error:

SAXParseException:
http://www.w3.org/TR/xhtml-modularization/DTD/xhtml-framework-1.mod:89:0:
error in processing external entity reference

will be thrown.