jdownie jdownie at
Sun Apr 10 13:57:05 CEST 2011

What you suggested solved my problem, but unfortunately it did reveal that the HTML that I was parsing was not compliant with the DTD that it should have been. There were a lot of missing end tags.

In light of this frustrating problem i've gone back to the source docbook code. There are many isolated XML files with content that I want to parse out. One example that I am focussing on starts with this…

<?xml version='1.0' encoding='iso-8859-1'?>
<appendix xmlns="" xml:id="indexes">

My xml.sax parser fails with…

phpdoc/doc-base/funcindex.xml:3:8: undefined entity

I went looking for "FunctionIndex" and grep told me…

phpdoc/en/language-defs.ent:<!ENTITY FunctionIndex     "Function Index">

…and so I look for pages that explicitly reference "language-defs.ent" and rep tells me…

phpdoc/doc-base/install-unix.xml:<!ENTITY % language-defs     SYSTEM "./en/language-defs.ent">

By this stage, i'm a bit tangled up. Is a SAX parser the right way to parse docbooks when they have locally defined external entities? I am hoping to extract structured information from this documentation to present in another format.

More information about the Python-list mailing list