DOCTYPE + SAX
jdownie
jdownie at gmail.com
Sun Apr 10 07:57:05 EDT 2011
What you suggested solved my problem, but unfortunately it did reveal that the HTML that I was parsing was not compliant with the DTD that it should have been. There were a lot of missing end tags.
In light of this frustrating problem i've gone back to the source docbook code. There are many isolated XML files with content that I want to parse out. One example that I am focussing on starts with this…
<?xml version='1.0' encoding='iso-8859-1'?>
<appendix xmlns="http://docbook.org/ns/docbook" xml:id="indexes">
<title>&FunctionIndex;</title>
My xml.sax parser fails with…
phpdoc/doc-base/funcindex.xml:3:8: undefined entity
I went looking for "FunctionIndex" and grep told me…
phpdoc/en/language-defs.ent:<!ENTITY FunctionIndex "Function Index">
…and so I look for pages that explicitly reference "language-defs.ent" and rep tells me…
phpdoc/doc-base/install-unix.xml:<!ENTITY % language-defs SYSTEM "./en/language-defs.ent">
By this stage, i'm a bit tangled up. Is a SAX parser the right way to parse docbooks when they have locally defined external entities? I am hoping to extract structured information from this documentation to present in another format.
More information about the Python-list
mailing list