How to parse XHTML with xml.parsers.xmlproc?

Paavo Hartikainen pahartik at
Tue Sep 18 05:21:13 CEST 2001

Paavo Hartikainen writes:

> Now I seem to still have a problem with validating to DTD, while
> parsing alone without validating works already.

Validation part was broken in older version (0.5.1-5) of python-xml
package.  It took some Debian knowledge and pushing around to build
newer python-xml (0.6.6-2) package (from Debian woody source package)
on Debian potato system.  Doing the same thing with python-distutils
package, which does not seem to exist in potato at all, did not
require any tuning.  After upgrading python-xml package from potato
version to woody one, problem disappeared.

> This is what fails in my test case:

So this time there was nothing wrong with my code.

> Complete, stand-alone simplified test case is available at
> <URL:> for now,
> including Python code, XHTML file, DTD catalog and related DTD
> files.

I had to fix DTD/catalog file since it wants to also have DTD files
included from main DTD file listed or they will not be found.

However, when I point to catalog file like this:

cat = catalog.xmlproc_catalog("DTD/catalog", catalog.CatParserFactory())

Validator reaches the main DTD file (xhtml1-strict.dtd) just fine, but
DTD files included from within that file get searched from "DTD/DTD/".
Is this expected behaviour?  This is what my "DTD/catalog" looks like:

PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" xhtml1-strict.dtd
PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" xhtml-lat1.ent
PUBLIC "-//W3C//ENTITIES Symbols for XHTML//EN" xhtml-symbol.ent
PUBLIC "-//W3C//ENTITIES Special for XHTML//EN" xhtml-special.ent

I would think it should try to read "DTD/xhtml-lat1.ent" instead of
"DTD/DTD/xhtml-lat1.ent" and so on...

My quick hack to get over it was to create symbolic link like this:

ln -s ./ DTD/DTD

I will update my test case archive and leave it to my site, maybe it
could help someone else to get started with python-xml.

 "pienena   /  Paavo "Rainbow Rat" Hartikainen
  minusta  /  E-mail: pahartik at
  tulee   /  URL:
  rotta" /  EFnet: pahartik at #Atari and #LionKing

More information about the Python-list mailing list