Parsing SGML document in Python program

Eric Brunel eric.brunel at pragmadev.com
Wed Oct 16 17:15:51 CEST 2002


Ilya Shambat wrote:

> Hello all,
> 
> I need to be able to parse an SGML document in a Python program. I
> need to know the syntax on how to do that. The project involves using
> a DTD, passed as a command line argument, to read all the SGML files
> that exist in a directory. Does anybody know how this is done?

There is a sgmllib module in the standard library, but it's not a full SGML 
parser. SGML has a lot of funky possibilities that are quite hard to parse 
and that were apparently not considered in the sgmllib module. I never used 
it, but as far as I can see from the docs, it doesn't use a DTD, so it's 
really not a SGML parser (XML seems to live well without a DTD, but doing 
so in SGML may be considered as heretic ;-). It may however be usable if 
your documents are really simple.

I had once to do that and I couldn't find a parser directly usable in 
Python. Maybe it has changed (just check the Vaults of Parnassus for it). 
The solution I used at the time was to rely on an external parser that gave 
easy to parse results. The one I used was nsgmls, part of James Clark's SP 
project. You may find it @ http://www.jclark.com/sp/ ; just test it and 
you'll see that its output is really easy to get back into Python.

HTH
-- 
- Eric Brunel <eric.brunel at pragmadev.com> -
PragmaDev : Real Time Software Development Tools - http://www.pragmadev.com



More information about the Python-list mailing list