xml.sax feature question

christof hoeke csad7 at yahoo.com
Sat Oct 25 16:54:48 EDT 2003


hi,
this is my first try with sax (and some of the first utils in python 
too) so the code is not the best. but i wrote a small utility which 
finds all used element names in a bunch of xml files. reason is simply 
to find out which elements are used and only partly a DTD is available.

so with a os.path.walk over all xml-files in a dir includings subdirs a 
simple sax ContentHandler simply stores all names in a dictionary (to 
keep any given name only once).

the problem i have is that if the xmlfile has a doctype declaration the 
sax parser tries to load it and fails (IOError if course).
partly because the path to the DTD is just a simple name in the same dir 
e.g. <!DOCTYPE contacts SYSTEM "contacts.dtd"> and i guess the parser 
does not use the path os.path.walk uses (can i somehow give the parser 
this information?). but it also could be a DTD which should be loaded 
over a network which is not available at the time.

at the moment these files are not processed at all.

i guess to simply set a feature of the sax parser to not try to load any 
external DTDs should work. question is which feature do i have to disable?
	p = xml.sax.make_parser()
         p.setFeature('http://xml.org/sax/features/validation', False)

i thought turning off the validation would stop the parser to load 
external DTDs, but it still tries to load them.
any other suggestions?


sorry for the rather lengthy explanation and code.
thanks a lot!
chris

the complete code for a better understanding of my problem:

import fnmatch, os.path, sys, xml.sax

class ElementList:
     name = {}

     class Names(xml.sax.ContentHandler):
         def startElement(self, tag, attr):
             if not ElementList.name.has_key(tag):
                 ElementList.name[tag] = 1
             else:
                 ElementList.name[tag] += 1

     def process(self, file):
         try:
             #xml.sax.parse(file, ElementList.Names())
             p = xml.sax.make_parser()
             p.setContentHandler(ElementList.Names())
             p.setFeature('http://xml.org/sax/features/validation', False)
             p.parse(file)
             print '\t', file
         except (xml.sax.SAXException, IOError), e:
             print '\tNOT PROCESSED', file, e

     def printList(self):
         print
         print '#\t<ELEMENTNAME>'
         print '-\t-------------'
         keys = self.name.keys()
         keys.sort()
         for key in keys:
             print self.name[key], '\t', key

class Lister:
     def __init__(self):
         self.el = ElementList()

     def process(self, dir):
         print
         print 'FILES'
         print '-----'
         def proc(junk, dir, files):
             for file in fnmatch.filter(files, '*.xml'):
                 self.el.process(os.path.join(dir, file))
         os.path.walk(dir, proc, None)

     def printList(self):
         self.el.printList()

#MAIN
if __name__ == '__main__':
     try:
         dir = sys.argv[1]
     except:
         print "usage: python lister.py startdir"
         sys.exit(0)
     l = Lister()
     l.process(dir)
     l.printList()





More information about the Python-list mailing list