[Tutor] python xml entity problem [xml parsing and DTD handling]

Danny Yoo dyoo@hkn.eecs.berkeley.edu
Tue, 3 Sep 2002 16:21:41 -0700 (PDT)


On Tue, 3 Sep 2002 smith@rfa.org wrote:

> I'm interested in parsing a xml file using the python tools in debian
> woody. Everything seems to be ok until I reach a "&MAN;" My python
> script just passes over it. My guess is that I have a external entity
> resolver problem. I've been reading the Python XML book on O'reilly and
> I believe I'm doing the right things. At least in terms of non external
> entities. Does anybody have any examples or how can I make the program
> recognize external entity. I'm still very new to python and xml so maybe
> it's something I don't understand.

Hi Smith,

Yikes.  The documentation on Python and XML, in my opinion, is just
absolutely hideous; most of the documentation assumes that you want to
write this stuff in Java, which is just, well, lazy!  Even the 'Python &
XML' book minimally touches external references.


Forgive me for the rant; I'm just so frustrated by the lack of
documentation in this area.  It just seems that the documentation and
examples could be greatly improved.  Perhaps we can do an xml-parsing
thing this week and send our examples over to the PyXML folks...  Hmmmm.)



For XML processing with DTD's to work well, it appears that you need to
give the address of a real DTD --- otherwise, I think the system will just
ignore external references outright!  The line:


<!DOCTYPE schedule SYSTEM "ftp://something.org/pub/xml_files/program.dtd">


should be changed to point to a real url that your system can find ---
xml.sax will use the 'urllib' module to grab DTD information from this, so
it's not enough to put a fill-me-in sort of thing: the DTD url has to be
real.



Also, there's a bug on line 3 of your DTD:

<!---  --->
<!ELEMENT pgm_block(id?,arch?,air_date?,air_time?, ...
<!---  --->

You need a space between the element name and the open parenthesis
character.  Doh!  *grin*



Once we make those fixes, then everything should be ok.  I've written an
example that uses a DTD that's embedded in the xml, so that the system
doesn't have to try looking for it online.


######
import xml.sax

xml_text = \
"""<?xml version='1.0' encoding="UTF-8" standalone="no"?>
<!DOCTYPE schedule [
    <!ELEMENT schedule (pgm_block,segment*)>
    <!ELEMENT pgm_block (id?,arch?,air_date?,air_time?,service_id?,
                         block_time?,sch_status,mc,produce)>
    <!ELEMENT id (#PCDATA)>
    <!ELEMENT arch (#PCDATA)>
    <!ELEMENT air_date (#PCDATA)>
    <!ELEMENT air_time (#PCDATA)>
    <!ELEMENT service_id (#PCDATA)>
    <!ENTITY  BUR "Burmese">
    <!ENTITY  KHM "Cambodian">
    <!ENTITY  CAN "Cantonese">
    <!ENTITY  KOR "Korean">
    <!ENTITY  LAO "Lao">
    <!ENTITY  MAN "Mandarin">
    <!ENTITY  TIB "Tibetan">
    <!ENTITY  UYG "Uyghur">
    <!ENTITY  VIE "Vietnamese">
]>
<?xml:stylesheet
type="text/xsl"href="ftp://something.org/pub/xml_files/program.xsl"?>
<schedule>
<pgm_block>
        <id></id>
        <arch>http://something/MAN/2000/02/test.mp3</arch>
        <air_date>2000/02/02</air_date>
        <air_time>16:00</air_time>
        <service_id>&MAN;</service_id>
        <block_time>00:70:00</block_time>
        <sch_status>archive</sch_status>
        <mc>AW</mc>
        <producer>AW</producer>
        <editor>XZ</editor>
</pgm_block>
</schedule>"""


class MyContentHandler(xml.sax.handler.ContentHandler):
    def __init__(self):
        xml.sax.handler.ContentHandler.__init__(self)
        self.indent = 0

    def startElement(self, name, attrs):
        self.printIndent()
        print "I see the start of", name
        self.indent += 4

    def endElement(self, name):
        self.indent -= 4
        self.printIndent()
        print "I see the end of", name


    def characters(self, text):
        if text.strip():
            self.printIndent()
            print "I see characters", repr(text)
            pass

    def printIndent(self):
        print " " * self.indent,


if __name__ == '__main__':
    parser = xml.sax.make_parser()
    handler = MyContentHandler()
    parser.setContentHandler(handler)
    parser.feed(xml_text)
    parser.close()
#######



I hope this helps!  If you have more questions, please feel free to ask.