XML: Doctype http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd

Ralf Schmitt ralf at brainbot.com
Tue Jun 15 10:16:38 EDT 2004


"Thomas Guettler" <guettli at thomas-guettler.de> writes:

> Hi,
>
> I want to parse XHTML.
>
> The doctype is http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd.
>
> When I try to parse it with SAX. The parser
> tries to connect via httplib to www.w2.org.
>
> I downloaded all necessary DTDs and changed the
> link to "xhtml1-transitional.dtd", which is now read
> from the local filesystem.
>
> One thing I don't like: I need to change the xml file
> by hand (remove http://www.w3c.org....). Is there
> a way to tell the parser, that it should look into
> the local filesystem before trying to download them?

Here's some example code I posted a few days ago, which does exaxtly
what you want. 

--------------------------------
from xml.sax import saxutils, handler, make_parser, xmlreader
class Handler(handler.ContentHandler):
    def resolveEntity(self, publicid, systemid):
        print "RESOLVE:", publicid, systemid
        
        return open(systemid[systemid.rfind('/')+1:], "rb")
    def characters(self, s):
        print repr(s)
        
doc = r'''<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<HTML>
 ä
</HTML>
'''

h = Handler()
parser = make_parser()
parser.setContentHandler(h)
parser.setEntityResolver(h)

parser.feed(doc)
parser.close()
-------
Output:

RESOLVE: -//W3C//DTD XHTML 1.0 Transitional//EN http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd
RESOLVE: -//W3C//ENTITIES Latin 1 for XHTML//EN xhtml-lat1.ent
RESOLVE: -//W3C//ENTITIES Symbols for XHTML//EN xhtml-symbol.ent
RESOLVE: -//W3C//ENTITIES Special for XHTML//EN xhtml-special.ent
u'\n'
u'\xa0'
u'\xe4'
u'\n'


>
> Regards,
>  Thomas

-- 
brainbot technologies ag
boppstrasse 64 . 55118 mainz . germany
fon +49 6131 211639-1 . fax +49 6131 211639-2
http://brainbot.com/  mailto:ralf at brainbot.com



More information about the Python-list mailing list