[XML-SIG] saxutils bug (was: value error when parsing XML)

Andrew Clover and-xml at doxdesk.com
Tue Aug 3 20:53:37 CEST 2004


 From Ajay's report I've been looking at problems in the saxutils 
function prepare_input_source:

   def prepare_input_source(source, base = ""):
     [...]
     sysid = source.getSystemId()
     if os.path.isfile(sysid):
       basehead = os.path.split(os.path.normpath(base))[0]
       source.setSystemId(os.path.join(basehead, sysid))
       f = open(sysid, "rb")

This allows a systemId to be either a filename or a URI, and tries to 
guess when it's a filename by sniffing to see if a file with the given 
name exists.

However the filename-sniffing is done *before* the source's systemId is 
resolved relative to its baseURI, and the non-resolved systemId is used 
to open the file, thus ignoring the baseURI passed in completely and 
calculating any relative URIs relative to the current working directory 
instead of the enclosing baseURI.

For this reason, a document in a different directory to the CWD may have 
trouble using external entities and the external DTD subset. If the 
systemId is relative and does not exist relative to the CWD instead of 
the baseURI, the function will assume it is a URI and attempt to urlopen 
it, resulting in the ValueError reported by Ajay.

This is the case when a filename is passed in to prepare_input_source 
(and hence, to the original parse() call), but it's also the case for 
file streams due to this line earlier in the function:

   if hasattr(f, "name"):
     source.setSystemId(f.name)

f.name is the filename the stream was opened with, which can also be 
relative. I believe it would be more appropriate to abspath the filename 
(not normpath as, I believe erroneously, used above) and convert it to 
an unambiguous file: URI.

However, I believe the approach of detecting the difference between URI 
and filename by file-sniffing on every entity access to be broken in 
general. For example a document at http://www.example.com/xml/foo.xml 
that referenced the system ID 'foo.ent' would get the wrong external 
entity if there just happened to be a 'foo.ent' in the current working 
directory.

I would prefer to keep all InputSource systemIds as URIs; even when a 
filename was originally passed in it should be converted to a URI. 
Otherwise we cannot reliably deal with relative systemIds.

However as I have not played much with SAX I'm hesitant to drop patches 
to sourceforge just yet. Discussion of any potential problems with this 
approach, and any better ways of detecting the difference between a 
filename and a URI, would be appreciated.

cheers,

-- 
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/


More information about the XML-SIG mailing list