[XML-SIG] saxutils bug (was: value error when parsing XML)
Andrew Clover
and-xml at doxdesk.com
Tue Aug 3 20:53:37 CEST 2004
From Ajay's report I've been looking at problems in the saxutils
function prepare_input_source:
def prepare_input_source(source, base = ""):
[...]
sysid = source.getSystemId()
if os.path.isfile(sysid):
basehead = os.path.split(os.path.normpath(base))[0]
source.setSystemId(os.path.join(basehead, sysid))
f = open(sysid, "rb")
This allows a systemId to be either a filename or a URI, and tries to
guess when it's a filename by sniffing to see if a file with the given
name exists.
However the filename-sniffing is done *before* the source's systemId is
resolved relative to its baseURI, and the non-resolved systemId is used
to open the file, thus ignoring the baseURI passed in completely and
calculating any relative URIs relative to the current working directory
instead of the enclosing baseURI.
For this reason, a document in a different directory to the CWD may have
trouble using external entities and the external DTD subset. If the
systemId is relative and does not exist relative to the CWD instead of
the baseURI, the function will assume it is a URI and attempt to urlopen
it, resulting in the ValueError reported by Ajay.
This is the case when a filename is passed in to prepare_input_source
(and hence, to the original parse() call), but it's also the case for
file streams due to this line earlier in the function:
if hasattr(f, "name"):
source.setSystemId(f.name)
f.name is the filename the stream was opened with, which can also be
relative. I believe it would be more appropriate to abspath the filename
(not normpath as, I believe erroneously, used above) and convert it to
an unambiguous file: URI.
However, I believe the approach of detecting the difference between URI
and filename by file-sniffing on every entity access to be broken in
general. For example a document at http://www.example.com/xml/foo.xml
that referenced the system ID 'foo.ent' would get the wrong external
entity if there just happened to be a 'foo.ent' in the current working
directory.
I would prefer to keep all InputSource systemIds as URIs; even when a
filename was originally passed in it should be converted to a URI.
Otherwise we cannot reliably deal with relative systemIds.
However as I have not played much with SAX I'm hesitant to drop patches
to sourceforge just yet. Discussion of any potential problems with this
approach, and any better ways of detecting the difference between a
filename and a URI, would be appreciated.
cheers,
--
Andrew Clover
mailto:and at doxdesk.com
http://www.doxdesk.com/
More information about the XML-SIG
mailing list