[XML-SIG] Using PyExpat.py

Uche Ogbuji uche.ogbuji@fourthought.com
Mon, 19 Feb 2001 15:06:44 -0700


> > > > xml_dom_object = reader.fromUri(filename) #should work for either
> > > > URL or file
> > 
> > > Let's talk about this comment.  Is it really a good idea to build URL
> > > access right into the API here?
> > 
> > I can't find out whether this has been settled. Did you propose to
> > drop the support for URLs in the API, or the one for local files.
> 
> I'd like to drop support for URLs; I don't think the typical computer
> is sufficiently networked to make this work well.

In this case, the typical computer user will have a great deal of trouble 
using any XML application in any language.  Almost all of them use URIs as 
basis, and for good reason.  Special support for local files are almost 
universally a mere convenience.

Most XML processing specifications mandate that the URI of the XML entity that 
contains an infoset node is used as the basis for further processing.  To me, 
this argues strongly for dropping local files rather than URIs if we must 
choose.  Some XML specs would be very difficult to implement properly if the 
low-level tools became file-system-only readers.

> > We just had a report where urllib apparently decided to use "c" as the
> > protocol name; I'm not entirely sure what the exact cause was.
> 
> That's the ambiguity between local filenames and URLs.  You have to
> decide whether filenames passed to APIs are in local filename space or
> in URL space, and not try to guess based on what the name looks like.
> On the Mac, all absolute filenames look like foo:bar or
> foo:bar:bletch, so there you have even less to work with.

The Mac people should have spoken to the IETF a decade ago when URLs emerged, 
or a bit later when URIs came out.  I suspect, again that if this is the case, 
they suffer much more pain in XML processing than is inflicted on them by 
PyXML.

> > > Case in point: I found this bit in saxutilx.py:
> > > 
> > >         if os.path.isfile(sysid):
> > >             basehead = os.path.split(os.path.normpath(base))[0]
> > >             source.setSystemId(os.path.join(basehead, sysid))
> > >             f = open(sysid, "rb")
> > >         else:
> > >             source.setSystemId(urlparse.urljoin(base, sysid))
> > >             f = urllib.urlopen(source.getSystemId())
> > > 
> > > Now I don't know under which circumstances this get triggered (the
> > > context is obscure)
> > 
> > prepare_input_source is invoked by every parser when processing the
> > argument to .parse(), so the common usage is
> > 
> >   p = make_parser()
> >   p.setContentHandler(something)
> >   p.parse(filename)
> > 
> > Instead of filename, you can have URLs, stream, and InputSource
> > objects (the Java API only supports InputSource here).
> 
> I would suggest to have separate APIs depending on the argument type,
> e.g. p.parseFile(filename), p.parseURL(url),
> p.parseStream(InputSource), p.parseString(text).  (And no, Java
> overloading wouldn't help much here, since three out of four APIs have
> string arguments.)

Sure, one can add a parseFile, but what do you do with

<?xml version='1.0'?>
<!DOCTYPE spam [
  <!ENTITY foo SYSTEM 'foo.bar'>
]>
<spam>&foo;</spam>

URI or file?

Note that this is a trick question, and the "trick" is *exactly* my point.

> > > but I'd say it's a bad idea to just try to open a URL when a string
> > > isn't a local file.  Maybe *you* live in a world where the network
> > > is "always on" (and I do too!), but for plenty of folks, it's rather
> > > annoying to find that their modem starts dialing out each time they
> > > make a typo in a filename.
> > 
> > But would the modem actually start dialling? Wouldn't it rather
> > determine that the protocol is "file" and the report that the file is
> > missing? So I think it would either report an unknown url type, or an
> > ENOENT. What kind of typo did you think of?
> 
> Maybe I was thinking of another case (not involving PyXML) that was
> reported to me third hand, where a filename containing a colon on
> Windows (using Cygwin tools) ended up being interpreted as Unix rcp
> filename syntax, and the system was doing a host lookup on the part
> before the colon -- that really does make the modem dial!

Yes, but that does sound like a bug elsewhere.

> > > The application knows this, but the library doesn't.  It's also fine
> > > to have an alternative API that takes a URL instead of a local
> > > filename -- but it's not okay to attempt to overlap the two
> > > namespaces.
> > 
> > The application can always make sure that the right thing is processed
> > by opening it itself, and then passing that to the parser.
> 
> Sure, and if a string is given, it should be assumed to be a local
> filename unless the API name has "URL" in it.

It's not all that easy, as evidenced by my example above.


-- 
Uche Ogbuji                               Principal Consultant
uche.ogbuji@fourthought.com               +1 303 583 9900 x 101
Fourthought, Inc.                         http://Fourthought.com 
4735 East Walnut St, Ste. C, Boulder, CO 80301-2537, USA
Software-engineering, knowledge-management, XML, CORBA, Linux, Python