[XML-SIG] Error 404 and xml.dom.ext.reader

Mike Olson Mike.Olson@fourthought.com
12 Aug 2002 08:29:19 -0600


On Mon, 2002-08-12 at 07:32, Alexandre wrote:

I think it is a good idea.

Mike

> Hello, I'm the current Debian maintainer for python-xml and
> python-4suite, and J=E9r=F4me has forwarded me you mail.
>=20
> > From: "J. Imlay" <jimlay@u.washington.edu>
> > To: jerome@debian.org
> > cc: jimlay@u.washington.edu
> > Date: Sat, 27 Jul 2002 00:35:40 -0700 (PDT)
> > Subject: python2.1-xml but with xml.dom.ext.reader.PyExpat?
> >=20
> > Hello, I know this isn't your department but I can't figure out who thi=
s
> > developer for this actually is. It looks like it's 4suite but I don't
> > think it is because I thought PyExpat was done by the PyExpat people wh=
o
> > are not 4Suite. If you could forward this to the appropriate party, (an=
d
> > keep me in the cc if you will) I'd appreciate it.
>=20
> Actually, it's the PyXML code you are using (4DOM, to which xml.dom.ext
> belongs, was donated by the 4Suite team to the PyXML project). I'm
> cc'ing the PyXML mailing list for further discussion.=20
>=20
> >=20
> > from xml.dom.ext.reader import PyExpat
> > reader =3D PyExpat.Reader()
> > doc =3D reader.fromUri(uri)
> >=20
> > If the uri contains a #sign (as uri's with references to an anchor tag
> > do), the # sign should be ignored no? Instead if
> > uri=3D"http://purl.org/file#" and you ask for the file, the webserver
> > (depending on how smart it is, apache figures it out, but not all web
> > servers do) will return a 404. And the url handeler does not realize it=
's
> > a 404 and proceeds to choke on the non-xml output. So 2 things.
> >=20
> > 1. It should (I think, you of course can disagree if you think I am
> > ignorant) pick off the # before making the GET request.
> >=20
> > 2. If there is a http error returned in the GET request it should retur=
n
> > that rather than trying to parse the 404 page as XML and dieing with a
> > line 1 column 54 error. (the error baffled more than 1 Programmer beyon=
d
> > solvability, it took some haxoring to figure out it was the # at the en=
d
> > of the URL that was bombing it)
>=20
> This is certainly a bug, but after having given a look at the code in
> PyXML, I'd say that it is most likeky a bug in the urllib module from
> the python standard library, which doesn't throw an exception when an
> HTTP error is encountered.
>=20
> >>> from urllib import urlopen
> >>> urlopen('http://purl.org/file#').read()
> '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
> 2.0//EN">\n<HTML><HEAD>\n<TITLE>404 Not
> Found</TITLE>\n</HEAD><BODY>\n<H1>Not Found</H1>\nThe requested URL
> /file was not found on this server.<P>\n</BODY></HTML>\n'
> >>> urlopen('http://purl.org/file').read()
> '<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML
> 2.0//EN">\n<HTML><HEAD>\n<TITLE>404 Not
> Found</TITLE>\n</HEAD><BODY>\n<H1>Not Found</H1>\nThe requested URL
> /file was not found on this server.<P>\n</BODY></HTML>\n'
>=20
> Now, this has been fixed in urllib2:=20
>=20
> >>> from urllib2 import urlopen
> >>> urlopen('http://purl.org/file').read()
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
> <...>
>   File "/usr/lib/python2.1/urllib2.py", line 425, in http_error_default
>     raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
> urllib2.HTTPError: HTTP Error 404: Not Found
> 						 =20
> Since support for Python1.5 has been dropped from PyXML, perhaps using
> urllib2 instead of urllib should be considered. I don't know if this
> module is available in Python2.0, though.
>=20
> Any opinion?
>=20
> Alexandre Fayolle
> --=20
> LOGILAB, Paris (France).
> http://www.logilab.com   http://www.logilab.fr  http://www.logilab.org
> Narval, the first software agent available as free software (GPL).
>=20
> _______________________________________________
> XML-SIG maillist  -  XML-SIG@python.org
> http://mail.python.org/mailman/listinfo/xml-sig
--=20
Mike Olson                                Principal Consultant
mike.olson@fourthought.com                +1 303 583 9900 x 102
Fourthought, Inc.                         http://Fourthought.com=20
4735 East Walnut St,                      http://4Suite.org
Boulder, CO 80301-2537, USA
XML strategy, XML tools, knowledge management