How to Convert IO Stream to XML Document

naugiedoggie michael.a.powe at gmail.com
Sat Sep 11 17:03:50 EDT 2010


On Sep 10, 12:20 pm, jakecjacobson <jakecjacob... at gmail.com> wrote:
> I am trying to build a Python script that reads a Sitemap file and
> push the URLs to a Google Search Appliance.  I am able to fetch the
> XML document and parse it with regular expressions but I want to move
> to using native XML tools to do this.  The problem I am getting is if
> I use urllib.urlopen(url) I can convert the IO Stream to a XML
> document but if I use urllib2.urlopen and then read the response, I
> get the content but when I use minidom.parse() I get a "IOError:
> [Errno 2] No such file or directory:" error

Hello,

This may not be helpful, but I note that you are doing two different
things with your requests, and judging from the documentation,  the
objects returned by urllib and urllib2 openers do not appear to be the
same.  I don't know why you are calling urllib.urlopen(url) and
urllib2.urlopen(request), but I can tell you that I have used urllib2
opener to retrieve a web services document in XML and then parse it
with minidom.parse().


>
> THIS WORKS but will have issues if the IO Stream is a compressed file
> def GetPageGuts(net, url):
>         pageguts = urllib.urlopen(url)
>         xmldoc = minidom.parse(pageguts)
>         return xmldoc
>
> # THIS DOESN'T WORK, but I don't understand why
> def GetPageGuts(net, url):
>         request=getRequest_obj(net, url)
>         response = urllib2.urlopen(request)
>         response.headers.items()
>         pageguts = response.read()

Did you note the documentation says:

"One caveat: the read() method, if the size argument is omitted or
negative, may not read until the end of the data stream; there is no
good way to determine that the entire stream from a socket has been
read in the general case."

No EOF marker might be the cause of the parsing problem.

Thanks.

mp



More information about the Python-list mailing list