Parsing XML with ElementTree (unicode problem?)

oren.tsur at gmail.com oren.tsur at gmail.com
Mon Jul 23 10:29:36 EDT 2007


(this question was also posted in the devshed python forum:
http://forums.devshed.com/python-programming-11/parsing-xml-with-elementtree-unicode-problem-461518.html
).
-----------------------------

(it's a bit longish but I hope I give all the information)

1. here is my problem: I'm trying to parse an XML file (saved locally)
using elementtree.parse but I get the following error:
xml.parsers.expat.ExpatError: not well-formed (invalid token): line
13, column 327
apparently, the problem is caused by the token 'Saunière' due to the
apostrophe.

the thing is that I'm sure that python (ElementTree module and parse()
function) can handle this type of encoding since I obtain my xml file
from the web by opening it with:

from elementtree import ElementTree
from urllib import urlopen
query = r'http://ecs.amazonaws.com/onca/xml?
Service=AWSECommerceService&AWSAccessKeyId=189P5TE3VP7N9MN0G302&Operation=ItemLookup&ItemId=1400079179&ResponseGroup=Reviews&ReviewPage=166'
root = ElementTree.parse(urlopen(query))

where query is a query to the AWS, and this specific query has the
'Saunière' in the response. (you could simply open the query with a
web browser and see the xml).

I create a local version of the XML file, containing only the tags
that are of interest. my file looks something like this (I replaced
some of the content with 'bla bla' string in order to make it fit
here):
<ReviewBatch>
<Review>
<ID>805</ID> <Rating>3</Rating>
<HelpfulVotes>5</HelpfulVotes> <TotalVotes>6</TotalVotes>
<Date>2004-04-03</Date>
<Summary>Not as good as Angels and Demons</Summary>
<Content>I found that this book was not as good and thrilling as
Angels and Demons. bla bla.</Content>
</Review>

<Review>
<ID>827</ID> <Rating>4</Rating>
<HelpfulVotes>2</HelpfulVotes> <TotalVotes>8</TotalVotes>
<Date>2004-04-01</Date>
<Summary>The Da Vinci Code, a master piece of words</Summary>
<Content>The Da Vinci Code by Dan Brown is a well-written bla bla. The
story starts out in Paris, France with a murder of Jacque Saunière,
the head curator at Le Louvre.bla bla </Content>
</Review>
</ReviewBatch>

BUT, then trying:

fIn  = open(file,'r') #or even 'import codecs'  and opening with 'fIn
= codecs.open(file,encoding = 'utf-8')'
tree = ElementTree.parse(fIn)



where file is the saved file, I get the error above
(xml.parsers.expat.ExpatError: not well-formed (invalid token): line
13, column 327). so what's the difference? how comes parsing is fine
in the first case but erroneous in the second case? please advise.

2. there is another problem that might be similar I get a similar
error if the content of the (locally saved) xml have special
characters such as '&', for example in 'angles & demons' (vs. 'angles
and demons'). is it the same problem? same solution?

thanks!




More information about the Python-list mailing list