not quite 1252

Fri Apr 28 09:00:51 EDT 2006

Anton Vredegoor wrote:
> Serge Orlov wrote:
>
> > I extracted content.xml from a test file and the header is:
> > <?xml version="1.0" encoding="UTF-8"?>
> >
> > So any xml library should handle it just fine, without you trying to
> > guess the encoding.
>
> Yes my header also says UTF-8. However some kind person send me an
> e-mail stating that since I am getting \x94 and such output when using
> repr (even if str is giving correct output) there could be some problem
> with the XML-file not being completely UTF-8. Or is there some other
> reason I'm getting these \x94 codes? Or maybe this is just as it should
> be and there's no problem at all?

Indeed, just load the file into ElementTree. Extending the example you
posted before:

data = zin.read(x)
import elementtree.ElementTree as ET
doc = ET.fromstring(data)
officetag = "{http://openoffice.org/2000/office}"
body = self.doc.find(".//"+officetag+"body")
for fragment in body.getchildren():
   ... process one fragment of document's body ...