ElementTree XML parsing problem

Benjamin Kaplan benjamin.kaplan at case.edu
Wed Apr 27 14:41:49 EDT 2011


On Wed, Apr 27, 2011 at 2:26 PM, Mike <Mike at invalid.invalid> wrote:
> I'm using ElementTree to parse an XML file, but it stops at the second
> record (id = 002), which contains a non-standard ascii character, ä. Here's
> the XML:
>
> <?xml version="1.0"?>
> <snapshot time="Mon Apr 25 08:47:23 PDT 2011">
> <records>
> <record id="001" education="High School" employment="7 yrs" />
> <record id="002" education="Universität Bremen" employment="3 years" />
> <record id="003" education="River College" employment="5 yrs" />
> </records>
> </snapshot>
>
> The complaint offered up by the parser is
>
> Unexpected error opening simple_fail.xml: not well-formed (invalid token):
> line 5, column 40
>
> and if I change the line to eliminate the ä, everything is wonderful. The
> parser is perfectly happy with this modification:
>
> <record id="002" education="University Bremen" employment="3 yrs" />
>
> I can't find anything in the ElementTree docs about allowing additional text
> characters or coercing strange ascii to Unicode.
>
> Is there a way to coerce the text so it doesn't cause the parser to raise an
> exception?
>

Have you tried specifying the file encoding? ä is not "strange ascii".
It's outside the ASCII range so if the parser expects ASCII, it will
get confused.

> Here's my test script (simple_fail contains the offending line, and
> simple_pass contains the line that passes).
>
> import sys
> import xml.etree.ElementTree as ET
>
> def main():
>
>    xml_files = ['simple_fail.xml', 'simple_pass.xml']
>    for xml_file in xml_files:
>
>        print
>        print 'XML file: %s' % (xml_file)
>
>        try:
>            tree = ET.parse(xml_file)
>        except Exception, inst:
>            print "Unexpected error opening %s: %s" % (xml_file, inst)
>            continue
>
>        root = tree.getroot()
>        records = root.find('records')
>        for record in records:
>            print record.attrib['id'], record.attrib['education']
>
> if __name__ == "__main__":
>        main()
>
>
> Thanks,
>
> -- Mike --
>
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list