[lxml-dev] Fun with unicode errors
Hi, Maybe you have an idea what could be happening here, otherwise I will (try to) come back with a more complete example. For now I have this small code excerpt that behaves strangely: (isinstance(output_xml, lxml.etree._Element) is True) # The two ET.tostring() invocations below, (1) and (2), show the # following behaviour: # (1) "works" (UnicodeDecodeError about el.text after (2)) # (1) (2) "works" (UnicodeDecodeError about el.text after (2)) # (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about ET.tostsring() (2) ET.tostring(output_xml) # (1) # Make pretty-printing work by removing unnecessary whitespace: for el in output_xml.iter(): ET.tostring(el) # (2) if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None (1) and (2) are commented out to run them in the different combinations discussed. If you have a hint about what might be the issue there, I'd be very glad to hear it. I'm using this code on WinXP SP3, Python 2.6.4, lxml 2.2.2 (pre-compiled package). Expected behaviour would be to run without raising any exception in any of the runs. Something about output_xml changed, as this code snipped used to work. Thanks, Felix
Hi, The debugging continues. The issue below has been when I read the file using: input_xml = ET.parse(input_filename).getroot() If I change this to: input_xml = ET.XML(file(input_filename, "rb").read()) I get the UnicodeDecodeError in each of the (1)/(2) combinations. $ ./01_loop_debug.sh /c/python26/python.exe tools/xml merge/xmlmerge.py -i Build/CABxxxB.xml Traceback (most recent call last): File "tools/xml merge/xmlmerge.py", line 470, in <module> sys.exit(main(sys.argv)) File "tools/xml merge/xmlmerge.py", line 447, in main output_xml = postprocess_xml(output_xml) File "tools/xml merge/xmlmerge.py", line 173, in postprocess_xml if el.tail and not el.tail.strip(): File "lxml.etree.pyx", line 833, in lxml.etree._Element.tail.__get__ (src/lxml/lxml.etree.c:32942) File "apihelpers.pxi", line 620, in lxml.etree._collectText (src/lxml/lxml.etree.c:14919) File "apihelpers.pxi", line 1232, in lxml.etree.funicode (src/lxml/lxml.etree.c:19564) UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: unexpected end of data Hope that helps the understanding of the issue a bit. Git is being a big help right now. If I reduce the input file to a certain amount, the problem goes away. Hopefully I can isolate the cause soon (before my employer makes me reimplement the workaround again, basically insert: xml = ET.XML(ET.tostring(xml, encoding="utf-8")) in several places). - Felix -----Ursprüngliche Nachricht----- Von: lxml-dev-bounces@codespeak.net [mailto:lxml-dev-bounces@codespeak.net] Im Auftrag von Praktikant3 - SAG Gesendet: Donnerstag, 29. Oktober 2009 09:53 An: lxml-dev@codespeak.net Betreff: [lxml-dev] Fun with unicode errors Hi, Maybe you have an idea what could be happening here, otherwise I will (try to) come back with a more complete example. For now I have this small code excerpt that behaves strangely: (isinstance(output_xml, lxml.etree._Element) is True) # The two ET.tostring() invocations below, (1) and (2), show the # following behaviour: # (1) "works" (UnicodeDecodeError about el.text after (2)) # (1) (2) "works" (UnicodeDecodeError about el.text after (2)) # (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about ET.tostsring() (2) ET.tostring(output_xml) # (1) # Make pretty-printing work by removing unnecessary whitespace: for el in output_xml.iter(): ET.tostring(el) # (2) if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None (1) and (2) are commented out to run them in the different combinations discussed. If you have a hint about what might be the issue there, I'd be very glad to hear it. I'm using this code on WinXP SP3, Python 2.6.4, lxml 2.2.2 (pre-compiled package). Expected behaviour would be to run without raising any exception in any of the runs. Something about output_xml changed, as this code snipped used to work. Thanks, Felix _______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
Forget what I'm saying about the changed exceptions. There must be memory corruption. An unrelated change now makes the code raise the SerialisationError again. I keep working on this thing. - Felix
Praktikant3 - SAG wrote:
The debugging continues. The issue below has been when I read the file using:
input_xml = ET.parse(input_filename).getroot()
If I change this to:
input_xml = ET.XML(file(input_filename, "rb").read())
I get the UnicodeDecodeError in each of the (1)/(2) combinations.
Your input file has a bogus encoding declaration and/or encoding errors.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: unexpected end of data
In utf-8, 0xe5 is the start of a 3-byte sequence. It must be followed by two more chars. -- Marcello Perathoner webmaster@gutenberg.org
Funny thing is:
s = file("input.xml", "rb").read() '\xe5' in s False
and:
u = s.decode('utf-8') len(u) == len(s) True
There are no multi-byte sequences at all, Python can decode in a straightforward manner. - Felix -----Ursprüngliche Nachricht----- Von: Marcello Perathoner [mailto:marcello@perathoner.de] Gesendet: Donnerstag, 29. Oktober 2009 12:42 An: Praktikant3 - SAG Cc: lxml-dev@codespeak.net Betreff: Re: [lxml-dev] Fun with unicode errors Praktikant3 - SAG wrote:
The debugging continues. The issue below has been when I read the file using:
input_xml = ET.parse(input_filename).getroot()
If I change this to:
input_xml = ET.XML(file(input_filename, "rb").read())
I get the UnicodeDecodeError in each of the (1)/(2) combinations.
Your input file has a bogus encoding declaration and/or encoding errors.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe5 in position 2: unexpected end of data
In utf-8, 0xe5 is the start of a 3-byte sequence. It must be followed by two more chars. -- Marcello Perathoner webmaster@gutenberg.org
Hi, thanks for the report. Praktikant3 - SAG, 29.10.2009 09:53:
Maybe you have an idea what could be happening here, otherwise I will (try to) come back with a more complete example. For now I have this small code excerpt that behaves strangely: (isinstance(output_xml, lxml.etree._Element) is True)
# The two ET.tostring() invocations below, (1) and (2), show the # following behaviour:
# (1) "works" (UnicodeDecodeError about el.text after (2))
# (1) (2) "works" (UnicodeDecodeError about el.text after (2))
# (2) does not work, lxml.etree.SerialisationError: IO_ENCODER about ET.tostsring() (2)
ET.tostring(output_xml) # (1)
Ok, so, do I understand this correctly: a normal serialisation works, right? Only when you start deleting text content, it will start failing to serialise?
# Make pretty-printing work by removing unnecessary whitespace: for el in output_xml.iter(): ET.tostring(el) # (2) if len(el) and el.text and not el.text.strip(): el.text = None if el.tail and not el.tail.strip(): el.tail = None
I don't have your input file, so I can't test this. Please provide either the input file (private e-mail is ok) or at least the XML snippet that contains the text that makes this fail. To do this, try to remove only the .text or the .tail attribute and see which one of them produces this problem. Then, print a tag trace on each iteration to see which element fails and try to remove unrelated XML content until you can reproduce this with a short XML file. Stefan
participants (3)
-
Marcello Perathoner
-
Praktikant3 - SAG
-
Stefan Behnel