Python 2to3 Regression in c14n2 Serialization?!
Hi, I’m not sure if I just overlooked something but it seems that etree.tostring using method c14n2 does not work in the same way in Python2 and Python3. In Python2 it works as expected in Python3 it claims about a not declared namespace which is is still there (see stdout information in tests). I put together a simple TestCase class (https://pastebin.com/raw/fgMjy0Ax) which shows the different behaviour if invoked for the latest released lxml version 4.6.4 using Python2 or Python3: c:\python27\python.exe -m nose ./py3_test.py .... ---------------------------------------------------------------------- Ran 4 tests in 0.014s OK c:\python39\python.exe -m nose ./py3_test.py EEEE ====================================================================== ERROR: test_python3_problem_bytesio_iterparse (py3_test.LXML_C14N2_RegressionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\devel\code\cs.requirements\py3_test.py", line 18, in test_python3_problem_bytesio_iterparse handle_div_end(event, element) File "c:\devel\code\cs.requirements\py3_test.py", line 13, in handle_div_end etree.tostring(element, method="c14n2") File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope -------------------- >> begin captured stdout << --------------------- <class 'str'> <class 'str'> some_ns_id = http://www.example.com <class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml --------------------- >> end captured stdout << ---------------------- ====================================================================== ERROR: test_python3_problem_bytesio_iterparse_global_ns_registration (py3_test.LXML_C14N2_RegressionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\devel\code\cs.requirements\py3_test.py", line 34, in test_python3_problem_bytesio_iterparse_global_ns_registration handle_div_end(event, element) File "c:\devel\code\cs.requirements\py3_test.py", line 29, in handle_div_end etree.tostring(element, method="c14n2") File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope -------------------- >> begin captured stdout << --------------------- <class 'str'> <class 'str'> some_ns_id = http://www.example.com <class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml --------------------- >> end captured stdout << ---------------------- ====================================================================== ERROR: test_python3_problem_filebased_iterparse (py3_test.LXML_C14N2_RegressionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\devel\code\cs.requirements\py3_test.py", line 49, in test_python3_problem_filebased_iterparse handle_div_end(event, element) File "c:\devel\code\cs.requirements\py3_test.py", line 44, in handle_div_end etree.tostring(element, method="c14n2") File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope -------------------- >> begin captured stdout << --------------------- <class 'str'> <class 'str'> some_ns_id = http://www.example.com <class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml --------------------- >> end captured stdout << ---------------------- ====================================================================== ERROR: test_python3_problem_filebased_parse (py3_test.LXML_C14N2_RegressionTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "c:\devel\code\cs.requirements\py3_test.py", line 62, in test_python3_problem_filebased_parse serialize_div_element(div) File "c:\devel\code\cs.requirements\py3_test.py", line 58, in serialize_div_element etree.tostring(element, method="c14n2") File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope -------------------- >> begin captured stdout << --------------------- <class 'str'> <class 'str'> some_ns_id = http://www.example.com <class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml --------------------- >> end captured stdout << ---------------------- ---------------------------------------------------------------------- Ran 4 tests in 0.010s FAILED (errors=4) Could you give me some hint whether this is an actual bug or just a wrong usage? If it is a bug – should I create a new one in your bug tracker or will you add one directly? Best regards, Kai
Hi, I tried to get a little bit deeper and started to build lxml myself to play around with this possible issue. I think I found out that this is an actual bug, but also how to fix it. I was able to do it in a way that all your tests of "make test" are green/ok as well as the simple test suite linked below in my previous mail which has been written to demonstrate this problem. Currently I tested under linux (20.04 LTS, python2.7.18/python3.8.10) only, not Windows, not MacOS, maybe someone of you could verify the patch on this platforms? Patch against LXML Master (v4.7.0a0/tag: lxml-4.7.0-pre - 982f8d5612925010a12a70748a077af846def6be): https://pastebin.com/raw/x0Zmb0Kn Should I create a bug report for this within your launchpad tracker to get this patch merged (if acceptable) ? What do you think about the way it has been fixed? I think the main problems here are the bytestring vs unicode string comparison regarding namespaces/prefixes/uris -- I'm not sure whether there are some more places where it needs to be fixed as well. Greetings, Kai Am 12.11.21 um 15:09 schrieb Kai Hillmann:
Hi,
I’m not sure if I just overlooked something but it seems that etree.tostring using method c14n2 does not work in the same way in Python2 and Python3. In Python2 it works as expected in Python3 it claims about a not declared namespace which is is still there (see stdout information in tests).
I put together a simple TestCase class (https://pastebin.com/raw/fgMjy0Ax) which shows the different behaviour if invoked for the latest released lxml version 4.6.4 using Python2 or Python3:
c:\python27\python.exe -m nose ./py3_test.py
....
----------------------------------------------------------------------
Ran 4 tests in 0.014s
OK
c:\python39\python.exe -m nose ./py3_test.py
EEEE
======================================================================
ERROR: test_python3_problem_bytesio_iterparse (py3_test.LXML_C14N2_RegressionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\devel\code\cs.requirements\py3_test.py", line 18, in test_python3_problem_bytesio_iterparse
handle_div_end(event, element)
File "c:\devel\code\cs.requirements\py3_test.py", line 13, in handle_div_end
etree.tostring(element, method="c14n2")
File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring
File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target
File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start
File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start
File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname
ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope
-------------------- >> begin captured stdout << ---------------------
<class 'str'> <class 'str'> some_ns_id = http://www.example.com
<class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml
--------------------- >> end captured stdout << ----------------------
======================================================================
ERROR: test_python3_problem_bytesio_iterparse_global_ns_registration (py3_test.LXML_C14N2_RegressionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\devel\code\cs.requirements\py3_test.py", line 34, in test_python3_problem_bytesio_iterparse_global_ns_registration
handle_div_end(event, element)
File "c:\devel\code\cs.requirements\py3_test.py", line 29, in handle_div_end
etree.tostring(element, method="c14n2")
File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring
File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target
File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start
File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start
File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname
ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope
-------------------- >> begin captured stdout << ---------------------
<class 'str'> <class 'str'> some_ns_id = http://www.example.com
<class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml
--------------------- >> end captured stdout << ----------------------
======================================================================
ERROR: test_python3_problem_filebased_iterparse (py3_test.LXML_C14N2_RegressionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\devel\code\cs.requirements\py3_test.py", line 49, in test_python3_problem_filebased_iterparse
handle_div_end(event, element)
File "c:\devel\code\cs.requirements\py3_test.py", line 44, in handle_div_end
etree.tostring(element, method="c14n2")
File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring
File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target
File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start
File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start
File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname
ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope
-------------------- >> begin captured stdout << ---------------------
<class 'str'> <class 'str'> some_ns_id = http://www.example.com
<class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml
--------------------- >> end captured stdout << ----------------------
======================================================================
ERROR: test_python3_problem_filebased_parse (py3_test.LXML_C14N2_RegressionTest)
----------------------------------------------------------------------
Traceback (most recent call last):
File "c:\devel\code\cs.requirements\py3_test.py", line 62, in test_python3_problem_filebased_parse
serialize_div_element(div)
File "c:\devel\code\cs.requirements\py3_test.py", line 58, in serialize_div_element
etree.tostring(element, method="c14n2")
File "src\lxml\etree.pyx", line 3407, in lxml.etree.tostring
File "src\lxml\serializer.pxi", line 943, in lxml.etree._tree_to_target
File "src\lxml\serializer.pxi", line 1128, in lxml.etree.C14NWriterTarget.start
File "src\lxml\serializer.pxi", line 1155, in lxml.etree.C14NWriterTarget._start
File "src\lxml\serializer.pxi", line 1085, in lxml.etree.C14NWriterTarget._qname
ValueError: Namespace http://www.w3.org/1999/xhtml of name "div" is not declared in scope
-------------------- >> begin captured stdout << ---------------------
<class 'str'> <class 'str'> some_ns_id = http://www.example.com
<class 'str'> <class 'str'> xhtml = http://www.w3.org/1999/xhtml
--------------------- >> end captured stdout << ----------------------
----------------------------------------------------------------------
Ran 4 tests in 0.010s
FAILED (errors=4)
Could you give me some hint whether this is an actual bug or just a wrong usage?
If it is a bug – should I create a new one in your bug tracker or will you add one directly?
Best regards,
Kai _______________________________________________ lxml - The Python XML Toolkit mailing list -- lxml@python.org To unsubscribe send an email to lxml-leave@python.org https://mail.python.org/mailman3/lists/lxml.python.org/ Member address: kai@kaih.de
Hi, thanks for investigating this. Kai Hillmann schrieb am 17.11.21 um 00:23:
I tried to get a little bit deeper and started to build lxml myself to play around with this possible issue.
I think I found out that this is an actual bug, but also how to fix it. I was able to do it in a way that all your tests of "make test" are green/ok as well as the simple test suite linked below in my previous mail which has been written to demonstrate this problem.
Currently I tested under linux (20.04 LTS, python2.7.18/python3.8.10) only, not Windows, not MacOS, maybe someone of you could verify the patch on this platforms?
Patch against LXML Master (v4.7.0a0/tag: lxml-4.7.0-pre - 982f8d5612925010a12a70748a077af846def6be): https://pastebin.com/raw/x0Zmb0Kn
Should I create a bug report for this within your launchpad tracker to get this patch merged (if acceptable) ?
A pull request (or patch) is usually ok. I'm not strict on requiring tickets for each change. A PR would be better, though, including the tests, since it would get us a free CI run on the changes.
What do you think about the way it has been fixed? I think the main problems here are the bytestring vs unicode string comparison regarding namespaces/prefixes/uris -- I'm not sure whether there are some more places where it needs to be fixed as well.
Yeah, right, I also don't think it's the right way to solve this since it looks more like a data cleanliness issue. Meaning: why is there a mix of byte strings and unicode strings in the first place? In Py3, we should always have unicode strings in our hands. There must be some incorrect data conversion happening somewhere. Stefan
participants (2)
-
Kai Hillmann
-
Stefan Behnel