Question about problem with NCR entities in lxml under PyPy

I have a simple test that fails using lxml 3.4.4 running under PyPy 2.6.1. It succeeds as expected under CPython from Python 2.7.10. Both of the sample xml blocks are similar except for the inclusion of a numeric character reference ( or en space) in the failing sample. I include samples and the test at the bottom of this email. The sample2_etree will parse successfully, but sample_etree will not assign a bar value. Is there any testing of these characters in the lxml test suite or suggested work arounds? The problem I encounter is that any use of element.xpath(".//text()") on xml that contains an NCR will generate a stackOverflow : Thanks. - Jeff Doran ------- TEST ------- sample="""<name xmlns="http://www.epo.org/exchange" xmlns:ops=" http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">NEXT COMPUTER INC [US]</name> """ sample2="""<name xmlns="http://www.epo.org/exchange" xmlns:ops=" http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">NEXT COMPUTER INC [US]</name> """ def simple_xpath_test(): import lxml sample2_etree = lxml.etree.fromstring(sample2) print "____full sample2 %r", sample2 bar = sample2_etree.xpath('.//text()') print "____bar = %r", bar assert (bar is not None) sample_etree = lxml.etree.fromstring(sample) print "____full sample %r", sample bar = sample_etree.xpath('.//text()') print "____bar = %r", bar assert (bar is not None) ====================================================================== ERROR: bias.tests.epo.test_epo_patent.simple_xpath_test ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/jeff/lexmachina/deus_lex/.tox/pypy/site-packages/nose/case.py", line 197, in runTest self.test(*self.arg) File "/home/jeff/lexmachina/deus_lex/bias/bias/tests/epo/test_epo_patent.py", line 45, in simple_xpath_test bar = sample_etree.xpath('.//text()') File "lxml.etree.pyx", line 1507, in lxml.etree._Element.xpath (src/lxml/lxml.etree.c:52198) File "xpath.pxi", line 307, in lxml.etree.XPathElementEvaluator.__call__ (src/lxml/lxml.etree.c:152124) SystemError: <StackOverflow object at 0x7fbc2f3167b0> -------------------- >> begin captured stdout << --------------------- ____full sample2 %r <name xmlns="http://www.epo.org/exchange" xmlns:ops=" http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">NEXT COMPUTER INC [US]</name> ____bar = %r ['NEXT COMPUTER INC [US]'] ____full sample %r <name xmlns="http://www.epo.org/exchange" xmlns:ops=" http://ops.epo.org" xmlns:xlink="http://www.w3.org/1999/xlink">NEXT COMPUTER INC [US]</name> --------------------- >> end captured stdout << ----------------------

Jeff Doran schrieb am 30.09.2015 um 20:43:
The difference is that lxml returns a subclass of a bytes object in the first case and a subclass of a str/unicode object in the second. PyPy doesn't like the latter because the inheritance happens at the C level. https://bitbucket.org/pypy/pypy/issues/2021/pypy3-pytype_ready-crashes-for-e... A work-around for users is to pass "smart_strings=False" into the xpath() call to make lxml return a plain unicode string object instead of a subclass. I committed a work-around to latest master. Could you test it? You'll need Cython 0.23.x for the source build (pip install cython). https://github.com/lxml/lxml/commit/7f98e37d668abc5231a462ead9c09568780600eb Stefan

Jeff Doran schrieb am 30.09.2015 um 20:43:
The difference is that lxml returns a subclass of a bytes object in the first case and a subclass of a str/unicode object in the second. PyPy doesn't like the latter because the inheritance happens at the C level. https://bitbucket.org/pypy/pypy/issues/2021/pypy3-pytype_ready-crashes-for-e... A work-around for users is to pass "smart_strings=False" into the xpath() call to make lxml return a plain unicode string object instead of a subclass. I committed a work-around to latest master. Could you test it? You'll need Cython 0.23.x for the source build (pip install cython). https://github.com/lxml/lxml/commit/7f98e37d668abc5231a462ead9c09568780600eb Stefan
participants (2)
-
Jeff Doran
-
Stefan Behnel