Partly erratic wrong behaviour, Python 3, lxml

Jussi Piitulainen jpiitula at ling.helsinki.fi
Thu Mar 4 05:46:06 EST 2010


Dear group,

I am observing weird semi-erratic behaviour that involves Python 3 and
lxml, is extremely sensitive to changes in the input data, and only
occurs when I name a partial result. I would like some help with this,
please. (Python 3.1.1; GNU/Linux; how do I find lxml version?)

The test script regress/Tribug is at the end of this message, with a
snippet to show the form of regress/tridata.py where the XML is.

What I observe is this. Parsing an XML document (wrapped in BytesIO)
with lxml.etree.parse and then extracting certain elements with xpath
sometimes fails so that I get three times the correct number of
elements. On the first run of the script, it fails in one way, and on
each subsequent run in another way: subsequent runs are repeatable.

Second, the bug only occurs when I give a name to the result from
lxml.etree.parse! This is seen below in the lines labeled "name1" or
"name2" that sometimes exhibit the bug, and lines labeled "nest1" or
"nest2" that never do. That is, this fails in some complex way:

        result = etree.parse(BytesIO(body))
        n = len(result.xpath(title))

This fails to fail:

        n = len(etree.parse(BytesIO(body)).xpath(title))

I have failed to observe the error interactively. I believe the
erroneus result lists are of the form [x x x y y y z z z] when they
should be [x y z] but I do not know if the x's are identical or
copies. I will know more later, of course, when I have written more
complex tests, unless somebody can lead me to a more profitable way of
debugging this.

Two versions of the test runs follow, before and after a trivial
change to the test data. Since the numbers are repeated n's of the
above snippets, they should all be the same: 5 observed 1000 times.

A first run after removing regress/tridata.pyc:

[1202] $ regress/Tribug 
name1: size 5 observed 969 times
name1: size 15 observed 31 times
name2: size 5 observed 1000 times
nest1: size 5 observed 1000 times
nest2: size 5 observed 1000 times

All subsequent runs, with regress/tridata.pyc recreated:

[1203] $ regress/Tribug 
name1: size 5 observed 1000 times
name2: size 5 observed 978 times
name2: size 15 observed 22 times
nest1: size 5 observed 1000 times
nest2: size 5 observed 1000 times

Adding an empty comment <!-- --> to the XML document;
a first run:

[1207] $ regress/Tribug 
name1: size 5 observed 992 times
name1: size 15 observed 8 times
name2: size 5 observed 1000 times
nest1: size 5 observed 1000 times
nest2: size 5 observed 1000 times

And subsequent runs:

[1208] $ regress/Tribug 
name1: size 5 observed 991 times
name1: size 15 observed 9 times
name2: size 5 observed 998 times
name2: size 15 observed 2 times
nest1: size 5 observed 1000 times
nest2: size 5 observed 1000 times

---start of regress/Tribug---
#! /bin/env python3
# -*- mode: Python; -*-

from io import BytesIO
from lxml import etree
from tridata import body, title

def naming():
    sizes = dict()
    for k in range(0,1000):
        result = etree.parse(BytesIO(body))
        n = len(result.xpath(title))
        sizes[n] = 1 + sizes.get(n, 0)
    return sizes

def nesting():
    sizes = dict()
    for k in range(0,1000):
        n = len(etree.parse(BytesIO(body)).xpath(title))
        sizes[n] = 1 + sizes.get(n, 0)
    return sizes

def report(label, sizes):
    for size, count in sizes.items():
        print('{}: size {} observed {} times'
              .format(label, size, count))

report('name1', naming())
report('name2', naming())
report('nest1', nesting())
report('nest2', nesting())
---end of regress/Tribug---

The file regress/tridata.py contains only the two constants. I omit
most of the XML. It would be several screenfuls.

---start of regress/tridata.py---
body = b'''<OAI-PMH xmlns="http://www.opena...
...
</OAI-PMH>
'''

title = '//*[name()="record"]//*[name()="dc:title"]'
---end of regress/tridata.py---



More information about the Python-list mailing list