
Hello, lxml community, I get string tails cut off in certain cases when I parse XML data. How to reproduce: Clone sample xml and python executable from here: $ git clone git://github.com/Motiejus/lxml_report.git and start it: $ ./debug.py demo.xml Actual output: Got data len: 752 Got data len: 752 Got data len: 2169 Got data len: 375 Expected output: Got data len: 752 Got data len: 752 Got data len: 2544 Got data len: 752 Is this a bug? Python and (LXML) versions I have tried on two amd64 installations: Debian Stable: python: 2.6.6-3+squeeze6 and 3.1.3-12 lxml: 2.2.8-2 libxslt1.1: 1.1.26-6 libxml2: 2.7.8.dfsg-2+squeeze1 Debian Sid: python: 2.7.2-9 and 3.2.2-1 lxml: 2.3.2-1 libxslt1.1: 1.1.26-8 libxml2: 2.7.8.dfsg-6 This also happens with xml.etree.cElementTree, xml.etree.ElementTree and xml.sax. However, data sets differ for every backend, and it was easiest to create (a small) one for lxml. Thank you for help. Motiejus Jakštys

On Thu, Jan 19, 2012 at 11:59:52PM +0000, Motiejus Jakštys wrote:
Hello, lxml community,
I get string tails cut off in certain cases when I parse XML data.
How to reproduce:
Clone sample xml and python executable from here: $ git clone git://github.com/Motiejus/lxml_report.git
and start it: $ ./debug.py demo.xml
Actual output: Got data len: 752 Got data len: 752 Got data len: 2169 Got data len: 375
Expected output: Got data len: 752 Got data len: 752 Got data len: 2544 Got data len: 752
The .xml file contains three elements with the contents you're looking for. The first two are small and result in one call to .start(), one call to .data(), one call to .end(). The third is larger, and results in one call to .start(), two calls to .data(), and one call to .end(). The data is delivered in two chunks. 2169 + 375 == 2544.
Is this a bug?
In your code? You assume that .data() will contain the entire contents of an element.
Python and (LXML) versions I have tried on two amd64 installations: Debian Stable: python: 2.6.6-3+squeeze6 and 3.1.3-12 lxml: 2.2.8-2 libxslt1.1: 1.1.26-6 libxml2: 2.7.8.dfsg-2+squeeze1
Debian Sid: python: 2.7.2-9 and 3.2.2-1 lxml: 2.3.2-1 libxslt1.1: 1.1.26-8 libxml2: 2.7.8.dfsg-6
This also happens with xml.etree.cElementTree, xml.etree.ElementTree and xml.sax. However, data sets differ for every backend, and it was easiest to create (a small) one for lxml.
Thank you for help.
Motiejus Jakštys
Marius Gedminas -- If nothing else helps, read the documentation.

On Fri, Jan 20, 2012 at 04:08:40AM +0200, Marius Gedminas wrote:
On Thu, Jan 19, 2012 at 11:59:52PM +0000, Motiejus Jakštys wrote:
Hello, lxml community, Is this a bug?
In your code? You assume that .data() will contain the entire contents of an element.
Great insight, thanks! Motiejus
participants (2)
-
Marius Gedminas
-
Motiejus Jakštys