Mailman 3 "Inventing" XML elements - bug? - lxml - The Python XML Toolkit

19 Aug 2014

      Hello Everyone,

This is my first posting to this list - so excuse if anything does
not meet the standards.

I use lxml.html.fromstring() to parse html.

The parser tries to do its best to make something reasonable,
even when the input is broken. This works fine and the parser does not
"invent" elements. I.e. the resulting tree does not contain elements
never present in the input.
But this is what I observe when the input is of following kind:

	<html><body> .... </body>' at the end of the input!
Whether "inventing" elements is a bug in case of invalid input
is debatable, but what if the number of elements is nearly doubled?

Please consider the following script which illustrates the effect:
It creates inside the <body> element a sequence of <img> elements
and checks after parsing the number of elements reported by
iterlinks():

=======================================================================
import sys
import lxml.html
import lxml.etree

parser= lxml.html.HTMLParser()

failCount= 0
for imageCount in range(1,20):
    # Produce some simple HTML document with some <img> elements
    content= '<html>\n<body>\n%s</body>\n</html>' % (
	'\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i
	for i in range(imageCount)])
    )
    # Parse this and assert the number of links found.
    # (this works always)
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    assert(imagesFound == imageCount)

    # Now remove the last '>' of the closing '<html>' element.
    # After some tries, the parser "resuses" some of its
    # parsed tree fragments and appends them to the tree.
    # These fragments may even come from completly different
    # parsed documents.
    content=content[:-1]
    html= lxml.html.fromstring(content, parser=parser)
    imagesFound= len([x for x in html.iterlinks()])
    if imageCount != imagesFound:
	print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40)
    	print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount)
	break

versionFmt= "%-25s %s"
print
print versionFmt % ('Python', sys.version_info)
for vers in (
  'LXML_VERSION',
  'LIBXML_VERSION',
  'LIBXML_COMPILED_VERSION',
  'LIBXSLT_VERSION',
  'LIBXSLT_COMPILED_VERSION',
):
    print versionFmt % (vers, getattr(lxml.etree, vers))
=======================================================================

On my machine (Ubuntu 12.04) the output is:
=======================================================================
Input:
----------------------------------------
<html>
<body>
<img src="verysmall-icon-0.png" align="right">
<img src="verysmall-icon-1.png" align="right">
<img src="verysmall-icon-2.png" align="right">
<img src="verysmall-icon-3.png" align="right">
<img src="verysmall-icon-4.png" align="right">
<img src="verysmall-icon-5.png" align="right">
<img src="verysmall-icon-6.png" align="right">
<img src="verysmall-icon-7.png" align="right">
<img src="verysmall-icon-8.png" align="right">
<img src="verysmall-icon-9.png" align="right">
<img src="verysmall-icon-10.png" align="right"></body>
' at the end. The parser produced
a tree with lot of fragments coming from other parsed pages... 
You can imagine what happens then.

Yours,
Elmar.
-- 
LEO GmbH          | Elmar Bartel                 | 
Mühlweg 2b        | Phone: +49 (0)8104-90950141  | No signature here.
D-82054 Sauerlach | Fax:   +49 (0)8104-90950290  |
Germany           | Email: elmar@leo.org         |

Register Gericht: Amtsgericht München, HRB161107
Geschäftsführer:  Hans Riethmayer, Elmar Bartel

"Inventing" XML elements - bug?

Elmar Bartel

Stefan Behnel

tags

participants (2)