"Inventing" XML elements - bug?
Hello Everyone, This is my first posting to this list - so excuse if anything does not meet the standards. I use lxml.html.fromstring() to parse html. The parser tries to do its best to make something reasonable, even when the input is broken. This works fine and the parser does not "invent" elements. I.e. the resulting tree does not contain elements never present in the input. But this is what I observe when the input is of following kind: <html><body> .... </body>' at the end of the input! Whether "inventing" elements is a bug in case of invalid input is debatable, but what if the number of elements is nearly doubled? Please consider the following script which illustrates the effect: It creates inside the <body> element a sequence of <img> elements and checks after parsing the number of elements reported by iterlinks(): ======================================================================= import sys import lxml.html import lxml.etree parser= lxml.html.HTMLParser() failCount= 0 for imageCount in range(1,20): # Produce some simple HTML document with some <img> elements content= '<html>\n<body>\n%s</body>\n</html>' % ( '\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i for i in range(imageCount)]) ) # Parse this and assert the number of links found. # (this works always) html= lxml.html.fromstring(content, parser=parser) imagesFound= len([x for x in html.iterlinks()]) assert(imagesFound == imageCount) # Now remove the last '>' of the closing '<html>' element. # After some tries, the parser "resuses" some of its # parsed tree fragments and appends them to the tree. # These fragments may even come from completly different # parsed documents. content=content[:-1] html= lxml.html.fromstring(content, parser=parser) imagesFound= len([x for x in html.iterlinks()]) if imageCount != imagesFound: print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40) print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount) break versionFmt= "%-25s %s" print print versionFmt % ('Python', sys.version_info) for vers in ( 'LXML_VERSION', 'LIBXML_VERSION', 'LIBXML_COMPILED_VERSION', 'LIBXSLT_VERSION', 'LIBXSLT_COMPILED_VERSION', ): print versionFmt % (vers, getattr(lxml.etree, vers)) ======================================================================= On my machine (Ubuntu 12.04) the output is: ======================================================================= Input: ---------------------------------------- <html> <body> <img src="verysmall-icon-0.png" align="right"> <img src="verysmall-icon-1.png" align="right"> <img src="verysmall-icon-2.png" align="right"> <img src="verysmall-icon-3.png" align="right"> <img src="verysmall-icon-4.png" align="right"> <img src="verysmall-icon-5.png" align="right"> <img src="verysmall-icon-6.png" align="right"> <img src="verysmall-icon-7.png" align="right"> <img src="verysmall-icon-8.png" align="right"> <img src="verysmall-icon-9.png" align="right"> <img src="verysmall-icon-10.png" align="right"></body> ' at the end. The parser produced a tree with lot of fragments coming from other parsed pages... You can imagine what happens then. Yours, Elmar. -- LEO GmbH | Elmar Bartel | Mühlweg 2b | Phone: +49 (0)8104-90950141 | No signature here. D-82054 Sauerlach | Fax: +49 (0)8104-90950290 | Germany | Email: elmar@leo.org | Register Gericht: Amtsgericht München, HRB161107 Geschäftsführer: Hans Riethmayer, Elmar Bartel
Hi, thanks for the report and the excellent example script. Makes it easy to reproduce the problem. Elmar Bartel schrieb am 19.08.2014 um 12:13:
I use lxml.html.fromstring() to parse html.
The parser tries to do its best to make something reasonable, even when the input is broken. This works fine and the parser does not "invent" elements. I.e. the resulting tree does not contain elements never present in the input. But this is what I observe when the input is of following kind:
<html><body> .... </body>
Note the missing '>' at the end of the input! Whether "inventing" elements is a bug in case of invalid input is debatable, but what if the number of elements is nearly doubled?
Please consider the following script which illustrates the effect: It creates inside the <body> element a sequence of <img> elements and checks after parsing the number of elements reported by iterlinks():
======================================================================= import sys import lxml.html import lxml.etree
parser= lxml.html.HTMLParser()
failCount= 0 for imageCount in range(1,20): # Produce some simple HTML document with some <img> elements content= '<html>\n<body>\n%s</body>\n</html>' % ( '\n'.join(['<img src="verysmall-icon-%d.png" align="right">' % i for i in range(imageCount)]) ) # Parse this and assert the number of links found. # (this works always) html= lxml.html.fromstring(content, parser=parser) imagesFound= len([x for x in html.iterlinks()]) assert(imagesFound == imageCount)
# Now remove the last '>' of the closing '<html>' element. # After some tries, the parser "resuses" some of its # parsed tree fragments and appends them to the tree. # These fragments may even come from completly different # parsed documents. content=content[:-1] html= lxml.html.fromstring(content, parser=parser) imagesFound= len([x for x in html.iterlinks()]) if imageCount != imagesFound: print 'Input:\n%s\n%s\n%s' % ('-'*40, content, '-'*40) print 'FAILURE: found %d img elements when only %d were present' % (imagesFound, imageCount) break
versionFmt= "%-25s %s" print print versionFmt % ('Python', sys.version_info) for vers in ( 'LXML_VERSION', 'LIBXML_VERSION', 'LIBXML_COMPILED_VERSION', 'LIBXSLT_VERSION', 'LIBXSLT_COMPILED_VERSION', ): print versionFmt % (vers, getattr(lxml.etree, vers)) =======================================================================
On my machine (Ubuntu 12.04) the output is: ======================================================================= Input: ---------------------------------------- <html> <body> <img src="verysmall-icon-0.png" align="right"> <img src="verysmall-icon-1.png" align="right"> <img src="verysmall-icon-2.png" align="right"> <img src="verysmall-icon-3.png" align="right"> <img src="verysmall-icon-4.png" align="right"> <img src="verysmall-icon-5.png" align="right"> <img src="verysmall-icon-6.png" align="right"> <img src="verysmall-icon-7.png" align="right"> <img src="verysmall-icon-8.png" align="right"> <img src="verysmall-icon-9.png" align="right"> <img src="verysmall-icon-10.png" align="right"></body>
Python sys.version_info(major=2, minor=7, micro=3, releaselevel='final', serial=0) LXML_VERSION (3, 3, 5, 0) LIBXML_VERSION (2, 7, 8) LIBXML_COMPILED_VERSION (2, 7, 8) LIBXSLT_VERSION (1, 1, 26) LIBXSLT_COMPILED_VERSION (1, 1, 26) =======================================================================
On a different machine (Solaris 10 ;-)
======================================================================= Input: ---------------------------------------- <html> <body> <img src="verysmall-icon-0.png" align="right"> <img src="verysmall-icon-1.png" align="right"> <img src="verysmall-icon-2.png" align="right"> <img src="verysmall-icon-3.png" align="right"> <img src="verysmall-icon-4.png" align="right"> <img src="verysmall-icon-5.png" align="right"> <img src="verysmall-icon-6.png" align="right"> <img src="verysmall-icon-7.png" align="right"> <img src="verysmall-icon-8.png" align="right"> <img src="verysmall-icon-9.png" align="right"> <img src="verysmall-icon-10.png" align="right"></body>