lxml.etree.XMLSyntaxError: Extra content at the end of the document

Hi dev, I have a problem which I couldn't figure out after through quest. I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size: - Python 2.7 - lxml 2.3.1 The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*. But I tried what a naive do using sed command (sed -n 751969p filename) for specific line. So here are specific line output * 751969* <http://pastebin.com/nkBxnxZS> and *438466*<http://pastebin.com/NurZtPME>. It clearly shows that element start and end is not matching. But, the problem is I can't open such a huge file and do editing manually. Note: I have design the *validator* <http://pastebin.com/C9JnPz85>according to given schema and it shows the same problem. Question ------------ - **So how can I get rid off from such a error while parsing in future?** - **Or get get rid off from such an element which is not mandatory in parser?** Error goes here C:\Documents and Settings\****\Desktop>python example.py (751969, 438466) None file:///D:/files/average.mzML:751969:438466:FATAL:PARSER:ERR_DOCUMENT_END: Extra content at the end of the document Traceback (most recent call last): File "MainPaser.py", line 330, in <module> main() File "MainPaser.py", line 322, in main fast_iter(context, process_element) File "MainPaser.py", line 24, in fast_iter for event, elem in context: File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml\lxml.etree.c:98432) File "iterparse.pxi", line 530, in lxml.etree.iterparse._read_more_events (src/lxml\lxml.etree.c:98953) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74696) lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 751969, column 438466 Help me!! I posted same question in a *stack overflow*<http://stackoverflow.com/questions/8082456/lxml-etree-xmlsyantaxerror>

On Thu, Nov 10, 2011 at 2:35 PM, Thaman chand <cloudchand@gmail.com> wrote:
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between 751969:438466. But I tried what a naive do using sed command (sed -n 751969p filename) for specific line. So here are specific line output 751969 and 438466. It clearly shows that element start and end is not matching. But, the problem is I can't open such a huge file and do editing manually.
How about writing a SAX parser to walk through the document and keep track of which elements still haven't been closed, reporting what's still open when you get to the end (or what's left over after the top-level element is closed, if the problem is in the other direction). -- Bob Kline http://www.rksystems.com mailto:bkline@rksystems.com

Thaman chand, 10.11.2011 20:35:
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7 - lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*.
The error you get is in line 751969, not in line 438466. 438466 is the column number - it's a *really* long line, with lots of text encoded binary content. You may be running into libxml2's default security limit for large text content (to prevent stuff the "billion laughs attack"). You can disable it with the "huge_tree" parser option. http://lxml.de/parsing.html#parser-options Stefan

Stefan Behnel <stefan_ml <at> behnel.de> writes:
Thaman chand, 10.11.2011 20:35:
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7 - lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*.
The error you get is in line 751969, not in line 438466. 438466 is the column number - it's a *really* long line, with lots of text encoded binary content.
You may be running into libxml2's default security limit for large text content (to prevent stuff the "billion laughs attack"). You can disable it with the "huge_tree" parser option.
http://lxml.de/parsing.html#parser-options
Stefan _________________________________________________________________ Mailing list for the lxml Python XML toolkit - http://lxml.de/ lxml <at> lxml.de https://mailman-mail5.webfaction.com/listinfo/lxml
I am still haunted by the same error lxml.etree.XMLSyntaxError: Extra content at the end of the document. I set libxml2 huge_tree=True parser option but not working. Validator.py ------------ from lxml import etree hugetree = etree.XMLParser(huge_tree=True) schema = etree.XMLSchema(file='mzML1.1.0.xsd') try: parser = etree.iterparse(open(r'D:\files\example.xml'), schema=schema,huge_tree=hugetree) for elementuple in parser: print elementuple except etree.XMLSyntaxError, e: print e.position print e.lineno print e.error_log raise Error ----- file:///D:/files/example.xml:751969:438466:FATAL:PARSER:ERR_DOCUMENT_END: Extra content at the end of the document Traceback (most recent call last): File "validator.py", line 8, in <module> for aTuple in parser: File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml\lxml.etree.c:98432) File "iterparse.pxi", line 530, in lxml.etree.iterparse._read_more_events (src/lxml\lxml.etree.c:98953) File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74696) lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 751969, column 438466

tchand, 11.11.2011 13:39:
Stefan Behnel<stefan_ml<at> behnel.de> writes:
Thaman chand, 10.11.2011 20:35:
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7 - lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*.
The error you get is in line 751969, not in line 438466. 438466 is the column number - it's a *really* long line, with lots of text encoded binary content.
You may be running into libxml2's default security limit for large text content (to prevent stuff the "billion laughs attack"). You can disable it with the "huge_tree" parser option.
I am still haunted by the same error lxml.etree.XMLSyntaxError: Extra content at the end of the document. I set libxml2 huge_tree=True parser option but not working.
Validator.py ------------
from lxml import etree hugetree = etree.XMLParser(huge_tree=True)
schema = etree.XMLSchema(file='mzML1.1.0.xsd') try:
parser = etree.iterparse(open(r'D:\files\example.xml'), schema=schema,huge_tree=hugetree)
That's not quite what I meant... Try only this: from lxml import etree itp = etree.iterparse('D:\\files\\example.xml', huge_tree=True, remove_blank_text=True) for _, element in itp: #print(element.tag) element.clear() I commented out the print() line since you appear to be doing this from Windows. The incredibly slow console there will just slow down the program too much. Stefan

Stefan Behnel, 11.11.2011 14:27:
tchand, 11.11.2011 13:39:
Stefan Behnel<stefan_ml<at> behnel.de> writes:
Thaman chand, 10.11.2011 20:35:
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7 - lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*.
The error you get is in line 751969, not in line 438466. 438466 is the column number - it's a *really* long line, with lots of text encoded binary content.
You may be running into libxml2's default security limit for large text content (to prevent stuff the "billion laughs attack"). You can disable it with the "huge_tree" parser option.
I am still haunted by the same error lxml.etree.XMLSyntaxError: Extra content at the end of the document. I set libxml2 huge_tree=True parser option but not working.
Validator.py ------------
from lxml import etree hugetree = etree.XMLParser(huge_tree=True)
schema = etree.XMLSchema(file='mzML1.1.0.xsd') try:
parser = etree.iterparse(open(r'D:\files\example.xml'), schema=schema,huge_tree=hugetree)
That's not quite what I meant...
Try only this:
from lxml import etree
itp = etree.iterparse('D:\\files\\example.xml', huge_tree=True, remove_blank_text=True)
for _, element in itp: #print(element.tag) element.clear()
I commented out the print() line since you appear to be doing this from Windows. The incredibly slow console there will just slow down the program too much.
BTW, what's in line 751970 of the document, i.e. the line following the problematic line? Could you put just a couple of surrounding lines into pastebin so that we can see the context? And what's the encoding used in the file? Stefan
participants (4)
-
Bob Kline
-
Stefan Behnel
-
tchand
-
Thaman chand