
tchand, 11.11.2011 13:39:
Stefan Behnel<stefan_ml<at> behnel.de> writes:
Thaman chand, 10.11.2011 20:35:
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7 - lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between * 751969:438466*.
The error you get is in line 751969, not in line 438466. 438466 is the column number - it's a *really* long line, with lots of text encoded binary content.
You may be running into libxml2's default security limit for large text content (to prevent stuff the "billion laughs attack"). You can disable it with the "huge_tree" parser option.
I am still haunted by the same error lxml.etree.XMLSyntaxError: Extra content at the end of the document. I set libxml2 huge_tree=True parser option but not working.
Validator.py ------------
from lxml import etree hugetree = etree.XMLParser(huge_tree=True)
schema = etree.XMLSchema(file='mzML1.1.0.xsd') try:
parser = etree.iterparse(open(r'D:\files\example.xml'), schema=schema,huge_tree=hugetree)
That's not quite what I meant... Try only this: from lxml import etree itp = etree.iterparse('D:\\files\\example.xml', huge_tree=True, remove_blank_text=True) for _, element in itp: #print(element.tag) element.clear() I commented out the print() line since you appear to be doing this from Windows. The incredibly slow console there will just slow down the program too much. Stefan