Hi dev,
I have a problem which I couldn't figure out after through quest.
I have been trying to figure out the constantly the same error in my xml parser. I have following configuration and dealing with file(200MB-4GB) size:
- Python 2.7
- lxml 2.3.1
The problem I understand is there is mismatch in the XML Syntax (start and end). The file is too huge I can't look inside at particular line between 751969:438466. But I tried what a naive do using sed command (sed -n 751969p filename) for specific line. So here are specific line output 751969 and 438466. It clearly shows that element start and end is not matching. But, the problem is I can't open such a huge file and do editing manually.
Note: I have design the validator according to given schema and it shows the same problem.
Question
------------
- **So how can I get rid off from such a error while parsing in future?**
- **Or get get rid off from such an element which is not mandatory in parser?**
Error goes here
C:\Documents and Settings\****\Desktop>python example.py
(751969, 438466)
None
file:///D:/files/average.mzML:751969:438466:FATAL:PARSER:ERR_DOCUMENT_END: Extra content at the end of the document
Traceback (most recent call last):
File "MainPaser.py", line 330, in <module>
main()
File "MainPaser.py", line 322, in main
fast_iter(context, process_element)
File "MainPaser.py", line 24, in fast_iter
for event, elem in context:
File "iterparse.pxi", line 478, in lxml.etree.iterparse.__next__ (src/lxml\lxml.etree.c:98432)
File "iterparse.pxi", line 530, in lxml.etree.iterparse._read_more_events (src/lxml\lxml.etree.c:98953)
File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml\lxml.etree.c:74696)
lxml.etree.XMLSyntaxError: Extra content at the end of the document, line 751969, column 438466
Help me!!
I posted same question in a stack overflow