Jens Tröger schrieb am 13.10.2017 um 23:39:
I just ran into the following error:
big-file.xml:3291: parser error : xmlSAX2Characters: huge text node 279ebcd8791394504dc9d4823772baa4bcc942a0871755e1ac3562f0369c69e1e2472dc202cb784a ^ big-file.xml:3291: parser error : Extra content at the end of the document 279ebcd8791394504dc9d4823772baa4bcc942a0871755e1ac3562f0369c69e1e2472dc202cb784a ^ The offending node is one of several like this:
<image id=“image-8” class="image">425a68393141592…29c28481da477d780</image>
where the content of the node here (i.e. the node.text property) is about 13MB of text :-)
Is this an lxml limitation or one of the underlying xml library?
It's a default security restriction in libxml2. Disable it at your own risk. http://lxml.de/parsing.html#parser-options See, for example: https://pypi.python.org/pypi/defusedxml/#attack-vectors Stefan