[lxml-dev] very long files with many XML entity refs
data:image/s3,"s3://crabby-images/45427/45427895f5c2c41dac889863553983744ca8b135" alt=""
I have a sample XML file which contains <text>‡‡ .... </text> with 8,000,000 (eight million) repetitions of '‡'. A test program for loading it and then writing it is: import sys #import cElementTree as ET from lxml import etree as ET f=open(sys.argv[1]) et = ET.ElementTree(file = f) et.write('ooo') When it is run with cElementTree , it completes successfully in about 1 minute. When it is run with lxml, it does not complete, even after 12 hours!!! and the process is constantly at 100% CPU. Further testing showed it reaches the 'write' statement quite fast and is stuck in there. Is this a bug or is lxml just dead slow relative to cElementTree , for this action? Notes: 1) Nothing special about '‡', it is just a simple sample with the same character repeating. The original problem showed up with a long file of various entity refs (some encoding of binary data). 2) Testing with shorter files (thousands of characters), seemed to have similar speed for cElementTree and lxml. TIA Moshe
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Moshe Cohen wrote:
*That* is slow. :)
Well, yes, there is something special about ‡ in that it's not ASCII, but you are encoding to "US-ASCII", which means that libxml2 has to encode all non-ASCII characters as character entities. According to timeit, writing the file out as UTF-8 takes 173 milliseconds on my machine: et.write("eout.xml", encoding="UTF-8") The complete I/O cycle runs in about two seconds on my machine (after warm-up), which is a lot faster than one minute :) However, I do agree that the charref encoding in your example seems to be impressively slow in libxml2 and I have no idea why, looks like a bug to me. I'll ask on the libxml2 list. Stefan
data:image/s3,"s3://crabby-images/4cf20/4cf20edf9c3655e7f5c4e7d874c5fdf3b39d715f" alt=""
Moshe Cohen wrote:
*That* is slow. :)
Well, yes, there is something special about ‡ in that it's not ASCII, but you are encoding to "US-ASCII", which means that libxml2 has to encode all non-ASCII characters as character entities. According to timeit, writing the file out as UTF-8 takes 173 milliseconds on my machine: et.write("eout.xml", encoding="UTF-8") The complete I/O cycle runs in about two seconds on my machine (after warm-up), which is a lot faster than one minute :) However, I do agree that the charref encoding in your example seems to be impressively slow in libxml2 and I have no idea why, looks like a bug to me. I'll ask on the libxml2 list. Stefan
participants (2)
-
Moshe Cohen
-
Stefan Behnel