
Hi there, I am using this code https://github.com/xml-director/xmldirector.demo/blob/master/xmldirector/dem... for benchmarking the XSD parsing speed against some variants of the MODS schema. mods-3-1.xsd and mods-3-2.xsd take about 15 seconds for parsing while the other variants parser in less than 0.3 seconds. How can one explain this huge difference? Andreas ---- mods-3-1.xsd 0.00209999084473 15.2129580975 -------------------------------------------------------------------------------- mods-3-2.xsd 0.00260806083679 15.2835290432 -------------------------------------------------------------------------------- mods-3-3.xsd 0.00289702415466 0.300955057144 -------------------------------------------------------------------------------- mods-3-4.xsd 0.00385713577271 0.313620090485 -------------------------------------------------------------------------------- mods-3-5.xsd 0.00278782844543 0.278451919556

Andreas Jung schrieb am 11.04.2015 um 04:01:
Quick remark: "XML(f.read())" is very inefficient. But that's not the problem here, as we can see from your numbers.
mods-3-1.xsd and mods-3-2.xsd take about 15 seconds for parsing while the other variants parser in less than 0.3 seconds.
I assume the "while True" loops means that you're repeating the whole benchmark multiple times to see if it's a setup problem or if it persists over time? And I assume it persists?
How can one explain this huge difference?
I'd run a profiler. Given that lxml doesn't really do anything here, Python level profiling won't help, so (assuming you're on Linux) I recommend callgrind and kcachegrind. They're really easy to use once you've found a suitable command line for callgrind, e.g. valgrind --tool=callgrind --toggle-collect=xmlSchemaParse \ python my_parse_schema.py mods-3-2.xsd See http://www.valgrind.org/docs/manual/cl-manual.html That should generate a file callgrind.out.PID and tell you how many instructions it collected. Then run KCachegrind on that file. Click around to see where the problem is. The call tree will guide you there. If you're using the system provided libxml2, then the output might be a little obfuscated. There should be debug packages of those libraries available for your system that fix this. Stefan

Hi, The xsd:Import <xsd:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/> may be slow? 3-3 uses another server to get xml.xsd. jens

Jens Quade schrieb am 11.04.2015 um 09:15:
Ah, right. I think I recall that the W3C has deliberately slowed down their DTD/schema delivery when they noticed that people just happily load and reload stuff from their servers, even in CI or benchmark scenarios. The right way to do it is to provide all external dependencies in your local catalogue, then libxml2 will load them from there. Stefan

Andreas Jung schrieb am 11.04.2015 um 04:01:
Quick remark: "XML(f.read())" is very inefficient. But that's not the problem here, as we can see from your numbers.
mods-3-1.xsd and mods-3-2.xsd take about 15 seconds for parsing while the other variants parser in less than 0.3 seconds.
I assume the "while True" loops means that you're repeating the whole benchmark multiple times to see if it's a setup problem or if it persists over time? And I assume it persists?
How can one explain this huge difference?
I'd run a profiler. Given that lxml doesn't really do anything here, Python level profiling won't help, so (assuming you're on Linux) I recommend callgrind and kcachegrind. They're really easy to use once you've found a suitable command line for callgrind, e.g. valgrind --tool=callgrind --toggle-collect=xmlSchemaParse \ python my_parse_schema.py mods-3-2.xsd See http://www.valgrind.org/docs/manual/cl-manual.html That should generate a file callgrind.out.PID and tell you how many instructions it collected. Then run KCachegrind on that file. Click around to see where the problem is. The call tree will guide you there. If you're using the system provided libxml2, then the output might be a little obfuscated. There should be debug packages of those libraries available for your system that fix this. Stefan

Hi, The xsd:Import <xsd:import namespace="http://www.w3.org/XML/1998/namespace" schemaLocation="http://www.w3.org/2001/xml.xsd"/> may be slow? 3-3 uses another server to get xml.xsd. jens

Jens Quade schrieb am 11.04.2015 um 09:15:
Ah, right. I think I recall that the W3C has deliberately slowed down their DTD/schema delivery when they noticed that people just happily load and reload stuff from their servers, even in CI or benchmark scenarios. The right way to do it is to provide all external dependencies in your local catalogue, then libxml2 will load them from there. Stefan
participants (3)
-
Andreas Jung
-
Jens Quade
-
Stefan Behnel