Mailman 3 April 2012 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Checking whether a node is a comment/element
by Geoffrey Sneddon 10 Apr '23

10 Apr '23

Hi, What's the best way to check whether a given node is a comment or an element? For the former, I'm currently using isinstance(node, etree._Comment), which is rather obviously sub-optimal. -- Geoffrey Sneddon <http://gsnedders.com/>

6 6

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

Memory leak when parsing XML files in sequence?
by Maarten van Gompel (proycon) 03 May '13

03 May '13

Hi, I stumbled across what I think it a memory leak within the lxml module. I am parsing literally millions of mostly small XML files, in sequence. In the following, simplified, fashion: index = glob.glob('/path/to/dir/with/huge/number/of/xml/files/*xml') for f in index: d = lxml.etree.parse(f) The problem is that (almost) every iteration, memory usage is increased. But note that d gets overwritten everytime and the reference to the previous document should be lost (I don't reference it anywhere else). Even an explicit 'del d' and gc.collect() within the loop doesn't help to clear up the extra memory. I used objgraph to debug a bit and the Python reference counts remain unchanged as I would expect, leaving me to conclude that this is a memory leak problem in the lxml module. This becomes problematic quickly when dealing with millions of XML files. I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case. Is this a known problem? Is there anything else I explicitly need to do to free the memory used? The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log. I also attach an example of an XML file. Python 2.7.2 (ubuntu 11.10, x86_64) lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) Regards, -- Maarten van Gompel (Proycon) E-mail: proycon(a)anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon

6 16

[lxml-dev] confusing xpath performance characteristics
by jholg＠gmx.de 24 Aug '12

24 Aug '12

Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185] What??? >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] >>> Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

2 5

lxml running on PyPy
by Stefan Behnel 19 Jun '12

19 Jun '12

Hi, here's a little status update regarding lxml on PyPy. I got the basics of lxml.etree working so far, mostly by patching up Cython and tracking down bugs in PyPy's cpyext (CPython C-API compatibility) layer. I'm still getting crashes during error reporting and didn't care much about XPath or XSLT yet. But given that those do not have much Python interaction per se, I don't expect major surprises on that front. The results are very encouraging, given that PyPy lacks support for many of the tweaks and hacks that are possible in CPython. Here's a little parser benchmark: $ python2.7 -m timeit -s 'import lxml.etree as et' 'et.parse("hamlet.xml")' 100 loops, best of 3: 4.61 msec per loop $ pypy -m timeit -s 'import lxml.etree as et' 'et.parse("hamlet.xml")' 100 loops, best of 3: 5.74 msec per loop Pretty acceptable. That makes lxml the fastest XML parser that currently exists for PyPy. And here's a worst case benchmark for element proxy instantiation and iteration, likely the most heavily tuned parts of lxml when running in CPython: $ python2.7 -m timeit -s 'import lxml.etree as et; \ t=et.parse("hamlet.xml")' 'list(t.iter())' 100 loops, best of 3: 2.71 msec per loop $ pypy -m timeit -s 'import lxml.etree as et; \ t=et.parse("hamlet.xml")' 'list(t.iter())' 10 loops, best of 3: 28.2 msec per loop That's about a factor of 10. Sounds huge, but it's actually not bad, considering the amount of extra work that has to be done for PyPy here. Certainly doesn't render it unusable, we are still talking milliseconds after all. And so far, there hasn't gone any tuning into it, so it's not the final word. I'm pretty optimistic. BTW, if you're interested in improvements on this front, you can help getting this done faster by using the "donate" button on lxml's project home page. Any donation will help in freeing some of my time for this. Stefan

4 17

How to count the number of times some tag occures in xlm file?
by Piotr Oh 27 Apr '12

27 Apr '12

The simplest way is: counter=0 for i in tree.iter(tag='sometag'): counter += 1 But please note, I don't use 'i' here. Any cleaner way? P.

3 2

Re: [lxml] How to deal with XMLSchema IOError?
by Piotr Oh 27 Apr '12

27 Apr '12

W dniu 27 kwietnia 2012 11:44 użytkownik Werner F. Bruhin < werner.bruhin(a)free.fr> napisał: > On 27/04/2012 09:58, Piotr Oh wrote: > >> Hi >> >> How to check for errors when loading XMLSchema file? >> >> schema=etree.XMLSchema(file=**polonSchemaFile) >> > What about checking if polonSchemaFile exists before you call etree? > > Werner > Well, this is of course possible, but requires more typing, and I have to forget about this nice feature of file_like_objects as params. I'd like to have both. P.

1 0