Mailman 3 February 2012 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Checking whether a node is a comment/element
by Geoffrey Sneddon 10 Apr '23

10 Apr '23

Hi, What's the best way to check whether a given node is a comment or an element? For the former, I'm currently using isinstance(node, etree._Comment), which is rather obviously sub-optimal. -- Geoffrey Sneddon <http://gsnedders.com/>

6 6

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

Memory leak when parsing XML files in sequence?
by Maarten van Gompel (proycon) 03 May '13

03 May '13

Hi, I stumbled across what I think it a memory leak within the lxml module. I am parsing literally millions of mostly small XML files, in sequence. In the following, simplified, fashion: index = glob.glob('/path/to/dir/with/huge/number/of/xml/files/*xml') for f in index: d = lxml.etree.parse(f) The problem is that (almost) every iteration, memory usage is increased. But note that d gets overwritten everytime and the reference to the previous document should be lost (I don't reference it anywhere else). Even an explicit 'del d' and gc.collect() within the loop doesn't help to clear up the extra memory. I used objgraph to debug a bit and the Python reference counts remain unchanged as I would expect, leaving me to conclude that this is a memory leak problem in the lxml module. This becomes problematic quickly when dealing with millions of XML files. I attach a short log excerpt in which I extracted resident memory usage from ps after each iteration and measure the increase. Note that I only parse the documents, to be overwritten each time, I don't do anything else with them in this test case. Is this a known problem? Is there anything else I explicitly need to do to free the memory used? The problem does not reproduce if I reload the same document over and over again. Memory usage remains constant then. It only happens when new documents are loaded, and even then in some rare cases the problem dos not occur for some or several iterations, most notably at the start of the log. I also attach an example of an XML file. Python 2.7.2 (ubuntu 11.10, x86_64) lxml.etree : (2, 3, 0, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) Regards, -- Maarten van Gompel (Proycon) E-mail: proycon(a)anaproy.nl Homepage: http://proycon.anaproy.nl Google+: https://plus.google.com/105334152965507305708 Facebook: http://facebook.com/proycon Twitter: http://twitter.com/proycon

6 16

[lxml-dev] confusing xpath performance characteristics
by jholg＠gmx.de 24 Aug '12

24 Aug '12

Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185] What??? >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] >>> Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

2 5

How to customize namespace prefix in objects inheriting from lxml.ElementBase
by Régis Décamps 27 Feb '12

27 Feb '12

Dear fellow developers, I originally posted my questions on stackoverflow: http://stackoverflow.com/questions/9265534/how-to-customize-namespace-prefi… objects-inheriting-from-lxml-elementbase I understand that custom XML elements should inherit from `ElementBase`. For instance, I can create class FactVariable(etree.ElementBase): ''' Class that represents a XBRL fact variable.''' TAG = '{http://xbrl.org/2008/variable}factVariable' @property def label(self): return self.attrib['{http://www.w3.org/1999/xlink}label'] @label.setter def label(self, value): self.attrib['{http://www.w3.org/1999/xlink}label'] = value My problem is that when I create a XML tree and place such nodes, I get <ns0:factVariable xmlns:ns0="http://xbrl.org/2008/variable" label="azerty"/> **Question**: I want the namespace to be prefixed `va`, not `ns0` How can I change that? I tried to set the `self.nsmap` property, but I have a "read-only" exception. Adding a key/value has no effect (as said in the documentation). I also tried, without success etree.register_namespace('va', 'http://xbrl.org/2008/variable') Thanks in advance Régis

2 1

cp1252 encoding not found on Free BSD 8
by Tim Arnold 22 Feb '12

22 Feb '12

hi, This is a bug reported last August for Mac, but it is also happening for freebsd8.2 (amd64). https://bugs.launchpad.net/lxml/+bug/707396 Python 2.7.1 (r271:86832, Apr 5 2011, 13:19:14) [GCC 4.2.1 20070719 [FreeBSD]] on freebsd8 from lxml import etree parser = etree.HTMLParser(encoding='cp1252') Traceback (most recent call last): File "lxml_bug.py", line 11, in <module> parser = etree.HTMLParser(encoding='cp1252') File "parser.pxi", line 1423, in lxml.etree.HTMLParser.__init__ (src/lxml/lxml.etree.c:81303) File "parser.pxi", line 743, in lxml.etree._BaseParser.__init__ (src/lxml/lxml.etree.c:76172) LookupError: unknown encoding: 'cp1252' Here are my details: Python : sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0) lxml.etree : (2, 3, 1, 0) libxml used : (2, 7, 8) libxml compiled : (2, 7, 8) libxslt used : (1, 1, 26) libxslt compiled : (1, 1, 26) platform.architecture() ('64bit', 'ELF') thanks, --Tim Arnold

3 3

Question about unicode strings
by Frank Millman 21 Feb '12

21 Feb '12

Hi all I happen to be following the mailing lists of both lxml and rpclib. The guys at rpclib want to make a change to their code base to fix what they see as a 'quirk' of lxml. I am not qualified to comment, but I thought I would post the issue here in case anyone can suggest a cleaner solution. With Python 3 and version 2.3.3, if you pass a unicode string to etree.fromstring(...), and then retrieve a text node from the tree, you get a unicode string back. If you pass in a byte array, you get a byte array back. With Python 2 and version 2.2.2 (I don't have 2.3.3), if you pass a unicode string that contains a non-ASCII character, you get a unicode string back. If you pass a unicode string that contains only ASCII characters, you get a normal string back. This behaviour is causing a problem to a user of rpclib, so the proposal is that rpclib should always convert the string to unicode before returning it. I don't know how they know that they passed in a unicode string in the first place, but I assume they have a way of checking. The maintainer of rpclib says "If you disagree, speak now or forever hold your silence :))" So I thought I would mention it here and see if it sounds ok. Thanks Frank Millman

3 4