Mailman 3 March 2010 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Checking whether a node is a comment/element
by Geoffrey Sneddon 10 Apr '23

10 Apr '23

Hi, What's the best way to check whether a given node is a comment or an element? For the former, I'm currently using isinstance(node, etree._Comment), which is rather obviously sub-optimal. -- Geoffrey Sneddon <http://gsnedders.com/>

6 6

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

[lxml-dev] confusing xpath performance characteristics
by jholg＠gmx.de 24 Aug '12

24 Aug '12

Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185] What??? >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] >>> Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

2 5

[lxml-dev] adding a namespace
by Wichert Akkerman 05 Apr '10

05 Apr '10

I am having some problems adding a new namespace to a parsed document. My goal is to take an input file like this: <html xmlns="http://www.w3.org/1999/xhtml"> <body> <div id="one"><p>first paragraph</p></div> <div id="two"><p>second paragraph</p></div> </body> </html> and turn it into this: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:i18n="http://xml.zope.org/namespaces/i18n"> <body> <div id="one"><p i18n:translate="string1">first paragraph</p></div> <div id="two"><p i18n:translate="string2">second paragraph</p></div> </body> </html> the code is fairly simple, and looks like this (simplified from original): NS="http://xml.zope.org/namespaces/i18n" tree=lxml.etree.parse(input) root=tree.getroot() count=1 if "i18n" not in root.nsmap: root.nsmap["i18n"]=NS for el in root.iter(): if "{%s}translate" % NS in el.attrib: continue if hasText(el): el.attrib["{%s}translate" % NS]="string%d" % count count+=1 print lxml.etree.tostring(tree) However the resulting output looks like this: <html xmlns="http://www.w3.org/1999/xhtml"> <body> <div id="one"><p xmlns:ns0="http://xml.zope.org/namespaces/i18n" ns0:translate="string1">first paragraph</p></div> <div id="two"><p xmlns:ns1="http://xml.zope.org/namespaces/i18n" ns1:translate="string2">second paragraph</p></div> </body> </html> while trying to debug this I noticed something odd: lxml allows you to modify the nsmap for an element, but ignores what you do: >>> root.nsmap {None: 'http://www.w3.org/1999/xhtml', 'py': 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} >>> root.nsmap["frop"]='http://frip' >>> root.nsmap {None: 'http://www.w3.org/1999/xhtml', 'py': 'http://genshi.edgewall.org/', 'xi': 'http://www.w3.org/2001/XInclude'} I would expect that to either work, or raise an exception telling me I am trying to do something that is not allowed. The current behaviour feels a bit unpythonic. It is possible to specify your own nsmap when creating elements, but I can not find an API to modify the nsmap for a parsed tree. Is that a missing feature, or is there another way to do this? Wichert.

3 5

[lxml-dev] lxml iterparse generator not returning anything
by Joe Sarre 05 Apr '10

05 Apr '10

Hi everyone, I'm finding that when using iterparse, the generator always throws StopIteration immediately, without returning any data. I must be doing something wrong, or I must have some kind of setup problem, but I'm struggling to work out what it is. If anybody has any ideas, then that would be greatly appreciated, or if this is a bug, I will raise it on the bug tracker. My version details are: >>> print etree.LXML_VERSION (2, 2, 2, 0) >>> print etree.LIBXML_VERSION (2, 7, 6) >>> print etree.LIBXML_COMPILED_VERSION (2, 7, 3) >>> print etree.LIBXSLT_VERSION (1, 1, 26) >>> print etree.LIBXSLT_COMPILED_VERSION (1, 1, 24) The most striking thing about this is that LIBXML_VERSION != LIBXML_COMPILED_VERSION, and LIBXSLT_VERSION != LIBXSLT_COMPILED_VERSION. If this version discrepancy is the real cause of the problem, then I think this issue is perhaps more appropriate for the Fedora mailing list, and you can ignore the rest of this mail. An example in which I am seeing this ( taken from http://codespeak.net/lxml/parsing.html#iterparse-and-iterwalk ) is: """ >>> from lxml import etree >>> from StringIO import StringIO >>> xml = '''<root> ... <element key='value'>text</element> ... <element>text</element>tail ... <empty-element xmlns="http://testns/" /> ... </root>''' >>> print xml <root> <element key='value'>text</element> <element>text</element>tail <empty-element xmlns="http://testns/" /> </root> >>> context = etree.iterparse(StringIO(xml)) >>> for action, elem in context: ... print("%s: %s" % (action, elem.tag)) end: element end: element end: {http://testns/}empty-element end: root """ if __name__ == '__main__': import doctest doctest.testmod() The result of putting this in a file and running it is that python complains: ********************************************************************** File "test.py", line 20, in __main__ Failed example: for action, elem in context: print("%s: %s" % (action, elem.tag)) Expected: end: element end: element end: {http://testns/}empty-element end: root Got nothing ********************************************************************** 1 items had failures: 1 of 6 in __main__ ***Test Failed*** 1 failures. Thanks in advance for any help, Joe Sarre This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Skyscanner. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error.

2 1

[lxml-dev] Tempory data attached to custom subclasses
by Dave Kuhlman 01 Apr '10

01 Apr '10

I've been using the custom subclasses capability of lxml. It's slick. I do, however, miss the ability to attach temporary data to the ElementBase subclasses. (see the warnings under "Element initialization" at http://codespeak.net/lxml/element_classes.html) I can, as suggested by the docs, add attributes or children to the underlying etree.Element, but that means that I'd have to strip that temporary data off when I want to serialize the tree. (please stop me if you've already heard this request, or if there is another solution.) I'd have a solution (see below) to this need if I could get a value, say an ID, (1) that is unique to each node and (2) that does not change during the existence of the ElementTree. Note that this "ID" does not have to be meaningful, and does not need to enable me to do anything with the underlying XML object (other than re-identify it). If I could get this opaque ID (or whatever it might be called), then I could use a dictionary and something like the following to store and retrieve temporary data:: Datadict1 = {} def get_temp_data(node, datadict): id = node.get_opaque_id() if id in datadict: return datadict[id] else: data = {} datadict[id] = data return data def test(): doc = lxml.parse('somedoc.xml') root = doc.getroot() node = root[0] data = get_temp_data(node, Datadict1) value1 = 'some temporary data' data['key1'] = value1 o o o data = get_temp_data(node, Datadict1) print data['key1'] test() Looking at lxml-2.2.4/src/lxml/lxml.etree.pyx, it seems like that would be a trivial function to add. (see below) What do you think? It's pretty simple solution. Has it be tried or rejected already? Here is a patch that seems to add the necessary function. This function returns the C pointer to the libxml2 object that is underneath the lxml/etree object. Am I right that this value would be (1) unique and (2) persistent across the lifetime of the lxml/etree ElementTree? Index: lxml.etree.pyx =================================================================== --- lxml.etree.pyx (revision 71999) +++ lxml.etree.pyx (working copy) @@ -1185,6 +1185,21 @@ return None return _elementFactory(self._doc, c_node) + def getopaqueid(self): + u"""getopaqueid(self) + + Returns an opaque ID for the underlying XML C node. This + opaque ID is guaranteed (1) to be unique to each node + and (2) not to change during the existence of the + ElementTree. + """ + cdef xmlNode* c_node + cdef int intnode + c_node = self._c_node + intnode = <int>c_node + opaqueid = intnode + return opaqueid + def getnext(self): u"""getnext(self) - Dave -- Dave Kuhlman http://www.rexx.com/~dkuhlman

2 4

[lxml-dev] How to get HTML charset ?
by David Shieh 31 Mar '10

31 Mar '10

Hi all, I use lxml for a long time and it works fine for me. But now, I get confused about the charset thing. When I want to get the original charset of a html file, I used codes below: file_content = ''.join( [i.rstrip('\r\n ').lstrip() for i in response.readlines()] ) html = lxml.html.fromstring(file_content) for i in html.xpath('head/meta'): print lxml.html.tostring(i) Surprisingly, there's no output of any <meta http-equiv="Content-Type" .. /> element. So, how can I know the original charset of this html? BTW, I used urllib2 to get charset, using the codes below: req = urllib2.Request(url) try: response = urllib2.urlopen(req) except HTTPError, e: print e.code else: print response.headers.getheader('Content-Type') Not every sites return its charset, some sites don't return any charset information. What I gonna do if I really want to know the charset? Thanks, guys. Best wishes, David -- ---------------------------------------------- Attitude determines everything ! ----------------------------------------------

3 3