Mailman 3 September 2011 - lxml - The Python XML Toolkit

[lxml-dev] lxml has its page on launchpad
by Stefan Behnel 11 Apr '23

11 Apr '23

Hi all, I added the lxml project to launchpad, the Ubuntu Bug-Tracker. It also has a FAQ engine and a couple of other goodies. https://launchpad.net/lxml It's easy to sign up for launchpad, BTW, no 90%-footnotes-contract. Have fun, Stefan

9 9

[lxml-dev] Checking whether a node is a comment/element
by Geoffrey Sneddon 10 Apr '23

10 Apr '23

Hi, What's the best way to check whether a given node is a comment or an element? For the former, I'm currently using isinstance(node, etree._Comment), which is rather obviously sub-optimal. -- Geoffrey Sneddon <http://gsnedders.com/>

6 6

[lxml-dev] Reparenting a node
by Lawrence Oluyede 30 Jan '23

30 Jan '23

I have a doc A and a doc B, I'd like to put a node extracted from A in the document B but I always get a ValueError: ValueError: Element is not a child of this node. I didn't find any "setparent" in the API. How can I do this? -- Lawrence, oluyede.org - neropercaso.it "It is difficult to get a man to understand something when his salary depends on not understanding it" - Upton Sinclair

3 2

[lxml-dev] lxml 2.0.5 released
by Stefan Behnel 11 Jan '23

11 Jan '23

Hi all, lxml 2.0.5 is on PyPI. This is a bug-fix-only release of the stable 2.0 series. Have fun, Stefan 2.0.5 (2008-05-01) Bugs fixed * Resolving to a filename in custom resolvers didn't work. * lxml did not honour libxslt's second error state "STOPPED", which let some XSLT errors pass silently. * Memory leak in Schematron with libxml2 >= 2.6.31.

3 4

[lxml-dev] Building LXML Trunk
by Sidnei da Silva 31 Aug '22

31 Aug '22

Hi, I've tried to build lxml from trunk today, on Win32. Got the following error: src\lxml\etree.c(880) : error C2059: syntax error : ')' src\lxml\etree.c(881) : error C2059: syntax error : ')' src\lxml\etree.c(882) : error C2059: syntax error : ')' src\lxml\etree.c(883) : error C2059: syntax error : ')' Any clue? Smells like a Pyrex issue? -- Sidnei da Silva Enfold Systems http://enfoldsystems.com Fax +1 832 201 8856 Office +1 713 942 2377 Ext 214

4 4

[lxml-dev] confusing xpath performance characteristics
by jholg＠gmx.de 24 Aug '12

24 Aug '12

Hi, I ran into some performance characteristics of lxml/libxml2 xpath that I find rather confusing: I try to find the @type attribute of a certain element in an XML Schema (which contains lots of complexType definitions with lots of elements in them; unfortunately I can't post the schema): >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.095885038375854492, 0.096823930740356445, 0.096174955368041992] So I think I'm being smart and give a little more path information - reckoning that this should *improve* performance: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.1770780086517334, 0.1775970458984375, 0.17748594284057617] Hm. Performance degrades slightly. I'm adding even more of the path to where my desired elements live in the schema: >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType//xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [103.79744100570679, 103.83671712875366, 103.61817717552185] What??? >>> timeit.Timer(stmt="""xpath(schema)""", setup="""from lxml import etree, objectify; schema=etree.parse('/ae/data/pydev/hjoukl/NDM/SVN_CO/TRUNK/ndm/reference/xsd/NDM.xsd').getroot(); xpath = etree.XPath('/xs:schema/xs:complexType/*/xs:element[@name="equity"]/@type', namespaces={'xs': 'http://www.w3.org/2001/XMLSchema'})""").repeat(number=10) [0.044407129287719727, 0.044126987457275391, 0.044229030609130859] >>> Ok, this version's better than my naive approach, which seems logical to me. But why would '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' perform drastically slower than '/xs:schema/xs:complexType//xs:element[@name="equity"]/@type' ? libxml2 problem? Running the same xpaths in Oxygen I don't notice performance differences (can't profile this). Holger -- GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT! Jetzt freischalten unter http://portal.gmx.net/de/go/maxdome01

2 5

Re: [lxml] Multiple input documents
by Stefan Behnel 29 Sep '11

29 Sep '11

Evgeny Turnaev, 27.09.2011 15:27: > 2011/9/27 Stefan Behnel: >> Evgeny Turnaev, 27.09.2011 12:09: >>> My question if related to XSLT document() function and processing >>> multiple input documents in XSLT. >>> >>> Currently in our application fetches 3 to 7 separate xml documents merges >>> all of them into single tree using append() or SubElement and passes merged tree >>> into XSLT transformation. >>> >>> Is it possible in lxml to pass multiple trees into XSLT >>> transformation and access them >>> for example using document() function? If so then: will a document accessed by >>> document() function be parsed for each access? >> >> It will be cached during the lifetime of one XSLT execution. > > So for the first time it will be parsed? Yes. > Or i can pass already parsed tree using custom resolver? No, not currently. It could work to enable that, but given the way libxslt works here, it would always have to get deep copied internally. See the function _xslt_resolve_from_python() in xslt.pxi. > Hmm. There seems no method like Resolver.resolve_document() > What is a result of resolve_string() ? Is it a parsed tree? The return values are opaque reference objects that should only be passed back from the user provided resolver method. They do not contain documents. > You suggesting to cache > result of resolve_string() and return cached tree for calls to > document('my_doc') ? No, I was saying that libxslt caches the documents it parses during an XSLT run. They will be discarded afterwards. >>> Will i have to save document to disk to be >>> able to load in in document() function or i can load already parsed >>> tree from memory? >> >> You can use custom resolvers (see the docs) to pass arbitrary sources into >> lxml's parsers and XSLT engine. >> >> >>> Will it be faster to use document() than appending 3-4 of 40kb xml >>> trees and 4-5 small (1kb)? >> >> Maybe not, but it depends on what you do. You should benchmark it. >> >> >>> One other reason why i am asking it: we have a lot of merging of the >>> same tries (<1kb) into different >>> documents and a few merging of 40kb tries. So i thinked: why cant lxml >>> use the same tree using document() >>> instead of explicitly appending it into each xml before transformation. >> >> Yes, that sounds like you could simplify your processing. However, if that >> makes it any faster, cleaner or 'better' by whatever metric, depends >> entirely on your exact code. >> >> >>> Is there any other any other performance hints? >> >> First question: do you really have a performance problem? If so, where? >> >> Or is your question more about refactoring the code to keep more of it in >> XSLT for some design reason? > > No we don`t have any performance issues. Our application is IO bound > (mostly waiting, although in some situations > fetching is done from memcache (around 2ms) and in this case xslt > transform time matters). > Application is a bit chaotic in code and i am taking some > investigation of how i can rewrite > the whole thing and maybe also speedup. The profiling says that about > a half of actual CPU time > is in xslt transformation (not much in absolute value) and i am > wondering if i can "cache" subtrees and > pass them into xslt instead of appending to each xml individually. I > will surely benchmark. (i think i will be > faster than tree merging, although maybe less readable and more > complicated in python part) Ok, I take it that your focus in on code cleanup rather than optimisation. As I said, passing in multiple subtrees isn't guaranteed to be any faster than what you currently have, and it may just as well be slower. I may be a way to clean up the code, though, but since I don't see the code, I can only guess. Stefan

1 0

lxml 2.3.1 released
by Stefan Behnel 29 Sep '11

29 Sep '11

Hi everyone, I'm happy to announce the release of lxml 2.3.1. This is the first bug fix release of the stable 2.3 series. It contains a number of behavioural corrections of the original 2.3 release, so updating is recommended. http://lxml.de/ http://pypi.python.org/pypi/lxml/2.3.1/ This release was built using Cython 0.15.1. It is recommended (although not required) to use at least libxml2 2.7.8 with lxml, which fixes a number of important bugs compared to the previous 2.7.x releases. Note that this release officially drops supports for CPython 2.3, which has long terminated its extended security-fix-only maintenance period back in March 2008. CPython 2.4.x, although equally outdated, continues to be supported due to its long term maintenance in certain Linux/Unix server installations. If you are interested in trainings, commercial support or customisations regarding the lxml package, please contact me directly. Have fun, Stefan Features added -------------- * New option kill_tags in lxml.html.clean to remove specific tags and their content (i.e. their whole subtree). * pi.get() and pi.attrib on processing instructions to parse pseudo-attributes from the text content of processing instructions. * lxml.get_include() returns a list of include paths that can be used to compile external C code against lxml.etree. This is specifically required for statically linked lxml builds when code needs to compile against the exact same header file versions as lxml itself. * Resolver.resolve_file() takes an additional option close_file that configures if the file(-like) object will be closed after reading or not. By default, the file will be closed, as the user is not expected to keep a reference to it. Bugs fixed ---------- * HTML cleaning didn't remove 'data:' links. * The html5lib parser integration now uses the 'official' implementation in html5lib itself, which makes it work with newer releases of the library. * In lxml.sax, endElementNS() could incorrectly reject a plain tag name when the corresponding start event inferred the same plain tag name to be in the default namespace. * When an open file-like object is passed into parse() or iterparse(), the parser will no longer close it after use. This reverts a change in lxml 2.3 where all files would be closed. It is the users responsibility to properly close the file(-like) object, also in error cases. * Assertion error in lxml.html.cleaner when discarding top-level elements. * In lxml.cssselect, use the xpath 'A//B' (short for 'A/descendant-or-self::node()/B') instead of 'A/descendant::B' for the css descendant selector ('A B'). This makes a few edge cases to be consistent with the selector behavior in WebKit and Firefox, and makes more css expressions valid location paths (for use in xsl:template match). * In lxml.html, non-selected <option> tags no longer show up in the collected form values. * Adding/removing <option> values to/from a multiple select form field properly selects them and unselects them. Other changes -------------- * Static builds can specify the download directory with the --download-dir option.

2 1

Re: [lxml] Multiple input documents
by Evgeny Turnaev 29 Sep '11

29 Sep '11

Am i asked something wrong or obvious? 2011/9/27 Evgeny Turnaev <turnaev.e(a)gmail.com>: > 2011/9/27 Stefan Behnel <stefan_ml(a)behnel.de>: >> Evgeny Turnaev, 27.09.2011 12:09: >>> Hi. >>> My question if related to XSLT document() function and processing >>> multiple input documents in XSLT. >>> >>> Currently in our application fetches 3 to 7 separate xml documents merges >>> all of them into single tree using append() or SubElement and passes merged tree >>> into XSLT transformation. >>> >>> Is it possible in lxml to pass multiple trees into XSLT >>> transformation and access them >>> for example using document() function? If so then: will a document accessed by >>> document() function be parsed for each access? >> >> It will be cached during the lifetime of one XSLT execution. > > So for the first time it will be parsed? Or i can pass already parsed > tree using custom resolver? > Hmm. There seems no method like Resolver.resolve_document() > What is a result of resolve_string() ? Is it a parsed tree? You > suggesting to cache > result of resolve_string() and return cached tree for calls to > document('my_doc') ? > >> >>> Will i have to save >>> document to disk to be >>> able to load in in document() function or i can load already parsed >>> tree from memory? >> >> You can use custom resolvers (see the docs) to pass arbitrary sources into >> lxml's parsers and XSLT engine. >> >> >>> Will it be faster to use document() than appending 3-4 of 40kb xml >>> trees and 4-5 small (1kb)? >> >> Maybe not, but it depends on what you do. You should benchmark it. >> >> >>> One other reason why i am asking it: we have a lot of merging of the >>> same tries (<1kb) into different >>> documents and a few merging of 40kb tries. So i thinked: why cant lxml >>> use the same tree using document() >>> instead of explicitly appending it into each xml before transformation. >> >> Yes, that sounds like you could simplify your processing. However, if that >> makes it any faster, cleaner or 'better' by whatever metric, depends >> entirely on your exact code. >> >> >>> Is there any other any other performance hints? >> >> First question: do you really have a performance problem? If so, where? >> >> Or is your question more about refactoring the code to keep more of it in >> XSLT for some design reason? > > No we don`t have any performance issues. Our application is IO bound > (mostly waiting, although in some situations > fetching is done from memcache (around 2ms) and in this case xslt > transform time matters). > Application is a bit chaotic in code and i am taking some > investigation of how i can rewrite > the whole thing and maybe also speedup. The profiling says that about > a half of actual CPU time > is in xslt transformation (not much in absolute value) and i am > wondering if i can "cache" subtrees and > pass them into xslt instead of appending to each xml individually. I > will surely benchmark. (i think i will be > faster than tree merging, although maybe less readable and more > complicated in python part) > >> Stefan >> _________________________________________________________________ >> Mailing list for the lxml Python XML toolkit - http://lxml.de/ >> lxml(a)lxml.de >> https://mailman-mail5.webfaction.com/listinfo/lxml >> > > > > -- > -------------------------------------------- > Турнаев Евгений Викторович > +7 906 875 09 43 > -------------------------------------------- > -- -------------------------------------------- Турнаев Евгений Викторович +7 906 875 09 43 --------------------------------------------

1 0

Re: [lxml] Fwd: Version 2.3.1 with Python 2.5 windows
by Brandon Goldfedder 28 Sep '11

28 Sep '11

Christoph, Thank you very much. This will really help! -Brandon On Sep 28, 2011 7:07 PM, "Christoph Gohlke" <cgohlke(a)uci.edu> wrote: > http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml > > Christoph > > > Brandon Goldfedder <brandon <at> goldfedder.com> writes: > >> >> >> >> All, >> Hitting some issues and thought I would ask. Is there a static windows >> build for Version 2.3.0 or 2.3.1 with Python 2.5 (need to do some >> Schematron stuff). easy_install doesn't seem to think so. I tried some >> steps to setup/build manually but they keep going boom (I tried to use >> the libxml2 and libxslt directly but think I'm missing a step there as >> well). If anyone has a pointer to a static windows build, or can point >> me to the step I am missing in getting it to manually setup it would >> greatly help. I'm trying to avoid moving from Python2.5. >> Thanks, >> Brandon Goldfedder >> >> >> _________________________________________________________________ >> Mailing list for the lxml Python XML toolkit - http://lxml.de/ >> lxml <at> lxml.de >> https://mailman-mail5.webfaction.com/listinfo/lxml >> > >

1 0