[lxml-dev] Building Problems

Hello everybody, during googeling on how to write easier xml datastores with python I just found our project. Especialy the objectify modules impressed me. So to test things I wanted to install it. Unfortunatly I can not use the provided debian package as there is only one for version 1.03 not including the objectify extension. So I downloaded the source of 1.1.1 from codespeak.net extracted it and that's what it did. # tar -xzf lxml-1.1.1.tgz # ce lxml-1.1.1 # make clean test. python setup.py build_ext -i Building lxml version 1.1 running build_ext python test.py -p -v Traceback (most recent call last): File "test.py", line 591, in ? exitcode = main(sys.argv) File "test.py", line 554, in main test_cases = get_test_cases(test_files, cfg, tracer=tracer) File "test.py", line 254, in get_test_cases module = import_module(file, cfg, tracer=tracer) File "test.py", line 197, in import_module mod = __import__(modname) File "/tmp/lxml-1.1/src/lxml/tests/test_objectify.py", line 16, in ? from lxml import objectify ImportError: /tmp/lxml-1.1/src/lxml/objectify.so: undefined symbol: previousElement make: *** [test_inplace] Error 1 It builds just ok but when it comes to test it crashes. There seams to be something missing. I have installed libxml2 in version 2.6.26 and libxslt in version 1.1.17. I investigated the problem a little further with ldd and it gives me: # ldd -d src/lxml/objectify.so undefined symbol: PyTraceBack_Type (src/lxml/objectify.so) undefined symbol: _Py_NoneStruct (src/lxml/objectify.so) undefined symbol: PyString_Type (src/lxml/objectify.so) undefined symbol: previousElement (src/lxml/objectify.so) undefined symbol: PyExc_SystemError (src/lxml/objectify.so) undefined symbol: PyExc_ValueError (src/lxml/objectify.so) undefined symbol: PyExc_TypeError (src/lxml/objectify.so) undefined symbol: PyBaseString_Type (src/lxml/objectify.so) undefined symbol: PyExc_IndexError (src/lxml/objectify.so) undefined symbol: PyTuple_Type (src/lxml/objectify.so) undefined symbol: PyExc_AttributeError (src/lxml/objectify.so) undefined symbol: PyClass_Type (src/lxml/objectify.so) undefined symbol: nextElement (src/lxml/objectify.so) undefined symbol: PyList_Type (src/lxml/objectify.so) undefined symbol: PyExc_NotImplementedError (src/lxml/objectify.so) undefined symbol: PyBool_Type (src/lxml/objectify.so) undefined symbol: PyInstance_Type (src/lxml/objectify.so) undefined symbol: PyType_Type (src/lxml/objectify.so) undefined symbol: PyObject_GC_Del (src/lxml/objectify.so) undefined symbol: PyExc_NameError (src/lxml/objectify.so) linux-gate.so.1 => (0xffffe000) libexslt.so.0 => /usr/lib/libexslt.so.0 (0xa7eb2000) libxslt.so.1 => /usr/lib/libxslt.so.1 (0xa7e80000) libxml2.so.2 => /usr/lib/libxml2.so.2 (0xa7d67000) libpthread.so.0 => /lib/tls/libpthread.so.0 (0xa7d55000) libc.so.6 => /lib/tls/libc.so.6 (0xa7c23000) libgcrypt.so.11 => /usr/lib/libgcrypt.so.11 (0xa7bd2000) libgpg-error.so.0 => /usr/lib/libgpg-error.so.0 (0xa7bcd000) libm.so.6 => /lib/tls/libm.so.6 (0xa7ba8000) libdl.so.2 => /lib/tls/libdl.so.2 (0xa7ba4000) libz.so.1 => /usr/lib/libz.so.1 (0xa7b90000) /lib/ld-linux.so.2 (0x75555000) libnsl.so.1 => /lib/tls/libnsl.so.1 (0xa7b7a000) Can anybody give me a hint where previousElement is defined? Or did I do something wrong? Thanks in advance. Achim Kern PS: if responding please put me into the cc, as I am currently not able to receive mails from mailinglists. Thank you.

Hi, I know we've discussed that before but I ran into it again: The StringElement class mirrors only a small subset of native python string methods. Converting my codebase to lxml.objectify I now have to substitute every occurance of <obj>.<some string method> with <obj>.pyval.<some string method>. Automating this depends on developing sed magic for all of these methods. I now wonder if it wouldn't make sense to rather add the string methods (*) to StringElement instead of putting work into developing the substitution scripts, making StringElement much more feature rich. I don't think that having the methods has any negative impact on StringElement regarding child access as - data elements should normally not have child elements (although this is not enforced) - you can always access child elements with names that conflict with element methods using the find() method. I could add that code and tests for it and post a patch. The one reason to not do this might be the rationale "when operating on objectify string elements always use <obj>.pyval" and avoid the additional code. But then again I think all these methods would rather just return _strValueOf(self).<method>(...), thus being rather maintenance-robust. Stefan? Holger (*) minus the sequence protocol stuff, of course Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
Feel free to figure out which methods make sense (and can be supported) and then post a list or even a patch if you like.
Maintenance is not my main concern. The problem is that we provide an incomplete interface here, so it's "kinda compatible, but not quite", which I consider worse than "no string methods there". I fear that the choice of methods may look too arbitrary to understand. But as I said, feel free to convince me. Stefan

Hi, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 27.09.2006 18:05:03:
Maintenance is not my main concern. The problem is that we provide an incomplete interface here, so it's "kinda compatible, but not quite",
which I
I've experimented with that some more and came to think you're right. It's more of a documentation problem than maintenance and it is a lot more concise to have "wanna use string methods, use .pyval" than having a bunch of supported and some unsupported string methods. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi, I'm currently running into some optimization issues. Be warned this post is rather lenghty... First some background: I'm experimenting with a custom objectified datetime class based on Python's datetime that employs the dateutil.parser module to detect if some element value is in a valid datetime format, i.e. the parse function from dateutil.parser is used to implement the type_check for the PyType type registry. 1) Invoking this parse method is quite expensive, so I want this to happen rarely. As I am using "recursive element dumping" as default I found that for every __str__ call .pyval of the ObjectifiedDataElements in a tree is accessed, which in turn triggers parsing for my custom datetime class. As I don't really see a way to avoid this I propose the introduction of an additional property "_pyval_repr" that can be overridden in subclasses, which makes it possible to simply return element.text, if getting .pyval is expensive. S.th. like: *** ORIG/lxml-1.1/src/lxml/objectify.pyx Wed Sep 27 09:18:30 2006 --- src/lxml/objectify.pyx Wed Oct 4 11:00:09 2006 *************** *** 484,489 **** --- 484,493 ---- def __get__(self): return textOf(self._c_node) + property _pyval_repr: + def __get__(self): + return self.pyval + def __str__(self): return textOf(self._c_node) or '' *************** *** 931,938 **** cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "pyval"): ! value = element.pyval else: value = textOf(element._c_node) if value and not value.strip(): --- 935,942 ---- cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "_pyval_repr"): ! value = element._pyval_repr else: value = textOf(element._c_node) if value and not value.strip(): This can substantially speed up things for complicated type_check routines (in my usecase :) 2) Then, I figured to reduce the calls to ObjectifiedElement.__str__ in general. I am using a custom logging module that implies a function that converts its input arguments to strings, concatenates them and then writes them out through the logger (which substitutes stdout) if the loglevel of the caller meets the set loglevel for the output file/stdout. As the conversion to strings is performed before any loglevel checking, reversing this order leads to a lot less str() calls on the objects. To my astonishment things actually slowed massively down, though. I tried to come up with a minimal example of what seems to happen, using only lxml standard: Runs slow: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root.i print root.f print root.s print root.d """ "n = root.i; n = root.f; n = root.s; n = root.d" 17 238.3343 what 2006-03-03 10 loops -> 0.0102 secs 17 238.3343 what 2006-03-03 100 loops -> 0.101 secs 17 238.3343 what 2006-03-03 1000 loops -> 1.02 secs 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 raw times: 1.03 1.02 1.02 1000 loops, best of 3: 1.02 msec per loop Runs fast: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root """ "n = root.i; n = root.f; n = root.s; n = root.d" root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10 loops -> 0.00109 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 100 loops -> 0.00928 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 1000 loops -> 0.0897 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10000 loops -> 0.905 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] raw times: 0.893 0.911 0.911 10000 loops, best of 3: 89.3 usec per loop Recursively outputting root before accessing its child elements really speeds things up, even though I accessed all elements in the slow example, too. Why is this? I'm clueless. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi, I ran into a problem using the objectify DataElement factory function. When implementing an _init method in a derived ObjectifiedDataElement class, it is impossible to access the element.text in _init because this has not yet been set when _init gets called by _elementFactory. Don't see a nice clean way to solve that. Maybe instrument _elementFactory with an optional skip_init argument that allows for a delayed manual call of _init in corner cases? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
True, that's a problem.
Not a good idea, as it is rarely used. I already thought about adding a public C-API function for creating elements a while ago, that takes all necessary parameters including the text content. I think that's the cleanest solution. Stefan

Hi again, Holger Joukl wrote:
etree's C-API now has a new makeElement() function that creates an _Element straight through with everything it can carry: attributes, text, tail and a prefix mapping, either for an existing _Document or by creating a new document also. Objectify uses it to overcome the above problem. Stefan

Hi, as a followup to my last post some more strange observations. To find out why the call to str(root) aka objectify.dump(root) speeds up things: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.dump(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000898 secs 100 loops -> 0.00887 secs 1000 loops -> 0.0885 secs 10000 loops -> 0.887 secs raw times: 0.893 0.899 0.903 10000 loops, best of 3: 89.3 usec per loop I implemented a visit function that does nothing more than visit every node: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): for child in element.iterchildren(): _visit(child) But: /apps/pydev/gcc/3.4.4/bin/python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.0104 secs 100 loops -> 0.103 secs 1000 loops -> 1.04 secs raw times: 1.04 1.02 1.03 1000 loops, best of 3: 1.02 msec per loop This is actually much slower, again. Now if I change the visit code to: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): element.items() # my only addition for child in element.iterchildren(): _visit(child) Now it's fast, again: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000887 secs 100 loops -> 0.0087 secs 1000 loops -> 0.088 secs 10000 loops -> 0.874 secs raw times: 0.876 0.865 0.87 10000 loops, best of 3: 86.5 usec per loop All of this because of the additional element.items()??? I'm lost. Hope somebody can point out a serious misunderstanding of mine, where my systematic testing error lies or come up with an actual explanation :) As I'm abroad next week I'll follow up on this Tuesday in a week. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, first of all: please create a new thread for a new topic instead of responding to an existing message. Most mail client honour the "in reply to" hint in the header and sort them into the old thread. Then: what you observe are most likely GC 'issues'. The thing is: if the element already exists as Python object, it is reused, which is much faster then creating a new one. So in the cases where your code runs faster, you can assume that the object survived a larger portion of your code without being re-instantiated. Especially recursive printing instantiates the entire tree, so if the objects are not deleted directly afterwards, this has a performance effect on code that runs afterwards. Stefan

Hi Achim, Achim Kern wrote:
1.1? Not 1.1.1?
running build_ext python test.py -p -v
You did build it, right? I assume this is a second try after already having built it once. Did you do "make clean" in between? That removes the ".c" files, which means you need a special Pyrex version to rebuild it. See "doc/build.txt". If you only unpack the tgz and build from that, you should not need Pyrex as the ".c" files are included. Please retry the above with a clean setup and if that still fails, send a complete copy of your attempted commands and the resulting output to the list. Stefan

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
That sounds as thought it should be the result of 'make distclean'; make clean should remove .o files, etc., but leave the tree in a state in which 'make' can be re-run safely. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFFGsI3+gerLs4ltQ4RAhcYAJY28BThKHuNMa7cp0hIezmNIcgFAJwLxi+1 3Lw3DNsAokeEkyT/EZRBmw== =JlHg -----END PGP SIGNATURE-----

Hi Tres, Tres Seaver wrote:
Well, it can be re-run safely, if you have Pyrex installed. The thing is, as long as you don't modify anything in the sources, there is not much of a reason to run "make clean", but if you do changes, you need Pyrex anyway to regenerate the ".c" files. So there is much to win from distinguishing "make clean" and "make distclean". Stefan

Hi Stefan, thanks for your rapid answer. As I wasn't in office until today it I wasn't able to answer. Sorry for this. behnel_ml@gkec.informatik.tu-darmstadt.de writes:
I tried both but none seamed to work for me.
I assume not. :-(
I tested it with a clean version which I downloaded and it builds like a dream. I really not clear what happend the first time. Maybe it was because I messed something up with the debian package which I had installed. Sorry for wasting your time. Regards Achim

Hi, I know we've discussed that before but I ran into it again: The StringElement class mirrors only a small subset of native python string methods. Converting my codebase to lxml.objectify I now have to substitute every occurance of <obj>.<some string method> with <obj>.pyval.<some string method>. Automating this depends on developing sed magic for all of these methods. I now wonder if it wouldn't make sense to rather add the string methods (*) to StringElement instead of putting work into developing the substitution scripts, making StringElement much more feature rich. I don't think that having the methods has any negative impact on StringElement regarding child access as - data elements should normally not have child elements (although this is not enforced) - you can always access child elements with names that conflict with element methods using the find() method. I could add that code and tests for it and post a patch. The one reason to not do this might be the rationale "when operating on objectify string elements always use <obj>.pyval" and avoid the additional code. But then again I think all these methods would rather just return _strValueOf(self).<method>(...), thus being rather maintenance-robust. Stefan? Holger (*) minus the sequence protocol stuff, of course Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
Feel free to figure out which methods make sense (and can be supported) and then post a list or even a patch if you like.
Maintenance is not my main concern. The problem is that we provide an incomplete interface here, so it's "kinda compatible, but not quite", which I consider worse than "no string methods there". I fear that the choice of methods may look too arbitrary to understand. But as I said, feel free to convince me. Stefan

Hi, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 27.09.2006 18:05:03:
Maintenance is not my main concern. The problem is that we provide an incomplete interface here, so it's "kinda compatible, but not quite",
which I
I've experimented with that some more and came to think you're right. It's more of a documentation problem than maintenance and it is a lot more concise to have "wanna use string methods, use .pyval" than having a bunch of supported and some unsupported string methods. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi, I'm currently running into some optimization issues. Be warned this post is rather lenghty... First some background: I'm experimenting with a custom objectified datetime class based on Python's datetime that employs the dateutil.parser module to detect if some element value is in a valid datetime format, i.e. the parse function from dateutil.parser is used to implement the type_check for the PyType type registry. 1) Invoking this parse method is quite expensive, so I want this to happen rarely. As I am using "recursive element dumping" as default I found that for every __str__ call .pyval of the ObjectifiedDataElements in a tree is accessed, which in turn triggers parsing for my custom datetime class. As I don't really see a way to avoid this I propose the introduction of an additional property "_pyval_repr" that can be overridden in subclasses, which makes it possible to simply return element.text, if getting .pyval is expensive. S.th. like: *** ORIG/lxml-1.1/src/lxml/objectify.pyx Wed Sep 27 09:18:30 2006 --- src/lxml/objectify.pyx Wed Oct 4 11:00:09 2006 *************** *** 484,489 **** --- 484,493 ---- def __get__(self): return textOf(self._c_node) + property _pyval_repr: + def __get__(self): + return self.pyval + def __str__(self): return textOf(self._c_node) or '' *************** *** 931,938 **** cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "pyval"): ! value = element.pyval else: value = textOf(element._c_node) if value and not value.strip(): --- 935,942 ---- cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "_pyval_repr"): ! value = element._pyval_repr else: value = textOf(element._c_node) if value and not value.strip(): This can substantially speed up things for complicated type_check routines (in my usecase :) 2) Then, I figured to reduce the calls to ObjectifiedElement.__str__ in general. I am using a custom logging module that implies a function that converts its input arguments to strings, concatenates them and then writes them out through the logger (which substitutes stdout) if the loglevel of the caller meets the set loglevel for the output file/stdout. As the conversion to strings is performed before any loglevel checking, reversing this order leads to a lot less str() calls on the objects. To my astonishment things actually slowed massively down, though. I tried to come up with a minimal example of what seems to happen, using only lxml standard: Runs slow: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root.i print root.f print root.s print root.d """ "n = root.i; n = root.f; n = root.s; n = root.d" 17 238.3343 what 2006-03-03 10 loops -> 0.0102 secs 17 238.3343 what 2006-03-03 100 loops -> 0.101 secs 17 238.3343 what 2006-03-03 1000 loops -> 1.02 secs 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 17 238.3343 what 2006-03-03 raw times: 1.03 1.02 1.02 1000 loops, best of 3: 1.02 msec per loop Runs fast: ========== python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' print root """ "n = root.i; n = root.f; n = root.s; n = root.d" root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10 loops -> 0.00109 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 100 loops -> 0.00928 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 1000 loops -> 0.0897 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] 10000 loops -> 0.905 secs root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] root = None [ObjectifiedElement] i = 17 [IntElement] f = 238.33430000000001 [FloatElement] s = 'what' [StringElement] d = '2006-03-03' [StringElement] raw times: 0.893 0.911 0.911 10000 loops, best of 3: 89.3 usec per loop Recursively outputting root before accessing its child elements really speeds things up, even though I accessed all elements in the slow example, too. Why is this? I'm clueless. Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi, I ran into a problem using the objectify DataElement factory function. When implementing an _init method in a derived ObjectifiedDataElement class, it is impossible to access the element.text in _init because this has not yet been set when _init gets called by _elementFactory. Don't see a nice clean way to solve that. Maybe instrument _elementFactory with an optional skip_init argument that allows for a delayed manual call of _init in corner cases? Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl wrote:
True, that's a problem.
Not a good idea, as it is rarely used. I already thought about adding a public C-API function for creating elements a while ago, that takes all necessary parameters including the text content. I think that's the cleanest solution. Stefan

Hi again, Holger Joukl wrote:
etree's C-API now has a new makeElement() function that creates an _Element straight through with everything it can carry: attributes, text, tail and a prefix mapping, either for an existing _Document or by creating a new document also. Objectify uses it to overcome the above problem. Stefan

Hi, as a followup to my last post some more strange observations. To find out why the call to str(root) aka objectify.dump(root) speeds up things: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.dump(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000898 secs 100 loops -> 0.00887 secs 1000 loops -> 0.0885 secs 10000 loops -> 0.887 secs raw times: 0.893 0.899 0.903 10000 loops, best of 3: 89.3 usec per loop I implemented a visit function that does nothing more than visit every node: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): for child in element.iterchildren(): _visit(child) But: /apps/pydev/gcc/3.4.4/bin/python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.0104 secs 100 loops -> 0.103 secs 1000 loops -> 1.04 secs raw times: 1.04 1.02 1.03 1000 loops, best of 3: 1.02 msec per loop This is actually much slower, again. Now if I change the visit code to: def visit(_Element element not None): """Return a recursively generated string representation of an element. """ _visit(element) cdef object _visit(_Element element): element.items() # my only addition for child in element.iterchildren(): _visit(child) Now it's fast, again: python2.4 -m timeit -v -s""" from lxml import etree from lxml import objectify parser = etree.XMLParser(remove_blank_text=True) lookup = etree.ElementNamespaceClassLookup(objectify.ObjectifyElementClassLookup()) parser.setElementClassLookup(lookup) objectify.setDefaultParser(parser) objectify.enableRecursiveStr() root = objectify.Element('root') root.i = 17 root.f = 238.3343 root.s = 'what' root.d = '2006-03-03' objectify.visit(root) """ "n = root.i; n = root.f; n = root.s; n = root.d" 10 loops -> 0.000887 secs 100 loops -> 0.0087 secs 1000 loops -> 0.088 secs 10000 loops -> 0.874 secs raw times: 0.876 0.865 0.87 10000 loops, best of 3: 86.5 usec per loop All of this because of the additional element.items()??? I'm lost. Hope somebody can point out a serious misunderstanding of mine, where my systematic testing error lies or come up with an actual explanation :) As I'm abroad next week I'll follow up on this Tuesday in a week. Greetings, Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, first of all: please create a new thread for a new topic instead of responding to an existing message. Most mail client honour the "in reply to" hint in the header and sort them into the old thread. Then: what you observe are most likely GC 'issues'. The thing is: if the element already exists as Python object, it is reused, which is much faster then creating a new one. So in the cases where your code runs faster, you can assume that the object survived a larger portion of your code without being re-instantiated. Especially recursive printing instantiates the entire tree, so if the objects are not deleted directly afterwards, this has a performance effect on code that runs afterwards. Stefan

Hi Achim, Achim Kern wrote:
1.1? Not 1.1.1?
running build_ext python test.py -p -v
You did build it, right? I assume this is a second try after already having built it once. Did you do "make clean" in between? That removes the ".c" files, which means you need a special Pyrex version to rebuild it. See "doc/build.txt". If you only unpack the tgz and build from that, you should not need Pyrex as the ".c" files are included. Please retry the above with a clean setup and if that still fails, send a complete copy of your attempted commands and the resulting output to the list. Stefan

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Stefan Behnel wrote:
That sounds as thought it should be the result of 'make distclean'; make clean should remove .o files, etc., but leave the tree in a state in which 'make' can be re-run safely. Tres. - -- =================================================================== Tres Seaver +1 202-558-7113 tseaver@palladion.com Palladion Software "Excellence by Design" http://palladion.com -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD4DBQFFGsI3+gerLs4ltQ4RAhcYAJY28BThKHuNMa7cp0hIezmNIcgFAJwLxi+1 3Lw3DNsAokeEkyT/EZRBmw== =JlHg -----END PGP SIGNATURE-----

Hi Tres, Tres Seaver wrote:
Well, it can be re-run safely, if you have Pyrex installed. The thing is, as long as you don't modify anything in the sources, there is not much of a reason to run "make clean", but if you do changes, you need Pyrex anyway to regenerate the ".c" files. So there is much to win from distinguishing "make clean" and "make distclean". Stefan

Hi Stefan, thanks for your rapid answer. As I wasn't in office until today it I wasn't able to answer. Sorry for this. behnel_ml@gkec.informatik.tu-darmstadt.de writes:
I tried both but none seamed to work for me.
I assume not. :-(
I tested it with a clean version which I downloaded and it builds like a dream. I really not clear what happend the first time. Maybe it was because I messed something up with the debian package which I had installed. Sorry for wasting your time. Regards Achim
participants (4)
-
Achim Kern
-
Holger Joukl
-
Stefan Behnel
-
Tres Seaver