[lxml-dev] [lxml][objectify] optimization of recursive object dumping

Hi, I've posted this before but messed with the threads, so here it is again: (Note: patch line numbers might differ, this was based on 1.1 branch of 2 weeks ago, but I could of course update this and send a new patch) First some background: I'm experimenting with a custom objectified datetime class based on Python's datetime that employs the dateutil.parser module to detect if some element value is in a valid datetime format, i.e. the parse function from dateutil.parser is used to implement the type_check for the PyType type registry. 1) Invoking this parse method is quite expensive, so I want this to happen rarely. As I am using "recursive element dumping" as default I found that for every __str__ call .pyval of the ObjectifiedDataElements in a tree is accessed, which in turn triggers parsing for my custom datetime class. As I don't really see a way to avoid this I propose the introduction of an additional property "_pyval_repr" that can be overridden in subclasses, which makes it possible to simply return element.text, if getting .pyval is expensive. S.th. like: *** ORIG/lxml-1.1/src/lxml/objectify.pyx Wed Sep 27 09:18:30 2006 --- src/lxml/objectify.pyx Wed Oct 4 11:00:09 2006 *************** *** 484,489 **** --- 484,493 ---- def __get__(self): return textOf(self._c_node) + property _pyval_repr: + def __get__(self): + return self.pyval + def __str__(self): return textOf(self._c_node) or '' *************** *** 931,938 **** cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "pyval"): ! value = element.pyval else: value = textOf(element._c_node) if value and not value.strip(): --- 935,942 ---- cdef object _dump(_Element element, int indent): indentstr = " " * indent ! if hasattr(element, "_pyval_repr"): ! value = element._pyval_repr else: value = textOf(element._c_node) if value and not value.strip(): This can substantially speed up things for complicated type_check routines (in my usecase :) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl worote:
But that should only happen for normal text content (well, and dates). Numbers should always be parsed first.
Hmmm, I don't really like the idea of adding a new Python method only to optimise the debug output (which is what dump() is essentially meant for). I understand that you use this as default, but I don't think many people will rely on the performance of this function... Have you considered switching from "dump() by default" to "implement __str__() for all data types by hand"? There are not that many standard types... On the other hand, what if we did something like this: cdef object _dump(_Element element, int indent): indentstr = " " * indent if isinstance(element, ObjectifiedDataElement): value = str(element) else: ... Would that help? Stefan

Hi again, Stefan Behnel wrote:
I wrote up a patch that could do the trick. Sadly, it requires a behavioural change in NumberElement to return repr(value) for str(). Not that beautiful. I'm just posting it to give you an idea about what looks like a viable approach to me. I'll try to get 1.1.2 out tomorrow, so unless I get convinced by then that this is a sufficiently solid idea, this will have to wait for the next release. Stefan

Hi Stefan, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 25.10.2006 19:32:34: that
this is a sufficiently solid idea, this will have to wait for the next release.
I'll probably be not able to look at anything before tomorrow, sorry. But as svn trunk usually works rock solid anyway I'm not that tied to official releases, anyway :) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.

Hi Holger, Holger Joukl worote:
But that should only happen for normal text content (well, and dates). Numbers should always be parsed first.
Hmmm, I don't really like the idea of adding a new Python method only to optimise the debug output (which is what dump() is essentially meant for). I understand that you use this as default, but I don't think many people will rely on the performance of this function... Have you considered switching from "dump() by default" to "implement __str__() for all data types by hand"? There are not that many standard types... On the other hand, what if we did something like this: cdef object _dump(_Element element, int indent): indentstr = " " * indent if isinstance(element, ObjectifiedDataElement): value = str(element) else: ... Would that help? Stefan

Hi again, Stefan Behnel wrote:
I wrote up a patch that could do the trick. Sadly, it requires a behavioural change in NumberElement to return repr(value) for str(). Not that beautiful. I'm just posting it to give you an idea about what looks like a viable approach to me. I'll try to get 1.1.2 out tomorrow, so unless I get convinced by then that this is a sufficiently solid idea, this will have to wait for the next release. Stefan

Hi Stefan, Stefan Behnel <behnel_ml@gkec.informatik.tu-darmstadt.de> schrieb am 25.10.2006 19:32:34: that
this is a sufficiently solid idea, this will have to wait for the next release.
I'll probably be not able to look at anything before tomorrow, sorry. But as svn trunk usually works rock solid anyway I'm not that tied to official releases, anyway :) Holger Der Inhalt dieser E-Mail ist vertraulich. Falls Sie nicht der angegebene Empfänger sind oder falls diese E-Mail irrtümlich an Sie adressiert wurde, verständigen Sie bitte den Absender sofort und löschen Sie die E-Mail sodann. Das unerlaubte Kopieren sowie die unbefugte Übermittlung sind nicht gestattet. Die Sicherheit von Übermittlungen per E-Mail kann nicht garantiert werden. Falls Sie eine Bestätigung wünschen, fordern Sie bitte den Inhalt der E-Mail als Hardcopy an. The contents of this e-mail are confidential. If you are not the named addressee or if this transmission has been addressed to you in error, please notify the sender immediately and then delete this e-mail. Any unauthorized copying and transmission is forbidden. E-Mail transmission cannot be guaranteed to be secure. If verification is required, please request a hard copy version.
participants (2)
-
Holger Joukl
-
Stefan Behnel