Hi, attached you find a new version of my io patch for lxml. Features include: - Output options for tostring() and write() methods. - Use new names for output options as discussed on this mailing list. - Use libxml2 io contexts to provide io to file-like objects without intermediary StringIO's - Support for url keyword on XML() and parse() to indicate base url in case we're not parsing from a file. - Added a function .tostring() to ElementTree and to Element. The module level tostring() now calls the 'tostring' method on its first argument. - Support for parse options (although none are defined yet) - Default encoding is utf-8. - doctests - unit tests - A new function XMLTree that is just like XML but returns a tree instead. With Stefan's recent patches this function may become superfluous, as we can ElementTree(XML(...)) instead. - valgrid/unittest clean Regards, Geert
All, apologies for replying to my own email, but I forgot to mention that in order to test my patch you need to have the latest libxml2 from CVS. The patch that was required to libxml2 has since been accepted. This means that until a new version of libxml2 is released you need to use the CVS. Regards, Geert
Hi,
attached you find a new version of my io patch for lxml. Features include:
- Output options for tostring() and write() methods. - Use new names for output options as discussed on this mailing list. - Use libxml2 io contexts to provide io to file-like objects without intermediary StringIO's - Support for url keyword on XML() and parse() to indicate base url in case we're not parsing from a file. - Added a function .tostring() to ElementTree and to Element. The module level tostring() now calls the 'tostring' method on its first argument. - Support for parse options (although none are defined yet) - Default encoding is utf-8. - doctests - unit tests - A new function XMLTree that is just like XML but returns a tree instead. With Stefan's recent patches this function may become superfluous, as we can ElementTree(XML(...)) instead. - valgrid/unittest clean
Regards, Geert ------------------------------------------------------------------------
_______________________________________________ lxml-dev mailing list lxml-dev@codespeak.net http://codespeak.net/mailman/listinfo/lxml-dev
Geert Jansen wrote:
attached you find a new version of my io patch for lxml. Features include:
- Output options for tostring() and write() methods. - Use new names for output options as discussed on this mailing list. - Use libxml2 io contexts to provide io to file-like objects without intermediary StringIO's - Support for url keyword on XML() and parse() to indicate base url in case we're not parsing from a file. - Added a function .tostring() to ElementTree and to Element. The module level tostring() now calls the 'tostring' method on its first argument. - Support for parse options (although none are defined yet) - Default encoding is utf-8. - doctests - unit tests - A new function XMLTree that is just like XML but returns a tree instead. With Stefan's recent patches this function may become superfluous, as we can ElementTree(XML(...)) instead. - valgrid/unittest clean
in order to test my patch you need to have the latest libxml2 from CVS. The patch that was required to libxml2 has since been accepted. This means that until a new version of libxml2 is released you need to use the CVS.
Hi! Since we have the other thread open regarding trunk/branches, just in case you consider it viable to move to the branch, would you mind trying to build a patch against the scoder2 branch? I'm currently refactoring it to split it into modules (as Philipp suggested), so don't try immediately. Once that's done, it should become easier for people to oversee the impact of patches. I know, this is additional work for you, but since you have to wait for the libxml update anyway to get your stuff into lxml, we may well have decided on the branch issue until then (either positively or negatively or something in between), and I'd still like to see your modifications in both places, once we can start building on the updated libxml2. Thanks, Stefan
Hi Stefan, Stefan Behnel wrote:
Since we have the other thread open regarding trunk/branches, just in case you consider it viable to move to the branch, would you mind trying to build a patch against the scoder2 branch? I'm currently refactoring it to split it into modules (as Philipp suggested), so don't try immediately. Once that's done, it should become easier for people to oversee the impact of patches
OK, once you've done refactoring I'll give it a go. Splitting lxml up into smaller modules should definately help here. I also think that the current modules are too big. While you're at it, maybe a a short file header to each of the files as well, stating its purpose and a small copyright statement? And the parser deserves its own module as well. Regards, Geer
Geert Jansen wrote:
Stefan Behnel wrote:
Since we have the other thread open regarding trunk/branches, just in case you consider it viable to move to the branch, would you mind trying to build a patch against the scoder2 branch? I'm currently refactoring it to split it into modules (as Philipp suggested), so don't try immediately. Once that's done, it should become easier for people to oversee the impact of patches
OK, once you've done refactoring I'll give it a go. Splitting lxml up into smaller modules should definately help here. I also think that the current modules are too big.
Thanks. I'm currently looking through the helper functions and I'm not quite sure which to move where. Would you say that the I/O functions deserve their own module (also in the light of your patch)?
While you're at it, maybe a a short file header to each of the files as well, stating its purpose and a small copyright statement?
Sure. There is definitely too little documentation in lxml, so that for one would be a good start.
And the parser deserves its own module as well.
He-he, already done :) I did the first check-in, so if you want to look at it, feel free. I'll now go for the tests and split them up to match the code modules. Stefan
Hi! Finally finding time to comment on your patch. One thing is that it's a bit more work to get it applied to my branch as the virtual ElementTrees have a context node associated with them. When you serialize or treat the ElementTree in any other way, you must start doing so at that specific node, not at the document root (although they *may* be the same, so that you can optimize). Geert Jansen wrote:
attached you find a new version of my io patch for lxml. Features include:
- Output options for tostring() and write() methods. - Use new names for output options as discussed on this mailing list.
I saw that you type check these options. That is rarely done in Python. If it's an ASCII string (like an encoding) and you want to be sure Pyrex handles it correctly, prefix it with str() or cast it into a char* (don't know if that always works, though). But this + if not isinstance(encoding, basestring): + raise TypeError, 'encoding must be a string' + if not encoding: + raise ValueError, 'an encoding must be specified' seems too much in my eyes. Remember, we are all adults here. And, didn't we want to default to UTF-8? On the other hand, something like this looks ok: + if isinstance(file, basestring): + ctxt = tree.xmlSaveToFilename(file, xml_encoding, options) + tree.xmlSaveDoc(ctxt, self._c_doc) + tree.xmlSaveClose(ctxt) + elif hasattr(file, 'write'): + ctxt = tree.xmlSaveToIO(<void*> _writeFileObject, NULL, + <void*> file, xml_encoding, options) + tree.xmlSaveDoc(ctxt, self._c_doc) + tree.xmlSaveClose(ctxt) + else: + raise TypeError, 'expecting file name or file like object' Difference being that 'file' accepts different types of input values, so a distinction has to be made anyway. BTW, what is an "enchandler"? Anything enchanted? ;) What about calling that enc_handler? Hmmm, I've seen things like this before: +def tostring(tree_or_elem, encoding=None, **kwargs): + assert hasattr(tree_or_elem, 'tostring'), 'Expecting Element or ElementTree.' What good is it to put assertions on public API methods? What can you 'assert' the user to do? I'd personally prefer a typed exception here rather than an AssertionError...
- Use libxml2 io contexts to provide io to file-like objects without intermediary StringIO's
+cdef int _readFileObject(void* context, char* buffer, int size): + cdef object file + cdef int nbytes + + file = <object> context + try: + buf = file.read(size) + except IOError: + return -1 + # Some file like objects such as StringIO can return unicode when + # read. This is not what it is supposed to be, as libxml2 expects + # and encoded string here in the (libxml2) encoding specified by the + # parser context. It is not guaranteed that Python supports the + # specified encoding. For now, we encode to UTF-8 and require the + # end-user to use the default encoding for these special file-like + # objects. + if isinstance(buf, unicode): + buf = buf.encode('utf-8') + nbytes = len(buf) + tree.strncpy(buffer, buf, size) + return nbytes Although I like the simplicity of the two file I/O functions, this one is broken. Encoding to UTF-8 can make things longer than their character size, so you may end up stripping off bytes at the end and throw them away.
- Support for url keyword on XML() and parse() to indicate base url in case we're not parsing from a file.
Sounds ok.
- Added a function .tostring() to ElementTree and to Element. The module level tostring() now calls the 'tostring' method on its first argument. - Support for parse options (although none are defined yet) - Default encoding is utf-8. - doctests - unit tests
I saw, you changed the test cases for ElementTree also (not only etree). Do you have the original ElementTree installed? Because you will have to run the tests against both to verify that the new interface stays compatible. The common approach is rather to change the code than the test cases if you intend to stay compatible.
- A new function XMLTree that is just like XML but returns a tree instead. With Stefan's recent patches this function may become superfluous, as we can ElementTree(XML(...)) instead.
True.
- valgrid/unittest clean
That's good. Some more comments: ElementTree.parse seems to do a lot of checking to determine how to parse the document. I would prefer having this in a private helper function as it is useful to many other places where you could want to throw in a file (like XSLT, RelaxNG, ...) I'm somewhat convinced now that it would be a good idea to put the I/O stuff (buffer encoding, input checking/parsing/...) into either the parser module (rather the input stuff) or into a new IO module (preferably the output stuff). The functions related to parsing are best put into the parser module, so I don't know if there is enough input stuff left to merit an I/O module or if it's rather stripped down to an O module. Still, many other modules can do their own input since libxml2/libxslt has a number of functions for parsing RelaxNG, XSLT, etc. from files and memory. So maybe we should have some helper functions that determine if something is a file that can be parsed that way. An I/O module would be a good place for them. So I'd personally go for io.pxi. So much for my comments, I hope they do not discourage you in your work. The rest looks pretty promising. Stefan
participants (2)
-
Geert Jansen
-
Stefan Behnel