generating html with incremental writer
Hi, Is there a way to pass method='html' to etree.xmlfile()? Or any other way to serialize incrementally to html? Best, Burak
On 12/23/13 11:01, Stefan Behnel wrote:
Burak Arslan, 23.12.2013 09:56:
Is there a way to pass method='html' to etree.xmlfile()? Or any other way to serialize incrementally to html? No, not currently. (Would be a new feature, maybe "htmlfile()".)
You could generate XHTML, though.
I'm not too interested in XHTML, but I could work on implementing support for a "method" argument to etree.xmlfile (or write a new html.htmlfile, whichever you deem more appropriate). Did you actually re-implement bits of xmlNodeDumpOutput in xmlfile in a context-manager-friendly way? If so, should I start by looking at htmlNodeDumpFormatOutput from libxml? Do you think htmlfile should be a separate function or just a wrapper around xmlfile? I see most of lxml.html is a wrapper around similar calls in lxml.etree but htmlfile could be different enough to merit being a separate function. Any pointers/advice about getting this to work would be appreciated. Best, Burak
Burak Arslan, 23.12.2013 12:40:
On 12/23/13 11:01, Stefan Behnel wrote:
Burak Arslan, 23.12.2013 09:56:
Is there a way to pass method='html' to etree.xmlfile()? Or any other way to serialize incrementally to html? No, not currently. (Would be a new feature, maybe "htmlfile()".)
You could generate XHTML, though.
I'm not too interested in XHTML, but I could work on implementing support for a "method" argument to etree.xmlfile (or write a new html.htmlfile, whichever you deem more appropriate).
Sure. I'd prefer having an etree.htmlfile() class. The "xmlfile" class itself is fairly short anyway, it's purely an API class. The real work is done by the _IncrementalFileWriter. I suggest you add a "method" argument to the latter and pass it either OUTPUT_METHOD_XML or OUTPUT_METHOD_HTML from the two frontends. Take care to disallow namespaces in HTML mode. Also, I'm not sure if there is anything to do about self-closing tags in HTML mode. I guess if people use the context manager to create them, they may just have to live with them being split into opening and closing tags ... I guess write_declaration() should raise an error in HTML mode. The rest might even work more or less as it is. Most of the code you need to write might actually end up in tests.
Did you actually re-implement bits of xmlNodeDumpOutput in xmlfile in a context-manager-friendly way?
Yes, necessarily. lxml actually replicates a fair bit of libxml2's functionality where the latter doesn't fit it's API well enough (or where lxml can do things more efficiently).
If so, should I start by looking at htmlNodeDumpFormatOutput from libxml?
No, _writeNodeToBuffer() handles this just fine when you pass the right method value. Stefan
Hello Stefan, I just sent you a pull request: https://github.com/lxml/lxml/pull/142 On 12/23/13 17:51, Stefan Behnel wrote:
Burak Arslan, 23.12.2013 12:40:
Burak Arslan, 23.12.2013 09:56:
Is there a way to pass method='html' to etree.xmlfile()? Or any other way to serialize incrementally to html? No, not currently. (Would be a new feature, maybe "htmlfile()".)
You could generate XHTML, though. I'm not too interested in XHTML, but I could work on implementing support for a "method" argument to etree.xmlfile (or write a new
On 12/23/13 11:01, Stefan Behnel wrote: html.htmlfile, whichever you deem more appropriate). Sure.
I'd prefer having an etree.htmlfile() class. The "xmlfile" class itself is fairly short anyway, it's purely an API class. The real work is done by the _IncrementalFileWriter. I suggest you add a "method" argument to the latter and pass it either OUTPUT_METHOD_XML or OUTPUT_METHOD_HTML from the two frontends.
Done.
Take care to disallow namespaces in HTML mode. Also, I'm not sure if there is anything to do about self-closing tags in HTML mode. I guess if people use the context manager to create them, they may just have to live with them being split into opening and closing tags ...
Nowadays html markup is abused for all sorts of javascript-y reasons, so I left it as it is. I also noticed that html.tostring doesn't suppress namespaces, but my patch does. Should I alter that behaviour?
I guess write_declaration() should raise an error in HTML mode.
Done.
The rest might even work more or less as it is. Most of the code you need to write might actually end up in tests.
It indeed seems to. Could you advise what kind of tests you think this code needs? Best regards, Burak
Burak Arslan schrieb am 15.09.2014 um 13:56:
I just sent you a pull request: https://github.com/lxml/lxml/pull/142
Thanks!
On 12/23/13 17:51, Stefan Behnel wrote:
Burak Arslan, 23.12.2013 12:40:
Burak Arslan, 23.12.2013 09:56:
Is there a way to pass method='html' to etree.xmlfile()? Or any other way to serialize incrementally to html? No, not currently. (Would be a new feature, maybe "htmlfile()".)
You could generate XHTML, though. I'm not too interested in XHTML, but I could work on implementing support for a "method" argument to etree.xmlfile (or write a new
On 12/23/13 11:01, Stefan Behnel wrote: html.htmlfile, whichever you deem more appropriate). Sure.
I'd prefer having an etree.htmlfile() class. The "xmlfile" class itself is fairly short anyway, it's purely an API class. The real work is done by the _IncrementalFileWriter. I suggest you add a "method" argument to the latter and pass it either OUTPUT_METHOD_XML or OUTPUT_METHOD_HTML from the two frontends.
Done.
See my comments in the pull request.
Take care to disallow namespaces in HTML mode. Also, I'm not sure if there is anything to do about self-closing tags in HTML mode. I guess if people use the context manager to create them, they may just have to live with them being split into opening and closing tags ...
Nowadays html markup is abused for all sorts of javascript-y reasons, so I left it as it is.
I also noticed that html.tostring doesn't suppress namespaces, but my patch does. Should I alter that behaviour?
I was referring to the xf.element() calls (sorry). If people write out subtrees as HTML that use namespaces somewhere, I think they're pretty much on their own. However, I think it should be an error if you try that directly with xf.element() in HTML mode.
The rest might even work more or less as it is. Most of the code you need to write might actually end up in tests.
It indeed seems to. Could you advise what kind of tests you think this code needs?
There are a couple of tests for xmlfile(), so the bulk of the code is already tested. What still needs testing is the HTML specific parts, i.e. any changes or additions to the API (including error cases), as well as the actual HTML specific serialisation (i.e. that you actually get what you asked for). Stefan
you were not supposed to merge this patch so soon :) there are two unresolved issues, one was posted in another thread, the other is this one, directly related. On 09/15/14 15:21, Stefan Behnel wrote:
I also noticed that html.tostring doesn't suppress namespaces, but my
patch does. Should I alter that behaviour? I was referring to the xf.element() calls (sorry). If people write out subtrees as HTML that use namespaces somewhere, I think they're pretty much on their own.
However, I think it should be an error if you try that directly with xf.element() in HTML mode.
with etree.htmlfile(self._file) as xf: xf.write(etree.Element('{some_ns}some_tag')) doesnt suppress namespaces with etree.htmlfile(self._file) as xf: with xf.element("{some_ns}some_tag"): pass does suppress namespaces. this is inconsistent and needs to be fixed. see: https://github.com/plq/lxml/commit/378408d2b6e94a4c91410fc7bde5bba055f54785 You have three options: 1) silently filter namespaces out 2) throwing an exception when using html serialization with namespaced elements 3) let namespaces pass I'd choose 1 to make the lives of people who generate xhtml and html with the same code easier. It's your decision though. Best, Burak
Burak Arslan schrieb am 16.09.2014 um 10:29:
you were not supposed to merge this patch so soon :)
It looked ok, though. :)
On 09/15/14 15:21, Stefan Behnel wrote:
I also noticed that html.tostring doesn't suppress namespaces, but my
patch does. Should I alter that behaviour? I was referring to the xf.element() calls (sorry). If people write out subtrees as HTML that use namespaces somewhere, I think they're pretty much on their own.
However, I think it should be an error if you try that directly with xf.element() in HTML mode.
with etree.htmlfile(self._file) as xf: xf.write(etree.Element('{some_ns}some_tag'))
doesnt suppress namespaces
with etree.htmlfile(self._file) as xf: with xf.element("{some_ns}some_tag"): pass
does suppress namespaces. this is inconsistent
True.
and needs to be fixed.
Hmm, does it? What about this case: plain_p = etree.Element('p') etree.SubElement(plain_p, '{some_ns}some_tag') with etree.htmlfile(self._file) as xf: xf.write(plain_p) Namespaces can be used at any place when writing out subtrees. I wouldn't want to validate the entire tree before writing it out.
see: https://github.com/plq/lxml/commit/378408d2b6e94a4c91410fc7bde5bba055f54785
You have three options:
1) silently filter namespaces out 2) throwing an exception when using html serialization with namespaced elements 3) let namespaces pass
I'd choose 1 to make the lives of people who generate xhtml and html with the same code easier.
ISTM that 3) is the simpler and more obvious option, i.e. let libxml2 handle it. 1) would require running through the entire subtree, and if we find (XHTML) namespaces, make a copy of the subtree, remove the namespaces from the copy, and serialise it. That sounds like more than we should impose on users behind their back. Stefan
Hello, On 09/16/14 18:55, Stefan Behnel wrote:
You have three options:
1) silently filter namespaces out 2) throwing an exception when using html serialization with namespaced elements 3) let namespaces pass
I'd choose 1 to make the lives of people who generate xhtml and html with the same code easier.
ISTM that 3) is the simpler and more obvious option, i.e. let libxml2 handle it.
done. see: https://github.com/plq/lxml/compare/lxml:master...master for your convenience: git remote add plq git://github.com/plq/lxml git fetch plq git merge plq/master also, don't forget about void elements. I know they are rare ones, but they are in the spec. best, burak
Burak Arslan schrieb am 18.09.2014 um 21:13:
On 09/16/14 18:55, Stefan Behnel wrote:
You have three options:
1) silently filter namespaces out 2) throwing an exception when using html serialization with namespaced elements 3) let namespaces pass
I'd choose 1 to make the lives of people who generate xhtml and html with the same code easier.
ISTM that 3) is the simpler and more obvious option, i.e. let libxml2 handle it.
done. see: https://github.com/plq/lxml/compare/lxml:master...master
Thanks. I applied the changes manually and moved them around a bit. Stefan
participants (2)
-
Burak Arslan
-
Stefan Behnel