Mailman 3 [lxml-dev] html branch - lxml - The Python XML Toolkit

[lxml-dev] html branch

older
[lxml-dev] Network downloading of...

Ian Bicking

May 29, 2007

3:37 p.m.

I've started a branch with lxml.html, in http://codespeak.net/svn/lxml/branch/html It currently includes: lxml.doctestcompare: XML/HTML doctests lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree - find_rel_links: find <a rel="?"> - find_class: find <* class="?"> * HTML: parser * parse_elements: parse fragment, return list of elements * parse_element: parse fragment, return single element * Element: apparently a highly broken element factory (segfaults?!) * tostring: HTML serialization lxml.defs: lists of HTML tags (e.g., block_tags) lxml.clean: clean Javascript and other problem code from HTML lxml.rewritelinks: change the links in a document lxml.htmldiff: make human-readable diffs and blame reports The usedoctest modules are based on a really horrible hack. It seems to work, except for some reason lxml/html/tests/test_clean.txt is sometimes run without the doctest change. The other doctests aren't run like this, and when you explicitly run the test (e.g., python test.py test_clean) it runs fine. So something weird with the test runner, I guess. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Show replies by date

Stefan Behnel

May 2007

9:41 p.m.

Hi Ian, Ian Bicking wrote:

...

I've started a branch with lxml.html, in http://codespeak.net/svn/lxml/branch/html

Sure, cool.

...

lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

...

lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser

Good idea. That way it's automatically gets the same 'interface'. I'm not sure about the "use...", though. It needs to read well with "import": from lxml import usedoctest Too many verbs IMHO (but as long as I can't come up with a better name, I'll just leave it as is :)

...

lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent: def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element) def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:]

...

* HTML: parser * parse_elements: parse fragment, return list of elements * parse_element: parse fragment, return single element

I'll look into those, but they look ok at first glance.

...

* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do Element = html_parser.makeelement

...

* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

...

lxml.[html.]defs: lists of HTML tags (e.g., block_tags)

Ok.

...

lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

...

lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

...

lxml.[html.]htmldiff: make human-readable diffs and blame reports

The usedoctest modules are based on a really horrible hack. It seems to work, except for some reason lxml/html/tests/test_clean.txt is sometimes run without the doctest change. The other doctests aren't run like this, and when you explicitly run the test (e.g., python test.py test_clean) it runs fine. So something weird with the test runner, I guess.

I'll take a look at these later. Stefan

Ian Bicking

10:10 p.m.

Stefan Behnel wrote:

...

...
lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs.

...

...
lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser

Good idea. That way it's automatically gets the same 'interface'.

I'm not sure about the "use...", though. It needs to read well with "import":

from lxml import usedoctest

Too many verbs IMHO (but as long as I can't come up with a better name, I'll just leave it as is :)

I feel like there needs to be a verb in the name, since the import does stuff. The module itself is useless.

...

...
lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent:

def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element)

def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:]

I am a little reluctant to add self-delete methods in general in Python, but with this technique I would *always* do el.getparent().cut_out_tree(el). I pretty much always find an element then get rid of it. Doing it from the parent is consistent but inconvenient. I agree the remove names are ambiguous -- both how they relate to each other, and that they seem similar to remove().

...

...
* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do

Element = html_parser.makeelement

OK. What's the distinction between Element and SubElement?

...

...
* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

Yeah; it works. I hate the <meta http-equiv="Content-Type"> removal via a regex, but not removing it bugs the hell out of me and there's no other way I see to get rid of it. If I was more apt to dig in libxml2 code I'm sure there's a better technique, but I'm shy around C code.

...

...
lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments.

...

...
lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

This I feel a little more comfortable about than the cleanup. Especially making all links absolute is really convenient when you are doing parsing. I'd like to do some kind of query (returning all links in the document), but I'm not sure what that would look like. Generally *just* the link is kind of boring. Usually the link plus the element that has the link is more interesting. But some kinds of links don't have elements; CSS particularly. OTOH, a method that didn't cover that particular case (even though the rewriting did) would still be useful. Maybe it would return [(element_with_link, attribute_where_link_is), ...]. Or it could be (element_with_link, attribute_where_link_is, link), and for CSS that'd be (<style element>, None, link). So potentially I see the methods: make_links_absolute(base_href) resolve_base_href() # kind of icky, but still useful; for <base href> iter_links() # as described rewrite_links(link_repl_func) Does that make for too many methods? Doesn't seem too bad, especially since links are important. I've also added two new methods: get_element_by_id() (a long name, but at least easy to remember) and text_only(), which gives the text of the tree with all the tags removed. I don't really like the text_only name, though, but the function is useful. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Stefan Behnel

6:08 a.m.

Hi Ian, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
...
lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs.

Ok. I'll have to take a look at the hack anyway, but I believe you already did enough to search for a better solution... But isn't there a way to copy over a bit of the doctest code to make this easier? Like the whole method in OutputChecker?

...

...
from lxml import usedoctest

Too many verbs IMHO

I feel like there needs to be a verb in the name, since the import does stuff. The module itself is useless.

Good point. "usedoctest" it is then.

...

...
...
lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent:

def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element)

def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:]

I am a little reluctant to add self-delete methods in general in Python, but with this technique I would *always* do el.getparent().cut_out_tree(el). I pretty much always find an element then get rid of it. Doing it from the parent is consistent but inconvenient.

Ok, you're the one who has experience in using these functions.

...

I agree the remove names are ambiguous -- both how they relate to each other, and that they seem similar to remove().

Ok, so, what other words do we have for that? discard? extract? drop?

...

...
...
* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do

Element = html_parser.makeelement

OK. What's the distinction between Element and SubElement?

Both are factories, SubElement adds elements to an existing tree, Element creates a new root element in a new tree. The tree classes are _Element, _Comment, _ProcessingInstruction and (since this week) _Entity. They are proxy classes that you can't instantiate yourself, only lxml can do that. Ah, BTW, inheriting from _Comment won't work, you have to use CommentBase (which inherits from _Comment). And if HtmlLookup stays that simple, we can even use ElementDefaultClassLookup(element=HtmlElement, comment=HtmlComment) That's internal config, so people won't notice if we ever have to change to something else.

...

...
...
* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

Yeah; it works. I hate the <meta http-equiv="Content-Type"> removal via a regex, but not removing it bugs the hell out of me and there's no other way I see to get rid of it. If I was more apt to dig in libxml2 code I'm sure there's a better technique, but I'm shy around C code.

That's why there's lxml :) Once we have a usable API, we can still see if there is any stuff we can reimplement in Pyrex, but Python code is best for now.

...

...
...
lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments.

Ah, ok, good point. Still, I would like to keep the number of modules low. lxml.html should be as close to "one point for solving your HTML needs" as possible.

...

...
...
lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

This I feel a little more comfortable about than the cleanup. Especially making all links absolute is really convenient when you are doing parsing.

I'd like to do some kind of query (returning all links in the document), but I'm not sure what that would look like. Generally *just* the link is kind of boring. Usually the link plus the element that has the link is more interesting. But some kinds of links don't have elements; CSS particularly. OTOH, a method that didn't cover that particular case (even though the rewriting did) would still be useful. Maybe it would return [(element_with_link, attribute_where_link_is), ...]. Or it could be (element_with_link, attribute_where_link_is, link), and for CSS that'd be (<style element>, None, link).

So potentially I see the methods:

make_links_absolute(base_href) resolve_base_href() # kind of icky, but still useful; for <base href> iter_links() # as described rewrite_links(link_repl_func)

Does that make for too many methods? Doesn't seem too bad, especially since links are important.

Agreed, and I think the above are good ones.

...

I've also added two new methods: get_element_by_id() (a long name, but at least easy to remember) and text_only(), which gives the text of the tree with all the tags removed. I don't really like the text_only name, though, but the function is useful.

What about gettext() or gettextcontent() ? Having a very visible .text property makes it clear that these two do more. Even collecttext() would work well. (BTW, I keep favouring xpath's "//text()" or even "string()" for the implementation: fast and simple). I'll give the code a closer review when I find the time. Regards, Stefan

Ian Bicking

6:40 p.m.

Stefan Behnel wrote:

...

We can't copy over the code; at the point usedoctest is imported, doctest code is already running. We aren't adding our own runner, we're modifying the runner that is already in progress. Instead of swapping in the doctestcompare check_output, we could swap in code that does something simpler, like calls a method indirectly (and we could swap that method). Either way, it involves messing with func_code, because of that blasted bound method in __run. But a permanent change in code would at least make it less important to disable the patch.

...

I like drop, I'll switch to that.

...

OK. *Actually* putting them all in one module would make the module feel too big to me. I could import them all into __init__.py. That might make the import unnecessarily slow, I'm not sure. For some reason I've never used lazy-loading functions, though the implementation seems obvious enough; just something like: def clean(*args, **kw): from lxml.html import clean return clean(*args, **kw) It breaks documentation tools, I guess (though at least I can refer to the real function in the docstring).

...

OK, switched to get_text_content(). Is there a style guideline for naming? I'm using underscores, and avoiding smashed words, which would be get_text_content(). Though the "get_" seems unnecessary; text_content() seems better to me. I've been trying to use find_* for methods that return lists of nodes, and get_* for things that return a single node. For a number of the methods I'd also like a function version that takes a string and returns a string. I think this makes it easier to convince people to use the functions. Obviously this doesn't make sense for a lot of the methods, but does for clean, htmldiff, make_links_absolute, and maybe rewrite_links. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Stefan Behnel

9:11 p.m.

Hi Ian, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
Ian Bicking wrote:

...
Stefan Behnel wrote:

...
...
lxml.doctestcompare: XML/HTML doctests As people would rarely import this, why not have it start with an underscore? I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs.

Ok. I'll have to take a look at the hack anyway, but I believe you already did enough to search for a better solution...

But isn't there a way to copy over a bit of the doctest code to make this easier? Like the whole method in OutputChecker?

We can't copy over the code; at the point usedoctest is imported, doctest code is already running. We aren't adding our own runner, we're modifying the runner that is already in progress.

Instead of swapping in the doctestcompare check_output, we could swap in code that does something simpler, like calls a method indirectly (and we could swap that method). Either way, it involves messing with func_code, because of that blasted bound method in __run. But a permanent change in code would at least make it less important to disable the patch.

I'll have to take a closer look into this, but won't have the time during the next week.

...

...
...
...
...
lxml.[html.]clean: clean Javascript and other problem code from HTML That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module. The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments.

Ah, ok, good point. Still, I would like to keep the number of modules low. lxml.html should be as close to "one point for solving your HTML needs" as possible.

OK. *Actually* putting them all in one module would make the module feel too big to me. I could import them all into __init__.py. That might make the import unnecessarily slow, I'm not sure.

Avoiding imports tends to be not worth the effort. It already takes a while to import etree, so importing some more Python modules doesn't add much.

...

For some reason I've never used lazy-loading functions, though the implementation seems obvious enough; just something like:

def clean(*args, **kw): from lxml.html import clean return clean(*args, **kw)

It breaks documentation tools, I guess (though at least I can refer to the real function in the docstring).

I wouldn't do that. Calling things happens much more often than importing them, so adding overhead to the call that is usually done only once feels wrong to me.

...

...
...
I've also added two new methods: get_element_by_id() (a long name, but at least easy to remember) and text_only(), which gives the text of the tree with all the tags removed. I don't really like the text_only name, though, but the function is useful.

What about gettext() or gettextcontent() ? Having a very visible .text property makes it clear that these two do more. Even collecttext() would work well. (BTW, I keep favouring xpath's "//text()" or even "string()" for the implementation: fast and simple).

OK, switched to get_text_content(). Is there a style guideline for naming?

lxml has not been very consistent here, but I'm planning to get closer to PEP 8 on the long term. http://www.python.org/dev/peps/pep-0008/ ElementTree has traditionally used CamelCase for module names and "smashedwords" for methods, which is not quite compliant with what PEP-8 says today (long after the ET was written). But there's not enough examples for multi-word methods to make that a naming priciple. I think underscore names are just right.

...

I'm using underscores, and avoiding smashed words, which would be get_text_content(). Though the "get_" seems unnecessary; text_content() seems better to me.

Make it so.

...

I've been trying to use find_* for methods that return lists of nodes, and get_* for things that return a single node.

Ok, that sounds consistent, though it won't necessarily be immediately obvious to users, as it requires knowing a couple of examples before you actually start seeing the pattern - *if* you care for seeing one. Anyway, having a consistent naming pattern is always a good idea.

...

For a number of the methods I'd also like a function version that takes a string and returns a string. I think this makes it easier to convince people to use the functions. Obviously this doesn't make sense for a lot of the methods, but does for clean, htmldiff, make_links_absolute, and maybe rewrite_links.

I like that pattern, too. Stefan

Ian Bicking

June 2007

4:46 a.m.

Stefan Behnel wrote:

...

The overhead of an import (if the module has already been imported) isn't very significant, and could be cached easily enough. That said, the clean module isn't particularly large and doesn't import much itself. But htmldiff is the only module of substantial size. I've integrated rewritelinks directly into __init__, which after refactoring the algorithm a bit isn't very big anyway. I dunno; I'm okay just requiring htmldiff to be imported directly, and importing clean into __init__.

...

So I made a generic wrapper for exposing methods as functions. It parses the first argument if it's not already a parsed document, then does something, and returns the result of the method or the serialized form of the document if the method returns None. This might be a bit too fancy/automatic. But anyway, putting that aside, I was thinking that maybe the general pattern should be like: def make_links_absolute(doc, base_href, fragment=False): if isinstance(doc, basestring): if fragment: doc = parse_element(doc) else: doc = HTML(doc) return_string = True else: doc = copy.deepcopy(doc) return_string = False doc.make_links_absolute(doc, base_href) if return_string: return tostring(doc) else: return doc This makes the function also a handy way to do functional-style transformations of elements. It bothers me a bit to change the return type (which I generally dislike doing), except that it matches the input type which seems like it might be okay. Does this seem okay? Also, I'm wondering if (a) I should try to automatically determine fragment unless it is explicitly given, and/or (b) if parse_element doesn't work (raises an exception) I should use parse_element(doc, create_parent=True) which will wrap the fragment in a <div>. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Stefan Behnel

4:58 p.m.

Hi, Ian Bicking wrote:

...

I'm okay just requiring htmldiff to be imported directly, and importing clean into __init__.

Makes sense to me. htmldiff is definitely a module in it's own right. Everything else just deals with different things to do with a tree.

...

But anyway, putting that aside, I was thinking that maybe the general pattern should be like:

def make_links_absolute(doc, base_href, fragment=False): if isinstance(doc, basestring): if fragment: doc = parse_element(doc) else: doc = HTML(doc) return_string = True else: doc = copy.deepcopy(doc) return_string = False doc.make_links_absolute(doc, base_href) if return_string: return tostring(doc) else: return doc

Ok.

...

This makes the function also a handy way to do functional-style transformations of elements. It bothers me a bit to change the return type (which I generally dislike doing), except that it matches the input type which seems like it might be okay.

Does this seem okay?

It looks Pythonic to me. You get out what you put in and whatever you put in, it does the same thing to it. So it's just a perfectly polymorphic function.

...

Also, I'm wondering if (a) I should try to automatically determine fragment unless it is explicitly given, and/or (b) if parse_element doesn't work (raises an exception) I should use parse_element(doc, create_parent=True) which will wrap the fragment in a <div>.

Defaulting to a "wrap with <div>" fallback means changing the input in a not really predictable way. That sounds like too much magic to me. In most cases, users will know what they are dealing with. Otherwise, they can well catch the exception and then fall back to an alternative *if they want*. I'm fine with having a function that can handle HTML trees or serialised HTML documents and requires users to parse things themselves if it's not a document. Stefan

Ian Bicking

4:27 p.m.

Stefan Behnel wrote:

...

I imported a bunch of HTML cleaning tests from other sources, and in the process I found "parse this somehow and give me an element" to be very convenient. Of course, HTML() *does* exactly that kind of parsing, but at least for cleaning you usually don't want a full document, you really just want a fragment. And that's not too uncommon. To make this easier I implemented a parse() function that does its best to parse your content. If your content is a full page, you get a full page back. If it's not a full page and it contains just one element, you get that element back. But if it's not a full page and it contains multiple elements, it gets wrapped in a <div>. This seems less intrusive than wrapping it in <html><body>, which is eeffectively what the standard parser does. <div> is really a generic wrapper (though I suppose since it is block level, it's not *entirely* generic -- it might be more ideal to see if the content contains any block level elements, and if not just wrap in <span>). Dealing with ordered lists of elements with no parent isn't that easy or natural anywhere in the API. If there was some kind of anonymous container then that would be a nice container, but there isn't one. Is it possible to make something like that? It seems like a new kind of node could cause a lot of problems. Notably, with the HTML parser you frequently get something out with more elements than were in the original. It'll add <p> or <div> tags fairly liberally, rearrange tags, etc., to make the document valid. So adding a <div> tag isn't that far from what can already happen. -- Ian Bicking | ianb@colorstudy.com | http://blog.ianbicking.org | Write code, do good | http://topp.openplans.org/careers

Stefan Behnel

5:34 p.m.

Hi Ian, Ian Bicking wrote:

...

Ok, that makes sense.

...

Adding block elements might break things like CSS.

...

That's a good idea. The parse() function could do that as it already aims to be smart about what it returns (otherwise, you could just use the normal etree.parse() with an HTMLParser). If you pass it something that can't be returned as a single element, I find it legitimate to wrap it in something that fits. And if we've already determined that we need to wrap it, we can also check what to wrap it in by traversing the tree(s). As a quick check, we can walk through the parsed root elements to check if there are any block elements and only if not, we can traverse each tree completely. If we find at least one block element (easy to check the tag against a positive set), we wrap with <div>, otherwise, we wrap with <span>.

...

It definitely would. Adding such a beast would cause overhead in basically all API functions, in traversal code, etc. I'd be very happy to avoid that.

...

True. As I said, having a parse() function that accompanies etree.parse() and that deliberately says "I return *one* element and I do it the smart way" is definitely the way to go. Stefan

Stefan Behnel

May 2007

9:41 p.m.

Hi Ian, Ian Bicking wrote:

...

I've started a branch with lxml.html, in http://codespeak.net/svn/lxml/branch/html

Sure, cool.

...

lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

...

lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser

...

lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

...

* HTML: parser * parse_elements: parse fragment, return list of elements * parse_element: parse fragment, return single element

I'll look into those, but they look ok at first glance.

...

* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do Element = html_parser.makeelement

...

* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

...

lxml.[html.]defs: lists of HTML tags (e.g., block_tags)

Ok.

...

lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

...

lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

...

lxml.[html.]htmldiff: make human-readable diffs and blame reports

The usedoctest modules are based on a really horrible hack. It seems to work, except for some reason lxml/html/tests/test_clean.txt is sometimes run without the doctest change. The other doctests aren't run like this, and when you explicitly run the test (e.g., python test.py test_clean) it runs fine. So something weird with the test runner, I guess.

I'll take a look at these later. Stefan

Ian Bicking

10:10 p.m.

Stefan Behnel wrote:

...

...
lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

...

...
lxml.usedoctest: enable the doctest from within a doctest lxml.html.usedoctest: enable the doctest, using the HTML parser

Good idea. That way it's automatically gets the same 'interface'.

I'm not sure about the "use...", though. It needs to read well with "import":

from lxml import usedoctest

Too many verbs IMHO (but as long as I can't come up with a better name, I'll just leave it as is :)

I feel like there needs to be a verb in the name, since the import does stuff. The module itself is useless.

...

...
lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent:

def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element)

def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:]

...

...
* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do

Element = html_parser.makeelement

OK. What's the distinction between Element and SubElement?

...

...
* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

...

...
lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

...

...
lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

Stefan Behnel

6:08 a.m.

Hi Ian, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
...
lxml.doctestcompare: XML/HTML doctests

As people would rarely import this, why not have it start with an underscore?

I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs.

...

...
from lxml import usedoctest

Too many verbs IMHO

I feel like there needs to be a verb in the name, since the import does stuff. The module itself is useless.

Good point. "usedoctest" it is then.

...

...
...
lxml.html: * lxml.html.HtmlMixin, defining on each element: - remove_element: element removes itself from a tree - remove_tag: element removes itself but not its children from a tree

remove() already exists and removes the element you pass (not the element you call it on), so this becomes too ambiguous. Also, the more ElementTree-ish way would be to go through the parent:

def cut_out_tree(self, element): if element.tail: previous = element.getprevious() previous.tail = (previous.tail or '') + element.tail self.remove(element)

def cut_out_element(self, element): pos = self.index(element) if element.text: self.text = (self.text or '') + element.text self.cut_out_tree(element) self[pos:pos] = element[:]

I am a little reluctant to add self-delete methods in general in Python, but with this technique I would *always* do el.getparent().cut_out_tree(el). I pretty much always find an element then get rid of it. Doing it from the parent is consistent but inconvenient.

Ok, you're the one who has experience in using these functions.

...

I agree the remove names are ambiguous -- both how they relate to each other, and that they seem similar to remove().

Ok, so, what other words do we have for that? discard? extract? drop?

...

...
...
* Element: apparently a highly broken element factory (segfaults?!)

Yup, that won't work that way. Element classes cannot be instantiated on their own. But you can do

Element = html_parser.makeelement

OK. What's the distinction between Element and SubElement?

...

...
...
* tostring: HTML serialization

Based on XSLT, as I've seen before. Sure, why not.

Yeah; it works. I hate the <meta http-equiv="Content-Type"> removal via a regex, but not removing it bugs the hell out of me and there's no other way I see to get rid of it. If I was more apt to dig in libxml2 code I'm sure there's a better technique, but I'm shy around C code.

That's why there's lxml :) Once we have a usable API, we can still see if there is any stuff we can reimplement in Pyrex, but Python code is best for now.

...

...
...
lxml.[html.]clean: clean Javascript and other problem code from HTML

That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module.

The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments.

Ah, ok, good point. Still, I would like to keep the number of modules low. lxml.html should be as close to "one point for solving your HTML needs" as possible.

...

...
...
lxml.[html.]rewritelinks: change the links in a document

Maybe too special and too long for integration into the lxml.html and HtmlElement, not sure. Some of this might fit, though.

This I feel a little more comfortable about than the cleanup. Especially making all links absolute is really convenient when you are doing parsing.

I'd like to do some kind of query (returning all links in the document), but I'm not sure what that would look like. Generally *just* the link is kind of boring. Usually the link plus the element that has the link is more interesting. But some kinds of links don't have elements; CSS particularly. OTOH, a method that didn't cover that particular case (even though the rewriting did) would still be useful. Maybe it would return [(element_with_link, attribute_where_link_is), ...]. Or it could be (element_with_link, attribute_where_link_is, link), and for CSS that'd be (<style element>, None, link).

So potentially I see the methods:

make_links_absolute(base_href) resolve_base_href() # kind of icky, but still useful; for <base href> iter_links() # as described rewrite_links(link_repl_func)

Does that make for too many methods? Doesn't seem too bad, especially since links are important.

Agreed, and I think the above are good ones.

...

I've also added two new methods: get_element_by_id() (a long name, but at least easy to remember) and text_only(), which gives the text of the tree with all the tags removed. I don't really like the text_only name, though, but the function is useful.

Ian Bicking

6:40 p.m.

Stefan Behnel wrote:

...

I like drop, I'll switch to that.

...

Stefan Behnel

9:11 p.m.

Hi Ian, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
Ian Bicking wrote:

...
Stefan Behnel wrote:

...
...
lxml.doctestcompare: XML/HTML doctests As people would rarely import this, why not have it start with an underscore? I guess... the usedoctest technique is a pretty egregious hack; I actually change doctest.OutputChecker.check_output.im_func.func_code because there's a local bound method that has to be changed. So the more conventional installation method still seems good, if there are interaction bugs.

Ok. I'll have to take a look at the hack anyway, but I believe you already did enough to search for a better solution...

But isn't there a way to copy over a bit of the doctest code to make this easier? Like the whole method in OutputChecker?

We can't copy over the code; at the point usedoctest is imported, doctest code is already running. We aren't adding our own runner, we're modifying the runner that is already in progress.

Instead of swapping in the doctestcompare check_output, we could swap in code that does something simpler, like calls a method indirectly (and we could swap that method). Either way, it involves messing with func_code, because of that blasted bound method in __run. But a permanent change in code would at least make it less important to disable the patch.

I'll have to take a closer look into this, but won't have the time during the next week.

...

...
...
...
...
lxml.[html.]clean: clean Javascript and other problem code from HTML That rather looks like an HtmlElement method to me: "cleanup(...)", and the clean_html() function would fit right into the top-level of the lxml.html module. The long signature of the function made me reluctant to do this. Any function with that many parameters feels non-authoritative to me. And I would encourage people to actually write their own clean function with the parameter defaults that are appropriate for their domain (e.g., clean_untrusted_comment, clean_wysiwyg_submission, etc). I just guessed reasonable defaults for those keyword arguments.

Ah, ok, good point. Still, I would like to keep the number of modules low. lxml.html should be as close to "one point for solving your HTML needs" as possible.

OK. *Actually* putting them all in one module would make the module feel too big to me. I could import them all into __init__.py. That might make the import unnecessarily slow, I'm not sure.

Avoiding imports tends to be not worth the effort. It already takes a while to import etree, so importing some more Python modules doesn't add much.

...

For some reason I've never used lazy-loading functions, though the implementation seems obvious enough; just something like:

def clean(*args, **kw): from lxml.html import clean return clean(*args, **kw)

It breaks documentation tools, I guess (though at least I can refer to the real function in the docstring).

I wouldn't do that. Calling things happens much more often than importing them, so adding overhead to the call that is usually done only once feels wrong to me.

...

...
...
I've also added two new methods: get_element_by_id() (a long name, but at least easy to remember) and text_only(), which gives the text of the tree with all the tags removed. I don't really like the text_only name, though, but the function is useful.

What about gettext() or gettextcontent() ? Having a very visible .text property makes it clear that these two do more. Even collecttext() would work well. (BTW, I keep favouring xpath's "//text()" or even "string()" for the implementation: fast and simple).

OK, switched to get_text_content(). Is there a style guideline for naming?

...

I'm using underscores, and avoiding smashed words, which would be get_text_content(). Though the "get_" seems unnecessary; text_content() seems better to me.

Make it so.

...

I've been trying to use find_* for methods that return lists of nodes, and get_* for things that return a single node.

...

For a number of the methods I'd also like a function version that takes a string and returns a string. I think this makes it easier to convince people to use the functions. Obviously this doesn't make sense for a lot of the methods, but does for clean, htmldiff, make_links_absolute, and maybe rewrite_links.

I like that pattern, too. Stefan

Ian Bicking

June 2007

4:46 a.m.

Stefan Behnel wrote:

...

Stefan Behnel

June 2007

4:58 p.m.

Hi, Ian Bicking wrote:

...

I'm okay just requiring htmldiff to be imported directly, and importing clean into __init__.

Makes sense to me. htmldiff is definitely a module in it's own right. Everything else just deals with different things to do with a tree.

...

But anyway, putting that aside, I was thinking that maybe the general pattern should be like:

def make_links_absolute(doc, base_href, fragment=False): if isinstance(doc, basestring): if fragment: doc = parse_element(doc) else: doc = HTML(doc) return_string = True else: doc = copy.deepcopy(doc) return_string = False doc.make_links_absolute(doc, base_href) if return_string: return tostring(doc) else: return doc

Ok.

...

This makes the function also a handy way to do functional-style transformations of elements. It bothers me a bit to change the return type (which I generally dislike doing), except that it matches the input type which seems like it might be okay.

Does this seem okay?

It looks Pythonic to me. You get out what you put in and whatever you put in, it does the same thing to it. So it's just a perfectly polymorphic function.

...

Also, I'm wondering if (a) I should try to automatically determine fragment unless it is explicitly given, and/or (b) if parse_element doesn't work (raises an exception) I should use parse_element(doc, create_parent=True) which will wrap the fragment in a <div>.

Ian Bicking

4:27 p.m.

Stefan Behnel wrote:

...

Stefan Behnel

5:34 p.m.

Hi Ian, Ian Bicking wrote:

...

Ok, that makes sense.

...

Adding block elements might break things like CSS.

...

It definitely would. Adding such a beast would cause overhead in basically all API functions, in traversal code, etc. I'd be very happy to avoid that.

...

True. As I said, having a parse() function that accompanies etree.parse() and that deliberately says "I return *one* element and I do it the smart way" is definitely the way to go. Stefan

6475

Age (days ago)

6481

Last active (days ago)

List overview

Download

9 comments

2 participants

participants (2)

Ian Bicking
Stefan Behnel