Mailman 3 [lxml-dev] Setting URL from lxml.html.fromstring, etc - lxml - The Python XML Toolkit

newer
[lxml-dev] Segfault and bus error...

[lxml-dev] Setting URL from lxml.html.fromstring, etc

older
[lxml-dev] PyUnicodeUCS2_Decode...

Ian Bicking

17 Feb 2008 17 Feb '08

11:38 p.m.

There doesn't seem to be any way to set a document's URL when parsing the document. E.g.:

...

...
...
from lxml import html tree = html.parse('http://www.python.org') tree.docinfo.URL 'http://www.python.org'

But the parse function doesn't really take any arguments, and the URL attribute is write-only. Ideally you could do fromstring('...doc...', URL='location'). (Also I'm not sure why the URL shouldn't be writable.) Ian

Show replies by date

Stefan Behnel

18 Feb 18 Feb

8:33 a.m.

Hi Ian, Ian Bicking wrote:

...

There doesn't seem to be any way to set a document's URL when parsing the document. E.g.:

...
...
...
from lxml import html tree = html.parse('http://www.python.org') tree.docinfo.URL 'http://www.python.org'

But the parse function doesn't really take any arguments, and the URL attribute is write-only. Ideally you could do fromstring('...doc...', URL='location').

All keyword arguments that you pass to the parse/fromstring functions are passed on to lxml.etree's corresponding functions. That means, you can pass the "base_url" keyword. (Maybe that should be mentioned in the docstrings).

...

Also I'm not sure why the URL shouldn't be writable.

What would be the use case? The problem that arises is that the source URL of a document would no longer be an immutable identifier of the document. If it can change, it's less valuable for caching (for example). It's a different thing if you pass a URL to the parser because it can't know where the document came from, or if you change the 'source' of a document at will. Stefan

Ian Bicking

6:04 p.m.

Stefan Behnel wrote:

...

Ian Bicking wrote:

...
There doesn't seem to be any way to set a document's URL when parsing the document. E.g.:

...
...
...
from lxml import html tree = html.parse('http://www.python.org') tree.docinfo.URL 'http://www.python.org'

But the parse function doesn't really take any arguments, and the URL attribute is write-only. Ideally you could do fromstring('...doc...', URL='location').

All keyword arguments that you pass to the parse/fromstring functions are passed on to lxml.etree's corresponding functions. That means, you can pass the "base_url" keyword. (Maybe that should be mentioned in the docstrings).

Yeah... it's hard to figure out what method is underlying these. I've added a note to the docstring and an explicit base_url argument to the functions, so you can see the presence of the parameter more easily. It does not appear that html.parse() takes a base_url argument (just as etree.parse does not). If you pass a URL or filename then I suppose that becomes the base. If you pass in a file-like object then I think it also works, if the file-like object has a geturl() method (like urllib's files do).

...

...
Also I'm not sure why the URL shouldn't be writable.

What would be the use case? The problem that arises is that the source URL of a document would no longer be an immutable identifier of the document. If it can change, it's less valuable for caching (for example). It's a different thing if you pass a URL to the parser because it can't know where the document came from, or if you change the 'source' of a document at will.

If you can just get it right during parsing it should be fine. But there's things like xml:base (doesn't apply to HTML; not sure how it's handled in XML), or unusual headers like Content-Location, which you might want to handle at point in time that the document has already been parsed. Probably not a problem, but it doesn't seem that much like a problem to make it writable too. Especially since the document itself is writable. Once you've edited the document, it's not *the* document at that URL anyway. Maybe you get a page, edit it, and serve it at a new location. Deliverance does this by getting the theme page, then injecting the content into that page -- but the theme page is the originally-parsed object, though it will be served at a different location. I'd like to be able to fix up that data. And I'm not sure how I'd make a copy of a document with a new URL, if the URL/document link is immutable. (Right now I'm mostly ignoring the URL, but it would be nice if I could actually trust it.) Ian

Stefan Behnel

8:29 p.m.

Hi Ian, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
Ian Bicking wrote:

...
There doesn't seem to be any way to set a document's URL when parsing the document. E.g.:

...
...
...
from lxml import html tree = html.parse('http://www.python.org') tree.docinfo.URL 'http://www.python.org'

But the parse function doesn't really take any arguments, and the URL attribute is write-only. Ideally you could do fromstring('...doc...', URL='location').

All keyword arguments that you pass to the parse/fromstring functions are passed on to lxml.etree's corresponding functions. That means, you can pass the "base_url" keyword. (Maybe that should be mentioned in the docstrings).

Yeah... it's hard to figure out what method is underlying these. I've added a note to the docstring and an explicit base_url argument to the functions, so you can see the presence of the parameter more easily.

That's good, then epydoc can pick it up.

...

It does not appear that html.parse() takes a base_url argument (just as etree.parse does not). If you pass a URL or filename then I suppose that becomes the base.

Yes. parse() is for parsing from files/URLs, so you'd normally have some kind of source name/URL. StringIO is a different thing, but then, in most cases where you could use parse(StringIO), it would be better to use fromstring(), which supports the "base_url" keyword.

...

If you pass in a file-like object then I think it also works, if the file-like object has a geturl() method (like urllib's files do).

The code we use is this: cdef _getFilenameForFile(source): # file instances have a name attribute try: return source.name except AttributeError: pass # gzip file instances have a filename attribute try: return source.filename except AttributeError: pass # urllib2 provides a geturl() method try: geturl = source.geturl except AttributeError: # can't determine filename return None else: return geturl()

...

...
...
Also I'm not sure why the URL shouldn't be writable.

What would be the use case? The problem that arises is that the source URL of a document would no longer be an immutable identifier of the document. If it can change, it's less valuable for caching (for example). It's a different thing if you pass a URL to the parser because it can't know where the document came from, or if you change the 'source' of a document at will.

If you can just get it right during parsing it should be fine. But there's things like xml:base (doesn't apply to HTML; not sure how it's handled in XML)

Not sure, but that should be handled in the parser. At least, it deals with parse-time information.

...

or unusual headers like Content-Location, which you might want to handle at point in time that the document has already been parsed.

"Header" sounds more like something you'd also know in advance.

...

Probably not a problem, but it doesn't seem that much like a problem to make it writable too. Especially since the document itself is writable. Once you've edited the document, it's not *the* document at that URL anyway. Maybe you get a page, edit it, and serve it at a new location. Deliverance does this by getting the theme page, then injecting the content into that page -- but the theme page is the originally-parsed object, though it will be served at a different location. I'd like to be able to fix up that data. And I'm not sure how I'd make a copy of a document with a new URL, if the URL/document link is immutable. (Right now I'm mostly ignoring the URL, but it would be nice if I could actually trust it.)

I see. The URL is currently retrieved through "tree.docinfo" (i.e. the DocInfo class), which is completely read-only. I'll have to figure out the implications first - feel free to inject some ideas. :) Stefan

Stefan Behnel

19 Feb 19 Feb

12:06 p.m.

Hi, Ian Bicking wrote:

...

It does not appear that html.parse() takes a base_url argument (just as etree.parse does not). If you pass a URL or filename then I suppose that becomes the base. If you pass in a file-like object then I think it also works, if the file-like object has a geturl() method (like urllib's files do).

I added the base_url keyword to parse() for now, so that you can set the URL for file-like objects. Stefan

Stefan Behnel

25 Feb 25 Feb

4:19 p.m.

Hi, Ian Bicking wrote:

...

Probably not a problem, but it doesn't seem that much like a problem to make it writable too. Especially since the document itself is writable. Once you've edited the document, it's not *the* document at that URL anyway. Maybe you get a page, edit it, and serve it at a new location. Deliverance does this by getting the theme page, then injecting the content into that page -- but the theme page is the originally-parsed object, though it will be served at a different location. I'd like to be able to fix up that data. And I'm not sure how I'd make a copy of a document with a new URL, if the URL/document link is immutable. (Right now I'm mostly ignoring the URL, but it would be nice if I could actually trust it.)

Setting the document URL works on the current trunk. I also added a "base" property to Elements that is based on the xml:base attribute (or the appropriate fallback to the document URL). Stefan

Ian Bicking

26 Feb 26 Feb

8:55 p.m.

Stefan Behnel wrote:

...

Setting the document URL works on the current trunk.

Cool.

...

I also added a "base" property to Elements that is based on the xml:base attribute (or the appropriate fallback to the document URL).

Hmm... there's a property in lxml.html called .base_url, which previously just read docinfo.URL. Now it could read .base... but obviously that's silly, as it's just an alias. We could deprecate .base_url in lxml.html, or rename .base as .base_url, but having both ain't good. Ian

Stefan Behnel

28 Feb 28 Feb

10:23 a.m.

Hi, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
I also added a "base" property to Elements that is based on the xml:base attribute (or the appropriate fallback to the document URL).

Hmm... there's a property in lxml.html called .base_url, which previously just read docinfo.URL. Now it could read .base... but obviously that's silly, as it's just an alias.

We could deprecate .base_url in lxml.html, or rename .base as .base_url, but having both ain't good.

I agree, wasn't aware of it. (Here, we are actually lucky that it wasn't writable already!) But 'base' is a better name for the XML environment given 'xml:base'. It feels weird to set '.base_url' and have it set an xml:base attribute on the Element. Also, it might just be a URI, although that's unlikely. Don't you think it should behave differently for XML and HTML? For XML, I'd expect it to depend on xml:base, while for HTML, it'd rather always depend on the document URL (and not set an xml:base attribute on assignment). Stefan

Ian Bicking

5:06 p.m.

Stefan Behnel wrote:

...

Hi,

Ian Bicking wrote:

...
...
I also added a "base" property to Elements that is based on the xml:base attribute (or the appropriate fallback to the document URL). Hmm... there's a property in lxml.html called .base_url, which

Stefan Behnel wrote: previously just read docinfo.URL. Now it could read .base... but obviously that's silly, as it's just an alias.

We could deprecate .base_url in lxml.html, or rename .base as .base_url, but having both ain't good.

I agree, wasn't aware of it. (Here, we are actually lucky that it wasn't writable already!)

But 'base' is a better name for the XML environment given 'xml:base'. It feels weird to set '.base_url' and have it set an xml:base attribute on the Element. Also, it might just be a URI, although that's unlikely.

Don't you think it should behave differently for XML and HTML? For XML, I'd expect it to depend on xml:base, while for HTML, it'd rather always depend on the document URL (and not set an xml:base attribute on assignment).

Sure, they act somewhat differently, but does it make sense to use two different names? I think they mean similar things in both cases, though perhaps the per-element base attribute in HTML shouldn't be writable. (Though the tree is kind of this weird invisible thing that you wouldn't know is there except for things like docinfo.URL, but a little documentation can fix that of course.) Ian

Stefan Behnel

29 Feb 29 Feb

8:17 p.m.

Hi, Ian Bicking wrote:

...

Stefan Behnel wrote:

...
Don't you think it should behave differently for XML and HTML? For XML, I'd expect it to depend on xml:base, while for HTML, it'd rather always depend on the document URL (and not set an xml:base attribute on assignment).

Sure, they act somewhat differently, but does it make sense to use two different names? I think they mean similar things in both cases, though perhaps the per-element base attribute in HTML shouldn't be writable. (Though the tree is kind of this weird invisible thing that you wouldn't know is there except for things like docinfo.URL, but a little documentation can fix that of course.)

ok, I do prefer 'base' then, though, as it matches xml:base. It also makes less sense in the HTML area than in the XML area, where you actually /have/ something like a base URL of an element, rather than just a URL of a document that the Element happens to be in. So, if you move an HTML Element from one tree to another, it will change its base URL, while in the XML world, you /can/ work around that if you need/want to. I think we should deprecate 'base_url' in favour of 'base', and document the respective behaviour in the doc strings of both properties. Stefan

Ian Bicking

10:35 p.m.

Stefan Behnel wrote:

...

Hi,

Ian Bicking wrote:

...
...
Don't you think it should behave differently for XML and HTML? For XML, I'd expect it to depend on xml:base, while for HTML, it'd rather always depend on the document URL (and not set an xml:base attribute on assignment). Sure, they act somewhat differently, but does it make sense to use two different names? I think they mean similar things in both cases, though

Stefan Behnel wrote: perhaps the per-element base attribute in HTML shouldn't be writable. (Though the tree is kind of this weird invisible thing that you wouldn't know is there except for things like docinfo.URL, but a little documentation can fix that of course.)

ok, I do prefer 'base' then, though, as it matches xml:base. It also makes less sense in the HTML area than in the XML area, where you actually /have/ something like a base URL of an element, rather than just a URL of a document that the Element happens to be in. So, if you move an HTML Element from one tree to another, it will change its base URL, while in the XML world, you /can/ work around that if you need/want to.

I think we should deprecate 'base_url' in favour of 'base', and document the respective behaviour in the doc strings of both properties.

OK. Then would the html base attribute just be a read-only property then? Like: def base(self): return super(HtmlElement, self).base base = property(base) I'm not terribly concerned about whether it is read-only or not. It's a little fuzzy, since HTML is parsed to the lxml representation, and though it will probably be serialized to HTML again (if it is serialized at all) and HTML doesn't have anything like xml:base, the lxml representation is not itself exactly HTML. And if you serialize to XHTML, then xml:base is available. Also translating HTML to XHTML is kind of an outstanding issue for lxml.html, and it seems reasonable to me that XHTML could be parsed into the same classes as HTML. The only real caveat there is that XHTML uses different (namespaced) tag names. If you remove the tag names, then the classes and the lookup applies just fine. (Presumably the lookup could be changed to support XHTML fairly easily.) Ian

Stefan Behnel

1 Mar 1 Mar

8:33 a.m.

Hi Ian, Ian Bicking wrote:

...

OK. Then would the html base attribute just be a read-only property then? Like:

def base(self): return super(HtmlElement, self).base base = property(base)

I'm not terribly concerned about whether it is read-only or not. It's a little fuzzy, since HTML is parsed to the lxml representation, and though it will probably be serialized to HTML again (if it is serialized at all) and HTML doesn't have anything like xml:base, the lxml representation is not itself exactly HTML. And if you serialize to XHTML, then xml:base is available.

Hmm, true. However, if you use lxml.html, you're likely to stay in the HTML world, so I would prefer making this read-only. If you really want an xml:base attribute, you can set it yourself, and if you really want to set the document URL, it's better to be explicit than setting it through an Element.

...

Also translating HTML to XHTML is kind of an outstanding issue for lxml.html, and it seems reasonable to me that XHTML could be parsed into the same classes as HTML. The only real caveat there is that XHTML uses different (namespaced) tag names. If you remove the tag names, then the classes and the lookup applies just fine. (Presumably the lookup could be changed to support XHTML fairly easily.)

That's a different topic, so I think we should discuss that in a separate thread. Stefan

5898

Age (days ago)

5911

Last active (days ago)

List overview

Download

11 comments

2 participants

participants (2)

Ian Bicking
Stefan Behnel

[lxml-dev] Setting URL from lxml.html.fromstring, etc

tags

participants (2)