why i can get nothing?
robert at roberthelmer.com
Sun Jan 15 18:03:18 EST 2012
On Sat, Jan 14, 2012 at 7:54 PM, contro opinion <contropinion at gmail.com> wrote:
> here is my code :
> import urllib
> import lxml.html
> tnodes = root.xpath("//a/@href[contains(string(),'mp4')]")
> for i,add in enumerate(tnodes):
> print i,add
> why i can get nothing?
The problem is the document. The links you are trying to match on are
inside the script tags in the document, here's a simplified version:
So the anchor elements are not part of the DOM as far as lxml is
did it would have to execute the JS, and JS would have to modify the
DOM, before you could get this via xpath)
You could have lxml return just the script nodes that contain the text
you care about:
tnodes = root.xpath("//script[contains(.,'mp4')]")
Then you will need a different tool for the rest of this, regex is not
perfect but should be good enough. Probably not worth the effort to
links out of this, but it's an option.
More information about the Python-list