how can i use lxml with win32com?

Michiel Overtoom motoom at xs4all.nl
Sun Oct 25 04:13:05 EDT 2009


elca wrote:

> yes i want to extract this text 'CNN Shop' and linked page
> 'http://www.turnerstoreonline.com'.

Well then.
First, we'll get the page using urrlib2:

     doc=urllib2.urlopen("http://www.cnn.com")

Then we'll feed it into the HTML parser:

     soup=BeautifulSoup(doc)

Next, we'll look at all the links in the page:

     for a in soup.findAll("a"):

and when a link has the text 'CNN Shop', we have a hit,
and print the URL:

         if a.renderContents()=="CNN Shop":
             print a["href"]


The complete program is thus:

import urllib2
from BeautifulSoup import BeautifulSoup

doc=urllib2.urlopen("http://www.cnn.com")
soup=BeautifulSoup(doc)
for a in soup.findAll("a"):
     if a.renderContents()=="CNN Shop":
         print a["href"]


The example above can be condensed because BeautifulSoup's find function 
can also look for texts:

     print soup.find("a",text="CNN Shop")

and since that's a navigable string, we can ascend to its parent and 
display the href attribute:

     print soup.find("a",text="CNN Shop").findParent()["href"]

So eventually the whole program could be collapsed into one line:

     print 
BeautifulSoup(urllib2.urlopen("http://www.cnn.com")).find("a",text="CNN 
Shop").findParent()["href"]

...but I think this is very ugly!


 > im very sorry my english.

You English is quite understandable.  The hard part is figuring out what 
exactly you wanted to achieve ;-)

I have a question too.  Why did you think JavaScript was necessary to 
arrive at this result?

Greetings,



More information about the Python-list mailing list