<div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote">On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick <span dir="ltr"><<a href="mailto:joel.goldstick@gmail.com" target="_blank">joel.goldstick@gmail.com</a>></span> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><br><div class="gmail_extra"><br><br><div class="gmail_quote"><div class="im">On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <span dir="ltr"><<a href="mailto:breamoreboy@yahoo.co.uk" target="_blank">breamoreboy@yahoo.co.uk</a>></span> wrote:<br>


</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div><div><div class="im">On 29/11/2013 16:56, Max Cuban wrote:<br>

</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class="im"><div>

I have the following code to extract certain links from a webpage:<br>

<br>

from bs4 import BeautifulSoup<br>

import urllib2, sys<br>

import re<br>

<br>

def tonaton():<br>

     site = "<a href="http://tonaton.com/en/job-vacancies-in-ghana" target="_blank">http://tonaton.com/en/job-<u></u>vacancies-in-ghana</a>"<br>

     hdr = {'User-Agent' : 'Mozilla/5.0'}<br>

     req = urllib2.Request(site, headers=hdr)<br>

     jobpass = urllib2.urlopen(req)<br>

     invalid_tag = ('h2')<br>

     soup = BeautifulSoup(jobpass)<br>

     print soup.find_all('h2')<br>

<br>

The links are contained in the 'h2' tags so I get the links as follows:<br>

<br>

<h2><a href="/en/cashiers-accra"><u></u>cashiers </a></h2><br>

<h2><a href="/en/cake-baker-accra"><u></u>Cake baker</a></h2><br>

<h2><a href="/en/automobile-<u></u>technician-accra">Automobile Technician</a></h2><br>

<h2><a href="/en/marketing-officer-<u></u>accra-4">Marketing Officer</a></h2><br>

<br>

But I'm interested in getting rid of all the 'h2' tags so that I have links only in this manner:<br>

<br>

<a href="/en/cashiers-accra"><u></u>cashiers </a><br>

<a href="/en/cake-baker-accra"><u></u>Cake baker</a><br>

<a href="/en/automobile-<u></u>technician-accra">Automobile Technician</a><br>

<a href="/en/marketing-officer-<u></u>accra-4">Marketing Officer</a><br>

<br>

<br></div></div>

This is more a beautiful soup question than python.  Have you gone through their tutorial.  Check here:<br></blockquote></div></div></blockquote><div><br></div>They have an example that looks close here: <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/" target="_blank">http://www.crummy.com/software/BeautifulSoup/bs4/doc/</a><br>


<br>One common task is extracting all the URLs found within a page’s <a> tags:<br><br>for link in soup.find_all('a'):<br>    print(link.get('href'))<br># <a href="http://example.com/elsie" target="_blank">http://example.com/elsie</a><br>


# <a href="http://example.com/lacie" target="_blank">http://example.com/lacie</a><br># <a href="http://example.com/tillie" target="_blank">http://example.com/tillie</a><br><br></div><div class="gmail_quote">In your case, you want the href values for the child of the h2 refences.<br>


<br></div><div class="gmail_quote">So this might be close (untested)<br></div></div></div></blockquote><div><br></div><div>Pardon my typo.  Try this: <br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote"><br>for link in soup.find_all('h2'):<br>    print (link.a.get('href'))<br># <a href="http://example.com/elsie" target="_blank">http://example.com/elsie</a><br>


# <a href="http://example.com/lacie" target="_blank">http://example.com/lacie</a><br># <a href="http://example.com/tillie" target="_blank">http://example.com/tillie</a><span class="HOEnZb"><font color="#888888"><br><br><br>

</font></span></div><span class="HOEnZb"><font color="#888888"><div class="gmail_quote"><br> </div><br clear="all"><br>-- <br><div dir="ltr">

<div>Joel Goldstick<br></div><a href="http://joelgoldstick.com" target="_blank">http://joelgoldstick.com</a><br></div>

</font></span></div></div>

</blockquote></div><br><br clear="all"><br>-- <br><div dir="ltr"><div>Joel Goldstick<br></div><a href="http://joelgoldstick.com" target="_blank">http://joelgoldstick.com</a><br></div>

</div></div>