strip away html tags from extracted links
Joel Goldstick
joel.goldstick at gmail.com
Fri Nov 29 12:45:54 EST 2013
On Fri, Nov 29, 2013 at 12:44 PM, Joel Goldstick
<joel.goldstick at gmail.com>wrote:
>
>
>
> On Fri, Nov 29, 2013 at 12:33 PM, Mark Lawrence <breamoreboy at yahoo.co.uk>wrote:
>
>> On 29/11/2013 16:56, Max Cuban wrote:
>>
>>> I have the following code to extract certain links from a webpage:
>>>
>>> from bs4 import BeautifulSoup
>>> import urllib2, sys
>>> import re
>>>
>>> def tonaton():
>>> site = "http://tonaton.com/en/job-vacancies-in-ghana"
>>> hdr = {'User-Agent' : 'Mozilla/5.0'}
>>> req = urllib2.Request(site, headers=hdr)
>>> jobpass = urllib2.urlopen(req)
>>> invalid_tag = ('h2')
>>> soup = BeautifulSoup(jobpass)
>>> print soup.find_all('h2')
>>>
>>> The links are contained in the 'h2' tags so I get the links as follows:
>>>
>>> <h2><a href="/en/cashiers-accra">cashiers </a></h2>
>>> <h2><a href="/en/cake-baker-accra">Cake baker</a></h2>
>>> <h2><a href="/en/automobile-technician-accra">Automobile
>>> Technician</a></h2>
>>> <h2><a href="/en/marketing-officer-accra-4">Marketing Officer</a></h2>
>>>
>>> But I'm interested in getting rid of all the 'h2' tags so that I have
>>> links only in this manner:
>>>
>>> <a href="/en/cashiers-accra">cashiers </a>
>>> <a href="/en/cake-baker-accra">Cake baker</a>
>>> <a href="/en/automobile-technician-accra">Automobile Technician</a>
>>> <a href="/en/marketing-officer-accra-4">Marketing Officer</a>
>>>
>>>
>>> This is more a beautiful soup question than python. Have you gone
>>> through their tutorial. Check here:
>>>
>>
> They have an example that looks close here:
> http://www.crummy.com/software/BeautifulSoup/bs4/doc/
>
> One common task is extracting all the URLs found within a page’s <a> tags:
>
> for link in soup.find_all('a'):
> print(link.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
> In your case, you want the href values for the child of the h2 refences.
>
> So this might be close (untested)
>
Pardon my typo. Try this:
>
> for link in soup.find_all('h2'):
> print (link.a.get('href'))
> # http://example.com/elsie
> # http://example.com/lacie
> # http://example.com/tillie
>
>
>
>
>
>
> --
> Joel Goldstick
> http://joelgoldstick.com
>
--
Joel Goldstick
http://joelgoldstick.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20131129/741e3111/attachment.html>
More information about the Python-list
mailing list