<div dir="ltr"><div><br></div><div>I have asked this question earlier but this should make more sense than the earlier version and I don't want anyone who could potentially helped to be put off by the initial mess even if I updated it with my cleaner version as a reply </div>
<div><br></div><div>I want to save the links scraped to be save in my database so that on subsequent run, it only scrapes and append only new links to the list.</div><div><br></div><div>This is my code below but at the end of the day my database is empty. What changes can I make to overcome this? Thanks in advance</div>
<div><br></div><div> </div><div> from django.template.loader import get_template</div><div> from django.shortcuts import render_to_response </div><div> from bs4 import BeautifulSoup</div><div> import urllib2, sys</div>
<div> import urlparse</div><div> import re</div><div> from listing.models import jobLinks</div><div><span class="" style="white-space:pre"> </span></div><div><span class="" style="white-space:pre"> </span>#this function extract the links</div>
<div> def businessghana():</div><div> site = "<a href="http://www.businessghana.com/portal/jobs">http://www.businessghana.com/portal/jobs</a>"</div><div> hdr = {'User-Agent' : 'Mozilla/5.0'}</div>
<div> req = urllib2.Request(site, headers=hdr)</div><div> jobpass = urllib2.urlopen(req)</div><div> soup = BeautifulSoup(jobpass)</div><div> for tag in soup.find_all('a', href = True):</div>
<div> tag['href'] = urlparse.urljoin('<a href="http://www.businessghana.com/portal/">http://www.businessghana.com/portal/</a>', tag['href'])</div><div> return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))</div>
<div><span class="" style="white-space:pre"> </span></div><div><span class="" style="white-space:pre"> </span># result from businssghana() saved to a variable to make them iterable as a list</div><div> all_links = businessghana()</div>
<div><br></div><div><span class="" style="white-space:pre"> </span>#this function should be saving the links to the database unless the link already exist</div><div> def save_new_links(all_links):</div><div> current_links = jobLinks.objects.all()</div>
<div> for i in all_links:</div><div> if i not in current_links:</div><div> jobLinks.objects.create(url=i)</div><div> </div><div><span class="" style="white-space:pre"> </span># I called the above function here hoping that it will save to database</div>
<div> save_new_links(all_links)</div><div><br></div><div><span class="" style="white-space:pre"> </span># return my httpResponse with this function</div><div> def display_links(request):</div><div> name = all_links() </div>
<div> return render_to_response('jobs.html', {'name' : name})</div><div><span class="" style="white-space:pre"> </span></div><div><span class="" style="white-space:pre"> </span></div><div><span class="" style="white-space:pre"> </span>My django models.py looks like this:</div>
<div><span class="" style="white-space:pre"> </span></div><div><span style="white-space:pre"> from django.db import models
class jobLinks(models.Model):
links = models.URLField()
pub_date = models.DateTimeField('date retrieved')
def __unicode__(self):
return self.links</span></div><span style="white-space:pre"> <br></span><div><br></div></div>