Pls help me...I want to save scraped data automatically to my database(cleaner version)
Max Cuban
edzeame at gmail.com
Sat Jan 25 16:22:14 EST 2014
I have asked this question earlier but this should make more sense than the
earlier version and I don't want anyone who could potentially helped to be
put off by the initial mess even if I updated it with my cleaner version as
a reply
I want to save the links scraped to be save in my database so that on
subsequent run, it only scrapes and append only new links to the list.
This is my code below but at the end of the day my database is empty. What
changes can I make to overcome this? Thanks in advance
from django.template.loader import get_template
from django.shortcuts import render_to_response
from bs4 import BeautifulSoup
import urllib2, sys
import urlparse
import re
from listing.models import jobLinks
#this function extract the links
def businessghana():
site = "http://www.businessghana.com/portal/jobs"
hdr = {'User-Agent' : 'Mozilla/5.0'}
req = urllib2.Request(site, headers=hdr)
jobpass = urllib2.urlopen(req)
soup = BeautifulSoup(jobpass)
for tag in soup.find_all('a', href = True):
tag['href'] = urlparse.urljoin('
http://www.businessghana.com/portal/', tag['href'])
return map(str, soup.find_all('a', href =
re.compile('.getJobInfo')))
# result from businssghana() saved to a variable to make them iterable as
a list
all_links = businessghana()
#this function should be saving the links to the database unless the link
already exist
def save_new_links(all_links):
current_links = jobLinks.objects.all()
for i in all_links:
if i not in current_links:
jobLinks.objects.create(url=i)
# I called the above function here hoping that it will save to database
save_new_links(all_links)
# return my httpResponse with this function
def display_links(request):
name = all_links()
return render_to_response('jobs.html', {'name' : name})
My django models.py looks like this:
from django.db import models class jobLinks(models.Model): links =
models.URLField() pub_date = models.DateTimeField('date retrieved') def
__unicode__(self): return self.links
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20140125/050b81a6/attachment.html>
More information about the Python-list
mailing list