Python/Django Extract and append only new links

Max Cuban edzeame at gmail.com
Tue Dec 31 16:19:33 CET 2013


 I am putting together a project using Python 2.7 Django 1.5 on Windows 7.
I believe this should be on the django group but I haven't had help from
there so I figured I would try the python list
I have the following view:
views.py:

def foo():
    site = "http://www.foo.com/portal/jobs"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    for tag in soup.find_all('a', href = True):
        tag['href'] = urlparse.urljoin('http://www.businessghana.com/portal/',
tag['href'])
    return map(str, soup.find_all('a', href = re.compile('.getJobInfo')))
def example():
    site = "http://example.com"
    hdr = {'User-Agent' : 'Mozilla/5.0'}
    req = urllib2.Request(site, headers=hdr)
    jobpass = urllib2.urlopen(req)
    soup = BeautifulSoup(jobpass)
    return map(str, soup.find_all('a', href = re.compile('.display-job')))
 foo_links = foo()
 example_links = example()
def all_links():
    return (foo_links + example_links)
def display_links(request):
    name = all_links()
    paginator = Paginator(name, 25)
    page = request.GET.get('page')
    try:
        name = paginator.page(page)
    except PageNotAnInteger:
        name = paginator.page(1)
    except EmptyPage:
        name = paginator.page(paginator.num_pages)
    return render_to_response('jobs.html', {'name' : name})

My template looks like this:

<ol>
{% for link in name %}
  <li> {{ link|safe }}</li>
{% endfor %}
 </ol>
 <div class="pagination">
<span class= "step-links">
    {% if name.has_previous %}
        <a href="?page={{ names.previous_page_number }}">Previous</a>
    {% endif %}
    <span class = "current">
        Page {{ name.number }} of {{ name.paginator.num_pages}}.
    </span>
    {% if name.has_next %}
        <a href="?page={{ name.next_page_number}}">next</a>
    {% endif %}
</span>
 </div>

 Right now as my code stands, anytime I run it, it scraps all the links on
the frontpage of the sites selected and presents them paginated *all afresh*.
However, I don't think its a good idea for the script to read/write all the
links that had previously extracted links all over again and therefore
would like to check for and append only new links. I would like to save the
previously scraped links so that over the course of say, a week, all the
links that have appeared on the frontpage of these sites will be available
on my site as older pages.

It's my first programming project and don't know how to incorporate this
logic into my code.

Any help/pointers/references will be greatly appreciated.

regards, Max
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20131231/b76d3abf/attachment.html>


More information about the Python-list mailing list