[Tutor] FETCH URLs FROM WEBSITE

Válas Péter turtle at 64.hu
Sat Aug 1 19:43:30 CEST 2015


2015-08-01 12:48 GMT+02:00 Gaurav Lathwal <glathwal at gmail.com>:

> Hello everyone, I am new to Python, so please forgive me if my question is
> too dumb.
> I want to write a script that automatically downloads all the videos hosted
> on this site :-
>
> http://www.toonova.com/batman-beyond
>
> Now, the problem I am having is, I am unable to fetch the video urls of all
> the videos.
> I mean I can manually fetch the video urls using the chrome developer's
> console, but it's too time consuming.
> Is there any way to just fetch all the video urls using BeautifulSoup ?
>

I am not familiar wit BS, I like to arrange things myself.
The first step is always to have a look at the source by naked eye and try
to guess a rule.
You should also observe the encoding in the page header, we will need it
(in this case UTF-8).

If I want to download things only once, and the rule is strict, sometimes I
simply generate the links by MS Excel with string functions. :-)
A more powerful way is to learn regular expressions which are miracles of
programmers" world when it turns to text processing.

In this case the base of your program will be (save it in an editor with
utf-8 encoding):

# -*- coding: UTF-8 -*-
"""
Extract video links from http://www.toonova.com/batman-beyond
Python 3 syntax
"""

import urllib.request, re, codecs

batmanpage = 'http://www.toonova.com/batman-beyond'
videolinkpattern = re.compile('http://www
\.toonova\.com/batman\-beyond.*?episode-\d+')
# Either has a season string before episode or not

#Opening the page
page = urllib.request.urlopen(batmanpage).read().decode('utf-8')
for link in videolinkpattern.finditer(page):
    url = link.group()
    print(url)

Now you have them in url and you may download them in the loop after or
instead of print.
Not that there is a 2nd page with 2 links only, so the simplest thing is to
download those 2 manually. If they were more, you could arrange an outer
loop for web pages, but in this case this is not neccessary.
The program won't work in Python 2 this way, because it has a different
urllib module.


More information about the Tutor mailing list