[Tutor] web scraping using Python and urlopen in Python 3.3

Wed Nov 7 17:37:12 CET 2012

On 11/07/2012 10:44 AM, Seema V Srivastava wrote:
> Hi,
> I am new to Python, trying to learn it by carrying out specific tasks.  I
> want to start with trying to scrap the contents of a web page.  I have
> downloaded Python 3.3 and BeautifulSoup 4.
>
> If I call upon urlopen in any form, such as below, I get the error as shown
> below the syntax:  Does urlopen not apply to Python 3.3?  If not then
> what;s the syntax I should be using?  Thanks so much.
>
> import urllib
> from bs4 import BeautifulSoup
> soup = BeautifulSoup(urllib.urlopen("http://www.pinterest.com"))
>
> Traceback (most recent call last):
>   File "C:\Users\Seema\workspace\example\main.py", line 3, in <module>
>     soup = BeautifulSoup(urllib.urlopen("http://www.pinterest.com"))
> AttributeError: 'module' object has no attribute 'urlopen'
>
>

Since you're trying to learn, let me point out a few things that would
let you teach yourself, which is usually quicker and more effective than
asking on a mailing list.  (Go ahead and ask, but if you figure out the
simpler ones yourself, you'll learn faster)

(BTW, I'm using 3.2, but it'll probably be very close)

First, that error has nothing to do with BeautifulSoup.  If it had, I
wouldn't have responded, since I don't have any experience with BS.  The
way you could learn that for yourself is to factor the line giving the
error:

tmp = urllib.urlopen("http://www.pinterest.com")
soup = BeautifulSoup(tmp)

Now, you'll get the error on the first line, before doing anything with
BeautifulSoup.

Now that you have narrowed it to urllib.urlopen, go find the docs for
that.  I used DuckDuckGo, with keywords  python urllib urlopen, and the
first match was:
     http://docs.python.org/2/library/urllib.html

and even though this is 2.7.3 docs, the first paragraph tells you
something useful:

Note

The urllib
<http://docs.python.org/2/library/urllib.html#module-urllib> module has
been split into parts and renamed in Python 3
to urllib.request, urllib.parse, and urllib.error. The /2to3/
<http://docs.python.org/2/glossary.html#term-to3> tool will
automatically adapt imports when converting your sources to Python 3.
Also note that the urllib.urlopen()
<http://docs.python.org/2/library/urllib.html#urllib.urlopen> function
has been removed in Python 3 in favor of urllib2.urlopen()
<http://docs.python.org/2/library/urllib2.html#urllib2.urlopen>.

Now, the next question I'd ask is whether you're working from a book (or
online tutorial), and that book is describing Python 2.x  If so, you
might encounter this type of pain many times.

Anyway, another place you can learn is from the interactive
interpreter.  just run python3, and experiment.

>>> import urllib
>>> urllib.urlopen
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'urlopen'
>>> dir(urllib)
['__builtins__', '__cached__', '__doc__', '__file__', '__name__',
'__package__', '__path__']
>>>

Notice that dir shows us the attributes of urllib, and none of them look
directly useful.  That's because urllib is a package, not just a
module.  A package is a container for other modules.  We can also look
__file__

>>> urllib.__file__
'/usr/lib/python3.2/urllib/__init__.py'

That __init__.py is another clue;  that's the way packages are initialized.

But when I try importing urllib2, I get
   ImportError: No module named urllib2

So back to the website.  But using the dropdown at the upper left, i can
change from 2.7 to 3.3:
    http://docs.python.org/3.3/library/urllib.html

There it is quite explicit.

urllib is a package that collects several modules for working with URLs:

  * urllib.request
    <http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request> for
    opening and reading URLs
  * urllib.error
    <http://docs.python.org/3.3/library/urllib.error.html#module-urllib.error> containing
    the exceptions raised by urllib.request
    <http://docs.python.org/3.3/library/urllib.request.html#module-urllib.request>
  * urllib.parse
    <http://docs.python.org/3.3/library/urllib.parse.html#module-urllib.parse> for
    parsing URLs
  * urllib.robotparser
    <http://docs.python.org/3.3/library/urllib.robotparser.html#module-urllib.robotparser> for
    parsing robots.txt files

So, if we continue to play with the interpreter, we can try:

>>> import urllib.request
>>> dir(urllib.request)

['AbstractBasicAuthHandler', 'AbstractDigestAuthHandler',
'AbstractHTTPHandler', 'BaseHandler', 'CacheFTPHandler',
'ContentTooShortError', 'FTPHandler', 'FancyURLopener', 'FileHandler',
'HTTPBasicAuthHandler', 'HTTPCookieProcessor',
'HTTPDefaultErrorHandler', 'HTTPDigestAuthHandler', 'HTTPError',
'HTTPErrorProcessor',
......
'urljoin', 'urlopen', 'urlparse', 'urlretrieve', 'urlsplit', 'urlunparse']

I chopped off part of the long list of things that was imported in that
module.  But one of them is urlopen, which is what you were looking for
before.

So back to your own sources, try:

>>> tmp = urllib.request.urlopen("http://www.pinterest.com")
>>> tmp
<http.client.HTTPResponse object at 0x1df1c10>

OK, the next thing you might wonder is what parameters urlopen might take:

Help on function urlopen in module urllib.request:

>>> help(urllib.request.urlopen)
urlopen(url, data=None, timeout=<object object>, *, cafile=None,
capath=None)
(END)

Hopefully, this will get you started into BeautifulSoup.  As i said
before, I have no experience with that part.

Note that I normally use the docs.python.org documentation much more. 
But a quick question to the interpreter can be very useful, especially
if you don't have internet access.

-- 

DaveA