[Tutor] How to read `https` url with urllib.request.urlopen?

Mon Sep 3 16:11:00 EDT 2018

On 09/03/2018 07:29 AM, Amrit Pandey wrote:
> I am learning web scraping with python.
> 
> With urllib.request.urlopen() I am able to fetch http urls, but https give some certificate error. How can we bypass the certificate check or is there any other configuration that is used for https urls?

if you really want to ignore (and usually problems mean a problem on
your end, not on the other end, so you should probably not do that - the
exception being if you're scraping your own test server, which might not
have official certs), you create an ssl context and set the appropriate
settings, passing that to urllib.  /Conceptually/ something like this,
but there may be quite a few more details needed:

ctx = ssl.create_default_context()
ctx.verify_mode = ssl.CERT_NONE
with urllib.request.urlopen(url, context=ctx) as f:
    data = f.read()

But in general your life will be much more pleasant if you use the
requests module instead of urllib.

there are also extensive and well debugged web scraping packages in
Python, if you're actually looking to deploy something, as opposed to
learning something (and there's definitely nothing wrong with a learning
exercise!!!) you should look at scrapy and others.