[Tutor] Recursion depth exceeded in python web crawler
Mark Lawrence
breamoreboy at gmail.com
Thu Jun 14 22:01:05 EDT 2018
On 14/06/18 19:32, Daniel Bosah wrote:
> I am trying to modify code from a web crawler to scrape for keywords from
> certain websites. However, Im trying to run the web crawler before I
> modify it, and I'm running into issues.
>
> When I ran this code -
>
>
>
>
> *import threading*
> *from Queue import Queue*
> *from spider import Spider*
> *from domain import get_domain_name*
> *from general import file_to_set*
>
>
> *PROJECT_NAME = "SPIDER"*
> *HOME_PAGE = "https://www.cracked.com/ <https://www.cracked.com/>"*
> *DOMAIN_NAME = get_domain_name(HOME_PAGE)*
> *QUEUE_FILE = '/home/me/research/queue.txt'*
> *CRAWLED_FILE = '/home/me/research/crawled.txt'*
> *NUMBER_OF_THREADS = 1*
> *#Captialize variables and make them class variables to make them const
> variables*
>
> *threadqueue = Queue()*
>
> *Spider(PROJECT_NAME,HOME_PAGE,DOMAIN_NAME)*
>
> *def crawl():*
> * change = file_to_set(QUEUE_FILE)*
> * if len(change) > 0:*
> * print str(len(change)) + 'links in the queue'*
> * create_jobs()*
>
> *def create_jobs():*
> * for link in file_to_set(QUEUE_FILE):*
> * threadqueue.put(link) #.put = put item into the queue*
> * threadqueue.join()*
> * crawl()*
> *def create_spiders():*
> * for _ in range(NUMBER_OF_THREADS): #_ basically if you dont want to
> act on the iterable*
> * vari = threading.Thread(target = work)*
> * vari.daemon = True #makes sure that it dies when main exits*
> * vari.start()*
>
> *#def regex():*
> * #for i in files_to_set(CRAWLED_FILE):*
> * #reg(i,LISTS) #MAKE FUNCTION FOR REGEX# i is url's, LISTs is
> list or set of keywords*
> *def work():*
> * while True:*
> * url = threadqueue.get()# pops item off queue*
> * Spider.crawl_pages(threading.current_thread().name,url)*
> * threadqueue.task_done()*
>
> *create_spiders()*
>
> *crawl()*
>
>
> That used this class:
>
> *from HTMLParser import HTMLParser*
> *from urlparse import urlparse*
>
> *class LinkFinder(HTMLParser):*
> * def _init_(self, base_url,page_url):*
> * super()._init_()*
> * self.base_url= base_url*
> * self.page_url = page_url*
> * self.links = set() #stores the links*
> * def error(self,message):*
> * pass*
> * def handle_starttag(self,tag,attrs):*
> * if tag == 'a': # means a link*
> * for (attribute,value) in attrs:*
> * if attribute == 'href': #href relative url i.e not
> having www*
> * url = urlparse.urljoin(self.base_url,value)*
> * self.links.add(url)*
> * def return_links(self):*
> * return self.links()*
It's very unpythonic to define getters like return_links, just access
self.links directly.
>
>
> And this spider class:
>
>
>
> *from urllib import urlopen #connects to webpages from python*
> *from link_finder import LinkFinder*
> *from general import directory, text_maker, file_to_set, conversion_to_set*
>
>
> *class Spider():*
> * project_name = 'Reader'*
> * base_url = ''*
> * Queue_file = ''*
> * crawled_file = ''*
> * queue = set()*
> * crawled = set()*
>
>
> * def __init__(self,project_name, base_url,domain_name):*
> * Spider.project_name = project_name*
> * Spider.base_url = base_url*
> * Spider.domain_name = domain_name*
> * Spider.Queue_file = '/home/me/research/queue.txt'*
> * Spider.crawled_file = '/home/me/research/crawled.txt'*
> * self.boot()*
> * self.crawl_pages('Spider 1 ', base_url)*
It strikes me as completely pointless to define this class when every
variable is at the class level and every method is defined as a static
method. Python isn't Java :)
[code snipped]
>
> and these functions:
>
>
>
> *from urlparse import urlparse*
>
> *#get subdomain name (name.example.com <http://name.example.com>)*
>
> *def subdomain_name(url):*
> * try:*
> * return urlparse(url).netloc*
> * except:*
> * return ''*
It's very bad practice to use a bare except like this as it hides any
errors and prevents you from using CTRL-C to break out of your code.
>
> *def get_domain_name(url):*
> * try:*
> * variable = subdomain_name.split(',')*
> * return variable[-2] + ',' + variable[-1] #returns 2nd to last and
> last instances of variable*
> * except:*
> * return '''*
The above line is a syntax error.
>
>
> (there are more functions, but those are housekeeping functions)
>
>
> The interpreter returned this error:
>
> *RuntimeError: maximum recursion depth exceeded while calling a Python
> object*
>
>
> After calling crawl() and create_jobs() a bunch of times?
>
> How can I resolve this?
>
> Thanks
Just a quick glance but crawl calls create_jobs which calls crawl...
--
My fellow Pythonistas, ask not what our language can do for you, ask
what you can do for our language.
Mark Lawrence
More information about the Tutor
mailing list