[Tutor] customizing dark_harvest problems

Alan Gauld alan.gauld at btinternet.com
Thu Apr 7 21:10:55 EDT 2016


On 08/04/16 01:51, Jason Willis wrote:

> Though, I do know some things and can figure a little bit out when looking
> at source code I'm usually at a loss when understanding the entire workings
> of a program. 

And thats the problem here. The code is very specific to the page it is
parsing. Simply substituting a different file will never work.

For example...

> DOC_ROOT          = 'http://freeproxylists.com'
> ELITE_PAGE        = 'elite.html'


>     def _extract_ajax_endpoints(self):
> 
>         ''' make a GET request to freeproxylists.com/elite.html '''
>         url = '/'.join([DOC_ROOT, ELITE_PAGE])
>         response = requests.get(url)
> 
>         ''' extract the raw HTML doc from the response '''
>         raw_html = response.text
> 
>         ''' convert raw html into BeautifulSoup object '''
>         soup = BeautifulSoup(raw_html)
> 
>         for url in soup.select('table tr td table tr td a'):
>             if 'elite #' in url.text:
>                 yield '%s/load_elite_d%s' % (DOC_ROOT,
> url['href'].lstrip('elite/'))


Notice that last 'if' section has 'elite #' hard coded in.
But the standard page doesn't use 'elite #'...

There are probably a lot more similar content-dependant things
in the code, I just happened to spot that one.

It would be better if you took the time(only a few hours really) to
learn how to program in Python so that you can actually understand
the code rather than making "poke 'n hope" changes.


-- 
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos




More information about the Tutor mailing list