[Tutor] customizing dark_harvest problems
Alan Gauld
alan.gauld at btinternet.com
Thu Apr 7 21:10:55 EDT 2016
On 08/04/16 01:51, Jason Willis wrote:
> Though, I do know some things and can figure a little bit out when looking
> at source code I'm usually at a loss when understanding the entire workings
> of a program.
And thats the problem here. The code is very specific to the page it is
parsing. Simply substituting a different file will never work.
For example...
> DOC_ROOT = 'http://freeproxylists.com'
> ELITE_PAGE = 'elite.html'
> def _extract_ajax_endpoints(self):
>
> ''' make a GET request to freeproxylists.com/elite.html '''
> url = '/'.join([DOC_ROOT, ELITE_PAGE])
> response = requests.get(url)
>
> ''' extract the raw HTML doc from the response '''
> raw_html = response.text
>
> ''' convert raw html into BeautifulSoup object '''
> soup = BeautifulSoup(raw_html)
>
> for url in soup.select('table tr td table tr td a'):
> if 'elite #' in url.text:
> yield '%s/load_elite_d%s' % (DOC_ROOT,
> url['href'].lstrip('elite/'))
Notice that last 'if' section has 'elite #' hard coded in.
But the standard page doesn't use 'elite #'...
There are probably a lot more similar content-dependant things
in the code, I just happened to spot that one.
It would be better if you took the time(only a few hours really) to
learn how to program in Python so that you can actually understand
the code rather than making "poke 'n hope" changes.
--
Alan G
Author of the Learn to Program web site
http://www.alan-g.me.uk/
http://www.amazon.com/author/alan_gauld
Follow my photo-blog on Flickr at:
http://www.flickr.com/photos/alangauldphotos
More information about the Tutor
mailing list