Finding sentinel text when using a thread pool...
Christopher Reimer
christopher_reimer at yahoo.com
Sat May 20 13:20:05 EDT 2017
On 5/20/2017 1:19 AM, dieter wrote:
> If your (590) pages are linked together (such that you must fetch
> a page to get the following one) and page fetching is the limiting
> factor, then this would limit the parallelizability.
The pages are not linked together. The URL requires a page number. If I
requested 1000 pages in sequence, the first 60% will have comments and
the remaining 40% will have the sentinel text. As more comments are
added over time, the dividing line between the last page with the oldest
comments and the first page with the sentinel page shifts over time.
Since I changed the code to fetch 16 pages at the same time, the run
time got reduced by nine minutes.
> If processing a selected page takes a significant amount of time
> (compared to the fetching), then you could use a work queue as follows:
> a page is fetched and the following page determined; if a following
> page is found, processing this page is put as a job into the work queue
> and page processing is continued. Free tasks look for jobs in the work queue
> and process them.
I'm looking into that now. The requester class yields one page at a
time. If I change the code to yield a list of 16 pages, I could parse 16
pages at a time. That change would require a bit more work but it would
fix some problems that's been nagging me for a while about the parser class.
Thank you,
Chris Reimer
More information about the Python-list
mailing list