[Twisted-Python] possible error in twisted app

a portion of my twisted app is having some problems. i think i figured out the issue -- but if I'm right.. i'll be a bit lost. this portion of the app is essentially a web scraper. it grabs a batch of X urls from a data broker , and then updates a database with data about the URL ( which either comes from an oEmbed endpoint , a third party data provider, or scraping the page if needed ) there's a lot of code that would be messy to follow, so i'll just explain it as best as possible, and provide some highlights. the underlying logic is basically this: reactor starts an UpdateLinksService, that checks for new batches every 30 seconds the UpdateLinksService has an internal marker to check if it's still processing the last batch - or if it's safe to process to process the urls, the UpdateLinksService runs them in a request wrapper , that is supposed to be run through a defer.DeferredSemaphore() service when i'm done with the batch, i clear out internal marker via a `deferred_list_finish` method. looking at some aggressive debugging output, it looks like my work to process a url is happening /after/ i call deferred_list_finish. in other words, i've somehow structured this so that i'm instantly finished. i *thought* i was running out of memory because i had some phantom deferreds running around. now i'm starting to think that i'm just stacking the queue faster than i work on it. i've tried changing things around and using different return values, but then started getting "exceptions.AssertionError:" because "assert not isinstance(result, Deferred)" ( twisted/internet/defer.py", line 381, in callback ) the following is a rough composite of what is going on. if anyone sees an obvious fix, i'd be greatly appreciative. thanks! ================= class UpdateLinksService(): def process_urls(self, urls): requests = [] for url in urls: wrapper = requestWrapper( self.semaphoreService, dbPool ) d = wrapper.queue_url(url) updates.append(d) self.d_list = defer.DeferredList( updates )\ .addCallback( self.deferred_list_finish ) class RequestWrapper(): def __init__(self, semaphore_service, dbPool): self.semaphoreService=semaphore_service self.dbPool = dbPool def queue_url( self, url ): self.url = url d = self.semaphoreService.run( self._to_thread ) return d def _to_thread( self ): d = threads.deferToThread( self._thread_begin ) return d def _thread_begin(self): worker = UrlWorker() d = self.dbPool.runInteraction( worker.process_url , self.url ) class UrlWorker(): def process_url(self,txn, url): #blocking stuff return True/False The reason why I have _to_thread + _thread_begin as 2 functions, and UrlWoker separate is for code re-use. The RequestWrapper functions are mostly all in a base class; i just subclass RequestWrapper and override _thread_begin and an error callback (not shown) UrlWorker's various methods are used througout my twisted daemon.

Hi Jonathan, On Jan 17, 2014, at 6:22 PM, Jonathan Vanasco <twisted-python@2xlp.com> wrote:
the following is a rough composite of what is going on. if anyone sees an obvious fix, i'd be greatly appreciative.
I'd love to help, but a "rough composite" is hard to make guesses about, especially since you're talking about hard-to-predict memory-consumption behavior. Can you attach a <http://sscce.org/> which is actually runnable, for example, with a canned list of input URLs (or better yet with an included web server so the URLs can be localhost and more predictable), so we can debug and diagnose a running program instead of ideas about the outline of one? -glyph

Thanks for the offer to help. I was hoping someone would see an apparent bug in the outline, so i wouldn't have to build a a SSCCE unfortunately, that wasn't going to fly, so I built out a self-contained version of the issue Before sharing it, I added in some docs references to the example... and then I noticed something peculiar, and seemed to have solved the problem ! the issue was this: 1. I used twisted.internet.defer.DeferredSemaphore to set up a semaphore service 2. I queued tasks with `semaphoreService.run( to_thread_function )` 3. `to_thread_function` ran a configurable method through `threads.deferToThread` 4. the configurable method ran something in twisted.enterprise.adbapi.ConnectionPool's `runInteraction` when copying docs, i realized that I was running `toThread` and then `runInteraction` , which uses it's own thread. so i had threads spawning threads. the base 'scaffold' for this daemon has been modified and patched since 2005 , so at some point i made an improvement and left some semi-functional legacy cruft in there. i'm not sure of the specifics on how / why this manifests, but if I just use runInteraction and bypass using `deferToThread`, everything works out perfect. if you're curious , i tossed the example online here https://github.com/jvanasco/twisted_gist_001 `constants.py` has some toggles for playing with the return values of the various functions ( controls the base class and subclass ) it also lets you toggle to use the broken functionality ( thread within a thread ) or what seems to work fine now happy i seemed to have solved this myself. still confused why the issue happened , but this fix ( only 1 thread ) seems to be fine and the more ideal approach On Jan 18, 2014, at 7:17 PM, GMail wrote:
Hi Jonathan,
On Jan 17, 2014, at 6:22 PM, Jonathan Vanasco <twisted-python@2xlp.com> wrote:
the following is a rough composite of what is going on. if anyone sees an obvious fix, i'd be greatly appreciative.
I'd love to help, but a "rough composite" is hard to make guesses about, especially since you're talking about hard-to-predict memory-consumption behavior.
Can you attach a <http://sscce.org/> which is actually runnable, for example, with a canned list of input URLs (or better yet with an included web server so the URLs can be localhost and more predictable), so we can debug and diagnose a running program instead of ideas about the outline of one?
-glyph
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python

On Jan 19, 2014, at 1:27 PM, Jonathan Vanasco <twisted-python@2xlp.com> wrote:
Before sharing it, I added in some docs references to the example...
and then I noticed something peculiar, and seemed to have solved the problem !
.... aaaaand that right there is a major reason we ask people for SSCCEs when they ask questions ;-) -glyph

On 19 Jan, 09:27 pm, twisted-python@2xlp.com wrote:
Thanks for the offer to help.
i'm not sure of the specifics on how / why this manifests, but if I just use runInteraction and bypass using `deferToThread`, everything works out perfect.
To intentionally slightly misinterpret you, the specifics might not be as interesting in this case as this general principle: Twisted APIs are not safe to call except from the thread the reactor is running in. There are exceptions, most notably `reactor.callFromThread`, but they are *extremely* rare. Jean-Paul
participants (4)
-
exarkun@twistedmatrix.com
-
Glyph
-
GMail
-
Jonathan Vanasco