[Twisted-Python] Problem with XMLRPC resource wrapped with guard basic auth
Using the current trunk r27366 (which is after #4014 fixed a related issue), I am having trouble with an implementation of web.guard wrapped XMLRPC. This is a new test implementation to expose both a soap and xmlrpc interface. SOAP works, but xmlrpc throws UnsupportedMethod POST. Here is my test code, can anybody tell me if im doing something wrong? Again, /soap works, but /rpc2 freaks on the POST method being unavailable. ###### test-script.py from zope.interface import implements from twisted.internet import reactor from twisted.web.resource import IResource, Resource from twisted.web import server, guard from twisted.cred.portal import IRealm from twisted.python import log from zope.interface import implements from twisted.python import log from twisted.internet import reactor from twisted.web import server, resource, guard, xmlrpc, soap from twisted.cred.portal import IRealm, Portal from twisted.cred.checkers import InMemoryUsernamePasswordDatabaseDontUse import sys def getQuote(): return "Victory to the burgeois, you capitalist swine!" class XMLRPCQuoter(xmlrpc.XMLRPC): def xmlrpc_quote(self): return getQuote() class SOAPQuoter(soap.SOAPPublisher): def soap_quote(self): return getQuote() class WebServicesRealm(object): implements(IRealm) def requestAvatar(self, avatarId, mind, *interfaces): if resource.IResource in interfaces: node = resource.Resource() node.putChild("rpc2", XMLRPCQuoter()) node.putChild("soap", SOAPQuoter()) return resource.IResource, node, lambda: None raise NotImplementedError() if __name__ == "__main__": log.startLogging(sys.stdout) checker = [InMemoryUsernamePasswordDatabaseDontUse(foo='bar')] webServicesWrapper = guard.HTTPAuthSessionWrapper(Portal(WebServicesRealm(), checker), [guard.BasicCredentialFactory("test")]) reactor.listenTCP(9999, server.Site(webServicesWrapper)) reactor.run() ###### http://localhost:9999/soap *good* http://localhost:9999/rpc2 *bad; POST isnt an allowed method* Thanks, TWKiel
On 12:00 am, asset@impactdamage.com wrote:
Using the current trunk r27366 (which is after #4014 fixed a related issue), I am having trouble with an implementation of web.guard wrapped XMLRPC. This is a new test implementation to expose both a soap and xmlrpc interface. SOAP works, but xmlrpc throws UnsupportedMethod POST.
I think you're being tricked by the confusing way in which this exception is displayed and by a slight implementation difference (and indeed externally visible behavioral difference) between XMLRPC and SOAPPublisher. SOAPPublisher defines "render" and no other render methods. So it will accept any request method and treat it the same way. It will never raise UnsupportedMethod. XMLRPC, on the other hand, defines "render_POST" and no other render methods, so it will only accept POST requests. For any other request, it will raise UnsupportedMethod and create that exception with a tuple of which methods it does allow. ('POST',) in this case. The exception I see when I approach this server at /rpc2 with my web browser (which is the only thing I've tried, because it's too much work to put together a real xml-rpc client that supports basic auth) is just what I'd expect. The browser issues a GET, the XMLRPC resource rejects this, indicating it only accepts POSTs. It may be worth improving the way UnsupportedMethod exceptions are stringified to make it more clear what's going on. Or, if this doesn't actually explain your problem, feel free to point that out and provide more details about how the client you're using behaves. Jean-Paul
So, I have a situation... I have an application whose basic function is, in simplified form: def main(): get_web_page(main_page_from_params) def get_web_page(page_name): set up a page getter deferred, one of the callbacks gets the links out of the page and sends them to get_them() def get_them(links): for l in links: if l is not being gotten or hasn't been got: deferred = get_web_page(l) In other words, I feed in the top level page, then recursively feed in any pages linked to by the current page, and they feed in all their links, until all pages are gotten. I understand the concurrency issues with multiple deferred's trying to add pages to the "get list" -- it's properly handled in the code (far as I can tell, so far). So, here's the question... I have a "pages" list containing all of the pages. They are set to either gotten or in-flight. In-flight means I have a deferred that's going to go get it (in get_web_page()). IOW, right now, if I don't already have the page, and I have a link to it, I just start a deferred to go get it. Should I limit the number of "in-flight" pages? Currently, I'm scanning sites that have upwards of 5000 pages and it seems that, when I get too many deferred's in flight, the app *appears* to crash. I'm not sure whether it's actually going out to lunch or just appears that way and, before I go instrumenting the app to death, can anyone tell me whether there is some sort of practical limit to how many "in- flight" deferreds might start to cause issues, just due to the sheer number? Thanks for any insight on this that anyone might offer. I expect the usual flurry of "you must post your exact code or we can't help you at all, moron" posts, but... In spite of my not having posted specific code, could someone with some actual experience in this please give me a clue, within an order of magnitude, how many deferreds might start to cause real trouble? Thanks, S
On Tue, Oct 6, 2009 at 10:40 PM, Steve Steiner (listsin) < listsin@integrateddevcorp.com> wrote:
Should I limit the number of "in-flight" pages?
I'm not going to comment on that, because I don't know what your app is doing or why it appears to be dying. As you said, you didn't post code :). However, you can experiment with it pretty easily using DeferredSemaphore: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.Deferred...
Currently, I'm scanning sites that have upwards of 5000 pages and it seems that, when I get too many deferred's in flight, the app *appears* to crash.
I'm not sure whether it's actually going out to lunch or just appears that way and, before I go instrumenting the app to death, can anyone tell me whether there is some sort of practical limit to how many "in- flight" deferreds might start to cause issues, just due to the sheer number?
If your app is doing something strange that you don't understand, you should instrument it until you understand it. Regardless of any practical advice you may receive as a temporary stopgap, there's always a chance that something *else* is going wrong, and by reducing the number of concurrent requests you're just decreasing its likelihood rather than properly fixing it. It's highly unlikely that it's actually the number of Deferreds. A Deferred is just a Python object, so if you've got the RAM to store them and their associated callbacks, you should be fine. It's more likely that it has something to do with long callback chains, or hitting some kind of file-descriptor limit (what version of Twisted are you using?) or perhaps that 5000 pages is just a lot of pages to request and you might need to wait a while. Good luck, -Glyph
On Oct 6, 2009, at 10:57 PM, Glyph Lefkowitz wrote:
However, you can experiment with it pretty easily using DeferredSemaphore: http://twistedmatrix.com/documents/8.2.0/api/twisted.internet.defer.Deferred...
Cool, I didn't know about that, I'll give it a look. Thanks!
If your app is doing something strange that you don't understand, you should instrument it until you understand it.
It's not that I don't understand what's supposed to be happening, or that it's doing something strange, it just seems that sites up to about 2000 pages work fine, then things get dicey. I was just looking for some guidance on the "max-deferred" that has been found in practical experience more than anything else.
Regardless of any practical advice you may receive as a temporary stopgap, there's always a chance that something else is going wrong, and by reducing the number of concurrent requests you're just decreasing its likelihood rather than properly fixing it.
I understand, and agree. I'm not looking for a stopgap, just maybe a ballpark of "don't set more than 2000 in-flight deferreds at one time" type of guideline. I understand that every situation is different, I'm working to limit my in-filght requests to a manageable number,
It's highly unlikely that it's actually the number of Deferreds. A Deferred is just a Python object, so if you've got the RAM to store them and their associated callbacks, you should be fine.
Yes, I understand that, thank you for clarifying.
It's more likely that it has something to do with long callback chains, or hitting some kind of file-descriptor limit
The callback chains are short, and I'm not getting a file-descriptor limit exception, or any exception that's getting percolated up.
(what version of Twisted are you using?)
Sorry for not including this earlier... # python -V Python 2.6.1
import twisted twisted.__version__ '8.2.0'
I'm running right out of the release versions for these tests since that's what my users will have installed.
or perhaps that 5000 pages is just a lot of pages to request and you might need to wait a while.
Yes, it is a lot of stuff... What I'm working on determining is whether limiting the number of "in- flight" URL getters would beneficial. Thanks, S
Good luck,
-Glyph _______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Your limit will usually be the number of file descriptors in the system, which can be usually changed via ulimit or your system's equivalent. On Linux I believe it defaults to 1024, so you should be able to handle 1024 simultaneous connections. One thing of note is that you say you have concurrency issues handled -- but with asynchronous I/O, there are no concurrency issues, since there's no concurrency (at least, not at application level). This is confusing at first but it's important to understand. All that said, you probably want to maintain a queue of URLs and some sort of graph representation of your data for purposes of finding loops (e.g. A links to B, B links to C, C links to A). You can then set an upper limit on the number of concurrent connections (say 1000) and track the number of deferreds in the system just based on when you start connections and when they finish (via callbacks). Your initial seed can start one URL, and then its callback can hit all linked nodes, and so on and so on. You might be hitting a cycle in the page traversal graph, and that is causing you all sorts of problems in terms of recursion depth or running out of file descriptors. Without seeing your code or your target site, though, it's impossible to say. Have you considered using another library for web spidering? I believe Scrapy (http://scrapy.org) is a good spidering tool, and it might be easier to use a decent library than roll your own. - Matt On Tue, Oct 6, 2009 at 10:40 PM, Steve Steiner (listsin) < listsin@integrateddevcorp.com> wrote:
So, I have a situation...
I have an application whose basic function is, in simplified form:
def main(): get_web_page(main_page_from_params)
def get_web_page(page_name): set up a page getter deferred, one of the callbacks gets the links out of the page and sends them to get_them()
def get_them(links): for l in links: if l is not being gotten or hasn't been got: deferred = get_web_page(l)
In other words, I feed in the top level page, then recursively feed in any pages linked to by the current page, and they feed in all their links, until all pages are gotten.
I understand the concurrency issues with multiple deferred's trying to add pages to the "get list" -- it's properly handled in the code (far as I can tell, so far).
So, here's the question...
I have a "pages" list containing all of the pages.
They are set to either gotten or in-flight.
In-flight means I have a deferred that's going to go get it (in get_web_page()).
IOW, right now, if I don't already have the page, and I have a link to it, I just start a deferred to go get it.
Should I limit the number of "in-flight" pages?
Currently, I'm scanning sites that have upwards of 5000 pages and it seems that, when I get too many deferred's in flight, the app *appears* to crash.
I'm not sure whether it's actually going out to lunch or just appears that way and, before I go instrumenting the app to death, can anyone tell me whether there is some sort of practical limit to how many "in- flight" deferreds might start to cause issues, just due to the sheer number?
Thanks for any insight on this that anyone might offer.
I expect the usual flurry of "you must post your exact code or we can't help you at all, moron" posts, but...
In spite of my not having posted specific code, could someone with some actual experience in this please give me a clue, within an order of magnitude, how many deferreds might start to cause real trouble?
Thanks,
S
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
On Oct 6, 2009, at 11:00 PM, Matt Perry wrote:
One thing of note is that you say you have concurrency issues handled -- but with asynchronous I/O, there are no concurrency issues, since there's no concurrency (at least, not at application level). This is confusing at first but it's important to understand.
The concurrency to which I was referring was having multiple deferreds adding to the "getlist" semi-simultaneously. They have to obtain a lock on the "getlist" before they can add new things to "get", then they release it. Thanks, S
If everything is happening in a single thread, you probably don't need to lock anything, because there's no shared access and therefore no race conditions. I have no idea how your app is written, so you may need them - I don't know. Just an observation. - Matt On Tue, Oct 6, 2009 at 11:13 PM, Steve Steiner (listsin) < listsin@integrateddevcorp.com> wrote:
On Oct 6, 2009, at 11:00 PM, Matt Perry wrote:
One thing of note is that you say you have concurrency issues handled -- but with asynchronous I/O, there are no concurrency issues, since there's no concurrency (at least, not at application level). This is confusing at first but it's important to understand.
The concurrency to which I was referring was having multiple deferreds adding to the "getlist" semi-simultaneously.
They have to obtain a lock on the "getlist" before they can add new things to "get", then they release it.
Thanks,
S
_______________________________________________ Twisted-Python mailing list Twisted-Python@twistedmatrix.com http://twistedmatrix.com/cgi-bin/mailman/listinfo/twisted-python
Steve Steiner (listsin) wrote: [...]
I expect the usual flurry of "you must post your exact code or we can't help you at all, moron" posts, but...
I'll try to restrain myself ;)
In spite of my not having posted specific code, could someone with some actual experience in this please give me a clue, within an order of magnitude, how many deferreds might start to cause real trouble?
None. Deferreds aren't the problem; they are just Python objects. You can probably have *millions* of them without great difficulty. They are a symptom, not a cause. The problem is more likely the underlying operations that are linked to the Deferreds. My two top guesses are: 1) the web server failing to cope with thousands of concurrent requests gracefully, or 2) the number of sockets is hitting a system limit (number of FDs you can pass to select(), or hitting the max number of file descriptors, something like that) in that order. For the second one, assuming you're on Linux, you may benefit from a trivial change to use the epoll reactor rather than the default one. For the first one, you're at the mercy of the webserver. IIRC the RFCs say that clients SHOULD use no more than two concurrent connections to a server... Regardless, I imagine you're unlikely to get much performance benefit from hammering a server with 1000 concurrent requests over something much smaller, like 5 or 10. So I'd use a DeferredSemaphore, or perhaps look into using Cooperator, and not worry about solving the mystery how to make 1000s of concurrent requests work. Of course, if you give more specific info about how your code fails and what it does I might be able to give more specific advice... ;) -Andrew.
On Oct 6, 2009, at 11:25 PM, Andrew Bennetts wrote:
Steve Steiner (listsin) wrote: [...]
I expect the usual flurry of "you must post your exact code or we can't help you at all, moron" posts, but...
I'll try to restrain myself ;)
Thanks, I appreciate your restraint. Must say, most posts without code drive me a little cuckoo, too. I hope this one was justified in not including specific code.
My two top guesses are:
1) the web server failing to cope with thousands of concurrent requests gracefully, or
Ah, that would certainly make sense in this particular case. I am asking the server for an *awful* lot of stuff all at one time.
2) the number of sockets is hitting a system limit (number of FDs you can pass to select(), or hitting the max number of file descriptors, something like that)
That may also be an issue, thank you for pointing that out.
IIRC the RFCs say that clients SHOULD use no more than two concurrent connections to a server...
As you said, the performance is going to be rate-limited by the server's ability to respond to requests anyway, so I think what I'll do is just make it so that the "get me a page" queue doesn't put more than a couple of requests "in-flight" at the same time. Thanks for helping me think this out! Thanks, S
participants (6)
-
Andrew Bennetts
-
asset
-
exarkun@twistedmatrix.com
-
Glyph Lefkowitz
-
Matt Perry
-
Steve Steiner (listsin)