hi, excuse my noobness, I have a few basic questions about twisted, or probably about web servers in general.
what is the advantage of using a single-threaded server?
i figured it makes it more scalable because there's too much overhead to have a thread for each user when you have many simultaneous users. but a friend i'm talking to now says that using i/o blocking threads is perfectly scalable for a large number of simultaneous users.
if that's true i can only see a disadvantage in using a single-threaded server -- having to use deferreds and stuff to make things asynchronous
i also don't understand how you're supposed to use deferreds the twisted doc says deferreds won't *make* your code asynchronous. so let's say you have to do an sql query that takes 10 seconds, deferreds would be useless for making that not block unless you have a way of making that sql query non-blocking already? how is that done? do you run a separate thread of your own for each sql query? one thread for all sql queries?
also I wonder in an typical twisted app, just how slow should an operation be before you use a deferred? what if a user enters a username and password and i have to look that up in the database. do i use a deferred? just how bad should the query be before using a deferred?
(reading the twisted docs is like reading a brick wall for me, it would be nice if someone could just explain things to me in simple terms.)
thx
On Saturday 26 April 2008, inhahe wrote:
what is the advantage of using a single-threaded server?
If you use threads, your code can be interrupted at any place, except when you tell it not to (locking). If you use deferreds, your code can be interrupted only at exactly those places you have indicated. This makes it much easier to write correct code.
i figured it makes it more scalable because there's too much overhead to have a thread for each user when you have many simultaneous users. but a friend i'm talking to now says that using i/o blocking threads is perfectly scalable for a large number of simultaneous users.
It depends on a lot of things. For example, would you use thread pools or one thread per user? And how many users are you talking about? 100, 1000, 10000, ...?
It also depends on how efficient your OS is in handling threads. I remember having to compile a differently configured Linux kernel because it would run out of processes when creating about 1000 threads, but this was over 5 years ago, before NPTL, so this may no longer be an issue.
If you have a multi core / multi CPU machine, running multiple threads could spread the workload over different cores. For Python this doesn't really help though: the Python VM has the Global Interpreter Lock, which effectively means that unless you implement long-running operations in an extension written in C, there will only be one thread making progress at any time. So if you want to use multiple cores effectively in Python, you have to design your application to consist of separate communicating processes.
if that's true i can only see a disadvantage in using a single-threaded server -- having to use deferreds and stuff to make things asynchronous
The main advantage in my opinion is that it is much easier to write correct asynchronous code than correct threaded code. If you write threaded code and overlook one place it can be interrupted, you have a bug. If you write asynchronous code and overlook one place it should be interruptable, you get worse latency, but it is still correct.
Because the points at which different tasks are interleaved are much more predictable in asynchronous code, there is a reasonable chance that if your code passes your unit tests, it is actually correct. For threaded code, it's not uncommon that code passes its unit tests, but starts giving wrong results as soon as the server is put under high load.
It may sound strange that I'm saying asynchronous code is easier to write, since that is probably not the experience you have when you start doing it. But if you're writing a complex threaded application, you typically end up assigning each thread its own area of responsibility and getting its inputs and outputs from other threads using event queues. If you don't do this, threads will run through your application in unpredictable ways as the application grows in complexity and even assuming you have proper locking over all shared data, you can run into deadlocks if you don't always lock things in the same order (thread 1 locks A and then B, thread 2 locks B and then A -> possible deadlock).
So you end up with a threaded application design where each thread runs in an isolated pocket, getting data from an event queue, processing it and then inserting it in another event queue. This is not all that different from the asynchronous situation in which you get an event from a reactor callback, do some processing and then register another callback.
As an aside, I think one of the problems with threads is that to write a piece of code correctly, you have to take into account which threads exist in your application. This means it is no longer possible to know whether for example a class is correct by looking at it in isolation. One of the advantages of object oriented programming is that you only have to care about whether a class correctly implements its interface, not how that class is used in an application. But when threading, this is no longer the case: a class that is correct in single threaded use can be incorrect in multi threaded use and a class that is correct in multi threaded use in one application can cause a deadlock in another application.
i also don't understand how you're supposed to use deferreds the twisted doc says deferreds won't *make* your code asynchronous. so let's say you have to do an sql query that takes 10 seconds, deferreds would be useless for making that not block unless you have a way of making that sql query non-blocking already? how is that done? do you run a separate thread of your own for each sql query? one thread for all sql queries?
If there is an asynchronous API for doing a particular type of I/O, use that. If there isn't, you have to use a thread like you describe and use one of the thread safe reactor calls to pass the result.
My gut feeling tells me to use a thread pool, possibly of size 1, to access for example a database. But I haven't written code like this, so I have no experience to back this up. Every kind of I/O I wanted to do so far was already handled by Twisted. In the case of databases, use "adbapi".
also I wonder in an typical twisted app, just how slow should an operation be before you use a deferred? what if a user enters a username and password and i have to look that up in the database. do i use a deferred? just how bad should the query be before using a deferred?
It depends on the kind of database. If you have an in-memory database, you don't need a deferred. If you have a simple text file on a local disk, you probably don't need a deferred. If you contact a DB server on the same machine, you might get away with not using a deferred, but it would be better to use one. If you contact a DB server on a different machine, definately use a deferred.
One simple check is to imagine what would happen if the DB is not available. If you use an in-memory DB, it will always be available. If you use a simple text file on a local disk, you will immediately get an error if opening it fails. If you contact a DB server, it is possible you get a timeout when connecting to it. Since server timeouts are typically in the order of seconds, this is not something you'd want to block your entire application on, so use a deferred.
In any case, Twisted offers "cred" as an authentication framework and cred always uses a deferred to give you the results of a credentials check. This is good because now you can easily switch from one type of credentials checker to another without changing the code that uses it.
(reading the twisted docs is like reading a brick wall for me, it would be nice if someone could just explain things to me in simple terms.)
I think one of the problems is that many people who get started with Twisted are learning both asynchronous programming and Twisted at the same time, so there are a lot of new concepts to learn.
Bye, Maarten
inhahe wrote:
hi, excuse my noobness, I have a few basic questions about twisted, or probably about web servers in general.
There's nothing web-specific in these questions that I see... they apply to any network service serving requests.
what is the advantage of using a single-threaded server?
i figured it makes it more scalable because there's too much overhead to have a thread for each user when you have many simultaneous users. but a friend i'm talking to now says that using i/o blocking threads is perfectly scalable for a large number of simultaneous users.
That's basically right. Threads can be a scalability issue, particularly if you have many connections that are mostly idle — you end up with a lot of wasted memory (for stack space).
Another problem with threads is non-determinism. You can't easy construct a test suite that will find every possible race condition, because a thread can be pre-empted at any time. In effect you have a state machine with a massive number of states, many more than necessary. With a single thread, you can simply and reliably test what happens when events happen a particular order.
Personally, I find this latter advantage more compelling. The performance differences are in many respects minor (and not clearly one way), especially compared to the overhead of using Python over C/C++. I find it *much* easier to write and test non-threaded code (and for that matter, I find it much easier to write and test Python). No point worrying about performance unless I can be confident in the correctness :)
if that's true i can only see a disadvantage in using a single-threaded server -- having to use deferreds and stuff to make things asynchronous
That is a disadvantage. Creating lots of objects and calling lots of functions can be a performance issue in Python. Deferreds are *much* nicer than the obvious alternative (passing callback functions to functions that produce asynchronous results), though.
Fundamentally, concurrent programming is more complex than non-concurrent. The question is which tradeoffs suit your problem best.
i also don't understand how you're supposed to use deferreds the twisted doc says deferreds won't *make* your code asynchronous. so let's say you have to do an sql query that takes 10 seconds, deferreds would be useless for making that not block unless you have a way of making that sql query non-blocking already? how is that done? do you run a separate thread of your own for each sql query? one thread for all sql queries?
You've got it. If you have a blocking API, there's nothing you can do to make it non-blocking apart from running it in a thread (if it's kind enough to release the GIL) or running it in a subprocess (if you don't mind the overhead of spawning another process and the complexity of marshalling messages to it rather than simply sharing an address space).
Note that a common compromise between “a separate thread of your own for each sql query” and “one thread for all sql queries” is a thread pool with a limited number of threads. This is what twisted.enterprise.adbapi basically does to run SQL using the standard Python DB-API.
also I wonder in an typical twisted app, just how slow should an operation be before you use a deferred? what if a user enters a username and password and i have to look that up in the database. do i use a deferred? just how bad should the query be before using a deferred?
The precise answer is: it depends.
The short answer is: if it does I/O or is obviously slower than instant, then it's blocking and should be avoided (in your main thread).
To be precise: it depends on your requirements: basically, what performance do you need? If a lookup in the database is only, say, 30ms, and you don't lots of concurrent requests, and they only need to do that one lookup, and you only need an average latency for replying to requests of 100ms, then you'd be pretty comfortable with just blocking for that lookup.
Typically, anything that doesn't return immediately, for some value of “immediately”, is good to treat as blocking, and thus something to avoid in your main thread. Small writes to disk are often fast enough to count as “immediate”. Small reads that are probably cached in RAM by your OS might be too. Querying a database usually isn't. It depends on your exact situation, though. It sounds like you already have a good idea of the sorts of things to watch out for, though.
Basically, there's no magic substitute for measuring actual performance, and asking yourself “is it good enough?”
(reading the twisted docs is like reading a brick wall for me, it would be nice if someone could just explain things to me in simple terms.)
It sounds to me like you've actually understood things quite well. :)
-Andrew.
inhahe ha scritto:
hi, excuse my noobness, I have a few basic questions about twisted, or probably about web servers in general.
what is the advantage of using a single-threaded server?
It depends on what do you intend for "thread".
If you intend an OS thread, with preemption, then the advantage of Twisted should be evident.
If instead you intend micro threads (like Lua coroutines, Erland process, Python stackless, Python greenlets) with cooperation, then multi threads should be better, because they are both efficient and easy to program.
[...]
Manlio Perillo
On 26 Apr, 06:44 am, inhahe@gmail.com wrote:
hi, excuse my noobness, I have a few basic questions about twisted, or probably about web servers in general.
what is the advantage of using a single-threaded server?
i figured it makes it more scalable because there's too much overhead to have a thread for each user when you have many simultaneous users. but a friend i'm talking to now says that using i/o blocking threads is perfectly scalable for a large number of simultaneous users.
One way in which a single-threaded server scales better can be understood in terms of how the operating system copes with increased load.
Let's say your site is being hit very, very hard, and you have to ssh in to change some stuff around to update it to deal with more load. With a single-threaded server, the operating system is scheduling two tasks: your SSH session and your web server. Therefore your SSH session gets plenty of time to talk to you. With a multi-threaded or multi-process server, it is scheduling zillions of tasks, and you have to think about limiting the number of processes that can run (which puts a hard limit on the number of concurrent users, that can easily be too many or too few for your hardware, especially if those processes are doing network I/O of their own).
So, on a poorly-configured multithreaded server, you have to wait for the load to die down before your system starts slowing to a crawl. With a single-process server like Twisted, lighttpd, and nginx, you can easily get in and poke it with a separate maintenance tool like SSH.
There are also potentially performance differences between event-driven and multithreaded servers, but there's a huge amount of optimization work that has gone into both approaches, so I wouldn't want to say one is definitely faster. Twisted is definitely a lot slower than several competing servers which use a multi-process approach. However, it can be made to scale in a variety of interesting ways. (I would say that your friend is wrong in saying that i/o blocking threads is "perfectly scalable", though, especially in the naive case.)
if that's true i can only see a disadvantage in using a single-threaded server -- having to use deferreds and stuff to make things asynchronous
This is not a disadvantage. Deferreds are great; if you have a race- condition firing two Deferreds you can easily write a test to fire them in a different order and easily replicate the problem in a debugger to figure out what is going on. If you have the same problem with threads, you are basically screwed; it's very hard to reproduce in an environment where you can see what's going on and even harder to write a repeatable test for.
i also don't understand how you're supposed to use deferreds the twisted doc says deferreds won't *make* your code asynchronous. so let's say you have to do an sql query that takes 10 seconds, deferreds would be useless for making that not block unless you have a way of making that sql query non-blocking already? how is that done? do you run a separate thread of your own for each sql query? one thread for all sql queries?
I could try to explain this, but you should really just read the Deferred howto and experiment at the Python prompt for a few hours with Deferred-returning APIs like twisted.web.client and twisted.enterprise.adbapi. The short answer is "twisted uses threads under the covers to do stuff with SQL, but to your code it just looks like a deferred because it's simpler".
also I wonder in an typical twisted app, just how slow should an operation be before you use a deferred? what if a user enters a username and password and i have to look that up in the database. do i use a deferred? just how bad should the query be before using a deferred?
It's not a question of speed, it's a question of blocking. If you are doing CPU-intensive stuff, you might want to put it into a separate process so you don't need to break it up into lots of little chunks. (Look into spawnProcess.) However, in general the things that use Deferreds are the things that generate some output and then wait for some input in response.
(reading the twisted docs is like reading a brick wall for me, it would be nice if someone could just explain things to me in simple terms.)
Good luck.