PEP 611 -- why limit coroutines and classes?
I want to question two specific limits. (a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately. (b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads? -- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...>
I was thinking the same thing. We should distinguish limits with respect to the codegen process, which seem reasonable, vs runtime. Classes and coroutines are objects, and like objects in general, the program should have the option of filling its heap with any arbitrary objects. (Whether wise or not, this design is not for us to arbitrarily limit. For example, I recall that Eve Online is/was running large numbers of stackless coroutines, possibly well in excess of 1M.) For some comparison: Note the JVM has it made easier to tune the use of the native heap for class objects since Java 8, in part to relax earlier constraints around "permgen" allocation - by default, class objects are automatically allocated from the heap without limit (this is managed by "metaspace"). I suppose if this was a tunable option, maybe it could be useful, but probably not - Java's ClassLoader design is prone to leaking classes, as we know from our work on Jython. There's nothing comparable to my knowledge for why this would be the case for CPython class objects more than other objects. I also would suggest for PEP 611 that any limits are discoverable (maybe in sys) so it can be used by other implementations like Jython. There's no direct correspondence between LOC and generated Python or Java bytecode, but it could possibly still be helpful for some codegen systems. Jython is limited to 2**15 bytes per method due to label offsets, although we do have workarounds for certain scenarios, and could always compile, then run Python bytecode for large methods. (Currently we use CPython to do that workaround compilation, thanks!) Lastly, PEP 611 currently erroneously conjectures that "For example, Jython might need to use a lower class limit of fifty or sixty thousand becuase of JVM limits." - Jim On Mon, Dec 9, 2019 at 9:55 AM Guido van Rossum <guido@python.org> wrote:
I want to question two specific limits.
(a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately.
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CJO36YRF... Code of Conduct: http://python.org/psf/codeofconduct/
I also would suggest for PEP 611 that any limits are discoverable (maybe in sys) so it can be used by other implementations like Jython.
I agree, I think that sys would likely be the most reasonable place to read these limits from. Also, it seems like a good location for setting of the limits, if that becomes an option. This would go along well with the existing sys.getrecursionlimit() and sys.setrecursionlimit(). In general, this proposal would be much easier to consider if the limits were customizable. I'm not sure if it would be reasonable for all of the options, but it would at least allow those who have a legitimate use case for going beyond the limits (either now or in the future) to still be able to do so. On Mon, Dec 9, 2019 at 8:51 PM Jim Baker <jim.baker@python.org> wrote:
I was thinking the same thing. We should distinguish limits with respect to the codegen process, which seem reasonable, vs runtime. Classes and coroutines are objects, and like objects in general, the program should have the option of filling its heap with any arbitrary objects. (Whether wise or not, this design is not for us to arbitrarily limit. For example, I recall that Eve Online is/was running large numbers of stackless coroutines, possibly well in excess of 1M.)
For some comparison: Note the JVM has it made easier to tune the use of the native heap for class objects since Java 8, in part to relax earlier constraints around "permgen" allocation - by default, class objects are automatically allocated from the heap without limit (this is managed by "metaspace"). I suppose if this was a tunable option, maybe it could be useful, but probably not - Java's ClassLoader design is prone to leaking classes, as we know from our work on Jython. There's nothing comparable to my knowledge for why this would be the case for CPython class objects more than other objects.
I also would suggest for PEP 611 that any limits are discoverable (maybe in sys) so it can be used by other implementations like Jython. There's no direct correspondence between LOC and generated Python or Java bytecode, but it could possibly still be helpful for some codegen systems. Jython is limited to 2**15 bytes per method due to label offsets, although we do have workarounds for certain scenarios, and could always compile, then run Python bytecode for large methods. (Currently we use CPython to do that workaround compilation, thanks!)
Lastly, PEP 611 currently erroneously conjectures that "For example, Jython might need to use a lower class limit of fifty or sixty thousand becuase of JVM limits."
- Jim
On Mon, Dec 9, 2019 at 9:55 AM Guido van Rossum <guido@python.org> wrote:
I want to question two specific limits.
(a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately.
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CJO36YRF... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RQIVRB4Y... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Dec 09, 2019 at 11:27:56PM -0500, Kyle Stanley wrote:
I agree, I think that sys would likely be the most reasonable place to read these limits from. Also, it seems like a good location for setting of the limits, if that becomes an option. This would go along well with the existing sys.getrecursionlimit() and sys.setrecursionlimit().
In general, this proposal would be much easier to consider if the limits were customizable. I'm not sure if it would be reasonable for all of the options, but it would at least allow those who have a legitimate use case for going beyond the limits (either now or in the future) to still be able to do so.
Someone will correct me if I'm wrong, but my reading of the PEP tells me that these are structural limits in the interpreter's internal data structures, so they can't be configured at runtime. The impression I get is that you believe that the proposal is to add a bunch of runtime checks like this: # pseudo-code for the module-loading code lines = 0; while 1 { read and process line of code lines += 1; if lines >= 1000000: GOTO error condition } In that case, it would be easy to change the 1000000 constant to a value that can be configured at runtime. You wouldn't even need to quit the running interpreter. As I understand the PEP, the proposal is to change a bunch of C-level structs which currently contain 32-bit fields into 20-bit fields. To change that, you would need to recompile with new struct definitions and build a new interpreter binary. If I'm right, making this configurable at *build-time* could be possible, but that's probably a bad idea. That's like the old "narrow versus wide Unicode" builds. It doubles (at least!) the amount of effort needed to maintain Python, for something which (if the PEP is correct) benefits nearly nobody but costs nearly everyone. -- Steven
Steve D'Aprano wrote:
As I understand the PEP, the proposal is to change a bunch of C-level structs which currently contain 32-bit fields into 20-bit fields. To change that, you would need to recompile with new struct definitions and build a new interpreter binary.
If I'm right, making this configurable at *build-time* could be possible, but that's probably a bad idea. That's like the old "narrow versus wide Unicode" builds. It doubles (at least!) the amount of effort needed to maintain Python, for something which (if the PEP is correct) benefits nearly nobody but costs nearly everyone.
Ah, if that's case I'll withdraw the suggestion for allowing the limits to be configurable through sys then. Thanks for the clarification. I have some basic understanding of how C structs work, but I'll admit that it's far from an area that I'm knowledgeable about. My feedback is mostly based on experience as a user and involvement with stdlib development, not as a C developer. I don't have a strong understanding of the implementation details behind the limits. Although if they're not (reasonably) configurable by most users, I would consider that to further reinforce my previous statement that we should have some form of concrete evidence; to prove that imposing the limits will provide tangible benefits to the vast majority of Python users. On Tue, Dec 10, 2019 at 4:45 AM Steven D'Aprano <steve@pearwood.info> wrote:
On Mon, Dec 09, 2019 at 11:27:56PM -0500, Kyle Stanley wrote:
I agree, I think that sys would likely be the most reasonable place to read these limits from. Also, it seems like a good location for setting of the limits, if that becomes an option. This would go along well with the existing sys.getrecursionlimit() and sys.setrecursionlimit().
In general, this proposal would be much easier to consider if the limits were customizable. I'm not sure if it would be reasonable for all of the options, but it would at least allow those who have a legitimate use case for going beyond the limits (either now or in the future) to still be able to do so.
Someone will correct me if I'm wrong, but my reading of the PEP tells me that these are structural limits in the interpreter's internal data structures, so they can't be configured at runtime.
The impression I get is that you believe that the proposal is to add a bunch of runtime checks like this:
# pseudo-code for the module-loading code lines = 0; while 1 { read and process line of code lines += 1; if lines >= 1000000: GOTO error condition }
In that case, it would be easy to change the 1000000 constant to a value that can be configured at runtime. You wouldn't even need to quit the running interpreter.
As I understand the PEP, the proposal is to change a bunch of C-level structs which currently contain 32-bit fields into 20-bit fields. To change that, you would need to recompile with new struct definitions and build a new interpreter binary.
If I'm right, making this configurable at *build-time* could be possible, but that's probably a bad idea. That's like the old "narrow versus wide Unicode" builds. It doubles (at least!) the amount of effort needed to maintain Python, for something which (if the PEP is correct) benefits nearly nobody but costs nearly everyone.
-- Steven _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Q2UOEO3V... Code of Conduct: http://python.org/psf/codeofconduct/
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread. There's also a practical use case for having a large number of coroutine objects, such as for asynchronously: 1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic. 2) Sending a large number of concurrent database transactions to run on a cluster of database servers. I don't know that anyone is currently using production code that results in 1 million coroutine objects within the same interpreter at once, but something like this definitely scales over time. Arbitrarily placing a limit on the total number of coroutine objects doesn't make sense to me for that reason. OS threads on the other hand take significantly more memory. From a recent (but entirely unrelated) discussion where the memory usage of threads was brought up, Victor Stinner wrote a program that demonstrated that each OS thread takes up approximately ~13.2kB on Linux, which I verified on kernel version 5.3.8. See https://bugs.python.org/msg356596. For comparison, I just wrote a similar program to compare the memory usage between 1M threads and 1M coroutines: ``` import asyncio import threading import sys import os def wait(event): event.wait() class Thread(threading.Thread): def __init__(self): super().__init__() self.stop_event = threading.Event() self.started_event = threading.Event() def run(self): self.started_event.set() self.stop_event.wait() def stop(self): self.stop_event.set() self.join() def display_rss(): os.system(f"grep ^VmRSS /proc/{os.getpid()}/status") async def test_mem_coros(count): print("Coroutine memory usage before:") display_rss() coros = tuple(asyncio.sleep(0) for _ in range(count)) print("Coroutine memory usage after creation:") display_rss() await asyncio.gather(*coros) print("Coroutine memory usage after awaiting:") display_rss() def test_mem_threads(count): print("Thread memory usage before:") display_rss() threads = tuple(Thread() for _ in range(count)) print("Thread memory usage after creation:") display_rss() for thread in threads: thread.start() print("Thread memory usage after starting:") for thread in threads: thread.run() print("Thread memory usage after running:") display_rss() for thread in threads: thread.stop() print("Thread memory usage after stopping:") display_rss() if __name__ == '__main__': count = 1_000_000 arg = sys.argv[1] if arg == 'threads': test_mem_threads(count) if arg == 'coros': asyncio.run(test_mem_coros(count)) ``` Here are the results: 1M coroutine objects: Coroutine memory usage before: VmRSS: 14800 kB Coroutine memory usage after creation: VmRSS: 651916 kB Coroutine memory usage after awaiting: VmRSS: 1289528 kB 1M OS threads: Thread memory usage before: VmRSS: 14816 kB Thread memory usage after creation: VmRSS: 4604356 kB Traceback (most recent call last): File "temp.py", line 60, in <module> test_mem_threads(count) File "temp.py", line 44, in test_mem_threads thread.start() File "/usr/lib/python3.8/threading.py", line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread (Python version: 3.8) (Linux kernel version: 5.13) As is present in the results above, 1M OS threads can't even be ran at once, and the memory taken up just to create the 1M threads is ~3.6x more than it costs to concurrently await the 1M coroutine objects. Based on that, I think it would be reasonable to place a limit of 1M on the total number of OS threads. It seems unlikely that a system would be able to properly handle 1M threads at once anyways, whereas that seems entirely feasible with 1M coroutine objects. Especially on a high traffic server. On Mon, Dec 9, 2019 at 12:01 PM Guido van Rossum <guido@python.org> wrote:
I want to question two specific limits.
(a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately.
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CJO36YRF... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, Dec 9, 2019, 18:48 Kyle Stanley <aeros167@gmail.com> wrote:
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread.
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic. 2) Sending a large number of concurrent database transactions to run on a cluster of database servers.
I don't know that anyone is currently using production code that results in 1 million coroutine objects within the same interpreter at once, but something like this definitely scales over time. Arbitrarily placing a limit on the total number of coroutine objects doesn't make sense to me for that reason.
OS threads on the other hand take significantly more memory. From a recent (but entirely unrelated) discussion where the memory usage of threads was brought up, Victor Stinner wrote a program that demonstrated that each OS thread takes up approximately ~13.2kB on Linux, which I verified on kernel version 5.3.8. See https://bugs.python.org/msg356596.
For comparison, I just wrote a similar program to compare the memory usage between 1M threads and 1M coroutines:
``` import asyncio import threading import sys import os
def wait(event): event.wait()
class Thread(threading.Thread): def __init__(self): super().__init__() self.stop_event = threading.Event() self.started_event = threading.Event()
def run(self): self.started_event.set() self.stop_event.wait()
def stop(self): self.stop_event.set() self.join()
def display_rss(): os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
async def test_mem_coros(count): print("Coroutine memory usage before:") display_rss() coros = tuple(asyncio.sleep(0) for _ in range(count)) print("Coroutine memory usage after creation:") display_rss() await asyncio.gather(*coros) print("Coroutine memory usage after awaiting:") display_rss()
def test_mem_threads(count): print("Thread memory usage before:") display_rss() threads = tuple(Thread() for _ in range(count)) print("Thread memory usage after creation:") display_rss() for thread in threads: thread.start() print("Thread memory usage after starting:") for thread in threads: thread.run() print("Thread memory usage after running:") display_rss() for thread in threads: thread.stop() print("Thread memory usage after stopping:") display_rss()
if __name__ == '__main__': count = 1_000_000 arg = sys.argv[1] if arg == 'threads': test_mem_threads(count) if arg == 'coros': asyncio.run(test_mem_coros(count))
``` Here are the results:
1M coroutine objects:
Coroutine memory usage before: VmRSS: 14800 kB Coroutine memory usage after creation: VmRSS: 651916 kB Coroutine memory usage after awaiting: VmRSS: 1289528 kB
1M OS threads:
Thread memory usage before: VmRSS: 14816 kB Thread memory usage after creation: VmRSS: 4604356 kB Traceback (most recent call last): File "temp.py", line 60, in <module> test_mem_threads(count) File "temp.py", line 44, in test_mem_threads thread.start() File "/usr/lib/python3.8/threading.py", line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
(Python version: 3.8) (Linux kernel version: 5.13)
As is present in the results above, 1M OS threads can't even be ran at once, and the memory taken up just to create the 1M threads is ~3.6x more than it costs to concurrently await the 1M coroutine objects. Based on that, I think it would be reasonable to place a limit of 1M on the total number of OS threads. It seems unlikely that a system would be able to properly handle 1M threads at once anyways, whereas that seems entirely feasible with 1M coroutine objects. Especially on a high traffic server.
This logic doesn't seem much different than would be for coroutines... Just need to wait for larger systems... With 100k threads started we're only using 8G memory, there are plenty of systems today with more than 80G of RAM
On Mon, Dec 9, 2019 at 12:01 PM Guido van Rossum <guido@python.org> wrote:
I want to question two specific limits.
(a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately.
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CJO36YRF... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WYZHKRGN... Code of Conduct: http://python.org/psf/codeofconduct/
This logic doesn't seem much different than would be for coroutines... Just need to wait for larger systems...
With 100k threads started we're only using 8G memory, there are plenty of systems today with more than 80G of RAM
Well, either way, I think it's still a solid argument against imposing the 1M limit on coroutines. Arguing in favor or against 1M OS threads wasn't the primary message I was trying to convey, it was just to demonstrate that 1M coroutines could be created and awaited concurrently on most current systems (w/ ~1.3GB+ available RAM and virtual memory). But I think there's a reasonable question of practicality when it comes to running 1M OS threads simultaneously. For especially high volumes of concurrent tasks, OS threads are generally not the best solution (for CPython, at least). They work for handling a decent number IO-bound tasks such as sending out and processing network requests, but coroutine objects are significantly more efficient when it comes to memory usage. For the usage of child processes, we have watcher implementations that don't use OS threads at all, such as the recently added PidfdChildWatcher ( https://docs.python.org/3.9/library/asyncio-policy.html#asyncio.PidfdChildWa...). There are also others that don't spawn a new thread per process. That being said, you are correct in that at some point, the memory usage for running 1M simultaneous OS threads will be perfectly reasonable. I'm just not sure if there's a practical reason to do so, considering that more efficient means of implementing parallelism are available when memory usage becomes a significant concern. Of course, the main question here is: "What benefit would imposing this particular limit to either coroutines objects or OS threads provide?". Personally, I'm not entirely convinced that placing a hard limit of 1M at once for either would result in a significant benefit to performance, efficiency, or security (mentioned in the PEP, as a reason for imposing the limits). I could see it being more useful for other areas though, such as lines of code or bytecode instructions per object. I just think that placing a limit of 1M on current coroutine objects would not be reasonable. But between the two, I think a limit of 1M on OS threads is *more* reasonable in comparison. On Mon, Dec 9, 2019 at 10:17 PM Khazhismel Kumykov <khazhy@gmail.com> wrote:
On Mon, Dec 9, 2019, 18:48 Kyle Stanley <aeros167@gmail.com> wrote:
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread.
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic. 2) Sending a large number of concurrent database transactions to run on a cluster of database servers.
I don't know that anyone is currently using production code that results in 1 million coroutine objects within the same interpreter at once, but something like this definitely scales over time. Arbitrarily placing a limit on the total number of coroutine objects doesn't make sense to me for that reason.
OS threads on the other hand take significantly more memory. From a recent (but entirely unrelated) discussion where the memory usage of threads was brought up, Victor Stinner wrote a program that demonstrated that each OS thread takes up approximately ~13.2kB on Linux, which I verified on kernel version 5.3.8. See https://bugs.python.org/msg356596.
For comparison, I just wrote a similar program to compare the memory usage between 1M threads and 1M coroutines:
``` import asyncio import threading import sys import os
def wait(event): event.wait()
class Thread(threading.Thread): def __init__(self): super().__init__() self.stop_event = threading.Event() self.started_event = threading.Event()
def run(self): self.started_event.set() self.stop_event.wait()
def stop(self): self.stop_event.set() self.join()
def display_rss(): os.system(f"grep ^VmRSS /proc/{os.getpid()}/status")
async def test_mem_coros(count): print("Coroutine memory usage before:") display_rss() coros = tuple(asyncio.sleep(0) for _ in range(count)) print("Coroutine memory usage after creation:") display_rss() await asyncio.gather(*coros) print("Coroutine memory usage after awaiting:") display_rss()
def test_mem_threads(count): print("Thread memory usage before:") display_rss() threads = tuple(Thread() for _ in range(count)) print("Thread memory usage after creation:") display_rss() for thread in threads: thread.start() print("Thread memory usage after starting:") for thread in threads: thread.run() print("Thread memory usage after running:") display_rss() for thread in threads: thread.stop() print("Thread memory usage after stopping:") display_rss()
if __name__ == '__main__': count = 1_000_000 arg = sys.argv[1] if arg == 'threads': test_mem_threads(count) if arg == 'coros': asyncio.run(test_mem_coros(count))
``` Here are the results:
1M coroutine objects:
Coroutine memory usage before: VmRSS: 14800 kB Coroutine memory usage after creation: VmRSS: 651916 kB Coroutine memory usage after awaiting: VmRSS: 1289528 kB
1M OS threads:
Thread memory usage before: VmRSS: 14816 kB Thread memory usage after creation: VmRSS: 4604356 kB Traceback (most recent call last): File "temp.py", line 60, in <module> test_mem_threads(count) File "temp.py", line 44, in test_mem_threads thread.start() File "/usr/lib/python3.8/threading.py", line 852, in start _start_new_thread(self._bootstrap, ()) RuntimeError: can't start new thread
(Python version: 3.8) (Linux kernel version: 5.13)
As is present in the results above, 1M OS threads can't even be ran at once, and the memory taken up just to create the 1M threads is ~3.6x more than it costs to concurrently await the 1M coroutine objects. Based on that, I think it would be reasonable to place a limit of 1M on the total number of OS threads. It seems unlikely that a system would be able to properly handle 1M threads at once anyways, whereas that seems entirely feasible with 1M coroutine objects. Especially on a high traffic server.
This logic doesn't seem much different than would be for coroutines... Just need to wait for larger systems...
With 100k threads started we're only using 8G memory, there are plenty of systems today with more than 80G of RAM
On Mon, Dec 9, 2019 at 12:01 PM Guido van Rossum <guido@python.org> wrote:
I want to question two specific limits.
(a) Limiting the number of classes, in order to potentially save space in object headers, sounds like a big C API change, and I think it's better to lift this idea out of PEP 611 and debate the and cons separately.
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
-- --Guido van Rossum (python.org/~guido) *Pronouns: he/him **(why is my pronoun here?)* <http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-c...> _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CJO36YRF... Code of Conduct: http://python.org/psf/codeofconduct/
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WYZHKRGN... Code of Conduct: http://python.org/psf/codeofconduct/
On Mon, 9 Dec 2019 21:42:36 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread.
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic.
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
2) Sending a large number of concurrent database transactions to run on a cluster of database servers.
1M concurrent database transactions? Does that sound reasonable at all? Your database administrator probably won't like you.
something like this definitely scales over time. Arbitrarily placing a limit on the total number of coroutine objects doesn't make sense to me for that reason.
There are a lot of arbitrary limits inside a computer system. You just aren't aware of them because you don't hit them in practice. Claiming that limits shouldn't exist is just pointless. Regards Antoine.
On Thu, Dec 12, 2019 at 8:46 AM Antoine Pitrou <solipsis@pitrou.net> wrote:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic.
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
Each client has a socket, yes, but they all use the same local port number. Distinct sockets are identified by the tuple (TCP, LocalAddr, LocalPort, RemoteAddr, RemotePort) and can quite happily duplicate on any part of that as long as they can be distinguished by some other part. There will be other limits, though. On Linux (and probably most Unix-like systems), every socket requires a file descriptor, and you're limited to a few hundred thousand of those, I think. On Windows, sockets aren't the same things as files, so I don't know what the limit actually is, but there'll be one somewhere. ChrisA
On Wed, Dec 11, 2019 at 11:52 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 9 Dec 2019 21:42:36 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic.
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
I'm sorry, but the accepted socket has the same local port number as the listening one. Routing is performed by (local_ip, local_port, remote_ip, remote_port) quad. The listening socket can accept hundreds of thousands of concurrent client connections. The only thing that should be tuned for this is increasing the limit of file descriptors.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3ZODXHE... Code of Conduct: http://python.org/psf/codeofconduct/
-- Thanks, Andrew Svetlov
Andrew Svetlov wrote:
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
I'm sorry, but the accepted socket has the same local port number as the listening one. Routing is performed by (local_ip, local_port, remote_ip, remote_port) quad.
IIRC, that combination is typically referred to as the "port address" (IP addr + port num source, IP addr + port num destination). All four are are required in TCP, and in UDP, the IP and source port are optional. So in UDP, this could potentially just be (remote_ip, remote_port). Also, it's possible to bind multiple AF_INET (or AF_INET6) sockets to a single port address by using the SO_REUSEPORT socket option, which we discussed recently in bpo-37228 (https://bugs.python.org/issue37228). The only requirement is that the same UID is used for each socket bound to the same port address (from my understanding, SO_REUSEPORT is typically used for binding a single process to multiple listening sockets). TL;DR: It's definitely possible to have more than one client per TCP port. I'll admit that I've never personally seen production code that uses anywhere near 1M coroutine objects, but we shouldn't limit users from doing that without a good reason. At the present moment, it's rather trivial to create 1M coroutine objects on any system with ~1.3GB+ available main memory (see my code example in https://mail.python.org/archives/list/python-dev@python.org/message/WYZHKRGN... ). There's also the infamous "10M" problem, of accepting 10 million concurrent clients without significant performance issues. This is mostly theoretical at the moment, but there's an article that explains how the problem could be addressed by using 10M goroutines: https://goroutines.com/10m. I see no reason why this couldn't be potentially translated into Python's coroutines, with the usage of an optimized event loop policy such as uvloop. But, either way, Mark Shannon removed the 1M coroutine limit from PEP 611, due to it having the least strong argument out of all of the proposed limits and a significant amount of push-back from the dev community. Andrew Svetlov wrote:
The listening socket can accept hundreds of thousands of concurrent client connections. The only thing that should be tuned for this is increasing the limit of file descriptors.
The default soft limit of file descriptors per process on Linux is 1024 (which can be increased), but you could exceed a per-process limitation of file descriptors by forking child processes. I'm have no idea what the realistic maximum limit of global FDs would be for most modern servers though, but here's the upper bound limit on Linux kernel 5.3.13: [aeros:~]$ cat /proc/sys/fs/file-max 9223372036854775807 My system's current hard limit of file descriptors is much lower, but is still fairly substantial: [aeros:~]$ ulimit -nH 524288 I recall reading somewhere that per additional 100 file descriptors, it requires approximately 1MB of main memory. Based on that estimate, 1M FDs would require ~10GB+. This doesn't seem unreasonable to me, especially on a modern server. But I'd imagine the actual memory usage depends upon how much data is being buffered at once through the pipes associated with each FD, but I believe this can be limited through the FD option F_SETPIPE_SZ ( https://linux.die.net/man/2/fcntl). Note: I was unable to find a credible source on the minimum memory usage per additional FD, so clarification on that would be appreciated. On Wed, Dec 11, 2019 at 5:06 PM Andrew Svetlov <andrew.svetlov@gmail.com> wrote:
On Wed, Dec 11, 2019 at 11:52 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 9 Dec 2019 21:42:36 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
There's also a practical use case for having a large number of
coroutine
objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic.
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
I'm sorry, but the accepted socket has the same local port number as the listening one. Routing is performed by (local_ip, local_port, remote_ip, remote_port) quad.
The listening socket can accept hundreds of thousands of concurrent client connections. The only thing that should be tuned for this is increasing the limit of file descriptors.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3ZODXHE... Code of Conduct: http://python.org/psf/codeofconduct/
-- Thanks, Andrew Svetlov _______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ZVTAHRNW... Code of Conduct: http://python.org/psf/codeofconduct/
On Wed, 11 Dec 2019 23:17:48 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
TL;DR: It's definitely possible to have more than one client per TCP port.
Thanks for correcting me. Not sure why, but I appear to make that mistake once every couple years.
I'm have no idea what the realistic maximum limit of global FDs would be for most modern servers though, but here's the upper bound limit on Linux kernel 5.3.13:
[aeros:~]$ cat /proc/sys/fs/file-max 9223372036854775807
Looks like 2**63 - 1 to me :-)
I recall reading somewhere that per additional 100 file descriptors, it requires approximately 1MB of main memory.
More than file descriptors per se, what's relevant here is the per-TCP connection overhead (unless you're interested in keeping closed TCP sockets around?). Which I guess is related to the latency*bandwidth product. Regards Antoine.
Antoine Pitrou wrote:
1M concurrent database transactions? Does that sound reasonable at all? Your database administrator probably won't like you.
I agree that 1M concurrent transactions would not be reasonable for the vast majority of database configurations, I didn't mean to specifically imply that 1M would be something reasonable today. That's why I said "a large number of concurrent transactions" instead of specifically saying "1M concurrent transactions". I honestly don't know if any databases are close to capable of handling that many transactions at once, at least not at the present time. But, although we may think it's ridiculous today, who's to say that won't be an occurrence in the future? Imagine a scenario where there was a massive super-cluster of database servers, that performed real-time update transactions every time a single item was checked in and out of some global inventory system. Currently, something like this would likely have to be implemented through batching, where x number of updates have to be queued or x amount of time has to pass before the next transaction is started. Alternatively, each facility or region would have its own local database that synchronizes with the global database every so often. But, there could be a significant advantage in having a near-perfectly synchronized global inventory system, which would only be possible if it was updated in real-time. IMO, the max number of concurrent transactions that the a database system can handle at once is a very clear application of Moore's Law. My point being is that I don't want to arbitrarily restrict how many coroutine objects can exist at once without having a strong reason for doing so AND having a limit that's reasonable in the long term. 1M would have the advantage of being easy to remember, but other than that I see no reason why that should specifically be the limit for the max number of coroutines. As Guido mentioned at the start of the thread, a coroutine object is "just another Python object and has no operating resources associated with it". Other than main memory usage, there's no other external limit to the max number of coroutine objects that can exist at once. Note: Although coroutines were already dropped from PEP 611, I felt that this response was still worthwhile to write. I suspect that the topic of "coroutine object limits" is likely to come up again in the future. On Wed, Dec 11, 2019 at 4:46 PM Antoine Pitrou <solipsis@pitrou.net> wrote:
On Mon, 9 Dec 2019 21:42:36 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread.
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously:
1) Handling a large number of concurrent clients on a continuously running web server that receives a significant amount of traffic.
Not sure how that works? Each client has an accepted socket, which is bound to a local port number, and there are 65536 TCP port numbers available. Unless you're using 15+ coroutines per client, you probably won't reach 1M coroutines that way.
2) Sending a large number of concurrent database transactions to run on a cluster of database servers.
1M concurrent database transactions? Does that sound reasonable at all? Your database administrator probably won't like you.
something like this definitely scales over time. Arbitrarily placing a limit on the total number of coroutine objects doesn't make sense to me for that reason.
There are a lot of arbitrary limits inside a computer system. You just aren't aware of them because you don't hit them in practice. Claiming that limits shouldn't exist is just pointless.
Regards
Antoine.
_______________________________________________ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-leave@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3ZODXHE... Code of Conduct: http://python.org/psf/codeofconduct/
On Thu, 12 Dec 2019 00:56:41 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
IMO, the max number of concurrent transactions that the a database system can handle at once is a very clear application of Moore's Law.
I'm not quite sure that's the case. I think in reality Moore's Law has also helped databases become much larger and more complex, so it's not clear-cut. But in any case, Moore's Law is slowly dying, so I'm not sure that's a good argument for envisioning databases supporting 1M+ concurrent transactions in 10 years. Of course, never say never :-)
Note: Although coroutines were already dropped from PEP 611, I felt that this response was still worthwhile to write. I suspect that the topic of "coroutine object limits" is likely to come up again in the future.
Right. Best regards Antoine.
On Thu, Dec 12, 2019 at 7:05 PM Kyle Stanley <aeros167@gmail.com> wrote:
Antoine Pitrou wrote:
1M concurrent database transactions? Does that sound reasonable at all? Your database administrator probably won't like you.
I agree that 1M concurrent transactions would not be reasonable for the vast majority of database configurations, I didn't mean to specifically imply that 1M would be something reasonable today. That's why I said "a large number of concurrent transactions" instead of specifically saying "1M concurrent transactions". I honestly don't know if any databases are close to capable of handling that many transactions at once, at least not at the present time.
I have personally worked on a database back end that was processing ~800k transactions per second, about 10 years ago. It was highly-specialised, ran on Big Iron, and was implemented in C/C++ (mostly), FORTRAN, and IIRC a smidgin of assembler. So those numbers may not be currently realistic for Python, or for general purpose RDBMSs, but they are not completely ridiculous. Cheers, Duane. -- "I never could learn to drink that blood and call it wine" - Bob Dylan
On 2019-12-11 22:45, Antoine Pitrou wrote:
On Mon, 9 Dec 2019 21:42:36 -0500 Kyle Stanley <aeros167@gmail.com> wrote:
(b) Why limit coroutines? It's just another Python object and has no operating resources associated with it. Perhaps your definition of coroutine is different, and you are thinking of OS threads?
This was my primary concern with the proposed PEP. At the moment, it's rather trivial to create one million coroutines, and the total memory taken up by each individual coroutine object is very minimal compared to each OS thread.
There's also a practical use case for having a large number of coroutine objects, such as for asynchronously: [...] 2) Sending a large number of concurrent database transactions to run on a cluster of database servers.
1M concurrent database transactions? Does that sound reasonable at all? Your database administrator probably won't like you.
Right. Instead, you use a pool of DB connections, making each coroutine await its turn to talk to the DB. You can still have a million coroutines waiting. (Or as Guido put it, a coroutine "just another Python object and has no operating resources associated with it". I still sense some confusion around what was meant there.)
participants (10)
-
Andrew Svetlov
-
Antoine Pitrou
-
Chris Angelico
-
Duane Griffin
-
Guido van Rossum
-
Jim Baker
-
Khazhismel Kumykov
-
Kyle Stanley
-
Petr Viktorin
-
Steven D'Aprano