[Python-Dev] A more flexible task creation

Tin Tvrtković tinchester at gmail.com
Thu Jun 14 18:21:41 EDT 2018

Other folks have already chimed in, so I'll be to the point. Try writing a
simple asyncio web scraper (using maybe the aiohttp library) and create
5000 tasks for scraping different sites. My prediction is a whole lot of
them will time out due to various reasons.

Other responses inline.

On Thu, Jun 14, 2018 at 9:15 PM Chris Barker <chris.barker at noaa.gov> wrote:

> async is not parallel -- all the tasks will be run in the same thread
> (Unless you explicitly spawn another thread), and only one task is running
> at once, and the task switching happens when the task specifically releases
> itself.

asyncio is mostly used for IO-heavy workloads (note the name). If you're
doing IO in asyncio, it is most definitely parallel. The point of it is
having a large number of open network connections at the same time.

> So why do queries fail with 10000 tasks? or ANY number? If the async DB
> access code is written right, a given query should not "await" unless it is
> in a safe state to do so.

Imagine you have a batch job you need to do. You need to fetch a million
records from your database, and you can't use a query to get them all - you
need a million individual "get" requests. Even if Python was infinitely
fast, and your bandwidth was infinite, can your database handle opening a
million new connections in parallel, in a very short time? Mine sure can't,
even a few hundred extra connections would be a potential problem. So you
want to do the work in chunks, but still not one by one.

> and threads aren't synchronous -- but they are concurrent.

Using threads implies coupling threads with IO. Doing requests one at a
time in a given thread. Generally called 'synchronous IO', as opposed to
asynchronous IO/asyncio.

>  because threads ARE concurrent, and there is no advantage to having more
> threads than can actually run at once, and having many more does cause
> thread-switching performance issues.

Weeell technically threads in CPython aren't really concurrent (when
running Python bytecode), but for doing IO they are in practice. When doing
IO, there absolutely is an advantage to using more threads than can run at
once (in CPython only one thread running Python can run at once). You can
test it out yourself by writing a synchronous web scraper (using maybe the
requests library) and trying to scrape using a threadpool vs using a single
thread. You'll find the threadpool version is much faster.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20180615/04338d86/attachment.html>

More information about the Python-Dev mailing list