Mailman 3 improving concurrency, reliability of devpi-server - devpi-dev

29 Oct 2015

      Hi Florian, all,

there are at least three issues that somewhat interelate and share the
common topic of service reliability, concurrency and interactions with
remote pypi.python.org or devpi masters:

https://bitbucket.org/hpk42/devpi/issues/267/intermittent-assertionerror-in-...

    multiple devpi-server processes write to the same (networked shared) file system
    resulting in failed transaction handling.  devpi-server was not
    designed for it.

https://bitbucket.org/hpk42/devpi/issues/274/recurring-consistency-issues-wi...

    under high load database/transaction handling issues arise.
    (although it's unclear what the precise scenario is, how to replicate)

https://bitbucket.org/hpk42/devpi/issues/208/pip-gets-timeout-on-large-packa...

    trying to install an uncached package that originates from pypi.python.org
    can fail if devpi-server cannot download the package fast enough.

Starting with the last issue, we probably need to re-introduce a way to
stream remote files instead of first retrieving it in full and only then
starting a client response .  This should take into account that there
could be two threads (or even two processes) which try to retrieve the
same file.  This means that we start a response as soon as we got a http
return code and them forward-stream the content.

The first two issues could be mitigated by introducing a better
read/write transaction separation.  background: GET-ting simple pages or
release files can cause write transactions in a devpi-server process
because we may need to retrive & cache information from pypi.python.org
or a devpi-server master.  Currently, during the processing of the GET
request we at some point promote a READ-transaction into a
WRITE-transaction through a call to keyfs.restart_as_write_transaction()
and persist what we have.  This all happens before the response to the
client is returned.  "Restarting as write" is somewhat brittle because
something might have changed since we started our long-running request.

Strategy and notes on how to mitigate all three issues:

- release files: cache and stream chunks of what we remotely receive,
  all within a READ-transaction and all within RAM. This should ideally
  be done in such a way that if multiple threads stream the same file,
  only one remote http request is done for fetching the file.  Otherwise
  we end up retrieving large files multiple times unneccessarily.  After
  the http response to the client is complete we (try to) write it to
  sqlite/filesystem so that subsequent requests can work from the local
  filesystem.  Here we need to be careful and consider that we might
  have multiple writers/streamers.  If we discover that where we want to
  write someone else already has we can simply forget about it.

- simple pages: first retrieve the remote simple page in RAM, process
  it, serve the full pyramid response and then (try to) cache it after
  the response is completed.  Here we probably don't need to care if
  multiple threads are trying to retrieve the same simple page because
  simple pages are not big.

- we cache things in RAM because even for large files it shouldn't
  matter given that servers typically have multiple gigabytes of RAM.
  And we can avoid synchronization issues wrt to the file system (see
  also the first issues where multiple processes write to the file
  system).

- we always finish response to the client before we attempt to do a
  write transaction.  The write transaction part should be implemented
  in a separate function to make it clear what kind of state we can rely
  on and what we must re-check.  (currently we do the READ->Write switch
  in the middle of a view function).

- we also need to review how exactly we open the sqlite DB for writing
  and if multiple processes correctly serialize on their write attempts,
  particularly in the multi-process case.

- care must be taken wrt to waitress and nginx configuration and their
  buffering, see for example:
  http://www.4byte.cn/question/68410/pyramid-stream-response-body.html

any feedback or thoughts welcome.

holger

-- 
about me:    http://holgerkrekel.net/about-me/
contracting: http://merlinux.eu

improving concurrency, reliability of devpi-server

holger krekel

Erb, Stephan

holger krekel

Erb, Stephan

holger krekel

Erb, Stephan

holger krekel

tags

participants (2)