[Catalog-sig] PyPI operational changes

Wed Dec 7 09:29:59 CET 2011

Over the last few weeks, I did a number of changes to the PyPI
installation, namely
- replace Apache with nginx
- replace FastCGI with uwsgi
- full vacuum of postgres, and activate of autovacuum
- introduce a separate uwsgi logging daemon

Together, these changes seem to have a positive effect on stability
of PyPI.

Peak load average is down from 500 to about 10:

http://pypi.python.org/munin/localdomain/localhost.localdomain/load.html

I believe this is mainly due to switching from Apache to nginx.
Apache would spawn hundreds of worker threads in an overload situation,
which made things worse, not better.

Memory consumption is down. Application memory would fluctuate up
to 3.5G, and is now at 750M. Committed memory would increase up to
20G, and is now below 2G. Swap might did use up to 3G, and is now
practically unused (7M).

http://pypi.python.org/munin/localdomain/localhost.localdomain/memory.html

Peak usage is again probably reduced to the change in process
model between Apache and nginx; in addition, the rejuvenation
features of uwsgi (replace worker process after 1000 requests)
prevent Python processes from growing too much unused memory.

Postgres response time is improved. There had been occasional
transactions taking 1700s, and occasional queries taking 870s.
This is now down to 45s/30s for the last day:

http://pypi.python.org/munin/localdomain/localhost.localdomain/postgres_querylength_ALL.html

There are two factors that likely cause this reduction. On the
one hand, the postgres database wasn't vacuumed:

http://pypi.python.org/munin/localdomain/localhost.localdomain/postgres_size_ALL.html

The reason for the failure to autovacuum probably was that it
was successively upgraded from 7.x release which didn't do
autovacuum, and Debian at some point dropping the cron job
that did the manual vacuum. Tables and indices now better fit
into the address space, improving performance.

Performing the full vacuum caused an outage of about 20 min
two weeks ago.

In addition, I set the uwsgi harakiri timeout to 60s, causing
any query taking longer to be aborted. I believe such queries
still occasionally happen; it's not clear to me what HTTP
requests are triggering such long-running transactions.

While I'm mostly happy with the current setup, one issue is
that uwsgi doesn't support proper logrotation; in particular,
it is unwilling to close-then-reopen the log files. Debian
tries to use the copytruncate approach of logrotate, but that
apparently didn't work too well (log space would constantly
increase). I have now written a UDP server which supports proper
log rotation and configured uwsgi to send log records to that
UDP port.

Regards,
Martin