[Distutils] [Python-ideas] PyPI search still broken

Donald Stufft donald at stufft.io
Thu Sep 10 15:31:13 CEST 2015

On September 10, 2015 at 8:48:05 AM, David Wilson (dw+python-ideas at hmmz.org) wrote:
> On Thu, Sep 10, 2015 at 03:07:14PM +0300, Ionel Cristian Mărieș wrote:
> > Wouldn't it be better if you'd just build an external search service?
> > Getting a list of packages and descriptions should be possible no?
> > (just asking, not 100% sure)
> That would be the idea. In fact preferably not build a service at all,
> just pay someone $50/mo for hosted ElasticSearch, rip out the guts of
> the old thing and write a small sync cron job similar to the one
> existing in the Bitbucket repo I linked.

The old PostgreSQL based system has been gone for awhile, and we already have ElasticSearch with a small cron job that runs every 3 hours to index the data.

When we moved the database to Heroku this cronjob started taking 6+ hours to
complete, because we were fetching data in too small of chunks which didn't
actually hurt when the script and the database were running close to each
other. That got "fixed" a day or two ago by increasing the size of the chunks
we pulled from 1000 to 10000 and by switching to a
SERIALIZABLE READ ONLY DEFERRABLE transaction so that we only needed to hold
open a lock right at the very beginning which has the job finishing in 40
minutes now. I suspect further enhancements to the indexing speed will require 
locating the script in EC2 to get it closer to the PostgreSQL instance.

Given that these problems seem to be *new* since the move of the database to
Heroku, I don't think the shape of our data in Elasticsearch nor the actual
query we're using which hasn't changed should be at fault, so I've been trying
to figure out what else we might have changed in the transition that would have
caused it.

Donald Stufft
PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA

More information about the Distutils-SIG mailing list