[Distutils] Fixing Cheese Shop search

Mon Apr 22 01:11:55 CEST 2013

Hi there,

In a fit of madness caused by another 30 seconds-long PyPI search I
decided to investigate the code, in the hopes of perhaps finding
something simple that would alleviate the extremely long search time.

I discovered what appears to be a function that makes 6 SQL queries
for each term provided by the search user, in turn those queries
expand to what appear to be %SUBSTRING% table scans across the
releases table, which appears to contain upwards of half a gigabyte of
strings.

Now the root cause has been located, what to do about it? I looked at
hacking on the code, but it seems webui.py is already too massive for
its own good, and in any case PostgreSQL's options for efficient
search are quite limited. It might be all-round good if the size of
that module started to drop..

I wrote a crawler to pull a reasonable facsimile of the releases table
on to my machine via the XML-RPC API, then arranged for Xapian to
index only the newest releases for each package. The resulting
full-text index weighs in at a very reasonable 334mb, and searches
complete almost immediately, even on my lowly Intel Atom colocated
server.

I wrote a quick hack Flask app around it, which you can see here:

      http://5.39.91.176:5000/

The indexer takes as input a database produced by the crawler, which
is smart enough to know how to use PyPI's exposed 'changelog' serial
numbers. Basically it is quite trivial and efficient to run this setup
in an incremental indexing mode.

As you can see from the results, even my lowly colo is trouncing what
is currently on PyPI, and so my thoughts tend toward making an
arrangement like this more permanent.

The crawler code weighs in at 150 lines, the indexer a meagre 113
lines, and the Flask example app is 74 lines. Implementing an exact
replica of PyPI's existing scoring function is already partially done
at indexing time, and the rest is quite easy to complete (mostly
cutpasting code).

Updating the Flask example to provide an XML-RPC API (or similar),
then *initially augmenting* the old search facility seems like a good
start, with a view to removing the old feature entirely. Integrating
indexing directly would be pointless, the PyPI code really doesn't
need anything more added to it until it gets at least reorganized a
little.

So for the cost of 334mb of disk, a cron job, and a lowly VPS with
even just 700MB RAM, PyPI's search pains might be solved permanently.
Naturally I'm writing this mail because it bothers me enough to
volunteer help. :)

Prototype code is here: https://bitbucket.org/dmw/pypi-search (note:
relies on a pre-alpha quality DB library I'm hacking on)

Thoughts?