[Distutils] Fixing Cheese Shop search

Mon Apr 22 15:30:10 CEST 2013

http://pypi.h1.botanicus.net/ is the same demo running behind Apache
with mod_gzip on a Core i7 920.

On 22 April 2013 02:11, David Wilson <dw at botanicus.net> wrote:
> Hi there,
>
> In a fit of madness caused by another 30 seconds-long PyPI search I
> decided to investigate the code, in the hopes of perhaps finding
> something simple that would alleviate the extremely long search time.
>
> I discovered what appears to be a function that makes 6 SQL queries
> for each term provided by the search user, in turn those queries
> expand to what appear to be %SUBSTRING% table scans across the
> releases table, which appears to contain upwards of half a gigabyte of
> strings.
>
> Now the root cause has been located, what to do about it? I looked at
> hacking on the code, but it seems webui.py is already too massive for
> its own good, and in any case PostgreSQL's options for efficient
> search are quite limited. It might be all-round good if the size of
> that module started to drop..
>
> I wrote a crawler to pull a reasonable facsimile of the releases table
> on to my machine via the XML-RPC API, then arranged for Xapian to
> index only the newest releases for each package. The resulting
> full-text index weighs in at a very reasonable 334mb, and searches
> complete almost immediately, even on my lowly Intel Atom colocated
> server.
>
> I wrote a quick hack Flask app around it, which you can see here:
>
>       http://5.39.91.176:5000/
>
> The indexer takes as input a database produced by the crawler, which
> is smart enough to know how to use PyPI's exposed 'changelog' serial
> numbers. Basically it is quite trivial and efficient to run this setup
> in an incremental indexing mode.
>
> As you can see from the results, even my lowly colo is trouncing what
> is currently on PyPI, and so my thoughts tend toward making an
> arrangement like this more permanent.
>
> The crawler code weighs in at 150 lines, the indexer a meagre 113
> lines, and the Flask example app is 74 lines. Implementing an exact
> replica of PyPI's existing scoring function is already partially done
> at indexing time, and the rest is quite easy to complete (mostly
> cutpasting code).
>
> Updating the Flask example to provide an XML-RPC API (or similar),
> then *initially augmenting* the old search facility seems like a good
> start, with a view to removing the old feature entirely. Integrating
> indexing directly would be pointless, the PyPI code really doesn't
> need anything more added to it until it gets at least reorganized a
> little.
>
> So for the cost of 334mb of disk, a cron job, and a lowly VPS with
> even just 700MB RAM, PyPI's search pains might be solved permanently.
> Naturally I'm writing this mail because it bothers me enough to
> volunteer help. :)
>
> Prototype code is here: https://bitbucket.org/dmw/pypi-search (note:
> relies on a pre-alpha quality DB library I'm hacking on)
>
> Thoughts?