Fixing Cheese Shop search

Hi there, In a fit of madness caused by another 30 seconds-long PyPI search I decided to investigate the code, in the hopes of perhaps finding something simple that would alleviate the extremely long search time. I discovered what appears to be a function that makes 6 SQL queries for each term provided by the search user, in turn those queries expand to what appear to be %SUBSTRING% table scans across the releases table, which appears to contain upwards of half a gigabyte of strings. Now the root cause has been located, what to do about it? I looked at hacking on the code, but it seems webui.py is already too massive for its own good, and in any case PostgreSQL's options for efficient search are quite limited. It might be all-round good if the size of that module started to drop.. I wrote a crawler to pull a reasonable facsimile of the releases table on to my machine via the XML-RPC API, then arranged for Xapian to index only the newest releases for each package. The resulting full-text index weighs in at a very reasonable 334mb, and searches complete almost immediately, even on my lowly Intel Atom colocated server. I wrote a quick hack Flask app around it, which you can see here: http://5.39.91.176:5000/ The indexer takes as input a database produced by the crawler, which is smart enough to know how to use PyPI's exposed 'changelog' serial numbers. Basically it is quite trivial and efficient to run this setup in an incremental indexing mode. As you can see from the results, even my lowly colo is trouncing what is currently on PyPI, and so my thoughts tend toward making an arrangement like this more permanent. The crawler code weighs in at 150 lines, the indexer a meagre 113 lines, and the Flask example app is 74 lines. Implementing an exact replica of PyPI's existing scoring function is already partially done at indexing time, and the rest is quite easy to complete (mostly cutpasting code). Updating the Flask example to provide an XML-RPC API (or similar), then *initially augmenting* the old search facility seems like a good start, with a view to removing the old feature entirely. Integrating indexing directly would be pointless, the PyPI code really doesn't need anything more added to it until it gets at least reorganized a little. So for the cost of 334mb of disk, a cron job, and a lowly VPS with even just 700MB RAM, PyPI's search pains might be solved permanently. Naturally I'm writing this mail because it bothers me enough to volunteer help. :) Prototype code is here: https://bitbucket.org/dmw/pypi-search (note: relies on a pre-alpha quality DB library I'm hacking on) Thoughts?

http://pypi.h1.botanicus.net/ is the same demo running behind Apache with mod_gzip on a Core i7 920. On 22 April 2013 02:11, David Wilson <dw@botanicus.net> wrote:
Hi there,
In a fit of madness caused by another 30 seconds-long PyPI search I decided to investigate the code, in the hopes of perhaps finding something simple that would alleviate the extremely long search time.
I discovered what appears to be a function that makes 6 SQL queries for each term provided by the search user, in turn those queries expand to what appear to be %SUBSTRING% table scans across the releases table, which appears to contain upwards of half a gigabyte of strings.
Now the root cause has been located, what to do about it? I looked at hacking on the code, but it seems webui.py is already too massive for its own good, and in any case PostgreSQL's options for efficient search are quite limited. It might be all-round good if the size of that module started to drop..
I wrote a crawler to pull a reasonable facsimile of the releases table on to my machine via the XML-RPC API, then arranged for Xapian to index only the newest releases for each package. The resulting full-text index weighs in at a very reasonable 334mb, and searches complete almost immediately, even on my lowly Intel Atom colocated server.
I wrote a quick hack Flask app around it, which you can see here:
The indexer takes as input a database produced by the crawler, which is smart enough to know how to use PyPI's exposed 'changelog' serial numbers. Basically it is quite trivial and efficient to run this setup in an incremental indexing mode.
As you can see from the results, even my lowly colo is trouncing what is currently on PyPI, and so my thoughts tend toward making an arrangement like this more permanent.
The crawler code weighs in at 150 lines, the indexer a meagre 113 lines, and the Flask example app is 74 lines. Implementing an exact replica of PyPI's existing scoring function is already partially done at indexing time, and the rest is quite easy to complete (mostly cutpasting code).
Updating the Flask example to provide an XML-RPC API (or similar), then *initially augmenting* the old search facility seems like a good start, with a view to removing the old feature entirely. Integrating indexing directly would be pointless, the PyPI code really doesn't need anything more added to it until it gets at least reorganized a little.
So for the cost of 334mb of disk, a cron job, and a lowly VPS with even just 700MB RAM, PyPI's search pains might be solved permanently. Naturally I'm writing this mail because it bothers me enough to volunteer help. :)
Prototype code is here: https://bitbucket.org/dmw/pypi-search (note: relies on a pre-alpha quality DB library I'm hacking on)
Thoughts?

On Mon, Apr 22, 2013 at 11:30 PM, David Wilson <dw@botanicus.net> wrote:
Prototype code is here: https://bitbucket.org/dmw/pypi-search (note: relies on a pre-alpha quality DB library I'm hacking on)
Thoughts?
Hi David, This certainly sounds intriguing (and even promising), but I believe Richard (Jones, the main PyPI maintainer) has just returned from a long business trip, so it may be a while before he catches back up with distutils-sig. I'll see if I can nudge some other folks and get you a better answer... Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Quasi-monthly ping :) I'm still happy to volunteer help. David On 23 April 2013 13:41, Nick Coghlan <ncoghlan@gmail.com> wrote:
On Mon, Apr 22, 2013 at 11:30 PM, David Wilson <dw@botanicus.net> wrote:
Prototype code is here: https://bitbucket.org/dmw/pypi-search (note: relies on a pre-alpha quality DB library I'm hacking on)
Thoughts?
Hi David,
This certainly sounds intriguing (and even promising), but I believe Richard (Jones, the main PyPI maintainer) has just returned from a long business trip, so it may be a while before he catches back up with distutils-sig.
I'll see if I can nudge some other folks and get you a better answer...
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
participants (2)
-
David Wilson
-
Nick Coghlan