I would also be very grateful for feedback from the twisted point of view.
---------- Forwarded message ----------
From: Marc Byrd <dr.marc.byrd(a)gmail.com>
Date: Fri, Feb 29, 2008 at 10:17 AM
Subject: Seeking Validation - search web service using memcached
I'm looking for some validation for some work I've done for a client, and
I'm open to criticism ("mock me" ? ;^), relevant awareness of similar
projects, and alternatives.
When I looked around in about September 2007 for a good scalable search
solution for Ruby on Rails, I found the choices lacking. Firstly, none of
the solutions seemed to have an option for keeping the reverse indices
in-memory across any number of machines I might like to store them.
Secondly, many of the solutions seemed too general purpose and heavy weight
for my client's needs (which are basically to search for items from the db,
based on tags). But without addressing the first concern, I felt that
anything I implemented would not scale to the customer's needs and
aspirations, and that for such an investment, virtually unlimited scale
would be mandatory.
Therefore I looked at memcached - well-proven on many large-scale sites for
caching, but to my knowledge not used in search. Note that memcached uses
an approach wherein the clients all calculate a server based on a given key,
such that no central (scale-limiting) controller is required. Having chosen
memcached, I next attempted to use various memcached connectors into RoR. I
found them at the time (Oct 2007 or so) to be slow and buggy; it didn't take
more than a couple of times of totally corrupting the entire cache to avert
my attention from a Ruby approach to using memcached. Meanwhile, I knew
from prior experience that the python client for memcached was both fast and
reliable. The python memcached client was routinely 3x faster for the tests
I ran. Python also seems to be quite fast at set operations.
Getting to the punchline, I used python and memcached, wrapped in twisted,
to provide a ReSTful web service api, which is called from RoR to get ALL of
the information needed to render search results. The API has been extended
to allow the Ruby code to "fire and forget" new indexing info onto a deque
(fifo queue), which is processed by a loosely-coupled daemon - overhead to
Ruby is about 20ms.
Prior to this approach, the client was using MyISAM full text search.
Search results were 10s for smaller search terms (5000 uses), and 20+s for
larger search terms (100k+ uses).
With the web service, the search results are routinely returned in 1-2
seconds, and the web service itself returns results to RoR within
100-200ms. Indexing is a challenge - the rank score needs to be updated
upon each viewing, but I've now gotten that to be almost real-time (5
minutes max). Plus I can re-index the entire database of 1M+ items in about
8 hours. The index is backed up nightly in case of a memcached server
failure (we're using 3). In addition to search, the search web service is
used for relatedness and for something like bookmarks.
So, is there anything out there that can touch these results and provide for
virtually unlimited scale (no central controller)?
Thanks in advance,
PS: Because of leaks in rmagick and its inferior performance compared to
the Python Image Library, I'm also considering a similar approach for
generating many different sizes of fairly large (10MB) images. A similar
fire and forget web service approach could be used to minimize the impact on
the RoR side. Early tests show a 10x speed improvement (even without the
fire and forget). Any thoughts there?