<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">
<meta name=Generator content="Microsoft Word 12 (filtered medium)">
<style>
<!--
/* Font Definitions */
@font-face
        {font-family:"Cambria Math";
        panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
        {font-family:Calibri;
        panose-1:2 15 5 2 2 2 4 3 2 4;}
@font-face
        {font-family:Tahoma;
        panose-1:2 11 6 4 3 5 4 4 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
        {margin:0in;
        margin-bottom:.0001pt;
        font-size:12.0pt;
        font-family:"Times New Roman","serif";}
a:link, span.MsoHyperlink
        {mso-style-priority:99;
        color:blue;
        text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
        {mso-style-priority:99;
        color:purple;
        text-decoration:underline;}
span.EmailStyle17
        {mso-style-type:personal-reply;
        font-family:"Calibri","sans-serif";
        color:#1F497D;}
.MsoChpDefault
        {mso-style-type:export-only;
        font-size:10.0pt;}
@page Section1
        {size:8.5in 11.0in;
        margin:1.0in 1.0in 1.0in 1.0in;}
div.Section1
        {page:Section1;}
-->
</style>
<!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
</head>
<body bgcolor=white lang=EN-US link=blue vlink=purple>
<div class=Section1>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'>One idea has to do with the fact that there are only 26 (assuming
Latin alphabet) possible first letters, so I would try splitting up the list of
10,000 into 26 lists in a dictionary indexed by the first letter. Just doing
that is a big reduction of your search space. That way you won’t be doing the
same search every time for a particular first letter. It might even be
worthwhile to split each of those into 26 sublists based on the second letter.
Now you’ve chopped up your 10,000 words into 676 lists, each of which might be
small enough to send to the client without further searching. (Too bad you
won’t have an even distribution across all letters. Then each list would only
have 15 words in it.)<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'>You could also try using SQLite. I’m using right now in a Django
application, and I’m very happy with the setup and performance, especially for
read operations. With Django, I’m using their ORM, which is quite nice, so I’m
not doing any SQL directly. I think there can be problems with SQLite when you
attempt concurrent writes, but you wouldn’t have that.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'>It’s hard to predict which would perform better, a tailor made
domain specific solution written in Python, or a general purpose in-memory
database written in C. I would start with which ever direction you are most
comfortable, and if you can’t get satisfactory performance, try the other
route.<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#595959'>Greg<o:p></o:p></span></p>
<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";
color:#1F497D'><o:p> </o:p></span></p>
<div>
<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span
style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>
tutor-bounces@python.org [mailto:tutor-bounces@python.org] <b>On Behalf Of </b>Dinesh
B Vadhia<br>
<b>Sent:</b> Thursday, April 10, 2008 8:13 AM<br>
<b>To:</b> tutor@python.org<br>
<b>Subject:</b> Re: [Tutor] Searching through large number of string items<o:p></o:p></span></p>
</div>
</div>
<p class=MsoNormal><o:p> </o:p></p>
<div>
<p class=MsoNormal><span style='color:navy'>The 10,000 string items are sorted.</span><o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal><span style='color:navy'>The way the autocomplete works is
that when a user enters a char eg. 'f', the 'f' is sent to the server and
returns strings with the char 'f'. You can limit the number of items sent
back to the browser (say, limit to between 15 and 100). The string items
containing 'f' are displayed. The user can then enter another char eg.
'a' to make 'fa'. The autocomplete plugin will search the cache to find
all items containing 'fa' but may need to go back to the server to collect
others. And, so on. Equally, the user could backspace the 'f' and
enter 'k'. The 'k' will be sent to the server to find strings containing
'k', and so on.</span><o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal><span style='color:navy'>One way to solve this is with
linear search which as you rightly pointed out has horrible performance (and it
has!). I'll try the binary search and let you know. I'll also look
at the trie structure.</span><o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal><span style='color:navy'>An alternative is to create an
in-memory SQLite database of the string items. Any thoughts on that?</span><o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal><span style='color:navy'>Dinesh</span><o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal> <o:p></o:p></p>
</div>
<div>
<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>-----
Original Message ----- <o:p></o:p></span></p>
<div>
<p class=MsoNormal style='background:#E4E4E4'><b><span style='font-size:10.0pt;
font-family:"Arial","sans-serif"'>From:</span></b><span style='font-size:10.0pt;
font-family:"Arial","sans-serif"'> <a href="mailto:kent37@tds.net"
title="kent37@tds.net">Kent Johnson</a> <o:p></o:p></span></p>
</div>
<div>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>To:</span></b><span
style='font-size:10.0pt;font-family:"Arial","sans-serif"'> <a
href="mailto:dineshbvadhia@hotmail.com" title="dineshbvadhia@hotmail.com">Dinesh
B Vadhia</a> <o:p></o:p></span></p>
</div>
<div>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Cc:</span></b><span
style='font-size:10.0pt;font-family:"Arial","sans-serif"'> <a
href="mailto:tutor@python.org" title="tutor@python.org">tutor@python.org</a> <o:p></o:p></span></p>
</div>
<div>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Sent:</span></b><span
style='font-size:10.0pt;font-family:"Arial","sans-serif"'> Thursday, April 10,
2008 5:20 AM<o:p></o:p></span></p>
</div>
<div>
<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Subject:</span></b><span
style='font-size:10.0pt;font-family:"Arial","sans-serif"'> Re: [Tutor] List
comprehensions<o:p></o:p></span></p>
</div>
</div>
<div>
<p class=MsoNormal><o:p> </o:p></p>
</div>
<p class=MsoNormal style='margin-bottom:12.0pt'>Dinesh B Vadhia wrote:<br>
> Kent<br>
> <br>
> I'm using a Javascript autocomplete plugin for an online web <br>
> application/service. Each time a user inputs a character, the
character <br>
> is sent to the backend Python program which searches for the character <br>
> in a list of >10,000 string items. Once it finds the character,
the <br>
> backend will return that string and N other adjacent string items where <br>
> N can vary from 20 to 150. Each string item is sent back to the JS
in <br>
> separate print statements. Hence, the for loop.<br>
<br>
Ok, this sounds a little closer to a real spec. What kind of search are <br>
you doing? Do you really just search for individual characters or are <br>
you looking for the entire string entered so far as a prefix? Is the <br>
list of 10,000 items sorted? Can it be?<br>
<br>
You need to look at your real problem and find an appropriate data <br>
structure, rather than showing us what you think is the solution and <br>
asking how to make it faster.<br>
<br>
For example, if what you have a sorted list of strings and you want to <br>
find the first string that starts with a given prefix and return the N <br>
adjacent strings, you could use the bisect module to do a binary search <br>
rather than a linear search. Binary search of 10,000 items will take <br>
13-14 comparisons to find the correct location. Your linear search will <br>
take an average of 5,000 comparisons.<br>
<br>
You might also want to use a trie structure though I'm not sure if that <br>
will let you find adjacent items.<br>
<a href="http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/">http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/</a><br>
<a href="http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/">http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/</a><br>
<br>
> I haven't done any profiling yet as we are still building the system but <br>
> it seemed sensible that replacing the for loop with a built-in would <br>
> help. Maybe not?<br>
<br>
Not. An algorithm with poor "big O" performance should be *replaced*,
<br>
not optimized.<br>
<br>
Kent<o:p></o:p></p>
</div>
</body>
</html>