<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">

<head>

<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=us-ascii">

<meta name=Generator content="Microsoft Word 12 (filtered medium)">

<style>

<!--

 /* Font Definitions */

 @font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

@font-face

        {font-family:Tahoma;

        panose-1:2 11 6 4 3 5 4 4 2 4;}

 /* Style Definitions */

 p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0in;

        margin-bottom:.0001pt;

        font-size:12.0pt;

        font-family:"Times New Roman","serif";}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:blue;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:purple;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-reply;

        font-family:"Calibri","sans-serif";

        color:#1F497D;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-size:10.0pt;}

@page Section1

        {size:8.5in 11.0in;

        margin:1.0in 1.0in 1.0in 1.0in;}

div.Section1

        {page:Section1;}

-->

</style>

<!--[if gte mso 9]><xml>

 <o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

 <o:shapelayout v:ext="edit">

  <o:idmap v:ext="edit" data="1" />

 </o:shapelayout></xml><![endif]-->

</head>

<body bgcolor=white lang=EN-US link=blue vlink=purple>

<div class=Section1>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>One idea has to do with the fact that there are only 26 (assuming

Latin alphabet) possible first letters, so I would try splitting up the list of

10,000 into 26 lists in a dictionary indexed by the first letter. Just doing

that is a big reduction of your search space. That way you won&#8217;t be doing the

same search every time for a particular first letter. It might even be

worthwhile to split each of those into 26 sublists based on the second letter.

Now you&#8217;ve chopped up your 10,000 words into 676 lists, each of which might be

small enough to send to the client without further searching. (Too bad you

won&#8217;t have an even distribution across all letters. Then each list would only

have 15 words in it.)<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p>&nbsp;</o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>You could also try using SQLite. I&#8217;m using right now in a Django

application, and I&#8217;m very happy with the setup and performance, especially for

read operations. With Django, I&#8217;m using their ORM, which is quite nice, so I&#8217;m

not doing any SQL directly. I think there can be problems with SQLite when you

attempt concurrent writes, but you wouldn&#8217;t have that.<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p>&nbsp;</o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'>It&#8217;s hard to predict which would perform better, a tailor made

domain specific solution written in Python, or a general purpose in-memory

database written in C. I would start with which ever direction you are most

comfortable, and if you can&#8217;t get satisfactory performance, try the other

route.<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p>&nbsp;</o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#595959'>Greg<o:p></o:p></span></p>

<p class=MsoNormal><span style='font-size:11.0pt;font-family:"Calibri","sans-serif";

color:#1F497D'><o:p>&nbsp;</o:p></span></p>

<div>

<div style='border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in'>

<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>From:</span></b><span

style='font-size:10.0pt;font-family:"Tahoma","sans-serif"'>

tutor-bounces@python.org [mailto:tutor-bounces@python.org] <b>On Behalf Of </b>Dinesh

B Vadhia<br>

<b>Sent:</b> Thursday, April 10, 2008 8:13 AM<br>

<b>To:</b> tutor@python.org<br>

<b>Subject:</b> Re: [Tutor] Searching through large number of string items<o:p></o:p></span></p>

</div>

</div>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

<div>

<p class=MsoNormal><span style='color:navy'>The 10,000 string items are sorted.</span><o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal><span style='color:navy'>The way the autocomplete works is

that when a user enters a char eg. 'f', the 'f' is sent to the server and

returns strings with the char 'f'.&nbsp; You can limit the number of items sent

back to the browser (say, limit to between 15 and 100).&nbsp; The string items

containing 'f' are displayed.&nbsp; The user can then enter another char eg.

'a' to make 'fa'.&nbsp; The autocomplete plugin will search the cache to find

all items containing 'fa' but may need to go back to the server to collect

others.&nbsp; And, so on.&nbsp; Equally, the user could backspace the 'f' and

enter 'k'.&nbsp; The 'k' will be sent to the server to find strings containing

'k', and so on.</span><o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal><span style='color:navy'>One way to solve this is with

linear search which as you rightly pointed out has horrible performance (and it

has!).&nbsp; I'll try the binary search and let you know.&nbsp; I'll also look

at the trie structure.</span><o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal><span style='color:navy'>An alternative is to create an

in-memory SQLite database of the string items.&nbsp; Any thoughts on that?</span><o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal><span style='color:navy'>Dinesh</span><o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal>&nbsp;<o:p></o:p></p>

</div>

<div>

<p class=MsoNormal><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>-----

Original Message ----- <o:p></o:p></span></p>

<div>

<p class=MsoNormal style='background:#E4E4E4'><b><span style='font-size:10.0pt;

font-family:"Arial","sans-serif"'>From:</span></b><span style='font-size:10.0pt;

font-family:"Arial","sans-serif"'> <a href="mailto:kent37@tds.net"

title="kent37@tds.net">Kent Johnson</a> <o:p></o:p></span></p>

</div>

<div>

<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>To:</span></b><span

style='font-size:10.0pt;font-family:"Arial","sans-serif"'> <a

href="mailto:dineshbvadhia@hotmail.com" title="dineshbvadhia@hotmail.com">Dinesh

B Vadhia</a> <o:p></o:p></span></p>

</div>

<div>

<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Cc:</span></b><span

style='font-size:10.0pt;font-family:"Arial","sans-serif"'> <a

href="mailto:tutor@python.org" title="tutor@python.org">tutor@python.org</a> <o:p></o:p></span></p>

</div>

<div>

<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Sent:</span></b><span

style='font-size:10.0pt;font-family:"Arial","sans-serif"'> Thursday, April 10,

2008 5:20 AM<o:p></o:p></span></p>

</div>

<div>

<p class=MsoNormal><b><span style='font-size:10.0pt;font-family:"Arial","sans-serif"'>Subject:</span></b><span

style='font-size:10.0pt;font-family:"Arial","sans-serif"'> Re: [Tutor] List

comprehensions<o:p></o:p></span></p>

</div>

</div>

<div>

<p class=MsoNormal><o:p>&nbsp;</o:p></p>

</div>

<p class=MsoNormal style='margin-bottom:12.0pt'>Dinesh B Vadhia wrote:<br>

&gt; Kent<br>

&gt;&nbsp; <br>

&gt; I'm using a Javascript autocomplete plugin for an online web <br>

&gt; application/service.&nbsp; Each time a user inputs a character, the

character <br>

&gt; is sent to the backend Python program which searches for the character <br>

&gt; in a list of &gt;10,000 string items.&nbsp; Once it finds the character,

the <br>

&gt; backend will return that string and N other adjacent string items where <br>

&gt; N can vary from 20 to 150.&nbsp; Each string item is sent back to the JS

in <br>

&gt; separate print statements.&nbsp; Hence, the for loop.<br>

<br>

Ok, this sounds a little closer to a real spec. What kind of search are <br>

you doing? Do you really just search for individual characters or are <br>

you looking for the entire string entered so far as a prefix? Is the <br>

list of 10,000 items sorted? Can it be?<br>

<br>

You need to look at your real problem and find an appropriate data <br>

structure, rather than showing us what you think is the solution and <br>

asking how to make it faster.<br>

<br>

For example, if what you have a sorted list of strings and you want to <br>

find the first string that starts with a given prefix and return the N <br>

adjacent strings, you could use the bisect module to do a binary search <br>

rather than a linear search. Binary search of 10,000 items will take <br>

13-14 comparisons to find the correct location. Your linear search will <br>

take an average of 5,000 comparisons.<br>

<br>

You might also want to use a trie structure though I'm not sure if that <br>

will let you find adjacent items.<br>

<a href="http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/">http://www.cs.mcgill.ca/~cs251/OldCourses/1997/topic7/</a><br>

<a href="http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/">http://jtauber.com/blog/2005/02/10/updated_python_trie_implementation/</a><br>

<br>

&gt; I haven't done any profiling yet as we are still building the system but <br>

&gt; it seemed sensible that replacing the for loop with a built-in would <br>

&gt; help.&nbsp; Maybe not?<br>

<br>

Not. An algorithm with poor &quot;big O&quot; performance should be *replaced*,

<br>

not optimized.<br>

<br>

Kent<o:p></o:p></p>

</div>

</body>

</html>