Handling large datastore search

Dave Angel davea at ieee.org
Wed Nov 4 00:27:50 CET 2009

Ahmed Barakat wrote:
> In case I have a  huge datastore (10000 entries, each entry has like 6
> properties), what is the best way
> to handle the search within such a huge datastore, and what if I want to
> make a generic search, for example
> you write a word and i use it to search within all properties I have for all
> entries?
> Is the conversion to XML a good solution, or it is not?
> sorry for being new to web development, and python.
> Thanks in advance.
I don't see anything about your query which is specific to web 
development, and there's no need to be apologetic for being new anyway.

One person's "huge" is another person's "pretty large."  I'd say 10000 
items is pretty small if you're working on the desktop, as you can 
readily hold all the data in "memory."  I edit text files bigger than 
that.   But I'll assume your data really is huge, or will grow to be 
huge, or is an environment which treats it as huge.

When you're parsing large amounts of data, there are always tradeoffs 
between performance and other characteristics, usually size and 
complexity.  If you have lots of data, you're probably best off by using 
a standard code system -- a real database.  The developers of such 
things have decades of experience in making certain things fast, 
reliable, and self-consistent.

But considering only speed here, I have to point out that you have to 
understand databases, and your particular model of database, pretty well 
to really benefit from all the performance tricks in there.  Keeping it 
abstract, you specify what parts of the data you care about fast random 
access to.  If you want fast search access to "all" of it, your database 
will generally be huge, and very slow to updates.  And the best way to 
avoid that is to pick a database mechanism that best fits your search 
mechanism.   I hate to think how many man-centuries Google has dedicated 
to getting fast random word access to its *enormous* database.  I'm sure 
they did not build on a standard relational model.

If you plan to do it yourself, I'd say the last thing you want to do is 
use XML.  XML may be convenient way to store self-describing data, but 
it's not quick to parse large amounts of it.  Instead, store the raw 
data in text form, with separate index files describing what is where.  
Anything that's indexed will be found rapidly, while anything that isn't 
will require search of the raw data.

There are algorithms for searching raw data that are faster than 
scanning every byte, but a relevant index will almost always be faster.


More information about the Python-list mailing list