Large Amount of Data

John Machin sjmachin at lexicon.net
Sat May 26 21:19:16 EDT 2007


On May 26, 6:17 pm, "Jack" <nos... at invalid.com> wrote:
> I have tens of millions (could be more) of document in files. Each of them
> has other
> properties in separate files. I need to check if they exist, update and
> merge properties, etc.

And then save the results where?
Option (0) retain it in memory
Option (1) a file
Option (2) a database

And why are you doing this agglomeration of information? Presumably so
that it can be queried. Do you plan to load the whole file into memory
in order to satisfy a simple query?




> And this is not a one time job. Because of the quantity of the files, I
> think querying and
> updating a database will take a long time...

Don't think, benchmark.

>
> Let's say, I want to do something a search engine needs to do in terms of
> the amount of
> data to be processed on a server. I doubt any serious search engine would
> use a database
> for indexing and searching. A hash table is what I need, not powerful
> queries.

Having a single hash table permits two not very powerful query
methods: (1) return the data associated with a single hash key (2)
trawl through the whole hash table, applying various conditions to the
data. If that is all you want, then comparisons with a serious search
engine are quite irrelevant.

What is relevant is that the whole hash table has be in virtual memory
before you can start either type of query. This is not the case with a
database. Type 1 queries (with a suitable index on the primary key)
should use only a fraction of the memory that a full hash table would.

What is the primary key of your data?




More information about the Python-list mailing list