Is there any library for indexing binary data?

甜瓜 littlesweetmelon at gmail.com
Thu Mar 25 21:45:44 EDT 2010


Many thanks for your kind reply. As you mentioned, a sparse array may
be the best choice.
Storing offset rather than payload itself can greatly save memory space.

1e7 queries per second is my ideal aim. But 1e6 must be achieved.
Currently I have implemented 5e6 on one PC (without incremental
indexing and all incoming queries coming from local data stream).
Since the table is very big and responding is time critical, the
finally system will be definitely distributed computing. I hope that
Judy algorithm can simplify indexing, so I can focus on implementing
data persistence and distributed computing affairs.

--
ShenLei

在 2010年3月26日 上午2:55,Irmen de Jong <irmen-NOSPAM- at xs4all.nl> 写道:
> On 25-3-2010 10:55, 甜瓜 wrote:
>> Thank you irmen. I will take a look at pytable.
>> FYI, let me explain the case clearly.
>>
>> Originally, my big data table is simply array of Item:
>> struct Item
>> {
>>      long id;    // used as key
>>      BYTE payload[LEN];   // corresponding value with fixed length
>> };
>>
>> All items are stored in one file by using "stdio.h" function:
>>      fwrite(itemarray, sizeof(Item), num_of_items, fp);
>>
>> Note that "id" is randomly unique without any order. To speed up
>> searching  I regrouped / sorted them into two-level hash tables (in
>> the form of files).  I want to employ certain library to help me index
>> this table.
>>
>> Since the table contains about 10^9 items and LEN is about 2KB, it is
>> impossible to hold all data in memory. Furthermore, some new item may
>> be inserted into the array. Therefore incremental indexing feature is
>> needed.
>
> I see, I thought the payload data was small as well. What about this idea:
> Build a hash table where the keys are the id from your Item structs and
> the value is the file seek offset of the Item 'record' in your original
> datafile. (although that might generate values of type long, which take
> more memory than int, so maybe we should use file_offset/sizeof(Item).
> This way you can just keep your original data file (you only have to
> scan it to build the hash table) and you will avoid a lengthy conversion
> process.
>
> If this hashtable still doesn't fit in memory use a sparse array
> implementation of some sort that is more efficient at storing simple
> integers, or just put it into a database solution mentioned in earlier
> responses.
>
> Another thing: I think that your requirement of 1e7 lookups per second
> is a bit steep for any solution where the dataset is not in core memory
> at once though.
>
> Irmen.
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list