Is there any library for indexing binary data?

甜瓜 littlesweetmelon at gmail.com
Thu Mar 25 05:55:27 EDT 2010


Thank you irmen. I will take a look at pytable.
FYI, let me explain the case clearly.

Originally, my big data table is simply array of Item:
struct Item
{
    long id;    // used as key
    BYTE payload[LEN];   // corresponding value with fixed length
};

All items are stored in one file by using "stdio.h" function:
    fwrite(itemarray, sizeof(Item), num_of_items, fp);

Note that "id" is randomly unique without any order. To speed up
searching  I regrouped / sorted them into two-level hash tables (in
the form of files).  I want to employ certain library to help me index
this table.

Since the table contains about 10^9 items and LEN is about 2KB, it is
impossible to hold all data in memory. Furthermore, some new item may
be inserted into the array. Therefore incremental indexing feature is
needed.

Hope this help you to understand my case.

--
ShenLei


2010/3/25 Irmen de Jong <irmen at -nospam-xs4all.nl>:
> On 3/25/10 4:28 AM, 甜瓜 wrote:
>>
>> Howdy,
>>
>> Recently, I am finding a good library for build index on binary data.
>> Xapian&  Lucene for python binding focus on text digestion rather than
>> binary data. Could anyone give me some recommendation? Is there any
>> library for indexing binary data no matter whether it is written in
>> python?
>>
>> In my case, there is a very big datatable which stores structured
>> binary data, eg:
>> struct Item
>> {
>>     long id; // used as key
>>     double value;
>> };
>>
>> I want to build the index on "id" field to speed on searching. Since
>> this datatable is not constant, the library should support incremental
>> indexing. If there is no suitable library, I have to do the index by
>> myself...
>>
>> Thank you in advance.
>>
>> --
>> ShenLei
>
> Put it into an Sqlite database? Or something else from
> http://docs.python.org/library/persistence.html.
> Or maybe http://www.pytables.org/ is more suitable to your needs (never used
> that one myself though).
> Or install a bank or 2 of memory in your box and read everything into memory
> in one big hashtable.
>
> Btw if you already have a big datatable in which the data is stored, I'm
> guessing that already is in some form of database format. Can't you write
> something that understands that database format.
>
> But I think you need to provide some more details about your data set.
>
> -irmen
> --
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list