very large dictionaries
Pierre-Frédéric Caillaud
peufeu at free.fr
Fri Jun 18 04:50:36 EDT 2004
From what I understand...
- The match table is static (or does not changes often)
- Your input data changes every time
(so you don't want to load it in a postgres database everytime)
Your problem :
- read the input data.
- for each input row, generate several keys
- lookup each key in the match table
- take appropriate action
My solution :
- read the input data.
- for each input row, generate several keys
- save the keys in a file (tempfile) in this format :
(key, record id)
Generate tempfile in partitions :
Generate N tempfiles which will each contain a partition of the keys.
- For each tempfile :
- Load it
(you chose a number of partitions such that the files fit into about
1/4 your RAM.)
- group by keys (dict( key: [record ids] ))
again it fits into RAM.
Record IDs are already sorted because they were generated this way.
- sort on keys
(again, fast because it fits in RAM)
Now you have a sorted keyset which you want to lookup into a match table,
which is ALSO sorted
(you sorted it beforehand because it doesn't change often).
Now start a key_pointer at the beginning of your key list, and a
match_pointer at the beginning of your match table.
You are basically comparing two sorted lists. This does not involve any
search, rather a very simple algorithm.
Your match table is not partitioned.
If key_pointer.key < match_pointer.key:
advance match_pointer
elif key_pointer.key > match_pointer.key:
advance key_pointer
if at end of your tempfile partition, load the next one and sort it.
elif key_pointer.key == match_pointer.key:
you have a match. record id.
What do you think ? I already used this algorithm and it's very powerful.
> Let me specify further. Our actual data is enormous, and tests with
> Postgres and MySQL indicate that the time taken just to build a single
> index is on the order of hours, which is too long. After analysis we
> believe that the important lookup info can be compressed to about 40
> million records of 48 bytes each. Furthermore, we have the flexibility
> to partition this into four somewhat equal parts. Each will hence be
> about 460MB in size. Even with structural overhead I see no reason
> this couldn't fit into memory. This is our match file.
>
> Our process involves taking an input disk file, and traversing it one
> record at a time. This we can sort by the same partition criteria used
> to split the match file (so that each match file can be loaded but
> once). For each input record we build a series of keys and compare
> them to the appropriate match file; thus there are multiple lookups
> per input record. An algorithm then determines which match record we
> wish to pick and we write an ID to the input.
>
> There is more to it than this, but these are the major elements. The
> input table may be millions of records long, but likely will be much
> smaller.
>
> The total processing time will be a sum of:
> time to sort/index input file
> time to traverse input file
> time to load in all parts of the match table
> time to build keys on input records
> time to search match table for each key
> time to write match key ID
> overhead time of routine
>
> A new wrinkle is that the match table may have duplicate keys, so
> storing it as a dictionary is not an option.
>
> The data is alphanumeric.
>
> Assume an arbitrarily powerful computer, since adding a GB of RAM is
> not an issue.
>
> The question I had, for those with experience with large data sets, is
> what structure would best handle this problem?
>
> -- robin
--
Using Opera's revolutionary e-mail client: http://www.opera.com/m2/
More information about the Python-list
mailing list