very large dictionaries

Pierre-Frédéric Caillaud peufeu at
Fri Jun 18 10:50:36 CEST 2004

	From what I understand...

	- The match table is static (or does not changes often)
	- Your input data changes every time
		(so you don't want to load it in a postgres database everytime)

	Your problem :
	- read the input data.
	- for each input row, generate several keys
	- lookup each key in the match table
	- take appropriate action

	My solution :
	- read the input data.
	- for each input row, generate several keys
	- save the keys in a file (tempfile) in this format :
		(key, record id)
		Generate tempfile in partitions :
		Generate N tempfiles which will each contain a partition of the keys.
	- For each tempfile :
		- Load it
			(you chose a number of partitions such that the files fit into about  
1/4 your RAM.)
		- group by keys (dict( key: [record ids] ))
			again it fits into RAM.
			Record IDs are already sorted because they were generated this way.
		- sort on keys
			(again, fast because it fits in RAM)

	Now you have a sorted keyset which you want to lookup into a match table,  
which is ALSO sorted
	(you sorted it beforehand because it doesn't change often).

	Now start a key_pointer at the beginning of your key list, and a  
match_pointer at the beginning of your match table.
	You are basically comparing two sorted lists. This does not involve any  
search, rather a very simple algorithm.
	Your match table is not partitioned.

	If key_pointer.key < match_pointer.key:
		advance match_pointer
	elif key_pointer.key > match_pointer.key:
		advance key_pointer
		if at end of your tempfile partition, load the next one and sort it.
	elif key_pointer.key == match_pointer.key:
		you have a match. record id.

	What do you think ? I already used this algorithm and it's very powerful.


> Let me specify further. Our actual data is enormous, and tests with
> Postgres and MySQL indicate that the time taken just to build a single
> index is on the order of hours, which is too long. After analysis we
> believe that the important lookup info can be compressed to about 40
> million records of 48 bytes each. Furthermore, we have the flexibility
> to partition this into four somewhat equal parts. Each will hence be
> about 460MB in size. Even with structural overhead I see no reason
> this couldn't fit into memory. This is our match file.
> Our process involves taking an input disk file, and traversing it one
> record at a time. This we can sort by the same partition criteria used
> to split the match file (so that each match file can be loaded but
> once). For each input record we build a series of keys and compare
> them to the appropriate match file; thus there are multiple lookups
> per input record. An algorithm then determines which match record we
> wish to pick and we write an ID to the input.
> There is more to it than this, but these are the major elements. The
> input table may be millions of records long, but likely will be much
> smaller.
> The total processing time will be a sum of:
> 	time to sort/index input file
> 	time to traverse input file
> 	time to load in all parts of the match table
> 	time to build keys on input records
> 	time to search match table for each key
> 	time to write match key ID
> 	overhead time of routine
> A new wrinkle is that the match table may have duplicate keys, so
> storing it as a dictionary is not an option.
> The data is alphanumeric.
> Assume an arbitrarily powerful computer, since adding a GB of RAM is
> not an issue.
> The question I had, for those with experience with large data sets, is
> what structure would best handle this problem?
> -- robin

Using Opera's revolutionary e-mail client:

More information about the Python-list mailing list