Caching Strategies/Database question
Jerry Gitomer
jgitomer at erols.com
Thu Apr 26 20:03:50 EDT 2001
Brian Kelley wrote:
> I am using a dictionary to hold
>
> {string:[list of integers]}
>
> key: values
>
> The dictionary is randomly accessed to add an integer to the
> list referenced by the string.
>
> I have been exploring several differnent caching strategies
> ranging from
> bsddb3 to the metakit toolkit for caching/retrieving these to
> disk. One of the problems that I am facing is that because
> the dictionary is randomly accessed after the database gets to
> a certain size performance
> goes WAY down with every strategy that I have tried. (Note
> that
> integers are never removed if that helps) I have had success
> speed-wise with MySQL but I simply cannot deliver that in my
> application, it's way too costly to administer.
>
> Has anyone examined a similar problem? Thanks in advance.
>
> Brian Kelley
>
>
>
Here I go shooting from the hip without knowing all of the
facts. (To give a proper answer I really should know how many
bytes is each key/pointer pair, how many records will the
database contain, and at what point do you hit the performance
wall.)
Assuming that data access is truly random the least expensive
way of storing and accessing (assuming you are going to roll
your own and not go with an existing product) is to add all new
records at the end of the file, mark, but do not remove deleted
records, and build indexes to access the records.
The trick is building and maintaining the indexes. The
solution is to build a multi-level index structure that is not
fully populated.
Each index entry will consist of three fields; the unique key
for the record, the byte offset of the record in the file, and
a one byte indicator telling wether this index entry addresses
a lower level index block or a data record. The data and each
level of the index should be in separate files.
Assume that each such entry is 32 bytes long and that your OS
block size is 8K. Rather than fully populate each block with
256 entries populate each index block with 250 entries.
To build the index create a sorted list for the entire
database and then populate the bottom level of the index.
Next, build a sorted list consisting of the first entry in each
block of the lower level. (Note that in this and any higher
level index blocks the indicator must be set to show that this
entry is pointing to an index block and not a data blcok.)
Build the second level index. At this point consider the
following each block in the second level index will point to
250 blocks in the first level index which will in turn point to
250 data records. If the second level index is held in memory
each 2nd level block will allow you to access 62,500 records
with no more than two disk accesses. If a third level index is
needed you wil be able to access any one of 15,625,000 records
with no more than three disk accesses.
The reason for not fully populating the database and the
indexes is to allow for expansion without total reorganization.
The problem with this scheme being that if you need to add
seven records to an index block it is necessary to split the
block, that is create a new index block and move half of the
entries in the split block into the new block. In general this
should only happen at the first level, but you must still make
provision for either splitting or total index reorganization.
Hope this helps. Also, if you do implement it please
contribute the code.
--
Jerry Gitomer
Once I learned how to spell DBA, I became one
More information about the Python-list
mailing list