[DB-SIG] Q: database integrity, flush and corrupted files....

Fri, 03 Jan 2003 22:24:42 +0100

Michael Scharf wrote:
> Hi Marc-Andre
> 
> M.-A. Lemburg wrote:
>  > You do know about mxBeeBase, do you ?
> 
> Yes, but I forgot that it exists. I just tryed it and it seems
> to solve my problem. (I was about to implement Extendible Hashing
> but the BeeBase B+Tree is fast enough for my app).

Be sure to use the version in egenix-mx-base 2.1 beta5 or
later; the one in 2.0.4 has some bugs which are fixed in that
release.

> I looked at BeeDict. The comments seem to indicate
> that an index is not a reliable datastructure. My data
> is actually an 16-byte string (md5 checksum) and an
> integer. Would you recommend to store the data in a flat
> file as well in case of a corrupted index?

mxBeeBase is a toolkit, so you can build your own style
of storage on top of it. E.g. the BeeDict implementation
uses a data file and an index for fast access. If the index
gets corrupted, there is built-in support for automatic
recreation of the index.

The system also support locking, so it is safe to
use a single mxBeeBase DB from multiple processes.

> With a simple stogare system like BeeStorage, after a failure it
> seems possible to recover (without too much data loss).
> Is there a recover procedure for an index?

Yes. Please see the source code for details. It is documented
in great detail .... the written docs still need some write-up.

> I have another (little) problem. The BeeBase indes complains
> about strings containing null. What is the best way to 'quote'
> the null bytes?

Good question. I've never had that problem... why would you
want to have NULLs in strings that you use for an index ?

> What is the meaning of the keysize (what happens if keys are longer?)

That's just for managing the B*Tree data structure. Keys can be
up to 158 bytes. Longer keys slow down lookups though. If you don't
need sorted lookups, hashing is faster.

> What is the meaning of sectorsize? (I used 1024 (the maximum!?) insted
> of 256 and the  data base of 2 million records gets smaller and is
> constructed fater).

The sector size should be chosen to match the physical buffer
size of the IO device where you keep the index. 1024 is the
maximum value... a larger sectorsize creates a larger index
file, but nowadays its probably OK to use 1024 bytes for
the sectorsize.

>  > No; flushing has to be done explicitly using f.flush() for this.
> 
> Hmm, but if I write a gigabyte file on a 32Mb computer, the system
> has to write out something at some point. 

Sure; I thought you were talking about "what can I do to
be sure that the data gets written to the disk" and that's
what .flush() does.

> Ok, it might be that this
> is 'permanent' and the os writes it to the disk in a hidden mode,
> but it becomes 'permanent' with flush. But what happens if there
> is a power failure during the flush?

Bad thing :-(   File/disk corruption is likely.

>  > Flushing usually refers to writing line buffers out to the
>  > disk. Whether the file system does write-through or not
>  > depends on the settings of the file system. It also depends
>  > on how you open the file, ie. buffered or not.
> 
> Ok, I test BeeBase a bit more and I might not be interested in
> those questions anymore ;-)

Great :-)

-- 
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/