Re: [Mailman-Developers] Requirements for a new archiver

On Wed, 29 Oct 2003 13:59:06 -0800 Peter C Norton <spacey-mailman@lenin.nu> wrote:
On Wed, Oct 29, 2003 at 10:14:52PM +0100, Brad Knowles wrote:
I don't believe that there are any databases in existence that "... can be made more suitable to the problem."
In theory you can add data types to postgresql. Not that I've done it myself, but its been done.
True, but that doesn't answer the question of whether an RDBMS is a good storage tool for messages. I spent a couple months of spare time last year building an archiving system I liked atop PostgresQL using fully decomposed SQL structures for all the message bits. It was not a pretty exercise, and the results were worse. Brad makes excellent points in his comments on poor BLOB support, the value if DBs for meta-data, and disaster recovery ease.
--
J C Lawrence
---------(*) Satan, oscillate my metallic sonatas.
claw@kanga.nu He lived as a devil, eh?
http://www.kanga.nu/~claw/ Evil is a name of a foeman, as I live.

On Wed, Oct 29, 2003 at 05:10:40PM -0500, J C Lawrence wrote:
True, but that doesn't answer the question of whether an RDBMS is a good storage tool for messages. I spent a couple months of spare time last year building an archiving system I liked atop PostgresQL using fully decomposed SQL structures for all the message bits. It was not a pretty exercise, and the results were worse. Brad makes excellent points in his comments on poor BLOB support, the value if DBs for meta-data, and disaster recovery ease.
I may not have made it clear, but I'm focusing on the metadata. Once you've parsed rfc822/2822, then it may become easier to have things in the database that can manipulate those types. I.e. to do be able to do simple searches for a property of given arbitrary headers (w/o having to have a database schema that consists of a few known headers and "others" which you then have to treat as a blob or as text).
-Peter
-- The 5 year plan: In five years we'll make up another plan. Or just re-use this one.

On Oct 29, 2003, at 2:28 PM, Peter C. Norton wrote:
I may not have made it clear, but I'm focusing on the metadata. Once you've parsed rfc822/2822, then it may become easier to have things in the database that can manipulate those types. I.e. to do be able to do simple searches for a property of given arbitrary headers (w/o having to have a database schema that consists of a few known headers and "others" which you then have to treat as a blob or as text).
my only real worry is that from what I've seen, 99.99% of the time, the user is going to want content searches. header stuff is fine, but of really low priority in the scheme of things (necessary to put useful things together, meaningless if you can't content/context search in fulltext).
that's why I'm leaning, blob issues or no, towards full-text storage in MySQL 4. Because if you can't easily chop up the message body content and find the messages you want to deal with, elegant storage of the headers is irrelevant...
I think you need that, too. But until you get a reasonable context search for the message body, designing the rest is silly. And it seems to me there are few better methods than dumping the text into MySQL and letting it do the work. Compromises, tradeoffs and etc notwithstanding...

by the way, this statement is in conflict with my previous statemenet of "use cycbufs". I'm fully aware of that conflict, too. resolving it will be one of the big challenges.
On Oct 29, 2003, at 4:12 PM, Chuq Von Rospach wrote:
that's why I'm leaning, blob issues or no, towards full-text storage in MySQL 4. Because if you can't easily chop up the message body content and find the messages you want to deal with, elegant storage of the headers is irrelevant...

At 4:12 PM -0800 2003/10/29, Chuq Von Rospach wrote:
that's why I'm leaning, blob issues or no, towards full-text storage in MySQL 4. Because if you can't easily chop up the message body content and find the messages you want to deal with, elegant storage of the headers is irrelevant...
I think you could do full word indexing per message, and then
store that index information in the database. Searching for phrases would require hitting the message bodies themselves, but searching for individual words could be done on indexed fields.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)

On Oct 29, 2003, at 5:52 PM, Brad Knowles wrote:
I think you could do full word indexing per message, and then store that index information in the database. Searching for phrases would require hitting the message bodies themselves, but searching for individual words could be done on indexed fields.
you could, but is it worth doing it yourself when MySQL is building it for you?
http://www.mysql.com/doc/en/Fulltext_Search.html
http://jeremy.zawodny.com/blog/archives/000576.html
http://www.zend.com/zend/tut/tutorial-ferrara1.php
If you were just storing into a TEXT and then doing SELECT LIKE into it, I'd agree with you. But MySQL is doing interesting things here. Why not leverage it?

At 7:00 PM -0800 2003/10/29, Chuq Von Rospach wrote:
you could, but is it worth doing it yourself when MySQL is building it for you?
From the top of this page:
6.8 MySQL Full-text Search
As of Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text indexes in MySQL are an index of type FULLTEXT. FULLTEXT indexes are used with MyISAM tables only and can be created from CHAR, VARCHAR, or TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. For large datasets, it will be much faster to load your data into a table that has no FULLTEXT index, then create the index with ALTER TABLE (or CREATE INDEX). Loading data into a table that already has a FULLTEXT index could be significantly slower.
Moreover, mail messages will be a undetermined variable length.
Can MySQL support a 32-bit VARCHAR? What about type TEXT? Or 8-bit or even 16-bit character sets? Since you might be storing a lot of MIME bodypart types, can it handle BLOBs, and can it handle them well? Or, do you do parsing within your archive application and store the entire message somewhere outside of the database, while storing a FULLTEXT index of only the bodypart types you declare to be human-readable?
What if you want to do a case-sensitive search? In that case, it
doesn't look like FULLTEXT or MATCH will do you any good, since MATCH is declared to be case-insensitive. Or what if you want to search for hyphenated literals? It seems that MATCH considers them to be word breaks even within literal searches.
If you were just storing into a TEXT and then doing SELECT LIKE into it, I'd agree with you. But MySQL is doing interesting things here. Why not leverage it?
I'm not sure it really helps in this case. I'm not sure it can
handle the amounts of data that might need to be stored into a field, or the different character sets that might need to be used. I'm also concerned about what using this function might do to the overall speed and size of the database.
On the page quoted above, look for benchmark data reported by Jim
Nguyen and John Takacs. Two million rows with text and multiple word searches (three or more) taking 30-seconds to a minute to complete, is not good performance. Three to five million rows, with searches taking 50 seconds or more for single words, is not good performance.
Now, consider how many words might be in a single message
(hundreds to thousands or even tens of thousands), and how many messages might be in a single archive (thousands to millions). If each message was contained within a row, this would be dead-Universe slow.
-- Brad Knowles, <brad.knowles@skynet.be>
"They that can give up essential liberty to obtain a little temporary safety deserve neither liberty nor safety." -Benjamin Franklin, Historical Review of Pennsylvania.
GCS/IT d+(-) s:+(++)>: a C++(+++)$ UMBSHI++++$ P+>++ L+ !E-(---) W+++(--) N+ !w--- O- M++ V PS++(+++) PE- Y+(++) PGP>+++ t+(+++) 5++(+++) X++(+++) R+(+++) tv+(+++) b+(++++) DI+(++++) D+(++) G+(++++) e++>++++ h--- r---(+++)* z(+++)
participants (4)
-
Brad Knowles
-
Chuq Von Rospach
-
J C Lawrence
-
Peter C. Norton