[Archiver-dev] UpLib and archiving
janssen at parc.com
Sun Oct 17 23:01:12 CEST 2010
Just noticed this list, and thought I'd sign up.
I build the UpLib archive system, at http://uplib.parc.com/.
The latest release includes new support for building very large
archives. UpLib has some support for email archiving already, including
thread analysis and a built-in IMAP server, but that support needs to be
re-worked for efficiency to support large archives. So I'm thinking
about that just now.
1. An email thread analysis library which works on a mixin, say
ThreadableEmail, so that different email packages could use it.
2. Support for multipart/related parsing.
3. Indexing for search. UpLib currently indexes email into PyLucene
with the following fields:
contents (tokenized -- just the body text, not the headers)
email-guid (untokenized -- a hash of the message-id)
email-from-name (tokenized, only used if present)
email-attachment-to (untokenized, for attachments, guid of message)
email-thread-index (untokenized, thread ID)
email-references (untokenized, zero or more email-guids)
email-in-reply-to (untokenized, zero or more email-guids)
email-recipient-names (untokenized [should be tokenized])
email-recipients (untokenized -- who the message was sent to)
Attachments are extracted, and indexed separately, with links from the
attachment to the message, and links from the message to its
attachments. This is a nice feature of UpLib over more specifically
mail-archiving systems -- it can also archive images, Word, PDF, etc.,
and do proper metadata indexing on all of the various types.
It also tries to leverage Lucene's multi-language support, by
running a language guesser over the text of the email, and selecting
the Lucene Analyzer which most closely matches that language.
So, is this a good list of indexing fields? Bad list? Where does
the Dublin Core factor into this?
4. Archive server frameworks. My IMAP server is currently built on top
of Medusa, like the rest of UpLib. No one's working on Medusa.
More information about the Archiver-dev