[Archiver-dev] UpLib and archiving

Bill Janssen janssen at parc.com
Tue Oct 19 03:13:26 CEST 2010


Barry Warsaw <barry at python.org> wrote:

> On Oct 17, 2010, at 02:01 PM, Bill Janssen wrote:
> 
> >I build the UpLib archive system, at http://uplib.parc.com/.
> >
> >The latest release includes new support for building very large
> >archives.  UpLib has some support for email archiving already, including
> >thread analysis and a built-in IMAP server, but that support needs to be
> >re-worked for efficiency to support large archives.  So I'm thinking
> >about that just now.
>
> Very cool!  The state of the art in open source email archivers has been
> stagnant for years.  I think a huge number of people would like to see a new
> offering, but getting things off the ground has always been too daunting.
> Maybe uplib will be the platform to build a nextgen email archiver on top of.

UpLib is not specifically about email -- it's a general purpose digital
document archiver.  (But most email is not just about email, either.
See for instance
http://www.parc.com/content/attachments/email_habitat_exploration_4360_parc.pdf.)
I pour my email into it, so I've added some features to it to help with
the process of reading and finding email, like the threading code.  It
knows how to parse emails (using the email package) and render them, and
do threading, etc.

> 
> >1.  An email thread analysis library which works on a mixin, say
> >    ThreadableEmail, so that different email packages could use it.
> 
> Which different email packages do you mean?  Is that "different versions of
> the stdlib email package" or something else?

Different email archival implementations, I was thinking.  For instance,
I'm re-writing the thread support in UpLib to handle email thread
forests backed by an SQLite DB.  I'd like to be able to use a library to
deduce the threads, but keep them in my own format.  To implement the
two separate IMAP threading algorithms in RFC 5256, you need "Subject",
"Date", "Message-ID" (the *normalized* Message-ID), "References", and
"In-Reply-To".  So such a library would need an email Message type
which provides these fields in some fashion.

In UpLib, I'd want these to be either subtypes of Document, or perhaps a
separate record type created solely for the purposes of threading.

> 
> >2.  Support for multipart/related parsing.
> 
> That would be nice.
> 
> >3.  Indexing for search.  UpLib currently indexes email into PyLucene
> >    with the following fields:
> >
> >      date (untokenized)
> >      contents (tokenized -- just the body text, not the headers)
> 
> How does it (or how do you envision it) working with non-text/plain parts?

It's got some code to deal with that.  If the Content-Type is non-text,
it just processes it as it would any other document of that
content-type.  Otherwise, if it's text/* or multipart, the email parsing
code picks one part, either text/plain or text/html (or a series of
such), and makes it the "main" part.  The text extracted from that is
the "contents".

Other parts are typically classified as "attachments", broken off, and
separately indexed.  So text in them is also indexed, but it's indexed
in the role of an attachment to a particular email message.  Attachments
show up as little icons in the visual rendering of the document.

> >      email-message-id (untokenized)
> >      email-guid (untokenized -- a hash of the message-id)
> 
> There's also this: http://wiki.list.org/display/DEV/Stable+URLs
> 
> Stable URLs on archive regeneration is absolutely critical and predictable
> URLs without communication between the MLM and archiver is highly desirable.
> The algorithm is simple, but I don't know how that works with uplib's notion
> of an email message's canonical URL.
> 
> >      email-subject (tokenized)
> >      email-from-name (tokenized, only used if present)
> >      email-from-address (untokenized)
> >      email-attachment-to (untokenized, for attachments, guid of message)
> >      email-thread-index (untokenized, thread ID)
> >      email-references (untokenized, zero or more email-guids)
> >      email-in-reply-to (untokenized, zero or more email-guids)
> >      email-recipient-names (untokenized [should be tokenized])
> >      email-recipients (untokenized -- who the message was sent to)
> >
> >    Attachments are extracted, and indexed separately, with links from the
> >    attachment to the message, and links from the message to its
> >    attachments.  This is a nice feature of UpLib over more specifically
> >    mail-archiving systems -- it can also archive images, Word, PDF, etc.,
> >    and do proper metadata indexing on all of the various types.
>
> And that is *very* cool.  How do you handle security issues, i.e. html parts
> with evil content (javascript) or Content-Disposition filenames that lie about
> their type?

I don't run Javascript, so evil parts get to stick around and
potentially do damage in the future.  UpLib has an extensible system of
data analysis engines called "rippers" which are automatically run on
each document; if you were concerned about the possibility of lingering
malware a malware-detector ripper could be added to flag and/or remove
such content.  I actually do process Javascript lightly to remove some
irritating pieces, like "capture" redirects.

As for the Content-Disposition filenames: UpLib runs its own
content-type determiner over the content to try to see what it is rather
than just relying on the filename, though it will fall back to the
filename if it can't figure it out.  And I've hardcoded some typical
situations.

> >    It also tries to leverage Lucene's multi-language support, by
> >    running a language guesser over the text of the email, and selecting
> >    the Lucene Analyzer which most closely matches that language.
> 
> Wow, neat!
> 
> >    So, is this a good list of indexing fields?  Bad list?  Where does
> >    the Dublin Core factor into this?
> 
> It seems like a reasonably good start.  Dunno about Dublin Core.
> 
> >4.  Archive server frameworks.  My IMAP server is currently built on top
> >    of Medusa, like the rest of UpLib.  No one's working on Medusa.
> 
> How hard would it be to slot in Twisted?  Something I've always wanted to see
> was an archiver that supported IMAP, NNTP, and web access.  Twisted seems like
> the obvious choice.

I've got the Web access and the IMAP support, but not NNTP -- never had
the need for it.  Twisted seems to have support going forward that
Medusa no longer has, and at some point I plan to port UpLib from Medusa
to Twisted.

> What about plugins?  There are a few areas that come to mind about things I'd
> like to have pluggable:
> 
> * Content hyperlinks.  Let's say you've got an archive for a -commits list.
>   It would be nice to be able to dig out things like bug numbers and vcs
>   revisions and hyperlink them to tracker or viewvcs pages.

UpLib automatically recognizes and stores hyperlinks found in Word, PDF,
Powerpoint, etc. as part of the standard metadata extraction process.
There's also a ripper which recognizes URLs and stores them as links.

In house, we have some support for entity-finding: person or corporation
or location names, dates, etc.  They are also automatically turned into
links, and show up as hyperlinks in the Web and Java UI tools.  What
you're suggesting is more of that.

> * Take-down support.  If a list admin wants to remove a posting, she should be
>   able to do that without disrupting email threads or breaking URLs.  One way
>   I've thought about doing that is a dynamic rendering plugin that checked the
>   to-be-displayed message against a blacklist, and if there's a hit, it would
>   substitute the body of the message with something like "Content unavailable
>   due to take-down notice.  Contact postmaster at python.org for detail."

Yeah, kind of provide a stand-in for the real message.  UpLib re-threads
if you modify the corpus, so removal is automatic.  It also includes a
capability to "replace" the content of an existing document, which
sounds like what you'd want for the above.

> * Email address obfuscation.  Obviously we'd want to support that, but using
>   what algorithm?  xxx'ing out the domain?  Using a central forwarding
>   service?  How do we recognize email addresses?

I don't obfuscate anything, really.  But this is an issue for a public
Web UI design, I think.

> * Send-me-this-message button.  I do a Google search and find a message in an
>   archive from 4 years ago.  It's relevant to the problem I'm now having and
>   I'd like to respond to it in my normal email reader.  Maybe IMAP/NNTP is the
>   right way to go, or there could be a button to allow the user to forward the
>   message to herself.

Nice idea.  I've got an extension which (sort of) supports this (you can
email a copy of any document to anyone), and the user can define new
buttons to add to the UI in her config file.  I, for instance, added a
button which shows me all the email threads which have been updated
today:

Today's Mail, /action/basic/email_threads?query=date:today+$email, _blank

Of course, the normal UpLib Web UI puts apppropriate "mailto:" links
around people's names, and adds "Reply-To" and "Reply-To-All" links to
the message.  Just click on that and it opens up in your MUA.

> Interested to hear your thoughts.  This would be a cool project to work on,
> and maybe we should also engage mailman-developers.  Thanks for releasing it
> under the GPL[*].

Well, there's lots to do :-).  The current IMAP server, for instance, is
more about getting the IMAP protocol right than it is efficiency.  When
you go into python-dev size archives without breaking it into chunks
(like the per-month view in Mailman archives), it poops out.  Shouldn't
do that.

My normal development process is to write any new code as an UpLib
extension, then if it works I eventually fold it into the codebase.
Extensions are easy to add (just plunk them in a directory, and point
the repository at that directory), and there are a number of examples
included with the source code.  The IMAP server is an extension, for
instance.

Bill

> 
> -Barry
> 
> [*] While GPLv2 would be incompatible with Mailman 3's GPLv3, I don't think it
> matters.  The two systems will be lightly connected, though we would have to
> think about the integration points.  MM3 has a plugin system for archivers.
> _______________________________________________
> Archiver-dev mailing list
> Archiver-dev at python.org
> http://mail.python.org/mailman/listinfo/archiver-dev


More information about the Archiver-dev mailing list