[Mailman-Developers] GSoC 15 - Interested in contributing to Hyperkitty

Wed Mar 25 18:02:01 CET 2015

Thanks for your feedback Aurelien.

> we'll need something like a task queue and a daemon process or a cron job

In my proposal I suggested using any of several asynchronous job queue
libraries, such as Celery or Huey. These all use redis as a back-end.
Because I have no experience with asynchronous job queues, I'm not sure if
this is too much baggage for our purposes. Maybe we just don't want the
extra dependencies. Regarding cron jobs, there's also django-background-task
<https://github.com/lilspikey/django-background-task>, which is a simple
django addon that might do what we need. Again, if we don't want/need the
extra dependency, rolling our own cron job should be fairly
straight-forward.

> If we choose to pre-build the mbox files, we can't simply have them
served through the webserver, because some lists are private

Then there is also an authentication step? I noticed on the test server
that I can't actually look at any of the mailing lists because they're all
private. So we should be able to use pre-existing code for this step?

> with possible attachments, we may be creating hundred of megabytes or
maybe gigabytes of data

When we create the mbox file, do we simply note that an attachment existed
(e.g. "Attachment: myattachment.txt") or do we actually put the attachment
in the mbox? AFAIK mbox is a plaintext format, so if the latter is the case
then I'm not exactly sure how this would work...

Are there going to be any issues handling unicode foreign characters or
with file locks? Right now it looks like we should only have one process
handling the mbox, but is it possible that more than one could be spawned
somehow?

Another possible "nice-to-have" feature I thought of yesterday is a
download link that scripts can use to get archives (e.g.
"/download?year=x&month=y"). On the other hand, maybe this is just a
security risk that has no actual use case, but I'd still like to have a
second opinion on this.

Additionally, here are some tentative weekly goals I have for the project.
Feedback on the order/plausibility of these would be awesome!

Week 1)  Given an email message, the message headers and body are extracted
and stored in a local file in mbox format. All unit tests passing.
Week 2)  Attachments are represented in the mbox file as well. Email
addresses are escaped. There are no encoding errors (no boxes or ?s). All
unit tests passing.
Week 3)  Explore options for possible asynchronous queue managers.
Weeks 4-5) When a mailing archive is created, a background process
(implemented using chosen backend) is attached to it for managing its mbox
files. Existing processes are started when the server starts, and the
server can efficiently manage all of these (possibly tens/hundreds?) of
tasks. All unit tests passing.
Week 6) Clean code and tests before midterm review. All unit tests passing.
Week 7-8)  Each background process unzips two mbox files, one for the
entire list and one for the past month, adds any messages that have come in
in the past hour (in mbox format) and rezips the archive. All unit tests
passing.
Week 9-10)  Mbox archives are served by hyperkitty upon request. Hyperkitty
does not at this point authenticate users. All unit tests passing.
Week 11) Hyperkitty authenticates the user before serving the mbox request.
If the request is denied, the user is notified via the UI. All unit tests
passing.
Week 12) Code review and cleaning, final check on unit tests (they should
all be passing).

Thanks,
David

On Wed, Mar 25, 2015 at 4:18 AM, Aurelien Bompard <aurelien at bompard.org>
wrote:

> Hey David, here are my thoughs on the challenges:
>
> > 1) Determine which messages to include in the mbox.
> >     An entire list archive is clearly one choice, but is there also
> > interest in generating mbox files for specific threads, list archieves
> > between specific dates, etc.?
>
> Hmm, depending on the architecture we choose, we may not have a lot of
> options. I'd like to see at least "whole-list" and "last 30 days"
> archives though, this last one being useful to those who want to use
> their mail client and "seed" it with the latest discussion to reply
> in-thread.
>
> > 2) For each message, append plaintext to mbox file.
> >     Is this the part where we risk "blocking the UI"? Certainly for
> > hundreds of thousands of messages, this will be a computationally
> intensive
> > step, so will this have to be run in a separate thread?
>
> Yeah, with a lot of messages, and with possible attachments, we may be
> creating hundred of megabytes or maybe gigabytes of data. This has to
> be done outside of the webserver process, so we'll need something like
> a task queue and a daemon process or a cron job. Or we could be
> building and appending to the mbox files when new messages arrive,
> which would take up more disk space but would be more fluid from a UI
> point of view. It would also probably be much more resource-intensive
> than a cron job, because the mbox files will be large and should be
> gzipped, so it would be better to append a batch of emails than
> opening and closing on each incoming email.
> I'm leaning towards pre-rendering the mbox files in a regular cron job
> and warning the user in the UI that the archive contains all email up
> to the last hour, for example.
> We can't use the prototype archiver because we need to filter the
> messages content and escape email adresses to protect from spam
> harvesters, like MM2.1 currently does.
>
> > 3) Present mbox file to user for download.
> >     I'm hoping this is a trivial step, but I'm not sure about some of the
> > specifics. For example, is Hyperkitty only able to run on apache, or is
> the
> > choice of web server entirely up to the web admin? How we ultimately
> serve
> > the file will depend on these details.
>
> HyperKitty runs on Django, which can be served by whichever
> WSGI-compliant server the admin chooses (Apache's mod_wsgi, uWSGI,
> gunicorn, etc.). If we choose to pre-build the mbox files, we can't
> simply have them served through the webserver, because some lists are
> private (only available to subscribers).
>
> I hope that clearifies a bit.
>
> Aurélien
> _______________________________________________
> Mailman-Developers mailing list
> Mailman-Developers at python.org
> https://mail.python.org/mailman/listinfo/mailman-developers
> Mailman FAQ: http://wiki.list.org/x/AgA3
> Searchable Archives:
> http://www.mail-archive.com/mailman-developers%40python.org/
> Unsubscribe:
> https://mail.python.org/mailman/options/mailman-developers/dru5%40cornell.edu
>
> Security Policy: http://wiki.list.org/x/QIA9
>