Mailman 3 Re: [Mailman-Developers] GSoC 15 - Interested in contributing to Hyperkitty - Mailman-Developers

March 25, 2015

      Thanks for your feedback Aurelien.
...
we'll need something like a task queue and a daemon process or a cron job
In my proposal I suggested using any of several asynchronous job queue
libraries, such as Celery or Huey. These all use redis as a back-end.
Because I have no experience with asynchronous job queues, I'm not sure if
this is too much baggage for our purposes. Maybe we just don't want the
extra dependencies. Regarding cron jobs, there's also django-background-task
<https://github.com/lilspikey/django-background-task>, which is a simple
django addon that might do what we need. Again, if we don't want/need the
extra dependency, rolling our own cron job should be fairly
straight-forward.
...
If we choose to pre-build the mbox files, we can't simply have them
served through the webserver, because some lists are private
Then there is also an authentication step? I noticed on the test server
that I can't actually look at any of the mailing lists because they're all
private. So we should be able to use pre-existing code for this step?
...
with possible attachments, we may be creating hundred of megabytes or
maybe gigabytes of data
When we create the mbox file, do we simply note that an attachment existed
(e.g. "Attachment: myattachment.txt") or do we actually put the attachment
in the mbox? AFAIK mbox is a plaintext format, so if the latter is the case
then I'm not exactly sure how this would work...
Are there going to be any issues handling unicode foreign characters or
with file locks? Right now it looks like we should only have one process
handling the mbox, but is it possible that more than one could be spawned
somehow?
Another possible "nice-to-have" feature I thought of yesterday is a
download link that scripts can use to get archives (e.g.
"/download?year=x&month=y"). On the other hand, maybe this is just a
security risk that has no actual use case, but I'd still like to have a
second opinion on this.
Additionally, here are some tentative weekly goals I have for the project.
Feedback on the order/plausibility of these would be awesome!
Week 1)  Given an email message, the message headers and body are extracted
and stored in a local file in mbox format. All unit tests passing.
Week 2)  Attachments are represented in the mbox file as well. Email
addresses are escaped. There are no encoding errors (no boxes or ?s). All
unit tests passing.
Week 3)  Explore options for possible asynchronous queue managers.
Weeks 4-5) When a mailing archive is created, a background process
(implemented using chosen backend) is attached to it for managing its mbox
files. Existing processes are started when the server starts, and the
server can efficiently manage all of these (possibly tens/hundreds?) of
tasks. All unit tests passing.
Week 6) Clean code and tests before midterm review. All unit tests passing.
Week 7-8)  Each background process unzips two mbox files, one for the
entire list and one for the past month, adds any messages that have come in
in the past hour (in mbox format) and rezips the archive. All unit tests
passing.
Week 9-10)  Mbox archives are served by hyperkitty upon request. Hyperkitty
does not at this point authenticate users. All unit tests passing.
Week 11) Hyperkitty authenticates the user before serving the mbox request.
If the request is denied, the user is notified via the UI. All unit tests
passing.
Week 12) Code review and cleaning, final check on unit tests (they should
all be passing).
Thanks,
David
On Wed, Mar 25, 2015 at 4:18 AM, Aurelien Bompard <aurelien@bompard.org>
wrote:
...
Hey David, here are my thoughs on the challenges:
...

Determine which messages to include in the mbox.
An entire list archive is clearly one choice, but is there also
interest in generating mbox files for specific threads, list archieves
between specific dates, etc.?

Hmm, depending on the architecture we choose, we may not have a lot of
options. I'd like to see at least "whole-list" and "last 30 days"
archives though, this last one being useful to those who want to use
their mail client and "seed" it with the latest discussion to reply
in-thread.
...

For each message, append plaintext to mbox file.
Is this the part where we risk "blocking the UI"? Certainly for
hundreds of thousands of messages, this will be a computationally
intensive
step, so will this have to be run in a separate thread?

Yeah, with a lot of messages, and with possible attachments, we may be
creating hundred of megabytes or maybe gigabytes of data. This has to
be done outside of the webserver process, so we'll need something like
a task queue and a daemon process or a cron job. Or we could be
building and appending to the mbox files when new messages arrive,
which would take up more disk space but would be more fluid from a UI
point of view. It would also probably be much more resource-intensive
than a cron job, because the mbox files will be large and should be
gzipped, so it would be better to append a batch of emails than
opening and closing on each incoming email.
I'm leaning towards pre-rendering the mbox files in a regular cron job
and warning the user in the UI that the archive contains all email up
to the last hour, for example.
We can't use the prototype archiver because we need to filter the
messages content and escape email adresses to protect from spam
harvesters, like MM2.1 currently does.
...

Present mbox file to user for download.
I'm hoping this is a trivial step, but I'm not sure about some of the
specifics. For example, is Hyperkitty only able to run on apache, or is
the
choice of web server entirely up to the web admin? How we ultimately
serve
the file will depend on these details.

HyperKitty runs on Django, which can be served by whichever
WSGI-compliant server the admin chooses (Apache's mod_wsgi, uWSGI,
gunicorn, etc.). If we choose to pre-build the mbox files, we can't
simply have them served through the webserver, because some lists are
private (only available to subscribers).
I hope that clearifies a bit.
Aurélien

Mailman-Developers mailing list
Mailman-Developers@python.org
https://mail.python.org/mailman/listinfo/mailman-developers
Mailman FAQ: http://wiki.list.org/x/AgA3
Searchable Archives:
http://www.mail-archive.com/mailman-developers%40python.org/
Unsubscribe:
https://mail.python.org/mailman/options/mailman-developers/dru5%40cornell.ed...
Security Policy: http://wiki.list.org/x/QIA9

Re: [Mailman-Developers] GSoC 15 - Interested in contributing to Hyperkitty

David Udelson

tags

participants (1)