Re: [Mailman-Developers] GSoC 15 - Interested in contributing to Hyperkitty

Thanks for your feedback Aurelien.
we'll need something like a task queue and a daemon process or a cron job
In my proposal I suggested using any of several asynchronous job queue libraries, such as Celery or Huey. These all use redis as a back-end. Because I have no experience with asynchronous job queues, I'm not sure if this is too much baggage for our purposes. Maybe we just don't want the extra dependencies. Regarding cron jobs, there's also django-background-task <https://github.com/lilspikey/django-background-task>, which is a simple django addon that might do what we need. Again, if we don't want/need the extra dependency, rolling our own cron job should be fairly straight-forward.
If we choose to pre-build the mbox files, we can't simply have them served through the webserver, because some lists are private
Then there is also an authentication step? I noticed on the test server that I can't actually look at any of the mailing lists because they're all private. So we should be able to use pre-existing code for this step?
with possible attachments, we may be creating hundred of megabytes or maybe gigabytes of data
When we create the mbox file, do we simply note that an attachment existed (e.g. "Attachment: myattachment.txt") or do we actually put the attachment in the mbox? AFAIK mbox is a plaintext format, so if the latter is the case then I'm not exactly sure how this would work...
Are there going to be any issues handling unicode foreign characters or with file locks? Right now it looks like we should only have one process handling the mbox, but is it possible that more than one could be spawned somehow?
Another possible "nice-to-have" feature I thought of yesterday is a download link that scripts can use to get archives (e.g. "/download?year=x&month=y"). On the other hand, maybe this is just a security risk that has no actual use case, but I'd still like to have a second opinion on this.
Additionally, here are some tentative weekly goals I have for the project. Feedback on the order/plausibility of these would be awesome!
Week 1) Given an email message, the message headers and body are extracted and stored in a local file in mbox format. All unit tests passing. Week 2) Attachments are represented in the mbox file as well. Email addresses are escaped. There are no encoding errors (no boxes or ?s). All unit tests passing. Week 3) Explore options for possible asynchronous queue managers. Weeks 4-5) When a mailing archive is created, a background process (implemented using chosen backend) is attached to it for managing its mbox files. Existing processes are started when the server starts, and the server can efficiently manage all of these (possibly tens/hundreds?) of tasks. All unit tests passing. Week 6) Clean code and tests before midterm review. All unit tests passing. Week 7-8) Each background process unzips two mbox files, one for the entire list and one for the past month, adds any messages that have come in in the past hour (in mbox format) and rezips the archive. All unit tests passing. Week 9-10) Mbox archives are served by hyperkitty upon request. Hyperkitty does not at this point authenticate users. All unit tests passing. Week 11) Hyperkitty authenticates the user before serving the mbox request. If the request is denied, the user is notified via the UI. All unit tests passing. Week 12) Code review and cleaning, final check on unit tests (they should all be passing).
Thanks, David
On Wed, Mar 25, 2015 at 4:18 AM, Aurelien Bompard <aurelien@bompard.org> wrote:
Hey David, here are my thoughs on the challenges:
- Determine which messages to include in the mbox. An entire list archive is clearly one choice, but is there also interest in generating mbox files for specific threads, list archieves between specific dates, etc.?
Hmm, depending on the architecture we choose, we may not have a lot of options. I'd like to see at least "whole-list" and "last 30 days" archives though, this last one being useful to those who want to use their mail client and "seed" it with the latest discussion to reply in-thread.
- For each message, append plaintext to mbox file. Is this the part where we risk "blocking the UI"? Certainly for hundreds of thousands of messages, this will be a computationally intensive step, so will this have to be run in a separate thread?
Yeah, with a lot of messages, and with possible attachments, we may be creating hundred of megabytes or maybe gigabytes of data. This has to be done outside of the webserver process, so we'll need something like a task queue and a daemon process or a cron job. Or we could be building and appending to the mbox files when new messages arrive, which would take up more disk space but would be more fluid from a UI point of view. It would also probably be much more resource-intensive than a cron job, because the mbox files will be large and should be gzipped, so it would be better to append a batch of emails than opening and closing on each incoming email. I'm leaning towards pre-rendering the mbox files in a regular cron job and warning the user in the UI that the archive contains all email up to the last hour, for example. We can't use the prototype archiver because we need to filter the messages content and escape email adresses to protect from spam harvesters, like MM2.1 currently does.
- Present mbox file to user for download. I'm hoping this is a trivial step, but I'm not sure about some of the specifics. For example, is Hyperkitty only able to run on apache, or is the choice of web server entirely up to the web admin? How we ultimately serve the file will depend on these details.
HyperKitty runs on Django, which can be served by whichever WSGI-compliant server the admin chooses (Apache's mod_wsgi, uWSGI, gunicorn, etc.). If we choose to pre-build the mbox files, we can't simply have them served through the webserver, because some lists are private (only available to subscribers).
I hope that clearifies a bit.
Aurélien
Mailman-Developers mailing list Mailman-Developers@python.org https://mail.python.org/mailman/listinfo/mailman-developers Mailman FAQ: http://wiki.list.org/x/AgA3 Searchable Archives: http://www.mail-archive.com/mailman-developers%40python.org/ Unsubscribe: https://mail.python.org/mailman/options/mailman-developers/dru5%40cornell.ed...
Security Policy: http://wiki.list.org/x/QIA9
participants (1)
-
David Udelson