Custom feature request - ripe ncc - archive expires

Hi there,
Hopefully you're the right contact to ask this. Please let us know who else to send the email to if not.
We're interested in the development of a feature in mailman3 where we can configure it to automatically expire/remove threads in/from the archive older than x number of days.
We can discuss the details, including financial compensation for the development effort, once we get further into this. For now I'm looking for the right person to talk to about this.
Is there a way forward for this?
Thank you so much in advance!
Marco van Tol RIPE NCC

On 9/11/24 00:23, Marco van Tol wrote:
The script at https://www.msapiro.net/scripts/prune_arch3 could be easily modified to do this and then be run periodically by cron.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thanks again for your answer.
First of all, all of what I write below happened on a test list server, so no real harm was done. I'm just curious how to recover from this, and how to get to what I need. :-)
On Wed, 11 Sept 2024 at 21:16, Mark Sapiro <mark@msapiro.net> wrote:
I made a modified version of the script, and ran it. (attachment: script-1.py)
The script went on its way for a bit, and then blew up. (attachment: error-run-1.txt)
Now the archive for the list is broken in the sense that when I search for "*", and order it by "earliest first", it will show a server error. The mailmanweb.log also gives errors when this happens. (attachment: mailman-web.log)
I tried to run ./manage.py update_index
which did not fix the issue for
the archive.
I then ran ./manage.py rebuild_index
which did fix the issue for the
archive.
Following this I ran the script again, and it showed similar output,
including an error message after a while, but on every run it would delete
new messages.
I could keep running it until all the messages that I needed to be gone
were gone, and then do a final rebuild_index
to get the server back in
shape.
The mailmanweb.log would show the occasional message like this while I was doing this: WARNING 2024-10-16 12:31:52,043 41 hyperkitty.tasks Cannot rebuild the thread cache: thread 28 does not exist.
These did not re-appear after I rebuilt the index.
Is there some call I need to make to refresh the Email.objects list between
runs?
Do I just blindly rerun Email.objects.filter(<args>)
after every call to
msg.delete()?
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
Thanks!
Marco van Tol RIPE NCC

On Wed, 16 Oct 2024 at 14:45, Marco van Tol <mvantol@ripe.net> wrote:
[...]
I made a change to the script that's really blunt but does work.
See the attached change. I can make it more efficient by calling the Email.objects.filter() once per loop instead of the current 2, but if you have any other improvements they'd be welcome.
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
This one very much stands, hopefully I can integrate the message deletion mode-direct into the archive search index.
Thank you very much in advance!
Marco van Tol RIPE NCC

On 10/16/24 07:02, Marco van Tol wrote:
I suspect the issue there is in the order of deletion and a message which is a parent in a thread gets deleted before the child.
I don't have any suggestions.
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
If the massive count is distributed over multiple lists with not so many messages you can run the Django update_index_one_list job to just do one list.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thank you so much for spending some of your scarse time on this for us.
On Wed, 16 Oct 2024 at 22:58, Mark Sapiro <mark@msapiro.net> wrote:
That's very much what it looks like, yeah. Perhaps I could somehow sort the returned array of messages on the date and then delete them from recent-to-old.
That's okay, thanks for thinking about it.
This is very helpful, thank you so much!
Marco van Tol RIPE NCC

On Thu, 17 Oct 2024 at 09:38, Marco van Tol <mvantol@ripe.net> wrote:
[...]
I tried this, and everything works fine with the last version of my script, except for one sort-of minor thing, and that's the message count if you search for "*". It won't update to the right message count in the top middle of the page until I do a "rebuild_index".
I'm afraid "update" index only looks at the messages changed since the last time update was run, and misses the fact that messages have disappeared from the beginning. (I agree removing messages from an archive is far from optimal, but I'm not doing this for myself :D )
It would be a minor thing if I hadn't told people this is the way to find out the number of messages in a list, and also to find the most recent post to a list.
Marco van Tol RIPE NCC

Marco van Tol writes:
I tried this, and everything works fine with the last version of my script, except for one sort-of minor thing,
Please don't deprecate your requirements. If you need it, and you do:
we want to give it to you. Sure, sometimes it is harder than you imagine or in our judgment it's not worth as much as something else we could do, but you needn't be shy about asking for it. I'm also pretty sure it's hardly a unique requirement, at least it won't be for long, between GDPR and other worries about privacy of archived data in many contexts.
How much does time does cost to do that? If it's expensive enough that on "monthly cleaning day" you've got some lists that stay unsynced for many minutes or hours, we might need to rearchitect the index to be per list.
Have you looked at the code to verify this? I agree it's consistent with the Mailman behavior you see. Unfortunately I'm not sure that all the indexers we claim to support would be able to suppose such deletions without a full rebuild.
Steve

Hi Stephen,
Thank you so much for your email.
On Thu, 17 Oct 2024 at 17:43, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Fair enough :)
I hadn't thought about that one yet, but indeed, thank you.
At the moment the entire list server has roughly 300.000 messages. From memory last time it took slightly under an hour.
A 300.000 count is a lot less than others have, but it makes it sort of okay for our server, today. If we time it right.
It is a 24x7 service though, so even at the best timing there's risk for a few people to have degraded service on the archives while the index rebuilds. But right now I think if we time it right once per month it's probably okay.
It would be nice if an improvement would be somewhere on a list of nice-to-haves. Or perhaps that list just only gets longer, it may well. :-)
I have not verified it in the code, I aim to have a look at some point. Indeed I was writing this with the witnessed behaviour in mind.
Thanks Stephen!
Marco van Tol RIPE NCC

You are probably correct. It probably doesn't remove index entries for messages no longer in the archive. I'm not sure, but this may depend on the particular haystack backend.
What happens if you search for * and sort earliest first. Does it find deleted messages and are there errors?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thanks for all your help on this topic so far.
On Fri, 18 Oct 2024 at 05:38, Mark Sapiro <mark@msapiro.net> wrote:
[...]
That's the strange, but also the good part. With my current version of the script that refreshes the Email.objects.filter() array on every iteration, doing this goes well as well. The top message becomes the next one that should not have been deleted, as expected.
The only thing that is not correct is the message count on the "search for *" result, after I delete a bunch from the beginning.
With the first version of my script, that runs into errors, the web page gives the server error. I don't intend to ever use that version any more.
Thanks!
Marco van Tol RIPE NCC

On 9/11/24 00:23, Marco van Tol wrote:
The script at https://www.msapiro.net/scripts/prune_arch3 could be easily modified to do this and then be run periodically by cron.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thanks again for your answer.
First of all, all of what I write below happened on a test list server, so no real harm was done. I'm just curious how to recover from this, and how to get to what I need. :-)
On Wed, 11 Sept 2024 at 21:16, Mark Sapiro <mark@msapiro.net> wrote:
I made a modified version of the script, and ran it. (attachment: script-1.py)
The script went on its way for a bit, and then blew up. (attachment: error-run-1.txt)
Now the archive for the list is broken in the sense that when I search for "*", and order it by "earliest first", it will show a server error. The mailmanweb.log also gives errors when this happens. (attachment: mailman-web.log)
I tried to run ./manage.py update_index
which did not fix the issue for
the archive.
I then ran ./manage.py rebuild_index
which did fix the issue for the
archive.
Following this I ran the script again, and it showed similar output,
including an error message after a while, but on every run it would delete
new messages.
I could keep running it until all the messages that I needed to be gone
were gone, and then do a final rebuild_index
to get the server back in
shape.
The mailmanweb.log would show the occasional message like this while I was doing this: WARNING 2024-10-16 12:31:52,043 41 hyperkitty.tasks Cannot rebuild the thread cache: thread 28 does not exist.
These did not re-appear after I rebuilt the index.
Is there some call I need to make to refresh the Email.objects list between
runs?
Do I just blindly rerun Email.objects.filter(<args>)
after every call to
msg.delete()?
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
Thanks!
Marco van Tol RIPE NCC

On Wed, 16 Oct 2024 at 14:45, Marco van Tol <mvantol@ripe.net> wrote:
[...]
I made a change to the script that's really blunt but does work.
See the attached change. I can make it more efficient by calling the Email.objects.filter() once per loop instead of the current 2, but if you have any other improvements they'd be welcome.
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
This one very much stands, hopefully I can integrate the message deletion mode-direct into the archive search index.
Thank you very much in advance!
Marco van Tol RIPE NCC

On 10/16/24 07:02, Marco van Tol wrote:
I suspect the issue there is in the order of deletion and a message which is a parent in a thread gets deleted before the child.
I don't have any suggestions.
And secondary: can I avoid the call to rebuild_index? The actual production server has a massive count of messages.
If the massive count is distributed over multiple lists with not so many messages you can run the Django update_index_one_list job to just do one list.
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thank you so much for spending some of your scarse time on this for us.
On Wed, 16 Oct 2024 at 22:58, Mark Sapiro <mark@msapiro.net> wrote:
That's very much what it looks like, yeah. Perhaps I could somehow sort the returned array of messages on the date and then delete them from recent-to-old.
That's okay, thanks for thinking about it.
This is very helpful, thank you so much!
Marco van Tol RIPE NCC

On Thu, 17 Oct 2024 at 09:38, Marco van Tol <mvantol@ripe.net> wrote:
[...]
I tried this, and everything works fine with the last version of my script, except for one sort-of minor thing, and that's the message count if you search for "*". It won't update to the right message count in the top middle of the page until I do a "rebuild_index".
I'm afraid "update" index only looks at the messages changed since the last time update was run, and misses the fact that messages have disappeared from the beginning. (I agree removing messages from an archive is far from optimal, but I'm not doing this for myself :D )
It would be a minor thing if I hadn't told people this is the way to find out the number of messages in a list, and also to find the most recent post to a list.
Marco van Tol RIPE NCC

Marco van Tol writes:
I tried this, and everything works fine with the last version of my script, except for one sort-of minor thing,
Please don't deprecate your requirements. If you need it, and you do:
we want to give it to you. Sure, sometimes it is harder than you imagine or in our judgment it's not worth as much as something else we could do, but you needn't be shy about asking for it. I'm also pretty sure it's hardly a unique requirement, at least it won't be for long, between GDPR and other worries about privacy of archived data in many contexts.
How much does time does cost to do that? If it's expensive enough that on "monthly cleaning day" you've got some lists that stay unsynced for many minutes or hours, we might need to rearchitect the index to be per list.
Have you looked at the code to verify this? I agree it's consistent with the Mailman behavior you see. Unfortunately I'm not sure that all the indexers we claim to support would be able to suppose such deletions without a full rebuild.
Steve

Hi Stephen,
Thank you so much for your email.
On Thu, 17 Oct 2024 at 17:43, Stephen J. Turnbull <turnbull@sk.tsukuba.ac.jp> wrote:
Fair enough :)
I hadn't thought about that one yet, but indeed, thank you.
At the moment the entire list server has roughly 300.000 messages. From memory last time it took slightly under an hour.
A 300.000 count is a lot less than others have, but it makes it sort of okay for our server, today. If we time it right.
It is a 24x7 service though, so even at the best timing there's risk for a few people to have degraded service on the archives while the index rebuilds. But right now I think if we time it right once per month it's probably okay.
It would be nice if an improvement would be somewhere on a list of nice-to-haves. Or perhaps that list just only gets longer, it may well. :-)
I have not verified it in the code, I aim to have a look at some point. Indeed I was writing this with the witnessed behaviour in mind.
Thanks Stephen!
Marco van Tol RIPE NCC

You are probably correct. It probably doesn't remove index entries for messages no longer in the archive. I'm not sure, but this may depend on the particular haystack backend.
What happens if you search for * and sort earliest first. Does it find deleted messages and are there errors?
-- Mark Sapiro <mark@msapiro.net> The highway is for gamblers, San Francisco Bay Area, California better use your sense - B. Dylan

Hi Mark,
Thanks for all your help on this topic so far.
On Fri, 18 Oct 2024 at 05:38, Mark Sapiro <mark@msapiro.net> wrote:
[...]
That's the strange, but also the good part. With my current version of the script that refreshes the Email.objects.filter() array on every iteration, doing this goes well as well. The top message becomes the next one that should not have been deleted, as expected.
The only thing that is not correct is the message count on the "search for *" result, after I delete a bunch from the beginning.
With the first version of my script, that runs into errors, the web page gives the server error. I don't intend to ever use that version any more.
Thanks!
Marco van Tol RIPE NCC
participants (3)
-
Marco van Tol
-
Mark Sapiro
-
Stephen J. Turnbull