[Mailman-Users] [Mailman-cabal] GDPR

Tue May 22 22:40:51 EDT 2018

On 05/22/2018 07:33 PM, Stephen J. Turnbull wrote:
> I would imagine that it is the subthread rooted at the first post 
> containing complainant's PII -- "Personally Identifying Information".

I feel like that's a self referencing definition.

A "thread" is "a subthread rooted at the first post containing PII".

I agree that's where the focus should start.  But I don't think it 
defines a thread in the way that I'm asking.

What is their working definition of "thread"?

Let's say:

1)  Bla
2)   +--- Re: Bla
3)   +--- Re: Bla
4)   |     +--- BlaBlaBla
5)   +--- Re: Bla
6)         +--- I hijacked this thread because I need help!!!

Let's say the PII was in message 3 and the person replying to it in 
message 4 removed the PII.  Do messages 3 and 4 need to be removed (or 
otherwise modified)?

Let's say that message 1 had the PII, messages 2, 3, and 5 quoted it, 
but 4 did not and 6 is a hijacker that hit reply on the most convenient 
message (under his cursor) and removed all content.  Do messages 4 and 6 
need to be removed?

What is the "(sub)thread" that needs to be removed?

> That is going to depend on the presence of PII in the messages.  If *whole 
> messages* are to be deleted, that would presumably involve content that 
> somehow identifies the person.  I would expect that we don't have to 
> delete whole bug reports on this list just because somebody requests 
> their PII be redacted.

I agree that it's possible to remove / redact PII without deleting the 
items containing the PII.

Think about it this way, spooks don't shred the entire sheet of paper, 
instead they take a black marker and redact just the pieces that need to 
be removed.

I'm afraid that the infinite wisdom of politicians will say that the 
entire paper needs to be shredded.

I think it also significantly depends on what needs to be redacted. 
Removing "supercalifragilisticexpialidocious" is a LOT different than 
removing "Grant Taylor" from the Mailman-Users archive. 
"supercalifragilisticexpialidocious" would be like reference to an 
event.  "Grant Taylor" would be any mention of my (or an impostor's) name.

The former is likely MUCH simpler to do than the latter.  The latter 
will also impact MANY more messages.

> What worries me more is the implications for blockchain, or more 
> precisely, DAG-based VCSes that use hashes for integrity check like git: 
> the identity of commits will change if authors and emails are redacted, 
> including if a commit log refers to PII of a bug reporter as they often 
> do.  I guess you'd need to maintain an index of pointers from old commit 
> ids, or at least for branches and tags (we do have the reflog in git).

I don't want to try to work that out.

> And heaven help you if you're a security conscious group like the Linux 
> kernel and use signed commits.  I guess the person who does the redaction 
> would sign the new commits, but that's pretty yucky -- that person could 
> do anything and nobody would know when it happened because you have to 
> delete the old commits and blobs that get redacted.

Yep.

> As I understand the "right to be forgotten", it's *not* a right to 
> arbitrarily edit content stored by someone else, it's the right to redact 
> *all* PII in that content.

Agreed.

In this case, I don't think that supercalifragilisticexpialidocious 
qualifies under GDPR's right to be forgotten.  }:-)

> It's not just messages from a person, it's headers containing their name 
> and email address, attribution lines for quoted material, quoted .sigs, 
> etc etc.

Agreed.

What about headers containing message ID from an uncommon / single user 
domain like mine?  I'd say that anything that can be used to identify 
less than a group of 1000 people would probably need to be redacted.  (I 
just chose 1000 arbitrarily, but it's a starting point.)

> You're missing
> 
> 0)  Randos accessing public archives.

What other modes have we collectively missed?

> For (0), the only logging would be IP addresses in the webserver.

True.

> No.  The accessing IPs will be in the webserver logs, but I don't think 
> there is any logging in either Mailman 2 or Mailman 3 of authentication 
> data.  All there would be is the implication that authentication was 
> successful if that data were accessed.

Okay.

I wonder if there's any correlation between the IP that authenticated 
and the IP that accessed data.

> In Mailman 2 there's no PII data whatsoever except for email address 
> and (maybe) display name in the subscriber data.

I expect that either of those, the email address -or- the display name 
are enough to count as PII.

I believe it's fair to say that people expect gtaylor (at) 
tnetconsulting (dot) net to reference a single person.  I also believe 
it's fair to say that most people expect most email addresses to 
identify be associated with one person.  The only exceptions to the rule 
being things like positional addresses; sales@ or info@ or webmaster at .

> I suppose you could put phone #s and junk like that in the display name, 
> but GDPR is more concerned with the database fields that might store 
> PII than the actual content.

1)  I'd consider the phone numbers in the display name to be a form of 
display name.
2)  *sigh*  It sounds like GDPR is talking about specific fields that 
could contain PII, even if they don't, while ignoring other fields that 
erroneously do contain PII.

> However, in Mailman 2 the various list passwords are shared, and would 
> not identify individuals in cases with multiple moderators or list owners.

IMHO that's an operational mis-step.  I get that it does happen.  But I 
think that it shouldn't.  People tend to share root password on unix 
too, despite multiple other options where it's not needed.

> Indeed.  The problem is identifying them if they do, since they can 
> just use normal filesystem operations from the shell, which are not 
> normally logged at all.

Where I've worked, it was assumed that if you had an ID on the box and 
file system level permission to access things then you effectively had 
accessed it.  —  If you can't prove that they didn't access the data, 
then you assume that they did access the data.

> In Mailman 3, we can configure databases like PostgreSQL, which I suppose 
> can log access to the subscriber databases, and which make it hard 
> (but not impossible) to access data via ordinary filesystem operations.

Having an RDBMS (et al) manage the files doesn't prevent file level 
access.  I can very likely still copy the DB file(s) and do my own thing 
with them to extract the data.

This is where (and why) DB encryption comes into play.  Though, if a 
rogue admin has access to the decryption key through any method.  (This 
includes extracting it out of memory.)  }:-)

> However, I think that the issue here is basically moot.  You keep host 
> access logs to check for suspicious IP addresses (attempting to) log 
> in, and otherwise (for #2 and #3) you just give the list of all the 
> people who can access that data in the normal course of their duties.

Yep.

> I don't think the issue with logging is pinning down a particular access 
> to specific data, but rather determining who *could* access that data.

Yep. Yep.

> The relevant access might have been by a long-since fired engineer who 
> did a Snowden on your database.  How could you possibly know?

Yep. Yep. Yep.

> I don't understand the "exclude third party site hosters".  The GDPR 
> requirement is not to *limit* access, it's to *log* access.

I was trying to imply that companies would need to host their own list 
servers.  Meaning that they couldn't outsource it to 3rd party 
companies, whom have their own host system administrators.

> I'm pretty sure they're referring to CRM-type databases where you track 
> customer interactions over time, linked by PII, and build up a profile. 
> One-off "for sale" posts wouldn't matter.  However, if this were a common 
> activity on the list, the *archives* might qualify as such a database.

~chuckle~

How many grains of sand does it take to make a pile?

IMHO none.  You just have to declare the pile's location.

> Sure, the point is to make it difficult for 3rd parties to discover 
> that history ex post.

Okay.  I want to make sure I'm understanding you correctly.  (Part of) 
GDPR is not about (just) knowing who has (had at the time) legitimate 
access to data, but additionally making it more difficult for other 3rd 
parties to gain access to the data in the future.  By the fact that the 
data is removed from the corpus that the 3rd party is subsequently given 
access to.

> I don't think the legislators envisioned people invoking these rights 
> frivolously or maliciously (though I do :-/).

Agreed.

> Backups would need to be redacted as well, I suppose.

Um... that also presents a severe technical problem.  One that could 
impose large operational expenses.  Suppose a company contracts to store 
their backup tapes off sight.  This means that they would need to recall 
the tapes that need to be redacted, do so, send the tapes back to the 
offsite storage.  This may involve an additional company that is simply 
the courier.  Let's not forget about the off site companies handling 
fees and the courier's fees.  Both ways for each tape.  Let's also throw 
company policies in place that dictate that only X number of drives can 
be in transit or recalled at one time.  That's a logistical nightmare, 
could take more than a trivial amount of time to complete, and untold 
cost.  Ouch!

> I have no idea what you mean by "ongoing discovery".

Ah.

Let's say that Wile E. Coyote decides to sue Acme because of their bad 
products.  As soon as the lawsuit is initiated, chances are very good 
that Acme's lawyers will 1) tell them to destroy all records or 2) tell 
Acme's IT staff that they can no longer rotate out any backups that may 
contain data pertinent to the lawsuit.  This is to facilitate the legal 
process of discovering evidence to be used in the case.  (Either way, 
for or against, Mr. Coyote, doesn't matter.)

I frequently hear about this referred to as one of two things 
"Litigation Hold" or "(Electronic) Discovery".  Discovery being the more 
common term and applies to more than just electronics.

> Not Mailman host's problem, assuming all subscribers have properly been 
> opted in and are allowed to opt out at will, as is normally the case.

What about that pesky time where the moderator hasn't approved the 
unsubscribe request.  (I think I remember seeing that option in Mailman.)

> Distributing content downstream is the purpose of the software, and 
> subscribers are aware of that.  The only edge cases I can imagine offhand 
> is the one discussed elsewhere in the thread, where a subscriber posts a 
> third party's information without permission, and possibly an open-post 
> list where the poster doesn't realize that it's open subscription/public 
> archives/whatever.

I think you misinterpreted what I was referring to.  Or I'm 
misinterpreting your reply.

I'm talking about 3rd party spam filtering services that are in the path 
between, downstream in between Mailman and the recipient's server.  They 
collect logs / data all the time.  Usually those logs and that data are 
what help them be better at their job of spam filtering.

> Not Mailman host's problem.

Okay.

> Sure, but you probably won't like what the courts consider reasonable.

"reasonable" is always subject to deliberation.

Lawyers get payed to tell a judge that "It will cost $Company $50,000 
dollars to recover the messages that $Plaintiff is requesting from 
$Defendant as part of their sunshine law request.  Here's why:

1)  We don't have a server that we can use so we must buy a low end 
machine.  (Legit, when there is only one mail server and the business 
can't be without mail for days / weeks.)
2)  We need another tape drive to do the restores.
3)  It will take $X number of (wo)man hours at $Y dollars per hour.
4)  We, $Defendant's lawyers must go through the emails at $YYYYY 
dollars per hour to make sure there's nothing given out that's outside 
of the sunshine law request.
5)  You just expanded the scope of your discovery?  Well, now we need to 
increase #1 and #2 to go through the last 5 years of things in the next 
three weeks.  Also #3 and #4.  }:-)

So … the total bill for your sunshine request comes to just over 
$50,000.  Are you willing to pay that bill to get an answer to your 
question via a sunshine law request?

Aside:  A sunshine law request is a request from a citizen to a 
governmental body for data that was arguably payed for by tax funding 
and on behalf of citizens, thus the citizen effectively owns the data in 
a round about way.  —  I don't know how wide spread that is.

> You lock up the backups offline unless and until the court asks for them 
> or you actually need to restore.  That reasonably addresses the privacy 
> issue itself, and you're covered by the "essential to business purpose" 
> clause for the duration of the court order.

6)  We have to buy additional tapes to replace the tapes that are on 
Lit' Hold.
7)  We have to pay for more storage to accommodate #6.  (Or we have to 
pay someone to house the tapes in a secure manner.)

I digress.

-- 
Grant. . . .
unix || die