Basic workflow of the ARC implementation
Aditya Divekar writes:
We need to generate a private and a public key for the signing purposes. For testing purposes, and while working on the code, I can probably generate the keys locally using the openssl tool.
In production, these keys are a *site* resource. I see no reason why we need to generate keys automatically in this case, and in fact many sites will use their DKIM keys AIUI.
As a rough sketch of implementation,
- ARC Seal
The tags -
NOTE - Now, here we needed to check the given message for any pre-existing signatures in most of the fields. For this I think a separate module can be created which can extract the previous ARC headers if they exist from the message. The code for this can be again used from the dkimpy package.
This should be trivial to do with the Python email package, too. I don't really see that a separate module would be useful, since we'll want to extract a fixed set of headers (ARC- and DKIM-specified). Of course it should be factored into a separate function (or perhaps a generic "extract_fields" function and a couple of derivatives with the DKIM list (just DKIM-Signature?) and the ARC list (ARC-Seal, ARC-Authentication-Results, and ARC-Message-Signature).
GSoC meta: You don't need to go as far as deciding the factoring of this function in your proposal, but on the other hand it wouldn't hurt. It's low importance (see below for high-priority tasks).
Another factoring issue: should you "import dkimpy" and call dkimpy.foo, or should you "from dkimpy import foo, bar, baz"? (Doesn't need to be settled for a while, and you can even try both at a small cost in time and effort.)
i: The value for this tag can be determined by performing a check on the original signature and seeing if there were previous ARC headers. If yes, we increment the value of the previous "i" by 1, and if no, make it 1.
Note that the I-D (Internet-Draft) provides a specific algorithm for canonicalizing i in the presence of missing i or duplicate i.
a: The value for this is fixed.
t,s,d: These can be obtained by using pre-existing dkimpy package.
b: Now we can compute the header hash using the dkimpy package again by using the headers given here <https://tools.ietf.org/html/draft-andersen-arc-02#section-5.1.1.3>. Here, we call the dkimpy package and get the signature for the above headers and then affix it to the "b="header.
cv: Use the same check as "i", if there is already an ARC i.e. i>1, then we make it as "V", else "N".
OK.
The ARC Seal gets computed here.
NOTE-For giving the "s" and "d" (selector and domain tag values), we will need to produce records for these where the key can be stored so that it is available for query by the verifiers (I still have to look up this mechanism).
OK.
GSoC meta: you can get away without actually specifying the mechanism in your proposal, but you should be able to say where you will look it up when the time comes. (ARC Seal will not necessarily be your first milestone, and you can book up on standards "just in time".) Knowing where to look is medium priority IMO. (IMO means "other mentors may feel differently, pay more attention to comments they make than the ones I make"!)
- ARC Message Signature
The tags -
i: The value for this tag is determined similar to the "i" tag for the ARC Seal.
Isn't this inaccurate? This "i" must *match* the Seal "i", no?
a,t,s,d: These can again be obtained from the dkimpy package.
bh: The body hash. This can be obtained from the package. Here, we set the canonicalization to 'relaxed' and get the body hash.
"The package" == dkimpy, right?
h: The "h=" header list is signed with the implicit list (as given in the draft) and any explicit list that we want in addition.
Now, for mailing lists, the recommended headers are - List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:Reply-To: and any other fields added by the list like the Precedence, XTopics or else. We will also sign the DKIM signature of the previous mail here if available(suggested by the draft). For this signing, we can use a modified version where the FROZEN_HEADERS (headers that are signed by default) will specify the implicit headers according to ARC specs (Another option could be to get all the implicit+explicit headers signed by the package, extract the h header, and modify it to include only the explicit headers.)
b: The signature, calculated from the package. The h tag is used as described above.
- ARC Authentication Results. The "i" tag simply takes on the value same as the above "i" tags. Now from our previous conversation, as you suggested, the authenticity of the previous MTA who sent us the mail is not sure to be trusted. So in the case where we don't trust the previous MTA, we will have to perform our own DMARC, SPF, DKIM testing of the recieved mail. If previous ARC chain exists i.e. cv=V, then we perform the ARC test too.
I'm not sure you understood me. We *always* verify the preceding MTA's claims, even if we trust them, because of spoofing and man-in-the-middle attacks on the Internet itself. (More precisely, verification is required by the I-D, and the rationale is spoofing and MITM.) The MTA we "may or may not trust" is our *own* MTA.
Now for performing the tests -
In one of the earlier mails, we discussed the use of the "authres" package for generating the authentication results header. The package conforms to the RFC7001 format, and now the format used is RFC7601. But according to the changes that I verified, we can use the package without any changes. (The changes were mostly related to extra specifications that are optional. Can be skipped for our purposes) So the "authres" package can be called here for generating the AAR.
Great!
If we need to perform the ARC test, then the module for that will have to be implemented manually. Though most of the code from the package for DKIM verification can be used.
Yes.
This is also the point where we detect if the mail is spam or not.
No. ARC modifies the message, and therefore is a Handler, which does not make decisions about disposition. We provide our authentication results to later Rules, probably by adding a field (or several) to msg_data. (This is a Mailman design policy, for detailed rationale ask Barry. The basic point is that passing ARC or even DMARC certainly does not mean the message is not spam, and vice versa, even if failure is a strong indication.)
If the arc test fails, then there is something fishy here. DKIM, DMARC, SPF may fail, but the failure of this test means the mail is not authentic. At this point the mail should probably be discarded (or any other measures that need to be taken).
No. The mail cannot be proved authentic by ARC, but that may be due to changes to the message at intervening hops. There may be more sophisticated tests (eg, a PGP signature on a MIME body) that can prove it authentic.
Now coming to the testing part. There can be a number of tests like verifying the generated ARC signature, changing the body of the message, failing when the implicitly signed AMS headers are changed and other such tests.
This is a little vague, but testing is hard. You'll learn it as you go along.
Is this the workflow you were expecting to see, or should I write a more explanatory draft ?
What you have written so far is basically OK as far as the *work*flow part goes. However, you also need a *schedule* with *milestones*.
The *schedule* should present the Google deadlines, any times that you will be out of communication (eg, because of travel), and any periods of more than two days that you expect to be unable to work. Google expect you to treat the internship as a fulltime job, so "working time" is like 40 hours a week. But you should not expect to be able to make up two days or more that you would normally be working by working times that you normally don't. A few people can do that, most can't. Therefore, schedule as if you can't until you've proven you can. Also, many people can successfully work 4 days x 10 hours, but more than 10 hours/day and you're pushing human limits. So you should be thinking in terms of a 5 day work week, or maybe 4 if you have really good reason or previous experience.
Finally, your schedule should include your milestones. A *milestone* is more than just a subgoal for your project. It needs to be objectively verifiable. "ARC Seal code completed" is not a milestone. "ARC Seal code completed, unit tests pass, merge request posted to GitLab" is a milestone. A better one would say "unit tests *pass with 100% branch coverage*." Note that all of these are things you check yourself and don't depend on others. You can include things like "review passed" and "code merged", but these depend on others so you should leave lots of room for delay. "Code completed" is not a milestone because it doesn't specify the quality of the code.
I count about 9 well-defined tasks in your message that could be made into good milestones. You should try to come up with your own list, but if you're really not happy with your list, send it to me and ask for help and I'll give you my ideas.
In an ideal world with experienced GSoC students, we would like to see approximately one milestone per week in your schedule. More than that and it becomes a theoretical exercise because you always end up finding out that the right order to do the subtasks is different from the one you scheduled. Schedules are only useful if there's a reasonable expectation you'll keep to them. On the other hand you should have more than two, because there should be one milestone describing what you'll be reviewed on for the midterm, and another for the final. The point of having a schedule is so you'll know how good your plan was and whether you'll be able to keep your promises.
Keep in mind that if you get in schedule trouble, you have four options: (1) work harder, (2) get help, (3) renegotiate your commitments for midterm and final, and (4) get lucky. Don't depend on #4 -- that's how students fail. The amount of help you can get and renegotiation you can do are limited, of course, but when #1 looks likely to fail, #2 and #3 may be available.
(Also wherever I have mentioned the use of dkimpy, a lot of custom implementation is needed to suit our requirements. )
You should say that when you first mention dkimpy. In general this is expected, though, so not terribly important.
Hi Steve!
This should be trivial to do with the Python email package, too. I don't really see that a separate module would be useful, since we'll want to extract a fixed set of headers (ARC- and DKIM-specified). Of course it should be factored into a separate function (or perhaps a generic "extract_fields" function and a couple of derivatives with the DKIM list (just DKIM-Signature?) and the ARC list (ARC-Seal, ARC-Authentication-Results, and ARC-Message-Signature).
Depending on the message, if it has previous Authentication Results added as a header, that can be extracted too. The entire message can be parsed, and then all the possible headers involved in the authentication process, ie. DKIM signature, Authentication results, ARC headers, can be extracted. If not found, a suitable flag can be set for them. ie. example
if no previous arc headers were found, a flag can be set. This can later be used in deciding the flow of the mail such as the "i" tag value, whether we need to perform ARC authentication for the previous ARC headers, and other fields that depend on the occurrences of any previous ARC set.
Another factoring issue: should you "import dkimpy" and call
dkimpy.foo, or should you "from dkimpy import foo, bar, baz"? (Doesn't need to be settled for a while, and you can even try both at a small cost in time and effort.)
From what I've read, I should use the
from ... import ...
when the no of methods required from the package is less, around 3-4, the entire package is not required and when naming conflicts are to be minded. But yes, I can settle it later during the coding part :)
Isn't this inaccurate? This "i" must *match* the Seal "i", no?
Yes, I meant the same thing. The "i" tag for all the three AAR, AMS and the AS will be the same , and will be one higher than the previous instance of the "i" tag. If no previous instance is there, it will be 1.
a,t,s,d: These can again be obtained from the dkimpy package.
bh: The body hash. This can be obtained from the package. Here, we set
the
canonicalization to 'relaxed' and get the body hash.
"The package" == dkimpy, right?
Yes.
- ARC Authentication Results. The "i" tag simply takes on the value same as the above "i" tags. Now from our previous conversation, as you suggested, the authenticity of the previous MTA who sent us the mail is not sure to be trusted. So in the case where we don't trust the previous MTA, we will have to perform our own DMARC, SPF, DKIM testing of the recieved mail. If previous ARC chain exists i.e. cv=V, then we perform the ARC test too.
I'm not sure you understood me. We *always* verify the preceding MTA's claims, even if we trust them, because of spoofing and man-in-the-middle attacks on the Internet itself. (More precisely, verification is required by the I-D, and the rationale is spoofing and MITM.) The MTA we "may or may not trust" is our *own* MTA.
Yes. So for every mail we receive, we always perform the authentication tests for spf,dkim, dmarc (and arc if present).
No. The mail cannot be proved authentic by ARC, but that may be due to changes to the message at intervening hops. There may be more sophisticated tests (eg, a PGP signature on a MIME body) that can prove it authentic.
Okay.! So we need to only include the ARC Headers and forward the message to the subscribers, and leave it upto their MTAs to do the needful.
Now coming to the testing part. There can be a number of tests like
verifying the generated ARC signature, changing the body of the message, failing when the implicitly signed AMS headers are changed and other such tests.
This is a little vague, but testing is hard. You'll learn it as you go along.
I will come up with at least a few concrete tests that need to be performed for each of the modules given below (in the milestones), and run them by you once before the proposal.
I count about 9 well-defined tasks in your message that could be made into good milestones. You should try to come up with your own list, but if you're really not happy with your list, send it to me and ask for help and I'll give you my ideas.
I have come up with the following probable milestones for the project. - (The project has been divided into milestones on the basis of the separate modules that will be created. Each module is a milestone). passed. merge request created.
- ARC Authentication Result - spf verification code completed. tests
- ARC Authentication Result - dkim verification code completed. tests passed. merge request created.
- ARC Authentication Result - dmarc verification code completed. tests passed. merge request created.
- ARC Authentication Result - arc verification code completed. tests passed. merge request created.
- ARC Authentication Result - generate AAR from the previous milestones code. tests passed. merge request created.
- ARC Message Signature code completed. tests passed. merge request created.
- ARC Seal code completed. tests passed. merge request created.
- Generate the ARC set of headers from the previous milestones code, and prepend them to the message. tests passed. merge request created.
*As you mentioned, branch coverage will be the aim behind all the tests for each module (Branch coverage would mean considering all possible scenarios of the workflow).
Notes -
I've broken down the AAR set into different milestones, since each method will require the use of different functions and packages. (ie. spf, dkim, dmarc, arc).
Regarding milestone 8, separate modules will be responsible for generating the components of the ARC header set, and these can be combined at the end for getting the complete ARC set. This will be useful for testing purposes.
We need to perform the dmarc testing manually since the gs.dmarc package only provides the dmarc policy query functionality. The gs.dmarc package can be used to query the dmarc record of the RFC5322.From domain. Then we can verify using the aspf and adkim relaxed/strict tag values and the spf/dkim results whether the dmarc authentication is a pass or fail, as given in the RFC7489 draft (dmarc draft). Is there a better alternative to this in your knowledge?
I would like your opinion on these milestones, and if possible your ideas can be merged with these to come up with a better list :)
Thanks.
Aditya.
participants (2)
-
Aditya Divekar
-
Stephen J. Turnbull