[Mailman-Developers] Basic workflow of the ARC implementation

Sun Mar 6 07:51:16 EST 2016

Aditya Divekar writes:

 > We need to generate a private and a public key for the signing purposes.
 > For testing purposes, and while working on the code, I can probably
 > generate the keys locally using the openssl tool.

In production, these keys are a *site* resource.  I see no reason why
we need to generate keys automatically in this case, and in fact many
sites will use their DKIM keys AIUI.

 > As a rough sketch of implementation,
 > 
 > 1. ARC Seal
 > 
 > The tags -
 > 
 > NOTE - Now, here we needed to check the given message for any
 > pre-existing signatures in most of the fields. For this I think a
 > separate module can be created which can extract the previous ARC
 > headers if they exist from the message.  The code for this can be
 > again used from the dkimpy package.

This should be trivial to do with the Python email package, too.  I
don't really see that a separate module would be useful, since we'll
want to extract a fixed set of headers (ARC- and DKIM-specified).  Of
course it should be factored into a separate function (or perhaps a
generic "extract_fields" function and a couple of derivatives with the
DKIM list (just DKIM-Signature?) and the ARC list (ARC-Seal,
ARC-Authentication-Results, and ARC-Message-Signature).

GSoC meta: You don't need to go as far as deciding the factoring of
this function in your proposal, but on the other hand it wouldn't
hurt.  It's low importance (see below for high-priority tasks).

Another factoring issue: should you "import dkimpy" and call
dkimpy.foo, or should you "from dkimpy import foo, bar, baz"?
(Doesn't need to be settled for a while, and you can even try both at
a small cost in time and effort.)

 > i: The value for this tag can be determined by performing a check
 > on the original signature and seeing if there were previous ARC
 > headers. If yes, we increment the value of the previous "i" by 1,
 > and if no, make it 1.

Note that the I-D (Internet-Draft) provides a specific algorithm for
canonicalizing i in the presence of missing i or duplicate i.

 > a: The value for this is fixed.
 > 
 > t,s,d: These can be obtained by using pre-existing dkimpy package.
 > 
 > b: Now we can compute the header hash using the dkimpy package
 > again by using the headers given here
 > <https://tools.ietf.org/html/draft-andersen-arc-02#section-5.1.1.3>.
 > Here, we call the dkimpy package and get the signature for the above
 > headers and then affix it to the "b="header.
 > 
 > cv: Use the same check as "i", if there is already an ARC i.e. i>1, then we
 > make it as "V", else "N".

OK.

 > The ARC Seal gets computed here.
 > 
 > NOTE-For giving the "s" and "d" (selector and domain tag values),
 > we will need to produce records for these where the key can be
 > stored so that it is available for query by the verifiers (I still
 > have to look up this mechanism).

OK.

GSoC meta: you can get away without actually specifying the mechanism
in your proposal, but you should be able to say where you will look it
up when the time comes.  (ARC Seal will not necessarily be your first
milestone, and you can book up on standards "just in time".)  Knowing
where to look is medium priority IMO.  (IMO means "other mentors may
feel differently, pay more attention to comments they make than the
ones I make"!)

 > 2. ARC Message Signature
 > 
 > The tags -
 > 
 > i: The value for this tag is determined similar to the "i" tag for
 > the ARC Seal.

Isn't this inaccurate?  This "i" must *match* the Seal "i", no?

 > a,t,s,d: These can again be obtained from the dkimpy package.
 > 
 > bh: The body hash. This can be obtained from the package. Here, we set the
 > canonicalization to 'relaxed' and get the body hash.

"The package" == dkimpy, right?

 > h: The "h=" header list is signed with the implicit list (as given
 > in the draft) and any explicit list that we want in addition.

 > Now, for mailing lists, the recommended headers are -
 > List-Id:List-Unsubscribe:List-Archive:List-Post:List-Help:List-Subscribe:Reply-To:
 > and any other fields added by the list like the Precedence, XTopics or
 > else.
 > We will also sign the DKIM signature of the previous mail here if
 > available(suggested by the draft).
 > For this signing, we can use a modified version where the
 > FROZEN_HEADERS (headers that are signed by default) will specify
 > the implicit headers according to ARC specs (Another option could
 > be to get all the implicit+explicit headers signed by the package,
 > extract the h header, and modify it to include only the explicit
 > headers.)

 > b: The signature, calculated from the package. The h tag is used as
 > described above.
 > 
 > 
 > 3. ARC Authentication Results.
 > The "i" tag simply takes on the value same as the above "i" tags.
 > Now from our previous conversation, as you suggested, the
 > authenticity of the previous MTA who sent us the mail is not sure
 > to be trusted. So in the case where we don't trust the previous
 > MTA, we will have to perform our own DMARC, SPF, DKIM testing of
 > the recieved mail. If previous ARC chain exists i.e. cv=V, then we
 > perform the ARC test too.

I'm not sure you understood me.  We *always* verify the preceding
MTA's claims, even if we trust them, because of spoofing and
man-in-the-middle attacks on the Internet itself.  (More precisely,
verification is required by the I-D, and the rationale is spoofing and
MITM.)  The MTA we "may or may not trust" is our *own* MTA.

 > Now for performing the tests -

 > In one of the earlier mails, we discussed the use of the "authres"
 > package for generating the authentication results header. The
 > package conforms to the RFC7001 format, and now the format used is
 > RFC7601. But according to the changes that I verified, we can use
 > the package without any changes.  (The changes were mostly related
 > to extra specifications that are optional.
 > Can be skipped for our purposes)
 > So the "authres" package can be called here for generating the AAR.

Great!

 > If we need to perform the ARC test, then the module for that will
 > have to be implemented manually. Though most of the code from the
 > package for DKIM verification can be used.

Yes.

 > This is also the point where we detect if the mail is spam or not.

No.  ARC modifies the message, and therefore is a Handler, which does
not make decisions about disposition.  We provide our authentication
results to later Rules, probably by adding a field (or several) to
msg_data.  (This is a Mailman design policy, for detailed rationale
ask Barry.  The basic point is that passing ARC or even DMARC
certainly does not mean the message is not spam, and vice versa, even
if failure is a strong indication.)

 > If the arc test fails, then there is something fishy here. DKIM,
 > DMARC, SPF may fail, but the failure of this test means the mail is
 > not authentic. At this point the mail should probably be discarded
 > (or any other measures that need to be taken).

No.  The mail cannot be proved authentic by ARC, but that may be due
to changes to the message at intervening hops.  There may be more
sophisticated tests (eg, a PGP signature on a MIME body) that can
prove it authentic.

 > Now coming to the testing part. There can be a number of tests like
 > verifying the generated ARC signature, changing the body of the
 > message, failing when the implicitly signed AMS headers are changed
 > and other such tests.

This is a little vague, but testing is hard.  You'll learn it as you
go along.

 > Is this the workflow you were expecting to see, or should I write a
 > more explanatory draft ?

What you have written so far is basically OK as far as the *work*flow
part goes.  However, you also need a *schedule* with *milestones*.

The *schedule* should present the Google deadlines, any times that you
will be out of communication (eg, because of travel), and any periods
of more than two days that you expect to be unable to work.  Google
expect you to treat the internship as a fulltime job, so "working
time" is like 40 hours a week.  But you should not expect to be able
to make up two days or more that you would normally be working by
working times that you normally don't.  A few people can do that, most
can't.  Therefore, schedule as if you can't until you've proven you
can.  Also, many people can successfully work 4 days x 10 hours, but
more than 10 hours/day and you're pushing human limits.  So you should
be thinking in terms of a 5 day work week, or maybe 4 if you have
really good reason or previous experience.

Finally, your schedule should include your milestones.  A *milestone*
is more than just a subgoal for your project.  It needs to be
objectively verifiable.  "ARC Seal code completed" is not a milestone.
"ARC Seal code completed, unit tests pass, merge request posted to
GitLab" is a milestone.  A better one would say "unit tests *pass with
100% branch coverage*."  Note that all of these are things you check
yourself and don't depend on others.  You can include things like
"review passed" and "code merged", but these depend on others so you
should leave lots of room for delay.  "Code completed" is not a
milestone because it doesn't specify the quality of the code.

I count about 9 well-defined tasks in your message that could be made
into good milestones.  You should try to come up with your own list,
but if you're really not happy with your list, send it to me and ask
for help and I'll give you my ideas.

In an ideal world with experienced GSoC students, we would like to see
approximately one milestone per week in your schedule.  More than
that and it becomes a theoretical exercise because you always end up
finding out that the right order to do the subtasks is different from
the one you scheduled.  Schedules are only useful if there's a
reasonable expectation you'll keep to them.  On the other hand you
should have more than two, because there should be one milestone
describing what you'll be reviewed on for the midterm, and another for
the final.  The point of having a schedule is so you'll know how good
your plan was and whether you'll be able to keep your promises.

Keep in mind that if you get in schedule trouble, you have four
options: (1) work harder, (2) get help, (3) renegotiate your
commitments for midterm and final, and (4) get lucky.  Don't depend on
#4 -- that's how students fail.  The amount of help you can get and
renegotiation you can do are limited, of course, but when #1 looks
likely to fail, #2 and #3 may be available.

 > (Also wherever I have mentioned the use of dkimpy, a lot of custom
 > implementation is needed to suit our requirements. )

You should say that when you first mention dkimpy.  In general this is
expected, though, so not terribly important.