[Distutils] Automation for creating, updating and destroying a TUF-secured PyPI mirror

Trishank Karthik Kuppusamy tk47 at students.poly.edu
Mon Apr 8 20:41:17 CEST 2013


Hello everyone,

I have been testing and refining the pypi.updateframework.com automation 
over the past week, and looking at how much TUF metadata is generated 
for PyPI.

In this email, I am going to focus only on the PyPI data under /simple; 
let us call that "simple data".

Now, if we assume that every developer will have her own key to sign the 
simple data for her package, then this is what the TUF metadata could 
look like:

metadata/targets.txt
====================
Delegation from the targets to the targets/simple role, with the former 
role being responsible for no target data because it has none of its own.

metadata/targets/simple.txt
===========================
Delegation from targets/simple to the targets/simple/packageI role, with 
the former role being responsible for one target datum: simple/index.html.

metadata/targets/simple/packageI.txt
====================================
The targets/simple/packageI role is responsible only for the simple data 
at simple/packageI/index.html.

In this upper bound case, where every developer is responsible for 
signing her own package, one can estimate the metadata size to be like so:

- metadata/targets/targets.txt is, at most, about a few KB, and can be 
safely ignored.
- metadata/targets/simple/packageI.txt is about 1KB.
- metadata/targets/simple.txt is about the sum of all 
metadata/targets/simple/packageI.txt files. (This is a very rough estimate!)

Therefore, if we have 30,000 developer packages on PyPI (roughly the 
current number of packages), then we would have about 29 MB of 
metadata/targets/simple/packageI.txt, and another 29 MB of 
metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB 
of total data (roughly what I saw from my last mirror), then the simple 
metadata is about 0.13% of total data size.

This may seem like a lot of metadata, but let us remember a few 
important things:

- So far, the metadata is simply uncompressed JSON. We are considering 
metadata compression or difference schemes.
- This assumes the upper bound case, where every package developer is 
responsible for her own package, so that means that we have talk about a 
lot of keys (random data).
- This is a one-time initial download cost. An update to PyPI is 
unlikely to change all the simple data; therefore, updates to the simple 
metadata will be cheap, because a TUF client would only download updated 
metadata. We could amortize the initial simple metadata download cost by 
distributing it with PyPI installers (e.g. pip).

Could we do better? Yes!

As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing 
for all of the developer packages itself. This means that we could reuse 
a key for multiple developer packages instead of dedicating a key per 
package. The tradeoff here is that if one such "shared key" is 
compromised, then multiple packages (but not all of them) could be 
compromised.

In this case, where we use a shared key to sign up to, say, 1,000 
developer packages, then we would have the following simple metadata 
size. First, let us define some terms:

NP = # of developer packages
NPK = # of developer packages signed by a key
NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK)
K = average key metadata size
D = average delegated role metadata size given one target path
P = average target path length
T = average simple target (index.html) metadata size

metadata/targets/simple.txt
===========================
Most of the metadata here deals with all of the keys, and the roles, 
used to sign simple data. Therefore, the size of the keys and roles 
metadata will dominate this file.

key metadata size = NR*K
role metadata size = NR*(D+NPK*P)

Takeaway: the lower the NPK (the number of developer packages signed by 
a key), then the higher the NR, and the larger the metadata. We would 
save metadata by setting NPK to, say, 1,000, because then one key could 
describe 1,000 packages.

metadata/targets/simple/roleI.txt
====================================
When NPK=1, then this file would be equivalent to 
metadata/targets/simple/packageI.txt.

It is a small metadata file if we assume that it only talks about the 
simple data (index.html) for one package. Most of the metadata talks 
about key signatures, and target metadata. If we increase NPK, then 
clearly the target metadata would increase in size:

target metadata size = NPK*T < NPK*1KB

Takeaway: the target metadata would increase in size, but it certainly 
will not increase as much as it would have if we had signed each 
developer package with a separate key.

Finally, the question is how the savings in metadata/targets/simple.txt 
would compare to the "growth" of the metadata/targets/simple/roleI.txt 
files. Ultimately, the higher the NPK (and thus the lower the NR), then 
the less would we be talking about keys (random data). Everything else 
would remain the same, because there would still be the same number of 
targets, and thus the same amount of target metadata. So, we would have 
net savings.

I hope this clears some questions about metadata size. If there was 
something confusing because I did not explain it well enough or I got 
something wrong, please be sure to let me know. My machine is nearly 
done generating all the simple metadata, so we can make better estimates 
then.

-Trishank



More information about the Distutils-SIG mailing list