Automation for creating, updating and destroying a TUF-secured PyPI mirror
Hello PyPI, Hope attendees had a great time at PyCon 2013! We certainly enjoyed presenting to you our lightning talk on securing PyPI with TUF (https://www.youtube.com/watch?v=2sx1lS6cT3g). Since then, we have been busy improving TUF and implementing machinery to automatically secure PyPI with TUF: https://github.com/dachshund/pypi.updateframework.com You may also have noticed that the root metadata for our prototype mirror of PyPI+TUF expired yesterday. This aligns nicely with our plan for switching our hand-maintained PyPI+TUF mirror with the automatic one. We expect to have it ready very soon, and until then, we certainly welcome your first impressions on our automation. You could try it on your machine right away! Finally, we are working continuously on improving TUF, especially on ensuring that the metadata scales with data. We welcome your feedback on these issues and more (https://github.com/akonst/tuf/issues?state=open). -Trishank
Hello everyone, I have been testing and refining the pypi.updateframework.com automation over the past week, and looking at how much TUF metadata is generated for PyPI. In this email, I am going to focus only on the PyPI data under /simple; let us call that "simple data". Now, if we assume that every developer will have her own key to sign the simple data for her package, then this is what the TUF metadata could look like: metadata/targets.txt ==================== Delegation from the targets to the targets/simple role, with the former role being responsible for no target data because it has none of its own. metadata/targets/simple.txt =========================== Delegation from targets/simple to the targets/simple/packageI role, with the former role being responsible for one target datum: simple/index.html. metadata/targets/simple/packageI.txt ==================================== The targets/simple/packageI role is responsible only for the simple data at simple/packageI/index.html. In this upper bound case, where every developer is responsible for signing her own package, one can estimate the metadata size to be like so: - metadata/targets/targets.txt is, at most, about a few KB, and can be safely ignored. - metadata/targets/simple/packageI.txt is about 1KB. - metadata/targets/simple.txt is about the sum of all metadata/targets/simple/packageI.txt files. (This is a very rough estimate!) Therefore, if we have 30,000 developer packages on PyPI (roughly the current number of packages), then we would have about 29 MB of metadata/targets/simple/packageI.txt, and another 29 MB of metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of total data (roughly what I saw from my last mirror), then the simple metadata is about 0.13% of total data size. This may seem like a lot of metadata, but let us remember a few important things: - So far, the metadata is simply uncompressed JSON. We are considering metadata compression or difference schemes. - This assumes the upper bound case, where every package developer is responsible for her own package, so that means that we have talk about a lot of keys (random data). - This is a one-time initial download cost. An update to PyPI is unlikely to change all the simple data; therefore, updates to the simple metadata will be cheap, because a TUF client would only download updated metadata. We could amortize the initial simple metadata download cost by distributing it with PyPI installers (e.g. pip). Could we do better? Yes! As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing for all of the developer packages itself. This means that we could reuse a key for multiple developer packages instead of dedicating a key per package. The tradeoff here is that if one such "shared key" is compromised, then multiple packages (but not all of them) could be compromised. In this case, where we use a shared key to sign up to, say, 1,000 developer packages, then we would have the following simple metadata size. First, let us define some terms: NP = # of developer packages NPK = # of developer packages signed by a key NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK) K = average key metadata size D = average delegated role metadata size given one target path P = average target path length T = average simple target (index.html) metadata size metadata/targets/simple.txt =========================== Most of the metadata here deals with all of the keys, and the roles, used to sign simple data. Therefore, the size of the keys and roles metadata will dominate this file. key metadata size = NR*K role metadata size = NR*(D+NPK*P) Takeaway: the lower the NPK (the number of developer packages signed by a key), then the higher the NR, and the larger the metadata. We would save metadata by setting NPK to, say, 1,000, because then one key could describe 1,000 packages. metadata/targets/simple/roleI.txt ==================================== When NPK=1, then this file would be equivalent to metadata/targets/simple/packageI.txt. It is a small metadata file if we assume that it only talks about the simple data (index.html) for one package. Most of the metadata talks about key signatures, and target metadata. If we increase NPK, then clearly the target metadata would increase in size: target metadata size = NPK*T < NPK*1KB Takeaway: the target metadata would increase in size, but it certainly will not increase as much as it would have if we had signed each developer package with a separate key. Finally, the question is how the savings in metadata/targets/simple.txt would compare to the "growth" of the metadata/targets/simple/roleI.txt files. Ultimately, the higher the NPK (and thus the lower the NR), then the less would we be talking about keys (random data). Everything else would remain the same, because there would still be the same number of targets, and thus the same amount of target metadata. So, we would have net savings. I hope this clears some questions about metadata size. If there was something confusing because I did not explain it well enough or I got something wrong, please be sure to let me know. My machine is nearly done generating all the simple metadata, so we can make better estimates then. -Trishank
FYI: For anyone who wants the executive summary, we think the TUF metadata will be under 1MB and even with very broad / rapid adoption of TUF in the next year or two will stay <3MB or so. Note that this cost is only paid upon the initial run of the client tool. Everything after that just downloads diffs (or at least will once we fix an open ticket). Thanks, Justin On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy < tk47@students.poly.edu> wrote:
Hello everyone,
I have been testing and refining the pypi.updateframework.com automation over the past week, and looking at how much TUF metadata is generated for PyPI.
In this email, I am going to focus only on the PyPI data under /simple; let us call that "simple data".
Now, if we assume that every developer will have her own key to sign the simple data for her package, then this is what the TUF metadata could look like:
metadata/targets.txt ==================== Delegation from the targets to the targets/simple role, with the former role being responsible for no target data because it has none of its own.
metadata/targets/simple.txt =========================== Delegation from targets/simple to the targets/simple/packageI role, with the former role being responsible for one target datum: simple/index.html.
metadata/targets/simple/**packageI.txt ==============================**====== The targets/simple/packageI role is responsible only for the simple data at simple/packageI/index.html.
In this upper bound case, where every developer is responsible for signing her own package, one can estimate the metadata size to be like so:
- metadata/targets/targets.txt is, at most, about a few KB, and can be safely ignored. - metadata/targets/simple/**packageI.txt is about 1KB. - metadata/targets/simple.txt is about the sum of all metadata/targets/simple/**packageI.txt files. (This is a very rough estimate!)
Therefore, if we have 30,000 developer packages on PyPI (roughly the current number of packages), then we would have about 29 MB of metadata/targets/simple/**packageI.txt, and another 29 MB of metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of total data (roughly what I saw from my last mirror), then the simple metadata is about 0.13% of total data size.
This may seem like a lot of metadata, but let us remember a few important things:
- So far, the metadata is simply uncompressed JSON. We are considering metadata compression or difference schemes. - This assumes the upper bound case, where every package developer is responsible for her own package, so that means that we have talk about a lot of keys (random data). - This is a one-time initial download cost. An update to PyPI is unlikely to change all the simple data; therefore, updates to the simple metadata will be cheap, because a TUF client would only download updated metadata. We could amortize the initial simple metadata download cost by distributing it with PyPI installers (e.g. pip).
Could we do better? Yes!
As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing for all of the developer packages itself. This means that we could reuse a key for multiple developer packages instead of dedicating a key per package. The tradeoff here is that if one such "shared key" is compromised, then multiple packages (but not all of them) could be compromised.
In this case, where we use a shared key to sign up to, say, 1,000 developer packages, then we would have the following simple metadata size. First, let us define some terms:
NP = # of developer packages NPK = # of developer packages signed by a key NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK) K = average key metadata size D = average delegated role metadata size given one target path P = average target path length T = average simple target (index.html) metadata size
metadata/targets/simple.txt =========================== Most of the metadata here deals with all of the keys, and the roles, used to sign simple data. Therefore, the size of the keys and roles metadata will dominate this file.
key metadata size = NR*K role metadata size = NR*(D+NPK*P)
Takeaway: the lower the NPK (the number of developer packages signed by a key), then the higher the NR, and the larger the metadata. We would save metadata by setting NPK to, say, 1,000, because then one key could describe 1,000 packages.
metadata/targets/simple/roleI.**txt ==============================**====== When NPK=1, then this file would be equivalent to metadata/targets/simple/ **packageI.txt.
It is a small metadata file if we assume that it only talks about the simple data (index.html) for one package. Most of the metadata talks about key signatures, and target metadata. If we increase NPK, then clearly the target metadata would increase in size:
target metadata size = NPK*T < NPK*1KB
Takeaway: the target metadata would increase in size, but it certainly will not increase as much as it would have if we had signed each developer package with a separate key.
Finally, the question is how the savings in metadata/targets/simple.txt would compare to the "growth" of the metadata/targets/simple/roleI.**txt files. Ultimately, the higher the NPK (and thus the lower the NR), then the less would we be talking about keys (random data). Everything else would remain the same, because there would still be the same number of targets, and thus the same amount of target metadata. So, we would have net savings.
I hope this clears some questions about metadata size. If there was something confusing because I did not explain it well enough or I got something wrong, please be sure to let me know. My machine is nearly done generating all the simple metadata, so we can make better estimates then.
-Trishank
On Tue, Apr 9, 2013 at 9:58 AM, Justin Cappos <jcappos@poly.edu> wrote:
FYI: For anyone who wants the executive summary, we think the TUF metadata will be under 1MB and even with very broad / rapid adoption of TUF in the next year or two will stay <3MB or so.
Is that after compression? Or did Trishank miscount the number of digits for the initial email? Cheers, Nick.
Note that this cost is only paid upon the initial run of the client tool. Everything after that just downloads diffs (or at least will once we fix an open ticket).
Thanks, Justin
On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy <tk47@students.poly.edu> wrote:
Hello everyone,
I have been testing and refining the pypi.updateframework.com automation over the past week, and looking at how much TUF metadata is generated for PyPI.
In this email, I am going to focus only on the PyPI data under /simple; let us call that "simple data".
Now, if we assume that every developer will have her own key to sign the simple data for her package, then this is what the TUF metadata could look like:
metadata/targets.txt ==================== Delegation from the targets to the targets/simple role, with the former role being responsible for no target data because it has none of its own.
metadata/targets/simple.txt =========================== Delegation from targets/simple to the targets/simple/packageI role, with the former role being responsible for one target datum: simple/index.html.
metadata/targets/simple/packageI.txt ==================================== The targets/simple/packageI role is responsible only for the simple data at simple/packageI/index.html.
In this upper bound case, where every developer is responsible for signing her own package, one can estimate the metadata size to be like so:
- metadata/targets/targets.txt is, at most, about a few KB, and can be safely ignored. - metadata/targets/simple/packageI.txt is about 1KB. - metadata/targets/simple.txt is about the sum of all metadata/targets/simple/packageI.txt files. (This is a very rough estimate!)
Therefore, if we have 30,000 developer packages on PyPI (roughly the current number of packages), then we would have about 29 MB of metadata/targets/simple/packageI.txt, and another 29 MB of metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of total data (roughly what I saw from my last mirror), then the simple metadata is about 0.13% of total data size.
This may seem like a lot of metadata, but let us remember a few important things:
- So far, the metadata is simply uncompressed JSON. We are considering metadata compression or difference schemes. - This assumes the upper bound case, where every package developer is responsible for her own package, so that means that we have talk about a lot of keys (random data). - This is a one-time initial download cost. An update to PyPI is unlikely to change all the simple data; therefore, updates to the simple metadata will be cheap, because a TUF client would only download updated metadata. We could amortize the initial simple metadata download cost by distributing it with PyPI installers (e.g. pip).
Could we do better? Yes!
As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing for all of the developer packages itself. This means that we could reuse a key for multiple developer packages instead of dedicating a key per package. The tradeoff here is that if one such "shared key" is compromised, then multiple packages (but not all of them) could be compromised.
In this case, where we use a shared key to sign up to, say, 1,000 developer packages, then we would have the following simple metadata size. First, let us define some terms:
NP = # of developer packages NPK = # of developer packages signed by a key NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK) K = average key metadata size D = average delegated role metadata size given one target path P = average target path length T = average simple target (index.html) metadata size
metadata/targets/simple.txt =========================== Most of the metadata here deals with all of the keys, and the roles, used to sign simple data. Therefore, the size of the keys and roles metadata will dominate this file.
key metadata size = NR*K role metadata size = NR*(D+NPK*P)
Takeaway: the lower the NPK (the number of developer packages signed by a key), then the higher the NR, and the larger the metadata. We would save metadata by setting NPK to, say, 1,000, because then one key could describe 1,000 packages.
metadata/targets/simple/roleI.txt ==================================== When NPK=1, then this file would be equivalent to metadata/targets/simple/packageI.txt.
It is a small metadata file if we assume that it only talks about the simple data (index.html) for one package. Most of the metadata talks about key signatures, and target metadata. If we increase NPK, then clearly the target metadata would increase in size:
target metadata size = NPK*T < NPK*1KB
Takeaway: the target metadata would increase in size, but it certainly will not increase as much as it would have if we had signed each developer package with a separate key.
Finally, the question is how the savings in metadata/targets/simple.txt would compare to the "growth" of the metadata/targets/simple/roleI.txt files. Ultimately, the higher the NPK (and thus the lower the NR), then the less would we be talking about keys (random data). Everything else would remain the same, because there would still be the same number of targets, and thus the same amount of target metadata. So, we would have net savings.
I hope this clears some questions about metadata size. If there was something confusing because I did not explain it well enough or I got something wrong, please be sure to let me know. My machine is nearly done generating all the simple metadata, so we can make better estimates then.
-Trishank
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
His 29MB and 58MB numbers assume that every developer has their own key right now. We don't think this is likely to happen and propose initially signing everything that the developers don't sign with a single PyPI key. It also assumes there are no abandoned packages / devel account. I also think many devels won't go back and sign all old versions of their software. So my number is definitely a back of the envelope calculation using Trishank's data. Trishank's calculations are much more expressive, but are the "worst case" size. Thanks, Justin On Tue, Apr 9, 2013 at 12:18 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
FYI: For anyone who wants the executive summary, we think the TUF
On Tue, Apr 9, 2013 at 9:58 AM, Justin Cappos <jcappos@poly.edu> wrote: metadata
will be under 1MB and even with very broad / rapid adoption of TUF in the next year or two will stay <3MB or so.
Is that after compression? Or did Trishank miscount the number of digits for the initial email?
Cheers, Nick.
Note that this cost is only paid upon the initial run of the client tool. Everything after that just downloads diffs (or at least will once we fix
open ticket).
Thanks, Justin
On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy <tk47@students.poly.edu> wrote:
Hello everyone,
I have been testing and refining the pypi.updateframework.comautomation over the past week, and looking at how much TUF metadata is generated
for
PyPI.
In this email, I am going to focus only on the PyPI data under /simple; let us call that "simple data".
Now, if we assume that every developer will have her own key to sign the simple data for her package, then this is what the TUF metadata could look like:
metadata/targets.txt ==================== Delegation from the targets to the targets/simple role, with the former role being responsible for no target data because it has none of its own.
metadata/targets/simple.txt =========================== Delegation from targets/simple to the targets/simple/packageI role, with the former role being responsible for one target datum: simple/index.html.
metadata/targets/simple/packageI.txt ==================================== The targets/simple/packageI role is responsible only for the simple data at simple/packageI/index.html.
In this upper bound case, where every developer is responsible for signing her own package, one can estimate the metadata size to be like so:
- metadata/targets/targets.txt is, at most, about a few KB, and can be safely ignored. - metadata/targets/simple/packageI.txt is about 1KB. - metadata/targets/simple.txt is about the sum of all metadata/targets/simple/packageI.txt files. (This is a very rough estimate!)
Therefore, if we have 30,000 developer packages on PyPI (roughly the current number of packages), then we would have about 29 MB of metadata/targets/simple/packageI.txt, and another 29 MB of metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of total data (roughly what I saw from my last mirror), then the simple metadata is about 0.13% of total data size.
This may seem like a lot of metadata, but let us remember a few important things:
- So far, the metadata is simply uncompressed JSON. We are considering metadata compression or difference schemes. - This assumes the upper bound case, where every package developer is responsible for her own package, so that means that we have talk about a lot of keys (random data). - This is a one-time initial download cost. An update to PyPI is unlikely to change all the simple data; therefore, updates to the simple metadata will be cheap, because a TUF client would only download updated
could amortize the initial simple metadata download cost by distributing it with PyPI installers (e.g. pip).
Could we do better? Yes!
As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing for all of the developer packages itself. This means that we could reuse a key for multiple developer packages instead of dedicating a key per
The tradeoff here is that if one such "shared key" is compromised, then multiple packages (but not all of them) could be compromised.
In this case, where we use a shared key to sign up to, say, 1,000 developer packages, then we would have the following simple metadata size. First, let us define some terms:
NP = # of developer packages NPK = # of developer packages signed by a key NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK) K = average key metadata size D = average delegated role metadata size given one target path P = average target path length T = average simple target (index.html) metadata size
metadata/targets/simple.txt =========================== Most of the metadata here deals with all of the keys, and the roles, used to sign simple data. Therefore, the size of the keys and roles metadata will dominate this file.
key metadata size = NR*K role metadata size = NR*(D+NPK*P)
Takeaway: the lower the NPK (the number of developer packages signed by a key), then the higher the NR, and the larger the metadata. We would save metadata by setting NPK to, say, 1,000, because then one key could describe 1,000 packages.
metadata/targets/simple/roleI.txt ==================================== When NPK=1, then this file would be equivalent to metadata/targets/simple/packageI.txt.
It is a small metadata file if we assume that it only talks about the simple data (index.html) for one package. Most of the metadata talks about key signatures, and target metadata. If we increase NPK, then clearly
target metadata would increase in size:
target metadata size = NPK*T < NPK*1KB
Takeaway: the target metadata would increase in size, but it certainly will not increase as much as it would have if we had signed each developer package with a separate key.
Finally, the question is how the savings in metadata/targets/simple.txt would compare to the "growth" of the metadata/targets/simple/roleI.txt files. Ultimately, the higher the NPK (and thus the lower the NR), then
less would we be talking about keys (random data). Everything else would remain the same, because there would still be the same number of targets, and thus the same amount of target metadata. So, we would have net savings.
I hope this clears some questions about metadata size. If there was something confusing because I did not explain it well enough or I got something wrong, please be sure to let me know. My machine is nearly done generating all the simple metadata, so we can make better estimates
an metadata. We package. the the then.
-Trishank
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org http://mail.python.org/mailman/listinfo/distutils-sig
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Tue, Apr 9, 2013 at 3:17 PM, Justin Cappos <jcappos@poly.edu> wrote:
His 29MB and 58MB numbers assume that every developer has their own key right now. We don't think this is likely to happen and propose initially signing everything that the developers don't sign with a single PyPI key.
It also assumes there are no abandoned packages / devel account. I also think many devels won't go back and sign all old versions of their software. So my number is definitely a back of the envelope calculation using Trishank's data. Trishank's calculations are much more expressive, but are the "worst case" size.
OK, that makes sense - thanks for the clarification. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On 4/9/13 1:17 AM, Justin Cappos wrote:
His 29MB and 58MB numbers assume that every developer has their own key right now. We don't think this is likely to happen and propose initially signing everything that the developers don't sign with a single PyPI key.
It also assumes there are no abandoned packages / devel account. I also think many devels won't go back and sign all old versions of their software. So my number is definitely a back of the envelope calculation using Trishank's data. Trishank's calculations are much more expressive, but are the "worst case" size.
Correct. Justin based his back-of-the-envelope calculation on some very rough prior estimates of mine, so they may be a little off. Nevertheless, our argument remains: sharing a key across, say, a thousand packages will certainly reduce the metadata by quite a bit. Combine that with compression or difference schemes, and you get even more savings.
What size keys? On Apr 9, 2013 1:23 AM, "Trishank Karthik Kuppusamy" <tk47@students.poly.edu> wrote:
On 4/9/13 1:17 AM, Justin Cappos wrote:
His 29MB and 58MB numbers assume that every developer has their own key right now. We don't think this is likely to happen and propose initially signing everything that the developers don't sign with a single PyPI key.
It also assumes there are no abandoned packages / devel account. I also think many devels won't go back and sign all old versions of their software. So my number is definitely a back of the envelope calculation using Trishank's data. Trishank's calculations are much more expressive, but are the "worst case" size.
Correct. Justin based his back-of-the-envelope calculation on some very rough prior estimates of mine, so they may be a little off. Nevertheless, our argument remains: sharing a key across, say, a thousand packages will certainly reduce the metadata by quite a bit. Combine that with compression or difference schemes, and you get even more savings.
On 4/9/13 7:47 AM, Daniel Holth wrote:
What size keys?
2048 bits, which is the minimum key size TUF currently allows for security purposes. Which range of key sizes do you think PyPI would be comfortable with?
I have finished generating the /simple metadata and they are about 52MB --- not too far off from my estimate of 59MB. Remember: this is the worst-case size for simple metadata. I have now started generating the /packages metadata. If all goes well, I should be able to test pip against a realistic TUF-secured PyPI mirror fairly soon. All of this is taking longer than I want it to because, well, automation is generally tricky business! We are getting there :)
On 04/09/2013 11:52 PM, Trishank Karthik Kuppusamy wrote:
I have finished generating the /simple metadata and they are about 52MB --- not too far off from my estimate of 59MB. Remember: this is the worst-case size for simple metadata.
Okay, so we have finished generating the TUF metadata for a complete (if not the latest) set of PyPI packages. Summary of the largest metadata, assuming the worst case of a key per package on PyPI: release.txt: 11MB /simple metadata: 52MB /packages metadata: 96MB All in all, the metadata sums to about 159MB. With the data being 45GB, that works out to the metadata size being 0.35% of the data size. Remember: this is the worst case for the metadata, where every PyPI package has its own key, and there is a role for every possible target subdirectory. The metadata is also uncompressed JSON. As we have said before, we think we can do better (e.g, by reusing keys for multiple packages), and we are working on it. Simultaneously, we are testing a TUF-enabled version of pip against a TUF-secured PyPI mirror.
participants (4)
-
Daniel Holth
-
Justin Cappos
-
Nick Coghlan
-
Trishank Karthik Kuppusamy