On Tue, Apr 9, 2013 at 9:58 AM, Justin Cappos <jcappos@poly.edu> wrote:Is that after compression? Or did Trishank miscount the number of
> FYI: For anyone who wants the executive summary, we think the TUF metadata
> will be under 1MB and even with very broad / rapid adoption of TUF in the
> next year or two will stay <3MB or so.
digits for the initial email?
Cheers,
Nick.
> _______________________________________________
>
> Note that this cost is only paid upon the initial run of the client tool.
> Everything after that just downloads diffs (or at least will once we fix an
> open ticket).
>
> Thanks,
> Justin
>
>
>
> On Mon, Apr 8, 2013 at 2:41 PM, Trishank Karthik Kuppusamy
> <tk47@students.poly.edu> wrote:
>>
>> Hello everyone,
>>
>> I have been testing and refining the pypi.updateframework.com automation
>> over the past week, and looking at how much TUF metadata is generated for
>> PyPI.
>>
>> In this email, I am going to focus only on the PyPI data under /simple;
>> let us call that "simple data".
>>
>> Now, if we assume that every developer will have her own key to sign the
>> simple data for her package, then this is what the TUF metadata could look
>> like:
>>
>> metadata/targets.txt
>> ====================
>> Delegation from the targets to the targets/simple role, with the former
>> role being responsible for no target data because it has none of its own.
>>
>> metadata/targets/simple.txt
>> ===========================
>> Delegation from targets/simple to the targets/simple/packageI role, with
>> the former role being responsible for one target datum: simple/index.html.
>>
>> metadata/targets/simple/packageI.txt
>> ====================================
>> The targets/simple/packageI role is responsible only for the simple data
>> at simple/packageI/index.html.
>>
>> In this upper bound case, where every developer is responsible for signing
>> her own package, one can estimate the metadata size to be like so:
>>
>> - metadata/targets/targets.txt is, at most, about a few KB, and can be
>> safely ignored.
>> - metadata/targets/simple/packageI.txt is about 1KB.
>> - metadata/targets/simple.txt is about the sum of all
>> metadata/targets/simple/packageI.txt files. (This is a very rough estimate!)
>>
>> Therefore, if we have 30,000 developer packages on PyPI (roughly the
>> current number of packages), then we would have about 29 MB of
>> metadata/targets/simple/packageI.txt, and another 29 MB of
>> metadata/targets/simple.txt, for a rough total of 58MB. If PyPI has 45GB of
>> total data (roughly what I saw from my last mirror), then the simple
>> metadata is about 0.13% of total data size.
>>
>> This may seem like a lot of metadata, but let us remember a few important
>> things:
>>
>> - So far, the metadata is simply uncompressed JSON. We are considering
>> metadata compression or difference schemes.
>> - This assumes the upper bound case, where every package developer is
>> responsible for her own package, so that means that we have talk about a lot
>> of keys (random data).
>> - This is a one-time initial download cost. An update to PyPI is unlikely
>> to change all the simple data; therefore, updates to the simple metadata
>> will be cheap, because a TUF client would only download updated metadata. We
>> could amortize the initial simple metadata download cost by distributing it
>> with PyPI installers (e.g. pip).
>>
>> Could we do better? Yes!
>>
>> As Nick Coghlan has suggested, PyPI could begin adopting TUF by signing
>> for all of the developer packages itself. This means that we could reuse a
>> key for multiple developer packages instead of dedicating a key per package.
>> The tradeoff here is that if one such "shared key" is compromised, then
>> multiple packages (but not all of them) could be compromised.
>>
>> In this case, where we use a shared key to sign up to, say, 1,000
>> developer packages, then we would have the following simple metadata size.
>> First, let us define some terms:
>>
>> NP = # of developer packages
>> NPK = # of developer packages signed by a key
>> NR = # of roles (each responsible for NPK packages) = math.ceil(NP/NPK)
>> K = average key metadata size
>> D = average delegated role metadata size given one target path
>> P = average target path length
>> T = average simple target (index.html) metadata size
>>
>> metadata/targets/simple.txt
>> ===========================
>> Most of the metadata here deals with all of the keys, and the roles, used
>> to sign simple data. Therefore, the size of the keys and roles metadata will
>> dominate this file.
>>
>> key metadata size = NR*K
>> role metadata size = NR*(D+NPK*P)
>>
>> Takeaway: the lower the NPK (the number of developer packages signed by a
>> key), then the higher the NR, and the larger the metadata. We would save
>> metadata by setting NPK to, say, 1,000, because then one key could describe
>> 1,000 packages.
>>
>> metadata/targets/simple/roleI.txt
>> ====================================
>> When NPK=1, then this file would be equivalent to
>> metadata/targets/simple/packageI.txt.
>>
>> It is a small metadata file if we assume that it only talks about the
>> simple data (index.html) for one package. Most of the metadata talks about
>> key signatures, and target metadata. If we increase NPK, then clearly the
>> target metadata would increase in size:
>>
>> target metadata size = NPK*T < NPK*1KB
>>
>> Takeaway: the target metadata would increase in size, but it certainly
>> will not increase as much as it would have if we had signed each developer
>> package with a separate key.
>>
>> Finally, the question is how the savings in metadata/targets/simple.txt
>> would compare to the "growth" of the metadata/targets/simple/roleI.txt
>> files. Ultimately, the higher the NPK (and thus the lower the NR), then the
>> less would we be talking about keys (random data). Everything else would
>> remain the same, because there would still be the same number of targets,
>> and thus the same amount of target metadata. So, we would have net savings.
>>
>> I hope this clears some questions about metadata size. If there was
>> something confusing because I did not explain it well enough or I got
>> something wrong, please be sure to let me know. My machine is nearly done
>> generating all the simple metadata, so we can make better estimates then.
>>
>> -Trishank
>>
>
>
> Distutils-SIG maillist - Distutils-SIG@python.org
> http://mail.python.org/mailman/listinfo/distutils-sig
>
--
Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia