On Mon, Oct 13, 2008 at 6:16 PM, "Martin v. Löwis" firstname.lastname@example.org wrote:
Maybe we could use one subfolder per alphabet letter,
Would that simplify anything?
PyPI uses one directory per letter to reduce the number of files in a single directory, in case ext3 doesn't deal with large directories well. For the stats, the "large directories" argument wouldn't count.
OTOH, if you do have separate pages per letter, the master server would still need to download all individual files. Having them split into chunks just increases the load, rather than reducing it.
Yes I thaught you were concerned by the size of that file, rather by the number of calls PyPI would need to perform.
You would need to specify a timestamp for each single download though, to make sure PyPI knows which hits to count, depending on the last date it checked the mirror.
No. It would just compute the grand total from scratch each time.
OTHO you would lose an interesting info: how downloads evolve in time.
As a packager, I can see some interesting use cases. For example when foo 2.0 gets out, I can watch foo 1.0 downloads decrease and foo 2.0 raise. (if not make sure i have promoted 2.0 correctly)
People would be able to generate interesting statistics tools from there. This would be possible of course only if PyPI provides the same timestamped pages for the grand total.
This leads to another point we did not discuss yet: it would be interesting to keep the user-agent info in the mirrors, and make sure all automatic-package-grabbing softwares out there have there own user agent id
For instance, knowing that 90% of the downloads of a given package where done by zc.buildout is interesting. IIRC, we cannot know it right now, and I could work on zc.buildout side for that, because it uses the setuptools user agent id