It seems like the download counts on PyPI aren't accurate. Though the really useful packages seem to have higher numbers than the packages that only apply to a specific target audience, I'm fairly certain that the numbers are more affected by robots and such than actual users.
Recently I started a service that requires membership. In the last month, PyPI reports 3000 downloads of the client, yet Google Analytics only reports a handful of visits to the website. I have even less membership signups (as expected, so soon after launch). Why are the download counts so inflated?
What has to be done to get this to be accurate?
I've included two screenshots of PyPI and GA.
Dustin Oprea
Mostly new packages will get roughly 2-3k of downloads from what appears to be mirroring infrastructure. I’m hesitant to mess with the traffic numbers at all because I don’t want them to be inaccurate *and* artificial vs just inaccurate (assuming you think it’s the number of people downloading your project).
On Oct 25, 2013, at 1:22 PM, Dustin Oprea dustin@randomingenuity.com wrote:
It seems like the download counts on PyPI aren't accurate. Though the really useful packages seem to have higher numbers than the packages that only apply to a specific target audience, I'm fairly certain that the numbers are more affected by robots and such than actual users.
Recently I started a service that requires membership. In the last month, PyPI reports 3000 downloads of the client, yet Google Analytics only reports a handful of visits to the website. I have even less membership signups (as expected, so soon after launch). Why are the download counts so inflated?
What has to be done to get this to be accurate?
I've included two screenshots of PyPI and GA.
Dustin Oprea <Selection_001.png><Selection_002.png>_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
On Fri, Oct 25, 2013 at 13:49 -0400, Donald Stufft wrote:
Mostly new packages will get roughly 2-3k of downloads from what appears to be mirroring infrastructure. I’m hesitant to mess with the traffic numbers at all because I don’t want them to be inaccurate *and* artificial vs just inaccurate (assuming you think it’s the number of people downloading your project).
Is it not possible that the analysis code or Fastly's delivery of logs has bugs? The "inflation" problem only happens sometimes after all. If it were mirroring infrastructure it should be more consistent.
holger
On Oct 25, 2013, at 1:22 PM, Dustin Oprea dustin@randomingenuity.com wrote:
It seems like the download counts on PyPI aren't accurate. Though the really useful packages seem to have higher numbers than the packages that only apply to a specific target audience, I'm fairly certain that the numbers are more affected by robots and such than actual users.
Recently I started a service that requires membership. In the last month, PyPI reports 3000 downloads of the client, yet Google Analytics only reports a handful of visits to the website. I have even less membership signups (as expected, so soon after launch). Why are the download counts so inflated?
What has to be done to get this to be accurate?
I've included two screenshots of PyPI and GA.
Dustin Oprea <Selection_001.png><Selection_002.png>_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Oct 25, 2013, at 1:57 PM, holger krekel holger@merlinux.eu wrote:
Is it not possible that the analysis code or Fastly's delivery of logs has bugs? The "inflation" problem only happens sometimes after all. If it were mirroring infrastructure it should be more consistent.
Sure it’s possible. The Analysis code is OSS in the pypi repo, the log delivery is just done via syslog. I have no idea how Fastly creates those logs though.
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Is there any way that we can use the user-agent to either identify users or identify mirrors?
Can we pass a flag or signature from "pip"? It won't reflect downloads from website, but this probably won't affect the numbers much. In this case, we might just reword it to "pip Downloads".
This is a distressing issue. It doesn't seem like package owners have any usable usage data.
Dustin Oprea On Oct 25, 2013 1:49 PM, "Donald Stufft" donald@stufft.io wrote:
Mostly new packages will get roughly 2-3k of downloads from what appears to be mirroring infrastructure. I’m hesitant to mess with the traffic numbers at all because I don’t want them to be inaccurate *and* artificial vs just inaccurate (assuming you think it’s the number of people downloading your project).
On Oct 25, 2013, at 1:22 PM, Dustin Oprea dustin@randomingenuity.com wrote:
It seems like the download counts on PyPI aren't accurate. Though the really useful packages seem to have higher numbers than the packages that only apply to a specific target audience, I'm fairly certain that the numbers are more affected by robots and such than actual users.
Recently I started a service that requires membership. In the last month, PyPI reports 3000 downloads of the client, yet Google Analytics only reports a handful of visits to the website. I have even less membership signups (as expected, so soon after launch). Why are the download counts so inflated?
What has to be done to get this to be accurate?
I've included two screenshots of PyPI and GA.
Dustin Oprea <Selection_001.png><Selection_002.png> _______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On 26 Oct 2013 04:51, "Dustin Oprea" myselfasunder@gmail.com wrote:
Is there any way that we can use the user-agent to either identify users
or identify mirrors?
Can we pass a flag or signature from "pip"? It won't reflect downloads
from website, but this probably won't affect the numbers much. In this case, we might just reword it to "pip Downloads".
This is a distressing issue. It doesn't seem like package owners have any
usable usage data.
Most downloads happen through the Fastly CDN - the numbers are derived from the Fastly logs rather than being direct. The code that does that log analysis is in https://bitbucket.org/pypa/pypi/src (Donald would be able to provide a more direct reference to the relevant source).
However, separating downloads between mirroring, automatic deployments and integration and actual direct downloads isn't something PyPI has ever done, or is really able to do in a systematic way. "pip install thatproject" (and equivalent commands for other tools) looks the same to PyPI regardless of whether it's a human or a script running the command.
That's why Donald's recent download analysis was able to split it up by tools, but not by purpose.
Now, exposing more of that analytical data to package owners on an ongoing basis is an interesting idea, but one that would be a *very* long way down the priority list for the current development team.
However, if someone else were to figure out a way to expose the data users needed to do their own analysis, it might be possible to support that, although it may be better to look at offering that through Warehouse (aka PyPI.next) rather than the existing PyPI software ( https://github.com/dstufft/warehouse). There's a demo instance (using live data) running at preview-pypi.python.org, but that's mostly focused on backwards compatibility testing for the tool APIs at this point rather than being navigable through a web browser.
Cheers, Nick.
Dustin Oprea
On Oct 25, 2013 1:49 PM, "Donald Stufft" donald@stufft.io wrote:
Mostly new packages will get roughly 2-3k of downloads from what appears
to be
mirroring infrastructure. I’m hesitant to mess with the traffic numbers
at all because
I don’t want them to be inaccurate *and* artificial vs just inaccurate
(assuming you
think it’s the number of people downloading your project).
On Oct 25, 2013, at 1:22 PM, Dustin Oprea dustin@randomingenuity.com
wrote:
It seems like the download counts on PyPI aren't accurate. Though the
really useful packages seem to have higher numbers than the packages that only apply to a specific target audience, I'm fairly certain that the numbers are more affected by robots and such than actual users.
Recently I started a service that requires membership. In the last
month, PyPI reports 3000 downloads of the client, yet Google Analytics only reports a handful of visits to the website. I have even less membership signups (as expected, so soon after launch). Why are the download counts so inflated?
What has to be done to get this to be accurate?
I've included two screenshots of PyPI and GA.
Dustin Oprea
<Selection_001.png><Selection_002.png>_______________________________________________
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372
DCFA
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
On Oct 25, 2013, at 9:35 PM, Nick Coghlan ncoghlan@gmail.com wrote:
Most downloads happen through the Fastly CDN - the numbers are derived from the Fastly logs rather than being direct. The code that does that log analysis is in https://bitbucket.org/pypa/pypi/src (Donald would be able to provide a more direct reference to the relevant source).
https://bitbucket.org/pypa/pypi/src/0c749f947b167b6643ed94ceac2b3e1ab478d5bd...
It buckets data and puts it into redis. Really “dumb” but it works well enough until better infrastructure can be put into place.
However, separating downloads between mirroring, automatic deployments and integration and actual direct downloads isn't something PyPI has ever done, or is really able to do in a systematic way. "pip install thatproject" (and equivalent commands for other tools) looks the same to PyPI regardless of whether it's a human or a script running the command.
That's why Donald's recent download analysis was able to split it up by tools, but not by purpose.
Yea this part is hard/impossible :/
Now, exposing more of that analytical data to package owners on an ongoing basis is an interesting idea, but one that would be a *very* long way down the priority list for the current development team.
Nice analytics for package owners is on the road map, but it’s, as you mentioned, down the road map a ways.
However, if someone else were to figure out a way to expose the data users needed to do their own analysis, it might be possible to support that, although it may be better to look at offering that through Warehouse (aka PyPI.next) rather than the existing PyPI software (https://github.com/dstufft/warehouse). There's a demo instance (using live data) running at preview-pypi.python.org, but that's mostly focused on backwards compatibility testing for the tool APIs at this point rather than being navigable through a web browser.
----------------- Donald Stufft PGP: 0x6E3CBCE93372DCFA // 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA