<p dir="ltr"><br>
On Jul 22, 2015 5:12 PM, "Brett Cannon" <<a href="mailto:bcannon@gmail.com">bcannon@gmail.com</a>> wrote:<br>
><br>
><br>
><br>
> On Wed, Jul 22, 2015 at 2:19 PM Wes Turner <<a href="mailto:wes.turner@gmail.com">wes.turner@gmail.com</a>> wrote:<br>
>><br>
>> <a href="https://github.com/dstufft/pypi-stats">https://github.com/dstufft/pypi-stats</a><br>
>><br>
>> <a href="https://github.com/dstufft/pypi-external-stats">https://github.com/dstufft/pypi-external-stats</a><br>
><br>
><br>
> I'm not quite sure what I'm supposed to get from those links, Wes, as that code still scrapes every project individually and downloads them while all I'm trying to avoid having to scrape PyPI and instead just download a single file (plus I don't want the files but just the metadata already returned by the JSON API).</p>
<p dir="ltr">An online query or an offline dump?</p>
<p dir="ltr">><br>
> -Brett<br>
>  <br>
>><br>
>> - [ ] a flat bigquery w/ pandas.io.gbq ala GitHub Archive would be great</p>
<p dir="ltr"><a href="http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#io-bigquery">http://pandas.pydata.org/pandas-docs/version/0.16.2/io.html#io-bigquery</a></p>
<p dir="ltr">>> - [ ] it's probably worth it to add RDFa to PyPi and warehouse pages (in addition to the auxiliary executed/extracted JSON) for #search</p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py">https://github.com/pypa/warehouse/blob/master/warehouse/packaging/models.py</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/tests/unit/packaging/test_models.py">https://github.com/pypa/warehouse/blob/master/tests/unit/packaging/test_models.py</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/warehouse/packaging/views.py">https://github.com/pypa/warehouse/blob/master/warehouse/packaging/views.py</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/warehouse/templates/packaging/detail.html">https://github.com/pypa/warehouse/blob/master/warehouse/templates/packaging/detail.html</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/warehouse/routes.py">https://github.com/pypa/warehouse/blob/master/warehouse/routes.py</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/tests/unit/legacy/api/test_json.py">https://github.com/pypa/warehouse/blob/master/tests/unit/legacy/api/test_json.py</a></p>
<p dir="ltr"><a href="https://github.com/pypa/warehouse/blob/master/warehouse/legacy/api/json.py">https://github.com/pypa/warehouse/blob/master/warehouse/legacy/api/json.py</a></p>
<p dir="ltr">>><br>
>> On Jul 22, 2015 4:08 PM, "Brett Cannon" <<a href="mailto:bcannon@gmail.com">bcannon@gmail.com</a>> wrote:<br>
>>><br>
>>> When I wrote <a href="https://nothingbutsnark.svbtle.com/python-3-support-on-pypi">https://nothingbutsnark.svbtle.com/python-3-support-on-pypi</a> I wrote a script to download every project's JSON metadata by scraping the simple index and then making the appropriate GET request for the JSON metadata. It worked, but somewhat of a hassle.<br>
>>><br>
>>> Is there some dump somewhere that is built daily, weekly, or monthly of all the metadata on PyPI for offline analysis?<br>
>>><br>
>>> _______________________________________________<br>
>>> Distutils-SIG maillist  -  <a href="mailto:Distutils-SIG@python.org">Distutils-SIG@python.org</a><br>
>>> <a href="https://mail.python.org/mailman/listinfo/distutils-sig">https://mail.python.org/mailman/listinfo/distutils-sig</a><br>
>>><br>
</p>