Data on requirement files on GitHub
Hi, I ran a couple of queries against GitHubs public big query dataset [0] last week. I’m interested in requirement files in particular, so I ran a query extracting all available requirement files. Since queries against this dataset are rather expensive ($7 on all repos), I thought I’d share the raw data here [1]. The data contains the repo name, the requirements file path and the contents of the file. Every line represents a JSON blob, read it with: with open('data.json') as f: for line in f.readlines(): data = json.loads(line) Maybe that’s of interest to some of you. If you have any ideas on what to do with the data, please let me know. — Jannis Gebauer [0]: https://cloud.google.com/bigquery/public-data/github https://cloud.google.com/bigquery/public-data/github [1]: https://github.com/jayfk/requirements-dataset https://github.com/jayfk/requirements-dataset
Looks like a fun chunk of data, what's the query you used? Can you add a
README to the repo with some description if others want to iterate on it
(maybe look into setup.py's?)
Nick
On Tue, Mar 7, 2017 at 5:06 AM, Jannis Gebauer
Hi,
I ran a couple of queries against GitHubs public big query dataset [0] last week. I’m interested in requirement files in particular, so I ran a query extracting all available requirement files.
Since queries against this dataset are rather expensive ($7 on all repos), I thought I’d share the raw data here [1]. The data contains the repo name, the requirements file path and the contents of the file. Every line represents a JSON blob, read it with:
with open('data.json') as f: for line in f.readlines(): data = json.loads(line)
Maybe that’s of interest to some of you.
If you have any ideas on what to do with the data, please let me know.
—
Jannis Gebauer
[0]: https://cloud.google.com/bigquery/public-data/github [1]: https://github.com/jayfk/requirements-dataset
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
I had some fun parsing and plotting the data (very simple, just the top
packages for now). See here:
https://github.com/lkraider/requirements-dataset/blob/master/index.ipynb
Let me know if you would accept a pull request so others can use that as a
starting point.
att,
--
Paul Eipper
On Wed, Mar 8, 2017 at 1:36 PM, Nick Timkovich
Looks like a fun chunk of data, what's the query you used? Can you add a README to the repo with some description if others want to iterate on it (maybe look into setup.py's?)
Nick
On Tue, Mar 7, 2017 at 5:06 AM, Jannis Gebauer
wrote: Hi,
I ran a couple of queries against GitHubs public big query dataset [0] last week. I’m interested in requirement files in particular, so I ran a query extracting all available requirement files.
Since queries against this dataset are rather expensive ($7 on all repos), I thought I’d share the raw data here [1]. The data contains the repo name, the requirements file path and the contents of the file. Every line represents a JSON blob, read it with:
with open('data.json') as f: for line in f.readlines(): data = json.loads(line)
Maybe that’s of interest to some of you.
If you have any ideas on what to do with the data, please let me know.
—
Jannis Gebauer
[0]: https://cloud.google.com/bigquery/public-data/github [1]: https://github.com/jayfk/requirements-dataset
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
PS: took 2 hours to parse the dataset into the linearized version (stored
as "parsed.json") on my notebook.
--
Paul Eipper
On Thu, Mar 9, 2017 at 7:39 PM, Paul Eipper
I had some fun parsing and plotting the data (very simple, just the top packages for now). See here: https://github.com/lkraider/requirements-dataset/blob/master/index.ipynb
Let me know if you would accept a pull request so others can use that as a starting point.
att,
-- Paul Eipper
On Wed, Mar 8, 2017 at 1:36 PM, Nick Timkovich
wrote: Looks like a fun chunk of data, what's the query you used? Can you add a README to the repo with some description if others want to iterate on it (maybe look into setup.py's?)
Nick
On Tue, Mar 7, 2017 at 5:06 AM, Jannis Gebauer
wrote: Hi,
I ran a couple of queries against GitHubs public big query dataset [0] last week. I’m interested in requirement files in particular, so I ran a query extracting all available requirement files.
Since queries against this dataset are rather expensive ($7 on all repos), I thought I’d share the raw data here [1]. The data contains the repo name, the requirements file path and the contents of the file. Every line represents a JSON blob, read it with:
with open('data.json') as f: for line in f.readlines(): data = json.loads(line)
Maybe that’s of interest to some of you.
If you have any ideas on what to do with the data, please let me know.
—
Jannis Gebauer
[0]: https://cloud.google.com/bigquery/public-data/github [1]: https://github.com/jayfk/requirements-dataset
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
https://en.wikipedia.org/wiki/BigQuery
BigQuery Dashboards
- http://bigqueri.es/c/github-archive
- https://redash.io/data-sources/google-bigquery
- https://github.com/getredash/redash
- https://github.com/getredash/redash/blob/master/requirements.txt
- https://github.com/getredash/redash/blob/master/Dockerfile
-
https://github.com/docker/docker/blob/master/builder/dockerfile/parser/parse...
- https://github.com/DBuildService/dockerfile-parse/issues
- https://github.com/getredash/redash/blob/master/docker-compose.yml
Software Configuration Management / Dependency Management applications for
BigQuery:
- https://opensource.googleblog.com/2017/03/operation-rosehub.html
- "Googlers used BigQuery and GitHub to patch thousands of vulnerable
projects"
https://www.reddit.com/r/bigquery/comments/5x0x5z/googlers_used_bigquery_and...
BigQuery Python Libraries
google-cloud-bigquery
- | Src: https://github.com/GoogleCloudPlatform/google-cloud-python
- | Pypi: https://pypi.python.org/pypi/google-cloud-bigquery
- | Docs:
https://cloud.google.com/bigquery/docs/reference/libraries#client-libraries-...
google-api-python-client
- | Src: https://github.com/google/google-api-python-client
- | Pypi: https://pypi.python.org/pypi/google-api-python-client
- pandas.io.gbq uses google-api-python-client:
- Docs:
http://pandas.pydata.org/pandas-docs/stable/io.html#google-bigquery-experime...
- read_gbq()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.gbq.read_gbq...
- to_gbq()
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.gbq.to_gbq.h...
Open Source Big Data Components for things like BigQuery:
Apache Drill
- | Wikipedia: https://en.wikipedia.org/wiki/Apache_Drill
- Apache Drill is similar to Google Dremel (which powers Google BigQuery)
- https://pypi.python.org/pypi/drillpy
Apache Beam
- | Wikipedia: https://en.wikipedia.org/wiki/Apache_Beam
- | Src: https://github.com/apache/beam
- | Docs: https://beam.apache.org/documentation/sdks/python/
- | Docs: https://beam.apache.org/get-started/quickstart-py/
- | Docs:
https://github.com/apache/beam/tree/master/sdks/python/apache_beam/examples
- Google Cloud Dataflow is now of Apache Beam
- https://cloud.google.com/dataflow/model/bigquery-io
Parsing (and MAINTAINING) Pip Requirements.txt Files:
- | Src: https://github.com/pypa/pip/tree/master/pip/req
- https://github.com/pypa/pip/issues/3884#issuecomment-236454008
- https://github.com/pypa/pip/issues/1479
- -> Pipfile, Pipfile.lock (``pipenv install pkgname --dev``)
- https://github.com/pyupio/safety-db#tools
- https://pyup.io/
- https://libraries.io/github/librariesio/pydeps
- https://github.com/librariesio/pydeps
- https://libraries.io/
- Pipfile, Pipfile.lock
- | PyPI: https://pypi.python.org/pypi/pipenv
- | PyPI: https://pypi.python.org/pypi/requirements-parser
- | PyPI: https://pypi.python.org/pypi/pipfile
- | Src: https://github.com/kennethreitz/pipenv
- These save to the Pipfile:
- ``pipenv install pkgname``
- ``pipenv install pkgname --dev``
- https://github.com/kennethreitz/pipenv/blob/master/pipenv/utils.py
- pip reqs.txt <--> Pipfile
... Thought I'd get these together; hopefully they're useful.
Cool Jupyter notebook!
( https://github.com/lkraider/requirements-dataset/blob/master/index.ipynb )
On Tue, Mar 7, 2017 at 5:06 AM, Jannis Gebauer
Hi,
I ran a couple of queries against GitHubs public big query dataset [0] last week. I’m interested in requirement files in particular, so I ran a query extracting all available requirement files.
Since queries against this dataset are rather expensive ($7 on all repos), I thought I’d share the raw data here [1]. The data contains the repo name, the requirements file path and the contents of the file. Every line represents a JSON blob, read it with:
with open('data.json') as f: for line in f.readlines(): data = json.loads(line)
Maybe that’s of interest to some of you.
If you have any ideas on what to do with the data, please let me know.
—
Jannis Gebauer
[0]: https://cloud.google.com/bigquery/public-data/github [1]: https://github.com/jayfk/requirements-dataset
_______________________________________________ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
participants (4)
-
Jannis Gebauer
-
Nick Timkovich
-
Paul Eipper
-
Wes Turner