web API to get a list of all module in stdlib
Hi, I am doing an exercise as a part of agile ux data mining team, and I need to get a list of Python modules: https://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-... But this gives only the modules that were compiled into specific interpreter, and I need a list of modules that are de-facto included in stdlib standard. I also need this for all Python versions, and be able to fetch it as csv, json or html table format over webm so that result of my work could be validated and experiment repeated as necessary. I see the data as the necessary step to organize a work around "externally evolving standard library", so a way to query it should be somewhat sustainable and obvious. It might be possible to generate something from docs, like: https://docs.python.org/2.7.2/dataset/modules.json This way you get static information without ability to version or refresh the info (still good to have anyway to compare docs and other sources). Or it may be a dedicated URL: https://api.python.org/2.7.2/stdlib/modules/ The result is HTML be default. ?format=csv - result is csv ?format=yaml I need in particular: - module name - files that comprise module sources - os supported So, basically I need an official support for this: https://bitbucket.org/techtonik/python-stdlib/src/092af75da07cb264070115fb9a... Because I don't have means to maintain this myself and feel tired trying to think about how it can be maintained from outside. If I have this mapping, I can make a diagram how many patches per module are sitting there on the tracker, and it may open a can of worms for many other fishy stats that will be attractive for people to work on. Actually, the code that sorts patches by modules is already there in that repository. It is also unlicensed to get it free from restrictions placed by copyright law over distributed development, so it doesn't require me or you to sign CLA to further develop it. So, where is the first class info about the module structure of stdlib? Where this info should be fetched from if accessed automatically from the web? How it should be kept up to date for all Python versions? -- anatoly t.
Here is the hack that does the thing locally, not from web, and rationale. https://github.com/jackmaney/python-stdlib-list Quoting here just in case there are still people who can talk with trolls: Python Standard Library List This package includes lists of all of the standard libraries for Python 2.6, 2.7, 3.2, 3.3, and 3.4, along with the code for scraping the official Python docs to get said lists. Listing the modules in the standard library? Wait, why on Earth would you care about that?! Because knowing whether or not a module is part of the standard library will come in handy in a project of mine <https://github.com/jackmaney/pypt>. And I'm not the only one <http://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-p...> who would find this useful. Or, the TL;DR answer is that it's handy in situations when you're analyzing Python code and would like to find module dependencies. After googling for a way to generate a list of Python standard libraries (and looking through the answers to the previously-linked Stack Overflow question), I decided that I didn't like the existing solutions. So, I started by writing a scraper for the TOC of the Python Module Index for each of the versions of Python above. However, web scraping can be a fragile affair. Thanks to a suggestion <https://github.com/jackmaney/python-stdlib-list/issues/1#issuecomment-865172...> by @ncoghlan <https://github.com/ncoghlan>, and some further help from @birkenfeld <https://github.com/birkenfeld> and @epc <https://github.com/epc>, the population of the lists is now done by grabbing and parsing the Sphinx object inventory for the official Python docs of each relevant version.
The above library extracts module info from Sphinx objects.inv database https://docs.python.org/2.7/objects.inv which is some binary format and requires local parsing. Ideally, the Sphinx should give out the open data about stdlib structure, such as : https://docs.python.org/2.7/dataset/1.0/modules.json <https://docs.python.org/2.7/objects.inv> https://docs.python.org/2.7/dataset/1.0/modules.csv <https://docs.python.org/2.7/objects.inv> list all module names, sorted by name Then you can easily load this data into web app or use it as a table for analysis. http://www.w3.org/2013/csvw/wiki/Main_Page
Hi Anatoly, On 03/23/2015 01:06 PM, anatoly techtonik wrote:
Hi,
I am doing an exercise as a part of agile ux data mining team, and I need to get a list of Python modules:
https://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-...
But this gives only the modules that were compiled into specific interpreter, and I need a list of modules that are de-facto included in stdlib standard.
I also need this for all Python versions, and be able to fetch it as csv, json or html table format over webm so that result of my work could be validated and experiment repeated as necessary.
I see the data as the necessary step to organize a work around "externally evolving standard library", so a way to query it should be somewhat sustainable and obvious.
It might be possible to generate something from docs, like:
https://docs.python.org/2.7.2/dataset/modules.json
This way you get static information without ability to version or refresh the info (still good to have anyway to compare docs and other sources).
+1 for the idea to publish the final results to avoid "reparsing the wheel". IMHO it could be interesting for new versions to have some kind of "sys.stdlib_module_names" (as stated in SO). Why not proposing it on python-ideas? Regards, francis
On Mon, Apr 6, 2015 at 5:44 PM, francis <francismb@email.de> wrote:
On 03/23/2015 01:06 PM, anatoly techtonik wrote:
Hi,
I am doing an exercise as a part of agile ux data mining team, and I need to get a list of Python modules:
https://stackoverflow.com/questions/6463918/how-can-i-get-a-list-of-all-the-...
But this gives only the modules that were compiled into specific interpreter, and I need a list of modules that are de-facto included in stdlib standard.
I also need this for all Python versions, and be able to fetch it as csv, json or html table format over webm so that result of my work could be validated and experiment repeated as necessary.
I see the data as the necessary step to organize a work around "externally evolving standard library", so a way to query it should be somewhat sustainable and obvious.
It might be possible to generate something from docs, like:
https://docs.python.org/2.7.2/dataset/modules.json
This way you get static information without ability to version or refresh the info (still good to have anyway to compare docs and other sources).
+1 for the idea to publish the final results to avoid "reparsing the wheel".
IMHO it could be interesting for new versions to have some kind of "sys.stdlib_module_names" (as stated in SO). Why not proposing it on python-ideas?
Done. But I omitted the `sys.stdlib_module_names` part, because for my use case in https://bitbucket.org/techtonik/python-stdlib project I need more data exported than just names. For example, I collect the paths to the module sources, so that further processing can be done on real module files: https://bitbucket.org/techtonik/python-stdlib/src/tip/stdlib.json?at=default -- anatoly t.
participants (2)
-
anatoly techtonik
-
francis