Manuel Gutierrez Algaba writes:
No, and because of a very simple reason. Although Linux apropos [Really long explanation elided.] structuring are not enough.
From a user interface perspective, it sounds like each "chunk" of documentation presented should have some sort of entry box or button
Manuel, I think I see what you're looking for. (For context: I have studied traditional information retrieval, but not natural language processing approaches.) Let me try to boil down what you've described to a (much) more concise description, and then follow on with my comments. If I misunderstand what you're asking for, please clarify. My summary of what you explained: You are looking for a concept-based search mechanism, which can preferably described what sorts of relationships the located items have to each other ("this is an example of that", etc.). You indicate an advantage of automatic concept extraction based on the content. that searches for other chunks related to the chunk on that page. My response: I think this would be really nice to have. As far as I'm aware, such systems are still largely research projects, with some applications having reached deployment (you point to good examples). To do this for the Python documentation (defined as broadly as needed), the most-needed thing to accomplish this is someone who can donate time and know-how. I don't know enough about the AI aspects or the natural language processing aspects. The user interface issues are also non-trivial (esp. if the interface can be distilled all the way down to a single button and maybe a text-entry box). But I'd be glad to work with someone regarding interpretation of the existing documentation and any improvements that could be made to make the processing more effective. There are two aspects to this which are related but not tightly bound: extraction of "concepts" and use of concepts to locate interesting information. Concepts can be extracted from the text using AI/NLP tools or can be marked explicitly in the documentation source. I must admit a bias toward the latter approach, but automated techniques may have progressed sufficiently to make them viable. I do not see any reason for the approaches to concept extraction to be mutually exclusive. What constitutes a "chunk" needs to be clearly defined, both for purposes of hyper-navigation and percolation of concept assignments up and down the document structure hierarchy. Use of a concept-to-chunk database may need to know about the extraction techniques (at least the explicit vs. automatic dichotomy), especially for purposes of ranking or presentation. I think we can go a long way using techniques based on explicit markup in the documentation. The index construction markup is one example of "meta" information being located in the documents, and other aspects of the markup are becoming increasingly "logical" rather than presentation-based. There is no reason that two things can't both happen: 1) additional meta information be added to the documents to allow explicit encoding of concept-like information, and 2) processing software imply relationships between chunks based on existing markup. With the coming conversion of the documentation to SGML, I expect some information present in the documentation today will become more explicit, making it somewhat easier to create processing software that doesn't have to make as many basic inferences as it has to today. (Yes, I realize that this doesn't come from SGML, but the conversion is an excellent opportunity for us to refine the markup in more useful ways than has been the case with the existing markup.) I'm quite interested in hearing from people about what information would be useful if marked explicitly, and how it could be used. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives 1895 Preston White Dr. Reston, VA 20191