An 'apropos' utility for documentations
Dear folks, I'm a Linux user that always has a bad time when installing a new package or reading the mastodontic documentation that most programs include. This morning I have a simply idea that I'd like you to consider and to comment. Do you know Unix command 'apropos' ? Well, the idea is to do an apropos command ( written in python of course) containing all the 'concepts' and tips related with the documentation of a program. Exactly, imagine that your program can be installed in AIX, Linux, and MS-dos,and for every platform it needs certain procedures. Well, apart from a description in a README file with the differences. It'd be easy to build a database as: core_of_it.py : AIX, Linux, MS-DOS, config.py, ../bin/files, setup, structures, socket This information could be introduced on line by the author, or it could be generated automatically ( at least part of it ). The fields of the database could be grouped so more complex searchs could be done ... Or the fields themselves could be entries in the database. Is it already done ? Can any of the maintainers of documentation of programs consider this feature ? Do you think is useful ? Do you think is easy to do ? Regards/Saludos Manolo ------------- My addresses / mis direcciones: www.ctv.es/USERS/irmina ---> lritaunas peki project/proyecto in python www.ctv.es/USERS/irmina/pyttex.htm ---> page of spanish users of latex / pagina de usuarios en espanyol de latex www.ctv.es/USERS/irmina/texpython.htm --> page of drawing utility for tex / pagina de utilidad de dibujo para Latex
Manuel Gutierrez Algaba writes:
Well, the idea is to do an apropos command ( written in python of course) containing all the 'concepts' and tips related with the documentation of a program.
Sorry for not responding to this sooner. I'm not sure if I understand what you want very well. Are you asking for a more elaborate form of the traditional apropos command, or are you looking for an apropos that operates on the Python library? If the former, I can see it taking the form of an advanced manual-searching interface, hopefully tied in (somehow) with the standard man/apropos system. If you're looking for an apropos that operates on the Python documentation, that's something for which support could be added to the logical markup of the documentation to some degree, and then an external utility could be used to build and query the database. This is certainly something we can consider as the source form of the documentation moves from LaTeX to SGML. Please elaborate / clarify on your idea; I'm interested! -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives 1895 Preston White Dr. Reston, VA 20191
Manuel Gutierrez Algaba writes:
Well, the idea is to do an apropos command ( written in python of course) containing all the 'concepts' and tips related with the documentation of a program.
Sorry for not responding to this sooner. I'm not sure if I understand what you want very well. Are you asking for a more elaborate form of the traditional apropos command, or are you looking for an apropos that operates on the Python library? If the former, I can see it taking the form of an advanced manual-searching interface, hopefully tied in (somehow) with the standard man/apropos system. If you're looking for an apropos that operates on the Python documentation, that's something for which support could be added to the logical markup of the documentation to some degree, and then an external utility could be used to build and query the database. This is certainly something we can consider as the source form of the documentation moves from LaTeX to SGML. Please elaborate / clarify on your idea; I'm interested!
FYI: if I understand the idea, it's similar to something I did a long while back using simple tools on Unix (the .info version of the manuals, and spawing the TTY version of 'info' on them, w/ a little hacked-up index which mapped words to GNU info nodes). See the code at: http://starship.skyport.net/~da/ihelp/. Having the markup in the doc would make that kind of project maintainable in the long run (the reason why I haven't updated 'ihelp' in years). --david
On Mon, 7 Dec 1998, Fred L. Drake wrote:
asking for a more elaborate form of the traditional apropos command,
No, and because of a very simple reason. Although Linux apropos gives you a vague idea of any term. It is never enough when you need precise information. It's a good starting point and that's all. And I think maintainers of the programs are responsible of feeding the apropos database. So I think I could do/gain very little if all the rest of the programers of the world don't pay more attention in supplying good apropos information in their programs. Example, if the programmers of 'fetchmail' don't say to apropos what 'fetchmail' is, then I can't do anything at all with a better apropos tool.
or are you looking for an apropos that operates on the Python library? If the former, I can see it taking the form of an advanced manual-searching interface, hopefully tied in (somehow) with the standard man/apropos system.
Well , this is the first point, although Python Library documentation is very good. It can be better, and all the information related with Python distribution can be better. The idea is as simple as: Imagine I want to do a communications program with sockets. The documentation is straight forward: Module Sockets, Module selection. Or not? There's a Module called SocketServer, and ftplib, telnetlib has another examples, and in Sun, for example, it may be another types of sockets. And in some contrib directory or in a FAQ, it could be related info that could be interesting for my problem. So , instead of reading a great deal of documentation, scanning another bigger deal and being suspicious about some hidden information in some FAQ or lib or module. Why not to make the information reveal itself ? It's not a matter of more comments or more decoration to the documentation. It's like all the information starts to say: "Hei, I'm a sockets-related information, I'm waiting for you!" For example, let's take a look at ftplib.py : The second line says :... RFC 959... Another one says: ftp.login, a.o.s.: python .... localhost... aos: import socket aos: SOCK_STREAM aos: gethosbyname aos: netrc aos: macros aos: sys.argv .... Fortunately, I'm a lamer, and a real newbie in most things. So I can enjoy certain pleasures that most wizards enjoyed long ago: - to learn new things. So, perhaps that list of words says very little to you , or perhaps you find it logical ( it's a ftplib , what do you expect ? ) Well, I can see in this list the following : RFC 959 is related to ftp. So If I find in a email RFC 959, or I do want to know what RFC is the ftp RFC. That information would be interesting in both ways. The next thing: ftp.login says several things: ftp is another user of the system ftp.login seems the logical way of a 'low level' ftp. Then you see 'localhost'. It's incredible but a newbie doesn't know that his machine is a ftp server too!! And he doesn't know that localhost is the natural test room for his ftp-scripts!! I'd say more, a newbie doesn't know what's the use of localhost !!! Next, I see sockets . Well obviously ftp are a communications utility ,and all of them are based on sockets . Not so obvious for the real newbies. And even for the wizard that is looking for some example of sockets, this information could be a great reminding. Much more than many may think. SOCK_STREAM is a good piece of information that tell us : "Hei, in ftplib you've got an example of SOCK_STREAM ". Think in the reverse problem: You're looking for an example of SOCK_STREAM and you start doing greps here and there. As you see, a lib is not only a collection of useful routines, it's a source of information and examples. But , the system I suggest goes further. Imagine we do in python FAQ the very same we've done with ftplib. So we got a library-FAQ apropos system. And we do a 'search' for sockets: Now it'll appear information of problems people had with sockets, solutions, examples, and related stuff in the library. This sounds comprehensive. And if you keep a disgested file of emails of comp.python.org for example , and you have that file attributed with those keywords. You can benefit as library developer, not only of feedback , but also reducing drastically the time you spend reading email. And supplying a knowledge database very interesting for anyone. Well, let's resume point 1: - The information is there, it's rich , but it's dispersed ! Now , let's go to point 2: Imagine now, that you want to know how python deals with multitasking . Uhmm, that's rather general . So you can't use the structure of FAQ, neither the HTML structure of documentation of the libraries. So , smile, you'll have to read all python documentation !! Or ... If when you visited Module Select , marked there 'asyncronous-calls', 'multitasking',... and when you visited SocketServer you marked 'fork', 'thread','multitasking',.... and when you visited Thread you mark 'multitasking'. So what do you see? Well, It's like the 'information' could be seen in several ways. Another example, python allows arbitrary functions arguments, well it says so in the tutorial and it gives an example , but as a matter of fact in the modules you can find more and more examples, perhaps richer or closer to what you want! A question , do you know where is an example of polynomials in the lib? Yes, in zmod ! But, in fact, when somebody does a program that does something. Perhaps it's far more interesting for the community to see how he's solved certain problems, that is, instead of the documentation it'd be far more interesting see how he's focused sockets or GUI. So, let's resume point 2: - It doesn't matter Latex , HTML, XML,... always the information will be HARDWIRED in some sort of way. People will want another perspectives. Now, let's go to point 3: Well, what's the system you are talking about ? :) This is the hard part of this letter. And it has to do with linguistics. Yeah, they're semas ( in Spanish they're semas ). If we have this sentence: I'll go to the cinema tomorrow or I'm going to the cinema very soon . The sema decomposition will be: me go cinema future A sema is a unit of meaning. A very simple meaning , usually non-divisible, but not always. Attributing pieces of source with semas is what I mean. Supply meaning to the sources. Semantics and semas are simply the universal HTML,XML, ... It's the universal reference for humans. You don't need pointers, nor links nor sections,... just sets of semas attributing pieces of code. The final look of the information should depend on the required semas. What I'm saying is not new: indices, apropos, yahoo, man are a kind of sema dealers. At a very low level I admit. Point 4: Is this craziness? Is this programmable ? Can it be done in an automatic way ? Where are the limits of semantics ? Let's start with the upper limits of semantics documentation, they're these: You sit in front of the computer, you tell him what you want, he consults his semantic knowledge database, extracts the related info and it does the program ! ( Please don't mention me Godel, nor Turing , nobody is going to do such weird computations !! ) The lower limit is the unix apropos : The programer tells the system : " hei system, you've got a new item related to sema 'graphics'", for example. Well, a step further in apropos model would be to include in python code such semas, in a fixed and structured way. For example : Keyword : sockets, multitasking, nice-list-handling So when they're parsed they can be included in a database. Uhmm, is nice-list-handling a sema? If so how am I going to search for nice-list-handling ? Well, as a matter of fact, semas are the higher meta-information man can think of ( because thinking is handling language ( semas) , ) so nice-list-handling can be divided into simpler semas : nice ,list,handling . This division can take place in a dictionary or analizing the structure of the sema . But, perhaps this system's far more interesting when you generate the semas in an automatic way. Yes, you only have to scan files searching for certain patterns . And the most you apply AI tecniques to attribute pieces of code , the most automated information you got. As a matter of fact yahoo does this . Finally, the trick of all this is to find the appropiate semas . And the appropiate patterns . But even this is rewarding. Because you have a very literal description ( in the form of semas ) of the semantic world of python: Its faults and its strongholds. The same than the several words in Inuktitut (Eskimo ) for snow, but applied in computer sciences. In any case, any effort in this field pays, the information is getting bigger and bigger everyday and hardwiring methods of structuring are not enough. Regards/Saludos Manolo ------------- My addresses / mis direcciones: www.ctv.es/USERS/irmina ---> lritaunas peki project/proyecto in python www.ctv.es/USERS/irmina/pyttex.htm ---> page of spanish users of latex / pagina de usuarios en espanyol de latex www.ctv.es/USERS/irmina/texpython.htm --> page of drawing utility for tex / pagina de utilidad de dibujo para Latex "...abandoneis el campo y vuestras casas y acudais a defender el mar y la ciudad...no lamentarse por las casas o la tierra, sino por las vidas humanas, pues aquellas no nos proporcionan hombres, sino los hombres aquellas "-Pericles
Manuel Gutierrez Algaba writes:
No, and because of a very simple reason. Although Linux apropos [Really long explanation elided.] structuring are not enough.
From a user interface perspective, it sounds like each "chunk" of documentation presented should have some sort of entry box or button
Manuel, I think I see what you're looking for. (For context: I have studied traditional information retrieval, but not natural language processing approaches.) Let me try to boil down what you've described to a (much) more concise description, and then follow on with my comments. If I misunderstand what you're asking for, please clarify. My summary of what you explained: You are looking for a concept-based search mechanism, which can preferably described what sorts of relationships the located items have to each other ("this is an example of that", etc.). You indicate an advantage of automatic concept extraction based on the content. that searches for other chunks related to the chunk on that page. My response: I think this would be really nice to have. As far as I'm aware, such systems are still largely research projects, with some applications having reached deployment (you point to good examples). To do this for the Python documentation (defined as broadly as needed), the most-needed thing to accomplish this is someone who can donate time and know-how. I don't know enough about the AI aspects or the natural language processing aspects. The user interface issues are also non-trivial (esp. if the interface can be distilled all the way down to a single button and maybe a text-entry box). But I'd be glad to work with someone regarding interpretation of the existing documentation and any improvements that could be made to make the processing more effective. There are two aspects to this which are related but not tightly bound: extraction of "concepts" and use of concepts to locate interesting information. Concepts can be extracted from the text using AI/NLP tools or can be marked explicitly in the documentation source. I must admit a bias toward the latter approach, but automated techniques may have progressed sufficiently to make them viable. I do not see any reason for the approaches to concept extraction to be mutually exclusive. What constitutes a "chunk" needs to be clearly defined, both for purposes of hyper-navigation and percolation of concept assignments up and down the document structure hierarchy. Use of a concept-to-chunk database may need to know about the extraction techniques (at least the explicit vs. automatic dichotomy), especially for purposes of ranking or presentation. I think we can go a long way using techniques based on explicit markup in the documentation. The index construction markup is one example of "meta" information being located in the documents, and other aspects of the markup are becoming increasingly "logical" rather than presentation-based. There is no reason that two things can't both happen: 1) additional meta information be added to the documents to allow explicit encoding of concept-like information, and 2) processing software imply relationships between chunks based on existing markup. With the coming conversion of the documentation to SGML, I expect some information present in the documentation today will become more explicit, making it somewhat easier to create processing software that doesn't have to make as many basic inferences as it has to today. (Yes, I realize that this doesn't come from SGML, but the conversion is an excellent opportunity for us to refine the markup in more useful ways than has been the case with the existing markup.) I'm quite interested in hearing from people about what information would be useful if marked explicitly, and how it could be used. -Fred -- Fred L. Drake, Jr. <fdrake@acm.org> Corporation for National Research Initiatives 1895 Preston White Dr. Reston, VA 20191
participants (3)
-
David Ascher -
Fred L. Drake -
Manuel Gutierrez Algaba