
hi all! i got such a great response to my last query that i'm trying another one ;-) is there anything out there already that i can use to parse python, c++, and java source files to get a listing and count of the lexemes that occur in each? i spent the better part of an afternoon writing python scripts to remove comments and docstrings so that i could compare line numbers, and i'm afraid parsing to get at the lexemes is beyond my ability within the time i have left to prepare my thesis. anyone suggestions? thanks again! jeff elkner yorktown high school arlington, va

The tokenize module might do what you want. """Tokenization help for Python programs. generate_tokens(readline) is a generator that breaks a stream of text into Python tokens. It accepts a readline-like method which is called repeatedly to get the next line of input (or "" for EOF). It generates 5-tuples with these members: the token type (see token.py) the token (a string) the starting (row, column) indices of the token (a 2-tuple of ints) the ending (row, column) indices of the token (a 2-tuple of ints) the original line (string) It is designed to match the working of the Python tokenizer exactly, except that it produces COMMENT tokens for comments and gives type OP for all operators Older entry points tokenize_loop(readline, tokeneater) tokenize(readline, tokeneater=printtoken) are the same, except instead of generating tokens, tokeneater is a callback function to which the 5 fields described above are passed as 5 arguments, each time a new token is found.""" --- Patrick K. O'Brien Orbtech
-----Original Message----- From: edu-sig-admin@python.org [mailto:edu-sig-admin@python.org]On Behalf Of Jeffrey Elkner Sent: Monday, April 01, 2002 8:29 PM To: edu-sig@python.org Subject: [Edu-sig] counting lexemes...
hi all!
i got such a great response to my last query that i'm trying another one ;-) is there anything out there already that i can use to parse python, c++, and java source files to get a listing and count of the lexemes that occur in each?
i spent the better part of an afternoon writing python scripts to remove comments and docstrings so that i could compare line numbers, and i'm afraid parsing to get at the lexemes is beyond my ability within the time i have left to prepare my thesis.
anyone suggestions?
thanks again!
jeff elkner yorktown high school arlington, va
_______________________________________________ Edu-sig mailing list Edu-sig@python.org http://mail.python.org/mailman/listinfo/edu-sig

On 1 Apr 2002, Jeffrey Elkner wrote:
i got such a great response to my last query that i'm trying another one ;-) is there anything out there already that i can use to parse python, c++, and java source files to get a listing and count of the lexemes that occur in each?
i spent the better part of an afternoon writing python scripts to remove comments and docstrings so that i could compare line numbers, and i'm afraid parsing to get at the lexemes is beyond my ability within the time i have left to prepare my thesis.
The Antlr parser generator by Terrence Parr, http://www.antlr.org/ has an example lexer/parser for Java 1.3, so you might be able to generate a Java lexer and parser using Antlr, and then drive it with Jython. I also saw a link to a production-quality C lexer and parser as well. This project looks interesting; if I have time, I'll see if I can cook up something. *grin* Good luck to you!
participants (3)
-
Danny Yoo
-
Jeffrey Elkner
-
Patrick K. O'Brien