a little parsing challenge ☺

gene heskett gheskett at wdtv.com
Sun Jul 17 10:26:59 EDT 2011


On Sunday, July 17, 2011 10:12:27 AM Xah Lee did opine:

> 2011-07-16
> 
> folks, this one will be interesting one.
> 
> the problem is to write a script that can check a dir of text files
> (and all subdirs) and reports if a file has any mismatched matching
> brackets.
> 
> • The files will be utf-8 encoded (unix style line ending).
> 
> • If a file has mismatched matching-pairs, the script will display the
> file name, and the  line number and column number of the first
> instance where a mismatched bracket occures. (or, just the char number
> instead (as in emacs's “point”))
> 
> • the matching pairs are all single unicode chars. They are these and
> nothing else: () {} [] “” ‹› «» 【】 〈〉 《》 「」 『』
> Note that ‘single curly quote’ is not consider matching pair here.
> 
> • You script must be standalone. Must not be using some parser tools.
> But can call lib that's part of standard distribution in your lang.
> 
> Here's a example of mismatched bracket: ([)], (“[[”), ((, 】etc. (and
> yes, the brackets may be nested. There are usually text between these
> chars.)
> 
> I'll be writing a emacs lisp solution and post in 2 days. Ι welcome
> other lang implementations. In particular, perl, python, php, ruby,
> tcl, lua, Haskell, Ocaml. I'll also be able to eval common lisp
> (clisp) and Scheme lisp (scsh), Java. Other lang such as Clojure,
> Scala, C, C++, or any others, are all welcome, but i won't be able to
> eval it. javascript implementation will be very interesting too, but
> please indicate which and where to install the command line version.
> 
> I hope you'll find this a interesting “challenge”. This is a parsing
> problem. I haven't studied parsers except some Wikipedia reading, so
> my solution will probably be naive. I hope to see and learn from your
> solution too.
> 
> i hope you'll participate. Just post solution here. Thanks.
> 
>  Xah

This is a very old solution, some of it nearly 30 years old.
Written in C, the trick is in doing it in python I guess.
======================begin cntx.c=======================
/*^k^s
.ds2
.hb
.fb^k^s^b                     Cntx.c, page #^k^s^b
*****************************************************************
*                                                               *
*                       CC (C Checker)                          *
*                                                               *
*             C Source Brackets, Parenthesis, brace,            *
*                    quote and comment Checker                  *
*                                                               *
*                T. Jennings  -- Sometime in 1983               *
*                Slightly tweaked and renamed MINILINT          *
*                       KAB Aug 84                              *
*                Ported to OS9 and further tweaked              *
*                       Brian Paquette March 91                 *
*          Brackets, single, double quote counters added        *
*                   & renamed Cntx  04/09/91                    *
*                       by   Gene Heskett                       *
*                                                               *
*  some additional code for ignoring "syntax" chars inside of   *  
*      double quoted strings added 3/21/93 by Gene Heskett      *
*  same for single quoted stuffs 7/10/93 by Gene Heskett        *
* And long lines handling ability added too.                    *
* Adding tab ignorers and a counter to tally how many 11/21/93  *
****************************************************************/
#define OS9           /* used for nested comment handling*/
                      /* comment out for non OS9/6809*/

#include <stdio.h>
#include <ctype.h>
#include <string.h>
#define  FALSE 0
#define  TRUE  1

#ifdef   OS9
#define  NO  " No "
#define  YES " Yes "
char *cmnt;
#endif

/* Very crude but very effective C source debugger. Counts the numbers of
matching braces, parentheses and comments, and displays them at the left 
edge of the screen. The best way to see what it does is to do it; try

        cntx -v cntx.c

Properly handles parens and braces inside comments; they are ignored.
Also ignores single quotes if doubles are odd number, so singles
can be used in a printf string for punctuation.  IF you see the doubles
are odd at line end (the printout tally), all bets are OFF! 
Enter cntx on the command line by itself for a usage note.
*/

main(argc,argv)
int argc;
char *argv[];
{
     FILE *f;
     char c,secnd_c,lastc;
     int bracket,parens,braces,squote,dquote,comments;
     int perr,bkerr,brerr,sqerr,dqerr;
     int verbose,okay;
     int filearg = 0;
     int line, col, tabc;

     if ((argc < 2)||(argc > 3)) getout(0);
     if (argc == 3)
     {
       verbose = TRUE;      /* already tested for -v switch  */
       if((argv[1][0] == '-') && (toupper(argv[1][1]) == 'V'))
         filearg = 2;       /*file name pointed to by argv[2] */
       if((argv[2][0] == '-') && (toupper(argv[2][1]) == 'V'))
         filearg = 1;
       if(!filearg) getout(192);
     }
     else
     {
       verbose = FALSE;
       filearg = 1;
     }
     if ((f = fopen(argv[filearg],"r")) == NULL)
     {
       fprintf(stderr,"Cntx: can't open '%s'\n",argv[1]);
       getout(216);
     }
     bracket= braces= parens= comments= squote= dquote= 0;
     perr= bkerr= brerr= sqerr= dqerr= 0;
     line=  col= tabc= 0;
     secnd_c= lastc= '\0';
     
     while ((c = getc(f)) != EOF)
     {
        while(c==0x09) /* ignore, but tally the count */
        {
           tabc+=1;
           c=getc(f);
        }

/* print running tally if in verbose mode and at beginning of line*/
/* OS9 version prints status of whether or not one is in a comment rather*/
/* than a count, as the Microware C compiler does not nest comments*/
     
       if ((col == 0) && verbose )
       {
#ifdef OS9
         if (comments)
           cmnt = YES;
         else cmnt = NO;
         printf("%d:   [%d]   {%d}   (%d)   \'%d\'   \"%d\"   /*%s*/ 
tabcnt=%d\n\n",
                      line,bracket,braces,parens,squote,dquote,cmnt,tabc);
#else
         printf("%d:   [%d]   {%d}   (%d)   \'%d\'   \"%d\"   /*%d*/\n\n",
                      line,bracket,braces,parens,squote,dquote,comments);
#endif
       }
     
/* additions to help tally squote & dquote errors at line end,
squotes and dquotes should match if we don't count those squotes
present when dquotes are odd number as in inside a printf or
puts statement.  Also if they are part of an escape sequence,
don't count */
     
       if (col == 0 && (squote % 2) ) ++sqerr;
       if (col == 0 && (dquote % 2) ) ++dqerr;
       if (col == 0 && bracket )     ++bkerr;

/* now clears the error to get back in step */
       if (col == 0) squote=dquote=0; 

/* Don't count parens and braces that are inside comments. This of course
assumes that comments are properly matched; in any case, that will be the
first thing to look for. */
     
       if (comments <= 0)
       {     /* 3/20/93, 7/10/93 taking sensitivity out of quoted stuffs */
		/* here, do ++dquote if its not a char constant like this 
*/
	 if ( c == '"' ) ++dquote; /* a little simpler */

         if ( !(dquote & 1) ) /* was the && of those */
         {
           if (c == '{' ) ++braces;
           if (c == '(' ) ++parens;
           if (lastc != '\'' && secnd_c == '[' && c != '\'' ) ++bracket;
/* here, skip squotes in a "text string's" */
           if ( secnd_c != '\\' && c== '\'' && !(dquote) ) ++squote;
           if ( lastc == '\\' && secnd_c == '\\' && c == '\'' ) ++squote;
           if (c == '}' ) --braces;
           if (c == ')' ) --parens;
           if (lastc != '\'' && secnd_c == ']' && c != '\'' ) --bracket;
         } 
       }

/* Now do comments. This properly handles nested comments;
whether or not the compiler does is your responsibility */

#ifdef OS9

/* The Microware C compiler for OS9 does NOT nest comments. */
/* The comment-close-mark (asterisk-backslash) will terminate */
/* (see K & R) a comment no matter how many '/*' come before it*/

       if ((c == '/') && (secnd_c == '*'))
         comments = 0;
       if ((c == '*') && (secnd_c == '/') && (comments == 0))
         ++comments;
#else
       if ( (c == '/' ) && (secnd_c == '*' ) ) --comments;
       if ( (c == '*' ) && (secnd_c == '/' ) ) ++comments;
#endif
       ++col;
       if (c == '\n' && secnd_c != '\\' )
       {            /* non-escaped newline == New Line */
         col= 0;                 /* set column 0 */
         ++line;
       }
       if (verbose)
         putchar(c);                 /* display text */
       lastc= secnd_c;                       /* update last char */
       secnd_c= c;
     }
     if (verbose)
     {
#ifdef OS9
       if (comments)
         cmnt = YES;
       else cmnt = NO;
       printf("EOF:   [%d]   {%d}   (%d)   \'%d\'   \"%d\"   /*%s*/\n", 
                bracket,braces,parens,squote,dquote,cmnt);
#else
       printf("EOF:   [%d]   {%d}   (%d)   \'%d\'   \"%d\"   /*%d*/\n", 
                bracket,braces,parens,squote,dquote,comments);
#endif
     }
     okay = TRUE;
     if (bracket||bkerr) puts("Unbalanced brackets\n"), okay = FALSE;
     if (braces) puts("Unbalanced braces\n"),okay = FALSE;
     if (parens) puts("Unbalanced parentheses\n"),okay = FALSE;
     if (sqerr||(squote%2)) puts("Unmatched single quotes\n"),okay=FALSE;
     if (dqerr||(dquote%2)) puts("Unmatched double quotes\n"),okay=FALSE;
     if (comments) puts("Unbalanced comments\n"),okay = FALSE;
     if (okay)  puts("No errors found\n");
}
getout(errex)
int    errex;
{
     fprintf(stderr,"Usage: Cntx [-v] <filename> [-v]\n");
     fprintf(stderr,"       -v = verbose mode \n");
     exit(errex);
}
=====================end cntx.c====================
=================begin cntx.hlp====================
This   "Cntx"  is based rather loosely on the  previously uploaded
file  called  MINILINT, in that if you use the -v option, it  will
show  you  the  file  and its report on a line by  line  basis  as
MINILINT  did.  Cntx however will also check for use and misuse of
more  of the usual "C" punctuation.  Its smart enough to ignore an
"escaped"  character,  or those buried in a text string  inside  a
printf("[[{{'etc");  statement.   The basic organization  is  from
"MINILINT",  but much expanded in checking scope.  It still is NOT
a "lint" which is why I didn't call it that, but it has turned out
to  be awfully handy. Ported to the Amiga, it found some stuff  in
the  code I was feeding DICE that I had totally missed, and  which
was  not being properly reported by DICE either, the errors it was
spitting  out  made  no  sense whatsoever. I had  somehow  lost  a
terminating  "}" in one of the PRINTFORM files in the  translation
to  a  C that required proto statements.  Cntx found it,  even  if
Dillons Integrated C Environments "dcc" didn't. But it still isn't
a "lint", not yet.

Usage: cntx [-v] filename

Without  the -v, it rapidly scans the whole source file and  gives
only  a final report of "no errors found" or "mismatched brackets",
etc.

Added 3/20/93 MEH: One more conditional test now causes it to skip
thru  any parens, braces or brackets found within a double  quoted
string  such as the format string for printf.  As the tally  needs
to  be  reset  at the start of a new line to  maintain  the  error
checking  phasing  in  case  there is an error, the  total  double
quote  count for the whole file is no longer kept. Only the  error
tally  now  shows at the end of a file scan.  So to see  the  line
with  the  error,  you  must use the -v option,  preferably  on  a
pauseing  screen so that one screen full of data can be seen at  a
time.   I liked the totals myself, but this does work better.  Now
Edition 4.

Added 7/10/93 MEH: Essentially the same as the above paragraph but
for  stuff  inside a pair of single quotes, so now *any* character
can be single quoted without being an error.  Now Edition 5.
===================end cntx.hlp=====================

Sometimes this list is hilarious with its re-inventions of the wheel. :)

The above code isn't the final version, I had it running on linux too, but 
one of fedoras (or W.D.s pissy drives) infamous crashes caused that version 
to come up missing, forgotten when I was doing the salvage operation to a 
new hard drive.

Cheers, gene
-- 
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
A holding company is a thing where you hand an accomplice the goods while
the policeman searches you.



More information about the Python-list mailing list