[Tutor] indented grammar parsing

spir denis.spir at free.fr
Sun Jan 11 19:11:53 CET 2009


Hello,

this is a rather specific question about parsing an indented grammar.
I have a working implementation (presented below) that may be worth a critic review -- if you like
it.

First issue is: is there a way to express an indented formatfing using a common grammar
language such as BNF (or regex)? If yes, how to do that? If not, what kind of formal grammar is
actually such a format? Is there any standard way to express it?

I have searched python's grammar for this: actually, for any compound statement, a block (=
section content) is simply renamed 'suite' with no more format expression. The lexical analysis
section on indentation reads:

'' The indentation levels of consecutive lines are used to generate INDENT and DEDENT tokens, using
a stack, as follows. Before the first line of the file is read, a single zero is pushed on the
stack; this will never be popped off again. The numbers pushed on the stack will always be
strictly increasing from bottom to top. At the beginning of each logical line, the line’s
indentation level is compared to the top of the stack. If it is equal, nothing happens. If it is
larger, it is pushed on the stack, and one INDENT token is generated. If it is smaller, it must be
one of the numbers occurring on the stack; all numbers on the stack that are larger are popped
off, and for each number popped off a DEDENT token is generated. At the end of the file, a DEDENT
token is generated for each number remaining on the stack that is larger than zero. ''

I guess from this text that python parsers have a pre-parse phase that generate the so-called
INDENT and DEDENT tokens -- which are then used during actual parsing. So that a typical section
looks like this for the parser:

header
	INDENT
	line
	line
	line
	DEDENT

Which is, in fact, recreating a common block start / stop format (e.g. C's {block}). Is it? Do you
know whether python parsers actually do that during a kind of pre-parse phase, or whether their
designers have found a way of matching indent/dedent together with the rest of the source text?

Thank you,
Denis

==================

My trial:
I have have implemented an overall parser to parse an indented grammar in the simplest (recursive)
case I could specify:
* no blank line, no comment
* a single format of inline pattern
* each indent level marked as a single (tab) char
* only step 1 indentation increase

To do this, I use a set of tricks. Below the grammar (using a pseudo BNF/regex
format -- actually it is implemented using a custom parser generator):

inline 	: [a..z0..9_]+
line 	: Indent inline NL
bloc 	: (section | line)+
section : line IndentMore bloc IndentLess
all	: <= bloc>

Oviously, I use 3 names starting with Indent, that are not defined. They are kinds of
pseudo-patterns and all rely on an external variable that holds the current indent level and is
hosted as an attribute of Indent:
* Indent: matches if a proper indentation is found, meaning equal to current indent level --
advances pointer
* IndentMore: matches if the actual indentation found is =level+1 -- does not advance pointer
(=lookahead) -- increments current level
* IndentLess: matches if the actual indentation found is <level -- does not advance pointer --
decrements current level of -1 whatever the actual indent level found

Last note causes that n IndentLess matches will be generated even when the indent level is
brutally decremented of n steps at once -- so that in a valid text each IndentMore match gets its
IndentLess counter-part.
I wonder about the indent level record stack introduced in CPython's description: I need only a
single variable.

As a example, here is the actual code of IndentLess:

class IndentLess(Pattern):
	def __init__(self):
		Pattern.__init__(self)
	def _check(self,text,pos):
		''' check indent level decrease (of any number of step)
			-- but decrement current level by 1 only in any case
			~ no pointer advance <==> Right() '''
		# match indentation
		start_pos = pos
		(result,pos) = ZeroOrMore(TAB)._check(text,pos)
		# case proper indent decrement
		indent_level = pos - start_pos
		if indent_level < Indent.level:
			Indent.level -= 1
			result = Result(self, None, text,start_pos,start_pos)
			return (result,start_pos)	# (no pointer advance)
		# case no indent decrement (by step 1 or more)
		# (may also be end-of-text reached)
		if pos >= len(text):
			raise EndOfText(self, text, start_pos)
		message = 	"No proper indentation decrement:\nindent level %s --> %s"\
				%(Indent.level, indent_level)
		raise NoMatch(self, text, start_pos, message)
	def _full_expr(self):
		return "INDENT--"

Below a test text and the corresponding parse result shown in tree view:

=======
1
2
3
	31
	32
4
5
6
	61
	62
	63
		631
		632
7
=======

=======
all
	line:1
	line:2
	section
		line:3
		bloc
			line:31
			line:32
	line:4
	line:5
	section
		line:6
		bloc
			line:61
			line:62
			section
				line:63
				bloc
					line:631
					line:632
	line:7
=======

I first want to use this structure to parse config files written according to an recursive
indented format. So that the result would be a (record) object holding properties which actually
each are either a parameter or a nested sub-config, e.g.:

config
	name : val
	name : val
	connexion
		name : val
		name : val
	transmission
		name : val
		name : val


------
la vida e estranya


More information about the Tutor mailing list