parsing an Excel formula with the re module

John Machin sjmachin at lexicon.net
Thu Jan 14 19:16:04 EST 2010


On Jan 13, 7:15 pm, Paul McGuire <pt... at austin.rr.com> wrote:
> On Jan 5, 1:49 pm, Tim Chase <python.l... at tim.thechases.com> wrote:
>
>
>
> > vsoler wrote:
> > > Hence, I need toparseExcel formulas. Can I do it by means only of re
> > > (regular expressions)?
>
> > > I know that for simple formulas such as "=3*A7+5" it is indeed
> > > possible. What about complex for formulas that include functions,
> > > sheet names and possibly other *.xls files?
>
> > Where things start getting ugly is when you have nested function
> > calls, such as
>
> >    =if(Sum(A1:A25)>42,Min(B1:B25), if(Sum(C1:C25)>3.14,
> > (Min(C1:C25)+3)*18,Max(B1:B25)))
>
> > Regular expressions don't do well with nested parens (especially
> > arbitrarily-nesting-depth such as are possible), so I'd suggest
> > going for a full-blown parsing solution like pyparsing.
>
> > If you have fair control over what can be contained in the
> > formulas and you know they won't contain nested parens/functions,
> > you might be able to formulate some sort of "kinda, sorta, maybe
> > parses some forms of formulas" regexp.
>
> > -tkc
>
> This might give the OP a running start:

Unfortunately "this" will blow up after only a few paces; see
below ...

>
> from pyparsing import (CaselessKeyword, Suppress, Word, alphas,
>     alphanums, nums, Optional, Group, oneOf, Forward, Regex,
>     operatorPrecedence, opAssoc, dblQuotedString)
>
> test1 = "=3*A7+5"
> test2 = "=3*Sheet1!$A$7+5"

test2a ="=3*'Sheet 1'!$A$7+5"
test2b ="=3*'O''Reilly''s sheet'!$A$7+5"


> test3 = "=if(Sum(A1:A25)>42,Min(B1:B25), " \
>      "if(Sum(C1:C25)>3.14, (Min(C1:C25)+3)*18,Max(B1:B25)))"

Many functions can take a variable number of args and they are not
restricted to cell references e.g.

test3a = "=sum(a1:a25,10,min(b1,c2,d3))"

The arg separator is comma or semicolon depending on the locale ... a
parser should accept either.


> EQ,EXCL,LPAR,RPAR,COLON,COMMA,DOLLAR = map(Suppress, '=!():,$')
> sheetRef = Word(alphas, alphanums)
> colRef = Optional(DOLLAR) + Word(alphas,max=2)
> rowRef = Optional(DOLLAR) + Word(nums)
> cellRef = Group(Optional(sheetRef + EXCL)("sheet") + colRef("col") +
>                     rowRef("row"))
>
> cellRange = (Group(cellRef("start") + COLON + cellRef("end"))
> ("range")
>                 | cellRef )
>
> expr = Forward()
>
> COMPARISON_OP = oneOf("< = > >= <= != <>")
> condExpr = expr + COMPARISON_OP + expr
>
> ifFunc = (CaselessKeyword("if") +
>           LPAR +
>           Group(condExpr)("condition") +

that should be any expression; at run-time it expects a boolean (TRUE
or FALSE) or a number (0 means false, non-0 means true). Text causes a
#VALUE! error. Trying to subdivide expressions into conditional /
numeric /text just won't work.


>           COMMA + expr("if_true") +
>           COMMA + expr("if_false") + RPAR)
> statFunc = lambda name : CaselessKeyword(name) + LPAR + cellRange +
> RPAR
> sumFunc = statFunc("sum")
> minFunc = statFunc("min")
> maxFunc = statFunc("max")
> aveFunc = statFunc("ave")
> funcCall = ifFunc | sumFunc | minFunc | maxFunc | aveFunc
>
> multOp = oneOf("* /")
> addOp = oneOf("+ -")

needs power op "^"

> numericLiteral = Regex(r"\-?\d+(\.\d+)?")

Sorry, that "-" in there is a unary minus operator. What about 1e23 ?

> operand = numericLiteral | funcCall | cellRange | cellRef
> arithExpr = operatorPrecedence(operand,
>     [
>     (multOp, 2, opAssoc.LEFT),
>     (addOp, 2, opAssoc.LEFT),
>     ])
>
> textOperand = dblQuotedString | cellRef
> textExpr = operatorPrecedence(textOperand,
>     [
>     ('&', 2, opAssoc.LEFT),
>     ])

Excel evaluates excessively permissively, and the punters are
definitely not known for self-restraint. The above just won't work:
2.3 & 4.5 produces text "2.34.5", while "2.3" + "4.5" produces number
6.8.




More information about the Python-list mailing list