[Tutor] Parsing problem

Mon Jul 25 17:16:09 CEST 2005

Liam -

I made some changes and timed them, I think this problem is solvable.  (All
timings are done using your data, on a P4 800MHz laptop.)

1. Baseline, the current state, in the parser code you sent me:

bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) ^
pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^
pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) )

Time: 02:20.71 (mm:ss.sss)

Just for general edification, '^' means Or, and it will evaluate all the
alternatives and choose the longest match (in regexp docs, this is sometimes
referred to as "greedy" matching); '|' means MatchFirst, and it will only
evaluate alternatives until it finds a match (which I refer to as "eager"
matching).  In the past, I've had only slight results converting '^' to '|',
but since this is a recursive expression, evaluating all of the possible
alternatives can involve quite a bit of processing before selecting the
longest.

2. Convert to '|', replace empty with ZeroOrMore:

bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | identifier
| pp.dblQuotedString | number | bracketGroup ) + RBRACE ) )

Time: 00:14.57

This is getting us somewhere!  Replaced empty and OneOrMore's with a single
ZeroOrMore, and changed from '^' to '|'.  Since there is no overlap of the
various alternatives *in their current order*, it is safe to use '|'. (This
would not be the case if assignment came after identifier - this should be a
hint on how to resolve the 'empty' problem.)  One problem with this
expression is that it will permit mixed bracket groups, such as { "A" 10
b=1000 {} }.

3. Go back to baseline, change '^' to '|', *and put empty at the end*

bracketGroup << ( pp.Group( LBRACE + ( pp.OneOrMore(assignment) |
pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) |
pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) )

Time: 00:12.04

Best solution yet!  This is faster than #2, since, once a match is made on
the first term within  a bracketGroup, all others in the group are expected
to be the same type.  Since '|' means "take first match", we resolve empty's
"accept anything" behavior simply by putting it at the end of the list.

4. Make change in #3, also convert '^' to '|' in RHS.

RHS << ( pp.dblQuotedString | identifier | number | bracketGroup )

Time: 00:01.15

Voila!  I'm happy to say, this is the first time I've seen a 100X
improvement, mostly by replacing '^' by '|'.  While this is not *always*
possible (see the CORBA IDL parser in the examples directory), it is worth
the effort, especially with a recursive expression.

The one item to be wary of when using '|' is when expressions mask each
other.  The easiest example is when trying to parse numbers, which may be
integers or reals.  If I write the expression as (assuming that integers
will match a sequence of digits, and reals will match digits with a decimal
point and some more digits):

number = (integer | real)

I will never match a real number! The integer expression "masks" the real,
and since it occurs first, it will match first.  The two solutions are:

number = (integer ^ real)
Or
number = (real | integer)

That is, use an Or, which will match the longest, or reorder the MatchFirst
to put the most restrictive expression first.

Welcome to pyparsing, please let me know how your project goes!

-- Paul

-----Original Message-----
From: Liam Clarke [mailto:cyresse at gmail.com] 
Sent: Monday, July 25, 2005 8:31 AM
To: Paul McGuire
Subject: Re: [Tutor] Parsing problem

Hi Paul, 

I've attached the latest version. It includes my sample data within the
file. The sample data came in at 8 minutes 32 seconds without Pysco, 5
minutes 25 with, on a  650MHz Athlon.

I was pondering whether breaking the test data down into separate bits via
some preprocessing and feeding the simpler data structures in would help at
all.

Unfortunately, as I'm using pp.empty to deal with empty bracket sets (which
were causing my 'expected "}" ' exceptions), using | matches to pp.empty
first. 

I'm not sure how to get around the empty brackets without using that.

I also get the feeling that pyparsing was more designed for making parsing
small complex expressions easy, as opposed to my data churning. That said, I
can think of about ten different projects I'd played with before giving up
because of a problem that pyparsing handles elegantly.
Usually it was regexes that got me. So if I have to attack this another way,
at least I know the basics of a good module now. :)

Much thanks for your time and energy, having read your listing on the c2
wiki (I searched for pyparsing on the offchance) I can see you're a busy
man, and I'm grateful for your efforts to help me.

Regards, 

Liam Clarke

On 7/26/05, Paul McGuire <paul at alanweberassociates.com> wrote:

	Liam-

	Please send me your latest version of the grammar, and I will post
	suggestions on the Tutor list.

	-- Paul

	-----Original Message-----
	From: Liam Clarke [mailto: cyresse at gmail.com
<mailto:cyresse at gmail.com> ]
	Sent: Monday, July 25, 2005 7:38 AM
	To: Paul McGuire
	Cc: tutor at python.org
	Subject: Re: [Tutor] Parsing problem

	Hi Paul,

	Well various tweaks and such done, it parses perfectly, so much
thanks, I 
	think I now have a rough understanding of the basics of pyparsing.

	Now, onto the fun part of optimising it. At the moment, I'm looking
at 2 - 5
	minutes to parse a 2000 line country section, and that's with psyco.
Only 
	problem is, I have 157 country sections...

	I am running a 650 MHz processor, so that isn't helping either. I
read this
	quote on http://pyparsing.sourceforge.net .

	"Thanks again for your help and thanks for writing pyparser! It
seems my
	code needed to be optimized and now I am able to parse a 200mb file
in 3
	seconds. Now I can stick my tongue out at the Perl guys ;)" 

	I'm jealous, 200mb in 3 seconds, my file's only 4mb.

	Are there any general approaches to optimisation that work well?

	My current thinking is to use string methods to split the string
into each
	component section, and then parse each section to a bare minimum
key, value. 
	ie - instead of parsing

	x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}

	out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = {
z = { }
	y = { } }}"

	I'm thinking that would avoid the complicated nested structure I
have now,
	and I could parse data out of the string as needed, if needed at
all.

	Erk, I don't know, I've never had to optimise anything.

	Much thanks for creating pyparsing, and doubly thank-you for your
assistance 
	in learning how to use it.

	Regards,

	Liam Clarke

	On 7/25/05, Liam Clarke <cyresse at gmail.com> wrote:

	        Hi Paul,

	        My apologies, as I was jumping into my car after sending
that email, 
	it clicked in my brain.
	        "Oh yeah... initial & body..."

	        But good to know about how to accept valid numbers.

	        Sorry, getting a bit too quick to fire off emails here. 

	        Regards,

	        Liam Clarke

	        On 7/25/05, Paul McGuire < paul at alanweberassociates.com
	<mailto: paul at alanweberassociates.com
<mailto:paul at alanweberassociates.com> > > wrote:

	                Liam -

	                The two arguments to Word work this way:
	                - the first argument lists valid *initial*
characters
	                - the second argument lists valid *body* or
subsequent
	characters

	                For example, in the identifier definition,

	                identifier = pp.Word(pp.alphas, pp.alphanums +
"_/:.")

	                identifiers *must* start with an alphabetic
character, and
	then may be
	                followed by 0 or more alphanumeric or _/: or .
characters.
	If only one
	                argument is supplied, then the same string of
characters is
	used as both
	                initial and body.  Identifiers are very typical for
2
	argument Word's, as
	                they often start with alphas, but then accept digits
and
	other punctuation.
	                No whitespace is permitted within a Word.  The Word
matching
	will end when a
	                non-body character is seen.

	                Using this definition:

	                integer = pp.Word(pp.nums+"-+.", pp.nums)

	                It will accept "+123", "-345", "678", and ".901".
But in a
	real number, a
	                period may occur anywhere in the number, not just as
the
	initial character,
	                as in "3.14159".  So your bodyCharacters must also
include a
	".", as in:

	                integer = pp.Word(pp.nums+"-+.", pp.nums+".")

	                Let me say, though, that this is a very permissive
	definition of integer -
	                for one thing, we really should rename it something
like
	"number", since it
	                now accepts non-integers as well!  But also, there
is no
	restriction on the
	                frequency of body characters.  This definition would
accept
	a "number" that
	                looks like "3.4.3234.111.123.3234".  If you are
certain that
	you will only
	                receive valid inputs, then this simple definition
will be
	fine.  But if you
	                will have to handle and reject erroneous inputs,
then you
	might do better
	                with a number definition like:

	                number = Combine( Word( "+-"+nums, nums ) +
	                                  Optional( point + Optional( Word(
nums ) )
	) )

	                This will handle "+123", "-345", "678", and "0.901",
but not
	".901".  If you
	                want to accept numbers that begin with "."s, then
you'll
	need to tweak this
	                a bit further.

	                One last thing: you may want to start using
setName() on
	some of your
	                expressions, as in:

	                number = Combine( Word( "+-"+nums, nums ) +
	                                  Optional( point + Optional( Word(
nums ) )
	)
	                ).setName("number")

	                Note, this is *not* the same as setResultsName.
Here
	setName is attaching a
	                name to this pattern, so that when it appears in an
	exception, the name will
	                be used instead of an encoded pattern string (such
as
	W:012345...).  No need
	                to do this for Literals, the literal string is used
when it
	appears in an
	                exception.

	                -- Paul

	        --

	        'There is only one basic human right, and that is to do as
you damn 
	well please.
	        And with it comes the only basic human duty, to take the
	consequences.'

	--
	'There is only one basic human right, and that is to do as you damn
well
	please.
	And with it comes the only basic human duty, to take the
consequences.' 

--
'There is only one basic human right, and that is to do as you damn well
please.
And with it comes the only basic human duty, to take the consequences.'