[Tutor] Parsing problem

Paul McGuire paul at alanweberassociates.com
Mon Jul 25 15:11:04 CEST 2005


Liam -

Could you e-mail me your latest grammar?  The last version I had includes
this definition for RHS:

RHS << ( pp.dblQuotedString.setParseAction(pp.removeQuotes) ^
         identifier ^
         integer ^
         pp.Group( LBRACE + pp.ZeroOrMore( assignment ^ RHS ) + RBRACE ) )

What happens if you replace the '^' operators with '|', as in:

RHS << ( pp.dblQuotedString.setParseAction(pp.removeQuotes) |
         identifier |
         integer |
         pp.Group( LBRACE + pp.ZeroOrMore( assignment | RHS ) + RBRACE ) )

I think earlier on, you needed to use '^' because your various terms were
fairly vague (you were still using Word(pp.printables), which would accept
just about anything).  But now I think there is little ambiguity between a
quoted string, identifier, etc., and simple '|' or MatchFirst's will do.
This is about the only optimization I can think of.

-- Paul
 

-----Original Message-----
From: Liam Clarke [mailto:cyresse at gmail.com] 
Sent: Monday, July 25, 2005 7:38 AM
To: Paul McGuire
Cc: tutor at python.org
Subject: Re: [Tutor] Parsing problem

Hi Paul, 

Well various tweaks and such done, it parses perfectly, so much thanks, I
think I now have a rough understanding of the basics of pyparsing. 

Now, onto the fun part of optimising it. At the moment, I'm looking at 2 - 5
minutes to parse a 2000 line country section, and that's with psyco. Only
problem is, I have 157 country sections...

I am running a 650 MHz processor, so that isn't helping either. I read this
quote on http://pyparsing.sourceforge.net.

"Thanks again for your help and thanks for writing pyparser! It seems my
code needed to be optimized and now I am able to parse a 200mb file in 3
seconds. Now I can stick my tongue out at the Perl guys ;)"

I'm jealous, 200mb in 3 seconds, my file's only 4mb.

Are there any general approaches to optimisation that work well?

My current thinking is to use string methods to split the string into each
component section, and then parse each section to a bare minimum key, value.
ie - instead of parsing 

x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}

out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = { z = { }
y = { } }}"

I'm thinking that would avoid the complicated nested structure I have now,
and I could parse data out of the string as needed, if needed at all.

Erk, I don't know, I've never had to optimise anything. 

Much thanks for creating pyparsing, and doubly thank-you for your assistance
in learning how to use it. 

Regards, 

Liam Clarke

On 7/25/05, Liam Clarke <cyresse at gmail.com> wrote:

	Hi Paul, 
	
	My apologies, as I was jumping into my car after sending that email,
it clicked in my brain. 
	"Oh yeah... initial & body..."
	
	But good to know about how to accept valid numbers.
	
	Sorry, getting a bit too quick to fire off emails here.
	
	Regards, 
	
	Liam Clarke
	
	
	On 7/25/05, Paul McGuire < paul at alanweberassociates.com
<mailto:paul at alanweberassociates.com> > wrote:
	

		Liam -
		
		The two arguments to Word work this way:
		- the first argument lists valid *initial* characters
		- the second argument lists valid *body* or subsequent
characters
		
		For example, in the identifier definition, 
		
		identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.")
		
		identifiers *must* start with an alphabetic character, and
then may be
		followed by 0 or more alphanumeric or _/: or . characters.
If only one 
		argument is supplied, then the same string of characters is
used as both
		initial and body.  Identifiers are very typical for 2
argument Word's, as
		they often start with alphas, but then accept digits and
other punctuation. 
		No whitespace is permitted within a Word.  The Word matching
will end when a
		non-body character is seen.
		
		Using this definition:
		
		integer = pp.Word(pp.nums+"-+.", pp.nums)
		
		It will accept "+123", "-345", "678", and ".901".  But in a
real number, a 
		period may occur anywhere in the number, not just as the
initial character,
		as in "3.14159".  So your bodyCharacters must also include a
".", as in:
		
		integer = pp.Word(pp.nums+"-+.", pp.nums+".")
		
		Let me say, though, that this is a very permissive
definition of integer -
		for one thing, we really should rename it something like
"number", since it
		now accepts non-integers as well!  But also, there is no
restriction on the 
		frequency of body characters.  This definition would accept
a "number" that
		looks like "3.4.3234.111.123.3234".  If you are certain that
you will only
		receive valid inputs, then this simple definition will be
fine.  But if you 
		will have to handle and reject erroneous inputs, then you
might do better
		with a number definition like:
		
		number = Combine( Word( "+-"+nums, nums ) +
		                  Optional( point + Optional( Word( nums ) )
) )
		
		This will handle "+123", "-345", "678", and "0.901", but not
".901".  If you
		want to accept numbers that begin with "."s, then you'll
need to tweak this 
		a bit further.
		
		One last thing: you may want to start using setName() on
some of your
		expressions, as in:
		
		number = Combine( Word( "+-"+nums, nums ) +
		                  Optional( point + Optional( Word( nums ) )
)
		).setName("number")
		
		Note, this is *not* the same as setResultsName.  Here
setName is attaching a
		name to this pattern, so that when it appears in an
exception, the name will 
		be used instead of an encoded pattern string (such as
W:012345...).  No need
		to do this for Literals, the literal string is used when it
appears in an
		exception.
		
		-- Paul
		
		
		




	-- 
	
	'There is only one basic human right, and that is to do as you damn
well please.
	And with it comes the only basic human duty, to take the
consequences.' 




--
'There is only one basic human right, and that is to do as you damn well
please.
And with it comes the only basic human duty, to take the consequences.' 



More information about the Tutor mailing list