[Tutor] Parsing problem

Liam Clarke cyresse at gmail.com
Tue Jul 26 10:22:15 CEST 2005


...Oh my gosh, that is awesome. Thanks so much. I had started playing with 
the positioning of various patterns and using |, but it was getting into the 
early AM, so I stopped. Prematurely, it seems.

I also got a point 0.1 second increase in speed by merging number and 
identifier, as the values will always be treated as strings, and are being 
written by a programme, so there's very little need to error check what's 
being parsed in. And it felt nice to improve it a wee bit myself. : )

Also interesting is that our processors, which aren't overly far apart in 
clock speed, vary so greatly in processing this problem. Maybe Intel is 
better.... *grin*

Much thanks sir. 

Regards, 

Liam Clarke

On 7/26/05, Paul McGuire <paul at alanweberassociates.com> wrote:
> 
> Liam -
> 
> I made some changes and timed them, I think this problem is solvable. (All
> timings are done using your data, on a P4 800MHz laptop.)
> 
> 1. Baseline, the current state, in the parser code you sent me:
> 
> bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) 
> ^
> pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^
> pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) )
> 
> Time: 02:20.71 (mm:ss.sss)
> 
> Just for general edification, '^' means Or, and it will evaluate all the
> alternatives and choose the longest match (in regexp docs, this is 
> sometimes
> referred to as "greedy" matching); '|' means MatchFirst, and it will only
> evaluate alternatives until it finds a match (which I refer to as "eager"
> matching). In the past, I've had only slight results converting '^' to 
> '|',
> but since this is a recursive expression, evaluating all of the possible
> alternatives can involve quite a bit of processing before selecting the
> longest.
> 
> 2. Convert to '|', replace empty with ZeroOrMore:
> 
> bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | 
> identifier
> | pp.dblQuotedString | number | bracketGroup ) + RBRACE ) )
> 
> Time: 00:14.57
> 
> This is getting us somewhere! Replaced empty and OneOrMore's with a single
> ZeroOrMore, and changed from '^' to '|'. Since there is no overlap of the
> various alternatives *in their current order*, it is safe to use '|'. 
> (This
> would not be the case if assignment came after identifier - this should be 
> a
> hint on how to resolve the 'empty' problem.) One problem with this
> expression is that it will permit mixed bracket groups, such as { "A" 10
> b=1000 {} }.
> 
> 3. Go back to baseline, change '^' to '|', *and put empty at the end*
> 
> bracketGroup << ( pp.Group( LBRACE + ( pp.OneOrMore(assignment) |
> pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) |
> pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) 
> )
> 
> Time: 00:12.04
> 
> Best solution yet! This is faster than #2, since, once a match is made on
> the first term within a bracketGroup, all others in the group are expected
> to be the same type. Since '|' means "take first match", we resolve 
> empty's
> "accept anything" behavior simply by putting it at the end of the list.
> 
> 4. Make change in #3, also convert '^' to '|' in RHS.
> 
> RHS << ( pp.dblQuotedString | identifier | number | bracketGroup )
> 
> Time: 00:01.15
> 
> Voila! I'm happy to say, this is the first time I've seen a 100X
> improvement, mostly by replacing '^' by '|'. While this is not *always*
> possible (see the CORBA IDL parser in the examples directory), it is worth
> the effort, especially with a recursive expression.
> 
> The one item to be wary of when using '|' is when expressions mask each
> other. The easiest example is when trying to parse numbers, which may be
> integers or reals. If I write the expression as (assuming that integers
> will match a sequence of digits, and reals will match digits with a 
> decimal
> point and some more digits):
> 
> number = (integer | real)
> 
> I will never match a real number! The integer expression "masks" the real,
> and since it occurs first, it will match first. The two solutions are:
> 
> number = (integer ^ real)
> Or
> number = (real | integer)
> 
> That is, use an Or, which will match the longest, or reorder the 
> MatchFirst
> to put the most restrictive expression first.
> 
> Welcome to pyparsing, please let me know how your project goes!
> 
> -- Paul
> 
> 
> -----Original Message-----
> From: Liam Clarke [mailto:cyresse at gmail.com]
> Sent: Monday, July 25, 2005 8:31 AM
> To: Paul McGuire
> Subject: Re: [Tutor] Parsing problem
> 
> Hi Paul,
> 
> I've attached the latest version. It includes my sample data within the
> file. The sample data came in at 8 minutes 32 seconds without Pysco, 5
> minutes 25 with, on a 650MHz Athlon.
> 
> I was pondering whether breaking the test data down into separate bits via
> some preprocessing and feeding the simpler data structures in would help 
> at
> all.
> 
> Unfortunately, as I'm using pp.empty to deal with empty bracket sets 
> (which
> were causing my 'expected "}" ' exceptions), using | matches to pp.empty
> first.
> 
> I'm not sure how to get around the empty brackets without using that.
> 
> I also get the feeling that pyparsing was more designed for making parsing
> small complex expressions easy, as opposed to my data churning. That said, 
> I
> can think of about ten different projects I'd played with before giving up
> because of a problem that pyparsing handles elegantly.
> Usually it was regexes that got me. So if I have to attack this another 
> way,
> at least I know the basics of a good module now. :)
> 
> Much thanks for your time and energy, having read your listing on the c2
> wiki (I searched for pyparsing on the offchance) I can see you're a busy
> man, and I'm grateful for your efforts to help me.
> 
> Regards,
> 
> Liam Clarke
> 
> 
> On 7/26/05, Paul McGuire <paul at alanweberassociates.com> wrote:
> 
> Liam-
> 
> Please send me your latest version of the grammar, and I will post
> suggestions on the Tutor list.
> 
> -- Paul
> 
> -----Original Message-----
> From: Liam Clarke [mailto: cyresse at gmail.com
> <mailto:cyresse at gmail.com> ]
> Sent: Monday, July 25, 2005 7:38 AM
> To: Paul McGuire
> Cc: tutor at python.org
> Subject: Re: [Tutor] Parsing problem
> 
> Hi Paul,
> 
> Well various tweaks and such done, it parses perfectly, so much
> thanks, I
> think I now have a rough understanding of the basics of pyparsing.
> 
> Now, onto the fun part of optimising it. At the moment, I'm looking
> at 2 - 5
> minutes to parse a 2000 line country section, and that's with psyco.
> Only
> problem is, I have 157 country sections...
> 
> I am running a 650 MHz processor, so that isn't helping either. I
> read this
> quote on http://pyparsing.sourceforge.net .
> 
> "Thanks again for your help and thanks for writing pyparser! It
> seems my
> code needed to be optimized and now I am able to parse a 200mb file
> in 3
> seconds. Now I can stick my tongue out at the Perl guys ;)"
> 
> I'm jealous, 200mb in 3 seconds, my file's only 4mb.
> 
> Are there any general approaches to optimisation that work well?
> 
> My current thinking is to use string methods to split the string
> into each
> component section, and then parse each section to a bare minimum
> key, value.
> ie - instead of parsing
> 
> x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}
> 
> out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = {
> z = { }
> y = { } }}"
> 
> I'm thinking that would avoid the complicated nested structure I
> have now,
> and I could parse data out of the string as needed, if needed at
> all.
> 
> Erk, I don't know, I've never had to optimise anything.
> 
> Much thanks for creating pyparsing, and doubly thank-you for your
> assistance
> in learning how to use it.
> 
> Regards,
> 
> Liam Clarke
> 
> On 7/25/05, Liam Clarke <cyresse at gmail.com> wrote:
> 
> Hi Paul,
> 
> My apologies, as I was jumping into my car after sending
> that email,
> it clicked in my brain.
> "Oh yeah... initial & body..."
> 
> But good to know about how to accept valid numbers.
> 
> Sorry, getting a bit too quick to fire off emails here.
> 
> Regards,
> 
> Liam Clarke
> 
> 
> On 7/25/05, Paul McGuire < paul at alanweberassociates.com
> <mailto: paul at alanweberassociates.com
> <mailto:paul at alanweberassociates.com> > > wrote:
> 
> 
> Liam -
> 
> The two arguments to Word work this way:
> - the first argument lists valid *initial*
> characters
> - the second argument lists valid *body* or
> subsequent
> characters
> 
> For example, in the identifier definition,
> 
> identifier = pp.Word(pp.alphas, pp.alphanums +
> "_/:.")
> 
> identifiers *must* start with an alphabetic
> character, and
> then may be
> followed by 0 or more alphanumeric or _/: or .
> characters.
> If only one
> argument is supplied, then the same string of
> characters is
> used as both
> initial and body. Identifiers are very typical for
> 2
> argument Word's, as
> they often start with alphas, but then accept digits
> and
> other punctuation.
> No whitespace is permitted within a Word. The Word
> matching
> will end when a
> non-body character is seen.
> 
> Using this definition:
> 
> integer = pp.Word(pp.nums+"-+.", pp.nums)
> 
> It will accept "+123", "-345", "678", and ".901".
> But in a
> real number, a
> period may occur anywhere in the number, not just as
> the
> initial character,
> as in "3.14159". So your bodyCharacters must also
> include a
> ".", as in:
> 
> integer = pp.Word(pp.nums+"-+.", pp.nums+".")
> 
> Let me say, though, that this is a very permissive
> definition of integer -
> for one thing, we really should rename it something
> like
> "number", since it
> now accepts non-integers as well! But also, there
> is no
> restriction on the
> frequency of body characters. This definition would
> accept
> a "number" that
> looks like "3.4.3234.111.123.3234". If you are
> certain that
> you will only
> receive valid inputs, then this simple definition
> will be
> fine. But if you
> will have to handle and reject erroneous inputs,
> then you
> might do better
> with a number definition like:
> 
> number = Combine( Word( "+-"+nums, nums ) +
> Optional( point + Optional( Word(
> nums ) )
> ) )
> 
> This will handle "+123", "-345", "678", and "0.901",
> but not
> ".901". If you
> want to accept numbers that begin with "."s, then
> you'll
> need to tweak this
> a bit further.
> 
> One last thing: you may want to start using
> setName() on
> some of your
> expressions, as in:
> 
> number = Combine( Word( "+-"+nums, nums ) +
> Optional( point + Optional( Word(
> nums ) )
> )
> ).setName("number")
> 
> Note, this is *not* the same as setResultsName.
> Here
> setName is attaching a
> name to this pattern, so that when it appears in an
> exception, the name will
> be used instead of an encoded pattern string (such
> as
> W:012345...). No need
> to do this for Literals, the literal string is used
> when it
> appears in an
> exception.
> 
> -- Paul
> 
> 
> 
> 
> 
> 
> 
> --
> 
> 'There is only one basic human right, and that is to do as
> you damn
> well please.
> And with it comes the only basic human duty, to take the
> consequences.'
> 
> 
> 
> 
> --
> 'There is only one basic human right, and that is to do as you damn
> well
> please.
> And with it comes the only basic human duty, to take the
> consequences.'
> 
> 
> 
> 
> 
> 
> --
> 'There is only one basic human right, and that is to do as you damn well
> please.
> And with it comes the only basic human duty, to take the consequences.'
> 
> 


-- 
'There is only one basic human right, and that is to do as you damn well 
please.
And with it comes the only basic human duty, to take the consequences.'
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050726/7f511586/attachment-0001.htm


More information about the Tutor mailing list