[Tutor] Parsing problem
Paul McGuire
paul at alanweberassociates.com
Mon Jul 25 17:16:09 CEST 2005
Liam -
I made some changes and timed them, I think this problem is solvable. (All
timings are done using your data, on a P4 800MHz laptop.)
1. Baseline, the current state, in the parser code you sent me:
bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) ^
pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^
pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) )
Time: 02:20.71 (mm:ss.sss)
Just for general edification, '^' means Or, and it will evaluate all the
alternatives and choose the longest match (in regexp docs, this is sometimes
referred to as "greedy" matching); '|' means MatchFirst, and it will only
evaluate alternatives until it finds a match (which I refer to as "eager"
matching). In the past, I've had only slight results converting '^' to '|',
but since this is a recursive expression, evaluating all of the possible
alternatives can involve quite a bit of processing before selecting the
longest.
2. Convert to '|', replace empty with ZeroOrMore:
bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | identifier
| pp.dblQuotedString | number | bracketGroup ) + RBRACE ) )
Time: 00:14.57
This is getting us somewhere! Replaced empty and OneOrMore's with a single
ZeroOrMore, and changed from '^' to '|'. Since there is no overlap of the
various alternatives *in their current order*, it is safe to use '|'. (This
would not be the case if assignment came after identifier - this should be a
hint on how to resolve the 'empty' problem.) One problem with this
expression is that it will permit mixed bracket groups, such as { "A" 10
b=1000 {} }.
3. Go back to baseline, change '^' to '|', *and put empty at the end*
bracketGroup << ( pp.Group( LBRACE + ( pp.OneOrMore(assignment) |
pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) |
pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) )
Time: 00:12.04
Best solution yet! This is faster than #2, since, once a match is made on
the first term within a bracketGroup, all others in the group are expected
to be the same type. Since '|' means "take first match", we resolve empty's
"accept anything" behavior simply by putting it at the end of the list.
4. Make change in #3, also convert '^' to '|' in RHS.
RHS << ( pp.dblQuotedString | identifier | number | bracketGroup )
Time: 00:01.15
Voila! I'm happy to say, this is the first time I've seen a 100X
improvement, mostly by replacing '^' by '|'. While this is not *always*
possible (see the CORBA IDL parser in the examples directory), it is worth
the effort, especially with a recursive expression.
The one item to be wary of when using '|' is when expressions mask each
other. The easiest example is when trying to parse numbers, which may be
integers or reals. If I write the expression as (assuming that integers
will match a sequence of digits, and reals will match digits with a decimal
point and some more digits):
number = (integer | real)
I will never match a real number! The integer expression "masks" the real,
and since it occurs first, it will match first. The two solutions are:
number = (integer ^ real)
Or
number = (real | integer)
That is, use an Or, which will match the longest, or reorder the MatchFirst
to put the most restrictive expression first.
Welcome to pyparsing, please let me know how your project goes!
-- Paul
-----Original Message-----
From: Liam Clarke [mailto:cyresse at gmail.com]
Sent: Monday, July 25, 2005 8:31 AM
To: Paul McGuire
Subject: Re: [Tutor] Parsing problem
Hi Paul,
I've attached the latest version. It includes my sample data within the
file. The sample data came in at 8 minutes 32 seconds without Pysco, 5
minutes 25 with, on a 650MHz Athlon.
I was pondering whether breaking the test data down into separate bits via
some preprocessing and feeding the simpler data structures in would help at
all.
Unfortunately, as I'm using pp.empty to deal with empty bracket sets (which
were causing my 'expected "}" ' exceptions), using | matches to pp.empty
first.
I'm not sure how to get around the empty brackets without using that.
I also get the feeling that pyparsing was more designed for making parsing
small complex expressions easy, as opposed to my data churning. That said, I
can think of about ten different projects I'd played with before giving up
because of a problem that pyparsing handles elegantly.
Usually it was regexes that got me. So if I have to attack this another way,
at least I know the basics of a good module now. :)
Much thanks for your time and energy, having read your listing on the c2
wiki (I searched for pyparsing on the offchance) I can see you're a busy
man, and I'm grateful for your efforts to help me.
Regards,
Liam Clarke
On 7/26/05, Paul McGuire <paul at alanweberassociates.com> wrote:
Liam-
Please send me your latest version of the grammar, and I will post
suggestions on the Tutor list.
-- Paul
-----Original Message-----
From: Liam Clarke [mailto: cyresse at gmail.com
<mailto:cyresse at gmail.com> ]
Sent: Monday, July 25, 2005 7:38 AM
To: Paul McGuire
Cc: tutor at python.org
Subject: Re: [Tutor] Parsing problem
Hi Paul,
Well various tweaks and such done, it parses perfectly, so much
thanks, I
think I now have a rough understanding of the basics of pyparsing.
Now, onto the fun part of optimising it. At the moment, I'm looking
at 2 - 5
minutes to parse a 2000 line country section, and that's with psyco.
Only
problem is, I have 157 country sections...
I am running a 650 MHz processor, so that isn't helping either. I
read this
quote on http://pyparsing.sourceforge.net .
"Thanks again for your help and thanks for writing pyparser! It
seems my
code needed to be optimized and now I am able to parse a 200mb file
in 3
seconds. Now I can stick my tongue out at the Perl guys ;)"
I'm jealous, 200mb in 3 seconds, my file's only 4mb.
Are there any general approaches to optimisation that work well?
My current thinking is to use string methods to split the string
into each
component section, and then parse each section to a bare minimum
key, value.
ie - instead of parsing
x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}
out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = {
z = { }
y = { } }}"
I'm thinking that would avoid the complicated nested structure I
have now,
and I could parse data out of the string as needed, if needed at
all.
Erk, I don't know, I've never had to optimise anything.
Much thanks for creating pyparsing, and doubly thank-you for your
assistance
in learning how to use it.
Regards,
Liam Clarke
On 7/25/05, Liam Clarke <cyresse at gmail.com> wrote:
Hi Paul,
My apologies, as I was jumping into my car after sending
that email,
it clicked in my brain.
"Oh yeah... initial & body..."
But good to know about how to accept valid numbers.
Sorry, getting a bit too quick to fire off emails here.
Regards,
Liam Clarke
On 7/25/05, Paul McGuire < paul at alanweberassociates.com
<mailto: paul at alanweberassociates.com
<mailto:paul at alanweberassociates.com> > > wrote:
Liam -
The two arguments to Word work this way:
- the first argument lists valid *initial*
characters
- the second argument lists valid *body* or
subsequent
characters
For example, in the identifier definition,
identifier = pp.Word(pp.alphas, pp.alphanums +
"_/:.")
identifiers *must* start with an alphabetic
character, and
then may be
followed by 0 or more alphanumeric or _/: or .
characters.
If only one
argument is supplied, then the same string of
characters is
used as both
initial and body. Identifiers are very typical for
2
argument Word's, as
they often start with alphas, but then accept digits
and
other punctuation.
No whitespace is permitted within a Word. The Word
matching
will end when a
non-body character is seen.
Using this definition:
integer = pp.Word(pp.nums+"-+.", pp.nums)
It will accept "+123", "-345", "678", and ".901".
But in a
real number, a
period may occur anywhere in the number, not just as
the
initial character,
as in "3.14159". So your bodyCharacters must also
include a
".", as in:
integer = pp.Word(pp.nums+"-+.", pp.nums+".")
Let me say, though, that this is a very permissive
definition of integer -
for one thing, we really should rename it something
like
"number", since it
now accepts non-integers as well! But also, there
is no
restriction on the
frequency of body characters. This definition would
accept
a "number" that
looks like "3.4.3234.111.123.3234". If you are
certain that
you will only
receive valid inputs, then this simple definition
will be
fine. But if you
will have to handle and reject erroneous inputs,
then you
might do better
with a number definition like:
number = Combine( Word( "+-"+nums, nums ) +
Optional( point + Optional( Word(
nums ) )
) )
This will handle "+123", "-345", "678", and "0.901",
but not
".901". If you
want to accept numbers that begin with "."s, then
you'll
need to tweak this
a bit further.
One last thing: you may want to start using
setName() on
some of your
expressions, as in:
number = Combine( Word( "+-"+nums, nums ) +
Optional( point + Optional( Word(
nums ) )
)
).setName("number")
Note, this is *not* the same as setResultsName.
Here
setName is attaching a
name to this pattern, so that when it appears in an
exception, the name will
be used instead of an encoded pattern string (such
as
W:012345...). No need
to do this for Literals, the literal string is used
when it
appears in an
exception.
-- Paul
--
'There is only one basic human right, and that is to do as
you damn
well please.
And with it comes the only basic human duty, to take the
consequences.'
--
'There is only one basic human right, and that is to do as you damn
well
please.
And with it comes the only basic human duty, to take the
consequences.'
--
'There is only one basic human right, and that is to do as you damn well
please.
And with it comes the only basic human duty, to take the consequences.'
More information about the Tutor
mailing list