...Oh my gosh, that is awesome. Thanks so much. I had started playing
with the positioning of various patterns and using |, but it was
getting into the early AM, so I stopped. Prematurely, it seems.<br>
<br>
I also got a point 0.1 second increase in speed by merging number and
identifier, as the values will always be treated as strings, and are
being written by a programme, so there's very little need to error
check what's being parsed in. And it felt nice to improve it a wee bit
myself. : )<br>
<br>
Also interesting is that our processors, which aren't overly far apart
in clock speed, vary so greatly in processing this problem. Maybe Intel
is better.... *grin*<br>
<br>
Much thanks sir. <br>
<br>
Regards, <br>
<br>
Liam Clarke<br><br><div><span class="gmail_quote">On 7/26/05, <b class="gmail_sendername">Paul McGuire</b> <<a href="mailto:paul@alanweberassociates.com">paul@alanweberassociates.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Liam -<br><br>I made some changes and timed them, I think this problem is solvable. (All<br>timings are done using your data, on a P4 800MHz laptop.)<br><br>1. Baseline, the current state, in the parser code you sent me:
<br><br>bracketGroup << ( pp.Group( LBRACE + ( pp.empty ^ pp.OneOrMore(assignment) ^<br>pp.OneOrMore(identifier) ^ pp.OneOrMore(pp.dblQuotedString) ^<br>pp.OneOrMore(number) ^ pp.OneOrMore(bracketGroup) ) + RBRACE ) )
<br><br>Time: 02:20.71 (mm:ss.sss)<br><br>Just for general edification, '^' means Or, and it will evaluate all the<br>alternatives and choose the longest match (in regexp docs, this is sometimes<br>referred to as "greedy" matching); '|' means MatchFirst, and it will only
<br>evaluate alternatives until it finds a match (which I refer to as "eager"<br>matching). In the past, I've had only slight results converting '^' to '|',<br>but since this is a recursive expression, evaluating all of the possible
<br>alternatives can involve quite a bit of processing before selecting the<br>longest.<br><br>2. Convert to '|', replace empty with ZeroOrMore:<br><br>bracketGroup << ( pp.Group( LBRACE + pp.ZeroOrMore( assignment | identifier
<br>| pp.dblQuotedString | number | bracketGroup ) + RBRACE ) )<br><br>Time: 00:14.57<br><br>This is getting us somewhere! Replaced empty and OneOrMore's with a single<br>ZeroOrMore, and changed from '^' to '|'. Since there is no overlap of the
<br>various alternatives *in their current order*, it is safe to use '|'. (This<br>would not be the case if assignment came after identifier - this should be a<br>hint on how to resolve the 'empty' problem.) One problem with this
<br>expression is that it will permit mixed bracket groups, such as { "A" 10<br>b=1000 {} }.<br><br>3. Go back to baseline, change '^' to '|', *and put empty at the end*<br><br>bracketGroup << ( pp.Group( LBRACE + (
pp.OneOrMore(assignment) |<br>pp.OneOrMore(identifier) | pp.OneOrMore(pp.dblQuotedString) |<br>pp.OneOrMore(number) | pp.OneOrMore(bracketGroup) | pp.empty ) + RBRACE ) )<br><br>Time: 00:12.04<br><br>Best solution yet! This is faster than #2, since, once a match is made on
<br>the first term within a bracketGroup, all others in the group are expected<br>to be the same type. Since '|' means "take first match", we resolve empty's<br>"accept anything" behavior simply by putting it at the end of the list.
<br><br>4. Make change in #3, also convert '^' to '|' in RHS.<br><br>RHS << ( pp.dblQuotedString | identifier | number | bracketGroup )<br><br>Time: 00:01.15<br><br>Voila! I'm happy to say, this is the first time I've seen a 100X
<br>improvement, mostly by replacing '^' by '|'. While this is not *always*<br>possible (see the CORBA IDL parser in the examples directory), it is worth<br>the effort, especially with a recursive expression.<br><br>The one item to be wary of when using '|' is when expressions mask each
<br>other. The easiest example is when trying to parse numbers, which may be<br>integers or reals. If I write the expression as (assuming that integers<br>will match a sequence of digits, and reals will match digits with a decimal
<br>point and some more digits):<br><br>number = (integer | real)<br><br>I will never match a real number! The integer expression "masks" the real,<br>and since it occurs first, it will match first. The two solutions are:
<br><br>number = (integer ^ real)<br>Or<br>number = (real | integer)<br><br>That is, use an Or, which will match the longest, or reorder the MatchFirst<br>to put the most restrictive expression first.<br><br>Welcome to pyparsing, please let me know how your project goes!
<br><br>-- Paul<br><br><br>-----Original Message-----<br>From: Liam Clarke [mailto:<a href="mailto:cyresse@gmail.com">cyresse@gmail.com</a>]<br>Sent: Monday, July 25, 2005 8:31 AM<br>To: Paul McGuire<br>Subject: Re: [Tutor] Parsing problem
<br><br>Hi Paul,<br><br>I've attached the latest version. It includes my sample data within the<br>file. The sample data came in at 8 minutes 32 seconds without Pysco, 5<br>minutes 25 with, on a 650MHz Athlon.<br><br>I was pondering whether breaking the test data down into separate bits via
<br>some preprocessing and feeding the simpler data structures in would help at<br>all.<br><br>Unfortunately, as I'm using pp.empty to deal with empty bracket sets (which<br>were causing my 'expected "}" ' exceptions), using | matches to
pp.empty<br>first.<br><br>I'm not sure how to get around the empty brackets without using that.<br><br>I also get the feeling that pyparsing was more designed for making parsing<br>small complex expressions easy, as opposed to my data churning. That said, I
<br>can think of about ten different projects I'd played with before giving up<br>because of a problem that pyparsing handles elegantly.<br>Usually it was regexes that got me. So if I have to attack this another way,<br>at least I know the basics of a good module now. :)
<br><br>Much thanks for your time and energy, having read your listing on the c2<br>wiki (I searched for pyparsing on the offchance) I can see you're a busy<br>man, and I'm grateful for your efforts to help me.<br><br>Regards,
<br><br>Liam Clarke<br><br><br>On 7/26/05, Paul McGuire <<a href="mailto:paul@alanweberassociates.com">paul@alanweberassociates.com</a>> wrote:<br><br> Liam-<br><br> Please send me your latest version of the grammar, and I will post
<br> suggestions on the Tutor list.<br><br> -- Paul<br><br> -----Original Message-----<br> From: Liam Clarke [mailto: <a href="mailto:cyresse@gmail.com">cyresse@gmail.com</a><br><mailto:<a href="mailto:cyresse@gmail.com">
cyresse@gmail.com</a>> ]<br> Sent: Monday, July 25, 2005 7:38 AM<br> To: Paul McGuire<br> Cc: <a href="mailto:tutor@python.org">tutor@python.org</a><br> Subject: Re: [Tutor] Parsing problem<br>
<br> Hi Paul,<br><br> Well various tweaks and such done, it parses perfectly, so much<br>thanks, I<br> think I now have a rough understanding of the basics of pyparsing.<br><br> Now, onto the fun part of optimising it. At the moment, I'm looking
<br>at 2 - 5<br> minutes to parse a 2000 line country section, and that's with psyco.<br>Only<br> problem is, I have 157 country sections...<br><br> I am running a 650 MHz processor, so that isn't helping either. I
<br>read this<br> quote on <a href="http://pyparsing.sourceforge.net">http://pyparsing.sourceforge.net</a> .<br><br> "Thanks again for your help and thanks for writing pyparser! It<br>seems my<br> code needed to be optimized and now I am able to parse a 200mb file
<br>in 3<br> seconds. Now I can stick my tongue out at the Perl guys ;)"<br><br> I'm jealous, 200mb in 3 seconds, my file's only 4mb.<br><br> Are there any general approaches to optimisation that work well?
<br><br> My current thinking is to use string methods to split the string<br>into each<br> component section, and then parse each section to a bare minimum<br>key, value.<br> ie - instead of parsing<br>
<br> x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}<br><br> out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = {<br>z = { }<br> y = { } }}"<br><br> I'm thinking that would avoid the complicated nested structure I
<br>have now,<br> and I could parse data out of the string as needed, if needed at<br>all.<br><br> Erk, I don't know, I've never had to optimise anything.<br><br> Much thanks for creating pyparsing, and doubly thank-you for your
<br>assistance<br> in learning how to use it.<br><br> Regards,<br><br> Liam Clarke<br><br> On 7/25/05, Liam Clarke <<a href="mailto:cyresse@gmail.com">cyresse@gmail.com</a>> wrote:<br><br>
Hi Paul,<br><br> My
apologies, as I was jumping into my car after sending<br>that email,<br> it clicked in my brain.<br> "Oh
yeah... initial & body..."<br><br> But
good to know about how to accept valid numbers.<br><br> Sorry,
getting a bit too quick to fire off emails here.<br><br> Regards,<br><br> Liam Clarke<br><br><br> On
7/25/05, Paul McGuire < <a href="mailto:paul@alanweberassociates.com">paul@alanweberassociates.com</a><br> <mailto: <a href="mailto:paul@alanweberassociates.com">paul@alanweberassociates.com</a><br><mailto:
<a href="mailto:paul@alanweberassociates.com">paul@alanweberassociates.com</a>> > > wrote:<br><br><br> Liam
-<br><br> The
two arguments to Word work this way:<br> -
the first argument lists valid *initial*<br>characters<br> -
the second argument lists valid *body* or<br>subsequent<br> characters<br><br> For
example, in the identifier definition,<br><br> identifier
= pp.Word(pp.alphas, pp.alphanums +<br>"_/:.")<br><br> identifiers
*must* start with an alphabetic<br>character, and<br> then may be<br> followed
by 0 or more alphanumeric or _/: or .<br>characters.<br> If only one<br> argument
is supplied, then the same string of<br>characters is<br> used as both<br> initial
and body. Identifiers are very typical for<br>2<br> argument Word's, as<br> they
often start with alphas, but then accept digits<br>and<br> other punctuation.<br> No
whitespace is permitted within a Word. The Word<br>matching<br> will end when a<br> non-body
character is seen.<br><br> Using
this definition:<br><br> integer
= pp.Word(pp.nums+"-+.", pp.nums)<br><br> It
will accept "+123", "-345", "678", and ".901".<br>But in a<br> real number, a<br> period
may occur anywhere in the number, not just as<br>the<br> initial character,<br> as
in "3.14159". So your bodyCharacters must also<br>include a<br> ".", as in:<br><br> integer
= pp.Word(pp.nums+"-+.", pp.nums+".")<br><br> Let
me say, though, that this is a very permissive<br> definition of integer -<br> for
one thing, we really should rename it something<br>like<br> "number", since it<br> now
accepts non-integers as well! But also, there<br>is no<br> restriction on the<br> frequency
of body characters. This definition would<br>accept<br> a "number" that<br> looks
like "3.4.3234.111.123.3234". If you are<br>certain that<br> you will only<br> receive
valid inputs, then this simple definition<br>will be<br> fine. But if you<br> will
have to handle and reject erroneous inputs,<br>then you<br> might do better<br> with
a number definition like:<br><br> number
= Combine( Word( "+-"+nums, nums ) +<br> Optional(
point + Optional( Word(<br>nums ) )<br> ) )<br><br> This
will handle "+123", "-345", "678", and "0.901",<br>but not<br> ".901". If you<br> want
to accept numbers that begin with "."s, then<br>you'll<br> need to tweak this<br> a
bit further.<br><br> One
last thing: you may want to start using<br>setName() on<br> some of your<br> expressions,
as in:<br><br> number
= Combine( Word( "+-"+nums, nums ) +<br> Optional(
point + Optional( Word(<br>nums ) )<br> )<br> ).setName("number")<br><br> Note,
this is *not* the same as setResultsName.<br>Here<br> setName is attaching a<br> name
to this pattern, so that when it appears in an<br> exception, the name will<br> be
used instead of an encoded pattern string (such<br>as<br> W:012345...). No need<br> to
do this for Literals, the literal string is used<br>when it<br> appears in an<br> exception.<br><br> --
Paul<br><br><br><br><br><br><br><br> --<br><br> 'There
is only one basic human right, and that is to do as<br>you damn<br> well please.<br> And
with it comes the only basic human duty, to take the<br> consequences.'<br><br><br><br><br> --<br> 'There is only one basic human right, and that is to do as you damn<br>well<br> please.<br> And with it comes the only basic human duty, to take the
<br>consequences.'<br><br><br><br><br><br><br>--<br>'There is only one basic human right, and that is to do as you damn well<br>please.<br>And with it comes the only basic human duty, to take the consequences.'<br><br></blockquote>
</div><br><br clear="all"><br>-- <br>'There is only one basic human right, and that is to do as you damn well please.<br>And with it comes the only basic human duty, to take the consequences.'