Hi Paul, <br>
<br>
Well various tweaks and such done, it parses perfectly, so much thanks,
I think I now have a rough understanding of the basics of pyparsing. <br>
<br>
Now, onto the fun part of optimising it. At the moment, I'm looking at
2 - 5 minutes to parse a 2000 line country section, and that's with
psyco. Only problem is, I have 157 country sections...<br>
<br>
I am running a 650 MHz processor, so that isn't helping either. I read this quote on <br>
<a href="http://pyparsing.sourceforge.net">http://pyparsing.sourceforge.net</a>.<br>
<br>
<i>"Thanks again for your help and thanks for writing pyparser! It seems my code needed to be optimized
and now I am able to parse a 200mb file in 3 seconds. Now I can stick my tongue out
at the Perl guys ;)"</i><br><br>
I'm jealous, 200mb in 3 seconds, my file's only 4mb.<br>
<br>
Are there any general approaches to optimisation that work well?<br>
<br>
My current thinking is to use string methods to split the string into
each component section, and then parse each section to a bare minimum
key, value. ie - instead of parsing <br>
<br>
x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}<br>
<br>
out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}"<br>
<br>
I'm thinking that would avoid the complicated nested structure I have now, and I could parse data <br>
out of the string as needed, if needed at all.<br>
<br>
Erk, I don't know, I've never had to optimise anything. <br>
<br>
Much thanks for creating pyparsing, and doubly thank-you for your assistance in learning how to use it. <br>
<br>
Regards, <br>
<br>
Liam Clarke<br><div><span class="gmail_quote">On 7/25/05, <b class="gmail_sendername">Liam Clarke</b> <<a href="mailto:cyresse@gmail.com">cyresse@gmail.com</a>> wrote:</span><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Hi Paul, <br>
<br>
My apologies, as I was jumping into my car after sending that email, it clicked in my brain. <br>
"Oh yeah... initial & body..."<br>
<br>
But good to know about how to accept valid numbers.<br>
<br>
Sorry, getting a bit too quick to fire off emails here.<br>
<br>
Regards, <br><span class="sg">
<br>
Liam Clarke<br><br></span><div><span class="q"><span class="gmail_quote">On 7/25/05, <b class="gmail_sendername">Paul McGuire</b> <<a href="mailto:paul@alanweberassociates.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">
paul@alanweberassociates.com</a>> wrote:</span></span><div><span class="e" id="q_1054c5296ffaec77_4"><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
Liam -<br><br>The two arguments to Word work this way:<br>- the first argument lists valid *initial* characters<br>- the second argument lists valid *body* or subsequent characters<br><br>For example, in the identifier definition,
<br><br>identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.")<br><br>identifiers *must* start with an alphabetic character, and then may be<br>followed by 0 or more alphanumeric or _/: or . characters. If only one
<br>argument is supplied, then the same string of characters is used as both<br>initial and body. Identifiers are very typical for 2 argument Word's, as<br>they often start with alphas, but then accept digits and other punctuation.
<br>No whitespace is permitted within a Word. The Word matching will end when a<br>non-body character is seen.<br><br>Using this definition:<br><br>integer = pp.Word(pp.nums+"-+.", pp.nums)<br><br>It will accept "+123", "-345", "678", and ".901". But in a real number, a
<br>period may occur anywhere in the number, not just as the initial character,<br>as in "3.14159". So your bodyCharacters must also include a ".", as in:<br><br>integer = pp.Word(pp.nums+"-+.",
pp.nums+".")<br><br>Let me say, though, that this is a very permissive definition of integer -<br>for one thing, we really should rename it something like "number", since it<br>now accepts non-integers as well! But also, there is no restriction on the
<br>frequency of body characters. This definition would accept a "number" that<br>looks like "3.4.3234.111.123.3234". If you are certain that you will only<br>receive valid inputs, then this simple definition will be fine. But if you
<br>will have to handle and reject erroneous inputs, then you might do better<br>with a number definition like:<br><br>number = Combine( Word( "+-"+nums, nums ) +<br> Optional(
point + Optional( Word( nums ) ) ) )<br><br>This will handle "+123", "-345", "678", and "0.901", but not ".901". If you<br>want to accept numbers that begin with "."s, then you'll need to tweak this
<br>a bit further.<br><br>One last thing: you may want to start using setName() on some of your<br>expressions, as in:<br><br>number = Combine( Word( "+-"+nums, nums ) +<br> Optional(
point + Optional( Word( nums ) ) )<br>).setName("number")<br><br>Note, this is *not* the same as setResultsName. Here setName is attaching a<br>name to this pattern, so that when it appears in an exception, the name will
<br>be used instead of an encoded pattern string (such as W:012345...). No need<br>to do this for Literals, the literal string is used when it appears in an<br>exception.<br><br>-- Paul<br><br><br></blockquote></span></div>
</div><br>
<br clear="all"><br>-- <div><span class="e" id="q_1054c5296ffaec77_6"><br>'There is only one basic human right, and that is to do as you damn well please.<br>And with it comes the only basic human duty, to take the consequences.'
</span></div></blockquote></div><br><br clear="all"><br>-- <br>'There is only one basic human right, and that is to do as you damn well please.<br>And with it comes the only basic human duty, to take the consequences.'