[Tutor] Parsing problem

Liam Clarke cyresse at gmail.com
Mon Jul 25 14:38:15 CEST 2005


Hi Paul, 

Well various tweaks and such done, it parses perfectly, so much thanks, I 
think I now have a rough understanding of the basics of pyparsing. 

Now, onto the fun part of optimising it. At the moment, I'm looking at 2 - 5 
minutes to parse a 2000 line country section, and that's with psyco. Only 
problem is, I have 157 country sections...

I am running a 650 MHz processor, so that isn't helping either. I read this 
quote on 
http://pyparsing.sourceforge.net.

*"Thanks again for your help and thanks for writing pyparser! It seems my 
code needed to be optimized and now I am able to parse a 200mb file in 3 
seconds. Now I can stick my tongue out at the Perl guys ;)"*

I'm jealous, 200mb in 3 seconds, my file's only 4mb.

Are there any general approaches to optimisation that work well?

My current thinking is to use string methods to split the string into each 
component section, and then parse each section to a bare minimum key, value. 
ie - instead of parsing 

x = { foo = { bar = 10 bob = 20 } type = { z = { } y = { } }}

out fully, just parse to "x":"{ foo = { bar = 10 bob = 20 } type = { z = { } 
y = { } }}"

I'm thinking that would avoid the complicated nested structure I have now, 
and I could parse data 
out of the string as needed, if needed at all.

Erk, I don't know, I've never had to optimise anything. 

Much thanks for creating pyparsing, and doubly thank-you for your assistance 
in learning how to use it. 

Regards, 

Liam Clarke
On 7/25/05, Liam Clarke <cyresse at gmail.com> wrote:
> 
> Hi Paul, 
> 
> My apologies, as I was jumping into my car after sending that email, it 
> clicked in my brain. 
> "Oh yeah... initial & body..."
> 
> But good to know about how to accept valid numbers.
> 
> Sorry, getting a bit too quick to fire off emails here.
> 
> Regards, 
> 
> Liam Clarke
> 
> On 7/25/05, Paul McGuire <paul at alanweberassociates.com> wrote:
> > 
> > Liam -
> > 
> > The two arguments to Word work this way:
> > - the first argument lists valid *initial* characters
> > - the second argument lists valid *body* or subsequent characters
> > 
> > For example, in the identifier definition, 
> > 
> > identifier = pp.Word(pp.alphas, pp.alphanums + "_/:.")
> > 
> > identifiers *must* start with an alphabetic character, and then may be
> > followed by 0 or more alphanumeric or _/: or . characters. If only one 
> > argument is supplied, then the same string of characters is used as both
> > initial and body. Identifiers are very typical for 2 argument Word's, as
> > they often start with alphas, but then accept digits and other 
> > punctuation. 
> > No whitespace is permitted within a Word. The Word matching will end 
> > when a
> > non-body character is seen.
> > 
> > Using this definition:
> > 
> > integer = pp.Word(pp.nums+"-+.", pp.nums)
> > 
> > It will accept "+123", "-345", "678", and ".901". But in a real number, 
> > a 
> > period may occur anywhere in the number, not just as the initial 
> > character,
> > as in "3.14159". So your bodyCharacters must also include a ".", as in:
> > 
> > integer = pp.Word(pp.nums+"-+.", pp.nums+".")
> > 
> > Let me say, though, that this is a very permissive definition of integer 
> > -
> > for one thing, we really should rename it something like "number", since 
> > it
> > now accepts non-integers as well! But also, there is no restriction on 
> > the 
> > frequency of body characters. This definition would accept a "number" 
> > that
> > looks like "3.4.3234.111.123.3234". If you are certain that you will 
> > only
> > receive valid inputs, then this simple definition will be fine. But if 
> > you 
> > will have to handle and reject erroneous inputs, then you might do 
> > better
> > with a number definition like:
> > 
> > number = Combine( Word( "+-"+nums, nums ) +
> > Optional( point + Optional( Word( nums ) ) ) )
> > 
> > This will handle "+123", "-345", "678", and "0.901", but not ".901". If 
> > you
> > want to accept numbers that begin with "."s, then you'll need to tweak 
> > this 
> > a bit further.
> > 
> > One last thing: you may want to start using setName() on some of your
> > expressions, as in:
> > 
> > number = Combine( Word( "+-"+nums, nums ) +
> > Optional( point + Optional( Word( nums ) ) )
> > ).setName("number")
> > 
> > Note, this is *not* the same as setResultsName. Here setName is 
> > attaching a
> > name to this pattern, so that when it appears in an exception, the name 
> > will 
> > be used instead of an encoded pattern string (such as W:012345...). No 
> > need
> > to do this for Literals, the literal string is used when it appears in 
> > an
> > exception.
> > 
> > -- Paul
> > 
> > 
> > 
> 
> 
> -- 
> 'There is only one basic human right, and that is to do as you damn well 
> please.
> And with it comes the only basic human duty, to take the consequences.' 
> 



-- 
'There is only one basic human right, and that is to do as you damn well 
please.
And with it comes the only basic human duty, to take the consequences.'
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20050726/2253d486/attachment.htm


More information about the Tutor mailing list