pyparsing: how to negate a grammar

Paul McGuire ptmcg at austin.rr._bogus_.com
Sun Jan 9 18:38:10 EST 2005


<knguyen at megisto.com> wrote in message
news:1105280829.930207.169770 at c13g2000cwb.googlegroups.com...
> Hi Paul,
>
> I am trying to extract HTTP response codes  from a HTTP page send from
> a web server.  Below is my test program. The program just hangs.
>
> Thanks,
> Khoa
> ##################################################
>
<snip sample program>
>
Khoa -

Thanks for supplying a little more information to go on.  The problem you
are struggling with has to do with pyparsing's handling or non-handling of
whitespace, which I'll admit takes some getting used to.

In general, pyparsing works its way through the input string, matching input
characters against the defined pattern. This gets a little tricky when
dealing with whitespace (which includes '\n' characters).  In particular,
restOfLine will read up to the next '\n', but will not go past it - AND
restOfLine will match an empty string.  So if you have a grammar that
includes repetition, such as OneOrMore(restOfLine), this will read up to the
next '\n', and then just keep matching forever.  This is just about the case
you have in your code, ZeroOrMore(BodyLine), in which BodyLine is
    BodyLine = Group(nonHTTP + restOfLine)
You need to include something to consume the terminating '\n', which is the
purpose of the LineEnd() class.  Change BodyLine to
    BodyLine = Group(nonHTTP + restOfLine + LineEnd())
and this will break the infinite looping that occurs at the end of the first
body line.  (If you like, use LineEnd.suppress(), to keep the '\n' tokens
from getting included with your other parsed data.)

Now there is one more problem - another infinite loop at the end of the
string.  By similar reasoning, it is resolved by changing
    nonHTTP = ~Literal("HTTP/1.1")
to
    nonHTTP = ~Literal("HTTP/1.1") + ~StringEnd()

After making those two changes, your program runs to completion on my
system.

Usually, when someone has some problems with this kind of "line-sensitive"
parsing, I recommend that they consider using pyparsing in a different
manner, or use some other technique.  For instance, you might use
pyparsing's scanString generator to match on the HTTP lines, as in

for toks,start,end in StatusLine.scanString(data):
    print toks,toks[0].StatusCode, toks[0].ReasonPhrase
    print start,end

which gives
[['HTTP/1.1', '200', ' OK']] 200  OK
0 15
[['HTTP/1.1', '400', ' Bad request']] 400  Bad request
66 90
[['HTTP/1.1', '500', ' Bad request']] 500  Bad request
142 166

If you need the intervening body text, you can use the start and end values
to extract it in slices from the input data string.

Or, since your data is reasonably well-formed, you could just use readlines,
or data.split('\n'), and find the HTTP lines using startswith().  While this
is a brute force approach, it will run certainly many times faster than
pyparsing.

In any event, best of luck using pyparsing, and write back if you have other
questions.

-- Paul





More information about the Python-list mailing list