[Tutor] Parse Text File

Stefan Lesicnik stefan at lsd.co.za
Thu Jun 11 20:44:27 CEST 2009


> > Hi Denis,
> >
> > Thanks for your input. So i decided i should use a pyparser and try it
> (im a
> > relative python noob though!)
>

Hi Everyone!

I have made some progress, although i believe it mainly due to luck and not
a lot of understanding (vague understanding maybe).

Hopefully this can help someone else out...


This is due to Combine(), that glues (back) together matched string bits. To
> work safely, it disables the default separator-skipping behaviour of
> pyparsing. So that
>   real = Combine(integral+fractional)
> would correctly not match "1 .2". Right?
> See a recent reply by Paul MacGuire about this topic on the pyparsing list
> http://sourceforge.net/mailarchive/forum.php?thread_name=FE0E2B47198D4F73B01E263034BDCE3C%40AWA2&forum_name=pyparsing-usersand the pointer he gives there.
> There are several ways to correctly cope with that.
>

^ was a useful link - I still sometime struggle with the whitespaces and
combine / group...


Below is my code that works as I expect (i think...)


#!/usr/bin/python

import sys
from pyparsing import alphas, nums, ZeroOrMore, Word, Group, Suppress,
Combine, Literal, OneOrMore, SkipTo, printables, White

text='''
[04 Jun 2009] DSA-1812-1 apr-util - several vulnerabilities
        {CVE-2009-0023 CVE-2009-1955 CVE-2009-1243}
        [etch] - apr-util 1.2.7+dfsg-2+etch2
        [lenny] - apr-util 1.2.12+dfsg-8+lenny2
[01 Jun 2009] DSA-1808-1 drupal6 - insufficient input sanitising
        {CVE-2009-1844}
        [lenny] - drupal6 6.6-3lenny2
[01 Jun 2009] DSA-1807-1 cyrus-sasl2 cyrus-sasl2-heimdal - arbitrary code
execution
        {CVE-2009-0688}
        [lenny] - cyrus-sasl2-heimdal 2.1.22.dfsg1-23+lenny1
        [lenny] - cyrus-sasl2 2.1.22.dfsg1-23+lenny1
        [etch] - cyrus-sasl2 2.1.22.dfsg1-8+etch1
'''

lsquare = Literal('[')
rsquare = Literal(']')
lbrace = Literal('{')
rbrace = Literal('}')
dash = Literal('-')

space = White('\x20')
newline = White('\n')

spaceapp = White('\x20') + Literal('-') + White('\x20')
spaceseries = White('\t')

date = Combine(lsquare.suppress() + Word(nums, exact=2) + Word(alphas) +
Word(nums, exact=4) + rsquare.suppress(),adjacent=False,joinString='-')
dsa = Combine(Literal('DSA') + dash + Word(nums, exact=4) + dash +
Word(nums, exact=1))
app = Combine(Word(printables) + SkipTo(spaceapp))
desc = Combine(spaceapp.suppress() + ZeroOrMore(Word(alphas)) +
SkipTo(newline))
cve = Combine(lbrace.suppress() + OneOrMore(Literal('CVE') + dash +
Word(nums, exact=4) + dash + Word(nums, exact=4) + SkipTo(rbrace) +
Suppress(rbrace) + SkipTo(newline)))

series = OneOrMore(Group(lsquare.suppress() + OneOrMore(Literal('lenny') ^
Literal('etch') ^ Literal('sarge')) + rsquare.suppress() +
spaceapp.suppress() + Word(printables) + SkipTo(newline)))

record = date + dsa + app + desc + cve + series

def parse(text):
    for data,dataStart,dataEnd in record.scanString(text):
        yield data

for i in parse(text):
    print i



My output is as follows

['04-Jun-2009', 'DSA-1812-1', 'apr-util', 'several vulnerabilities',
'CVE-2009-0023 CVE-2009-1955 CVE-2009-1243', ['etch', 'apr-util',
'1.2.7+dfsg-2+etch2'], ['lenny', 'apr-util', '1.2.12+dfsg-8+lenny2']]
['01-Jun-2009', 'DSA-1808-1', 'drupal6', 'insufficient input sanitising',
'CVE-2009-1844', ['lenny', 'drupal6', '6.6-3lenny2']]
['01-Jun-2009', 'DSA-1807-1', 'cyrus-sasl2 cyrus-sasl2-heimdal', 'arbitrary
code execution', 'CVE-2009-0688', ['lenny', 'cyrus-sasl2-heimdal',
'2.1.22.dfsg1-23+lenny1'], ['lenny', 'cyrus-sasl2',
'2.1.22.dfsg1-23+lenny1'], ['etch', 'cyrus-sasl2', '2.1.22.dfsg1-8+etch1']]


Thanks for everyone that offered assistance and prodding in right
directions.

Stefan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20090611/551b8d8c/attachment.htm>


More information about the Tutor mailing list