how to parse numeric data files

Andrew Dalke adalke at mindspring.com
Tue Apr 29 14:13:15 EDT 2003


george young:
> Currently we have a nasty mess of awk/shell/C/fortran programs that
> extract and process some data from these files.  I have a dream of
> a suite of simple, clear, maintainable python programs to do these tasks.

> Ideally, for each file format(there may be a dozen or so), a clear,
> concise descriptor file (yacc-like language definition? whatever)
> would drive the parsing process for reading that kind of data file,
> providing a nice clear API to the data.

Bioinformatics is another field which has this problem of many
"simple" file formats.  By simple I mean they are easy to parse
with regular expressions, and don't need a context-free grammar
used in yacc and other parser generators.  But they are complex
because they are very stateful - it's hard to partition the language
up into "tokens" which apply to a "grammar".

The solution most bioinformatics groups have done is to hand-write
a parser for each format.  Some build a data structure directly,
while others generate events so that a handler can build the
appropriate data structure.  The event model is nice because if
the event types are standardize, then it's sometimes possible to
reuse one handler for several file formats.

I've extended that one further and developed a parser generator
called Martel (part of the Biopython project) which takes a possibly
very large regular expression as the language definition, and creates
SAX2 events, so everything can be processed as if it's in XML.  See
  http://www.dalkescientific.com/Martel/
for more information about it.

Martel builds on mxTextTools, which is a low-level tool which
takes a while to understand, but is quite powerful.  Other relevent
tools, like SimpleParse, also build on that engine.

> [Below is a sample of one of the worst formats, shortened from a 40MB
file!]
> -- George

Martel parses in-memory and uses about 2x - 4x the memory
of the basic file.  If most of the file is a large number of blocks in
the same format (that is, optional header + repeats of a common format +
optional footer) then Martel includes some workarounds.

> PH_lot_id: YES_2_1.25V_1.0V
> Device_id: 4Mb_dc1
> Operator_id: EEA

A Martel grammer would look something like

from Martel import *
def Field(name):
  return Str(name + ": ") + ToEol(name)

header = (Field("PH_lot_id") + Field("Device_id") +
          Field("Operator_id") + ... )

> PFD: YES_2_1.25V_1.0V 4Mb_dc1 4M_digit_capture1 VDD=1.25V \
   VLEAKLOW=0V V_TCSETP=1.0V

or if you wanted more detail

PFD =   Str("PFD: ") + Word("PFD_YES") + Str(" ") + Word("PDF_SPAM") +
  Str(" ") + Word("PFD_SOURCE") + Str(" VDD=") + Float("PDF_VDD") +
  Str("V ") + Str("VLEAKLOW=") + Float("PFD_VLEAKLOW")  + ...

The parser is like XML

expression = header + PFD + ....

parser = expression.make_parser()
parser.setContentHandler( ... )
parser.parse(open("filename"))

                    Andrew
                    dalke at dalkescientific.com






More information about the Python-list mailing list