[TriZPUG] More Fun With Text Processing
cbc at unc.edu
Fri Apr 3 20:08:35 CEST 2009
On 4/3/2009 11:31 AM, Josh Johnson wrote:
> I've got a tabular list, it's the output from a command-line program,
> and I need to parse it into some sort of structure.
> What do you guys think?
Oh, yes. This is my thing. Data management.
And it a place where Python is really shiny.
BTW, somewhat dated but still awesome text for "free." I bought the dead
tree version years ago and don't regret it:
It may be that you didn't come to the TriZPUG meeting where I did this
same thing you are wanting. It's called parsing *structured text.* Not
the STX kind of structured text. The kind that says, these thing of
interest will be embedded within these tokens in this order within a
file. Tokens can be things like blank lines, newlines, or whitespace.
Anyway, here's some poorly maintained code in presented at a meeting.
Poorly maintained because the instrument which collects the files to
parse broke, and my interest along with it:
(Ha! In the middle of writing this email, the German engineered
replacement for the French engineered broken piece of crap arrived on a
big truck. I've been unpacking for an hour and now I'm going to go play
with my new toy.)
But the part you are interested in is here:
which processes files which look like this:
Notice that file is a lot like yours. It has header lines with columns
of data below them. It even has intermixed headers and data within
blocks of header/data groups. And then it has block after block.
OK, looking at the code in trac's subversion viewer, look down around
line 72 at RawData's __init__ method. That splits a multiline string
representing the file (identifier data) up into header/data-grouping
blocks, strips whitespace, and extends the self() list-like object with
objects called Samples representing each block. Because each block
represents a sample. Your file may or not have such blocks.
All the heavy lifting there was easily accomplished with the split
method because I have a handy delimiter ('$') between blocks.
So, that __init__ method extends a list like object with items which are
Sample objects. Look at line 122 of Sample's __init__ method. Here we
split a sample block into the header metadata sub-blocks and the main
data sub-block of each sample. We expect three metadata headers and one
data sub-block per sample separated by blank lines. I use the regular
expression module to split those up an add attributes of 'header' and
'body' to the Sample object with each. Those attributes represent Header
and Body objects.
Looking at Header's __init__ method on line 158 shows splitting the
header sub-blocks into alternating lines of column names and column
data. A Header is a dictionary-like object. I use zip while splitting
each column name and column data line simultaneously to update the
dictionary with key/value pairs from the names and data.
You have to make sure your column names can be valid dictionary keys.
Look at the Body's __init__ method on line 186. I do the same thing as
with Header almost. Except I make a list of dictionaries. The list is
ordered by altitude of each measurement in the sample. Each dictionary
is a bunch of variable name/mesaure value pairs.
So that gets everything into a complex data object, RawData, by just
using string methods and regular expressions. I overrode a bunch of
__getitem__ methods to make it easy to access by things like time and
altitude. I also included some methods to assist in making deep copies
Just visually look at your file to see where the split points and
patterns are. See what structures exist within structures. Apply the
All you are really interested in is parsing your file and getting the
data into Python objects, which you can then through at things like the
csv module for reports, or in my case, matplotlib for making charts and
Look at the comments at the top of the code. A RawData object is just a
bunch of Samples. Each Sample has a Header object full of metadata and a
Body object with a dictionary of data for each altitude. The Python
object I made resembles the file format. I suggest you do that before
applying a bunch of transformations to the format or data.
Get your data into a RawData type of object and then use it to feed the
__init__ methods of other objects which format and transform (like
FormattedData, AdjustedData, and ArrayData in the trunk of the project).
This may help:
Some people have objected to the one hour out of forty we spend on the
re module during PyCamp. But considering how much most people who do
Python use it everyday, it's kind of something you need to know pretty well.
The chapter in Dive Into Python covers string matching methods and shows
their limitations before plowing into regular expressions. But you can
do a hell of a lot with string methods before having to submit to
regular expressions. Pick whatever is the path of least resistance for
any particular structure block you are trying to parse out.
office: 332 Chapman Hall phone: (919) 599-3530
mail: Campus Box #3300, UNC-CH, Chapel Hill, NC 27599
More information about the TriZPUG