[TriZPUG] More Fun With Text Processing

Chris Calloway cbc at unc.edu
Fri Apr 3 20:08:35 CEST 2009


On 4/3/2009 11:31 AM, Josh Johnson wrote:
> I've got a tabular list, it's the output from a command-line program, 
> and I need to parse it into some sort of structure.
> What do you guys think?

Oh, yes. This is my thing. Data management.

And it a place where Python is really shiny.

BTW, somewhat dated but still awesome text for "free." I bought the dead 
tree version years ago and don't regret it:

http://gnosis.cx/TPiP/

It may be that you didn't come to the TriZPUG meeting where I did this 
same thing you are wanting. It's called parsing *structured text.* Not 
the STX kind of structured text. The kind that says, these thing of 
interest will be embedded within these tokens in this order within a 
file. Tokens can be things like blank lines, newlines, or whitespace.

Anyway, here's some poorly maintained code in presented at a meeting. 
Poorly maintained because the instrument which collects the files to 
parse broke, and my interest along with it:

http://trac.nccoos.org/dataproc/browser/sodar/trunk

(Ha! In the middle of writing this email, the German engineered 
replacement for the French engineered broken piece of crap arrived on a 
big truck. I've been unpacking for an hour and now I'm going to go play 
with my new toy.)

But the part you are interested in is here:

http://trac.nccoos.org/dataproc/browser/sodar/trunk/sodar/rawData.py?rev=120

which processes files which look like this:

http://whewell.marine.unc.edu/data/nccoos/level0/dukeforest/sodar/store/2007_07/20070704.dat

Notice that file is a lot like yours. It has header lines with columns 
of data below them. It even has intermixed headers and data within 
blocks of header/data groups. And then it has block after block.

OK, looking at the code in trac's subversion viewer, look down around 
line 72 at RawData's __init__ method. That splits a multiline string 
representing the  file (identifier data) up into header/data-grouping 
blocks, strips whitespace, and extends the self() list-like object with 
objects called Samples representing each block. Because each block 
represents a sample. Your file may or not have such blocks.

All the heavy lifting there was easily accomplished with the split 
method because I have a handy delimiter ('$') between blocks.

So, that __init__ method extends a list like object with items which are 
Sample objects. Look at line 122 of Sample's __init__ method. Here we 
split a sample block into the header metadata sub-blocks and the main 
data sub-block of each sample. We expect three metadata headers and one 
data sub-block per sample separated by blank lines. I use the regular 
expression module to split those up an add attributes of 'header' and 
'body' to the Sample object with each. Those attributes represent Header 
and Body objects.

Looking at Header's __init__ method on line 158 shows splitting the 
header sub-blocks into alternating lines of column names and column 
data. A Header is a dictionary-like object. I use zip while splitting 
each column name and column data line simultaneously to update the 
dictionary with key/value pairs from the names and data.

You have to make sure your column names can be valid dictionary keys.

Look at the Body's __init__ method on line 186. I do the same thing as 
with Header almost. Except I make a list of dictionaries. The list is 
ordered by altitude of each measurement in the sample. Each dictionary 
is a bunch of variable name/mesaure value pairs.

So that gets everything into a complex data object, RawData, by just 
using string methods and regular expressions. I overrode a bunch of 
__getitem__ methods to make it easy to access by things like time and 
altitude. I also included some methods to assist in making deep copies 
my way.

Just visually look at your file to see where the split points and 
patterns are. See what structures exist within structures. Apply the 
appropriate methods.

All you are really interested in is parsing your file and getting the 
data into Python objects, which you can then through at things like the 
csv module for reports, or in my case, matplotlib for making charts and 
graphs.

Look at the comments at the top of the code. A RawData object is just a 
bunch of Samples. Each Sample has a Header object full of metadata and a 
Body object with a dictionary of data for each altitude. The Python 
object I made resembles the file format. I suggest you do that before 
applying a bunch of transformations to the format or data.

Get your data into a RawData type of object and then use it to feed the 
__init__ methods of other objects which format and transform (like 
FormattedData, AdjustedData, and ArrayData in the trunk of the project).

This may help:

http://diveintopython.org/regular_expressions/index.html

Some people have objected to the one hour out of forty we spend on the 
re module during PyCamp. But considering how much most people who do 
Python use it everyday, it's kind of something you need to know pretty well.

The chapter in Dive Into Python covers string matching methods and shows 
their limitations before plowing into regular expressions. But you can 
do a hell of a lot with string methods before having to submit to 
regular expressions. Pick whatever is the path of least resistance for 
any particular structure block you are trying to parse out.

-- 
Sincerely,

Chris Calloway
http://www.secoora.org
office: 332 Chapman Hall   phone: (919) 599-3530
mail: Campus Box #3300, UNC-CH, Chapel Hill, NC 27599





More information about the TriZPUG mailing list