[TriZPUG] More Fun With Text Processing

David Handy david at handysoftware.com
Fri Apr 3 18:44:36 CEST 2009


On Fri, Apr 03, 2009 at 11:31:57AM -0400, Josh Johnson wrote:
> Ok all,
> Since we've got a brain trust of pythonistas that know how to deal with  
> strings, here's a problem I'm facing right now that I'd like some input 
> on:
>
> I've got a tabular list, it's the output from a command-line program,  
> and I need to parse it into some sort of structure.
>
> Here's an example of the data (the headings and column width will vary):
> TARGET         VOLUME GROUP        LENGTH     AVAILABLE         NPE  MIRROR
> 1.1               HIGHAVAIL    5001.023GB    4501.008GB     1192337  2.1
> 1.3                  BACKUP    5001.023GB    4250.759GB     1192337
> 1.4                  BACKUP    3000.613GB    3000.353GB      715402
> 2.2               HIGHAVAIL    5001.023GB    5001.015GB     1192337  1.2
> 2.3                  BACKUP    5001.023GB    5000.763GB     1192337
> 2.4                  BACKUP    3000.613GB    3000.353GB      715402
>
> I'd like a structure I can work with, like say, a list of hashes.
>
> My initial approach involves treating the header row as the guide for  
> the field lengths, and then extracting substrings for each field in each  
> row.

It's a bummer you have to screen-scrape like that. Any chance you can get at
the source for the command-line utility whose output you are parsing, and
access the underlying data yourself directly from Python?

You said that the headings and column widths could vary. Are there just a
handful of different variations? If so, I would study those, and then 
hard-code the field positions and widths in your script.  I'd make it table
driven, like this:

# UNTESTED, use at own risk

option1_fields = [
    ('TARGET', 0, 14),
    ('VOLUME GROUP', 15, 30),
    # etc
    ]

data = []
for line in file:
    d = {}
    for fieldname, start, end in option1_fields:
        d[fieldname] = line[start:end].strip()
    data.append(d)

That gives you your list of dictionaries.

David H

>
> I also thought about just doing a split on spaces, but some of the  
> fields could have spaces in their data.
>
> What do you guys think?
>
> JJ
> _______________________________________________
> TriZPUG mailing list
> TriZPUG at python.org
> http://mail.python.org/mailman/listinfo/trizpug
> http://trizpug.org is the Triangle Zope and Python Users Group

-- 
David Handy
Computer Programming is Fun!
Beginning Computer Programming with Python
http://www.handysoftware.com/cpif/


More information about the TriZPUG mailing list