Breaking String into Values

Wed Apr 3 12:46:54 EST 2002

Jeff Shannon <jeff at ccvcorp.com> wrote:
> rgwright at ux.cso.uiuc.edu says...
>> I am working on reading in a data file format which is set up as a series
>> of lines that look like this:
>> 
>> 3500035000010104A Foo 45
>> 
>> I want to break up into a variables as follows:
>> a = 35000, b = 35000, c = 10104, d = 'A', e = 'Foo', f = 45
>
> So, I would have something like this:
>
> format = { a: (0,5),
>            b: (5,10),
>            c: (10,15),
>            ....        }
>
> class Record:
>     def __init__(self, data)
>         for field,zone in format.keys():
>             setattr(self, field, data[zone[0]:zone[1]])
>
> If you want to do error checking of most of the fields, I'd 
> define a series of functions that do whatever checking and/or 
> converting you need, and include those in the dictionary:
>
> def ConfirmValidId():
>     ....
>
> format = { a: (ConfirmValidId, 0, 5),
>            b: (int, 5, 10),   ... }
>
> Then your setattr line becomes:
>
>         setattr(self, field, zone[0]( data[zone[1]:zone[2]] ) )

Thanks for the help everyone (this is sort of a group reply). What I am
working on is the Census Tiger/LINE database which is quite large. Each
is composed of a number of region (Virginia has 135). Each region has 17 
database tables. 

The first table (RT1) is:
10301 103802945 K  Loch                          Cir   A41        299          1
        298          200002366923669              51516506509382793827          
350003500001010401010410051000 -76314472+37052074 -76312226+37052312
followed by 8000+ similar lines.

The region I am testing on has 5 Megs of database tables that have to be read
in. The entire Tiger/LINE database is about 1 gig compressed.

I cleaned up my original code (removing all but a few of the temp variables)
into something like this. ToInt changes a string to an integer with error
checking and returning None when the string is blank.

class RT1Line:
    def __init__(self,line):
        TI = ToInt
        # Record Type
        temp = line[0:1].rstrip()
        if not temp.isalnum:
            raise ParseError('Blank value not allowed')
        self.rt = temp
        # Version Number
        temp = line[1:5]
        if not temp.isalnum:
            raise ParseError('Blank value not allowed')
        self.version = TI(temp,'version')
        # TIGER/LINE ID, Permanent Record Number
        temp = line[5:15]
        if not temp.isalnum:
            raise ParseError('Blank value not allowed')
        self.tlid = TI(temp,'tlid')
        # Single Side Source Code
        self.side1 = TI(line[15:16],'side1')
        # Linear Segment Source Code
        self.source = line[16:17].rstrip()
        ...

This read in the RT1 file in 14.0 seconds.

Next I abused the struct module to split up the string. This also got rid
of all of the temp variables. The struct unpack line is horrible but the file
is generated by another program. The error checking and cleanup code looks
better though.

class RT1Line:
    def __init__(self,line):
        TI = ToInt
        v = struct.unpack('1s4s10s1s1s2s30s4s2s3s11s11s11s11s1s1s1s1s5s5s5s5s1s1
s1s1s2s2s3s3s5s5s5s5s5s5s6s6s4s4s10s9s10s9s',line[:-2])
        # Record Type
        if not v[0].isalnum:
            raise ParseError('Blank value not allowed')
        self.rt = v[0]
        ...

This read in the RT1 file in 13.1 seconds.

Finally I tried the your method of keeping the format information in a 
dict, and using setattr. This required that I make a cleanup / error checking
function for every type of value.

format = { 'rt': (NoBlankLeftAlpha,0,1),
           'version': (NoBlankInt,1,5),
           'tlid': (NoBlankInt,5,15),
           'side1': (Int,15,16),
           ...

class RT1Line:
    def __init__(self,line):
        for field,zone in format.items():
            setattr(self, field, zone[0]( line[zone[1]:zone[2]] ) )

This read in the RT1 file in 23 seconds.

I think that I will go with the struct.unpack way. It is relatively clean
and since I generate these files with other code the nasty unpack line
is not so bad.

Thanks to everyone who responded.

-- 
Robert Wright
rgwright at uiuc dot edu