ASCII delimited files

Thomas A. Bryan tbryan at python.net
Thu Nov 11 03:03:55 CET 1999


Roger Irwin wrote:
> 
> Is there any function or module available for parsing ASCII delimited files,
> before I go and re-invent the wheel writing my own.

I'm not sure exactly what you're looking for.  I've appended something 
that I was playing with one day.  It was just a way to create an object 
easily that could parse and validate ascii, delimited files.  
It might be terribly slow: I never timed it.  

Basically, you create a DelimFldParser object with a list of 
DelimParserField subclasses and a delimiter.  Each 
DelimParserField subclass knows how to handle a specific "column" 
of the ASCII file.  The DelimFldParser is then handed a file 
object (anything with a readline() method, really), and it 
returns a list of lists.  The inner list is a list of values 
returned by the DelimFldParser objects for a specific line.
Oh, I also assume that each line of the file has the same number 
of "columns."  

I implemented three sample DelimParserField objects.  One converts 
ascii values to floats.  Another checks that the field value is 
in a specified list of values.  The last is designed to perform 
a verification of field values based on a regular expression.

I wrote this thing to read and verify files before importing them 
into a database.  I never really had much chance to use it, though.
I would love to see someone optimize this thing because it makes the 
task of building a parser for a new format of an ASCII file very 
simple.  It would be great, for example, for dealing with delimited 
data exported from a database or for parsing a delimited file 
for for import into a database.

---Tom


#!/usr/bin/python 

import string
import re

class DelimFldParser:
    def __init__(self, fields, delimiter=None):
        """fields is an ordered list of DelimParserField instances"""
        self.delimiter = delimiter
        self.fields = fields
        self.numCols = len(fields)
        self.cols = []
        for el in fields:
            self.cols.append(el.name)
    def parseLine(self, line):
        list = string.split(line, self.delimiter)
        assert len(list) == self.numCols, \
            "The following line doesn't have enough  fields.\n%s" % line
        for idx in range(self.numCols):
            list[idx] = self.fields[idx].convert(list[idx])
            self.fields[idx].verify(list[idx])
        return list
    def parseFile(self, fileObj):
        data = []
        line = fileObj.readline()
        while line:
            data.append(self.parseLine(line))
            line = fileObj.readline()
        return data
    def __str__(self):
        s = '<DelimFldParser: '
        for el in self.fields:
            s = s + el.name + ', '
        s = s[:-2] + ' >'
        return s

class DelimParserField:
    def __init__(self, name):
        self.name = name
    def convert(self,value):
        return value
    def verify(self,value):
        pass

class EnumField(DelimParserField):
    def __init__(self,name,validValues):
        DelimParserField.__init__(self,name)
        self.validValues = validValues[:]
    def verify(self,value):
        assert value in self.validValues, \
            "%s not in %s on the following line" % (value,self.validValues)

class NumericRngField(DelimParserField):
    def __init__(self,name,start,stop):
        DelimParserField.__init__(self,name)
        self.min = start
        self.max = stop
    def convert(self,value):
        return float(value)
    def verify(self,value):
        assert value >= self.min and value <= self.max, \
          "%s is not between %s an d %s" % (value,self.min,self.max)

class RegexpField(DelimParserField):
    def __init__(self,name,regexp,flags=None):
        DelimParserField.__init__(self,name)
        if flags:
            self.re = re.compile(regexp,flags)
        else:
            self.re = re.compile(regexp)
    def verify(self,value):
        assert self.re.search(value), \
           "%s does not match the pattern '%s'" % (value, self.re.pattern)



if __name__ == '__main__':
    fh = open('delimParser.test','w')
    fh.write("""a 10 9/10/1999
b 3.5 10/11/1974
c 5.7 09/10/1974
""")
    fh.close()
    fh = open('delimParser.test','r')
    myParser = DelimFldParser((EnumField('Enum',('a','b','c')),
                               NumericRngField('Range',0,10),
                               RegexpField('RegExp','\d{1,2}/\d{2}/\d{4}')))
    print myParser
    output = myParser.parseFile(fh)
    fh.close()
    print output




More information about the Python-list mailing list