Reading by positions plain text files
javivd
javiervandam at gmail.com
Mon Dec 13 18:29:52 EST 2010
On Dec 12, 11:21 pm, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
> On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
> <javiervan... at gmail.com> declaimed the following in
> gmane.comp.python.general:
>
>
>
> > f = open(r'c:c:\somefile.txt', 'w')
>
> > f.write('0123456789\n0123456789\n0123456789')
>
> Not the most explanatory sample data... It would be better if the
> records had different contents.
>
> > f.close()
>
> > f = open(r'c:\somefile.txt', 'r')
>
> > for line in f:
>
> Here you extract one "line" from the file
>
> > f.seek(3,0)
> > print f.read(1) #just to know if its printing the rigth column
>
> And here you ignored the entire line you read, seeking to the fourth
> byte from the beginning of the file, andreadingjust one byte from it.
>
> I have no idea of how seek()/read() behaves relative to line
> iteration in the for loop... Given the small size of the test data set
> it is quite likely that the first "for line in f" resulted in the entire
> file being read into a buffer, and that buffer scanned to find the line
> ending and return the data preceding it; then the buffer position is set
> to after that line ending so the next "for line" continues from that
> point.
>
> But in a situation with a large data set, or an unbuffered I/O
> system, the seek()/read() could easily result in resetting the file
> position used by the "for line", so that the second call returns
> "456789\n"... And all subsequent calls too, resulting in an infinite
> loop.
>
> Presuming the assignment requires pulling multiple selected fields
> from individual records, where each record is of the same
> format/spacing, AND that the field selection can not be preprogrammed...
>
> Sample data file (use fixed width font to view):
> -=-=-=-=-=-
> Wulfraed 09Ranger 1915
> Bask Euren 13Cleric 1511
> Aethelwulf 07Mage 0908
> Cwiculf 08Mage 1008
> -=-=-=-=-=-
>
> Sample format definition file:
> -=-=-=-=-=-
> Name 0-14
> Level 15-16
> Class 17-24
> THAC0 25-26
> Armor 27-28
> -=-=-=-=-=-
>
> Code to process (Python 2.5, with minimal error handling):
> -=-=-=-=-=-
>
> class Extractor(object):
> def __init__(self, formatFile):
> ff = open(formatFile, "r")
> self._format = {}
> self._length = 0
> for line in ff:
> form = line.split("\t") #file must be tab separated
> if len(form) != 2:
> print "Invalid file format definition: %s" % line
> continue
> name = form[0]
> columns = form[1].split("-")
> if len(columns) == 1: #single column definition
> start = int(columns[0])
> end = start
> elif len(columns) == 2:
> start = int(columns[0])
> end = int(columns[1])
> else:
> print "Invalid column definition: %s" % form[1]
> continue
> self._format[name] = (start, end)
> self._length = max(self._length, end)
> ff.close()
>
> def __call__(self, line):
> data = {}
> if len(line) < self._length:
> print "Data line is too short for required format: ignored"
> else:
> for (name, (start, end)) in self._format.items():
> data[name] = line[start:end+1]
> return data
>
> if __name__ == "__main__":
> FORMATFILE = "SampleFormat.tsv"
> DATAFILE = "SampleData.txt"
>
> characterExtractor = Extractor(FORMATFILE)
>
> df = open(DATAFILE, "r")
> for line in df:
> fields = characterExtractor(line)
> for (name, value) in fields.items():
> print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
> print
>
> df.close()
> -=-=-=-=-=-
>
> Output from running above code:
> -=-=-=-=-=-
> Field name: 'Armor' value: '15'
> Field name: 'THAC0' value: '19'
> Field name: 'Level' value: '09'
> Field name: 'Class' value: 'Ranger '
> Field name: 'Name' value: 'Wulfraed '
>
> Field name: 'Armor' value: '11'
> Field name: 'THAC0' value: '15'
> Field name: 'Level' value: '13'
> Field name: 'Class' value: 'Cleric '
> Field name: 'Name' value: 'Bask Euren '
>
> Field name: 'Armor' value: '08'
> Field name: 'THAC0' value: '09'
> Field name: 'Level' value: '07'
> Field name: 'Class' value: 'Mage '
> Field name: 'Name' value: 'Aethelwulf '
>
> Field name: 'Armor' value: '08'
> Field name: 'THAC0' value: '10'
> Field name: 'Level' value: '08'
> Field name: 'Class' value: 'Mage '
> Field name: 'Name' value: 'Cwiculf '
> -=-=-=-=-=-
>
> Note that string fields have not been trimmed, also numeric fields
> are still intextformat... The format definition file would need to be
> expanded to include a "string", "integer", "float" (and "Boolean"?) code
> in order for the extractor to do proper type conversions.
>
> --
> Wulfraed Dennis Lee Bieber AF6VN
> wlfr... at ix.netcom.com HTTP://wlfraed.home.netcom.com/
Clearly it's working. Altough, this code is beyond my python knowledge
(i don't get along with classes, maybe it's a good moment to learn
about them...) but i'll dig into it.
Thanks a lot! It really helps...
J
More information about the Python-list
mailing list