need fast parser for comma/space delimited numbers
Les Schaffer
godzilla at netmeg.net
Sat Mar 18 09:13:15 EST 2000
I have written an application for reading in large amounts of
space/comma delimited numbers from ASCII text files for statistical
processing.
I originally used re expresssions for splitting, but i was able to cut
the time required for data file parsing down to a third by using
string.split on the comma or space.
Still, the app takes about 5 minutes to parse a typical set of data
files. I'd like to drop that down to a minute of possible.
Which means i probably need to wrap in a C module with something like
an sscanf. Or maybe just a function which find the delimiters and
delivers the number parts of string (defined by delimiters) to atoi
and atof functions.
But before i get started, i imagine someone else has already done
this.
anyone have pointers to said code or suggestions? i'll happily post my
code if there is none out there already.
the two file formats look like this:
700 lines like this times about 100 files:
...
356 0.23514 0.1784
357 0.2206 0.22021
358 0.27676 0.41483
359 0.10083 0.33827
360 0.072568 0.3547
361 0.17443 0.41647
362 0.30491 0.27886
363 0.25666 0.32906
364 0.22523 0.46709
365 0.28276 0.65154
...
181 lines like this times about 100 files:
...
-4 90.43153 99.08258
-3 92.77277 100.00000
-2 93.88273 98.95287
-1 96.51977 98.49538
0 99.23279 98.57191
1 100.00000 97.05283
2 98.52036 93.01269
...
and occasional file that looks like so:
...
378, 0.001094949000, 0.000031531040, 0.005158320000
379, 0.001231154000, 0.000035215210, 0.005802907000
380, 0.001368000000, 0.000039000000, 0.006450001000
381, 0.001502050000, 0.000042826400, 0.007083216000
382, 0.001642328000, 0.000046914600, 0.007745488000
...
that 0.3547 over on the right: that aint my fault, thats the test
instruments formatting of the of the output strings, which i have zero
control over.
also note the varied number of digits in expression, also outta my
control.
i'd prefer if the first column of the data be parsed as an integer,
but thats not absolutely essential.
les schaffer
here's the core of what i have now. the lines of the ASCII data file
are already in python as a list of strings (strLines passed to
grabArrayData() ).
def __parseIFF(self, str):
"""Grab one int and the rest floats from string array
str. Return array with first element independent variable and
rest dependent variables"""
array = [string.atoi(str[0])]
for item in str[1:] :
array.append( string.atof(item) )
return array
def __parseFFF(self, str):
"""Grab one set of floats from string array str. Return array with
first element independent variable and rest dependent
variables"""
return map( string.atof, str )
def __breakStringOnSpace(self, str):
"""break one line str containing numbers in a string format on
whitespace (or comma), return array with the strings representing the
numbers."""
return filter(None, string.splitfields(str, self.splitStr) )
def setBreakExp(self, brk):
"""set whether we are using commans instead of white-space for
splitting"""
self.splitStr = brk
def grabArrayData( self, strLines ):
""" Feed grabDataArray an array of strings containing data,
strLines.
grabDataArray returns Numeric arrays for the independent and
dependent variables."""
(mPoints, mValues) = self.__createArrays(strLines)
# self.parse set to either either __parseFFF or __parseIFF in
# __init__
parse = self.parse
breakString = self.__breakStringOnSpace
for i in range( 0, len(strLines) ):
num = parse( breakString( strLines[i] ) )
mPoints[i] = num[0]
mValues[ 0:self.rows, i] = num[1:]
return mPoints, mValues
here's where things stand at the moment:
186209 function calls in 305.260 CPU seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
86436 105.620 0.001 105.620 0.001 NumberParser.py:161(__parseIFF)
86632 50.300 0.001 50.300 0.001 NumberParser.py:180(__breakStringOnSpace)
196 0.450 0.002 0.560 0.003 NumberParser.py:189(__createArrays)
196 122.010 0.623 278.380 1.420 NumberParser.py:206(grabArrayData)
[snip]
so a little over one-third the time is in performing the for-loop
inside grabArrayData, about 1/3 the time is doing atoi/atof, and 1/6
of the time is breaking strings on white-space or commas.
my next move: move the __parseXFF and __breakSringOnSpace inside the
for loop rather than making (expensive???) function callson each
line. or perhaps doing all the breaking first then the atoi'ing next.
other suggestions?
More information about the Python-list
mailing list