need fast parser for comma/space delimited numbers

Les Schaffer godzilla at netmeg.net
Sat Mar 18 09:13:15 EST 2000


I have written an application for reading in large amounts of
space/comma delimited numbers from ASCII text files for statistical
processing. 

I originally used re expresssions for splitting, but i was able to cut
the time required for data file parsing down to a third by using
string.split on the comma or space.

Still, the app takes about 5 minutes to parse a typical set of data
files. I'd like to drop that down to a minute of possible.

Which means i probably need to wrap in a C module with something like
an sscanf. Or maybe just a function which find the delimiters and
delivers the number parts of string (defined by delimiters) to atoi
and atof functions.

But before i get started, i imagine someone else has already done
this. 

anyone have pointers to said code or suggestions? i'll happily post my
code if there is none out there already.

the two file formats look like this:

700 lines like this times about 100 files:

...
356	0.23514	0.1784
357	0.2206	0.22021
358	0.27676	0.41483
359	0.10083	0.33827
360	0.072568	0.3547
361	0.17443	0.41647
362	0.30491	0.27886
363	0.25666	0.32906
364	0.22523	0.46709
365	0.28276	0.65154
... 


181 lines like this times about 100 files:
...
    -4     90.43153  99.08258   
    -3     92.77277  100.00000  
    -2     93.88273  98.95287   
    -1     96.51977  98.49538   
    0      99.23279  98.57191   
    1      100.00000 97.05283   
    2      98.52036  93.01269   
...

and occasional file that looks like so:
...
378,  0.001094949000,  0.000031531040,  0.005158320000
379,  0.001231154000,  0.000035215210,  0.005802907000
380,  0.001368000000,  0.000039000000,  0.006450001000
381,  0.001502050000,  0.000042826400,  0.007083216000
382,  0.001642328000,  0.000046914600,  0.007745488000
...

that 0.3547 over on the right: that aint my fault, thats the test
instruments formatting of the of the output strings, which i have zero
control over. 

also note the varied number of digits in expression, also outta my
control.

i'd prefer if the first column of the data be parsed as an integer,
but thats not absolutely essential.


les schaffer

here's the core of what i have now. the lines of the ASCII data file
are already in python as a list of strings (strLines passed to
grabArrayData() ).


    def __parseIFF(self, str):

	"""Grab one int and the rest floats from string array
	str. Return array with first element independent variable and
	rest dependent variables"""

	array = [string.atoi(str[0])]
	for item in str[1:] :
	    array.append( string.atof(item)  )
	return  array
    
    def __parseFFF(self, str):

	"""Grab one set of floats from string array str. Return array with
	first element independent variable and rest dependent
	variables"""
		
	return map( string.atof, str )

    def __breakStringOnSpace(self, str):

	"""break one line str containing numbers in a string format on
	whitespace (or comma), return array with the strings representing the
	numbers."""

        return filter(None, string.splitfields(str, self.splitStr)  )

    def setBreakExp(self, brk):
	
	"""set whether we are using commans instead of white-space for
	splitting"""
	self.splitStr = brk
    
    def grabArrayData( self, strLines ):

	""" Feed grabDataArray an array of strings containing data,
	strLines.
	
	grabDataArray returns Numeric arrays for the independent and
	dependent variables."""

	(mPoints, mValues) = self.__createArrays(strLines)

	# self.parse set to either either __parseFFF or __parseIFF in
	# __init__
        parse = self.parse
        breakString = self.__breakStringOnSpace

	for i  in range( 0, len(strLines) ):
	    num = parse( breakString( strLines[i] ) )
	    mPoints[i] = num[0]
            mValues[ 0:self.rows, i] = num[1:]
	    
	return mPoints, mValues


here's where things stand at the moment:

         186209 function calls in 305.260 CPU seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    86436  105.620    0.001  105.620    0.001 NumberParser.py:161(__parseIFF)
    86632   50.300    0.001   50.300    0.001 NumberParser.py:180(__breakStringOnSpace)
      196    0.450    0.002    0.560    0.003 NumberParser.py:189(__createArrays)
      196  122.010    0.623  278.380    1.420 NumberParser.py:206(grabArrayData)
[snip]

so a little over one-third the time is in performing the for-loop
inside grabArrayData, about 1/3 the time is doing atoi/atof, and 1/6
of the time is breaking strings on white-space or commas.


my next move: move the __parseXFF and __breakSringOnSpace inside the
for loop rather than making (expensive???) function callson each
line. or perhaps doing all the breaking first then the atoi'ing next.

other suggestions?



More information about the Python-list mailing list