Hello, I have a problem with numarray and especially the function numarray.all. I want to compare two files to do this I read the files with a function readcol2 who can put them in a list or numarray format (string or numerical). I'm doing a comparaison on each line of the file. If I'm using the array format and the numarray.all function, that take forever to do the comparaison for 2 big files. If I'm using python list object, it's very fast. I think there are some problem or at least some improvement to do. If I understand correctly the goal of numarray, it has been write to speed up some part of python but here it slow down a lot. An very simple sample to see the effect is at the bottom of this mail. Thanks for numarray, I hope to not bother you. My comments are more to improve numarray than other things. I have been able to find the problem so no I can avoied it. H. def readcol(fname,comments='%',columns=None,delimiter=None,dep=0,arraytype='list'): """ Load ASCII data from fname into an array and return the array. The data must be regular, same number of values in every row fname can be a filename or a file handle. Input: - Fname : the name of the file to read Optionnal input: - comments : a string to indicate the charactor to delimit the domments. the default is the matlab character '%'. - columns : list or tuple ho contains the columns to use. - delimiter : a string to delimit the columns - dep : an integer to indicate from which line you want to begin to use the file (useful to avoid the descriptions lines) - arraytype : a string to indicate which kind of array you want ot have: numeric array (numeric) or character array (numstring) or list (list). By default it's the list mode used matfile data is not currently supported, but see Nigel Wade's matfile ftp://ion.le.ac.uk/matfile/matfile.tar.gz Example usage: x,y = transpose(readcol('test.dat')) # data in two columns X = readcol('test.dat') # a matrix of data x = readcol('test.dat') # a single column of data x = readcol('test.dat,'#') # the character use like a comment delimiter is '#' initial function from pylab (J.Hunter). Change by myself for my specific need """ from numarray import array,transpose fh = file(fname) X = [] numCols = None nline = 0 if columns is None: for line in fh: nline += 1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue if arraytype=='numeric': row = [float(val) for val in line.split(delimiter)] else: row = [val.strip() for val in line.split(delimiter)] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) else: for line in fh: nline +=1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue row = line.split(delimiter) if arraytype=='numeric': row = [float(row[i-1]) for i in columns] elif arraytype=='numstring': row = [row[i-1].strip() for i in columns] else: row = [row[i-1].strip() for i in columns] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) if arraytype=='numeric': X = array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), elif arraytype == 'numstring': import numarray.strings # pb if numeric+pylab X = numarray.strings.array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), return X ------------------------------------------- files_test_creation.py ------------------------------------------- f1 = file('test1.dat','w') for i in range(10000): f1.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f1.close() f2 = file('test2.dat','w') for i in range(10000): f2.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f2.close() ------------------------------------------- numarray_pb_sample.py ------------------------------------------- import numarray data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring') #or in non string array form (same result) ## data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric') ## data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric') for a_i in range(data1.shape[0]): for b_i in range(data2.shape[0]): if numarray.all(data1[a_i,:] == data2[b_i,:]): print a_i,b_i ------------------------------------------- python_list_sample.py ------------------------------------------- data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list') for a_i in range(len(data1)): for b_i in range(len(data2)): if data1[a_i] == data2[b_i]: print a_i,b_i
Hi H, I did some work on this problem based on your previous post but apparently my response never made it to numpy-discussion. In a nutshell, I made numarray 12x faster for a benchmark like your numarray_pb_sample.py by speeding up string comparisons and improving all(). The changes are in numarray CVS but there is no Source Forge release that contains them yet. numarray-1.4.0 is still several weeks away. If you want to try CVS from UNIX/Linux just do: % cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/numpy login % cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/numpy co -P numarray Regards, Todd Humufr wrote:
Hello,
I have a problem with numarray and especially the function numarray.all.
I want to compare two files to do this I read the files with a function readcol2 who can put them in a list or numarray format (string or numerical).
I'm doing a comparaison on each line of the file. If I'm using the array format and the numarray.all function, that take forever to do the comparaison for 2 big files. If I'm using python list object, it's very fast. I think there are some problem or at least some improvement to do. If I understand correctly the goal of numarray, it has been write to speed up some part of python but here it slow down a lot.
An very simple sample to see the effect is at the bottom of this mail.
Thanks for numarray, I hope to not bother you. My comments are more to improve numarray than other things. I have been able to find the problem so no I can avoied it.
H.
def readcol(fname,comments='%',columns=None,delimiter=None,dep=0,arraytype='list'):
""" Load ASCII data from fname into an array and return the array. The data must be regular, same number of values in every row fname can be a filename or a file handle.
Input:
- Fname : the name of the file to read
Optionnal input: - comments : a string to indicate the charactor to delimit the domments. the default is the matlab character '%'. - columns : list or tuple ho contains the columns to use. - delimiter : a string to delimit the columns
- dep : an integer to indicate from which line you want to begin
to use the file (useful to avoid the descriptions lines)
- arraytype : a string to indicate which kind of array you want ot have: numeric array (numeric) or character array (numstring) or list (list). By default it's the
list mode used
matfile data is not currently supported, but see Nigel Wade's matfile ftp://ion.le.ac.uk/matfile/matfile.tar.gz
Example usage:
x,y = transpose(readcol('test.dat')) # data in two columns
X = readcol('test.dat') # a matrix of data
x = readcol('test.dat') # a single column of data
x = readcol('test.dat,'#') # the character use like a comment delimiter is '#'
initial function from pylab (J.Hunter). Change by myself for my specific need
""" from numarray import array,transpose
fh = file(fname)
X = [] numCols = None nline = 0 if columns is None: for line in fh: nline += 1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue if arraytype=='numeric': row = [float(val) for val in line.split(delimiter)] else: row = [val.strip() for val in line.split(delimiter)] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) else: for line in fh: nline +=1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue row = line.split(delimiter) if arraytype=='numeric': row = [float(row[i-1]) for i in columns] elif arraytype=='numstring': row = [row[i-1].strip() for i in columns] else: row = [row[i-1].strip() for i in columns] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row)
if arraytype=='numeric': X = array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), elif arraytype == 'numstring': import numarray.strings # pb if numeric+pylab X = numarray.strings.array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), return X
------------------------------------------- files_test_creation.py
-------------------------------------------
f1 = file('test1.dat','w') for i in range(10000): f1.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f1.close()
f2 = file('test2.dat','w') for i in range(10000): f2.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f2.close()
------------------------------------------- numarray_pb_sample.py
-------------------------------------------
import numarray data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring')
#or in non string array form (same result) ## data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric') ## data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric')
for a_i in range(data1.shape[0]): for b_i in range(data2.shape[0]): if numarray.all(data1[a_i,:] == data2[b_i,:]): print a_i,b_i
------------------------------------------- python_list_sample.py
-------------------------------------------
data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list')
for a_i in range(len(data1)): for b_i in range(len(data2)): if data1[a_i] == data2[b_i]: print a_i,b_i
------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Thank you very much. I saw no answer before. It's why I reduce a lot the sample :) I'll try it now Todd Miller wrote:
Hi H,
I did some work on this problem based on your previous post but apparently my response never made it to numpy-discussion. In a nutshell, I made numarray 12x faster for a benchmark like your numarray_pb_sample.py by speeding up string comparisons and improving all(). The changes are in numarray CVS but there is no Source Forge release that contains them yet. numarray-1.4.0 is still several weeks away. If you want to try CVS from UNIX/Linux just do:
% cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/numpy login % cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/numpy co -P numarray
Regards, Todd
Humufr wrote:
Hello,
I have a problem with numarray and especially the function numarray.all.
I want to compare two files to do this I read the files with a function readcol2 who can put them in a list or numarray format (string or numerical).
I'm doing a comparaison on each line of the file. If I'm using the array format and the numarray.all function, that take forever to do the comparaison for 2 big files. If I'm using python list object, it's very fast. I think there are some problem or at least some improvement to do. If I understand correctly the goal of numarray, it has been write to speed up some part of python but here it slow down a lot.
An very simple sample to see the effect is at the bottom of this mail.
Thanks for numarray, I hope to not bother you. My comments are more to improve numarray than other things. I have been able to find the problem so no I can avoied it.
H.
def readcol(fname,comments='%',columns=None,delimiter=None,dep=0,arraytype='list'):
""" Load ASCII data from fname into an array and return the array. The data must be regular, same number of values in every row fname can be a filename or a file handle.
Input:
- Fname : the name of the file to read
Optionnal input: - comments : a string to indicate the charactor to delimit the domments. the default is the matlab character '%'. - columns : list or tuple ho contains the columns to use. - delimiter : a string to delimit the columns
- dep : an integer to indicate from which line you want to begin
to use the file (useful to avoid the descriptions lines)
- arraytype : a string to indicate which kind of array you want ot have: numeric array (numeric) or character array (numstring) or list (list). By default it's the
list mode used matfile data is not currently supported, but see Nigel Wade's matfile ftp://ion.le.ac.uk/matfile/matfile.tar.gz
Example usage:
x,y = transpose(readcol('test.dat')) # data in two columns
X = readcol('test.dat') # a matrix of data
x = readcol('test.dat') # a single column of data
x = readcol('test.dat,'#') # the character use like a comment delimiter is '#'
initial function from pylab (J.Hunter). Change by myself for my specific need
""" from numarray import array,transpose
fh = file(fname)
X = [] numCols = None nline = 0 if columns is None: for line in fh: nline += 1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue if arraytype=='numeric': row = [float(val) for val in line.split(delimiter)] else: row = [val.strip() for val in line.split(delimiter)] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row) else: for line in fh: nline +=1 if dep is not None and nline <= dep: continue line = line[:line.find(comments)].strip() if not len(line): continue row = line.split(delimiter) if arraytype=='numeric': row = [float(row[i-1]) for i in columns] elif arraytype=='numstring': row = [row[i-1].strip() for i in columns] else: row = [row[i-1].strip() for i in columns] thisLen = len(row) if numCols is not None and thisLen != numCols: raise ValueError('All rows must have the same number of columns') X.append(row)
if arraytype=='numeric': X = array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), elif arraytype == 'numstring': import numarray.strings # pb if numeric+pylab X = numarray.strings.array(X) r,c = X.shape if r==1 or c==1: X.shape = max([r,c]), return X
------------------------------------------- files_test_creation.py
-------------------------------------------
f1 = file('test1.dat','w') for i in range(10000): f1.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f1.close()
f2 = file('test2.dat','w') for i in range(10000): f2.write(str(i)+' '+str(i+1)+' '+str(i+2)+'\n') f2.close()
------------------------------------------- numarray_pb_sample.py
-------------------------------------------
import numarray data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numstring')
#or in non string array form (same result) ## data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric') ## data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='numeric')
for a_i in range(data1.shape[0]): for b_i in range(data2.shape[0]): if numarray.all(data1[a_i,:] == data2[b_i,:]): print a_i,b_i
------------------------------------------- python_list_sample.py
-------------------------------------------
data1 = readcol2.readcol('test1.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list') data2 = readcol2.readcol('test2.dat',columns=[1,2,3],comments='#',delimiter=' ',dep=1,arraytype='list')
for a_i in range(len(data1)): for b_i in range(len(data2)): if data1[a_i] == data2[b_i]: print a_i,b_i
------------------------------------------------------- SF.Net email is sponsored by: Tame your development challenges with Apache's Geronimo App Server. Download it for free - -and be entered to win a 42" plasma tv or your very own Sony(tm)PSP. Click here to play: http://sourceforge.net/geronimo.php _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
participants (2)
-
Humufr
-
Todd Miller