[Numpy-discussion] Help using numPy to create a very large multi dimensional array

Bruno Santos bacmsantos at gmail.com
Tue Apr 17 12:03:29 EDT 2007


I try to use the expression as you said, but I'm not getting the desired
result,
My text file look like this:

# num rows=115 num columns=2634
AbassiM.txt 0.033023 0.033023 0.033023 0.165115 0.462321....0.000000
AgricoleW.txt 0.038691 0.038691 0.038691 0.232147 0.541676....0.215300
AliR.txt 0.041885 0.041885 0.041885 0.125656 0.586395....0.633580
.....
....
....
ZhangJ.txt 0.047189 0.047189 0.047189 0.155048 0.613452....0.000000

using the code line you give I don't obtain a matrix with that shape,
instead I obtain the following  array([], shape=(0, 115), dtype=float64)

2007/4/13, Charles R Harris <charlesr.harris at gmail.com>:
>
>
>
> On 4/13/07, Bruno Santos <bacmsantos at gmail.com> wrote:
> >
> > Dear Sirs,
> > I'm trying to use Numpy to solve a speed problem with Python, I need to
> > perform agglomerative clustering as a first step to k-means clustering.
> > My problem is that I'm using a very large list in Pyhton and the script
> > is taking more than 9minutes to process all the information, so I'm trying
> > to use Numpy to create a matrix.
> > I'm reading the vectors from a text file and I end up with an array of
> > 115*2634 float elements, How can I create this structure with numpy?
> >
> > Where is my code in python:
> > #Read each document vector to a matrix
> >     doclist = []
> >     matrix = []
> >     list = []
> >     for line in vecfile:
> >         list = line.split()
> >         for elem in range(1, len(list)):
> >             list[elem] = float(list[elem])
> >         matrix.append (list[1:])
> >     vecfile.close()
>
>
> I don't know what your text file looks like or how many elements are in
> each line, but assuming 115 entries/line and spaces, something like the
> following will read in the data:
>
> m = N.fromfile('name of text file', sep=' ').reshape(-1,115)
>
> This assumes you have done import numpy as N and will result in a 2634x115
> array, which isn't very large.
>
>     #Read the desired number of final clusters
> >     numclust = input('Input the desired number of clusters: ')
> >
> > #Clustering process
> >     clust = rows
> >     ind = [-1, -1]
> >     list_j=[]
> >     list_k=[]
> >     while (clust > numclust):
> >         min = 2147483647
> >         print('Number of Clusters %d \n' % clust)
> >         #Find the 2 most similares vectors in the file
> >         for j in range(0, clust):
> >             list_j=matrix[j]
> >             for k in range(j+1, clust):
> >                 list_k=matrix[k]
> >                 dist=0
> >                 for e in range(0, columns):
> >                     result = list_j[e] - list_k[e]
> >                     dist += result * result
> >                 if (dist < min):
> >                     ind[0] = j
> >                     ind[1] = k
> >                     min = dist
>
>         #Combine the two most similaires vectores by median
> >         for e in range(0, columns): matrix[ind[0]][e] =
> > (matrix[ind[0]][e] + matrix[ind[1]][e]) / 2.0
> >         clust = clust -1
> >
> >         #Move up all the remaining vectors
> >         for k in range(ind[1], (rows - 1)):
> >             for e in range(0, columns): matrix[k][e]=matrix[k+1][e]
>
>
> This is the slow step, order N^3 in the number of vectors. It can be
> vectorized, but perhaps there is a better implementation of this algorithm.
> There may be an agglomerative clustering algorithm already available in
> scipy, the documentation indicates that kmeans clustering software is
> available. Perhaps someone closer to that library can help you there.
>
> Chuck
>
>
> >
>
> _______________________________________________
> Numpy-discussion mailing list
> Numpy-discussion at scipy.org
> http://projects.scipy.org/mailman/listinfo/numpy-discussion
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20070417/b5ed1777/attachment.html>


More information about the NumPy-Discussion mailing list