
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
I've done it by brute force below. Any tips for making it faster? (5x would make it useful; 10x would be a dream.)
list2index.test()
Numbers: 5.84955787659 seconds Characters: 24.3192870617 seconds Dates: 39.288228035 seconds
import datetime, time from numpy import nan, asmatrix, ones
def list2index(L):
# Find unique elements in list uL = dict.fromkeys(L).keys()
# Convert list to matrix L = asmatrix(L).T
# Initialize return matrix idx = nan * ones((L.size, 1))
# Assign numbers to unique L values for i, uLi in enumerate(uL): idx[L == uLi,:] = i
def test():
L = 5000*range(255) t1 = time.time() idx = list2index(L) t2 = time.time() print 'Numbers:', t2-t1, 'seconds'
L = 5000*[chr(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Characters:', t2-t1, 'seconds'
d = datetime.date step = datetime.timedelta L = 5000*[d(2006,1,1)+step(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Dates:', t2-t1, 'seconds'

Keith Goodman wrote:
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
If what you want is that the first unique element get's zero, the second one, I don't think the code below will work in general since the dict does not preserve order. You might want to look at the results for the character case to see what I mean. If you're looking for something else, you'll need to elaborate a bit. Since list2index doesn't return anything, it's not entirely clear what the answer consists of. Just idx? Idx plus uL?
I've done it by brute force below. Any tips for making it faster? (5x would make it useful; 10x would be a dream.)
Assuming I understand what you're trying to do, this might help:
def list2index2(L): idx = ones([len(L)]) map = {} for i, x in enumerate(L): index = map.get(x) if index is None: map[x] = index = len(map) idx[i] = index return idx
It's almost 10x faster for numbers and about 40x faster for characters and dates. However it produces different results from list2index in the second two cases. That may or may not be a good thing depending on what you're really trying to do.
-tim
list2index.test()
Numbers: 5.84955787659 seconds Characters: 24.3192870617 seconds Dates: 39.288228035 seconds
import datetime, time from numpy import nan, asmatrix, ones
def list2index(L):
# Find unique elements in list uL = dict.fromkeys(L).keys()
# Convert list to matrix L = asmatrix(L).T
# Initialize return matrix idx = nan * ones((L.size, 1))
# Assign numbers to unique L values for i, uLi in enumerate(uL): idx[L == uLi,:] = i
def test():
L = 5000*range(255) t1 = time.time() idx = list2index(L) t2 = time.time() print 'Numbers:', t2-t1, 'seconds' L = 5000*[chr(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Characters:', t2-t1, 'seconds' d = datetime.date step = datetime.timedelta L = 5000*[d(2006,1,1)+step(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Dates:', t2-t1, 'seconds'
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

Tim Hochberg wrote:
Keith Goodman wrote:
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
If what you want is that the first unique element get's zero, the second one, I don't think the code below will work in general since the dict does not preserve order. You might want to look at the results for the character case to see what I mean. If you're looking for something else, you'll need to elaborate a bit. Since list2index doesn't return anything, it's not entirely clear what the answer consists of. Just idx? Idx plus uL?
I've done it by brute force below. Any tips for making it faster? (5x would make it useful; 10x would be a dream.)
Assuming I understand what you're trying to do, this might help:
def list2index2(L): idx = ones([len(L)]) map = {} for i, x in enumerate(L): index = map.get(x) if index is None: map[x] = index = len(map) idx[i] = index return idx
It's almost 10x faster for numbers and about 40x faster for characters and dates. However it produces different results from list2index in the second two cases. That may or may not be a good thing depending on what you're really trying to do.
Ugh! I fell victim to premature optimization disease. The following is both clearer and faster: Sigh.
def list2index3(L): idx = ones([len(L)]) map = {} for i, x in enumerate(L): if x not in map: map[x] = len(map) idx[i] = map[x] return idx
-tim
list2index.test()
Numbers: 5.84955787659 seconds Characters: 24.3192870617 seconds Dates: 39.288228035 seconds
import datetime, time from numpy import nan, asmatrix, ones
def list2index(L):
# Find unique elements in list uL = dict.fromkeys(L).keys()
# Convert list to matrix L = asmatrix(L).T
# Initialize return matrix idx = nan * ones((L.size, 1))
# Assign numbers to unique L values for i, uLi in enumerate(uL): idx[L == uLi,:] = i
def test():
L = 5000*range(255) t1 = time.time() idx = list2index(L) t2 = time.time() print 'Numbers:', t2-t1, 'seconds' L = 5000*[chr(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Characters:', t2-t1, 'seconds' d = datetime.date step = datetime.timedelta L = 5000*[d(2006,1,1)+step(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Dates:', t2-t1, 'seconds'
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

On 8/29/06, Tim Hochberg tim.hochberg@ieee.org wrote:
Keith Goodman wrote:
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
If what you want is that the first unique element get's zero, the second one, I don't think the code below will work in general since the dict does not preserve order. You might want to look at the results for the character case to see what I mean. If you're looking for something else, you'll need to elaborate a bit. Since list2index doesn't return anything, it's not entirely clear what the answer consists of. Just idx? Idx plus uL?
The output I wanted (in my mind, but unfortunately not in my previous email) is idx and uL where uL[0] corresponds to the zeros in idx, uL[1] corresponds to the ones in idx. etc.
I'd also like the uL's to be ordered (now I see that characters and dates aren't ordered, ooops, thanks for telling me about that). Or optionally ordered by a second list input which if present would be used instead of the unique values of L.
Thank you all for the huge improvements to my code. I'll learn a lot studying all of them.

something like this?
def list2index(L): uL=sorted(set(L)) idx=dict((y,x) for x,y in enumerate(uL)) return uL,asmatrix(fromiter((idx[x] for x in L),dtype=int))
//Torgil
On 8/29/06, Keith Goodman kwgoodman@gmail.com wrote:
On 8/29/06, Tim Hochberg tim.hochberg@ieee.org wrote:
Keith Goodman wrote:
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
If what you want is that the first unique element get's zero, the second one, I don't think the code below will work in general since the dict does not preserve order. You might want to look at the results for the character case to see what I mean. If you're looking for something else, you'll need to elaborate a bit. Since list2index doesn't return anything, it's not entirely clear what the answer consists of. Just idx? Idx plus uL?
The output I wanted (in my mind, but unfortunately not in my previous email) is idx and uL where uL[0] corresponds to the zeros in idx, uL[1] corresponds to the ones in idx. etc.
I'd also like the uL's to be ordered (now I see that characters and dates aren't ordered, ooops, thanks for telling me about that). Or optionally ordered by a second list input which if present would be used instead of the unique values of L.
Thank you all for the huge improvements to my code. I'll learn a lot studying all of them.
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

On 8/29/06, Torgil Svensson torgil.svensson@gmail.com wrote:
something like this?
def list2index(L): uL=sorted(set(L)) idx=dict((y,x) for x,y in enumerate(uL)) return uL,asmatrix(fromiter((idx[x] for x in L),dtype=int))
Wow. That's amazing. Thank you.

def list2index(L): uL=sorted(set(L)) idx=dict((y,x) for x,y in enumerate(uL)) return uL,asmatrix(fromiter((idx[x] for x in L),dtype=int,count=len(L)))
adding the count will save you a little more time, and temporary memory [see related thread].
//Torgil
On 8/29/06, Keith Goodman kwgoodman@gmail.com wrote:
On 8/29/06, Torgil Svensson torgil.svensson@gmail.com wrote:
something like this?
def list2index(L): uL=sorted(set(L)) idx=dict((y,x) for x,y in enumerate(uL)) return uL,asmatrix(fromiter((idx[x] for x in L),dtype=int))
Wow. That's amazing. Thank you.
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

def list2index(L): idx=dict((y,x) for x,y in enumerate(set(L))) return asmatrix(fromiter((idx[x] for x in L),dtype=int))
# old $ python test.py Numbers: 29.4062280655 seconds Characters: 84.6239070892 seconds Dates: 117.560418844 seconds
# new $ python test.py Numbers: 1.79700994492 seconds Characters: 1.6025249958 seconds Dates: 1.7974088192 seconds
16, 52 and 100 times faster
//Torgil
On 8/29/06, Keith Goodman kwgoodman@gmail.com wrote:
I have a very long list that contains many repeated elements. The elements of the list can be either all numbers, or all strings, or all dates [datetime.date].
I want to convert the list into a matrix where each unique element of the list is assigned a consecutive integer starting from zero.
I've done it by brute force below. Any tips for making it faster? (5x would make it useful; 10x would be a dream.)
list2index.test()
Numbers: 5.84955787659 seconds Characters: 24.3192870617 seconds Dates: 39.288228035 seconds
import datetime, time from numpy import nan, asmatrix, ones
def list2index(L):
# Find unique elements in list uL = dict.fromkeys(L).keys()
# Convert list to matrix L = asmatrix(L).T
# Initialize return matrix idx = nan * ones((L.size, 1))
# Assign numbers to unique L values for i, uLi in enumerate(uL): idx[L == uLi,:] = i
def test():
L = 5000*range(255) t1 = time.time() idx = list2index(L) t2 = time.time() print 'Numbers:', t2-t1, 'seconds' L = 5000*[chr(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Characters:', t2-t1, 'seconds' d = datetime.date step = datetime.timedelta L = 5000*[d(2006,1,1)+step(z) for z in range(255)] t1 = time.time() idx = list2index(L) t2 = time.time() print 'Dates:', t2-t1, 'seconds'
Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&da... _______________________________________________ Numpy-discussion mailing list Numpy-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/numpy-discussion

You can get some speed up for numeric data: def list2index2(L): aL = asarray(L) eL = empty_like(L) for v,k in enumerate(set(L)): eL[aL == k] = v return numpy.asmatrix(eL).T
fwiw, Alan Isaac
participants (4)
-
Alan G Isaac
-
Keith Goodman
-
Tim Hochberg
-
Torgil Svensson