Getting an array of indexes from a unicode string

Shane Holloway shane.holloway at ieee.org
Fri Nov 3 11:27:30 EST 2006


I'm looking for a better way to map the characters of a unicode  
string to indexes into an array of geometry.  The following code is  
functional, but it seems sub-optimal with all that numpy has to offer::

     textOrds = map(ord, text.encode('utf-8'))
     idx = indexMap[textOrds]
     textGeo = geometry[idx]

text is a simple python string coming in.  I then manually covert it  
to unicode ordinals.  Those are then mapped through indexMap, which  
happens to be a 1-to-1 mapping between unicode ordinals and valid  
indexes into geometry.  I then use the idx array to take a selection  
from geometry for the text.

As I mentioned before, this works alright, however two things seem  
inefficient.  First is the manual mapping to unicode ordinals.  Is  
there a way to have numpy do that for me?  Secondly is the mapping  
through indexMap, because it is only sparsely populated -- usually  
only a 2-5 thousand entries out of the 64 thousand allocated.  I've  
thought of using unicode.translate, but characters cannot be used for  
indexes in numpy.

What are your collective thoughts on making this cleaner and more  
efficient?

Thanks,
-Shane Holloway

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642




More information about the NumPy-Discussion mailing list