[Numpy-discussion] How to create a boolean sub-array from a larger string array?

Sat Jun 23 02:14:23 EDT 2007

Andriy Basilisk wrote:
> Hello all,
> 
> My challenge is this:
> I'm working on an application that parses numerical data from a text
> report using regular expressions, and then places the results in Numpy
> matrices for processing.  The data contains integers, floats, and
> boolean values.  The boolean values are represented in the text file
> by either an empty string '', or by a star '*'.  The regex parser
> creates a sequence of  nested lists that is readily converted to a MxN
> string-type matrix.  Then, the necessary rows of that matrix are
> sliced to create the necessary new sub-matrices.
> 
> Here is a simplified sample of my solution so far:
> 
> import numpy as _N
> data = [['1', '5.30', '', '3.44', '*'], ['2', '-4.12', '*', '-1.24',
> ''], ['3', '0.45', '', '3.22', '*']]
> mdat = _N.mat(data).T       # mdat.shape is now (5,3)
> ids = mdat[0,].astype(_N.int)       #this works for str->int
> noms = mdat[(1,3),].astype(_N.float64)      #same idea also works for
> str->float64
> ## The following technique would be nice, but
> ## it causes a ValueError: invalid literal for int() with base 10: ''
> outs = mdat[(2,4),].astype(_N.bool)
> ## Instead, I have to convert the strings to '0' or '1'
> ## explicitly, then cast them to a bool matrix:
> for i, b in enumerate(mdat[(2,4),].T):
>     mdat[2, i] = 1 if mdat[2, i] else 0
>     mdat[4, i] = 1 if mdat[4, i] else 0
> outs = mdat[(2,4),].astype(_N.bool)
> 
> I was expecting the above to behave similar to the Python bool()
> function on strings:
>    >>> bool(''), bool('*')
>    (False, True)
> but it doesn't work that way.
> 
> Can anyone enlighten me as to why slices of my string matrix cannot be
> cast to boolean matrices?

It's kind of a toss-up as to what's needed in general. I suspect that for the
majority of cases, one deals with strings of '0' and '1' instead of empty
strings and non-empty strings.

You can always use something like

  mdat[[2,4]] == '*'

to get the boolean array you want. This scheme can work with any string
representation of True and False.

-- 
Robert Kern

"I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth."
  -- Umberto Eco