chararray stripping trailing whitespace a bug?
I've been working with pyfits, which uses numpy chararrays. I've discovered the hard way that chararrays silently remove trailing whitespace:
a = np.array(['a ']) b = a.view(np.chararray) a[0] 'a ' b[0] 'a'
Note the string values stored in memory are unchanged. This behaviour caused a bug in a program I've been writing, and seems like a bad idea in general. Is it intentional? Neil
From the chararray docstring: Versus a regular Numpy array of type `str` or `unicode`, this class adds the following functionality: 1) values automatically have whitespace removed from the end when indexed So I guess it is a feature, not a bug. :) Warren Neil Crighton wrote:
I've been working with pyfits, which uses numpy chararrays. I've discovered the hard way that chararrays silently remove trailing whitespace:
a = np.array(['a ']) b = a.view(np.chararray) a[0]
'a '
b[0]
'a'
Note the string values stored in memory are unchanged. This behaviour caused a bug in a program I've been writing, and seems like a bad idea in general. Is it intentional?
Neil
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Mon, May 10, 2010 at 11:23 AM, Neil Crighton <neilcrighton@gmail.com> wrote:
I've been working with pyfits, which uses numpy chararrays. I've discovered the hard way that chararrays silently remove trailing whitespace:
a = np.array(['a ']) b = a.view(np.chararray) a[0] 'a ' b[0] 'a'
Note the string values stored in memory are unchanged. This behaviour caused a bug in a program I've been writing, and seems like a bad idea in general. Is it intentional?
Neil
This is an intentional "feature", not a bug. Chris -- Christopher Hanley Senior Systems Software Engineer Space Telescope Science Institute 3700 San Martin Drive Baltimore MD, 21218 (410) 338-4338
This is an intentional "feature", not a bug.
Chris
Ah, ok, thanks. I missed the explanation in the doc string because I'm using version 1.3 and forgot to check the web docs. For the record, this was my bug: I read a fits binary table with pyfits. One of the table fields was a chararray containing a bunch of flags ('A','B','C','D'). I tried to use in1d() to identify all entries with flags of 'C' or 'D'. So
c = pyfits_table.chararray_column mask = np.in1d(c, ['C', 'D'])
It turns out the actual stored values in the chararray were 'A ', 'B ', 'C ' and 'D '. in1d() converts the chararray to an ndarray before performing the comparison, so none of the entries matches 'C' or 'D'. What is the best way to ensure this doesn't happen to other people? We could change the array set operations to special-case chararrays, but this seems like an ugly solution. Is it possible to change something in pyfits to avoid this? Neil
Also from the docstring: """ .. note:: The `chararray` class exists for backwards compatibility with Numarray, it is not recommended for new development. Starting from numpy 1.4, if one needs arrays of strings, it is recommended to use arrays of `dtype` `object_`, `string_` or `unicode_`, and use the free functions in the `numpy.char` module for fast vectorized string operations. """ Neil Crighton wrote:
Ah, ok, thanks. I missed the explanation in the doc string because I'm using version 1.3 and forgot to check the web docs.
For the record, this was my bug: I read a fits binary table with pyfits. One of the table fields was a chararray containing a bunch of flags ('A','B','C','D'). I tried to use in1d() to identify all entries with flags of 'C' or 'D'. So
c = pyfits_table.chararray_column mask = np.in1d(c, ['C', 'D'])
It turns out the actual stored values in the chararray were 'A ', 'B ', 'C ' and 'D '. in1d() converts the chararray to an ndarray before performing the comparison, so none of the entries matches 'C' or 'D'.
This inconsistency is fixed in Numpy 1.4 (which included a major overhaul of chararrays). in1d will perform the auto whitespace-stripping on chararrays, but not on regular ndarrays of strings.
What is the best way to ensure this doesn't happen to other people? We could change the array set operations to special-case chararrays, but this seems like an ugly solution. Is it possible to change something in pyfits to avoid this?
Pyfits continues to use chararray since not doing so would break existing code relying on this behavior. And there are many use cases where this behavior is desirable, particularly with fixed-length strings in tables. The best way to get around it from your code is to cast the chararray pyfits returns to a regular ndarray. The cast does not perform a copy, so should be very efficient: In [6]: from numpy import char In [7]: import numpy as np In [8]: c = char.array(['a ', 'b ']) In [9]: c Out[9]: chararray(['a', 'b'], dtype='|S2') In [10]: np.asarray(c) Out[11]: array(['a ', 'b '], dtype='|S2') I suggest casting between to either chararray or ndarray depending on whether you want the auto-whitespace-stripping behavior. Mike -- Michael Droettboom Science Software Branch Operations and Engineering Division Space Telescope Science Institute Operated by AURA for NASA
This inconsistency is fixed in Numpy 1.4 (which included a major overhaul of chararrays). in1d will perform the auto whitespace-stripping on chararrays, but not on regular ndarrays of strings.
Great, thanks.
Pyfits continues to use chararray since not doing so would break existing code relying on this behavior. And there are many use cases where this behavior is desirable, particularly with fixed-length strings in tables.
The best way to get around it from your code is to cast the chararray pyfits returns to a regular ndarray.
My problem was I didn't know I needed to get around it :) But thanks for the suggestion, I'll use that in future when I need to switch between chararrays and ndarrays. Neil
participants (4)
-
Christopher Hanley
-
Michael Droettboom
-
Neil Crighton
-
Warren Weckesser