[Numpy-discussion] [SciPy-dev] Deprecate chararray [was Plea for help]

Tue Sep 22 17:22:00 EDT 2009

Michael:

First, thank you very much for your detailed and thorough analysis and recap
of the situation - it sounds to me like chararray is now in good hands! :-)

On Tue, Sep 22, 2009 at 10:58 AM, Michael Droettboom <mdroe at stsci.edu>wrote:

> Sorry to resurrect a long-dead thread, but I've been continuing Chris
>

IMO, no apology necessary!

> Hanley's investigation of chararray at Space Telescope Science Institute
> (and the broader astronomical community) for a while and have some
> findings to report back.
>
> What I've taken from this thread is that chararray is in need of a
> maintainer.  I am able to spend some time to the cause, but first would
>

Yes, thank you!

> like to clarify what it will take to make it's continued inclusion more
> comfortable.
>
> Let me start with the use case.  chararrays are extensively returned
> from pyfits (a tool to handle the standard astronomy data format).
> pyfits is the basis of many applications, and it would be impossible to
> audit all of that code.  Most authors of those tools do not track
> numpy-discussion closely, which is why we don't hear from them on this
> list, but there is a great deal of pyfits-using code.
>
> Doing some spot-checking on this code, a common thing I see is SQL-like
> queries on recarrays of objects.  For instance, it is very common to a
> have a table of objects, with a "Target" column which is a string, and
> do something like (where c is a chararray of the 'Target' column):
>
>   subset = array[np.where(c.startswith('NGC'))]
>
> Strictly speaking, this is a use case for "vectorized string
> operations", not necessarily for the chararray class as it presently
> stands.  One could almost as easily do:
>
>   subset = array[np.where([x.startswith('NGC') for x in c])]
>
> ...and the latter is even slightly faster, since chararray currently
> loops in Python anyway.
>
> Even better, though, I have some experimental code to perform the loop
> in C, and I get 5x speed up on a table with ~120,000 rows.  If that were
> to be included in numpy, that's a strong argument against recommending
> list comprehensions in user code.  The use case suggests the continued
> existence of vectorized string operations in numpy -- whether that
> continues to be chararray, or some newer/better interface + chararray
> for backward compatibility, is an open question.  Personally I think a
> less object-oriented approach and just having a namespace full of
> vectorized string functions might be cleaner than the current situation
> of needing to create a view class around an ndarray.  I'm suggesting
> something like the following, using the same example, where {STR} is
> some namespace we would fill with vectorized string operations:
>
>   subset = array[np.where(np.{STR}.startswith(c, 'NGC'))]
>
> Now on to chararray as it now stands.  I view chararray as really two
> separable pieces of functionality:
>
>   1) Convenience to perform vectorized string operations using
> '.method' syntax, or in some cases infix operators (+, *)
>   2) Implicit "rstrip"ping of values
>
> (Note that raw ndarray's truncate values at the first NULL character,
> like C strings, but chararrays will strip any and all whitespace
> characters from the end).
>
> Changing (2) just seems to be asking to be the source of subtle bugs.
> Unfortunately, there's an inconsistency between 1) and 2) in the present
> implementation.  For example:
>
> In [9]: a = np.char.array(['a  '])
>
> In [10]: a
> Out[10]: chararray(['a'], dtype='|S3')
>
> In [11]: a[0] == 'a'
> Out[11]: True
>
> In [12]: a.endswith('a')
> Out[12]: array([False], dtype=bool)
>
> This is *the* design wart of chararray, IMHO, and one that's difficult
> to fix without breaking compatibility.  It might be a worthwhile
> experiment to remove (2) and see how much we really break, but it would
> be impossible to know for sure.
>
> Now to address the concerns iterated in this thread.  Unfortunately, I
> don't know where this thread began before it landed on the Numpy list,
> so I may be missing details which would help me address them.
>
> > 0) "it gets very little use" (an assumption you presumably dispute);
> >
> Certainly not true from where I stand.
>

I'm convinced.

> > 1) "is pretty much undocumented" (less true than a week ago, but still
> true for several of the attributes, with another handful or so falling into
> the category of "poorly documented");
> >
> I don't quite understand this one -- 99% of the methods are wrappers

around standard Python string methods.  I don't think we should
> redocument those.  I agree it needs a better top level docstring about
>

OK, that's what I needed to hear (that I don't believe anyone stated
explicitly before - I'm sure I'll be corrected if I'm wrong): in that case,
finishing these off is as simple as stating that in the functions'
docstrings (albeit in a way compliant w/ the numpy docstring standard, of
course; see below).

<snip>

> > 6) it is, on its face, "counter to the spirit" of NumPy.
> >
> I don't quite know what this means -- but I do find the fact that it's a
> view class with methods a little bit clumsy.  Is that what you meant?
>

The rest of the arguments effectively become moot, but I will clarify what I
meant by 6), which was simply that as I understood - and understand - it,
the central purpose of numpy is to provide a fast (i.e., implemented in C),
Python API for a _numerical_ multidimensional array object; it sounds like
there is a need for a fast Python API for vectorized string operations, but
IMO, numpy is not the place for it (maybe a sub-package in scipy? it could
still use numpy "under the hood," of course); that said, my primary concern
presently is getting everything that _is_ presently in numpy documented, and
now, so it shall be.

> So here's my TODO list related to all this:
>
> 1) Fix bugs in Trac
> 2) Improve documentation (though probably not in a method-by-method way)
>

So, you're volunteering to do this?  Great, thanks!  (Please be sure, of
course, to conform to the numpy docstring standard:

http://projects.scipy.org/numpy/wiki/CodingStyleGuidelines#docstring-standard

with clarification of referral practice, such as it is, at:

http://docs.scipy.org/numpy/Questions+Answers/#documenting-equivalent-functions-and-methods
)

> 3) Improve unit test coverage
> 4a) Create C-based vectorized string operations
> 4b) Refactor chararray in terms of those
> 4c) Design and create an interface to those methods that will be the
> "right way" going forward
>
> Anything else?
>

Looks great to me!

With much thanks again!!!

DG

>
> Mike
>
>
> --
> Michael Droettboom
> Science Software Branch
> Operations and Engineering Division
> Space Telescope Science Institute
> Operated by AURA for NASA
>
> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20090922/5f28ccaf/attachment.html>