[Patches] [ python-Patches-520694 ] arraymodule.c improvements

noreply@sourceforge.net noreply@sourceforge.net
Sun, 24 Feb 2002 07:56:28 -0800


Patches item #520694, was opened at 2002-02-20 14:38
You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=520694&group_id=5470

Category: None
Group: None
Status: Open
Resolution: Accepted
Priority: 3
Submitted By: Jason Orendorff (jorend)
Assigned to: Martin v. Löwis (loewis)
Summary: arraymodule.c improvements

Initial Comment:
This patch makes brings the array module a little
more up-to-date.

There are two changes:

1. Modernize the array type, memory management,
   and so forth.  As a result, the array()
   builtin is no longer a function but a type.
   array.array is array.ArrayType.
   Also, it can now be subclassed in Python.

2. Add a new typecode 'u', for Unicode
   characters.

The patch includes changes to test/test_array.py
to test the new features.

I would like to make a further change: add an
arrayobject.h include file, and provide some
array operations there, giving them names like
PyArray_Check(), PyArray_GetItem(), and
PyArray_GET_DATA().  Is such a change likely
to find favor?



----------------------------------------------------------------------

>Comment By: Martin v. Löwis (loewis)
Date: 2002-02-24 07:56

Message:
Logged In: YES 
user_id=21627

There is a flaw in the extension of arrays to Unicode: There
is no easy way to get back the Unicode string. You have to use

u"".join(arr.tolist())

This is slightly annoying, since there is it is the only
case where it is not possible to get back the original
constructor arguments.

Also, what is the rationale for removing __members__?

----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-22 05:39

Message:
Logged In: YES 
user_id=38388

How about simplifying the whole setup altogether and 
add arrays as standard Python types (ie. put the code
in Objects/ and add the new include file to Includes/).

About the inter-module C API export: I'll write up a PEP
about this which will hopefully result in a new standard
support mechanism for this in Python. (BTW, the
approach I used in _ssl/_socket does use PyCObjects)

----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-02-22 05:25

Message:
Logged In: YES 
user_id=21627

With the rationale given, I'm now in favour of all parts of
the patch.

As for exposing the API, you need to address MAL's concerns:
PyArray_* won't be available to other extension modules,
instead, you need to do expose them through a C object.

However, I recommend *not* to follow the approach taken in
socket/ssl; I agree with Tim's concerns here. Instead, the
approach taken by cStringIO (via cStringIO.cStringIO_API) is
much better (i.e. put the burden of using the API onto any
importer, and out of Python proper).


----------------------------------------------------------------------

Comment By: M.-A. Lemburg (lemburg)
Date: 2002-02-21 00:40

Message:
Logged In: YES 
user_id=38388

About the Unicode bit: if "u" maps to Py_UNICODE I for one 
don't have any objections. The internal encoding is
available in lots of places, so that argument doesn't
count and I'm sure it can be put to some good use
for fast manipulation of large Unicode strings.

I very much like the new exposure of the type at C level;
however I don't understand how you would use it without
adding the complete module to the libpythonx.x.a (unless
you add some sort of inter-module C API import mechanism
like the one I added to _socket and _ssl) ?!


----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2002-02-20 18:03

Message:
Logged In: YES 
user_id=18139

> What is the rationale for expanding PyObject_VAR_HEAD?
> It doesn't seem to achieve anything.

It didn't make sense for array to be a VAR_HEAD type.

VAR_HEAD types are variable-size: the last member
defined in the struct for such a type is an array of
length 1, and type->item_size is nonzero.  See
e.g. PyType_GenericAlloc(), and how it decides whether
to call PyObject_INIT or PyObject_VAR_INIT: It checks
type->item_size.

The new arraymodule.c calls PyType_GenericAlloc; the
old one didn't.  So a change seemed warranted.  Since
Arraytype has item_size == 0, it seemed most consistent
to make it a non-VAR type and initialize the ob_size
field myself.

I'm pretty sure I got the right interpretation of this;
but if not, someone wiser in the ways of Python will
speak up.  :)

(While I was looking at this, I noticed this:
http://sourceforge.net/tracker/index.php?
func=detail&aid=520768&group_id=5470&atid=305470)


----------------------------------------------------------------------

Comment By: Jason Orendorff (jorend)
Date: 2002-02-20 17:15

Message:
Logged In: YES 
user_id=18139

> I don't like the Unicode part of it at all.

Well, I'm not attatched to it.  It's very easy
to subtract it from the patch.

> What can you do with this feature?

The same sort of thing you might do with an array
of type 'c'.  For example, change individual
characters of a (Unicode) string and then run a
(Unicode) re.match on it.

> It seems to unfairly prefer a specific Unicode encoding,
> without explaining what that encoding is, and without a
> clear use case why this encoding is desirable.

Well, why should array('h', '\x00\xff\xaa\xbb')
be allowed?  Why is that encoding preferable to any
other particular encoding of short ints?  Easy:
it's the encoding of the C compiler where Python was
built.  For 'u' arrays, the encoding used is just the
encoding that Python uses internally.

However, it's not intended to be used in any situation
where encode()/decode() would be appropriate.  I never
even thought about that possibility when I wrote it.

The behavior of a 'u' array is intended to be more
like this:  Suppose A = array('u', ustr).  Then:
    len(A) == len(ustr)
    A[0] == ustr[0]
    A[1] == ustr[1]
    ...

That is, a 'u' array is an array of Unicode characters.
Encoding is not an issue, any more than with the
built-in unicode type.

(If ustr is a non-Unicode string, then the behavior
is different -- more in line with what 'b', 'h', 'i',
and the others do.)

If your concern is that Python currently "hides" its
internal encoding, and the 'u' array exposes this
unnecessarily, then consider these two examples that
don't involve arrays:

>>> x = u'\U00012345'  # One Unicode codepoint...
>>> len(x)
2             # hmm.
>>> x[0]
u'\ud808'     # aha.  UTF-16.
>>> x[1]
u'\udf45'

>>> str(buffer(u'abc'))   # Example two.
'a\x00b\x00c\x00'

> It also seems to overlap with the Unicode object's
> .encode method, which is much more general.

Wow.  Well, that wasn't my intent.

It is intended, rather, to offer parity with 'c'.
Java has byte[], short[], int[], long[], float[],
double[], and char[]... Python doesn't currently have
char[].  Shouldn't it?


----------------------------------------------------------------------

Comment By: Martin v. Löwis (loewis)
Date: 2002-02-20 15:02

Message:
Logged In: YES 
user_id=21627

What is the rationale for expanding PyObject_VAR_HEAD? It
doesn't seem to achieve anything.

I don't like the Unicode part of it at all. What can you do
with this feature? It seems to unfairly prefer a specific
Unicode encoding, without explaining what that encoding is,
and without a clear use case why this encoding is desirable.
It also seems to overlap with the Unicode object's .encode
method, which is much more general.

----------------------------------------------------------------------

You can respond by visiting: 
http://sourceforge.net/tracker/?func=detail&atid=305470&aid=520694&group_id=5470