[Numpy-discussion] GSOC 2013

Wed Mar 6 03:38:12 EST 2013

On Mar 5, 2013 7:53 PM, "Nathaniel Smith" <njs at pobox.com> wrote:
>
> On 4 Mar 2013 23:21, "Jaime Fernández del Río" <jaime.frio at gmail.com>
wrote:
> >
> > On Mon, Mar 4, 2013 at 2:29 PM, Todd <toddrjen at gmail.com> wrote:
> >>
> >>
> >> 5. Currently dtypes are limited to a set of fixed types, or
combinations of these types.  You can't have, say, a 48 bit float or a
1-bit bool.  This project would be to allow users to create entirely new,
non-standard dtypes based on simple rules, such as specifying the length of
the sign, length of the exponent, and length of the mantissa for a custom
floating-point number.  Hopefully this would mostly be used for reading in
non-standard data and not used that often, but for some situations it could
be useful for storing data too (such as large amounts of boolean data, or
genetic code which can be stored in 2 bits and is often very large).
> >
> >
> > I second this general idea. Simply having a pair of packbits/unpackbits
functions that could work with 2 and 4 bit uints would make my life easier.
If it were possible to have an array of dtype 'uint4' that used half the
space of a 'uint8', but could have ufuncs an the like ran on it, it would
be pure bliss. Not that I'm complaining, but a man can dream...
>
> This would be quite difficult, since it would require reworking the guts
of the ndarray data structure to store strides and buffer offsets in bits
rather than bytes, and probably with endianness handling too. Indexing is
all done at the ndarray buffer-of-bytes layer, without any involvement of
the dtype.
>
> Consider:
>
> a = zeros(10, dtype=uint4)
> b = a[1::3]
>
> Now b is a view onto a discontiguous set of half-bytes within a...
>
> You could have a dtype that represented several uint4s that together
added up to an integral number of bytes, sort of like a structured dtype.
Or packbits()/unpackbits(), like you say.
>
> -n

Then perhaps such a project could be a four-stage thing.

1. Allow for the creation of int, unit, float, bool, and complex dtypes
with an arbitrary number of bytes

2. Allow for the creation of dtypes which are integer fractions of a byte,
(1, 2, or 4 bits), and must be padded to a whole byte.

3. Have an optional internal value in an array that tells it to exclude the
last n bits of the last byte.  This would be used to hide the padding from
step 2.  This should be abstracted into a general-purpose method for
excluding bits from the byte-to-dtype conversion so it can be used in step
4.

4. Allow for the creation of dtypes that are non-integer fractions of a
byte or non-integer multiples of a byte (3, 5, 6, 7, 9, 10, 11, 12, etc,
bits). Each element in the array would be stored as a certain number of
bytes, with the method from 3 used to cut it down to the right number of
bits.  So a 3 bit dtype would have two elements per byte with 2 bits
excluded. A 5 bit dtype would have 1 element per byte with 3 bits excluded.
A 12 bit dtype would have one element in two bytes with with 4 bits
excluded from the second byte.

This approach would allow for arbitrary numbers of bits without breaking
the internal representation, would have each stage building off the
previous stage, and we would still have something useful even if not all
the stages are completed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/numpy-discussion/attachments/20130306/4735f6ac/attachment.html>