[Python-Dev] An ability to specify start and length of slices
Noam Raphael
noamr at myrealbox.com
Thu Jun 3 18:07:48 EDT 2004
Hello,
Many times I find myself asking for a slice of a specific length, rather
than a slice with a specific end. I suggest to add the syntax
object[start:>length] (or object[start:>length:jump]), beside the
existing syntax.
Two examples:
1. Say I have a list with the number of panda bears hunted in each
month, starting from 1900. Now I want to know how many panda bears were
hunted in year y. Currently, I have to write something like this:
sum(huntedPandas[(y-1900)*12:(y-1900)*12+12])
If my suggestion is accepted, I would be able to write:
sum(huntedPandas[(y-1900)*12:>12])
2. Many data files contain fields of fixed length. Just an example: say
I want to get the color of the first pixel of a 24-bit color BMP file.
Say I have a function which gets a 4-byte string and converts it into a
32-bit integer. The four bytes, from byte no. 10, are the size of the
header, in bytes. Right now, if I don't want to use temporary variables,
I have to write:
picture[s2i(picture[10:14]):s2i(picture[10:14])+4]
I think this is nicer (and quicker):
picture[s2i(picture[10:>4]):>4]
(I mean to show that when working with data files, it's common to have
slices of specific length, and that the proposed syntax makes things
clear and simple. I took BMP just as an example - I know about PIL.)
Other solutions (from comp.lang.python responses):
1. Of course, the longer form may be used, and a temporary variable may
be used to avoid repeated function calls.
2. The idiom object[start:][:length] may be used. However, it may be
very inefficient, if the list is long. Another advantage of the proposed
syntax is that it can be used in multi-dimensional slices (for example,
ar[:,x:>3,:])
3. The programmer may define the function lambda object, start, length:
object[start:start+length]. This does make expressions quite short, but
it isn't very readable IMHO, and doesn't deal with multi-dimensional slices.
Objections (also from comp.lang.python):
1. There should be only one way to do something in Python.
2. Some don't like how it looks.
3. l[a:b] yields an empty list when a>b, and l[a:>b] doesn't.
My responses:
1. Changes should be taken seriously, and the language must be kept
simple and easy to read, but it doesn't mean that there should be only
one way to do something. Just an example: you could write l[:,:,:,3],
but the ellipsis token lets you write l[...,3].
2. I can't really argue with that, besides saying that it looks fine to
me; The symbol '>' generally means "move to the right". I think that
l[12345:>10] can easily be read as "start from 12345, and move 10 steps
to the right. Take all the items you passed over."
3. l[a:>b] doesn't look like l[a:b] and it means something altogether
different. Besides, l[a:b:-1] doesn't yield an empty list when a > b.
Some technical details:
My proposal only affects the conversion from Python code into byte-code.
This is why it is easy to implement and has no side effects, as far as I
can see.
I changed the definition of "subscript" in the Grammar file from:
subscript: '.' '.' '.' | test | [test] ':' [test] [sliceop]
into:
subscript: '.' '.' '.' | test | ([test] ':' [test] | test ':>' test)
[sliceop]
and added the ':>' token to tokenize.c and token.h.
I then extended compile.c to handle the new syntax.
The byte code produced is basically simple: Calculate start, calculate
length, and add start to length to get the usual start, end. It gets a
bit complicated because you want range(10)[3:>-5], for example, to yield
an empty list, and using the method described, it will be equivalent to
range(10)[3:-2], that is, to [3,4,5,6,7]. So the byte-code my
implementation produces checks to see if the resulting end is negative
and start is positive, and if so, puts -sys.maxint, instead of
start+length, as end. -sys.maxint is used instead of the more obvious
choice, 0, so that range(10)[3:>-5:-1] will yield [3,2,1,0] and not [3,2,1].
This can be optimized, because I expect that usually length will be an
integer given explicitly in the Python code, in which case no testing
has to be done in the byte-code.
Attached are the 4 diffs. I'm sorry, they are against the Python-2.3.3
release (the sourceforge CVS doesn't work for me currently), but I
expect them to work fine with the CVS head.
To summerize, this is a small addition, with no side-effects or
backward-compatibility issues, which will help me and others.
Well, what do you think? I would like to hear your comments.
Best wishes,
Noam Raphael
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Grammar.diff
Type: text/x-patch
Size: 808 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-dev/attachments/20040604/cf70e9a5/Grammar.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: compile.c.diff
Type: text/x-patch
Size: 2278 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-dev/attachments/20040604/cf70e9a5/compile.c.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: token.h.diff
Type: text/x-patch
Size: 767 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-dev/attachments/20040604/cf70e9a5/token.h.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tokenizer.c.diff
Type: text/x-patch
Size: 550 bytes
Desc: not available
Url : http://mail.python.org/pipermail/python-dev/attachments/20040604/cf70e9a5/tokenizer.c.bin
More information about the Python-Dev
mailing list