how to duplicate array entries
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Mon Jan 11 16:58:00 EST 2010
On Mon, 11 Jan 2010 10:03:04 +0100, Alf P. Steinbach wrote:
>>> A Python "list" is an array functionality-wise.
>>>
>>> If one isn't observant of that fact then one ends up with O(n^2) time
>>> for the simplest things.
>>
>> Well that's certainly not true. Some operations may be O(N**2), but
>> others are not: list.append() is amortized O(N) and for individual
>> appends, may be can be as fast as O(1).
>
> The second sentence may or may not be true. I don't know of any
> fundamental 'list' operations that have quadratic time. Is there?
I should hope not, but you seem to indicate that there are, or could be
-- you were the one who made the claim of O(N**2).
If you're talking about functions someone might write themselves, well
there is no limit on the awfulness of that. No doubt you're aware of
Bogosort, which has performance O(N!) instead of O(N*log N). Or the
(sadly very common) "Polish Painter Algorithm".
With a bit of work, one might even come up with a sometimes-O(N**2)
algorithm based on dicts:
def increment(d, key):
"""Add one to the value of dict d[key]"""
for k in d.keys():
value = d.pop(key)
if k == key:
value = value+1
d[key] = value
If all the keys hash to the same values, CPython dict lookup degenerates
to O(N) due to the key collisions. If you then remove each key from the
internal linked list, and re-add it to the end, then you might end up
with something as bad as O(N**2).
Of course the above is a stupidly bad function, but it's not as bad as
some things I've seen in real life.
> The first sentence is just baffling -- what on Earth is the "that"
> that you think is not true?
>
> OK, I can guess (correct me if I'm guessing wrong, please): you think
> I'm talking about elementary operations. I'm not. I'm talking about
> algorithmic complexity for loops doing e.g. insertions.
I'm sorry that I baffled you, but I was responding to your implication
that failure to realise that Python lists are implemented as arrays will
likely lead to people inadvertently writing O(N**2) algorithms when they
need not.
Of course you are correct in the sense that there is no limit to the
awfulness to code that can be written by somebody sufficiently
incompetent, ignorant, inexperienced or unlucky. But as far as typical
behaviour goes, I do not believe that the risks of writing such O(N**2)
algorithms is any higher than it otherwise would be if they were called
arrays. In fact, calling them "arrays" might give some people the false
impression that every single append requires an expensive resize
operation, which is not the case.
I will give you one thing: if you assume that Python lists are linked
lists, then you might expect list.insert(0, x) to be O(1) and list[i] to
be O(N). But in fact they are O(N) and O(1) respectively. Since indexing
is typically far more common than insertion, this is a good thing.
But since Python lists are quite rich, and already provide methods for
things like reversal and sorting, I just don't see that the risk of
writing poorly performing functions is any greater because they are
called "lists" rather than "arrays". Particularly if you treat them as an
unspecified implementation of abstract lists, rather than assuming they
are specifically linked-lists.
>>> Using the term "array" accentuates and clarifies this most important
>>> aspect.
>>
>> But Python lists are designed to behave as lists.
>
> No, I'm sorry, they're not.
>
> A Python 'list' has de facto constant time indexing, or "random access".
>
> A linked list -- what the informal "list" means in programming --
> does not have constant time indexing.
A linked list is a specific implementation of an even more general
abstract data type, the list. This is why people specify "linked list"
rather than "list". Abstract types are never defined in terms of
performance, because of course you can't say what the performance of an
abstract thing is, only of functional behaviour.
Both arrays and linked lists (as well as hash tables, as in Lua, or
trees) can be used to implement an abstract list type, and they have
different performance characteristics. A list in this sense is defined by
behaviour (fundamental operations provided), not performance
characteristics.
[...]
>> Just because CPython
>> implements them using arrays doesn't make them arrays. Other Python
>> implementations might use other implementations...
>
> No, I'm sorry, not without screwing up existing Python programs.
That's the choice that the implementer makes when doing something,
anything, different from CPython.
> Indexing is assumed as constant time in every CPython program.
*Every* program? Really? I've written many programs that don't assume
anything about indexing except that it is "fast enough".
In the same way people are often indifferent between hash tables with
constant-time lookup and binary trees with O(log N) lookup.
> That is,
> in your own words, but here correct, that's "certainly not true". ;-)
>
> No (sensible) programming language allows a language implementation to
> change the running time of common loops from linear to quadratic.
Unless the Python reference manually specifically guarantees such
performance characteristics, it is entirely an implementation choice
whether to optimize for insertions or indexing or something else all
together.
Of course in practice CPython is the proverbial 300lb gorilla in the
room, and people's expectations will be shaped by it, so many
implementers will likely try to match CPython's performance
characteristics.
> It would be decidedly un-pythonic. ;-)
>
>
>> If the creator of CLPython is out there, perhaps might like to comment
>> on whether he uses Lisp linked-lists for the Python list type?
>
> If he does then you're talking about a different language than the one
> that CPython implements: constant time indexing is very different from
> linear time. It doesn't matter if some bananas are called oranges. They
> don't turn into oranges no matter what they're called.
But both bananas and oranges are fruit, and if the menu says "Fruit of
the Day", the chef is free to choose between oranges or bananas. Unless
the menu promises that the fruit will be round, juicy, made up of
segments, rich in Vitamin C and low in potassium, there's nothing wrong
with supplying bananas.
Likewise, both linked-lists and arrays are suitable for implementing
abstract lists, and unless the language reference promises CPython's
performance characteristics the implementer is free to vary them.
[...]
>>>>>>> list(itertools.chain(*([x]*3 for x in [1,2,3])))
>>>> [1, 1, 1, 2, 2, 2, 3, 3, 3]
>>> And I think it's worth noting that, for the general case, this
>>> notation is also hiding a gross inefficiency, first constructing
>>> sub-arrays and then copying them and joining them.
>>
>> I wouldn't call that a gross inefficiency -- that's a gross
>> exaggeration unless count is very large, and even then, possibly not
>> that large. Only one such sub-array (sub-list) exists at any time.
>> (Unless I am grossly misinformed.)
>
> I'm sorry but to the best of my knowledge you're misinformed.
>
> Unless there's some pretty advanced lazy evaluation involved the *
> operator has to collect the subarrays into an array formal argument for
> the 'chain' routine.
>
> And at that point they all exist at the same time.
I think you are correct, to be honest I didn't notice the * operator and
was thinking that chain was being called on an iterator argument, not
collected into a tuple. Oh well, that's the downside of using operators
instead of named functions :/
--
Steven
More information about the Python-list
mailing list