how to duplicate array entries

Steven D'Aprano steve at REMOVE-THIS-cybersource.com.au
Mon Jan 11 16:58:00 EST 2010


On Mon, 11 Jan 2010 10:03:04 +0100, Alf P. Steinbach wrote:

>>> A Python "list" is an array functionality-wise.
>>>
>>> If one isn't observant of that fact then one ends up with O(n^2) time
>>> for the simplest things.
>> 
>> Well that's certainly not true. Some operations may be O(N**2), but
>> others are not: list.append() is amortized O(N) and for individual
>> appends, may be can be as fast as O(1).
> 
> The second sentence may or may not be true. I don't know of any
> fundamental 'list' operations that have quadratic time. Is there?

I should hope not, but you seem to indicate that there are, or could be 
-- you were the one who made the claim of O(N**2).

If you're talking about functions someone might write themselves, well 
there is no limit on the awfulness of that. No doubt you're aware of 
Bogosort, which has performance O(N!) instead of O(N*log N). Or the 
(sadly very common) "Polish Painter Algorithm". 

With a bit of work, one might even come up with a sometimes-O(N**2) 
algorithm based on dicts:


def increment(d, key):
    """Add one to the value of dict d[key]"""
    for k in d.keys():
        value = d.pop(key)
        if k == key:
            value = value+1
        d[key] = value

If all the keys hash to the same values, CPython dict lookup degenerates 
to O(N) due to the key collisions. If you then remove each key from the 
internal linked list, and re-add it to the end, then you might end up 
with something as bad as O(N**2).

Of course the above is a stupidly bad function, but it's not as bad as 
some things I've seen in real life.


> The first sentence is just baffling  --  what on Earth is the "that"
> that you think is not true?
> 
> OK, I can guess (correct me if I'm guessing wrong, please): you think
> I'm talking about elementary operations. I'm not. I'm talking about
> algorithmic complexity for loops doing e.g. insertions.

I'm sorry that I baffled you, but I was responding to your implication 
that failure to realise that Python lists are implemented as arrays will 
likely lead to people inadvertently writing O(N**2) algorithms when they 
need not.

Of course you are correct in the sense that there is no limit to the 
awfulness to code that can be written by somebody sufficiently 
incompetent, ignorant, inexperienced or unlucky. But as far as typical 
behaviour goes, I do not believe that the risks of writing such O(N**2) 
algorithms is any higher than it otherwise would be if they were called 
arrays. In fact, calling them "arrays" might give some people the false 
impression that every single append requires an expensive resize 
operation, which is not the case.

I will give you one thing: if you assume that Python lists are linked 
lists, then you might expect list.insert(0, x) to be O(1) and list[i] to 
be O(N). But in fact they are O(N) and O(1) respectively. Since indexing 
is typically far more common than insertion, this is a good thing.

But since Python lists are quite rich, and already provide methods for 
things like reversal and sorting, I just don't see that the risk of 
writing poorly performing functions is any greater because they are 
called "lists" rather than "arrays". Particularly if you treat them as an 
unspecified implementation of abstract lists, rather than assuming they 
are specifically linked-lists.



>>> Using the term "array" accentuates and clarifies this most important
>>> aspect.
>> 
>> But Python lists are designed to behave as lists.
> 
> No, I'm sorry, they're not.
> 
> A Python 'list' has de facto constant time indexing, or "random access".
> 
> A linked list  --  what the informal "list" means in programming  -- 
> does not have constant time indexing.

A linked list is a specific implementation of an even more general 
abstract data type, the list. This is why people specify "linked list" 
rather than "list". Abstract types are never defined in terms of 
performance, because of course you can't say what the performance of an 
abstract thing is, only of functional behaviour.

Both arrays and linked lists (as well as hash tables, as in Lua, or 
trees) can be used to implement an abstract list type, and they have 
different performance characteristics. A list in this sense is defined by 
behaviour (fundamental operations provided), not performance 
characteristics.


[...]
>> Just because CPython
>> implements them using arrays doesn't make them arrays. Other Python
>> implementations might use other implementations...
> 
> No, I'm sorry, not without screwing up existing Python programs.

That's the choice that the implementer makes when doing something, 
anything, different from CPython.


> Indexing is assumed as constant time in every CPython program. 

*Every* program? Really? I've written many programs that don't assume 
anything about indexing except that it is "fast enough".

In the same way people are often indifferent between hash tables with 
constant-time lookup and binary trees with O(log N) lookup.



> That is,
> in your own words, but here correct, that's "certainly not true". ;-)
> 
> No (sensible) programming language allows a language implementation to
> change the running time of common loops from linear to quadratic.

Unless the Python reference manually specifically guarantees such 
performance characteristics, it is entirely an implementation choice 
whether to optimize for insertions or indexing or something else all 
together.

Of course in practice CPython is the proverbial 300lb gorilla in the 
room, and people's expectations will be shaped by it, so many 
implementers will likely try to match CPython's performance 
characteristics. 



> It would be decidedly un-pythonic. ;-)
> 
> 
>> If the creator of CLPython is out there, perhaps might like to comment
>> on whether he uses Lisp linked-lists for the Python list type?
> 
> If he does then you're talking about a different language than the one
> that CPython implements: constant time indexing is very different from
> linear time. It doesn't matter if some bananas are called oranges. They
> don't turn into oranges no matter what they're called.

But both bananas and oranges are fruit, and if the menu says "Fruit of 
the Day", the chef is free to choose between oranges or bananas. Unless 
the menu promises that the fruit will be round, juicy, made up of 
segments, rich in Vitamin C and low in potassium, there's nothing wrong 
with supplying bananas.

Likewise, both linked-lists and arrays are suitable for implementing 
abstract lists, and unless the language reference promises CPython's 
performance characteristics the implementer is free to vary them.



[...]
>>>>>>> list(itertools.chain(*([x]*3 for x in [1,2,3])))
>>>> [1, 1, 1, 2, 2, 2, 3, 3, 3]
>>> And I think it's worth noting that, for the general case, this
>>> notation is also hiding a gross inefficiency, first constructing
>>> sub-arrays and then copying them and joining them.
>> 
>> I wouldn't call that a gross inefficiency -- that's a gross
>> exaggeration unless count is very large, and even then, possibly not
>> that large. Only one such sub-array (sub-list) exists at any time.
>> (Unless I am grossly misinformed.)
> 
> I'm sorry but to the best of my knowledge you're misinformed.
> 
> Unless there's some pretty advanced lazy evaluation involved the *
> operator has to collect the subarrays into an array formal argument for
> the 'chain' routine.
> 
> And at that point they all exist at the same time.

I think you are correct, to be honest I didn't notice the * operator and 
was thinking that chain was being called on an iterator argument, not 
collected into a tuple. Oh well, that's the downside of using operators 
instead of named functions :/




-- 
Steven



More information about the Python-list mailing list