Ordering python sets
Lie Ryan
lie.1296 at gmail.com
Tue Nov 4 15:40:52 EST 2008
On Sun, 02 Nov 2008 02:08:37 +0000, Steven D'Aprano wrote:
> On Sat, 01 Nov 2008 18:58:59 +0000, Tim Rowe wrote:
>
>> 2008/10/27 <bearophileHUGS at lycos.com>:
>>> Lie Ryan:
>>>
>>>>Oh no, the two dict implementation would work _exactly_ the same from
>>>>the outside, they are transparently interchangeable. Only the
>>>>performance characteristic differs because of the different
>>>>implementation.<
>>>
>>> I don't agree with the general idea. If the operations done by your
>>> data structure have different computational complexity, then they are
>>> fit for different usages. When you program you must know what
>>> computational complexity has each of the operations of your data
>>> structyre, otherwise there's no way to know the complexity of your
>>> whole program, so instead of programming you are just become a mage
>>> that tries magical spells and hopes for the better.
>>
>> No, when you program you know how you will be using the data structure,
>> so you can choose the implementation that's right for the application.
>
> I don't understand why you have prefixed that sentence with "No". It
> seems like you are agreeing with bearophile's point.
On linguistic note: As a non-native speaker of English, I've never relied
on the correct usage of Yes and No and would instead rely on the
following text. In some languages, situations where English-people
usually use Yes is answered with No and vice versa.
>> That's what the container libraries for other languages do. At the
>> moment, you just have one implementation, and have to hope it's right
>> for the job.
>
> To the extent that this is true, it is a sign of the poverty of the
> standard Python library. But in fact it isn't true. Python has quite a
> decent collection of collection types. Just off the top of my head:
>
> tuple, namedtuple, list, set, frozenset, dict, defaultdict, queue, deque
>
> set, list and dict are highly optimized for general use, and are very
> suitable for building new data structures. And you are wrong to say that
> we have to "hope it's right for the job", we can build perfectly
> acceptable pure-Python data structures from these building blocks, or
> add our own C extensions. You're never forced to use a standard library
> data structure, it's a trade-off of your time and effort against runtime
> efficiency. And because the standard library types are generally so
> excellent, that trade-off is often (but not always) a no-brainer.
There are cases where the algorithm you uses ignited the standard
library's data structure's worst case scenario.
>> Adding an *optional* parameter that says, in effect, "I want this list
>> optimised for writes and reads at both ends" or "I want this list
>> optimised for fully random reads but writes only at the end" doesn't
>> *lose* you any information about what you're programming with.
>
> It might not lose information, but it does obfuscate it.
>
> Right now, we can do this:
<snip>
I've repelled this same argument multiple times. How often do you think
you'd need to do check for an object's implementation? Compare that to
how often we need to ensure that the object we're handling is a list-
replacement.
> That's assuming the implementation is available at runtime at all; if
> it's not, then you have lost information.
The implementation property is always available, because it's a non-
optional argument when registering the data structure.
>> Of course it's essential that the data structure has identican
>> /functional/ behaviour whatever optimisation you use.
>
> "Essential"? That's an interesting point. Let's look at an actual
> example.
>
> Consider one of Lie's initial examples (paraphrased):
>
> D1 = dict({1: 'a', 2: 'b'}) # standard hash-table D2 = dict({1: 'a', 2:
> 'b'}, implementation="binary tree")
>
> You say it's "essential" that the binary tree implementation has
> identical functional behaviour to standard hash-table implementation of
> dict. Okay, let's see where that leads us. The functional behaviour of
> hash-tables is that they are O(1) for reads, while binary trees are
> O(log(N)). The only way for them to have identical functional behaviour
> is to *artificially* slow down dicts to the lowest common speed. That is
> a Bad Thing, and I don't understand why you think it is essential.
I can't think why you consider a function's performance characteristic as
its behavior, a "characteristic" of something means it is a it's property/
character. Not behavior.
> we no longer have to
> artificially change the behaviour of the type -- be we do have to
> artificially cripple the API of the type!
I don't get why you think we've got to restrict it to the lowest common
denominator. We expect people that messes with a field called
"implementation" to know enough about the limitations of the
implementation he chose. Regular python programmers nowadays have to be
aware that the limitation of the (hashed) dict is its key must be
immutable. A true dict/mapping type doesn't have that limitation, we
already have the so-called "artificial" limitations right now. The way it
is now gives the impression that hashed dict's limitations is equal to
dict/mapping's limitation. Instead, it should be: dict is a mapping type
of any object to any object, however the default hashed dict's limitation
is immutable key.
Aside: Also with my way, we could classify a "feature" into three status:
limitation, exact replacement, extension. limitation is minor points
where the implementation simply couldn't fulfill (not fulfilling major
points would fail registration), exact replacement is the regular things,
and extension as features that, if used, might make changing of
implementations harder, so should only be used if that implementation's
extension is the natural way to write an algorithm.
> But maybe you'll argue that Lie's examples are bad examples.
Yes, I agree it is a bad example for this issue. When I wrote that, what
I had in mind is to demonstrate how it'd look like in a simple way. Way
too simple, it seems.
> Perhaps
> what you have in mind is something like the difference between "dicts
> implemented as hash tables with chaining" and "dicts implemented as hash
> tables with open addressing". (Aside: what sort of open addressing?
> Linear? Quadratic? Something else?) At least now we're talking about
> different implementations with the same API and more-or-less the same
> functional behaviour.
Not quite. I'm looking for a more radical differences. This is more
apparent in array-list vs bintree-list. One is better for straight
traversal and random access, while the other is better for searching and
inserting.
> I'd suggest that 99% of the time, client code simply shouldn't care
> about the differences. The existing implementations are Good Enough.
> Have you *ever* found yourself saying "I can't use the built-in dict,
> because it uses chaining and for my application I need linear open
> addressing"? (Or vice versa.)
Yes, that's why we have reasonable default.
> I doubt you've every said that. I doubt you know anyone who has said it.
> I doubt you know *of* anyone who has said it. And even if you have, then
> you are free to go and write your own dict type as a C extension type
> and call it dict_with_linear_openaddressing and write this:
>
> D = dict_with_linear_openaddressing(data)
>
> instead of:
>
> D = dict(data, implementation = "linear open addressing")
>
> Yet again Lie's proposal gains you nothing. In fact, as an application
> writer, you're better off writing your own type rather than waiting for
> it to be added to the built-in implementation of dict.
There are times when it'd be too much work to implement it, test it, and
integrate it. There are times when we know that using alternative
implementation could make our program faster, but simply lack enough time
to implement and test it. I was talking about rapid application
development or even the wilder cousin, quick-one-time-disposable scripts.
(I'm not implying that only rapid programming benefits from this, it is
only one example)
Plus, currently it is impossible, without extensive reading of the whole
docs, to search for other data structure that implements the same
abstract type to list/dict/anything. Their documentations are scattered
around here and there. It's helpful if help(list) could list what
implementations are available (no pun intended).
<snip>
> Here's an even better way:
>
> class StandardFoo(data):
> def __init__(self, data):
> self.data = foo(data)
> def __len__(self):
> return len(self.data)
>
> class MagicFoo(data):
> def __init__(self, data):
> self.data = bar(data)
> def __len__(self):
> return len(self.data[0]) + len(self.data[1:])
>
> class GreenFoo(data):
> def __init__(self, data):
> self.data = baz(data)
> def __len__(self):
> return self._len
>
>
> and let the client code call whichever implementation they want.
You missed the last part. Integrating the implementation to current code.
They are the one that makes the difference.
And I wasn't talking about any of those you mentioned.
It's more like this:
class StandardFoo(data):
def __init__(self, data):
self.data = foo(data)
def __len__(self):
return len(self.data)
class MagicFoo(data):
def __init__(self, data):
self.data = bar(data)
def __len__(self):
return len(self.data[0]) + len(self.data[1:])
class GreenFoo(data):
def __init__(self, data):
self.data = baz(data)
def __len__(self):
return self._len
class Foo(AbstractType): pass
Foo.register({'standard': StandardFoo, 'green': GreenFoo, 'magic':
MagicFoo})
> [snip]
>> They're not different data structures from the client point of view.
>
> But of course they are. They have different functional behaviour, and
> different APIs.
> Unless we artificially cripple the implementations
Then please state the implementation's limitations on the docs. People
who knows enough to chose non-standard implementation should be
knowledgeable enough about the limitations and extension of the chosen
implementation.
> and
> make them all identical (which defeats the purpose of having different
> implementations!) they are different data structures, and those
> differences will be visible to the client code.
Yes, they're different data structure, same Abstract Type.
> This scheme is a case of over-generalisation. It's also impractical: who
> is going to spend the time and effort to do it? It's hard enough to get
> somebody to write a binary tree implementation for the standard library,
> without expecting them to *also* write a wrapper for it to make it look
> like a dict.
It's hard to write it because everyone knows noone is going to use it
because their implementation would be buried inside 60 feet of soil and
sand somewhere in the dark corners of the docs. And when you said "it's
hard to get somebody to write", do you realize how many people have, in
desperation, write their own implementation of something because they
can't find it in the standard library. And I think the 'register' way
would make it unnecessary to write wrappers as long as you've provided a
compatible data structure the generic register function would handle the
rest.
> It's working against the grain of Python, which is well
> suited for the one-type = one-API model instead of the one-type = many-
> APIs model which Lie is suggesting. And for what benefit? The only
> benefit suggested so far is that we can write:
>
> dict({1: 'a', 2: 'b'}, implementation="binary tree")
>
> instead of
>
> binarytree({1: 'a', 2: 'b'})
>
> If you call that a benefit. I don't.
A real benefit is apparent if we have this kind of snippet in our codes
(again, in my habit of oversimplifying: f is handling too many things for
its own good, isinstance is usually evil, and there is a strong smell of
bad design):
def f(listofiterable, someiterable):
for lst in listofiterable:
if isinstance(lst, list):
lst.extend(somelist)
elif isinstance(lst, [set, dict]):
lst.update(somelist)
else:
for n in somelist:
lst += n
if you have that kind of code, and many of them scattered, when you want
to change the data structure, you've got to hunt all those isinstance
down and change them to the new data structure. The other solutions, to
shadow 'list' (i.e. the previous name), doesn't work if we want to use
the original data structure too in other unrelated places. With
implementation, only the object creation would need to be searched and
isinstance(something, list) would pass it as an array-list replacement.
More information about the Python-list
mailing list