Merging lists has made my brain hurt.

Thorsten Kampe thorsten at thorstenkampe.de
Mon Oct 7 08:55:41 EDT 2002


* Alex Martelli
> Thorsten Kampe wrote:
>> Treating lists a priori as sets for reasons of speed is definitely
>> not the way to go because you're "losing information". I think it's
>> better
>
> Of course, you have to know which information matters.

I meant this: in a real world scenario it's often not practical to
say: use this program, but beware, it's relies on an input without
duplicates or relies on all items to be hashable, because you can't
tell in advance whether the conditions will be met.

Not for the example you were answering of course but for a general
purpose 'intersection program'.

>> to treat them as tuples (in a mathematical sense) whereby you can
>> always delete 'duplicates' afterwards if you don't care about them.
>
> If you don't care about duplicates you're much better off
> normalizing first, and if you care about COUNT of duplicates but not
> ordering (i.e., a bag) you're generally still better off normalizing
> first.

What do you mean exactly by "normalizing"?

>> Example: What's the intersection between [2, 1, 2, 3, 2] and [0, 1,
>> 2, 3, 2, 4]? An intersection (union, symmetric difference) of
>> tuples can only be meaningful for the /corresponding/ items of the
>> tuples/lists, so it's neither [2, 1, 3] nor [2, 1, 2, 3, 2] but [2,
>> 1, 2, 3].
>
> But why not [1, 2, 3, 2] instead? I know of no algebras defining
> operations called intersection &c that are not commutative.

Sorry, I wrote quite imprecisely. My interpretation of 'tuple' was
contradictory to the traditional mathematical meaning 'order and
repetition counts' but it meant "repetition counts, order doesn't".

> What you mean by "corresponding" there is rather mysterious to me,
> but wouldn't it mean "same item in the SAME position"?

I was asking myself: what would be a sensible generalization for the
common set operations (intersection, union, etc.) for tuples?

Precisely: what is (or should be) the intersection of [2, 2] and [2,
2, 2]? Treat them as sets? Then the result is '[2]'.

Treat them as tuples? The most common algorithm is: for each item in
first seq: if it is in second seq, then add it to the intersection.

But then:
intersect([2, 2], [2, 2, 2]) = [2, 2]
intersect([2, 2, 2], [2, 2]) = [2, 2, 2]

You "connect" the "corresponding" items (the first '2' in seq1 with
the first in seq2, the second '2' in seq1 with the second '2' in seq2,
the third '2' in seq1 with the third '2' in seq2 ... *ooops*)
   [1, 2, 2, 2, 3]
    |  |  |     |     <- "non proportional font required"
[0, 1, 2, 2,    3, 4]

So the intersection should contain all the items that can be
"paired"/"connected". The last '2' of seq2 has no 'partner' so it
shouldn't be part of the intersection.

>> I tried to write a reliable (though maybe slow) program that takes
>> care of this issue (especially that intersection is _commutative_):
>
> But it's NOT -- [1, 2, 3, 2] != [2, 1, 2, 3]. You're losing
> information, specifically order information. A correct definition
> should be:
>
> def intersect(seq1, seq2):
>     return [ x for x, y in zip(seq1, seq2) if x == y ]

This is correct for "if first item of seq1 = first item of seq2" but I
had never use for such an "intersection". I could not see any "real
use" for intersect([1, 2, 3], [3, 1, 2]) = [].

> If what you want is to treat sequences as BAGS, then they're better
> represented by dicts again -- order is the only information present
> in a list and not in a dict, [...]

Well, usually not:
>>> {1: 2, 1: 2} == {1:2}
1

>>> [2, 2] == [2]
0

> So to do extensive processing of seqs-as-bags have converter funcs:
>
> def seqToBag(seq):
>    bag = {}
>    for item in seq: bag[item] = 1 + bag.get(item, 0)
>
> def bagToSeq(bag):
>    return [ x for key in bag for x in [key]*bag[key]]
>
> and for example "intersection" of bags then becomes, e.g:
>
> def intersectBags(bag1, bag2):
>     result = {}
>     for item in bag1:
>         count = min(bag1[item], bag2.get(item, 0))
>         if count>0: result[item] = count
          ^^^^^^^^^^^ <- this is superfluous IMO
>     return result

Yes, definitely better than my "append to result, remove from original
list to prevent item being counted twice" version. And a lot faster
(except for the abnormal "very large seq0, very small seq1" case).

This is what I made of your code (it's a little optimized compared to
my previous version so that the loop goes always through the smaller
sequence.) I didn't change the 'union' and "symmetric difference" part
because you have to loop through count0 *and* count1.

#v+
def boolean(seq0, seq1, boolean_modus):
    """" perform 'and', 'not', 'or', or 'xor' operation on sequences """
    ## auxiliary functions
    def tocount(seq):  # convert to dict where key is the hashable representation
        count = {}     # of the object and value is the object count
        for item in seq:
            count[repr(item)] = count.get(repr(item), 0) + 1
        return count

    def toseq(count):  # convert back to sequence
        return [item for key in count for item in [eval(key)] * count[key]]

    if boolean_modus == 'and':    # intersection
        count0 = tocount(seq0)
        count1 = tocount(seq1)

        if count0 > count1:  # loop through the shorter list
            count0, count1 = count1, count0
        for item in count0:
            count0[item] = min(count0[item], count1.get(item, 0))
        return toseq(count0)

    elif boolean_modus == 'not':  # set difference
        count0 = tocount(seq0)
        count1 = tocount(seq1)

        if count0 < count1:  # loop through the shorter list
            for item in count0:
                count0[item] = count0[item] - count1.get(item, 0)
        else:
            for item in count1:
                count0[item] = count0.get(item, 0) - count1[item]
        return toseq(count0)

    elif boolean_modus == 'or':   # union
        return seq0 + boolean(seq1, seq0, 'not')

    elif boolean_modus == 'xor':  # symmetric difference
        return boolean(boolean(seq0, seq1, 'or'), boolean(seq0, seq1, 'and'), 'not')
#v-


Thorsten



More information about the Python-list mailing list