Comparing sequences with range objects
Ian Hobson
hobson42 at gmail.com
Sun Apr 10 00:37:51 EDT 2022
On 09/04/2022 13:14, Christian Gollwitzer wrote:
> Am 08.04.22 um 09:21 schrieb Antoon Pardon:
>>> The first is really hard. Not only may information be missing, no single
>>> single piece of information is unique or immutable. Two people may have
>>> the same name (I know about several other "Peter Holzer"s), a single
>>> person might change their name (when I was younger I went by my middle
>>> name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
>>> same person?), they will move (= change their address), change jobs,
>>> etc. Unless you have a unique immutable identifier that's enforced by
>>> some authority (like a social security number[1]), I don't think there
>>> is a chance to do that reliably in a program (although with enough data,
>>> a heuristic may be good enough).
>>
>> Yes I know all that. That is why I keep a bucket of possible duplicates
>> per "identifying" field that is examined and use some heuristics at the
>> end of all the comparing instead of starting to weed out the duplicates
>> at the moment something differs.
>>
>> The problem is, that when an identifying field is judged to be unusable,
>> the bucket to be associated with it should conceptually contain all other
>> records (which in this case are the indexes into the population list).
>> But that will eat a lot of memory. So I want some object that behaves as
>> if it is a (immutable) list of all these indexes without actually
>> containing
>> them. A range object almost works, with the only problem it is not
>> comparable with a list.
>
>
> Then write your own comparator function?
>
> Also, if the only case where this actually works is the index of all
> other records, then a simple boolean flag "all" vs. "these items in the
> index list" would suffice - doesn't it?
>
> Christian
>
Writing a comparator function is only possible for a given key. So my
approach would be:
1) Write a comparator function that takes params X and Y, such that:
if key data is missing from X, return 1
If key data is missing from Y return -1
if X > Y return 1
if X < Y return -1
return 0 # They are equal and key data for both is present
2) Sort the data using the comparator function.
3) Run through the data with a trailing enumeration loop, merging
matching records together.
4) If there are no records copied out with missing
key data, then you are done, so exit.
5) Choose a new key and repeat from step 1).
Regards
Ian
--
Ian Hobson
Tel (+66) 626 544 695
--
This email has been checked for viruses by AVG.
https://www.avg.com
More information about the Python-list
mailing list