Comparing sequences with range objects
Antoon Pardon
antoon.pardon at vub.be
Fri Apr 8 03:21:29 EDT 2022
Op 8/04/2022 om 08:24 schreef Peter J. Holzer:
> On 2022-04-07 17:16:41 +0200, Antoon Pardon wrote:
>> Op 7/04/2022 om 16:08 schreef Joel Goldstick:
>>> On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon<antoon.pardon at vub.be> wrote:
>>>> I am working with a list of data from which I have to weed out duplicates.
>>>> At the moment I keep for each entry a container with the other entries
>>>> that are still possible duplicates.
> [...]
>> Sorry I wasn't clear. The data contains information about persons. But not
>> all records need to be complete. So a person can occur multiple times in
>> the list, while the records are all different because they are missing
>> different bits.
>>
>> So all records with the same firstname can be duplicates. But if I have
>> a record in which the firstname is missing, it can at that point be
>> a duplicate of all other records.
> There are two problems. The first one is how do you establish identity.
> The second is how do you ween out identical objects. In your first mail
> you only asked about the second, but that's easy.
>
> The first is really hard. Not only may information be missing, no single
> single piece of information is unique or immutable. Two people may have
> the same name (I know about several other "Peter Holzer"s), a single
> person might change their name (when I was younger I went by my middle
> name - how would you know that "Peter Holzer" and "Hansi Holzer" are the
> same person?), they will move (= change their address), change jobs,
> etc. Unless you have a unique immutable identifier that's enforced by
> some authority (like a social security number[1]), I don't think there
> is a chance to do that reliably in a program (although with enough data,
> a heuristic may be good enough).
Yes I know all that. That is why I keep a bucket of possible duplicates
per "identifying" field that is examined and use some heuristics at the
end of all the comparing instead of starting to weed out the duplicates
at the moment something differs.
The problem is, that when an identifying field is judged to be unusable,
the bucket to be associated with it should conceptually contain all other
records (which in this case are the indexes into the population list).
But that will eat a lot of memory. So I want some object that behaves as
if it is a (immutable) list of all these indexes without actually containing
them. A range object almost works, with the only problem it is not
comparable with a list.
--
Antoon Pardon.
More information about the Python-list
mailing list