Comparing sequences with range objects
MRAB
python at mrabarnett.plus.com
Thu Apr 7 13:40:35 EDT 2022
On 2022-04-07 16:16, Antoon Pardon wrote:
> Op 7/04/2022 om 16:08 schreef Joel Goldstick:
>> On Thu, Apr 7, 2022 at 7:19 AM Antoon Pardon<antoon.pardon at vub.be> wrote:
>>> I am working with a list of data from which I have to weed out duplicates.
>>> At the moment I keep for each entry a container with the other entries
>>> that are still possible duplicates.
>>>
>>> The problem is sometimes that is all the rest. I thought to use a range
>>> object for these cases. Unfortunatly I sometimes want to sort things
>>> and a range object is not comparable with a list or a tuple.
>>>
>>> So I have a list of items where each item is itself a list or range object.
>>> I of course could sort this by using list as a key function but that
>>> would defeat the purpose of using range objects for these cases.
>>>
>>> So what would be a relatively easy way to get the same result without wasting
>>> too much memory on entries that haven't any weeding done on them.
>>>
>>> --
>>> Antoon Pardon.
>>> --
>>> https://mail.python.org/mailman/listinfo/python-list
>> I'm not sure I understand what you are trying to do, but if your data
>> has no order, you can use set to remove the duplicates
>
> Sorry I wasn't clear. The data contains information about persons. But not
> all records need to be complete. So a person can occur multiple times in
> the list, while the records are all different because they are missing
> different bits.
>
> So all records with the same firstname can be duplicates. But if I have
> a record in which the firstname is missing, it can at that point be
> a duplicate of all other records.
>
This is how I'd approach it:
# Make a list of groups, where each group is a list of potential duplicates.
# Initially, all of the records are potential duplicates of each other.
records = [list_of_records]
# Split the groups into subgroups according to the first name.
new_records = []
for group in records:
subgroups = defaultdict(list)
for record in group:
subgroups[record['first_name']].append(record)
# Records without a first name could belong to any of the subgroups.
missing = subgroups.pop(None, [])
for record in missing:
for subgroup in subgroups.values():
subgroup.extend(missing)
new_records.extend(subgroups.values())
records = new_records
# Now repeat for the last name, etc.
More information about the Python-list
mailing list