[Tutor] Better way to remove lines from a list?

Tue May 12 17:59:18 EDT 2020

boB Stepp wrote:

>   I have a test file with the following contents:
> 
> ADR;TYPE=HOME:;;11601 Southridge Dr;Little Rock;AR;72212-1733;US;11601
> Sout
>   hridge Dr\nLittle Rock\, AR 72212-1733\nUS
> ADR;TYPE=WORK:;;1912 Green Mountain Dr;Little Rock;AR;72212;US;1912 Green
> M
>   ountain Dr\nLittle Rock\, AR 72212\nUS
>   more meaningless stuff
>   even more meaningless stuff
> ADR:100;;4700 E McCain Blvd;North Little Rock;AR;72117;US;4700 E McCain
> Blv
>   d\n100\nNorth Little Rock\, AR 72117\nUS
> 
> I wish to remove the part of lines starting with "ADR" from the last
> semi-colon to the EOL *and* any following lines that continue this
> duplicated address.  As far as I can tell every such instance in my actual
> vCard file has these subsequent lines starting with a single space before
> a new legitimate vCard property line occurs which always has a character
> in the first column of the line.
> 
> I have a solution that works relying on these file-specific facts.  After
> reading the file into a list using readlines() I have this function to do
> this processing:
> 
> def clean_address(vCard):
>      cleaned_vCard = []
>      for index, line in enumerate(vCard):
>          clean_line = line
>          if line.startswith("ADR"):
>              clean_line = line.rpartition(";")[0]
>              while True:
>                  if vCard[index + 1].startswith(" "):
>                      vCard.pop(index + 1)
>                  else:
>                      break
>          cleaned_vCard.append(clean_line)
>      return cleaned_vCard
> 
> In the inner while loop I wanted to do the equivalent of saying "advance
> the outer for loop while staying inside the while loop".  If I were
> able to do this I would not need to modify the vCard list in place.  I
> tried to find a way to do this with ideas of next() or .__next__(), but I
> could not discover online how to access the for loop's iterator.  I feel
> sure there is a better way to do what I want to accomplish, possibly
> completely altering the logic of my function or doing something along my
> above speculations.
> 
> The other thing that bothers me is the fragility of my approach.  I am
> relying on two things that I am sure are not true for a general export of
> a
> Google vCard:  (1) What if I have an exceptionally long legitimate address
> that cannot be encompassed on a single line starting with "ADR"?  In this
> case my function as written would not yield a correct address.  (2) I am
> relying on illegitimate address duplicates starting on following lines
> beginning with a single space.  For my particular vCard file I don't think
> these will affect me, but I would like to make this more robust just
> because it is the right thing to do.  But at the moment I don't see how.

I doubt that the extra stuff in the ADR lines is illegitimate and think that 
the best solution would be to find a tool that can parse the data as-is.

However, practicality beats purity. So how about merging the line and then 
removing everything starting with the 8th semicolon? Like

# assuming that the colon after one of your ADRs is a typo

def cleaned(line):
    if line.startswith("ADR;"):
        line = ";".join(line.split(";")[:8])
    return line + "\n"

cleaned_text = "".join(
    cleaned(line) for line in text.replace("\n ", "").splitlines()
)

where text is the complete file as a string.