[Tutor] Better way to remove lines from a list?
Peter Otten
__peter__ at web.de
Tue May 12 17:59:18 EDT 2020
boB Stepp wrote:
> I have a test file with the following contents:
>
> ADR;TYPE=HOME:;;11601 Southridge Dr;Little Rock;AR;72212-1733;US;11601
> Sout
> hridge Dr\nLittle Rock\, AR 72212-1733\nUS
> ADR;TYPE=WORK:;;1912 Green Mountain Dr;Little Rock;AR;72212;US;1912 Green
> M
> ountain Dr\nLittle Rock\, AR 72212\nUS
> more meaningless stuff
> even more meaningless stuff
> ADR:100;;4700 E McCain Blvd;North Little Rock;AR;72117;US;4700 E McCain
> Blv
> d\n100\nNorth Little Rock\, AR 72117\nUS
>
> I wish to remove the part of lines starting with "ADR" from the last
> semi-colon to the EOL *and* any following lines that continue this
> duplicated address. As far as I can tell every such instance in my actual
> vCard file has these subsequent lines starting with a single space before
> a new legitimate vCard property line occurs which always has a character
> in the first column of the line.
>
> I have a solution that works relying on these file-specific facts. After
> reading the file into a list using readlines() I have this function to do
> this processing:
>
> def clean_address(vCard):
> cleaned_vCard = []
> for index, line in enumerate(vCard):
> clean_line = line
> if line.startswith("ADR"):
> clean_line = line.rpartition(";")[0]
> while True:
> if vCard[index + 1].startswith(" "):
> vCard.pop(index + 1)
> else:
> break
> cleaned_vCard.append(clean_line)
> return cleaned_vCard
>
> In the inner while loop I wanted to do the equivalent of saying "advance
> the outer for loop while staying inside the while loop". If I were
> able to do this I would not need to modify the vCard list in place. I
> tried to find a way to do this with ideas of next() or .__next__(), but I
> could not discover online how to access the for loop's iterator. I feel
> sure there is a better way to do what I want to accomplish, possibly
> completely altering the logic of my function or doing something along my
> above speculations.
>
> The other thing that bothers me is the fragility of my approach. I am
> relying on two things that I am sure are not true for a general export of
> a
> Google vCard: (1) What if I have an exceptionally long legitimate address
> that cannot be encompassed on a single line starting with "ADR"? In this
> case my function as written would not yield a correct address. (2) I am
> relying on illegitimate address duplicates starting on following lines
> beginning with a single space. For my particular vCard file I don't think
> these will affect me, but I would like to make this more robust just
> because it is the right thing to do. But at the moment I don't see how.
I doubt that the extra stuff in the ADR lines is illegitimate and think that
the best solution would be to find a tool that can parse the data as-is.
However, practicality beats purity. So how about merging the line and then
removing everything starting with the 8th semicolon? Like
# assuming that the colon after one of your ADRs is a typo
def cleaned(line):
if line.startswith("ADR;"):
line = ";".join(line.split(";")[:8])
return line + "\n"
cleaned_text = "".join(
cleaned(line) for line in text.replace("\n ", "").splitlines()
)
where text is the complete file as a string.
More information about the Tutor
mailing list