[Tutor] tab separated file handling

Wed Jun 13 09:03:37 EDT 2018

Niharika Jakhar wrote:

> hi everyone!
> I am working with a tsv file which has NA and empty values.
> I have used csv package to make a list of list of the data.
> I want to remove NA and empty values.
> 
> This is what I wrote:
> 
> 
> #removes row with NA values
>         for rows in self.dataline:
>             for i in rows:
>                 if i == 'NA' or i ==  '':
>                     self.dataline.remove(rows)
> 
> 
> This is what the terminal says:
> 
>     self.dataline.remove(rows)
> ValueError: list.remove(x): x not in list
> 
> 
> This is how the file looks like:
> 
> d23 87 9 NA 67 5 657 NA 76 8 87 78 90 800
> er 21 8 908 9008 9 7 5 46 3 5 757 7 5

I believe your biggest problem is the choice of confusing names ;)

Let's assume you start with a "table" as a list of rows where one "row" is a 
list of "value"s. Then

for row in table:
    for value in row:
        if value == "NA" or value == "":
            table.remove(value)

must fail because in table there are only lists, each representing one row, 
but no strings. 

However, when you fix it in the obvious way

for row in table:
    for value in row:
        if value == "NA" or value == "":
            row.remove(value)

you run into another problem:

>>> row = ["a", "b", "NA", "c", "d"]
>>> for value in row:
...     print("checking", value)
...     if value == "NA": row.remove(value)
... 
checking a
checking b
checking NA
checking d

Do you see the bug? "c" is never tested. It might be another "NA" that 
survives the removal. You should never modify lists you are iterating over.

There are a few workarounds, but in your case I think the best approach is 
not to put them in the table in the first place:

with open("whatever.tsv") as f:
    reader = csv.reader(f, delimiter="\t")
    table = [
        [value for value in row if value not in {"NA", ""}]
        for row in reader
    ]

Here the inner list comprehension rejects "NA" and "" immediately. That will 
of course lead to rows of varying length. If your intention was to skip rows 
containing NA completely, use

with open("whatever.tsv") as f:
    reader = csv.reader(f, delimiter="\t")
    table = [
        row for row in reader if "NA" not in row and "" not in row
    ]

or

    ILLEGAL = {"NA", ""}
    table = [
        row for row in reader if not ILLEGAL.intersection(row)
    ]