[Tutor] compare and arrange file

Wed Jul 13 17:02:10 CEST 2011

Edgar Almonte wrote:

> fist time i saw the statement "with" , is part of the module csv ? ,
> that make a loop through the file ? is not the sortkey function
> waiting for a paramenter ( the row ) ? i don't see how is get pass ,
> the "key" is a parameter of sorted function ? , reader is a function
> of csv module ? if "with..." is a loop  that go through the file is
> the second with inside of that loop ? ( i dont see how the write of
> the output file catch the line readed

with open(filename) as fileobj:
   do_something_with(fileobj)

is a shortcut for

fileobj = open(filename)
do_something_with(fileobj)
fileobj.close()

But it is not only shorter; it also guarantees that the file will be closed 
even if an error occurs while it is being processed.

In my code the file object is wrapped into a csv.reader.

for row in csv.reader(fileobj, delimiter="|"):
    # do something with row

walks through the file one row at a time where the row is a list of the 
columns. To be able to sort these rows you have to read them all into 
memory. You typically do that with

rows_iter = csv.reader(fileobj, delimiter="|")
rows = list(rows_iter)

and can then sort the rows with

rows.sort()

Because this two-step process is so common there is a builtin sorted() that 
converts an iterable (the rows here) into a sorted list. Now consider the 
following infile:

$ cat infile.txt
aXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11|
bXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11|
XXXXXXXXX XXXX| 0000000000000.00| 0000000088115.39|
XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00|
XXXXXXXXX XXXX| 0000000000000.00| 0000000090443.29|
cXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00|
XXXXXXXXX XXXX| 0000000088115.39| 0000000000000.00|
XXXXXXXXX XXXX| 0000000000000.00| 0000000088335.39|
XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00|
XXXXXXXXX XXXX| 0000000088335.39| 0000000000000.00|
XXXXXXXXX XXXX| 0000000090443.29| 0000000000000.00|
dXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00|

If we read it and sort it we get the following:

>>> import csv
>>> with open("infile.txt") as fileobj:
...     rows = sorted(csv.reader(fileobj, delimiter="|"))
...
>>> from pprint import pprint
>>> pprint(rows)
[['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
 ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', '']]

Can you infer the sort order of the list of lists above? The rows are sorted 
by the first item in the list, then rows whose first item compares equal are 
sorted by the second and so on. This sort order is not something built into 
the sort() method, but rather the objects that are compared. sort() uses the 
usual operators like < and == internally. Now what would you do if you 
wanted to sort your data by the third column, say? Here the key parameter 
comes into play. You can use it to provide a function that takes an item in 
the list to be sorted and returns something that is used instead of the 
items to compare them to each other:

>> def extract_third_column(row):
...     return row[2]
...
>>> rows.sort(key=extract_third_column)
>>> pprint(rows)
[['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', '']]

The key function you actually need is a bit more sophisticated. You want 
rows with equal nonzero values to end close together, no matter whether the 
interesting value is in the second or third column. Let's try:

>>> def extract_nonzero_column(row):
...     if row[1] == ' 0000000000000.00':
...             return row[2]
...     else:
...             return row[1]
...

Here's a preview of the keys:

>>> for row in rows:
...     print extract_nonzero_column(row)
...
 0000000088115.39
 0000000088335.39
 0000000090443.29
 0000000090453.29
 0000000090453.29
 0000000011111.11
 0000000011111.11
 0000000011111.11
 0000000011111.11
 0000000088115.39
 0000000088335.39
 0000000090443.29

Looks good, let's apply:

>>> pprint(sorted(rows, key=extract_nonzero_column))
[['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']]

Almost there, but we want the rows with nonzero values in the third column 
before those with nonzero values in the second. Emile's and my solution was 
to add a flag to the sort key, True if the nonzero value is in the first, 
False else because False < True in python:

>>> sorted([True, False, False, True, False])
[False, False, False, True, True]

Just as with the rows the flag is the second item in the result tuple, and 
is only used to sort the items if their first value compares equal.

>>> def sortkey(row):
...     if row[1] == ' 0000000000000.00':
...             return row[2], False
...     else:
...             return row[1], True
...
>>> rows.sort(key=sortkey)
>>> pprint(rows)
[['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
 ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
 ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
 ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
 ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']]

Now you just have to write the result back to a file.

Two problems remain: unpaired values are not detected and duplicate values 
destroy the right-left-right-left alteration, too.

Bonus: Python's list.sort() method is "stable", it doesn't affect the 
relative order of items in the list that compare equal. This allows to 
replace one sort with a complex key with two sorts with simpler keys:

>>> rows.sort(key=lambda row: row[1])
>>> rows.sort(key=lambda row: max(row[1:]))