[Tutor] compare and arrange file

Thu Jul 14 14:13:36 CEST 2011

On Wed, Jul 13, 2011 at 11:02 AM, Peter Otten <__peter__ at web.de> wrote:
> Edgar Almonte wrote:
>
>> fist time i saw the statement "with" , is part of the module csv ? ,
>> that make a loop through the file ? is not the sortkey function
>> waiting for a paramenter ( the row ) ? i don't see how is get pass ,
>> the "key" is a parameter of sorted function ? , reader is a function
>> of csv module ? if "with..." is a loop  that go through the file is
>> the second with inside of that loop ? ( i dont see how the write of
>> the output file catch the line readed
>
> with open(filename) as fileobj:
>   do_something_with(fileobj)
>
> is a shortcut for
>
> fileobj = open(filename)
> do_something_with(fileobj)
> fileobj.close()
>
> But it is not only shorter; it also guarantees that the file will be closed
> even if an error occurs while it is being processed.
>
> In my code the file object is wrapped into a csv.reader.
>
> for row in csv.reader(fileobj, delimiter="|"):
>    # do something with row
>
> walks through the file one row at a time where the row is a list of the
> columns. To be able to sort these rows you have to read them all into
> memory. You typically do that with
>
> rows_iter = csv.reader(fileobj, delimiter="|")
> rows = list(rows_iter)
>
> and can then sort the rows with
>
> rows.sort()
>
> Because this two-step process is so common there is a builtin sorted() that
> converts an iterable (the rows here) into a sorted list. Now consider the
> following infile:
>
> $ cat infile.txt
> aXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11|
> bXXXXXXXX XXXX| 0000000000000.00| 0000000011111.11|
> XXXXXXXXX XXXX| 0000000000000.00| 0000000088115.39|
> XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00|
> XXXXXXXXX XXXX| 0000000000000.00| 0000000090443.29|
> cXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00|
> XXXXXXXXX XXXX| 0000000088115.39| 0000000000000.00|
> XXXXXXXXX XXXX| 0000000000000.00| 0000000088335.39|
> XXXXXXXXX XXXX| 0000000090453.29| 0000000000000.00|
> XXXXXXXXX XXXX| 0000000088335.39| 0000000000000.00|
> XXXXXXXXX XXXX| 0000000090443.29| 0000000000000.00|
> dXXXXXXXX XXXX| 0000000011111.11| 0000000000000.00|
>
> If we read it and sort it we get the following:
>
>>>> import csv
>>>> with open("infile.txt") as fileobj:
> ...     rows = sorted(csv.reader(fileobj, delimiter="|"))
> ...
>>>> from pprint import pprint
>>>> pprint(rows)
> [['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', '']]
>
> Can you infer the sort order of the list of lists above? The rows are sorted
> by the first item in the list, then rows whose first item compares equal are
> sorted by the second and so on. This sort order is not something built into
> the sort() method, but rather the objects that are compared. sort() uses the
> usual operators like < and == internally. Now what would you do if you
> wanted to sort your data by the third column, say? Here the key parameter
> comes into play. You can use it to provide a function that takes an item in
> the list to be sorted and returns something that is used instead of the
> items to compare them to each other:
>
>>> def extract_third_column(row):
> ...     return row[2]
> ...
>>>> rows.sort(key=extract_third_column)
>>>> pprint(rows)
> [['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', '']]
>
> The key function you actually need is a bit more sophisticated. You want
> rows with equal nonzero values to end close together, no matter whether the
> interesting value is in the second or third column. Let's try:
>
>>>> def extract_nonzero_column(row):
> ...     if row[1] == ' 0000000000000.00':
> ...             return row[2]
> ...     else:
> ...             return row[1]
> ...
>
> Here's a preview of the keys:
>
>>>> for row in rows:
> ...     print extract_nonzero_column(row)
> ...
>  0000000088115.39
>  0000000088335.39
>  0000000090443.29
>  0000000090453.29
>  0000000090453.29
>  0000000011111.11
>  0000000011111.11
>  0000000011111.11
>  0000000011111.11
>  0000000088115.39
>  0000000088335.39
>  0000000090443.29
>
> Looks good, let's apply:
>
>>>> pprint(sorted(rows, key=extract_nonzero_column))
> [['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']]
>
> Almost there, but we want the rows with nonzero values in the third column
> before those with nonzero values in the second. Emile's and my solution was
> to add a flag to the sort key, True if the nonzero value is in the first,
> False else because False < True in python:
>
>>>> sorted([True, False, False, True, False])
> [False, False, False, True, True]
>
> Just as with the rows the flag is the second item in the result tuple, and
> is only used to sort the items if their first value compares equal.
>
>>>> def sortkey(row):
> ...     if row[1] == ' 0000000000000.00':
> ...             return row[2], False
> ...     else:
> ...             return row[1], True
> ...
>>>> rows.sort(key=sortkey)
>>>> pprint(rows)
> [['aXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['bXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000011111.11', ''],
>  ['cXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['dXXXXXXXX XXXX', ' 0000000011111.11', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088115.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088115.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000088335.39', ''],
>  ['XXXXXXXXX XXXX', ' 0000000088335.39', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000000000.00', ' 0000000090443.29', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090443.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', ''],
>  ['XXXXXXXXX XXXX', ' 0000000090453.29', ' 0000000000000.00', '']]
>
> Now you just have to write the result back to a file.
>
> Two problems remain: unpaired values are not detected and duplicate values
> destroy the right-left-right-left alteration, too.
>
> Bonus: Python's list.sort() method is "stable", it doesn't affect the
> relative order of items in the list that compare equal. This allows to
> replace one sort with a complex key with two sorts with simpler keys:
>
>>>> rows.sort(key=lambda row: row[1])
>>>> rows.sort(key=lambda row: max(row[1:]))
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>

Thanks a lot peter for you time to explain this to me , i do a fast
read of everything and i get almost everything , i will re-read later
and play with some test for better understand , thanks again !