[Tutor] improve the code

Peter Otten __peter__ at web.de
Fri Nov 4 09:10:42 CET 2011


lina wrote:

> On Wed, Nov 2, 2011 at 12:14 AM, Peter Otten <__peter__ at web.de> wrote:
>> lina wrote:
>>
>>>> sorted(new_dictionary.items())
>>>
>>> Thanks, it works, but there is still a minor question,
>>>
>>> can I sort based on the general numerical value?
>>>
>>> namely not:
>>> :
>>> :
>>> 83ILE 1
>>> 84ALA 2
>>> 8SER 0
>>> 9GLY 0
>>> :
>>> :
>>>
>>> rather 8 9 ...83 84,
>>>
>>> Thanks,
>>
>> You need a custom key function for that one:
>>
>>>>> import re
>>>>> def gnv(s):
>> ...     parts = re.split(r"(\d+)", s)
>> ...     parts[1::2] = map(int, parts[1::2])
>> ...     return parts
>> ...
>>>>> items = [("83ILE", 1), ("84ALA", 2), ("8SER", 0), ("9GLY", 0)]
>>>>> sorted(items, key=lambda pair: (gnv(pair[0]), pair[1]))
>> [('8SER', 0), ('9GLY', 0), ('83ILE', 1), ('84ALA', 2)]
> 
> 
> Thanks, I can follow the procedure and get the exact results, but
> still don't understand this part
> 
> parts = re.split(r'"(\d+)",s)
> 
> r"(\d+)", sorry,
> 
>>>> items
> [('83ILE', 1), ('84ALA', 2), ('8SER', 0), ('9GLY', 0)]
> 
> 
>>>> parts = re.split(r"(\d+)",items)
> Traceback (most recent call last):
>   File "<pyshell#78>", line 1, in <module>
>     parts = re.split(r"(\d+)",items)
>   File "/usr/lib/python3.2/re.py", line 183, in split
>     return _compile(pattern, flags).split(string, maxsplit)
> TypeError: expected string or buffer

I was a bit lazy and hoped you would accept the gnv() function as a black 
box...

Here's a step-through:
re.split() takes a pattern where to split the string and a string. In the 
following example the pattern is the character "_":

>>> re.split("_", "alpha_beta___gamma")
['alpha', 'beta', '', '', 'gamma']

You can see that this simple form works just like 
"alpha_beta___gamma".split("_"), and finds an empty string between two 
adjacent "_". If you want both "_" and "___" to work as a single separator 
you can change the pattern to "_+", where the "+" means one or more of the 
previous:

>>> re.split("_+", "alpha_beta___gamma")
['alpha', 'beta', 'gamma']

If we want to keep the separators, we can wrap the whole expression in 
parens:

>>> re.split("(_+)", "alpha_beta___gamma")
['alpha', '_', 'beta', '___', 'gamma']

Now for the step that is a bit unobvious: we can change the separator to 
include all digits. Regular expressions have two ways to spell "any digit": 
[0-9] or \d:

>>> re.split("([0-9]+)", "alpha1beta123gamma")
['alpha', '1', 'beta', '123', 'gamma']

I chose the other (which will also accept non-ascii digits)

>>> re.split(r"(\d+)", "alpha1beta123gamma")
['alpha', '1', 'beta', '123', 'gamma']

At this point we are sure that the list contains a sequence of non-integer-
str, integer-str, ..., non-integer-str, the first and the last always being 
a non-integer str.

>>> parts = re.split(r"(\d+)", "alpha1beta123gamma")

So

>>> parts[1::2]
['1', '123']

will always give us the parts that can be converted to an integer

>>> parts
['alpha', '1', 'beta', '123', 'gamma']
>>> parts[1::2] = map(int, parts[1::2])
>>> parts
['alpha', 1, 'beta', 123, 'gamma']

We need to do the conversion because strings won't sort the way we like:

>>> sorted(["2", "20", "10"])
['10', '2', '20']
>>> sorted(["2", "20", "10"], key=int)
['2', '10', '20']

We now have the complete gnv() function

>>> def gnv(s):
...     parts = re.split(r"(\d+)", s)
...     parts[1::2] = map(int, parts[1::2])
...     return parts
...

and can successfully sort a simple list of strings like

>>> values = ["83ILE", "84ALA", "8SER", "9GLY"]
>>> sorted(values, key=gnv)
['8SER', '9GLY', '83ILE', '84ALA']

The sorted() function calls gnv() internally for every item in the list and 
uses the results to determine the order of the items. When 
sorted()/list.sort() did not feature the key argument you could do this 
manually with "decorate sort undecorate":

>>> decorated = [(gnv(item), item) for item in values]
>>> decorated
[(['', 83, 'ILE'], '83ILE'), (['', 84, 'ALA'], '84ALA'), (['', 8, 'SER'], 
'8SER'), (['', 9, 'GLY'], '9GLY')]
>>> decorated.sort()
>>> decorated
[(['', 8, 'SER'], '8SER'), (['', 9, 'GLY'], '9GLY'), (['', 83, 'ILE'], 
'83ILE'), (['', 84, 'ALA'], '84ALA')]
>>> undecorated
['8SER', '9GLY', '83ILE', '84ALA']

For your actual data 

>>> items
[('83ILE', 1), ('84ALA', 2), ('8SER', 0), ('9GLY', 0)]

you need to extract the first from an (x, y) pair

>>> def first_gnv(item):
...     return gnv(item[0])
...
>>> first_gnv(("83ILE", 1))
['', 83, 'ILE']

but what if there are items with the same x? In that case the order is 
undefined:

>>> sorted([("83ILE", 1), ("83ILE", 2)], key=first_gnv)
[('83ILE', 1), ('83ILE', 2)]
>>> sorted([("83ILE", 2), ("83ILE", 1)], key=first_gnv)
[('83ILE', 2), ('83ILE', 1)]

Let's take y into account, too:

>>> def first_gnv(item):
...     return gnv(item[0]), item[1]
...
>>> sorted([("83ILE", 1), ("83ILE", 2)], key=first_gnv)
[('83ILE', 1), ('83ILE', 2)]
>>> sorted([("83ILE", 2), ("83ILE", 1)], key=first_gnv)
[('83ILE', 1), ('83ILE', 2)]

We're done!

>>> sorted(items, key=first_gnv)
[('8SER', 0), ('9GLY', 0), ('83ILE', 1), ('84ALA', 2)]

(If you look back into my previous post, can you find the first_gnv() 
function?)



More information about the Tutor mailing list