Limitation of os.walk

Steven D'Aprano steven at REMOVE.THIS.cybersource.com.au
Wed May 12 23:29:00 EDT 2010


On Wed, 12 May 2010 19:04:47 +0000, kj wrote:

> In <mailman.82.1273630064.32709.python-list at python.org> Terry Reedy
> <tjreedy at udel.edu> writes:
> 
>>On 5/11/2010 3:49 PM, kj wrote:
>>> PS: I never understood why os.walk does not support hooks for key
>>> events during such a tree traversal.
> 
>>Either 1) it is intentionally simple, with the expectation that people
>>would write there own code for more complicated uses or 2) no one has
>>submitted a 'full-featured' version or 3) both.
> 
> I hope it's not (1): I want the language I use to include more
> "batteries" not fewer.

How many more? Ten? Twenty? A hundred? Ten thousand? Ten million? Where 
do you draw the line?

Every extra battery is (say) a hundred extra lines of code, which means 
fifty thousand more potential bugs. It means more burden on the 
maintainers, and more burden on people learning the language. More 
batteries can make it more, not less, difficult to code:

# Python 1.1
"I need a collection of values, should I use a list, a tuple or a dict?"

vs

# Python 5000:
"I need a collection of values, should I use a list, a tuple, a dict, an 
array, a bag, a SortedBag, a table, a SortedTable, a NamedTuple, a 
Record, a stack, a queue, a deque, a linked list, a HashTable, a doubly-
linked list, a set, a frozenset, a skip list, a jump list, a binary tree, 
a  B tree, a B+ tree, a SortedList, a red-black tree, an A* tree, a B* 
tree, a SortedList, a StringList, a CaseInsensitiveSortedStringList, a 
CaseInsensitiveRedBlackTree, a HashSet, a CaseInsensitiveHashSet, an 
ArrayList, a ConcurrentQueue, a ConcurrentStack, a KeyedCollection, an 
EnumerableBag, a MultiSet, a StringMultiSet, a SortedSet, a 
DoubleEndedQueue, a Buffer, a CircularQueue, a heap, a binary search 
tree, an AssociatedArray, a Vector, a SparseArray, an XOR-linked list, a 
weight-balanced tree, a RandomizedBinarySearchTree, a ThreadedBinaryTree, 
an AVL tree, a splay tree, a rope, a binomial heap, a fibonacci heap, a 
trie, a B-trie, a judy array, an and/or tree, an enfilade, a treap, a 
dancing tree, a queap, or a fusion tree?" (Whew!)


There's nothing wrong with the existence of all these data structures. If 
you need one, you can use it. But they don't all need to be included in 
the standard library. That just increases the cognitive load on 
programmers without really adding that much benefit. I've seen people 
stress over the choice of a tuple or a list.



> <petpeeve>It seems that a similar "simplicity argument" was invoked to
> strip the cmp option from sort in Python 3.  Grrrr.  Simplicity is
> great, but when the drive for it starts causing useful functionality to
> be thrown out, then it is going too far.  Yes, I know that it is
> possible to do everything that sort's cmp option does through clever
> tricks with key, 

Not that clever. In general, key-based sorting is simpler than cmp-based 
sorting. In addition, the use of a key function is a basic technique 
which every coder should have in their mental toolbox, and it is 
applicable to more than just sorting. E.g. max() and min() also support 
key functions in Python 2.6 and better. On the other hand a cmp function 
is specific to sorting, and nothing but sorting.



> but grokking and coding this maneuver requires a *lot*
> more Python-fu than cmp did, which makes this functionality a lot less
> accessible to beginners that the intrinsic complexity of the problem
> warrants.

I disagree that key-based sorting requires any more Python-fu than cmp 
sorting. I believe they're equivalent. There are cmp functions you can 
write which aren't (easily?) convertible to key, but most of them (all?) 
aren't useful or sensible. E.g. they rely on side-effects, don't define a 
sensible sort order, or break stability of sorting:

>>> text = "here is a number of words in some order".split()
>>>
>>> sorted(text, cmp=lambda a,b: cmp(a.upper(), b.lower()))
['order', 'some', 'in', 'words', 'of', 'number', 'a', 'is', 'here']
>>>
>>> sorted(text[2:] + text[:2], cmp=lambda a,b: cmp(a.upper(), b.lower()))
['is', 'here', 'order', 'some', 'in', 'words', 'of', 'number', 'a']


Key-based sorting doesn't let you do this, but since I can't think of any 
application where you want the sort order to be dependent on the initial 
order, I believe this is a good thing. It is much harder to write buggy 
sorts using key than with cmp.



> And for what?  To get rid of an *option* that could be easily
> disregarded by anyone who found it too "complex"? It makes no sense to
> me.</petpeeve>


It's not that cmp is too complex, but that it's mere existence adds 
complexity to the list data type, it is usually slow and inefficient, and 
it can introduce subtle sorting bugs.


In the past, I've suggested that if people really believe that there is a 
use-case for sorting with a comparison function, they should fork it from 
the built-in list object and release it on PyPI as a third-party package. 
I still believe this.



-- 
Steven



More information about the Python-list mailing list