[Tutor] decomposing a problem

Avi Gross avigross at verizon.net
Wed Dec 26 11:02:07 EST 2018


[REAL SUBJECT: Beats me]
Steven,

I often find that I try to make a main point ad people then focus on
something else, like an example.

So, do we agree on the main point that choosing a specific data structure or
algorithm (or even computer language) too soon can lead to problems that can
be avoided if we first map out the problem and understand it better?

So when someone asks a question here like how do I get a data structure to
do this, often the answer might include asking if that is the right data
structure for the job. And yes, there may be no RIGHT as compared to degrees
of fit that include subjective components.

The rest of your post really is a rehash of the periodic efficiency
discussions.

I do not concede that efficiency can be ignored because computers are fast.
I do concede that it is often not worth the effort or that you can
inadvertently make things worse and there are tradeoffs.

Let me be specific. The side topic was asking how to get a random key from
an existing dictionary. If you do this ONCE, it may be no big deal to make a
list of all keys, index it by a random number, and move on. I did supply a
solution that might(or might not) run faster by using a generator to get one
item at a time and stopping when found. Less space but not sure if less
time.

But what I often need to do is to segment lots of data into two piles. One
is for training purposes using some machine learning algorithm and the
remainder is to be used for verifications. The choice must be random or the
entire project may become meaningless. So if your data structure was a
dictionary with key names promptly abandoned, you cannot just call pop()
umpteen times to get supposedly random results as they may come in a very
specific order. If you want to have 75% of the data in the training section,
and 25% reserved, and you have millions of records, what is a good way to
go? Worse, some algorithms like this build forests and take numerous random
samples for each tree as they go along.

So you can wonder if the dictionary is the best way to go if there is no
faster way to get random records. If you are told that while doing this you
also need to allow simultaneous access to the data structure so new
key/value pairs are being inserted or modified or removed, that may make you
wonder if other choices would perhaps be best overall.

One quite obvious solution is to not create a full list each and every time
you want one random number. There are many ways to arrange either to have
the list of keys standing around all the time or to be created and saved on
the first need and kept around as long as needed. A simple generator that
stays around can do that. So asking for 2 million keys one after another can
slow things down if you do it from scratch each time even as garbage
collection works overtime. But other solutions might simply make a copy of
the dictionary as a data frame or whatever it takes. There may be merits to
each solution as well as drawbacks. In any case, spending too much time on
it may not be worthwhile. And as is often noted, using something already in
existence and often already highly optimized and debugged, is often a
deciding factor. 

I believe I mentioned that the implementation that creates a list from a
generator may be faster than your own algorithm that may appear to be more
efficient in some way. My earlier mention is an example. I agree that the
implementation of lists usually results in an indexed search for the nth
item and is rapid. I was referring to algorithms that generate one item at a
time and find it on average halfway through and in the worst case end up
reading the entire list. In theory they can be faster than creating the
entire list at once but in practice, maybe not, and in any case, maybe not
the best place to optimize.

I tend to avoid extremes so let me make one thing very clear.

I neither like it when people make snap decisions about how to do something
with software and refuse to reconsider when reality intervenes NOR do I like
it when they need endless (often fairly meaningless) steps to plan it all
out so they may never actually start it. I like a prototyping approach with
flexibility. Do some planning, try some things out, perhaps with a set of
tools that lets you put a skeleton of the idea together to see if it fits.
If not, adjust. When it looks reasonable, flesh it out including lots more
rigor and bulletproofing.

For this forum, the questions that come in range from people who have no
idea how to even start a project to those that are well along and get stuck
at some point. The former may need help in making a general plan of attack.
The latter may just need someone to point to an error or design flaw. One
size does not fit all.

And, as Steven points out, there are tradeoffs. For a student exercise
intended to learn how to use a feature of the language, it is not that
important that the function they write checks to make sure they were called
with sensible arguments like a positive number when asked how many ice cream
scoops they want. For production code, much more may be required. And
extremely clever solutions being offered that may not even be understood,
often defeat the purpose.

-----Original Message-----
From: Tutor <tutor-bounces+avigross=verizon.net at python.org> On Behalf Of
Steven D'Aprano
Sent: Wednesday, December 26, 2018 12:45 AM
To: tutor at python.org
Subject: Re: [Tutor] decomposing a problem

On Tue, Dec 25, 2018 at 11:56:21PM -0500, Avi Gross wrote:

> I find that many people are fairly uncomfortable with abstraction and 
> tend to resist a pure top down approach by diving to any solutions 
> they may envision.

https://blog.codinghorror.com/it-came-from-planet-architecture/

> As someone asked on another python list, is there a better way to get 
> a random key for a dictionary. Well, not easily without expanding all 
> keys into a list of perhaps huge length.

Define "better".

What do you value? Time, space, simplicity or something else?

One of the most harmful things to value is "cleverness" for its own sake.
Some people tend to value a "clever" solution even when it wastes time,
space and is over complex and therefore hard to maintain or debug.

Even when the practical response to the "clever" solution is "YAGNI".

What counts as "huge"? To me, picking a random key from a list of 100 keys
is "huge". Copy out 100 keys to a list by hand and then pick one? 
What a PITA that would be.

But to your computer, chances are that ten million keys is "small". One
hundred million might be pushing "largish". A billion, or perhaps ten
billion, could be "large". Fifty, a hundred, maybe even a thousand billion
(a trillion) would be "huge".

Unless you expect to be handling at least a billion keys, there's probably
no justification for anything more complex than:

    random.choose(list(dict.keys())

Chances are that it will be faster *and* use less memory than any clever
solution you come up with -- and even if it does use more memory, it uses it
for a few milliseconds, only when needed, unlike a more complex solution
that inflates the size of the data structure all the time, whether you need
it or not.

Of course there may be use-cases where we really do need a more complex,
clever solution, and are willing to trade off space for time (or sometimes
time for space). But chances are YAGNI.


> Followed by a search of much of that list to get the nth index.

That's incorrect. Despite the name, Python lists aren't linked lists[1]
where you have to traverse N items to get to the Nth item. They're arrays,
where indexing requires constant time.


[...]
> If they keep demanding one function to master all, you can end up with 
> fairly awful spaghetti code.

https://en.wikipedia.org/wiki/God_object




[1] Technically speaking, this is not a requirement of the language, only a
"quality of implementation" question. A Python interpreter could offer
built-in lists using linked lists under the hood, with O(N) indexing. But
all the major implementations -- CPython, Stackless, PyPy, Jython,
IronPython, Cython, Nuitka, even (I think) MicroPython -- use arrays as the
list implementation. Given how simple arrays are, I think it is fair to
assume that any reasonable Python interpreter will do the same.



--
Steve
_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list