[Tutor] decomposing a problem

Fri Dec 28 15:34:19 EST 2018

Steve,

I am going to just respond to one part of your message and will snip the
rest. I am not is disagreement with most of what you say and may simply
stress different aspects. I will say that unless I have reason to, I don't
feel a need to test speeds for an academic discussion. Had this been a real
project, sure. Even then, if it will need to run on multiple machines using
multiple incarnations of python, the results will vary, especially if the
data varies too. You suggest that discussions backed by real data are
better. Sure. But when a discussion is abstract enough, then I think it
perfectly reasonable to say "may be faster" to mean that until you try it,
there are few guarantees. Many times a method seems superior until you reach
a pathological case. One sorting algorithm is fast except when the data is
already almost fully sorted already.

So why do I bother saying things like MAY? It seems to be impossible to
please everybody. There are many things with nuance and exceptions. When I
state things one way, some people (often legitimately) snipe. When I don't
insist on certainty, other have problem with that. When I make it short, I
am clearly leaving many things out. When I go into as much detail as I am
aware of, I get feedback that it is too long or boring or it wanders too
much. None of this is a problem as much as a reality about tradeoffs.

So before I respond, here is a general statement. I am NOT particularly
interested in much of what we discuss here from a specific point of view.
Someone raises a question and I think about it. They want to know of a
better way to get a random key from a dictionary. My thought is that if I
needed that random key, maybe I would not have stored it in a dictionary in
the first place. But, given that the data is in a dictionary, I wonder what
could be done. It is an ACADEMIC discussion with a certain amount of hand
waving. Sometimes I do experiment and show what I did. Other times I say I
am speculating and if someone disagrees, fine. If they show solid arguments
or point out errors on my part or create evidence, they can change my mind. 

You (Steve) are an easy person to discuss things with but there are some who
are less. People who have some idea of my style and understand the kind of
discussion I am having at that point and who let me understand where they
are coming from, can have a reasonable discussion. The ones who act like TV
lawyers who hear that some piece of evidence has less than one in a
quadrillion chance of happening then say BUT THERE IS A CHANCE so reasonable
doubt ... are hardly worth debating.

You replied to one of my points with this about a way to partition data:

---
The obvious solution:

keys = list(mydict.keys())
random.shuffle(keys)
index = len(keys)*3//4
training_data = keys[:index]
reserved = keys[index:]
---

(In the above, "---" is not python but a separator!)

That is indeed a very reasonable way to segment the data. But it sort of
makes my point. If the data is stored in a dictionary, the way to access it
ended up being to make a list and play with that. I would still need to get
the values one at a time from the dictionary such as in the ways you also
show and I omit.

For me, it seems more natural in this case to simply have the data in a data
frame where I have lots of tools and methods available. Yes, underneath it
all providing an array of indices or True/False Booleans to index the data
frame can be slow but it feels more natural. Yes, python has additional
paradigms I may not have used in R such as list comprehensions and
dictionary comprehensions that are conceptually simple. But I did use the
R-onic (to coin a phrase nobody would ironically use) equivalents that can
also be powerful and I need not discuss here in a python list. Part of
adjusting to python includes unlearning some old habits and attitudes and
living off this new land. [[Just for amusement, the original R language was
called S so you might call its way of doing things Sonic.]]

I see a balance between various ways the data is used. Clearly it is
possible to convert it between forms and for reasonable amounts of data it
can be fast enough. But as you note, at some point you can just toss one
representation away so maybe you can not bother using that in the first
place. Keep it simple.

In many real life situations, you are storing many units of data and often
have multiple ways of indexing the data. There are representations that do
much of the work for you. Creating a dictionary where each item is a list or
other data structure can emulate such functionality and even have advantages
but if your coding style is more comfortable with another way, why bother
unless you are trying to learn other ways and be flexible.

As I have mentioned too many times, my most recent work was in R and I
sometimes delight and other times groan at the very different ways some
things are done when using specific modules or libraries. But even within a
language and environment there are radical differences and approaches. The
naked (albeit evolving) languages often offer a reasonable way to do things
but developers often add new layers and paradigms that can become the more
standard way of doing things. Base python can do just about anything with
just lists. All you have to remember is that you stored the zip code in the
23rd element. But programmers created things like named tuples or
specialized objects like dataframes that may be easier or more powerful or
less prone to some kinds of errors or  whatever.