[Tutor] Response to responses about list of lists: a meta exercise in mailinglist recursion

Wed Jul 14 02:19:34 CEST 2010

On Tue, 13 Jul 2010 10:40:44 pm Siren Saren wrote:
> I'm not sure if there's a way to submit responses 'live' or
> whether it's better to respond to subthreads at once or together, so
> I'll err on the side of discretion and just send one response. 

Generally it's better, or at least more common, to respond to each 
response individually. But that's generally because people on mailing 
lists receive individual pieces of mail instead of a single giant 
digest containing the entire day's email traffic.

> In response to the first question: the consensus seems to be
> that there is no good way to sort a non-alternating one-to-many list
> like this, so my strategy of deriving the index numbers of every
> item, as awkward as it appears, may actually be the best approach.

No, a better approach is to split the list into separate lists:

mixedlist = [
 'Crime and punishment', 10, 40, 30, 
 "Brother's Karamazov", 22, 55, 9000, 
 "Father's and Sons", 100,
 'Anna Karenina', 1, 2, 4, 7, 9,
]

# untested
current_book = None
current_pages = []
result = []
for item in mixedlist:
    # Is it a title, or an integer page?
    if isinstance(item, str):
        # It's a title.
        if current_book is not None:
            result.append( (current_book, current_pages) )
        current_book = item
        current_pages = []
    else:
        # It's a page number.
        current_pages.append(item)

This will split the mixed list of titles, page numbers into a 
consolidated list of (title, list-of-page-numbers) like this:

booklist = [
 ("Crime and punishment", [10, 40, 30]),
 ("Brother's Karamazov", [22, 55, 9000]),
 ("Father's and Sons", [100]),
 ("Anna Karenina", [1, 2, 4, 7, 9]),
]

It's easy to adapt it to use a dictionary instead of a list. Change 
result to an empty dict {} instead of an empty list [], and change the 
line:

result.append( (current_book, current_pages) )

into:

result[current_book] = current_pages

One other thing... it's possible that your data is provided to you in 
text form, so that you have a single string like:

"Crime and punishment, page 10, page 40, page 30, ..."

instead of the more useful list of titles and integers. Some people have 
suggested using regular expressions to process the page numbers. That's 
fine, but it's rather overkill, like using a bulldozer to move a 
shovelful of dirt. Here's an (untested) filter to convert the one long 
string into a list of titles and integer pages:

result = []
for item in long_string.split(','):
    item = item.strip()  # get rid of leading and trailing spaces
    if item.startswith('page '):
        item = int(item[5:])
    result.append(item)

This assumes the data is well-formed, there are no typos such as "pgae 
15" in the data, and most importantly, no commas in the book titles.

This can be written as a two-liner, at the cost of some readability:

items = [item.strip() for item in long_string.split(',')]
result = [int(s[5:]) if s.startswith('page ') else s for s in items]

Making it into a one-liner is left as an exercise for the masochistic.

This demonstrates the basic principle of data processing of this kind. 
You start with a bunch of data in one form, and you transform it into 
another, slightly different and more useful form, using a series on 
individual filters or transformation functions:

Start with one long string.
Divide it into a list of substrings.
Transform it into a list of strings and integers.
Collate the integers with the string they belong to.
Do something useful with the pairs of (string, list-of-integers)

[...]
> I apologize for using 'reply,' I've never used a mailing
> list before and didn't understand what would happen.

Using rely is fine, but trim your response! Think of it like this. 
Suppose you received dozens of letters each day, from many different 
people, but the post office consolidated it all into a single envelope 
and delivered it once a day (a "digest"). If you wanted to reply to 
something from Fred, you wouldn't photocopy the entire day's collection 
of mail, all two hundred letters, and stick it at the bottom of your 
reply, would you? Well, that's what your mail program does, by default. 
It only takes a second to delete that excess baggage from your reply 
before hitting send.

> Is there some 
> online forum where I could post a message directly rather than
> mailing it in?

I don't think there's an online forum, because this is an email mailing 
list, not a web forum.

If you're starting a new discussion, or raising a new question, make a 
fresh, blank email, put a descriptive title in the subject line, and 
put tutor at python.org as the To address. 

If you're replying to an existing message, using reply is fine, but just 
trim the quoted text. You have a delete key -- learn how to use it :)

> I see that other people are somehow responding to my message from the
> more real-time updates I can get on activestate, but I don't know how
> they are doing it since I haven't received the mailing yet that would
> include my message and its responses.

That's because they'll be subscribed directly to the tutor mailing list, 
and getting individual pieces of mail as they're sent rather than a 
queued up and consolidated digest.

> If my list had a million books in it and 10 million page
> numbers, would the approach I've outlined in my initial post be the
> best for sorting them? 

Ten million items isn't much for modern computers with gigabytes of 
memory. It's *approaching* "much", but hasn't quite reached it yet, and 
for most applications, it doesn't matter if it takes 2 seconds to 
pre-process your data instead of 0.5 second, so long as that's a 
one-off cost. If you have to do it again and again, that's another 
thing!

-- 
Steven D'Aprano