[Tutor] Newbie Here -- Averaging & Adding Madness Over a Given (x) Range?!?!

Thu Feb 14 23:48:18 CET 2013

On 15/02/13 07:55, Michael McConachie wrote:

> Essentially:
>
> 1.  I have a list of numbers that already exist in a file.  I generate this file by parsing info from logs.
> 2.  Each line contains an integer on it (corresponding to the number of milliseconds that it takes to complete a certain repeated task).
> 3.  There are over a million entries in this file, one per line; at any given time it can be just a few thousand, or more than a million.
>
>     Example:
>     -------
>     173
>     1685
>     1152
>     253
>     1623

A million entries sounds like a lot to you or me, but to your computer, it's not. When you start talking tens or hundreds of millions, that's possibly a lot.

Do you know how to read those numbers into a Python list? Here is the "baby step" way to do so:

data = []  # Start with an empty list.
f = open("filename")  # Obviously you have to use the actual file name.
for line in f:  # Read the file one line at a time.
     num = int(line)  # Convert each line into an integer (whole number)
     data.append(num)  # and append it to the end of the list.
f.close()  # Close the file when done.

Here's a more concise way to do it:

with open("filename") as f:
     data = [int(line) for line in f]

Once you have that list of numbers, you can sum the whole lot:

sum(data)

or just a range of the items:

sum(data[:100])  # The first 100 items.

sum(data[100:200])  # The second 100 items.

sum(data[-50:])  # The last 50 items.

sum(data[1000:])  # Item 1001 to the end.  (See below.)

sum(data[5:99:3])  # Every third item, starting at index 5 and ending at index 98.

This is called "slicing", and it is perhaps the most powerful and useful technique that Python gives you for dealing with lists. The rules though are not necessarily the most intuitive though.

A slice is either a pair of numbers separated with a colon, inside the square brackets:

     data[start:end]

or a triple:

     data[start:end:step]

Any of these three numbers can be left out. The default values are:

start=0
end=length of the sequence being sliced
step=1

They can also be negative. If start or end are negative, they are interpreted as "from the end" rather than "from the beginning".

Item positions are counted from 0, which will be very familiar to C programmers. The start index is included in the slice, the end position is excluded.

The model that you should think of is to imagine the sequence of items labelled with their index, starting from zero, and with a vertical line *between* each position. Here is a sequence of 26 items, showing the index in the first line and the value in the second:

|0|1|2|3|4|5|6|7|8|9| ... |25|
|a|b|c|d|e|f|g|h|i|j| ... |z |

When you take a slice, the items are always cut at the left. So, if the above is called "letters", we have:

letters[0:4]  # returns "abcd"

letters[2:8]  # returns "cdefgh"

letters[2:8:2]  # returns "ceg"

letters[-3:]  # returns "xyz"

> Eventually what I'll need to do is:
>
> 1.  Index the file and/or count the lines, as to identify each line's positional relevance so that it can average any range of numbers that are sequential; one to one another.

No need. Python already does that, automatically, when you read the data into a list.

> 2.  Calculate the difference between any given (x) range.  In order to be able to ask the program to average every 5, 10, 100, 100, or 10,000 etc. -->  until completion.  This includes the need to dealing with stray remainders at the end of the file that aren't divisible by that initial requested range.

I don't quite understand you here. First you say "difference", then you say "average". Can you show a sample of data, say, 10 values, and the sorts of typical calculations you want to perform, with the answers you expect to get?

For example, here's 10 numbers:

103, 104, 105, 109, 111, 112, 115, 120, 123, 128

Here are the running averages of 3 values:

(103+104+105)/3

(104+105+109)/3

(105+109+111)/3

(109+111+112)/3

(111+112+115)/3

(112+115+120)/3

(115+120+123)/3

(120+123+128)/3

Is that what you mean? If so, then Python can deal with this trivially, using slicing. With your data stored in list "data", as above, I can say:

for i in range(0, len(data)-3):  # Stop 3 from the end.
     print sum(data[i:i+3])

to print the running sums taking three items at a time.

The rest of your post just confuses me. Until you explain exactly what calculations you are trying to perform, I can't tell you how to perform them :-)

-- 
Steven