[Tutor] Logical error?

Sat May 3 03:53:27 CEST 2014

Hi Bob, and welcome!

My responses interleaved with yours, below.

On Fri, May 02, 2014 at 11:19:26PM +0100, Bob Williams wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
> 
> Hi,
> 
> I'm fairly new to coding and python. My system is linux (openSUSE
> 13.1). 

Nice to know. And I see you have even more infomation about your system 
in your email signature, including your email client and uptime. But 
what you don't tell us is what version of Python you're using. I'm going 
to guess that it is something in the 3.x range, since you call print as 
a function rather than a statement, but can't be sure.

Fortunately in this case I don't think the exact version matters.

[...]
> fullPath = []   # declare (initially empty) lists
> truncPath = []
> 
> with codecs.open('/var/log/rsyncd.log', 'r') as rsyncd_log:
>     for line in rsyncd_log.readlines():
>         fullPath += [line.decode('utf-8', 'ignore').strip()]

A small note about performance here. If your log files are very large 
(say, hundreds of thousands or millions of lines) you will find that 
this part is *horribly horrible slow*. There's two problems, a minor and 
a major one.

First, rsyncd_log.readlines will read the entire file in one go. Since 
you end up essentially copying the whole file, you end up with two large 
lists of lines. There are ways to solve that, and process the lines 
lazily, one line at a time without needing to store the whole file. But 
that's not the big problem.

The big problem is this:

    fullPath += [line.decode('utf-8', 'ignore').strip()]

which is an O(N**2) algorithm. Do you know that terminology? Very 
briefly: O(1) means approximately constant time: tripling the size of 
the input makes no difference to the processing time. O(N) means linear 
time: tripling the input triples the processing time. O(N**2) means 
quadratic time: tripling the input increases the processing time not by 
a factor of three, but a factor of three squared, or nine.

With small files, and fast computers, you won't notice. But with huge 
files and a slow computer, that could be painful.

Instead, a better approach is:

    fullPath.append(line.decode('utf-8', 'ignore').strip())

which avoids the O(N**2) performance trap.

>     if fullPath[-1][0:10] == today:
>         print("\n   Rsyncd.log has been modified in the last 24 hours...")
>     else:
>         print("\n   No recent rsync activity. Nothing to do.\n")
>         sys.exit()
> 
> # Search for lines starting with today's date and containing 'recv'
> # Strip everything up to and including 'recv' and following last '/'
> path separator
> for i in range(0, len(fullPath)):
>     if fullPath[i][0:10] == today and 'recv' in fullPath[i]:
>         print("got there")
>         begin = fullPath[i].find('recv ')
>         end = fullPath[i].rfind('/')
>         fullPath[i] = fullPath[i][begin+5:end]
>         truncPath.append(fullPath[i])
>         print("   ...and the following new albums have been added:\n")
>     else:
>         print("   ...but no new music has been downloaded.\n")
>         sys.exit()

Now at last we get to your immediate problem: the above is 
intended to iterate over the lines of fullPath. But it starts at the 
beginning of the file, which may not be today. The first time you hit a 
line which is not today, the program exits, before it gets a chance to 
advance to the more recent days. That probably means that it looks at 
the first line in the log, determines that it is not today, and exits.

I'm going to suggest a more streamlined algorithm. Most of it is actual 
Python code, assuming you're using Python 3. Only the "process this 
line" part needs to be re-written.

new_activity = False  # Nothing has happened today.
with open('/var/log/rsyncd.log', 'r', 
          encoding='utf-8', errors='ignore') as rsyncd_log:
    for line in rsyncd_log:
        line = line.strip()
        if line[0:10] == today and 'recv' in line:
            new_activity = True
            process this line  #  <== fix this

if not new_activity:
    print("no new albums have been added today")

This has the benefit that every line is touched only once, not three 
times as in your version. Performance is linear, not quadratic. You 
should be able to adapt this to your needs.

Good luck, and feel free to ask questions!

-- 
Steven