[Tutor] Logical error?
Steven D'Aprano
steve at pearwood.info
Sat May 3 03:53:27 CEST 2014
Hi Bob, and welcome!
My responses interleaved with yours, below.
On Fri, May 02, 2014 at 11:19:26PM +0100, Bob Williams wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Hi,
>
> I'm fairly new to coding and python. My system is linux (openSUSE
> 13.1).
Nice to know. And I see you have even more infomation about your system
in your email signature, including your email client and uptime. But
what you don't tell us is what version of Python you're using. I'm going
to guess that it is something in the 3.x range, since you call print as
a function rather than a statement, but can't be sure.
Fortunately in this case I don't think the exact version matters.
[...]
> fullPath = [] # declare (initially empty) lists
> truncPath = []
>
> with codecs.open('/var/log/rsyncd.log', 'r') as rsyncd_log:
> for line in rsyncd_log.readlines():
> fullPath += [line.decode('utf-8', 'ignore').strip()]
A small note about performance here. If your log files are very large
(say, hundreds of thousands or millions of lines) you will find that
this part is *horribly horrible slow*. There's two problems, a minor and
a major one.
First, rsyncd_log.readlines will read the entire file in one go. Since
you end up essentially copying the whole file, you end up with two large
lists of lines. There are ways to solve that, and process the lines
lazily, one line at a time without needing to store the whole file. But
that's not the big problem.
The big problem is this:
fullPath += [line.decode('utf-8', 'ignore').strip()]
which is an O(N**2) algorithm. Do you know that terminology? Very
briefly: O(1) means approximately constant time: tripling the size of
the input makes no difference to the processing time. O(N) means linear
time: tripling the input triples the processing time. O(N**2) means
quadratic time: tripling the input increases the processing time not by
a factor of three, but a factor of three squared, or nine.
With small files, and fast computers, you won't notice. But with huge
files and a slow computer, that could be painful.
Instead, a better approach is:
fullPath.append(line.decode('utf-8', 'ignore').strip())
which avoids the O(N**2) performance trap.
> if fullPath[-1][0:10] == today:
> print("\n Rsyncd.log has been modified in the last 24 hours...")
> else:
> print("\n No recent rsync activity. Nothing to do.\n")
> sys.exit()
>
> # Search for lines starting with today's date and containing 'recv'
> # Strip everything up to and including 'recv' and following last '/'
> path separator
> for i in range(0, len(fullPath)):
> if fullPath[i][0:10] == today and 'recv' in fullPath[i]:
> print("got there")
> begin = fullPath[i].find('recv ')
> end = fullPath[i].rfind('/')
> fullPath[i] = fullPath[i][begin+5:end]
> truncPath.append(fullPath[i])
> print(" ...and the following new albums have been added:\n")
> else:
> print(" ...but no new music has been downloaded.\n")
> sys.exit()
Now at last we get to your immediate problem: the above is
intended to iterate over the lines of fullPath. But it starts at the
beginning of the file, which may not be today. The first time you hit a
line which is not today, the program exits, before it gets a chance to
advance to the more recent days. That probably means that it looks at
the first line in the log, determines that it is not today, and exits.
I'm going to suggest a more streamlined algorithm. Most of it is actual
Python code, assuming you're using Python 3. Only the "process this
line" part needs to be re-written.
new_activity = False # Nothing has happened today.
with open('/var/log/rsyncd.log', 'r',
encoding='utf-8', errors='ignore') as rsyncd_log:
for line in rsyncd_log:
line = line.strip()
if line[0:10] == today and 'recv' in line:
new_activity = True
process this line # <== fix this
if not new_activity:
print("no new albums have been added today")
This has the benefit that every line is touched only once, not three
times as in your version. Performance is linear, not quadratic. You
should be able to adapt this to your needs.
Good luck, and feel free to ask questions!
--
Steven
More information about the Tutor
mailing list