[Tutor] look back comprehensively

Tue Nov 13 23:59:24 EST 2018

I have been thinking about the thread we have had where the job seemed to be
to read in a log file and if some string was found, process the line before
it and generate some report. Is that generally correct?

The questioner suggested they needed both the entire file as one string but
also as a list of strings. Suggestions were made, IF SO, to read the entire
file twice, once as a whole and once as lines. Another suggestion was to
read either version ONCE and use cheaper Python methods to make the second
copy.

We also looked at a similar issue about using a buffer to keep the last N
lines.

I thought of another tack that may be in between but allow serious
functionality. 

OUTLINE:

Use just the readlines version to get a list of strings representing each
line. Assuming the searched text is static, no need for regular expressions.
You can ask if 

'something' in line

But if you find it, you may not have an index so a use of enumerate (or zip)
to make a tuple might be of use. You can do a list comprehension on an
enumerate object to get both the indexes where the requested 'something' was
found and also optionally the line contents (ignored) and be able to use the
indices to look at the index (or more) before.

Here is an example using a fake program where I create the four lines and
generate only the tuples needed for further processing, or just the indices
for the line ABOVE where it was found. It can be done with simple list
comprehensions or made into a generator expression.

-CODE-

"""

Sample code showing how to read a (simulated) file

and search for a fixed string and return the item

number in a list of strings for further processing

including of earlier lines.

"""

# Make test data without use of file

fromfile =  str1 = """alpha line one

beta line two

gamma line three

alphabet line four"""

lines= fromfile.split('\n')

print("RAW data: ", lines) # just for illustration

errors = [(index,line)

          for (index, line) in enumerate(lines)

          if 'bet' in line]

just_indices = [index - 1

                for (index, line) in enumerate(lines)

                if 'bet' in line]

from pprint import pprint

print("ERROR tuples:") # just for illustration

pprint(errors)

print("Just error indices:")

pprint(just_indices)

-END-CODE-

-OUTPUT-

RAW data:  ['alpha line one', 'beta line two', 'gamma line three', 'alphabet
line four']

ERROR tuples:

[(1, 'beta line two'), (3, 'alphabet line four')]

Just error indices:

[0, 2]

-END-OUTPUT-

Again, this did two ways, and only one is needed. But the next step would be
to iterate over the results and process the earlier line to find whatever it
is you need to report. Many ways to do that such as:

for (index, ignore) in errors

or

for index in just_indices

You can use a regular expression on a line at a time. And so on.

Again, other methods mentioned work fine, and using a deque to store earlier
lines in a limited buffer while not reading the entire file into memory
would also be a good way.

Warning: the above assumes the text found will never be in the zeroeth line.
Otherwise, you need to check as accessing line -1 may actually return the
last line!

As stated many times, there seem to be an amazing number of ways to do
anything. As an example, I mentioned using zip above. One obvious method is
to zip it with a range statement making it look just like enumerate. A more
subtle one would be to make a copy of the set of lines the same length but
each line content shifted by one. Zip that to the original and you get a
tuple with (line 0, null) then (line 1, line 0) up to (line n, line n-1)

Yes, that doubles memory use but you can solve so much more in one somewhat
more complicated list comprehension. Anyone want to guess how?

If you recall, some regular expression matches something on the previous
line. Let us make believe you wrote a function called do_the_match(line, re)
that applies the regular expression to the line of text and returns the
matching text or perhaps an empty string if not found. So if you define that
function then the following code will work.

First, make my funny zip:

I make lines_after as copy of lines shifted over one. Or more exactly
circularly permuted by one. The goal is to search in line_after and if
found, do the regular expression match in line before.

>>> lines_after = lines[1:]

>>> lines_after.append(lines[0])

>>> lines_after

['beta line two', 'gamma line three', 'alphabet line four', 'alpha line
one']

>>> lines

['alpha line one', 'beta line two', 'gamma line three', 'alphabet line
four']

>>> list(zip(lines_after, lines))

[('beta line two', 'alpha line one'), ('gamma line three', 'beta line two'),
('alphabet line four', 'gamma line three'), ('alpha line one', 'alphabet
line four')]

So the list comprehension looks something like this:

matches = [ do_theMatch(line, re)

           for (line_after, line) in zip(line_after, line)

           if 'something' in line_after ]

Need I mention the above code was used in Python 3.7.0 ???

I think it is important to try to learn the idioms and odd customs of a
language even if sometimes it results in more memory or CPU usage but maybe
better not to overly complicate the code till nobody understands it. The
latter may be in a mild way 'elegant' but even if I made the variable names
more descriptive it might be harder to understand than a simple enumerated
version or several passes in more normal iteration statements.

Feel free to comment. I have a thick skin and love to learn from others.

Avi