[Tutor] look back comprehensively

Avi Gross avigross at verizon.net
Mon Dec 24 19:45:04 EST 2018


There is linear thinking and then there is more linear thinking.

As Alan, Mats and others said, there are often choices we can make and many
approaches.

If your goal is to solve a problem NOW and right where you are, any method
you can find is great.

If your goal is to solve it repeatedly or in many possible places or more
efficiently or are trying to learn more about a particular language/method
like python, then you have constraints to consider.

The linear solution might be to solve the entire problem in one way. That
may be python or it may be some UNIX tool like AWK.

The flexible solutions may include doing it in stages, perhaps switching
tools along the way.

So for the logfile solution, some see two paradigms. One is to read in the
entire file, no matter how big. The other is to read in no more than a line
at a time.

Bogus choice.

In the logfile example, there are many other choices. If you can recognize
the beginning of a region and then the end, sure, you can buffer lines. But
how about a solution where you simply read the entire file (a line at a
time) while writing a second file on only the lines you might need. When
done, if the file is empty, move on.

If not, open that smaller file, perhaps reading it all in at once, and
process that small file containing perhaps a few error logs. Or, heck, maybe
each error region was written into a different file and you process them one
at a time with even less complex code.

Unless efficiency is paramount, many schemes like these can result in a
better division of labor, easier to understand and perhaps even code. And
some of these schemes may even have other advantages.

Here is yet another weird idea. REWIND. If the log file is a real file, you
can wait till you reach the end condition that tells you that you also need
earlier lines. Make a note of your position in the file and calculate an
estimate of how far back in the file you need to go, perhaps a very generous
estimate. Say you rewind a thousand bytes and start reading lines again,
perhaps discarding the first one as likely to be incomplete. You can read
all these lines into a buffer, figure out which lines you need starting from
the end, do what you want with it, toss the buffer, reset the file pointer,
and continue!

That solution is not linear at all. If you have a huge log file with a
sparse (or nonexistent) number of errors to process, this may even be faster
than a scheme which buffers the last hundred lines and is constantly rolling
those lines over for naught.

I am not criticizing any approach but suggesting that one good approach to
problems is to not close in on one particular solution prematurely. Be open
to other solutions and even think outside that box. There may be many ways
to look back. 

As for the UNIX tools, one nice thing about them was using them in a
pipeline where each step made some modification and often that merely
allowed the next step to modify that. The solution did not depend on one
tool doing everything.

Even within python, you can find a way to combine many modules to get a job
done rather than building it from scratch.

Which leads to the question of how you would design a log file if you knew
you needed to be able to search it efficiently for your particular
application.

I would offer something like HTML as an example. 

To some extent, the design of many elements looks like this:

<BODY>
...
</BODY>

The idea is you can write code that starts saving info when it reaches the
first tag, and when it reaches the second tag that ends it, you make a
decision. If you are in the right region, process it. If not, toss it and
just move on.

Of course, this scheme does not actually work for many of the tags in HTML.
Many tags allow an optional close but tolerate not having one. Some things
may be nested within each other.

But when it comes to log files, if some line says:

***

And that marks any error and you now wait till the end of the region to see
which error on a line like:

#ERRNO: 26

Then you can ask your code to ignore lines till it sees the first, marker.
Buffer any subsequent lines till you recognize the second marker and process
the buffer. Then go back to ignoring till ...

A real logfile may contain many sections for many purposes. They may even
have interleaved lines from different processes to the point where your
design may require all lines to start with something unique like the process
ID of the writer. This would make parsing such a file very hard, perhaps
requiring multiple passes sort of like described above. So sometimes a
better design is multiple log files that can be merged if needed.

Of course if you must use what exists, ....

-----Original Message-----
From: Tutor <tutor-bounces+avigross=verizon.net at python.org> On Behalf Of
Mats Wichmann
Sent: Monday, December 24, 2018 10:15 AM
To: tutor at python.org
Subject: Re: [Tutor] look back comprehensively

On 12/24/18 2:14 AM, Alan Gauld via Tutor wrote:
> On 24/12/2018 05:25, Asokan Pichai wrote:
> 
>> That said, sometimes text processing at the shell can precede or even 
>> replace some of these. Of course that assumes Unix/Linux OS.

> In fact for most log file analysis I still use [ng]awk.
> Its hard to beat the simplicity of regex based event handling for 
> slicing text files.

Sure... there's nothing wrong with using "the appropriate tool for the job"
rather than making everything look like a Python problem.  There's Perl,
too... the ultimate log analysis toolkit. If only I could read what I wrote
a week later :)


_______________________________________________
Tutor maillist  -  Tutor at python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor



More information about the Tutor mailing list