[Tutor] Increase performance of the script

Tue Dec 11 10:37:58 EST 2018

Hi All,

          I used your solution , however found a strange issue with deque :

I am using python 2.6.6:

>>> import collections
>>> d = collections.deque('abcdefg')
>>> print 'Deque:', d
  File "<stdin>", line 1
    print 'Deque:', d
                 ^
SyntaxError: invalid syntax
>>> print ('Deque:', d)
Deque: deque(['a', 'b', 'c', 'd', 'e', 'f', 'g'])
>>> print d
  File "<stdin>", line 1
    print d
          ^
SyntaxError: invalid syntax
>>> print (d)
deque(['a', 'b', 'c', 'd', 'e', 'f', 'g'])

In python 2.6 print statement work as print "Solution"

however after import collection I have to use print with print("Solution")
is this a known issue ?

Please let me know .

Thanks,

On Mon, Dec 10, 2018 at 10:30 PM <tutor-request at python.org> wrote:

> Send Tutor mailing list submissions to
>         tutor at python.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         https://mail.python.org/mailman/listinfo/tutor
> or, via email, send a message with subject or body 'help' to
>         tutor-request at python.org
>
> You can reach the person managing the list at
>         tutor-owner at python.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Tutor digest..."
> Today's Topics:
>
>    1. Re: Increase performance of the script (Peter Otten)
>    2. Re: Increase performance of the script (Steven D'Aprano)
>    3. Re: Increase performance of the script (Steven D'Aprano)
>
>
>
> ---------- Forwarded message ----------
> From: Peter Otten <__peter__ at web.de>
> To: tutor at python.org
> Cc:
> Bcc:
> Date: Sun, 09 Dec 2018 21:17:53 +0100
> Subject: Re: [Tutor] Increase performance of the script
> Asad wrote:
>
> > Hi All ,
> >
> >           I have the following code to search for an error and prin the
> > solution .
> >
> > /A/B/file1.log size may vary from 5MB -5 GB
> >
> > f4 = open (r" /A/B/file1.log  ", 'r' )
> > string2=f4.readlines()
>
> Do not read the complete file into memory. Read one line at a time and
> keep
> only those lines around that you may have to look at again.
>
> > for i in range(len(string2)):
> >     position=i
> >     lastposition =position+1
> >     while True:
> >          if re.search('Calling rdbms/admin',string2[lastposition]):
> >           break
> >          elif lastposition==len(string2)-1:
> >           break
> >          else:
> >           lastposition += 1
>
> You are trying to find a group of lines. The way you do it for a file of
> the
> structure
>
> foo
> bar
> baz
> end-of-group-1
> ham
> spam
> end-of-group-2
>
> you find the groups
>
> foo
> bar
> baz
> end-of-group-1
>
> bar
> baz
> end-of-group-1
>
> baz
> end-of-group-1
>
> ham
> spam
> end-of-group-2
>
> spam
> end-of-group-2
>
> That looks like a lot of redundancy which you can probably avoid. But
> wait...
>
>
> >     errorcheck=string2[position:lastposition]
> >     for i in range ( len ( errorcheck ) ):
> >         if re.search ( r'"error(.)*13?"', errorcheck[i] ):
> >             print "Reason of error \n", errorcheck[i]
> >             print "script \n" , string2[position]
> >             print "block of code \n"
> >             print errorcheck[i-3]
> >             print errorcheck[i-2]
> >             print errorcheck[i-1]
> >             print errorcheck[i]
> >             print "Solution :\n"
> >             print "Verify the list of objects belonging to Database "
> >             break
> >     else:
> >         continue
> >     break
>
> you throw away almost all the hard work to look for the line containing
> those four lines? It looks like you only need the
> "error...13" lines, the three lines that precede it and the last
> "Calling..." line occuring before the "error...13".
>
> > The problem I am facing in performance issue it takes some minutes to
> > print out the solution . Please advice if there can be performance
> > enhancements to this script .
>
> If you want to learn the Python way you should try hard to write your
> scripts without a single
>
> for i in range(...):
>     ...
>
> loop. This style is usually the last resort, it may work for small
> datasets,
> but as soon as you have to deal with large files performance dives.
> Even worse, these loops tend to make your code hard to debug.
>
> Below is a suggestion for an implementation of what your code seems to be
> doing that only remembers the four recent lines and works with a single
> loop. If that saves you some time use that time to clean the scripts you
> have lying around from occurences of "for i in range(....): ..." ;)
>
>
> from __future__ import print_function
>
> import re
> import sys
> from collections import deque
>
>
> def show(prompt, *values):
>     print(prompt)
>     for value in values:
>         print(" {}".format(value.rstrip("\n")))
>
>
> def process(filename):
>     tail = deque(maxlen=4)  # the last four lines
>     script = None
>     with open(filename) as instream:
>         for line in instream:
>             tail.append(line)
>             if "Calling rdbms/admin" in line:
>                 script = line
>             elif re.search('"error(.)*13?"', line) is not None:
>                 show("Reason of error:", tail[-1])
>                 show("Script:", script)
>                 show("Block of code:", *tail)
>                 show(
>                     "Solution",
>                     "Verify the list of objects belonging to Database"
>                 )
>                 break
>
>
> if __name__ == "__main__":
>     filename = sys.argv[1]
>     process(filename)
>
>
>
>
>
>
> ---------- Forwarded message ----------
> From: "Steven D'Aprano" <steve at pearwood.info>
> To: tutor at python.org
> Cc:
> Bcc:
> Date: Mon, 10 Dec 2018 09:43:20 +1100
> Subject: Re: [Tutor] Increase performance of the script
> On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote:
> > Hi All ,
> >
> >           I have the following code to search for an error and prin the
> > solution .
> >
> > /A/B/file1.log size may vary from 5MB -5 GB
> [...]
>
> > The problem I am facing in performance issue it takes some minutes to
> print
> > out the solution . Please advice if there can be performance enhancements
> > to this script .
>
> How many minutes is "some"? If it takes 2 minutes to analyse a 5GB file,
> that's not bad performance. If it takes 2 minutes to analyse a 5MB file,
> that's not so good.
>
>
>
> --
> Steve
>
>
>
>
> ---------- Forwarded message ----------
> From: "Steven D'Aprano" <steve at pearwood.info>
> To: tutor at python.org
> Cc:
> Bcc:
> Date: Mon, 10 Dec 2018 11:00:58 +1100
> Subject: Re: [Tutor] Increase performance of the script
> On Sun, Dec 09, 2018 at 03:45:07PM +0530, Asad wrote:
> > Hi All ,
> >
> >           I have the following code to search for an error and prin the
> > solution .
>
> Please tidy your code before asking for help optimizing it. We're
> volunteers, not being paid to work on your problem, and your code is too
> hard to understand.
>
> Some comments:
>
>
> > f4 = open (r" /A/B/file1.log  ", 'r' )
> > string2=f4.readlines()
>
> You have a variable "f4". Where are f1, f2 and f3?
>
> You have a variable "string2", which is a lie, because it is not a
> string, it is a list.
>
> I will be very surprised if the file name you show is correct. It has a
> leading space, and two trailing spaces.
>
>
> > for i in range(len(string2)):
> >     position=i
>
> Poor style. In Python, you almost never need to write code that iterates
> over the indexes (this is not Pascal). You don't need the assignment
> position=i. Better:
>
> for position, line in enumerate(lines):
>     ...
>
>
> >     lastposition =position+1
>
> Poorly named variable. You call it "last position", but it is actually
> the NEXT position.
>
>
> >     while True:
> >          if re.search('Calling rdbms/admin',string2[lastposition]):
>
> Unnecessary use of regex, which will be slow. Better:
>
>     if 'Calling rdbms/admin' in line:
>         break
>
>
> >           break
> >          elif lastposition==len(string2)-1:
> >           break
>
> If you iterate over the lines, you don't need to check for the end of
> the list yourself.
>
>
> A better solution is to use the *accumulator* design pattern to collect
> a block of lines for further analysis:
>
> # Untested.
> with open(filename, 'r') as f:
>     block = []
>     inside_block = False
>     for line in f:
>         line = line.strip()
>         if inside_block:
>             if line == "End of block":
>                 inside_block = False
>                 process(block)
>                 block = []  # Reset to collect the next block.
>             else:
>                 block.append(line)
>         elif line == "Start of block":
>             inside_block = True
>     # At the end of the loop, we might have a partial block.
>     if block:
>          process(block)
>
>
> Your process() function takes a single argument, the list of lines which
> makes up the block you care about.
>
> If you need to know the line numbers, it is easy to adapt:
>
>     for line in f:
>
> becomes:
>
>     for linenumber, line in enumerate(f):
>         # The next line is not needed in Python 3.
>         linenumber += 1  # Adjust to start line numbers at 1 instead of 0
>
> and:
>
>     block.append(line)
>
> becomes
>
>     block.append((linenumber, line))
>
>
> If you re-write your code using this accumulator pattern, using ordinary
> substring matching and equality instead of regular expressions whenever
> possible, I expect you will see greatly improved performance (as well as
> being much, much easier to understand and maintain).
>
>
>
> --
> Steve
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> https://mail.python.org/mailman/listinfo/tutor
>

-- 
Asad Hasan
+91 9582111698