[Tutor] help with re module and parsing data

Mon Mar 7 23:33:06 CET 2011

On Mon, 7 Mar 2011 06:54:30 pm vineeth wrote:
> Hello all I am doing some analysis on my trace file. I am finding the
> lines Recvd-Content and Published-Content. I am able to find those
> lines but the re module as predicted just gives the word that is
> being searched. But I require the entire  line similar to a grep in
> unix. Can some one tell me how to do this. I am doing the following
> way.

If you want to match *lines*, then you need to process each line 
individually, not the whole file at once. Something like this:

for line in open('file.txt'):
    if "Recvd-Content" in line or "Published-Content" in line:
        process_match(line)

A simple substring test should be enough, that will be *really* fast. 
But if you need a more heavy-duty test, you can use a regex, but 
remember that regexes are usually slow.

pattern = 'whatever...'
for line in open('file.txt'):
    if re.search(pattern, line):
        process_match(line)

Some further comments below:

> import re
> file = open('file.txt','r')
> file2 = open('newfile.txt','w')
>
> LineFile = ' '

Why do you initialise "LineFile" to a single space, instead of the empty 
string?

> for line in file:
>      LineFile += line

Don't do that! Seriously, that is completely the wrong way.

What this does is something like this:

Set LineFile to " ".
Read one line from the file.
Make a copy of LineFile plus line 1.
Assign that new string to LineFile.
Delete the old contents of LineFile.
Read the second line from the file.
Make a copy of LineFile plus line 2.
Assign that new string to LineFile.
Delete the old contents of LineFile.
Read the third line from the file.
Make a copy of LineFile plus line 3.
and so on... 

Can you see how much copying of data is being done? If there are 1000 
lines in the file, the first line gets copied 1000 times, the second 
line 999 times, the third 998 times... See this essay for more about 
why this is s-l-o-w:

http://www.joelonsoftware.com/articles/fog0000000319.html

Now, it turns out that *some* versions of Python have a clever 
optimization which, *sometimes*, can speed that up. But you shouldn't 
rely on it. The better way to add many strings is:

accumulator = []
for s in some_strings:
    accumulator.append(s)
result = ''.join(accumulator)

But in your case, when reading from a file, an even better way is to 
just read from the file in one chunk!

LineFile = open('file.txt','r').read()

-- 
Steven D'Aprano