[Tutor] help with re module and parsing data
Steven D'Aprano
steve at pearwood.info
Mon Mar 7 23:33:06 CET 2011
On Mon, 7 Mar 2011 06:54:30 pm vineeth wrote:
> Hello all I am doing some analysis on my trace file. I am finding the
> lines Recvd-Content and Published-Content. I am able to find those
> lines but the re module as predicted just gives the word that is
> being searched. But I require the entire line similar to a grep in
> unix. Can some one tell me how to do this. I am doing the following
> way.
If you want to match *lines*, then you need to process each line
individually, not the whole file at once. Something like this:
for line in open('file.txt'):
if "Recvd-Content" in line or "Published-Content" in line:
process_match(line)
A simple substring test should be enough, that will be *really* fast.
But if you need a more heavy-duty test, you can use a regex, but
remember that regexes are usually slow.
pattern = 'whatever...'
for line in open('file.txt'):
if re.search(pattern, line):
process_match(line)
Some further comments below:
> import re
> file = open('file.txt','r')
> file2 = open('newfile.txt','w')
>
> LineFile = ' '
Why do you initialise "LineFile" to a single space, instead of the empty
string?
> for line in file:
> LineFile += line
Don't do that! Seriously, that is completely the wrong way.
What this does is something like this:
Set LineFile to " ".
Read one line from the file.
Make a copy of LineFile plus line 1.
Assign that new string to LineFile.
Delete the old contents of LineFile.
Read the second line from the file.
Make a copy of LineFile plus line 2.
Assign that new string to LineFile.
Delete the old contents of LineFile.
Read the third line from the file.
Make a copy of LineFile plus line 3.
and so on...
Can you see how much copying of data is being done? If there are 1000
lines in the file, the first line gets copied 1000 times, the second
line 999 times, the third 998 times... See this essay for more about
why this is s-l-o-w:
http://www.joelonsoftware.com/articles/fog0000000319.html
Now, it turns out that *some* versions of Python have a clever
optimization which, *sometimes*, can speed that up. But you shouldn't
rely on it. The better way to add many strings is:
accumulator = []
for s in some_strings:
accumulator.append(s)
result = ''.join(accumulator)
But in your case, when reading from a file, an even better way is to
just read from the file in one chunk!
LineFile = open('file.txt','r').read()
--
Steven D'Aprano
More information about the Tutor
mailing list