[Tutor] RE troubles (fwd)

Danny Yoo dyoo at hkn.eecs.berkeley.edu
Tue Aug 17 19:16:51 CEST 2004


[Forwarding to tutor at python.org.  Please don't send questions to me alone:
send them to the Tutor list instead.  Why limit yourself to just one head,
when you can get input from the whole community?]


---------- Forwarded message ----------
Date: Tue, 17 Aug 2004 09:54:12 +0200 (CEST)
From: "[iso-8859-1] =D8yvind" <python at kapitalisten.no>
To: dyoo at hkn.eecs.berkeley.edu
Subject: Re: [Tutor] RE troubles

Hello again,

  I have one more question you hopefully might be able to help me with.

  I am using the following to extract the words and sentences out of the
documents, but it seems like the last character is often missing. Here
are a few lines that I would like to extract from:

=09<li type=3Dcircle><a href=3Dhttp://www.kvasir.no/query?what=3Dno&q=3Dcoo=
p+byggmix
[17/Aug/2004:09:44:22 +0200]&cf=3Dspy target=3D"_top">coop byggmix
</a><a href=3Djavascript:onClick=3DlaunchWarning('/nokkelhullet/o.html')><i=
mg
src=3Dhttp://www.kvasir.no/nokkelhullet/o.gif border=3D0></a>
<br><li type=3Dcircle><a
href=3Dhttp://www.kvasir.no/query?q=3Dchristiano+ronaldo&submit=3DS%F8k&wha=
t=3Dimages
[17/Aug/2004:09:44:21 +0200]&cf=3Dspy target=3D"_top">christiano ronaldo
</a><a href=3Djavascript:onClick=3DlaunchWarning('/nokkelhullet/o.html')><i=
mg
src=3Dhttp://www.kvasir.no/nokkelhullet/o.gif border=3D0></a>
<br><li type=3Dcircle><a
href=3Dhttp://www.kvasir.no/query?q=3Derop&submit=3DS%F8k&what=3Dno
[17/Aug/2004:09:44:21 +0200]&cf=3Dspy target=3D"_top">erop
</a>
<br><li type=3Dcircle><a
href=3Dhttp://www.kvasir.no/query?what=3Dno&q=3DBirgita+S%F8strene
[17/Aug/2004:09:44:21 +0200]&cf=3Dspy target=3D"_top">Birgita S=F8strene
</a><a href=3Djavascript:onClick=3DlaunchWarning('/nokkelhullet/o.html')><i=
mg
src=3Dhttp://www.kvasir.no/nokkelhullet/o.gif border=3D0></a>
<br><li type=3Dcircle><a
href=3Dhttp://www.kvasir.no/query?q=3Dbademilj%F8&submit=3DS%F8k&what=3Dweb=
-no
[17/Aug/2004:09:44:21 +0200]&cf=3Dspy target=3D"_top">bademilj=F8
</a>

from here I want:
coop byggmix
christiano ronaldo
erop
Birgita S=F8strene
bademilj=F8


and the code I use to extract is:

rawstr =3D r"""target=3D"_top">(.*?)\n</a
"""
compile_obj =3D re.compile(rawstr,  re.IGNORECASE| re.MULTILINE| re.VERBOSE=
)
matchstr =3D side.read()
liste =3D compile_obj.findall(matchstr)

The page that I get the info from is updated continously and can be found
at http://search.kvasir.no/nokkelhullet/myspy/ . When I read the output
the last character in the sentences/words are often missing. However, I
have not been able to find a pattern of which sentences/words that misses
the last character. Only that approximately 1 out of 5 does so.

Is there something with the code that can explain why? Is there something
I can change to get the whole string that I am looking for?

Thanks in advance,
=D8yvind



More information about the Tutor mailing list