[Tutor] String searching

Craig Hagerman craig@osa.att.ne.jp
Mon, 05 Jun 2000 19:39:37 +0900


Hello Andre,



> Earlier in this thread Craig mentioned that it would probably be best
> to use regular expressions, and because of the problems mentioned above
> I think they are what I need.

I really like using Regular Expressions, and they surely can help with a
problem such as yours.


>  But since I am
> not very good at regular expressions I can not come up with one that
> correctly handles the above mentioned problems.

I recommend the documentation on regular expressions available from the
Python site. It is pretty good at getting you up to speed. The relevant
chapter in "Python: Essential Reference" (Beazley) is useful as well.

> it to find links even if they are like "(http://www.link.com)" or when
> a dot is placed right after the link. Then I want to extract this link,
> but _only_ the part of it that is actually the link (not the
> surrounding parenthesis for an example.)

Is this text you are searching just plain text or is it marked up in html?
If it is html then the solution is very easy - just search for link tags (ie
<a href = "UrlLocation"> with a regex and extract the Url. Here is one
solution for that case:

import re
isLink = re.compile('<a\shref\s?.*?["|\'](.+?)["|\']',re.IGNORECASE)

for line in input.readlines():
    if isLink.seach(line):
        link = isLink.seach(line).group(1)

I will explain this regex:
 \s specifies a space - which should appear between 'a' and 'href'. There
may or may not be a space after 'href' so the question mark (\s?) matches
one or zero times with this. I have noticed some people write extra html
code between the 'a href' and "Url" so  .*?  is there to take care of any
extraneous code. The Url should be the only thing within quotes, whether
single or double  (ie ["|\']  -- you have to escape the single quote) so
make a group (ie  (.....)  ) of anything between quotes. This is the first
group so when we search specifying group one (ie isLink.seach(line).group(1)
) it will match the entire expression from <a href....  until the final
quote mark but only return the grouped expression.


If the text you are seaching is plain text then it is no more difficult. You
must think about how to specify the different cases. The Url could be
surrounded by quotes or parenthesis, or brackets... but probably NOT by
extraneous letters (ie (awordhttp:www.somewhere.comaword)

You could cover each of these cases in your regex:

isLink = re.compile('["|\'|(|\s]http: ....otherstuff...  ["|\'|\s])

but this could get longer and longer when you realize that there could also
be asterisks around the link etc. A simpler way is just to specify a
non-alphanumberic character which is \W or any white space; \s may appear
once or not at all before the expression to be extracted. So the regex may
look something like this:

isLink =
re.compile('[\W|\s]?(http|ftp):([\w.~/_-?~]*?\w)[\W|\s|([.]\s)]',re.IGNORECA
SE)

The explaination:
first look for either a non-alphanumberic character or a white space (
[\W|\s] ) and match or not. Then look for either "http" or "ftp" followed by
a colon. The next part in brackets is meant to search for a string of
alphanumeric characters or a dot, tilde, backslash, underscore or dash
(\w.~/_-) using non-greedy searching (*?) and make sure that string ends
with an alphanumberic character (\w). Follow that with either a
non-alphanumeric character or a white space or a dot immediatly followed by
a white space ( [\W|\s|([.]\s)] ). And make the entire thing case
insensitive since HTML is not sensitive. I *think* that this will work but
it may need a "&" after the final \w to ensure that it is matching at the
end of that string.

Anyway, I hope this gets you started on regular expressions. There is a
great deal of power there once you get the hang of them.

Craig Hagerman