[Tutor] String searching

Remco Gerlich scarblac@pino.selwerd.nl
Tue, 6 Jun 2000 13:54:59 +0200


On Sun, Jun 04, 2000 at 05:15:20PM +0200, André Dahlqvist wrote:
> > > >>> word = "http://www.fenx.com"
> > > >>> import string
> > > >>> if string.split(word,'://')[0] in ('http','ftp'):
> 
> I mentioned earlier that this solution worked good for my needs,

And I mentioned before that this exact version doesn't work because it
falsely matches the words 'http' and 'ftp'...

> but I
> have now found a case which I would need for it to handle, but it
> doesn't. Most of the links in the text that I am searching are
> separated from the rest of the text with a space, like this: "For more
> info see http://www.somelink.com .", which is as I see it an attempt to
> ease extraction. But in a few places in the text there are links
> enclosed in parenthesis where they haven't put a space at the end nor
> the beginning: "(http://www.somelink.com)"

Yes. People use <> for that a lot too. Not [], since that can occur in URLs.

> Earlier in this thread Craig mentioned that it would probably be best
> to use regular expressions, and because of the problems mentioned above
> I think they are what I need.

Right. Until you walk into the next problem, but they will do ok :)

> So I read up on regular expressions, and
> found a solution that could find the URLs in the text. But since I am
> not very good at regular expressions I can not come up with one that
> correctly handles the above mentioned problems. That is, I would like
> it to find links even if they are like "(http://www.link.com)" or when
> a dot is placed right after the link.

You can't strip the dot - what if the dot is part of the URL? That's
completely legal.

> Then I want to extract this link,
> but _only_ the part of it that is actually the link (not the
> surrounding parenthesis for an example.)
> 
> Sorry for the long explanation, I wanted to make sure I correctly
> described what I wanted to do.

Ok, we want a regex to match:
1. whitespace, ( or <
2. http or ftp
3. ://
4. some characters that aren't whitespace, (, ), < or >
5. whitespace, > or ), or < ( (in case of http://foo<http://bar>, two urls).

First try:
r = re.compile(r"[\s(<] (http|ftp) :// [^\s()<>]* [\s<>()]",
         re.VERBOSE+re.IGNORECASE)

We want to find the groups that form 2-4, so we should put () around that.
There is already () around http|ftp to match either of those, but we don't
want to find that group - we need (?:) there.

Second try:

r = re.compile(r"[\s(<] ( (?:http|ftp) :// [^\s()<>]* ) [\s<>()]",
         re.VERBOSE+re.IGNORECASE)

Now the only problem is that if there are two URLs right after each other,
the first one "consumes" the whitespace between them, so that the second
can't match (the whitespace before it is already matched). So part 5 should
match, but not consume the character. This is what (?=) does.

Third try:

r = re.compile(r"[\s(<] ( (?:http|ftp) :// [^\s()<>]* ) (?=[\s<>()])",
         re.VERBOSE+re.IGNORECASE)

Ah, but I've also forgotten the beginning of the line! It should recognize
them at the start of a line too. ^ is the start of the line. Same for the
end of the string.

Fourth try:
r = re.compile(r"(?:^|[\s(<]) ( (?:http|ftp) :// [^\s()<>]* ) (?=$|[\s<>()])",
         re.VERBOSE+re.IGNORECASE)

And this one works.

Actually, I suddenly realize that the last term isn't necessary - re will
greedily try to put as many characters as possible in the match, it will
always have one of those on the end.

Fifth try:

r = re.compile(r"(?:^|[\s(<]) ( (?:http|ftp) :// [^\s()<>]* )",
        re.VERBOSE+re.IGNORECASE)

>>> import re
>>> r = re.compile(r"(?:^|[\s(<]) ( (?:http|ftp) :// [^\s()<>]* )",
re.VERBOSE+re.IGNORECASE)
>>> r.findall("http://test FTP://spam<ftp://foo> (ftp)http://WHEE")
['http://test', FTP://spam', 'ftp://foo']

Doesn't find the last one, but I don't think it should.

Anyway, I didn't know much about them yet, I do know more now, learn by
experimenting ;-).
-- 
Remco Gerlich,  scarblac@pino.selwerd.nl