[Tutor] RegEx query

Mon Dec 19 11:12:44 CET 2005

Hi Kent,

I apologise for the not overly helpful initial post.

I had six possible uris to deal with -

/thread/28742/
/thread/28742/?s=1291819247219837219837129
/thread/28742/5/
/thread/28742/5/?s=1291819247219837219837129
/thread/28742/?goto=lastpost
/thread/28742/?s=1291819247219837219837129&goto=lastpost

The only one I wanted to match was the first two.

My initial pattern /thread/[0-9]*?/(\?s\=.*)?(?!lastpost)$

matched the first two and the last in redemo.py (which I've got
stashed as a py2exe bundle, should I ever find myself sans Python but
having to use regexes).

I managed to sort it by using

/thread
/[0-9]*?/
(\?s\=\w*)?$

The s avoids the fourth possibility, and the \w precludes the & in the last uri.

But, circumventing the problem irks me no end, as I haven't fixed what
I was doing wrong, which means I'll probably do it again, and avoiding
problems instead of resolving them feels too much like programming for
the Win32 api to me.
(Where removing a service from the service database doesn't actually
remove the service from the service database until you open and close
a handle to the service database a second time...)

So yes, any advice on how to use negative lookaheads would be great. I
get the feeling it was the .* before it.

As for my problem with BeautifulSoup, I'm not sure what was happening
there. It was happening in interactive console only, and I can't
replicate it today, which suggests to me that I've engaged email
before brain again.

I do like BeautifulSoup, however. Although people keep telling about
some XPath programme that's better, apparently, I like BeautifulSoup,
it works.

Regards,

Liam Clarke

On 12/18/05, Kent Johnson <kent37 at tds.net> wrote:
> Liam Clarke wrote:
> > Hi all,
> >
> > Using Beautiful Soup and regexes.. I've noticed that all the examples
> > used regexes like so - anchors = parseTree.fetch("a",
> > {"href":re.compile("pattern")} )  instead of precompiling the pattern.
> >
> > Myself, I have the following code -
> >
> >>>>z = []
> >>>>x = q.findNext("a", {"href":re.compile(".*?thread/[0-9]*?/.*",
> >
> > re.IGNORECASE)})
> >
> >
> >>>>while x:
> >
> > ...   num = x.findNext("td", "tableColA")
> > ...   h = (x.contents[0],x.attrMap["href"],num.contents[0])
> > ...   z.append(h)
> > ...   x = x.findNext("a",{"href":re.compile(".*?thread/[0-9]*?/.*",
> > re.IGNORECASE)})
> > ...
> >
> > This gives me a correct set of results. However, using the following -
> >
> >
> >>>>z = []
> >>>>pattern = re.compile(".*?thread/[0-9]*?/.*", re.IGNORECASE)
> >>>>x = q.findNext("a", {"href":pattern)})
> >
> >
> >>>>while x:
> >
> > ...   num = x.findNext("td", "tableColA")
> > ...   h = (x.contents[0],x.attrMap["href"],num.contents[0])
> > ...   z.append(h)
> > ...   x = x.findNext("a",{"href":pattern} )
> >
> > will only return the first found tag.
> >
> > Is the regex only evaluated once or similar?
>
> I don't know why there should be any difference unless BS modifies the compiled regex
> object and for some reason needs a fresh one each time. That would be odd and I don't see
> it in the source code.
>
> The code above has a syntax error (extra paren in the first findNext() call) - can you
> post the exact non-working code?
> >
> > (Also any pointers on how to get negative lookahead matching working
> > would be great.
> > the regex (/thread/[0-9]*)(?!\/) still matches "/thread/28606/" and
> > I'd assumed it wouldn't.
>
> Putting these expressions into Regex Demo is enlightening - the regex matches against
> "/thread/2860" - in other words the "not /" is matching against the 6.
>
> You don't give an example of what you do want to match so it's hard to know what a better
> solution is. Some possibilities
> - match anything except a digit or a slash - [^0-9/]
> - match the end of the string - $
> - both of the above - ([^0-9/]|$)
>
> Kent
>
> >
> > Regards,
> >
> > Liam Clarke
> > _______________________________________________
> > Tutor maillist  -  Tutor at python.org
> > http://mail.python.org/mailman/listinfo/tutor
> >
> >
>
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
>