[Python-Dev] Re: What to do about the Wiki?

Wed, 31 Jul 2002 19:02:51 +0200

Guido van Rossum wrote:
>>    Guido> Juergen Hermann, Moinmoin's author, said he fixed a few thin=
gs,
>>    Guido> but also said that Moinmoin is essentially vulnerable to
>>    Guido> "recursive wget" (e.g. someone trying to suck up the entire =
Wiki
>>    Guido> by following links).  Apparently this is what brought the si=
te
>>    Guido> down this weekend -- if I understand correctly, an in-memory=
 log
>>    Guido> was growing too fast.
>>
>>I'm a bit confused by these statements.  MoinMoin is a CGI script.  I d=
on't
>>understand where "recursive wget" and "in-memory log" would come into p=
lay.
>>I recently fired up two Wikis on the Mojam server.  I never see any
>>long-running process which would suggest there's an in-memory log which
>>could grow without bound.  The MoinMoin package does generate HTTP
>>redirects, but while they might coax wget into firing off another reque=
st,
>>it should be handled by a separate MoinMoin process on the server side.=
  You
>>should see the load grow significantly as the requests pour in, but
>>shouldn't see any one MoinMoin process gobbling up all sorts of resourc=
es.
>>J=FCrgen, can you elaborate on these themes a little more?
>=20
>=20
> Juergen seems offline or too busy to respond.  Here's what he wrote on
> the matter.  I guess he's reading the entire log into memory and
> updating it there.

J=FCrgen is talking about the file event.log which MoinMoin writes.
This is not read into memory. New events are simply appended to
the file.

Now since the Wiki has recursive links such as the "LikePages"
links on all pages and history links like the per page
info screen, a recursive wget is likely to run for quite a
while (even more because the URL level doesn't change much
and thus probably doesn't trigger any depth restrictions on wget-
like crawlers) and generate lots of events...

What was the cause of the break down ? A full disk or a process
claiming all resources ?

> | Subject: [Pydotorg] wiki
> | From: Juergen Hermann <jh@web.de>
> | To: "pydotorg@python.org" <pydotorg@python.org>
> | Date: Mon, 29 Jul 2002 20:32:31 +0200
> | Hi!
> |=20
> | I looked into the wiki, and two things killed us:
> |=20
> | a) apart from google hits, some $!&%$""$% did a recursive wget. And t=
he=20
> | wiki spans a rather wide uri space...
> |=20
> | b) the event log grows much faster than I'm used to, thus some=20
> | "simple" algorithms don't hold for this size.
> |=20
> |=20
> | Solutions:=20
> |=20
> | a) I just updated the wiki software, the current cvs contains a=20
> | robot/wget filter that forbids any access except to "view page" URIs=20
> | (i.e. we remain open to google, but no more open than absolutely=20
> | needed). If need be, we can forbid access altogether, or only allow=20
> | google.
> |=20
> | b) I'll install a cron job that rotates the logs, to keep them short.
> |=20
> | I shortened the logs manually for now. So if you all agree, we could=20
> | activate the wiki again.
> |=20
> |=20
> | Ciao, J=FCrgen
>=20
> Reading this again, I think we should give it a try again.

--=20
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting:                               http://www.egenix.com/
Python Software:                    http://www.egenix.com/files/python/