Googlebot and the mail.python.org python-dev archive
Is anyone else having trouble getting the python.org mail archive to turn up in Google searches for python-dev messages? I prefer to use that archive rather than one of the multitude of 3rd party archives when linking posts from PEPs and tracker issues, but for the last few weeks I've had to go find the messages directly on the archive pages rather than being able to grab them from a search. Example search (note that the top python.org hits are from 2006, but a 3rd party archive has the discussion I was after at the top of the list): http://www.google.com/search?hl=en&q=inurl%3Apython-dev+contextlib.nested&btnG=Search Searching the python.org archive specifically shows that the relevant recent messages aren't in the search index at all: http://www.google.com/search?q=inurl:pipermail+inurl:python-dev+contextlib.nested&hl=en&filter=0 Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
I think the better syntax would be to add site:mail.python.org to the query, but you're right, that doesn't seem to find recent messages. Maybe the absence of a robots.txt file on mail.python.org could be a partial explanation? (Disclaimer: I may work for Google, and Google's first crawler may have been written in Python, but I haven't the foggiest idea about how our crawler works these days.) --Guido On Fri, Feb 27, 2009 at 4:03 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
Is anyone else having trouble getting the python.org mail archive to turn up in Google searches for python-dev messages?
I prefer to use that archive rather than one of the multitude of 3rd party archives when linking posts from PEPs and tracker issues, but for the last few weeks I've had to go find the messages directly on the archive pages rather than being able to grab them from a search.
Example search (note that the top python.org hits are from 2006, but a 3rd party archive has the discussion I was after at the top of the list): http://www.google.com/search?hl=en&q=inurl%3Apython-dev+contextlib.nested&btnG=Search
Searching the python.org archive specifically shows that the relevant recent messages aren't in the search index at all: http://www.google.com/search?q=inurl:pipermail+inurl:python-dev+contextlib.nested&hl=en&filter=0
Cheers, Nick.
-- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia --------------------------------------------------------------- _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/guido%40python.org
-- --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum schrieb:
I think the better syntax would be to add site:mail.python.org to the query, but you're right, that doesn't seem to find recent messages. Maybe the absence of a robots.txt file on mail.python.org could be a partial explanation?
Doesn't the absence of a robots.txt mean "you may index everything"? Georg -- Thus spake the Lord: Thou shalt indent with four spaces. No more, no less. Four shall be the number of spaces thou shalt indent, and the number of thy indenting shall be four. Eight shalt thou not indent, nor either indent thou two, excepting that thou then proceed to four. Tabs are right out.
Georg Brandl <g.brandl <at> gmx.net> writes:
Guido van Rossum schrieb:
I think the better syntax would be to add site:mail.python.org to the query, but you're right, that doesn't seem to find recent messages. Maybe the absence of a robots.txt file on mail.python.org could be a partial explanation?
Doesn't the absence of a robots.txt mean "you may index everything"?
It does. However, pages such as: http://mail.python.org/pipermail/python-dev/ (and, it seems, all other pipermail-generated archive pages) have the following HTML tag in them: <META NAME="robots" CONTENT="noindex,follow"> which explicitly instructs Web spiders *not* to index contents nor follow links. Regards Antoine.
Antoine Pitrou wrote:
Georg Brandl <g.brandl <at> gmx.net> writes:
Guido van Rossum schrieb:
I think the better syntax would be to add site:mail.python.org to the query, but you're right, that doesn't seem to find recent messages. Maybe the absence of a robots.txt file on mail.python.org could be a partial explanation? Doesn't the absence of a robots.txt mean "you may index everything"?
It does. However, pages such as: http://mail.python.org/pipermail/python-dev/ (and, it seems, all other pipermail-generated archive pages) have the following HTML tag in them: <META NAME="robots" CONTENT="noindex,follow"> which explicitly instructs Web spiders *not* to index contents nor follow links.
That's not quite true - that meta tag says not to index the current page, but *do* follow the links to other pages. The archive page and the monthly summary pages say the same two things. Once you get down to the individual post level, then it switches around - the meta tags on those pages say to index the page and NOT to follow links. Those settings actually makes a certain amount of sense - it should encourage the actual messages to turn up in search results rather than the index pages pointing to those messages. The top-level list of mailing lists and the description pages for each list don't have the meta tag at all, so they should all be both indexed and the links followed. However, I checked on Wayback and it hasn't archived anything from mail.python.org since late 2007, suggesting there may be something about the current setup that well behaved web crawlers don't like. Is pydotorg-www still the place for website questions?* If so, I should probably take this over there... Cheers, Nick. * I ask because that list doesn't appear to have seen any traffic since May last year... -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia ---------------------------------------------------------------
Nick Coghlan <ncoghlan <at> gmail.com> writes:
<META NAME="robots" CONTENT="noindex,follow"> which explicitly instructs Web spiders *not* to index contents nor follow
links.
That's not quite true - that meta tag says not to index the current page, but *do* follow the links to other pages. The archive page and the monthly summary pages say the same two things.
For some mysterious reason my brain had read "nofollow" in the above. Well, nevermind. cheers Antoine.
On Sat, Feb 28, 2009 at 09:53:10PM +1000, Nick Coghlan wrote:
Is pydotorg-www still the place for website questions?* If so, I should probably take this over there...
Just 'pydotorg' is the current list (http://mail.python.org/mailman/listinfo/pydotorg). Looking at the access logs, mail.python.org is being actively crawled: 66.249.71.119 - - [28/Feb/2009:18:32:51 +0100] "GET /pipermail/python-list/2004-June/265194.html HTTP/1.1" 304 - "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" 72.30.79.38 - - [28/Feb/2009:18:32:51 +0100] "GET /pipermail/csv/2003-February/000368.html HTTP/1.0" 200 3929 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp/3.0; http://help.yahoo.com/help/us/ysearch/slurp)" 65.55.211.30 - - [28/Feb/2009:18:32:51 +0100] "GET /pipermail/python-list/2006-May/382528.html HTTP/1.1" 200 4028 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)" Is it maybe that the site is just too large, so the search engines index only 10,000 messages from it? One possible solution might be to block crawling of the python-list archive; it's enormous, and already available through Google's Usenet search. --amk
On Sat, Feb 28, 2009 at 10:37:09AM +0000, Antoine Pitrou wrote:
have the following HTML tag in them: <META NAME="robots" CONTENT="noindex,follow"> which explicitly instructs Web spiders *not* to index contents nor follow links.
I believe this makes spiders not index this page, but does follow links. Individual messages have "index,nofollow". --amk
participants (5)
-
A.M. Kuchling
-
Antoine Pitrou
-
Georg Brandl
-
Guido van Rossum
-
Nick Coghlan