Search engine 'results' counts (was Re: PHP vs Python)

Alex Martelli aleaxit at yahoo.com
Tue Jan 16 07:40:45 EST 2001


"Peter Hansen" <peter at engcorp.com> wrote in message
news:3A63E928.12A56BC2 at engcorp.com...
    [snip]
> > python AND (mod_snake OR NOT snake) => 186,508 pages
> > php => 177,182 pages (overestimated as it's the default extension)
>
> On the topic of using search engines to compare counts
> of 'users' (not that I think you were very serious :),
> here's what you get from google:
>
> 'python -snake': 617,000 pages
> 'php': 9,990,000 pages
>
> I'm not exactly sure what that means, and I didn't waste
> my time looking beyond the first page.  I just strongly
> suspect this is yet another time when the 'pages found'
> count is essentially meaningless.

Searching is an art, not a science.  With Google (searching
WITH the doublequotes, an important little trick!-):

"java language"      -> 112,000
"perl language"      ->  42,500
"pascal language"    ->   8,350
"python language"    ->   9,360
"php language"       ->   5,820
"ruby language"      ->     936
"erlang language"    ->     171

"java programming"   -> 139,000
"perl programming"   ->  69,700
"pascal programming" ->  18,800
"python programming" ->   8,290
"php programming"    ->   8,980
"ruby programming"   ->     156
"erlang programming" ->     192

"java programmer"    ->  38,300
"perl programmer"    ->  14,700
"pascal programmer"  ->   1,960
"python programmer"  ->     928
"php programmer"     ->   1,260
"ruby programmer"    ->       5
"erlang programmer"  ->      18


So, Java's cultural dominance and Perl's secure second
place seem well-confirmed, with PHP and Python below
both, and Ruby and Erlang on a lower step; but from this
one can't tell within experimental error about the
relative ordering of Python and PHP, just as one can't
between Ruby and Erlang.


Anyway, here's one way to do these searches without
too much pain (a rather quick & dirty script):

import urllib, re, locale

google = "http://www.google.com/search?q=%%22%s%%22&num=2"
numres = re.compile('of about <b>([,0-9]+)</b>')

def results(wordlist):
    query = '+'.join(wordlist)
    query_result = urllib.urlopen(google % query)
    for line in query_result.readlines():
        mo = numres.search(line)
        if mo:
            number_string = mo.group(1).replace(',','')
            return int(number_string)
    return 0

def combined(headwords, tailwords):
    longest_result = 0
    table = {}
    for head in headwords:
        table[head] = {}
        for tail in tailwords:
            result = results([head,tail])
            result = locale.format("%d",result,1)
            table[head][tail] = result
            if len(result)>longest_result:
                longest_result = len(result)
    longest_head = max(map(len,headwords))
    longest_tail = max(map(len,tailwords))
    pairs_length = longest_head+longest_tail+4
    format = '%%-%ds %%%ds' % (pairs_length, longest_result)
    for tail in tailwords:
        for head in headwords:
            pair = '"'+head+' '+tail+'"'
            hits = table[head][tail]
            print format % (pair, hits)
        print

def searcher(wordlist):
    separator = wordlist.index('+')
    combined(wordlist[:separator], wordlist[separator+1:])

if __name__=='__main__':
    import sys
    locale.setlocale(locale.LC_NUMERIC, 'US')
    searcher(sys.argv[1:])


So, for example:

D:\AWeb\pick>python sego.py tcl rexx + language programming programmer
"tcl language"      10,300
"rexx language"      3,330

"tcl programming"    2,430
"rexx programming"   2,950

"tcl programmer"       368
"rexx programmer"      149


D:\AWeb\pick>


Alex






More information about the Python-list mailing list