Web based AI advice for neophyte???

Sat Nov 3 16:01:42 EST 2001

This scriptlet now works sort of interestingly. I am hoping that others
might try it out and offer advice on how to improve it. Using a modified
version of the multiChoiceGuesser code that Max M posted on this list a
few weeks ago, modified to make it slightly more discriminating in it
choosing of a possible answer to an English language question by using
urllib to query Google with various combinations of the question and
possible answers, it is far more fun than and provacative than it was
last week. Also I added an algorithm to actually find an answer even if
the user does not offer possible choices.

This is done by using a (modified and extended NLQ class) posted on the
web by "Adam" that simply parses an English sentence and selects
keywords. I query Goggle with the question and then use NLQ to make a
list of keywords to use as possible answers.

Max M's original code worked on a mainloop with a test suite. I have
modified it to accept user input from the command line; and now also to
prompt the user to offer options or choices form the command line.

Future improvements could include using an improved NLQ to select key
phrases, not just keywords; using several alternative algorithms to test
the likely validity of a given answer, and averaging the results; having
Merlin determine the type of query being issued, and modifying his
response techniques in accordance (NLQ already has a start on
determining types of queries, btu I have not utilized that functionality
much yet); and having Merlin ask questions of his own to help clarify
the questioners intent when necessary.

I could also add prompts to ask the user for various criteria to be used
in making decisions, along with "weights' for each criteria, then query
the web with each option-criteria pair and calcilate the weighted
averages of the Google hits, sor tof liek I do in an earlier hack I
called decision analysis.

I hope others will come up with even more innovative ideas.

I posted a one weekend hack last week of this that was way premature and
well nigh worthless. Although this is still a very simple hack, and I am
not professional coder, nonetheless this version is infinitely better
than last week's, which I should never have posted in such a preliminary
stage. I hope you will try this script out even if you found last week's
to be vacuous.

The script can be found at http://www.awaretek.com/askMerlin.py which
might be a better version to use than the pasted-in one below that may
not properly preserve whitespace. Also the
http://www.awaretek.com/askMerlin site will be kept up to date with
later, hopefully improved , versions.

Ron Stephens

#!/usr/bin/python
# AskMerlin is a script I did by putting together two scripts and
modfying them both
# and adding input/output routines around them.
#
# First, I ultilized the multiChoiceGuesser script that Max M posted
# on the newsgroup comp.lang.python a couple weeks ago. This uses urllib
to go out to
# the web and judge the appropriateness of a given answer by how many
hits it gets on Google
# when coupled with the origninal question in a Google search.
# My contributions were to enable the program to ask
# for both an original question, and then for options to choose from. I
also set up a small
# routine in order to choose a most appropriate answer, in the case that
no options are given.
# This is done by using the second program, to create options of its own
to choose from
# NLQ to pick out Keywords from the page returned by a Google search of
the question, by itself.
# Then, these keyworsa are used as options or possible answers to the
question.
# Then, multiChoiceGuesser is applied to the question along with all of
the Keywords
# generated by NLQ. The result can take a long time, but eventually it
gets there, always. (???)
# Also, I added to multiChoiceGuesser the requirement to do two google
searches, one on
# the original question and each option, and one on the option by
itlself. Then
# we calculate a ratio between each option's Google hit score and its
question/option
# Google hit score, thus avoiding merely choosing the option that has
overwhelmingly high hits
# all by itself.
#
# Surely better algorithms can vastly improve thsi program!!!
#
# I am hoping some one or some folks come up with improved variatiosn
and algorithtms
#
# Various algoritms could be tried, and then the results from the
various algoritms could be
# averaged in order to produce more accurate results.
#
#
# Currently, Merlin is may have a low IQ, but he has potential for the
future.
# Anyway, Merlin can already answer just about any question.
# Someday, perhaps he will even answer correctly or at least with
wisdom. most or all
# of the time.
# ;-)))))))))))))

# NLQ:
# a short program called NLQ,
# or natural language query, which can be found online at
http://gurno.com/adam/nlq/#download
# NLQ is a Class to take an inputted query and output 1. Keywords and 2.
also to categorize
# the type of question being asked. I am primarily interested in using
the Keywords
# extracted from a query by NLQ. I shamelesly modified NLQ to add many
more
# IGNORE_WORDS and otherwise spruce it up.

# NLQ.py is still rather dumb, but hey, he has potential ;-))))).

import urllib
import re

import string, sys

# stuff
__version__ = "0.1"

#definine the question types...
UNKNOWN = 0
KNOWLEDGE = 1
COMPREHENSION = 2
APPLICATION = 3
ANALYSIS = 4
SYNTHESIS = 5
EVALUATION = 6

KNOWLEDGE_WORDS = ["name",
                   "list",
                   "recall",
                   "define",
                   "tell",
                   "match",
                   "who",
                   "what",
                   "when",
                   "describe",
                   "where"]

COMPREHENSION_WORDS = ["retell"]
APPLICATION_WORDS = ["why"]
ANALYSIS_WORDS = ["how",
                "classify",
                "outline",
                "diagram"]
SYNTHESIS_WORDS = []
EVALUATION_WORDS = []

PRONOUNS = ["he",
        "she",
        "it",
        "me",
        "you",
        "they",
        "them",
        "we",
        "who",
        "myself",
        "yourself",
        "ourself",
        "I",
        "me",
        "my"]

VERBS = ["is",
         "was",
         "are",
         "were",
         "be",
         "shall",
         "am",
         "isn't",
         "can't",
         "won't",
         "shouldn't",
         "couldn't",
         "aren't",
         "do",
         "don't",
         ]

OTHER_WORDS = ["if",
                "to",
                "too",
                "there",
                "will",
                "the",
                "a",
                "let",
                "I'll",
                "this",
                "these",
                "those",
                "let",
                "*.",
               "+*",
               ".*",
               "<*",
               ">*",
               "=*",
               "*=",
               "*<",
               "*>",
               "*.",
               "*-",
               "-*",
               "*:",
               ":*",
               ";*",
               "*;",
               "*,",
               ",*",
               "*.*",
               "*,*",
               "*;*",
               "*:*",
               "*+*",
               "*=*",
               "*-*",
               "*_*",
               "*<*",
               "*>*",
               "*?*",
               "*/*",
               "of",
               "and",
               "for",
               "very",
               "not",
               "in",
               "on",
               "up",
               "has",
               "from",
               "which",
               "and",
               "on",
               "of",
               "or",
               "not",
               "by",
               "can",
               "that",
               "your",
               "with",
               "their",
               "over",
               "back",
               "link",
               "about",
               "an",
               "at",
               "his",
               "enter",
               "into",
               "so",
               "was",
               "a",
               "as",
               "but"]

IGNORE_WORDS = VERBS + PRONOUNS + OTHER_WORDS + KNOWLEDGE_WORDS +
COMPREHENSION_WORDS + APPLICATION_WORDS + ANALYSIS_WORDS

def determine_type (word):
        # for right now this only matches the first word.  Soon it will
        # take the whole string and attempt to match using that.
        return_type = UNKNOWN
        if word in KNOWLEDGE_WORDS:
                return_type = KNOWLEDGE
        elif word in APPLICATION_WORDS:
                return_type = APPLICATION
        elif word in ANALYSIS_WORDS:
                return_type = ANALYSIS
        elif word in SYNTHESIS_WORDS:
                return_type = SYNTHESIS
        elif word in EVALUATION_WORDS:
                return_type = EVALUATION
        elif word in COMPREHENSION_WORDS:
                return_type = COMPREHENSION
        return return_type

class NLQ:
        def __init__(self, a_string):
                self.tuple = string.split(string.lower(a_string))
                self.type = determine_type (self.tuple[0])
                self.keywords = []

                for word in self.tuple[1:]:

                        if "~" in word:
                                continue
                        if "@" in word:
                                continue
                        if "#" in word:
                                continue
                        if "$" in word:
                                continue
                        if "%" in word:
                                continue
                        if "^" in word:
                                continue
                        if "&" in word:
                                continue
                        if "<" in word:
                                continue
                        if ">" in word:
                                continue
                        if ":" in word:
                                continue
                        if ";" in word:
                                continue
                        if "{" in word:
                                continue
                        if "}" in word:
                                continue
                        if "[" in word:
                                continue
                        if "*" in word:
                                continue
                        if "(" in word:
                                continue
                        if ")" in word:
                                continue
                        if "_" in word:
                                continue
                        if "-" in word:
                                continue
                        if "+" in word:
                                continue
                        if "=" in word:
                                continue
                        if "?" in word:
                                continue
                        if "for" == word:
                                continue
                        if word in IGNORE_WORDS:
                                continue
                        if word in OTHER_WORDS:
                                continue
                        if word in VERBS:
                                continue
                        if word in PRONOUNS:
                                continue
                        if "and" == word:
                                continue

                        if word[0] not in string.letters:
                                continue

                        if word[-1] not in string.letters:
                                word = word[:-1]

                        else:
                                self.keywords.append (word)

        def __repr__(self):
                return "type: %s\nkeywords: %s" % (self.type,
self.keywords)

class multiChoiceGuesser:

    def __init__(self, question='', replys=()):
        self.question = question
        self.replys   = replys

    def guessedAnswer(self):
        hits = []

        result = []

        for reply in self.replys:
                x = (self._getGoogleHits(self.question + ' ' + reply))
                y = (self._getGoogleHits(reply))
                float(x)
                float(y)

                if x == 0:
                        x = x + 1
                dividend = y / x
                hits.append(dividend)

        return hits.index(min(hits))

    def _getGoogleHits(self, query):
        query = urlencode({'q':query})
        urlHandle = urlopen('http://www.google.com/search?%s' % query)
        googlePage = urlHandle.read()
        try:
            numberAsString = re.search(
                'of about <b>(.*?)</b>.', googlePage, re.S
                ).group(1)
            hits = re.sub(',', '',numberAsString)
            urlHandle.close()
            hits = int(hits)
        except:
            hits = 0
        return hits

def guess(question, choices):
    mcg = multiChoiceGuesser(question, choices)
    print 'The question is: ', question
    print 'The most likely answer is: ', choices[mcg.guessedAnswer()]
    print ''

def get_list(heading, prompt):

        print heading
        print
        print "(enter a blank line to end the list)"
        ret = []
        i = 1
        while 1:
                line = raw_input(prompt % i)
                if not line:
                        break
                ret.append(line)
                i=i+1
        print
        return ret

question = raw_input ("What is your question?")

choices = get_list("Enter your options:", "Option %d: ")

if choices == []:

        print "Since you did not give Merlin any options, it may take a
while as he thinks. Please be patient; if you do not touch your keyboard
or mouse for a few minutes, Merlin will respond ;-)))))"

        source = _getGooglePage(question)

        b = NLQ(source)

        choices = b.keywords

        u = NLQ(question)

        bad = u.keywords

        for thing in bad:
                choices.remove(thing)

        del choices[:13]

        del choices[-13:]

guess(question, choices)

while 1:

        question = raw_input ("what is your next question?")

        choices = get_list("Enter your options:", "Option %d: ")

        if choices == []:

                print "Since you did not give Merlin any options, it may
take a while as he thinks. Please be patient and if you do not touch
your keyboard or mouse for a few minutes, Merlin will respond."

                source = _getGooglePage(question)

                b = NLQ(source)

                choices = b.keywords

                u = NLQ(question)

                bad = u.keywords

                for thing in bad:
                        choices.remove(thing)

                del choices[:13]

                del choices[-13:]

        guess(question, choices)