Search Engines, Chinese and Python.

sweeting at neuronet.com.my sweeting at neuronet.com.my
Sat Apr 10 13:51:07 EDT 1999


The "how do you build a search engine in Python ?" question has been
asked and answered enough times so I'll spare you all the agony. Given
the choice, I'd use Ultraseek*, WAIS or something rather than rebuild
this again myself; but I need this to work in Chinese. Forseeing a
great demand for this (for myself and in general) and failing to find a
decent ready-made solution, I figured that I may as well have a stab at
it. (If all else fails, at least I may improve my Mandarin).

Snipping from a thread last December :

[snip]
>Richard Jones <richard.jones at fulcrum.com.au> wrote:
>:    The short answer is "you don't".
>
The big answer might be (this not a Gadlfy answer, but hey):

(This uses indexing on item submission to speed fetching.)

A pair of (g)dm's.  One that stores your entries under some unique per
item id.

Churn through each new item looking for words. "Stop and stem" this
list (ie. kill "and" "at" "the", standardise case, collapse "runner",
"running" -> "run" or whatever. Can be tricky :-) You can automate
the stop-list just by counting word occurences)
The HTML parser and rfc822 et all could also be used to pull out
details for searches like "url:www.host.com".

The second file holds a relation between words and the documents that
contain them.  ie. an inverted list.

Query comes in: search for "python programming":
  list1 = db2["python"]
  list2 = db2["programming"]

The intersection of the list are the documents that contain both
words.  As things get big, you may need to overload the getitem
to return a smaller list and store things like the number times
the word appears in an item (you can then sort the inverted lists
on this attribute).

Hope this helps.

(Start small!)
-- James Preston, waiting for Godot.
[/snip]

Is it really as easy as that ?

It seems that the real work is in the indexing and this is going to be
even more of a chore with Chinese because words aren't separated by
spaces  - so we'll also have to build a parsing engine to work that out :(

If anybody has worked with Chinese text and has any caveats with regards the
above project or programming double-byte characters in general, I'm all ears..
I'm still struggling with getting my servers/scripts to write Chinese to
the screen of Chinese-Windows machines let alone programming this into a
database.

Thank you very much,

chas

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own    




More information about the Python-list mailing list