Search Engines, Chinese and Python.

Sun Apr 11 01:03:50 EDT 1999

I have a sneaky feeling that somebody on the Python list first mentioned
this URL a couple of years back but I've just rediscovered the great
resource on CJK processing (I knew I bookmarked it for a reason) :

http://www.ora.com/people/authors/lunde/cjk_inf.html

so I answer the second half of my own question.
Now if maybe the Infoseek guys are interested in porting their
engine to the most widely-spoken language in the world ;-)

chas

*just as an aside, it was due to Infoseek that I first looked at Python;
 always thought it was the best search engine, read that they used Python just
 as I was despairing with another P-language...  haven't looked back since. :)

  sweeting at neuronet.com.my wrote:
> The "how do you build a search engine in Python ?" question has been
> asked and answered enough times so I'll spare you all the agony. Given
> the choice, I'd use Ultraseek*, WAIS or something rather than rebuild
> this again myself; but I need this to work in Chinese. Forseeing a
> great demand for this (for myself and in general) and failing to find a
> decent ready-made solution, I figured that I may as well have a stab at
> it. (If all else fails, at least I may improve my Mandarin).
>
> Snipping from a thread last December :
>
> [snip]
> >Richard Jones <richard.jones at fulcrum.com.au> wrote:
> >:    The short answer is "you don't".
> >
> The big answer might be (this not a Gadlfy answer, but hey):
>
> (This uses indexing on item submission to speed fetching.)
>
> A pair of (g)dm's.  One that stores your entries under some unique per
> item id.
>
> Churn through each new item looking for words. "Stop and stem" this
> list (ie. kill "and" "at" "the", standardise case, collapse "runner",
> "running" -> "run" or whatever. Can be tricky :-) You can automate
> the stop-list just by counting word occurences)
> The HTML parser and rfc822 et all could also be used to pull out
> details for searches like "url:www.host.com".
>
> The second file holds a relation between words and the documents that
> contain them.  ie. an inverted list.
>
> Query comes in: search for "python programming":
>   list1 = db2["python"]
>   list2 = db2["programming"]
>
> The intersection of the list are the documents that contain both
> words.  As things get big, you may need to overload the getitem
> to return a smaller list and store things like the number times
> the word appears in an item (you can then sort the inverted lists
> on this attribute).
>
> Hope this helps.
>
> (Start small!)
> -- James Preston, waiting for Godot.
> [/snip]
>
> Is it really as easy as that ?
>
> It seems that the real work is in the indexing and this is going to be
> even more of a chore with Chinese because words aren't separated by
> spaces  - so we'll also have to build a parsing engine to work that out :(
>
> If anybody has worked with Chinese text and has any caveats with regards the
> above project or programming double-byte characters in general, I'm all ears..
> I'm still struggling with getting my servers/scripts to write Chinese to
> the screen of Chinese-Windows machines let alone programming this into a
> database.
>
> Thank you very much,
>
> chas
>
> -----------== Posted via Deja News, The Discussion Network ==----------
> http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own
>

-----------== Posted via Deja News, The Discussion Network ==----------
http://www.dejanews.com/       Search, Read, Discuss, or Start Your Own