[Moin-user] Help： Problem of Xapian search

Mon Apr 27 02:58:38 EDT 2009

Hi Byron,

> For example, I have a section of text is as follows:
> 
> "紐約洋基今天在新球場鏖戰14局，靠著卡布瑞拉的再見全壘打，9比7轟走奧克蘭運動
> 家，拿下主場首次延長賽勝利。"
> 
> If I want to search a string as "卡布瑞拉"
> 
> When Xapian is disabled, I can get a right search result.

Yes, the slow search mostly does a substring search.

> But, when I enable Xapian, MoinMoin could not  find the string.
> 
> If I enter a string as "靠著卡布瑞拉的再見全壘打", MoinMoin shows the
> result.

Xapian search only finds stuff that was put into the index.

When doing indexing, it finds stuff for putting into the index by
running the text through a tokenizer and then puts the tokens it yields
into the index.

For example, for a string like "foo plus bar is FooBar.", the tokenizier
should yield something like: foo, plus, bar, is, FooBar, Foo, Bar.
You see, the camelcase word FooBar is split and yielded also as its
components, but the normal splitting just happens at whitespace or
punctuation.

The builtin tokenizer works quite ok for English and other alphabetic
languages, but I guess it is not appropriate for Chinese or other
symbolic(?) languages.

So what is needed for chinese is a chinese tokenizer.
Most current moin developers have no clue about chinese, but well
commented patches are welcome. Whoever works on that, please communicate
with us often.

Please note that this problem is quite hard to solve in a generic way,
especially if the kind/language of text is unknown or mixed.

> So, when I enable the Xapian, I could not find the world in a sentence not
> only wiki page content but also attachment content
> 
> If I wanna  enable the Xapian function, how can I resolve this problem?

If you have 100% chinese stuff, you need a different tokenizer.
If you have mixed stuff, you need multiple tokenizers and some
intelligent means to switch them depending on the content.

Cheers,

Thomas