[Chicago] python libraries for latent semantic indexing

Jeremy McMillan aphor at me.com
Thu Sep 23 17:17:54 CEST 2010


Hey, I actually have a use case! I'm working on an expert system, and  
I want it to perform inference based on a network of salience. Getting  
data isn't hard, it's the metadata that's difficult. I have lots of  
existing (mostly) HTML, Excel Spreadsheets, and Word docs, and Power  
Point :).

I think I can cull some ontological data from the corpus, which can be  
used for inferences, but I need a measure of quality (I like to say  
salience, because affords perspective: a measure of link strength from  
one node to another). Using supervised learning puts data quality back  
into the heavy lifting for users category. I'm hoping to get an 80%  
solution out of a combination of unsupervised learning.

The first problem is that a lot of the text contains jargon which has  
different meaning in different domains. I think LSA may be able to  
provide a factor of salience by teasing domain metadata out of  
information represented in the corpus. If it's computationally  
feasible, and it looks like it *might* be, maybe LSA relevance can be  
used to seed the network/graph of salience from node to node? Should I  
infer this means that, or is this apples and oranges?

My corpus is in a Plone, so I think I will start with ZCTextIndex,  
which does Cosine Rule relevance ranking, rather than trying to bolt  
on something else.

http://wiki.zope.org/zope2/ZCTextIndex
http://www.zope.org/Members/dedalu/ZCTextIndex_python

Additionally, I found an old Geoffrey Hinton Google TechTalk YouTube  
video on using Neural Nets, "Restricted Boltzmann Machines," for  
feature detection (OCR mainly), but with an aside showing a simple  
document classification example.

31:37 is the document analysis example
http://www.youtube.com/watch?v=AyzOUbkUf3M

On Sep 23, 2010, at 9:13 AM, sheila miguez wrote:

> Subject: Re: [Chicago] python libraries for latent semantic indexing
>
> No one does this? seriously?
>
> I'm not doing it for anything serious. I thought it would be amusing  
> to use
> a corpus based on the set of science fiction writers I like who also  
> blog
> and/or make their works available online. Then I was going to try  
> random
> amusing crap like 'sort by noir transhumanism' on my facebook wall.
>
> though I'm thinking I want have a large enough corpus, and also that  
> I might
> not know enough to get anything other than nonsensical noise from  
> whatever I
> end up with.
>
> But, it is an amusing way to pass the time.
>
> On Sep 21, 2010 2:53 PM, "sheila miguez" <shekay at pobox.com> wrote:
>
> I would be interested in hearing a talk about how someone has used
> python for LSA, LDA, &c. analysis. Playing around myself, I found a
> python library called gensim, and a java library called mallet.
>
> http://nlp.fi.muni.cz/projekty/gensim/
>
> I have not played around enough to give a talk, and being a complete
> newbie with this, I would not want to. I'd like to hear someone with
> relevant experience give a talk.
>
> thanks thanks

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/chicago/attachments/20100923/de2bb63f/attachment.html>


More information about the Chicago mailing list