[Chicago] python libraries for latent semantic indexing

sheila miguez shekay at pobox.com
Thu Sep 23 17:38:03 CEST 2010


Coolness. give a talk :)


I learned about LSA back in the dark ages from a psychology prof.

I've been catching up. Check out some of the other methods. From a
quick gloss perhaps they'd be better able to help with the jargon
since they should give an improvement on handling polysemy. So if
jargon is used in one place to mean something other than what the
common use is, you have a better chance of catching that.


On Thu, Sep 23, 2010 at 10:17 AM, Jeremy McMillan <aphor at me.com> wrote:
> Hey, I actually have a use case! I'm working on an expert system, and I want
> it to perform inference based on a network of salience. Getting data isn't
> hard, it's the metadata that's difficult. I have lots of existing (mostly)
> HTML, Excel Spreadsheets, and Word docs, and Power Point :).
> I think I can cull some ontological data from the corpus, which can be used
> for inferences, but I need a measure of quality (I like to say salience,
> because affords perspective: a measure of link strength from one node to
> another). Using supervised learning puts data quality back into the heavy
> lifting for users category. I'm hoping to get an 80% solution out of a
> combination of unsupervised learning.
> The first problem is that a lot of the text contains jargon which has
> different meaning in different domains. I think LSA may be able to provide a
> factor of salience by teasing domain metadata out of information represented
> in the corpus. If it's computationally feasible, and it looks like it
> *might* be, maybe LSA relevance can be used to seed the network/graph of
> salience from node to node? Should I infer this means that, or is this
> apples and oranges?
> My corpus is in a Plone, so I think I will start with ZCTextIndex, which
> does Cosine Rule relevance ranking, rather than trying to bolt on something
> else.
> http://wiki.zope.org/zope2/ZCTextIndex
> http://www.zope.org/Members/dedalu/ZCTextIndex_python
> Additionally, I found an old Geoffrey Hinton Google TechTalk YouTube video
> on using Neural Nets, "Restricted Boltzmann Machines," for feature detection
> (OCR mainly), but with an aside showing a simple document classification
> example.
> 31:37 is the document analysis example
> http://www.youtube.com/watch?v=AyzOUbkUf3M
> On Sep 23, 2010, at 9:13 AM, sheila miguez wrote:
>
> Subject: Re: [Chicago] python libraries for latent semantic indexing
>
> No one does this? seriously?
>
> I'm not doing it for anything serious. I thought it would be amusing to use
> a corpus based on the set of science fiction writers I like who also blog
> and/or make their works available online. Then I was going to try random
> amusing crap like 'sort by noir transhumanism' on my facebook wall.
>
> though I'm thinking I want have a large enough corpus, and also that I might
> not know enough to get anything other than nonsensical noise from whatever I
> end up with.
>
> But, it is an amusing way to pass the time.
>
> On Sep 21, 2010 2:53 PM, "sheila miguez" <shekay at pobox.com> wrote:
>
> I would be interested in hearing a talk about how someone has used
> python for LSA, LDA, &c. analysis. Playing around myself, I found a
> python library called gensim, and a java library called mallet.
>
> http://nlp.fi.muni.cz/projekty/gensim/
>
> I have not played around enough to give a talk, and being a complete
> newbie with this, I would not want to. I'd like to hear someone with
> relevant experience give a talk.
>
> thanks thanks
>
>
> _______________________________________________
> Chicago mailing list
> Chicago at python.org
> http://mail.python.org/mailman/listinfo/chicago
>
>



-- 
sheila


More information about the Chicago mailing list